Probabilistic Data Separation

Probabilistic Data Separation

Clusters, modes, distributions, categories, sub-populations, sub-signals, mixtures, proportions, ratios, density curve. These concepts are related by a central theme. This theme is the idea of separation boundaries that are the distinction between one "thing" and another "thing". If separation boundaries did not exist, then everything in the universe would be nothing more than random noise.

This concept is behind various processes, algorithms, and tools in statistics, data science, and machine learning. Mixture Models attempt to fit a model to a dataset that contains more than one sub-population by grouping the data points, by commonalities in their attributes, that act as the boundaries between these types. Clustering algorithms do this also. Distribution fitting algorithms like Expectation Maximization does this as well. Modality testing algorithms do the same thing. All the different Categorical (non-regression) machine learning algorithms e.g. Decision Trees are designed to accomplish this task. So in a sense these methodologies are all duplicative in what they are designed to do... separate one set of data points from any others in a larger dataset.

Once the dataset is separated into these sub-categories, then we assign labels to them such as IDs or Names that represent the concept(s) that these data points describe. We also can end up with other descriptors about them such as their proportion percentage relative to the entire dataset and / or the probability percentage that a new data point added to the dataset should be assigned to each of the clusters within. This last point of assigning probability percentages to each cluster is what is meant when the word "prediction" is used in this context. Each sub-population, each distinct distribution mode, get a % value assigned and that percentage predicts how likely a new data point belongs to that grouping in the data. If a cluster has 66% of the datapoints in the dataset assigned to it, because 66% of the datapoints have similar attributes, then the maximum prediction is that there is a 66% chance that a new data point will belong there... and each of the other sub-groups will have percentage probabilities of some smaller amount such as 5% or 14%. The percentage values of all of the sub-groups combined together sum up to exactly 100% so that all possibilities are covered.

When you are wanting to automate the process of doing this type of data separation you need to come up with a process, an algorithm, that can step by step, without human intervention distinguish one data subset from another. Expectation Maximization is one such algorithm that can take in a bunch of data that is all mixed up and figure out what the groupings of similar attributes are.

When reading data science literature about this process you might see these datasets referred to as "matrices" i.e. a Matrix. In reality these are just data tables (rows and columns). The rows represent each individual data point (i.e. things) and the columns are all different attributes about those "things". Expectation Maximization is often used by other more complex algorithms as one part of their multistep process to achieve categorization separation. For example, you can take one of the attribute columns for every row in the data table and run the values in that column through the EM process to determine if there are sub-groupings within that one attribute and what the statistical properties are of those subgroupings (e.g. mean and standard deviation). In a machine learning process that uses EM it might loop over every attribute column and run each one through the expectation maximization process to find these sub-groups and their statistical values.

One negative with EM is that you must tell it how many subgroups (i.e. clusters) exist within the data up front. This means human intervention to plot the data, look at the data, and then tell the EM process how many subgroups it should fit its model to and gather statistics on. If we want to completely automate this process we need a way to get around this by also automatically figuring out how many subgroups there are and then pass this count into the EM function before it starts. In the example code project below we do this with an algorithm called Modality Testing that is implemented in a Python library called UniDip. Essentially, Modality Testing builds a histogram of the data and then finds the peaks in the data which tells it how many "modes" or subgroups there are along with a rough guess about the minimum and maximum values that are common within each one of those "modes".

No alt text provided for this image

Given the rough minimum and maximum returned by UniDip we can calculate a rough Mean and Standard Deviation for each mode represented by the peaks.

No alt text provided for this image


These modality testing result values can then be used to initialize the expectation maximization algorithm, which then iterates over the data to fine tune what the actual minimums / maximums and other statistical values are for each "mode" or subgrouping in the data.

Expectation Maximization does two steps over and over in a loop. The first step is called Expectation and of course the second step is called Maximization. In the Expectation step it iterates over each data point and calculates what the probability percentage is that the data point belongs to each subgrouping "mode".

No alt text provided for this image

Then after this is done for all data points it separates the data points and assigns them to the mode that had the highest probability calculated for that data point. Assigning mode by highest probability is the "maximization" part. Then it recalculates the Mean and Standard Deviation for each mode using these data points that were assigned to them by highest probability. This causes the distribution density curves you see in the plots above to scoot over and widen or narrow to get a little bit closer to an exact "fit" of the actual data. Fitting to the data as close as possible is what "modeling" in data science is all about.

After enough loops between Expectation and Maximization the model fit gets pretty close to exact. This allows you to then calculate more accurate statistics of various kinds on the data points within each cluster of data.

No alt text provided for this image

The code project for the process described above can be seen here on Datacamp.com.

Distribution Mixture Model via Expectation Maximization

When trying to learn data science you run across so many concepts that it is easy to feel like you are drowning in a sea of details. In reality many of these concepts have a root of commonality. Abstracting away the details to find these central concepts and then learning those common themes makes the journey of mastering the world of statistics, data science, and machine learning much more approachable. As you come across new concepts you can then associate (or map) them to these central common themes instead of feeling like you are starting from scratch and climbing an all new mountain ... when really it is just a small hill on the side of the large mountain that is data science.

要查看或添加评论,请登录

Gerald Gibson的更多文章

  • Chat That App Intro

    Chat That App Intro

    Chat That App is a Python desktop app I created by using ChatGPT from OpenAI to generate classes, functions, etc. that…

  • ChatGPT + Timeseries Anomalies

    ChatGPT + Timeseries Anomalies

    Over the past five years, I have been transforming my career from software engineering to machine learning engineering.…

    2 条评论
  • Airflow + PostgreSQL + WSL

    Airflow + PostgreSQL + WSL

    Airflow is a software service that provides asynchronous and distributed execution of workflows. There are several…

    3 条评论
  • TensorFlow-GPU + Ubuntu + WSL

    TensorFlow-GPU + Ubuntu + WSL

    This article walks you through the steps I discovered recently for setting up a working environment to create…

    4 条评论
  • Regional and Online Learnable Fields

    Regional and Online Learnable Fields

    Regional and Online Learnable Fields is a type of data clustering algorithm invented in the early 2000's. It was…

    1 条评论
  • Designing an architecture for MLOps

    Designing an architecture for MLOps

    A large part of architecting anything complex (think software, large buildings, aircraft, etc.) is the skill of mental…

  • Splunk & Datacamp Training

    Splunk & Datacamp Training

    Not a real article. Just a place to host these since the one drive sharing option is not working.

  • Random, Stochastic, Probabilistic

    Random, Stochastic, Probabilistic

    At the end of the previous article it was mentioned that we would show how, from a computer programming perspective…

  • Bayesian probabilities visualized 2

    Bayesian probabilities visualized 2

    In the previous article we covered the basics about what some of these words / phrases used in the Bayesian world…

  • Bayesian probabilities visualized

    Bayesian probabilities visualized

    I once saw an interview of Benoit Mandelbrot in which he described as a child in his math studies he saw shapes in his…

社区洞察

其他会员也浏览了