Probabilistic Data Separation
Gerald Gibson
Principal Engineer @ Salesforce | Hyperscale, Machine Learning, Patented Inventor
Clusters, modes, distributions, categories, sub-populations, sub-signals, mixtures, proportions, ratios, density curve. These concepts are related by a central theme. This theme is the idea of separation boundaries that are the distinction between one "thing" and another "thing". If separation boundaries did not exist, then everything in the universe would be nothing more than random noise.
This concept is behind various processes, algorithms, and tools in statistics, data science, and machine learning. Mixture Models attempt to fit a model to a dataset that contains more than one sub-population by grouping the data points, by commonalities in their attributes, that act as the boundaries between these types. Clustering algorithms do this also. Distribution fitting algorithms like Expectation Maximization does this as well. Modality testing algorithms do the same thing. All the different Categorical (non-regression) machine learning algorithms e.g. Decision Trees are designed to accomplish this task. So in a sense these methodologies are all duplicative in what they are designed to do... separate one set of data points from any others in a larger dataset.
Once the dataset is separated into these sub-categories, then we assign labels to them such as IDs or Names that represent the concept(s) that these data points describe. We also can end up with other descriptors about them such as their proportion percentage relative to the entire dataset and / or the probability percentage that a new data point added to the dataset should be assigned to each of the clusters within. This last point of assigning probability percentages to each cluster is what is meant when the word "prediction" is used in this context. Each sub-population, each distinct distribution mode, get a % value assigned and that percentage predicts how likely a new data point belongs to that grouping in the data. If a cluster has 66% of the datapoints in the dataset assigned to it, because 66% of the datapoints have similar attributes, then the maximum prediction is that there is a 66% chance that a new data point will belong there... and each of the other sub-groups will have percentage probabilities of some smaller amount such as 5% or 14%. The percentage values of all of the sub-groups combined together sum up to exactly 100% so that all possibilities are covered.
When you are wanting to automate the process of doing this type of data separation you need to come up with a process, an algorithm, that can step by step, without human intervention distinguish one data subset from another. Expectation Maximization is one such algorithm that can take in a bunch of data that is all mixed up and figure out what the groupings of similar attributes are.
When reading data science literature about this process you might see these datasets referred to as "matrices" i.e. a Matrix. In reality these are just data tables (rows and columns). The rows represent each individual data point (i.e. things) and the columns are all different attributes about those "things". Expectation Maximization is often used by other more complex algorithms as one part of their multistep process to achieve categorization separation. For example, you can take one of the attribute columns for every row in the data table and run the values in that column through the EM process to determine if there are sub-groupings within that one attribute and what the statistical properties are of those subgroupings (e.g. mean and standard deviation). In a machine learning process that uses EM it might loop over every attribute column and run each one through the expectation maximization process to find these sub-groups and their statistical values.
One negative with EM is that you must tell it how many subgroups (i.e. clusters) exist within the data up front. This means human intervention to plot the data, look at the data, and then tell the EM process how many subgroups it should fit its model to and gather statistics on. If we want to completely automate this process we need a way to get around this by also automatically figuring out how many subgroups there are and then pass this count into the EM function before it starts. In the example code project below we do this with an algorithm called Modality Testing that is implemented in a Python library called UniDip. Essentially, Modality Testing builds a histogram of the data and then finds the peaks in the data which tells it how many "modes" or subgroups there are along with a rough guess about the minimum and maximum values that are common within each one of those "modes".
Given the rough minimum and maximum returned by UniDip we can calculate a rough Mean and Standard Deviation for each mode represented by the peaks.
领英推荐
These modality testing result values can then be used to initialize the expectation maximization algorithm, which then iterates over the data to fine tune what the actual minimums / maximums and other statistical values are for each "mode" or subgrouping in the data.
Expectation Maximization does two steps over and over in a loop. The first step is called Expectation and of course the second step is called Maximization. In the Expectation step it iterates over each data point and calculates what the probability percentage is that the data point belongs to each subgrouping "mode".
Then after this is done for all data points it separates the data points and assigns them to the mode that had the highest probability calculated for that data point. Assigning mode by highest probability is the "maximization" part. Then it recalculates the Mean and Standard Deviation for each mode using these data points that were assigned to them by highest probability. This causes the distribution density curves you see in the plots above to scoot over and widen or narrow to get a little bit closer to an exact "fit" of the actual data. Fitting to the data as close as possible is what "modeling" in data science is all about.
After enough loops between Expectation and Maximization the model fit gets pretty close to exact. This allows you to then calculate more accurate statistics of various kinds on the data points within each cluster of data.
The code project for the process described above can be seen here on Datacamp.com.
When trying to learn data science you run across so many concepts that it is easy to feel like you are drowning in a sea of details. In reality many of these concepts have a root of commonality. Abstracting away the details to find these central concepts and then learning those common themes makes the journey of mastering the world of statistics, data science, and machine learning much more approachable. As you come across new concepts you can then associate (or map) them to these central common themes instead of feeling like you are starting from scratch and climbing an all new mountain ... when really it is just a small hill on the side of the large mountain that is data science.