Machine Learning: Dimensionality
Bhalchandra (Bhal) Madhekar
Software Engineering Leader | Big Data | Advanced Analytics | AI/ML | Cloud Technologies | SaaS / PaaS | KaggleX BIPOC Mentor | GitHub.com/madhekar
One of the tough problems in machine learning is dimensionality, in other words, number of features. This term was coined by Bellman in 1961. It was based on a fact that many algorithms that work fine in low dimensions become unmanageable when the input has lots of features. Generalizing correctly becomes exponentially harder as the dimensionality of the examples grows because a fixed-size training set covers a shrinking fraction of the input space per dimensions. This effectively means representation for each combination or set of features is very small. In numerical terms, even with a moderate dimension of 100 and a huge training set of a trillion examples, it covers only a fraction of about 0.0000000000000000001 of the input space. This is what makes machine learning both essential and hard.
The similarity-based reasoning model that machine learning algorithms depend on breaks down in higher dimensions. For simplicity, consider the nearest neighbor classifier with Hamming distance as the similarity measure, and suppose the class is just two features say x1 & x2. This is an easy problem but if there are 98 additional somewhat irrelevant features x3, . . . , x100, the noise from them completely swamps the signal in x1 and x2, and nearest neighbor effectively makes random predictions. Even more disturbing is that nearest neighbor still has a problem even if all 100 features are relevant! This is because in high dimensions all examples look alike or quite similar. Suppose, for instance, that examples are laid out on a regular grid, and consider a test example x-test. If the grid is N-dimensional, x-test’s 2N nearest examples are all at the same distance from it. So as the dimensionality increases, more and more examples become nearest neighbors of x-test until the choice of nearest neighbor is effectively random.
This is only one instance of a more general problem with high dimensions: intuition that comes from a three-dimensional world often does not apply in high-dimensional one. In high dimensions, most of the mass of a multivariate Gaussian normal distribution is not near the mean or center, but in an increasingly distant outer layer or “shell” around it. Most of the volume of a high-dimensional fruit is in the skin, not the core or pulp. If a constant number of examples is distributed uniformly in a high-dimensional hypercube, beyond some dimensionality most examples are closer to a face of the hypercube than to their nearest neighbor. This means if we approximate a hypersphere by inscribing it in a hypercube, in high dimensions almost all the volume of the hypercube is outside the hypersphere. This is somewhat unpleasant news for machine learning, where shapes of one type are often approximated by shapes of another. Building a classifier in two or three dimensions is easy; we can find a reasonable frontier between examples of different classes just by visual inspection. It’s even been said that if people could see in high dimensions machine learning would not be necessary. But in high dimensions, it’s hard to understand what is happening. This, in turn, makes it difficult to design a good classifier or model. Sometimes naively one might think that gathering more features never hurts since at worst they provide no new information about the class. But in fact, their benefits may be outweighed by the problems with or of dimensionality.
So whats the way out…?
Fortunately, there is an effect that partly counteracts the large feature problem, which might be called the non-uniformity blessing. In most applications, examples are not spread uniformly throughout the space but are concentrated on or near a lower-dimensional manifold. For example, k-nearest neighbor works quite well for handwritten digit recognition even though images of digits have one dimension per pixel, because the space of digit images is much smaller than the space of all possible images. One can implicitly take advantage of lower effective dimension or algorithms for explicitly reducing the dimensionality can be used.
Techniques such as ensemble formulation of models, which add to model diversity help mitigate the problem of effective detection or scoring. Just by adding randomness to ensemble approach it outperforms their base model. In this way, in KNN model varying K varies the topological neighborhood defined by data points, providing different lenses through which to view predictor-outcome relationships and smoothing of a regression function. Ensembles with this approach essentially have randomly chosen topological neighborhoods. The same analogy is effective in case of a random forest model, which is an ensemble approach to decision tree model.
These are some thoughts on mitigating accuracy issues with higher dimensional machine learning models.
References:
- https://en.wikipedia.org/wiki/Curse_of_dimensionality
- https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
- https://en.wikipedia.org/wiki/Ensemble_learning