Machine Learning: Dimensionality

Machine Learning: Dimensionality

One of the tough problems in machine learning is dimensionality, in other words, number of features. This term was coined by Bellman in 1961. It was based on a fact that many algorithms that work fine in low dimensions become unmanageable when the input has lots of features. Generalizing correctly becomes exponentially harder as the dimensionality of the examples grows because a fixed-size training set covers a shrinking fraction of the input space per dimensions. This effectively means representation for each combination or set of features is very small. In numerical terms, even with a moderate dimension of 100 and a huge training set of a trillion examples, it covers only a fraction of about 0.0000000000000000001 of the input space. This is what makes machine learning both essential and hard.

The similarity-based reasoning model that machine learning algorithms depend on breaks down in higher dimensions. For simplicity, consider the nearest neighbor classifier with Hamming distance as the similarity measure, and suppose the class is just two features say x1 & x2. This is an easy problem but if there are 98 additional somewhat irrelevant features x3, . . . , x100, the noise from them completely swamps the signal in x1 and x2, and nearest neighbor effectively makes random predictions. Even more disturbing is that nearest neighbor still has a problem even if all 100 features are relevant! This is because in high dimensions all examples look alike or quite similar. Suppose, for instance, that examples are laid out on a regular grid, and consider a test example x-test. If the grid is N-dimensional, x-test’s 2N nearest examples are all at the same distance from it. So as the dimensionality increases, more and more examples become nearest neighbors of x-test until the choice of nearest neighbor is effectively random.

This is only one instance of a more general problem with high dimensions: intuition that comes from a three-dimensional world often does not apply in high-dimensional one. In high dimensions, most of the mass of a multivariate Gaussian normal distribution is not near the mean or center, but in an increasingly distant outer layer or “shell” around it. Most of the volume of a high-dimensional fruit is in the skin, not the core or pulp. If a constant number of examples is distributed uniformly in a high-dimensional hypercube, beyond some dimensionality most examples are closer to a face of the hypercube than to their nearest neighbor. This means if we approximate a hypersphere by inscribing it in a hypercube, in high dimensions almost all the volume of the hypercube is outside the hypersphere. This is somewhat unpleasant news for machine learning, where shapes of one type are often approximated by shapes of another. Building a classifier in two or three dimensions is easy; we can find a reasonable frontier between examples of different classes just by visual inspection. It’s even been said that if people could see in high dimensions machine learning would not be necessary. But in high dimensions, it’s hard to understand what is happening. This, in turn, makes it difficult to design a good classifier or model. Sometimes naively one might think that gathering more features never hurts since at worst they provide no new information about the class. But in fact, their benefits may be outweighed by the problems with or of dimensionality. 

So whats the way out…?

Fortunately, there is an effect that partly counteracts the large feature problem, which might be called the non-uniformity blessing. In most applications, examples are not spread uniformly throughout the space but are concentrated on or near a lower-dimensional manifold. For example, k-nearest neighbor works quite well for handwritten digit recognition even though images of digits have one dimension per pixel, because the space of digit images is much smaller than the space of all possible images. One can implicitly take advantage of lower effective dimension or algorithms for explicitly reducing the dimensionality can be used. 

Techniques such as ensemble formulation of models, which add to model diversity help mitigate the problem of effective detection or scoring. Just by adding randomness to ensemble approach it outperforms their base model. In this way, in KNN model varying K varies the topological neighborhood defined by data points, providing different lenses through which to view predictor-outcome relationships and smoothing of a regression function. Ensembles with this approach essentially have randomly chosen topological neighborhoods. The same analogy is effective in case of a random forest model, which is an ensemble approach to decision tree model.

These are some thoughts on mitigating accuracy issues with higher dimensional machine learning models. 

References:

要查看或添加评论,请登录

Bhalchandra (Bhal) Madhekar的更多文章

  • Startup Metrics: Financial

    Startup Metrics: Financial

    In the world of startups, where every decision founders takes can make or break a company’s future. The startup metrics…

    1 条评论
  • LLM Model Serving : An Interesting Challenge

    LLM Model Serving : An Interesting Challenge

    Text Generation Short Summery Large Language Models (LLMs) generate text in a two-step process: pre-fill, where the…

  • Measuring SaaS Startup Success

    Measuring SaaS Startup Success

    It has always fascinated me working for more than a decade with various SaaS startups, how difficult it is to device…

    12 条评论
  • Effectively Leading a DataScience Initiative

    Effectively Leading a DataScience Initiative

    New data science leaders aspiring to manage advanced analytics teams face a set of challenges and are required to think…

    2 条评论
  • New Gen India: socially re-engineered?

    New Gen India: socially re-engineered?

    During my recent visit to India, I had a unique opportunity to interact with millennials and Gen Zs - true digital…

    12 条评论
  • A case for unification of financial crime, fraud, and cybersecurity operations

    A case for unification of financial crime, fraud, and cybersecurity operations

    In general risks associated with financial crime involve three kinds of counter measures: identifying and…

  • Machine Learning: Feature Engineering

    Machine Learning: Feature Engineering

    Data has become a first-class asset for modern businesses, corporations, and organizations irrespective of their size…

  • Application Containers Security, Monitoring and Compliance Challenges

    Application Containers Security, Monitoring and Compliance Challenges

    Application Container such as Docker—is relatively young application container technology with a lot of momentum think…

    8 条评论
  • IoT Security: Threats, Constrains & Challenges -II

    IoT Security: Threats, Constrains & Challenges -II

    The Internet of Things [IOT] will and is overhauling the way which we all use technology. Its proliferation although…

    3 条评论
  • Cyber Security: Advanced Persistent Threats

    Cyber Security: Advanced Persistent Threats

    Advanced Persistent Threats (APT) are long-lived malware with specific goals has recently emerged as the major threat…

社区洞察

其他会员也浏览了