登录查看更多内容

Machine Learning: Dimensionality

Bhalchandra (Bhal) Madhekar

Software Engineering Leader | Big Data | Advanced Analytics | AI/ML | Cloud Technologies | SaaS / PaaS | KaggleX BIPOC Mentor | GitHub.com/madhekar

发布日期: 2018年4月27日

One of the tough problems in machine learning is dimensionality, in other words, number of features. This term was coined by Bellman in 1961. It was based on a fact that many algorithms that work fine in low dimensions become unmanageable when the input has lots of features. Generalizing correctly becomes exponentially harder as the dimensionality of the examples grows because a fixed-size training set covers a shrinking fraction of the input space per dimensions. This effectively means representation for each combination or set of features is very small. In numerical terms, even with a moderate dimension of 100 and a huge training set of a trillion examples, it covers only a fraction of about 0.0000000000000000001 of the input space. This is what makes machine learning both essential and hard.

The similarity-based reasoning model that machine learning algorithms depend on breaks down in higher dimensions. For simplicity, consider the nearest neighbor classifier with Hamming distance as the similarity measure, and suppose the class is just two features say x1 & x2. This is an easy problem but if there are 98 additional somewhat irrelevant features x3, . . . , x100, the noise from them completely swamps the signal in x1 and x2, and nearest neighbor effectively makes random predictions. Even more disturbing is that nearest neighbor still has a problem even if all 100 features are relevant! This is because in high dimensions all examples look alike or quite similar. Suppose, for instance, that examples are laid out on a regular grid, and consider a test example x-test. If the grid is N-dimensional, x-test’s 2N nearest examples are all at the same distance from it. So as the dimensionality increases, more and more examples become nearest neighbors of x-test until the choice of nearest neighbor is effectively random.

This is only one instance of a more general problem with high dimensions: intuition that comes from a three-dimensional world often does not apply in high-dimensional one. In high dimensions, most of the mass of a multivariate Gaussian normal distribution is not near the mean or center, but in an increasingly distant outer layer or “shell” around it. Most of the volume of a high-dimensional fruit is in the skin, not the core or pulp. If a constant number of examples is distributed uniformly in a high-dimensional hypercube, beyond some dimensionality most examples are closer to a face of the hypercube than to their nearest neighbor. This means if we approximate a hypersphere by inscribing it in a hypercube, in high dimensions almost all the volume of the hypercube is outside the hypersphere. This is somewhat unpleasant news for machine learning, where shapes of one type are often approximated by shapes of another. Building a classifier in two or three dimensions is easy; we can find a reasonable frontier between examples of different classes just by visual inspection. It’s even been said that if people could see in high dimensions machine learning would not be necessary. But in high dimensions, it’s hard to understand what is happening. This, in turn, makes it difficult to design a good classifier or model. Sometimes naively one might think that gathering more features never hurts since at worst they provide no new information about the class. But in fact, their benefits may be outweighed by the problems with or of dimensionality.

So whats the way out…?

Fortunately, there is an effect that partly counteracts the large feature problem, which might be called the non-uniformity blessing. In most applications, examples are not spread uniformly throughout the space but are concentrated on or near a lower-dimensional manifold. For example, k-nearest neighbor works quite well for handwritten digit recognition even though images of digits have one dimension per pixel, because the space of digit images is much smaller than the space of all possible images. One can implicitly take advantage of lower effective dimension or algorithms for explicitly reducing the dimensionality can be used.

Techniques such as ensemble formulation of models, which add to model diversity help mitigate the problem of effective detection or scoring. Just by adding randomness to ensemble approach it outperforms their base model. In this way, in KNN model varying K varies the topological neighborhood defined by data points, providing different lenses through which to view predictor-outcome relationships and smoothing of a regression function. Ensembles with this approach essentially have randomly chosen topological neighborhoods. The same analogy is effective in case of a random forest model, which is an ensemble approach to decision tree model.

These are some thoughts on mitigating accuracy issues with higher dimensional machine learning models.

References:

https://en.wikipedia.org/wiki/Curse_of_dimensionality
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
https://en.wikipedia.org/wiki/Ensemble_learning

查看更多评论

要查看或添加评论，请登录

Bhalchandra (Bhal) Madhekar的更多文章

Startup Metrics: Financial

2023年12月5日

Startup Metrics: Financial

In the world of startups, where every decision founders takes can make or break a company’s future. The startup metrics…

1 条评论
LLM Model Serving : An Interesting Challenge

2023年10月19日

LLM Model Serving : An Interesting Challenge

Text Generation Short Summery Large Language Models (LLMs) generate text in a two-step process: pre-fill, where the…
Measuring SaaS Startup Success

2022年9月14日

Measuring SaaS Startup Success

It has always fascinated me working for more than a decade with various SaaS startups, how difficult it is to device…

12 条评论
Effectively Leading a DataScience Initiative

2020年4月15日

Effectively Leading a DataScience Initiative

New data science leaders aspiring to manage advanced analytics teams face a set of challenges and are required to think…

2 条评论
New Gen India: socially re-engineered?

2020年1月13日

New Gen India: socially re-engineered?

During my recent visit to India, I had a unique opportunity to interact with millennials and Gen Zs - true digital…

12 条评论
A case for unification of financial crime, fraud, and cybersecurity operations

2019年10月10日

A case for unification of financial crime, fraud, and cybersecurity operations

In general risks associated with financial crime involve three kinds of counter measures: identifying and…
Machine Learning: Feature Engineering

2018年4月16日

Machine Learning: Feature Engineering

Data has become a first-class asset for modern businesses, corporations, and organizations irrespective of their size…
Application Containers Security, Monitoring and Compliance Challenges

2016年1月31日

Application Containers Security, Monitoring and Compliance Challenges

Application Container such as Docker—is relatively young application container technology with a lot of momentum think…

8 条评论
IoT Security: Threats, Constrains & Challenges -II

2015年12月10日

IoT Security: Threats, Constrains & Challenges -II

The Internet of Things [IOT] will and is overhauling the way which we all use technology. Its proliferation although…

3 条评论
Cyber Security: Advanced Persistent Threats

2015年10月30日

Cyber Security: Advanced Persistent Threats

Advanced Persistent Threats (APT) are long-lived malware with specific goals has recently emerged as the major threat…

See all articles

Machine Learning: Dimensionality

Bhalchandra (Bhal) Madhekar

Software Engineering Leader | Big Data | Advanced Analytics | AI/ML | Cloud Technologies | SaaS / PaaS | KaggleX BIPOC Mentor | GitHub.com/madhekar

Bhalchandra (Bhal) Madhekar的更多文章

社区洞察

其他会员也浏览了

Understanding Bagging in Machine Learning: Combat Overfitting and Boost Accuracy

Feature selection Methods in Machine Learning

7 Common Challenges in 2023 - Machine Learning

Choosing Between Machine Learning and Rule-Based Algorithms: Practical Insights

Performance Matrix in Machine Learning

Performance Matrix in Machine Learning

Bias and Variance in Machine Learning

Mastering Linear Discriminant Analysis in Machine Learning

Unlocking the Potential of Machine Learning: A Look into the Various Applications

How Machine Learning can be used to set Stop Loss and Take Profit levels for a Stock

Bhalchandra (Bhal) Madhekar的更多文章

Startup Metrics: Financial

LLM Model Serving : An Interesting Challenge

Measuring SaaS Startup Success

Effectively Leading a DataScience Initiative

New Gen India: socially re-engineered?

A case for unification of financial crime, fraud, and cybersecurity operations

Machine Learning: Feature Engineering

Application Containers Security, Monitoring and Compliance Challenges

IoT Security: Threats, Constrains & Challenges -II

Cyber Security: Advanced Persistent Threats

社区洞察

其他会员也浏览了

Understanding Bagging in Machine Learning: Combat Overfitting and Boost Accuracy

Feature selection Methods in Machine Learning

7 Common Challenges in 2023 - Machine Learning

Choosing Between Machine Learning and Rule-Based Algorithms: Practical Insights

Performance Matrix in Machine Learning

Performance Matrix in Machine Learning

Bias and Variance in Machine Learning

Mastering Linear Discriminant Analysis in Machine Learning

Unlocking the Potential of Machine Learning: A Look into the Various Applications

How Machine Learning can be used to set Stop Loss and Take Profit levels for a Stock