登录查看更多内容

The Curse of Dimensionality – Behind the Scenes

Dataloop AI

The AI Development Platform

发布日期: 2022年4月14日

Originally posted on?Dataloop AI Blog?by?Shira Navot

Imagine yourself being in the middle of telling someone a long story or struggling through an explanation of something complicated when the other person looks at you and asks, “What’s the point?”

We can relate to that person’s needs, all they really want is that we deliver the information quickly and to the point. This problem is the essence of what’s known as “the curse of dimensionality”.

We live in a world of too much information. The same holds true in machine learning. There is almost always a huge set of potential features we can use to make our prediction. When confronted with a massive amount of data that has a high number of characteristics, we can use dimensionality reduction algorithms to make the data “get to the point”.

High Dimensional Data and the Problem it Holds

“The curse of dimensionality” sounds like an intimidating phrase but what it actually refers to is?when your data has too many features.

Many texts on statistics, machine learning, and computer science reference the “curse of dimensionality”, which is attributed to the applied mathematician, Richard Bellman. This phrase refers to the phenomenon that increasing data dimension exponentially increases the volume of space that data can occupy. Accordingly, in high-dimensional spaces, available data becomes sparse.

Let’s set an example from our world. You can think about an input image that is being fed into a deep neural network, that each of its layers extracts features in different scales so that this image could be eventually represented at some point as a 2048-sized vector of features. Which of these features are the most important? What features contribute more to the purpose underlying our data structure? Luckily, neural networks are built in a manner that important features can be extracted from these high-dimensional spaces.

As much as we would like to, we can’t use all of the features. We just don’t have enough observations. Using thousands of features to fit a model when we have significantly fewer observations of the target variable would be a bad idea. We run the risk of massively overfitting our model, this would generally result in a terrible out-of-sample performance in the real-world scenario.

When we have too many features, observations also become harder to cluster. Too many dimensions cause every observation in your dataset to appear equidistant from all the others. And because clustering uses a distance measure such as Euclidean distance to quantify the similarity between observations, this is a major issue. If the distances are all approximately equal, then all the observations appear equally alike (or equally different), and no meaningful clusters could be formed.

Formed clusters using different dimensionality reduction techniques

Without domain expertise and modeling experience, it can be hard to decide which features are truly worth keeping. This is where dimensionality reduction algorithms come into place. Helping us transform our set of thousands of features into an ideal set of features.

佩尼戈阿利斯泰尔 2 年前

Exploring the Depths: Unraveling the Intricacies of…

Commtel Networks 1 年前

This Week in AI: GANs Tutorial, Fragmentation of…

Lightning AI 2 年前

The Ideal Set of Features

If high-dimensional data has an underlying low-dimensional structure then the lower-dimensional structure can be extracted and explored. This is the concept behind traditional feature extraction and feature engineering methods. Meanwhile, feature extraction may employ linear combinations, projections, and transformations to extract new lower-dimensional sets of features with which to model.

If we could create an ideal set of features from scratch, then we should think about some?main properties that this set would have:

High Variance

Variance is basically how much the data bounces up and down. The more bouncy it is, the harder it is to predict or explain. Data with no variance has no uncertainty, so there is nothing for us to predict. Features with a high variance contain a lot of potential signals, which is a basic requirement for building a good and robust model.

Uncorrelated

Features that are highly correlated with each other are less useful and in certain cases downright harmful (when the correlation is so high as to cause multicollinearity).

Not Too Many

We wish to have a low number of features relative to our number of target variable observations. Too many features relative to observations would result in an overfit model that performs poorly out of the sample.

Principal Component Analysis (PCA) example for three final components

Summed Up

Today, we have demonstrated how having too many features could be harmful to machine learning models. In the real world of noisy, messy data and complicated relationships, dimensionality reduction represents a trade-off between a more robust model and lower interpretability.

Stay tuned for the second part of this series, where I’ll discuss the tools needed for combating the curse of dimensionality and explore practical algorithms for dimensionality reduction.

The Curse of Dimensionality – Behind the Scenes

Dataloop AI

The AI Development Platform

High Dimensional Data and the Problem it Holds

领英推荐

The Ideal Set of Features

High Variance

Uncorrelated

Not Too Many

Summed Up

更多精彩文章

社区洞察

其他会员也浏览了

TensorFlow - Aamir?P

Explainable taxonomy of AI, courtesy of DataScienceCentra

The misguided intuition I had to unlearn to come to grips with modern machine learning

Machine Learning, AI and Big Data Tools Open-Sourced By Major Corporations

ML Kernel with RUST

Image-Based Predictions with SHAP

Class 23 - MULTI-CLASS CLASSIFICATION Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

Exploring the Depths: Unraveling the Intricacies of Machine Learning and Deep Learning

The Math Behind the Foundation of AI

High Dimensional Data and the Problem it Holds

领英推荐

The Ideal Set of Features

High Variance

Uncorrelated

Not Too Many

Summed Up

10 Transformative Announcements from GTC 2024: A Leap into the Future with NVIDIA

2024年3月29日

4 Secrets For Managing High-Volume Data Labeling

2022年7月27日

Uncovering AI Tactics For Solving Real-Life Problems

2022年3月20日

Data-Centric: The New AI Oil or a New Buzzword?

2022年3月7日

Unleashing the Unsupervised

2022年3月3日

Content Moderation in the Realm of Machine Learning

2022年1月28日

Making Sense of Unstructured Data

2022年1月19日

2021 AI Recap & Founders’ Predictions for 2022

2022年1月11日

Dataloop Awarded Tech Innovation Leadership Award by Frost & Sullivan

2021年12月28日

The 'Data Loop'

2021年11月29日

社区洞察

其他会员也浏览了

TensorFlow - Aamir?P

Explainable taxonomy of AI, courtesy of DataScienceCentra

The misguided intuition I had to unlearn to come to grips with modern machine learning

Machine Learning, AI and Big Data Tools Open-Sourced By Major Corporations

ML Kernel with RUST

Image-Based Predictions with SHAP

Class 23 - MULTI-CLASS CLASSIFICATION Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

Exploring the Depths: Unraveling the Intricacies of Machine Learning and Deep Learning

The Math Behind the Foundation of AI