登录查看更多内容

Curse of Dimensionality

Utkarsh Sharma

SME & Manager | SAP Certified Application Associate | Certified Data Scientist | Intel certified Machine Learning Instructor| Mentor

发布日期: 2022年3月17日

Yes, data scientists and the data handling community do suffer from this well-known curse. So, is it really a curse or just a fancy made-up concept by some author? Let’s first understand the concept of dimensionality.

In a simple context of a table containing some data, the dimension is the number of columns in that table. So, what does a column in a table represents? The column in any table represents the properties of the record or entity in question. For example, if a table is storing the information of some student, then the columns will be the possible attributes of a student like Enroll no., Name, Age, Branch, etc. The attributes or columns make it easy to differentiate a student from another student, that’s an important property that every record in the table should be a unique combination of the attributes.

Now a question arises in mind what should be the optimal number of attributes required to represent any entity properly? So, the answer is that there is no such fixed number of dimensions which one can opt to represent every data, this criterion is subjective to the entity or the problem statement you are dealing with.

But what is the problem if we have a large number of dimensions? Will it not be more helpful in describing the entities? Let’s discuss this point with the help of the problem of clustering. In clustering, we intend to group the records into some clusters based on the similarity of their characteristics or attributes. And how do we do that, we calculate the distance between the attributes based on the number of attributes that are the same and different. Suppose I’m having 10 attributes in my table and based on those attributes I grouped my data points in some clusters. Now if I add one more column or attribute in my data, it may happen that some of the records which were totally different from one another might have the same value for this newly added attribute.

And imagine if I add 100 more columns in my data set then imagine how difficult will be for any machine learning algorithm to calculate the distance between two entities. This is the problem that is termed as the curse of dimensionality. To get rid of this problem there are several ways, some of them are listed below:

1.????Dimensionality reduction

领英推荐

Simple Exponential Smoothing

Nicolas Vandeput 5 年前

DECISION TREES AND TITANIC DATASET

Giancarlo Ronci 5 个月前

What kind of data does your company have?

Daniel Bourke 6 年前

2.????Numerosity reduction

3.????Data compression

We always need a good balance with the number of attributes, it should neither be too large so that it can become cumbersome for analysis nor it should be too less that we cannot capture the complete properties of the entity.

要查看或添加评论，请登录

Utkarsh Sharma的更多文章

reCAPTCHA: The Turing Test We Use Daily

2023年9月20日

reCAPTCHA: The Turing Test We Use Daily

It is amazing that we use some things so frequently that we forget to understand the mechanism behind them, like for…
Enable Machines to Feel: Sentiment Analysis

2022年5月5日

Enable Machines to Feel: Sentiment Analysis

Have you ever got a text from someone and couldn't tell if they were kidding or not? Unless we clearly tell the person…
Introduction to Time Series Analysis

2022年4月28日

Introduction to Time Series Analysis

Time series is a sequence of data points organized in time order. Forecast of data by analyzing time-based data is Time…

1 条评论
Dimensionality Reduction by PCA using Orange

2022年4月21日

Dimensionality Reduction by PCA using Orange

The curse of dimensionality haunts every data scientist dealing with a dataset containing a large number of attributes.…

1 条评论
Model Drift in Machine Learning

2022年4月14日

Model Drift in Machine Learning

“Change is the only constant in life.”- Heraclitus (Greek philosopher).
Principal Component Analysis????

2022年4月1日

Principal Component Analysis????

What is PCA? Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce…

3 条评论
Market Basket Analysis:- What will I buy next?

2022年3月10日

Market Basket Analysis:- What will I buy next?

Have you ever wondered, while entering a shopping store that how they organize or stack the things in a particular…
What do Data Engineer Do?

2022年3月3日

What do Data Engineer Do?

So, to define it very shortly a data engineer is that person who is responsible to collect the data from various…

4 条评论
A beginner’s Guide to data mining : RapidMiner

2022年2月24日

A beginner’s Guide to data mining : RapidMiner

RapidMiner studio is a data science and data mining platform that lets users extract transform and load data to draw…
Database Vs Data Warehouse Vs Data Lake

2022年2月17日

Database Vs Data Warehouse Vs Data Lake

In this article, we are going to discuss the difference between databases, data warehouses, and data lakes. So, to need…

1 条评论

See all articles

Curse of Dimensionality

Utkarsh Sharma

SME & Manager | SAP Certified Application Associate | Certified Data Scientist | Intel certified Machine Learning Instructor| Mentor

领英推荐

Utkarsh Sharma的更多文章

社区洞察

其他会员也浏览了

You don't need a Crystal ball or Calculus to do Sizing

Should we drop or embrace outliers ?

I ran 580 model-dataset experiments to show that, even if you try very hard, it is almost impossible to know that a model is degrading just by looking

Understanding the Minimum Description Length Principle: A Balance Between Model Complexity and Data Fit

TIME SERIES FORECASTING APPROACH

How to speed up tabular data processing by 1053x in pandas/cudf

Avoiding the Data Analyst’s Trap: The Law of the Instrument

Uniting Data Efficiently: Understanding the Nested Loop Join Algorithm

Does Correlation really prove Causation!

The Art of Deduction: Sherlock Holmes Way of Exploratory Data Analysis

领英推荐

Utkarsh Sharma的更多文章

reCAPTCHA: The Turing Test We Use Daily

Enable Machines to Feel: Sentiment Analysis

Introduction to Time Series Analysis

Dimensionality Reduction by PCA using Orange

Model Drift in Machine Learning

Principal Component Analysis????

Market Basket Analysis:- What will I buy next?

What do Data Engineer Do?

A beginner’s Guide to data mining : RapidMiner

Database Vs Data Warehouse Vs Data Lake

社区洞察

其他会员也浏览了

You don't need a Crystal ball or Calculus to do Sizing

Should we drop or embrace outliers ?

I ran 580 model-dataset experiments to show that, even if you try very hard, it is almost impossible to know that a model is degrading just by looking

Understanding the Minimum Description Length Principle: A Balance Between Model Complexity and Data Fit

TIME SERIES FORECASTING APPROACH

How to speed up tabular data processing by 1053x in pandas/cudf

Avoiding the Data Analyst’s Trap: The Law of the Instrument

Uniting Data Efficiently: Understanding the Nested Loop Join Algorithm

Does Correlation really prove Causation!

The Art of Deduction: Sherlock Holmes Way of Exploratory Data Analysis