Curse of Dimensionality
Utkarsh Sharma
SME & Manager | SAP Certified Application Associate | Certified Data Scientist | Intel certified Machine Learning Instructor| Mentor
Yes, data scientists and the data handling community do suffer from this well-known curse. So, is it really a curse or just a fancy made-up concept by some author? Let’s first understand the concept of dimensionality.
In a simple context of a table containing some data, the dimension is the number of columns in that table. So, what does a column in a table represents? The column in any table represents the properties of the record or entity in question. For example, if a table is storing the information of some student, then the columns will be the possible attributes of a student like Enroll no., Name, Age, Branch, etc. The attributes or columns make it easy to differentiate a student from another student, that’s an important property that every record in the table should be a unique combination of the attributes.
Now a question arises in mind what should be the optimal number of attributes required to represent any entity properly? So, the answer is that there is no such fixed number of dimensions which one can opt to represent every data, this criterion is subjective to the entity or the problem statement you are dealing with.
But what is the problem if we have a large number of dimensions? Will it not be more helpful in describing the entities? Let’s discuss this point with the help of the problem of clustering. In clustering, we intend to group the records into some clusters based on the similarity of their characteristics or attributes. And how do we do that, we calculate the distance between the attributes based on the number of attributes that are the same and different. Suppose I’m having 10 attributes in my table and based on those attributes I grouped my data points in some clusters. Now if I add one more column or attribute in my data, it may happen that some of the records which were totally different from one another might have the same value for this newly added attribute.
And imagine if I add 100 more columns in my data set then imagine how difficult will be for any machine learning algorithm to calculate the distance between two entities. This is the problem that is termed as the curse of dimensionality. To get rid of this problem there are several ways, some of them are listed below:
1.????Dimensionality reduction
领英推荐
2.????Numerosity reduction
3.????Data compression
?
We always need a good balance with the number of attributes, it should neither be too large so that it can become cumbersome for analysis nor it should be too less that we cannot capture the complete properties of the entity.