The Curse of Dimensionality in Machine Learning
SmartSoC Solutions Pvt Ltd
Silicon to System- Your Turnkey Partner for Complete Engineering Solutions
The Curse of Dimensionality arises when the number of features or variables in a dataset becomes too large. This can lead to a range of issues including Multicollinearity, Overfitting and Computational Complexity. All of these can have negative impacts on the accuracy and reliability of Machine Learning models.
CoD pertains to the challenges encountered when handling high-dimensional data, which are not present in low-dimensional spaces. These difficulties relate to the data's sparsity and the degree of "closeness" between data points during classification, organization, and analysis.
To understand this mathematically, let's consider an example where we have a binary classification problem with only two features (or dimensions): x1 and x2. Let's assume that we have a dataset of N samples. We want to fit a linear classifier to this dataset in order to predict the class label of new, unseen samples.
The linear classifier can be represented by a linear equation of the form:?
y = w1x1 + w2x2 + b ;
where w1 and w2 are the weights associated with the two features, and b is the bias.?
The goal of the learning algorithm is to learn the values of the weights and bias that minimize the classification error on the training data.
Now, let's say we want to add a third feature, x3, to the dataset. The linear classifier becomes: y = w1x1 + w2x2 + w3x3 + b . In order to learn the new weights w1, w2, w3 and the bias term b, we need to estimate four parameters instead of just three. This means we need more data to accurately estimate these parameters. In fact, the amount of data required grows exponentially with the number of features.
More formally, the number of unique points required to uniformly sample a unit hypercube of dimensionality d is given by: N = (1/ε)^d ; where ε is the desired distance between neighbouring points.?
Fig. 1? Above figure explains Data Sparsity issue with increase in features
In the above illustration we can see, the number of points required grows exponentially with d. This means that as the dimensionality of the data increases, the amount of data required to avoid overfitting and multicollinearity also increases exponentially.
领英推荐
Why is it a problem?
CoD can lead to issues such as Overfitting, increased computational complexity, difficulty in visualizing and interpreting the data & Multicollinearity.
Challenges in Model Building:?
COD poses several challenges in model building. One of the primary challenges is feature selection. In high-dimensional spaces, it is essential to select the most relevant features to avoid overfitting and improve the generalization performance of the model. However, selecting the most relevant features can be challenging, as the number of features increases.
Model Selection Issue:
Another challenge is model selection, as the performance of different models may vary significantly in high-dimensional spaces. Therefore, it is essential to choose the appropriate model for the problem at hand.
Overfitting:
CoD leads to an overfitting issue where a ML model becomes too complex and fits the noise in the data instead of the underlying patterns or relationships. This can occur when the number of features in the dataset is too large.
Fig. 2? Above figure explains Dimensionality Reduction from 3D to 2D
To overcome CoD, we can use techniques such as feature selection and dimensionality reduction methods like (PCA) and (t-SNE). These methods can help to identify the most important features and reduce the dimensionality of the data.
Author Saurabh Chakraborty