The curse and cure of dimensionality
A daily task for data scientists is to answer questions or derive new insights from a body of data they’ve been given.?They typically wonder: Do I have sufficient data attributes to answer the question I’m interested in? Too few attributes? Too many? Hence, a lot of data pre-processing revolves around a concept called “dimensionality.”??
This term may sound complicated. But dimensionality is relevant to a lot of decisions we make in daily life, like how long a road trip will take, just as much as it is to the kinds of business decisions that many Digitate customers make.?
Data scientists spend a sizable share of their time on data preparation – primarily working with the features in the data. Features refer to attributes or columns present in the data set. The number of such features is known as the dimensionality of a data set.??
Consider a data set containing details about employees. It might have columns such as role, department, location,?tenure, address, and so forth. These columns are considered features. They play a vital role in finding various insights such as user segments, detecting anomalies, predicting future events, and executing?other useful tasks.?
Features act as inputs in a machine learning (ML) algorithm. Most of these algorithms work by establishing a relationship between these features and the target values that we wish to predict. Naturally, relevant and adequate features help to increase the algorithm’s prediction power.??
Suppose we want to train an ML algorithm to predict the time it will take for a user to travel by road from one city to another – let’s say Pune to Mumbai. There could be a multitude of determining factors such as distance, road conditions, weather, or vehicle used. Every one of these determining factors could be a feature for our algorithm. For example, the larger the distance, the longer it takes. Or the better the road conditions, the faster we travel.?
The curse of dimensionality
In ML algorithms, having very few features may limit our understanding of the problem and in turn limit our ability to predict the outcome. That might make you think the more features we have, the better our accuracy will be.??
But this is where the curse of dimensionality kicks in! If you have a large number of features in your data (i.e. high dimensionality), the data becomes too difficult to visualize. It can lead to increased execution time, and it confuses the algorithm, reducing accuracy.??
Let’s look at both cases to understand this dilemma better.?
Low dimensionality:?Consider predicting the travel time between two cities with just two features: the start point and end point of the journey. Naturally, this would result in inadequate prediction accuracy because it does not take traffic, weather, road conditions, or other factors into consideration. If we do not have enough features in a data set, any algorithm we use will have an incomplete picture of the entire problem, resulting in low accuracy of the predictions.?
High dimensionality:?The natural course of action to fix our problem might be to add more features! So we add temperature, humidity, wind speed, terrain, and road conditions. And we could add even more, such as distance in miles, distance in kilometers, color of the vehicle, registration number, or name of the driver! As you can guess, too many features not only introduces additional noise in the data, it also carries the risk of inaccurate predictions. Imagine a pattern getting derived that all passengers driving a red car with even-numbered registration numbers tend to drive too slowly!??
Other factors also play a role in determining accuracy of the predictions, such as the nature of the test and training data sets, any underlying bias in the data, and the classification and regression algorithms used. Feature engineering is one of the most effective tools in a data scientist’s kit to overcome these issues. Finding the right balance and picking the right features becomes a very important first step.?
The cure: Sift out irrelevant features
Let’s first discuss the case when we have too many features. Often a lot of the features are either irrelevant or redundant to the problem at hand. For instance, in our example, the color of a vehicle might be irrelevant. Meanwhile, redundant features present the same information in different formats such as distance in miles and in kilometers.??
Here are some tricks of the trade to winnow them down.?
领英推荐
Dig deep in the toolbox when you have too few features
In contrast to the many ways of reducing an abundance of features, it’s tricky to work with too few. Your options are?more limited.?
Digitate’s take
Digitate data scientists works with a variety of use cases such as process prediction, transaction anomaly detection, and change management, among others. All these use cases involve a variety of data sets such as time series, events, and sets. Almost all of these data sets require some sort of feature engineering in order to derive any useful insights.?
Our award-winning AIOps suite, ignio, works with a wide range of ML problems across multiple domains. This has helped ignio “learn” how to excel at both ends of the dimensionality spectrum. It leverages a unique blend of domain-specific knowledge paired with generalized data-wrangling tricks and techniques.??
In some time series prediction use cases, it sees data with the bare minimum of features such as entity name, timestamp, and value. In such situations, ignio uses feature decomposition and feature derivation to expand the feature set. On the other end, ignio may have to manage a copious number of features in business transaction data. In such cases, ignio uses statistical methods to eliminate redundant features and brute-force methods to pick relevant features.?
Conclusion
Playing with data dimensions offers a wide range of possibilities. The quality and usability of analytics highly depends on the maturity of feature engineering. Today, the analytics pipeline is becoming fairly standardized.?However, data preparation and feature engineering is still an art?and is often?what determines success?for machine learning algorithms.?Great results?are?often a function of how experienced you are and how bold and creative can you get with the data!?
Written by Parag Agrawal
About the author
Parag Agrawal works as a Data Scientist at Digitate. He works on developing AI/ML solutions for real world business problems. His areas of expertise include event analysis, subgroup mining, and knowledge discovery.
Head - Jewellery International Business at Titan Company Ltd
2 年Brilliant article. Kudos to the author Parag - explained the concepts in a simple language. Keep growing Digitate - you are in a great space, at the cusp of massive global revolution