How do you select the right machine learning algorithm for a given problem?
Machine Learning
Perspectives from experts on the questions that matter for Machine Learning
This article was an early beta test. See all-new collaborative articles about Machine Learning to get expert insights and join the conversation.
Selecting the right machine learning algorithm for a given problem can be a daunting task, given there is no single algorithm that can address every user’s need. Here are some things to consider as you weigh the differences between different machine learning algorithms.?
1. Know your data: One of the most important aspects of selecting a machine learning algorithm is understanding the nature of the data you are working with.?
A crucial question to ask is if the data is labeled, which can then determine if you will need to undertake supervised or unsupervised learning. In supervised learning, the training dataset consists of input-output pairs, with the aim of learning a function that maps inputs to outputs. In this case, you might consider algorithms such as linear regression, naive bayes or random forests. For unsupervised learning, where there is no output label on the training data, you might consider clustering algorithms, such as k-means clustering, or dimensionality reduction techniques, such as principal component analysis. In turn, you can extract patterns and relationships from the data itself.?
Another aspect to consider when assessing your data is the size of the dataset, as larger datasets can accommodate more complex algorithms than smaller ones. And for some algorithms, the time it takes to train on larger datasets can be prohibitively long.
“If the training data is smaller or if the dataset has a smaller number of observations and a higher number of features like genetics or textual data, choose algorithms with high bias/low variance like Linear regression, Na?ve Bayes or Linear SVM. If the training data is sufficiently large and the number of observations is higher as compared to the number of features, one can go for low bias/high variance algorithms like KNN, Decision trees or kernel SVM.”
— David Nason is the chief technology officer at software development company Scoutr. He has over 30 years of experience in the technology industry and earned his master’s in computer science from Carnegie Mellon University.?
2. Consider the type of problem: In particular, you should consider if the problem you’re hoping to address is a classification or regression problem. For classification problems, algorithms such as logistic regression, k-nearest neighbors or support vector machines might be appropriate. For regression problems, you might consider linear regression, decision trees, or neural networks.
领英推荐
3. Evaluate your performance requirements: Different machine learning algorithms have different performance characteristics in terms of accuracy, training time and inference time. You will need to consider what requirements you have in terms of these three aspects to eliminate algorithms that will not meet them. For example, if you need a model that can rapidly make predictions on new data, you might want to rule out algorithms with comparatively long inference times, such as neural networks or support vector machines. Similarly, if you need a high accuracy model, you might rule out simpler algorithms such as linear regression, which can struggle on more complex datasets.
4. Experiment and compare: Ultimately, there is no one-size-fits-all approach to selecting the right machine learning algorithm for your needs. However, by following the principles outlined above, you should be able to narrow down your selection to a subset of algorithms that you can experiment with. Create a shortlist of candidates and compare the results of each on the same dataset. Often, the best performing algorithm will depend significantly on the nuances of the particular dataset you are using. By running comparisons across multiple algorithms, you can get a sense of which one might hold an edge over the others.
“You can also try a bunch of different algorithms and then take a closer look at the results. Keep in mind that this can be time consuming and use substantial computing power, so don't expect to get results immediately. Let's say that you're working with supervised machine learning, you want to use three different algorithms on your training data, decision trees, naive bayes or k-nearest neighbor. Then you can look at the results and see which one had the highest level of accuracy.”
—? Doug Rose is an author and LinkedIn Learning instructor. He holds a master’s in information management from Syracuse University and a J.D. from Syracuse University College of Law.?
5. Keep iterating: Remember that machine learning is an iterative process. Building a successful model often means reevaluating and refining your algorithm over time. As you gain more experience with your dataset, you can tweak your algorithm in order to improve its performance.
Explore more
This article was edited by LinkedIn News Editor Felicia Hou and was curated leveraging the help of AI technology.
Writer (Self-employed)
1 年It's worth considering where you are in your solution life cycle. If you're in the early investigation stage -- identifying potential approaches or even exploring your data, that's one set of issues. If you know your data and are seeking to build an operational engine, that's another. The choice of algorithms also has to be constrained by governance issues. To whom are you going to have to explain the results? How far (logically) is the "consumer" from the engineer? If something looks odd, is it an IM or Email to resolve, or is there a complex business process involved? The algorithm is the engine of the problem-solving vehicle -- but the rest of the infrastructure has to be compatible.
Choosing which machine learning algorithm to use should be aligned to (1) the business question/problem that you’re trying to solve for, and (2) the statistical distribution of the target variable of interest (e.g., binomial, categorical, continuous.) I always prefer to develop a holdout sample and testing the initial model on the holdout sample. This helps to understand the efficacy and lift that the ML model gives. There are multiple phases of ML, and the stastical analyses and model development are typically a small % of the overall effort. Identifying, standardizing, connecting, and transforming disparate data into the appropriate unit of analysis is typically 85%-90% of the time spent on machine learning. Running different ML models and statistical analyses provides different perspectives on the relationships between data variables. The ML model that best aligns to the business question and is used by the business (goes into production) is the best. Don't let perfect be the enemy of good!
President at BA Probst, LLC
1 年Choosing a machine tool is analogous to a mechanic choosing a wrench. The right tool needs to be applied to the right task. Some of the relevant criteria are: o Regulations -- for example credit decisions need to be explainable o Availability of data -- how often is data refreshed and are all attributes available o Type I risk and Type II risk. The risk of false positive and false negatives o Quickness of recalibration needs o etc Also, not everything needs to have a ML component. The business objective function should take this into account. Often times good ole common sense can be more effective than an advanced algorithm.
Internationally recognized chief analytics officer who is a thought leader, speaker, consultant, and author focused on analytics, data science, and AI
1 年In addition to the ones above, the algorithm should be as simple as possible to maximize scalability. It should also be as explainable as possible to maximize business sponsor acceptance. It can often make sense to go with a somewhat less powerful model if in return you get high scalability and simplicity. In other words, only get fancy and complex to the extent you need to.