登录查看更多内容

The Big 3 of Machine Learning Tasks

Dr. Johar M. Ashfaque

Medical Secretary | Data Analyst

发布日期: 2020年8月16日

+ 关注

The "Big 3" machine learning tasks, which are by far the most common ones. They are:

Regression
Classification
Clustering

1. Regression

1.1. (Regularized) Linear Regression

Strengths: Linear regression is straightforward to understand and explain, and can be regularized to avoid overfitting. In addition, linear models can be updated easily with new data using stochastic gradient descent.
Weaknesses: Linear regression performs poorly when there are non-linear relationships. They are not naturally flexible enough to capture more complex patterns, and adding the right interaction terms or polynomials can be tricky and time-consuming.

1.2. Regression Tree

Strengths: Decision trees can learn non-linear relationships, and are fairly robust to outliers. Ensembles perform very well in practice, winning many classical (i.e. non-deep-learning) machine learning competitions.
Weaknesses: Unconstrained, individual trees are prone to overfitting because they can keep branching until they memorize the training data. However, this can be alleviated by using ensembles.

1.3. Deep Learning

Strengths: Deep learning is the current state-of-the-art for certain domains, such as computer vision and speech recognition. Deep neural networks perform very well on image, audio, and text data, and they can be easily updated with new data using batch propagation. Their architectures (i.e. number and structure of layers) can be adapted to many types of problems, and their hidden layers reduce the need for feature engineering.
Weaknesses: Deep learning algorithms are usually not suitable as general-purpose algorithms because they require a very large amount of data. In fact, they are usually outperformed by tree ensembles for classical machine learning problems. In addition, they are computationally intensive to train, and they require much more expertise to tune (i.e. set the architecture and hyperparameters).

2. Classification

2.1. (Regularized) Logistic Regression

Strengths: Outputs have a nice probabilistic interpretation, and the algorithm can be regularized to avoid overfitting. Logistic models can be updated easily with new data using stochastic gradient descent.
Weaknesses: Logistic regression tends to underperform when there are multiple or non-linear decision boundaries. They are not flexible enough to naturally capture more complex relationships.

2.2. Classification Tree

Strengths: As with regression, classification tree ensembles also perform very well in practice. They are robust to outliers, scalable, and able to naturally model non-linear decision boundaries thanks to their hierarchical structure.
Weaknesses: Unconstrained, individual trees are prone to overfitting, but this can be alleviated by ensemble methods.

2.3. Deep Learning

Strengths: Deep learning performs very well when classifying for audio, text, and image data.
Weaknesses: As with regression, deep neural networks require very large amounts of data to train, so it's not treated as a general-purpose algorithm.

2.4. Support Vector Machines

Strengths: SVM's can model non-linear decision boundaries, and there are many kernels to choose from. They are also fairly robust against overfitting, especially in high-dimensional space.
Weaknesses: However, SVM's are memory intensive, trickier to tune due to the importance of picking the right kernel, and don't scale well to larger datasets. Currently in the industry, random forests are usually preferred over SVM's.

2.5. Naive Bayes

Strengths: Even though the conditional independence assumption rarely holds true, NB models actually perform surprisingly well in practice, especially for how simple they are. They are easy to implement and can scale with your dataset.
Weaknesses: Due to their sheer simplicity, NB models are often beaten by models properly trained and tuned using the previous algorithms listed.

3. Clustering

3.1. K-Means

Strengths: K-Means is hands-down the most popular clustering algorithm because it's fast, simple, and surprisingly flexible if you pre-process your data and engineer useful features.
Weaknesses: The user must specify the number of clusters, which won't always be easy to do. In addition, if the true underlying clusters in your data are not globular, then K-Means will produce poor clusters.

3.2. Affinity Propagation

Strengths: The user doesn't need to specify the number of clusters (but does need to specify 'sample preference' and 'damping' hyperparameters).
Weaknesses: The main disadvantage of Affinity Propagation is that it's quite slow and memory-heavy, making it difficult to scale to larger datasets. In addition, it also assumes the true underlying clusters are globular.

3.3. Hierarchical / Agglomerative

Strengths: The main advantage of hierarchical clustering is that the clusters are not assumed to be globular. In addition, it scales well to larger datasets.
Weaknesses: Much like K-Means, the user must choose the number of clusters (i.e. the level of the hierarchy to "keep" after the algorithm completes).

3.4. DBSCAN

Strengths: DBSCAN does not assume globular clusters, and its performance is scalable. In addition, it doesn't require every point to be assigned to a cluster, reducing the noise of the clusters (this may be a weakness, depending on your use case).
Weaknesses: The user must tune the hyperparameters 'epsilon' and 'min_samples,' which define the density of clusters. DBSCAN is quite sensitive to these hyperparameters.

The Big 3 of Machine Learning Tasks

Dr. Johar M. Ashfaque

Medical Secretary | Data Analyst

更多精彩文章

社区洞察

其他会员也浏览了

The Marvelous Intersection of Artificial Intelligence and Deep Machine Learning: A Journey into the Realm of Intelligent Algorithms

The importance of a test set

Copy of Predictive vs Causal Models in Machine Learning: Distinguishing Prediction from Causal Inference

Graph Machine Learning: It's Everywhere!

Types and Application of Machine Learning Algorithms

Machine Learning

Key Differences Between Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL)

Data Science Explained!

The Rise of Automated Machine Learning

BxD Primer Series: Support Vector Machine (SVM) Models

The Glashow-Salam-Weinberg Model

2016年5月21日

Knot Theory: Origins

2016年4月10日

Condensed Matter Theory: An Overview

2016年3月28日

Primality Testing: Pseudoprimes

2016年3月25日

The Muon Anomalous Magnetic Dipole Moment

2016年3月6日

Game Theory

2016年3月6日

Latin Squares & The Thirty-Six Officers Problem

2016年3月5日

Gravitational Waves