Getting Started with Machine Learning: 4 Essential Models to Know
Luis Fernando Torres
AML/FT Intelligence Analyst @ CloudWalk, Inc. | Microsoft Certified AI Engineer
For better understanding of the models presented in this article, as well as seeing their implementation on Python, I suggest you read The ABCs of Machine Learning: 4 Essential Models notebook on Kaggle.
Introduction
Machine Learning has been definitely reshaping many industries throughout the world. After OpenAI's release of its recent ChatGPT tool, a lot of enthusiasts and newcomers started to talk about AI and its possible effects in society.
With the attention caused by all the buzzwords and news outlets, many people are recently discovering the world of machine learning and getting interested in knowing how it works or even how to build their own models.
In this article, I'll briefly introduce to you the four most essential models every beginner in data science and machine learning should know of: Linear Regression, Logistic Regression, Decision Trees, and K-Means.
Machine learning and AI are rapidly changing our society and revolutionizing many industries and markets. With the abundance of data available today, machines are able to extract insights and recognize patterns that would be near-impossible for humans to find at the naked-eye, making it an exciting subject to study and acquire knowledge on.
Even though the vast number of algorithms and techniques of machine learning may seem overwhelming at first, by understanding these four models, it'll be much easier to grasp on more complex and advanced concepts ahead.
Let's get started!
Linear Regression
Linear regression is definitely the most ideal starting point for a beginner!
It is a type of supervised learning algorithm, which means that we must have a target variable in mind when building a linear regression model for predictions. This model is used to predict the output of continuous variables, based on the relationship the target variable has with one or more input features.
Linear regression models are built on the assumption that the target variable y has a linear relationship with the independent features X that can be modeled as a straight line.
The formula for a simple linear regression is:
Y = mX + b
Where Y represents the output — target variable —, X the independent feature – also referred to as the predictor input —, while m is the estimated slope, and b is the estimated intercept.
The slope(m) indicates the rate at which y changes for every unit increase or decrease in X. Assuming that m = 2, for instance, suggests that, for every unit increase in X, y is expected to increase by 2 units.
The intercept(b) represents the value of y when X = 0. It is the point at which the line crosses the y-axis.
The job of a Linear Regression model is to find the optimal values for m and b, to make predictions on the values of the target variable y for any given value of X.
Linear regression models are widely used in fields such as economics, finance, engineering, and many others. It is able to predict the prices of houses, cars, forecast sales, stock prices, etc.
Logistic Regression
Logistic regression is also a supervised learning algorithm. However, instead of fitting a straight line to the data, we fit an S shape curve called sigmoid to predict binary outcomes based on one or more input features.
It also assumes a linear relationship between the target variable that we wish to predict and the independent feature. The output of a logistic regression model is a value either 0.0 or 1.0 to indicate the probability of an event y – such as the chance of passing an exam – when compared to a feature X – the hours studying.
领英推荐
The formula for logistic regression is:
p(y = 1 | x) = 1 ÷ 1 + exp(-(β?+ β?x? + β?x? + … + β?x?))
Where p(y = 1 | x) is the probability of a target variable y taking on the value of 1 given the values of the predictor features x?, x? ,…, x?. The β coefficients are the parameters of the logistic regression model that are estimated from the data, they are the optimizers for the best fit of log odds.
Logistic regression models may be used in healthcare for predicting the likelihood of patients developing certain diseases, in finance to predict the likelihood of default, in marketing to predict the likelihood of a customer purchasing a product based on demographics, etc.
Decision Tree
The Decision Tree model is a powerful and intuitive supervised learning algorithm widely used for both classification and regression tasks. It receives its name from the tree-like structure that it is built on, divided into internal nodes to evaluate attributes and assign to them values that follow different paths – the branches – until eventually making it to the leaf node, which is the predicted outcome.
Decision Trees are extremely easy to interpret and visualize. They're also able of handling missing values and less sensitive to outliers. Another advantage is the fact that the model has the ability to capture non-linear relationships between the input features and the output variable, being able to capture complex interactions.
For being able of both classification and regression tasks, it is used in a variety of industries for many activities, such as investment, default risk, healthcare, etc.
K-Means
The K-Means is an unsupervised model, which means that it learns patterns from untagged data. In this case, there is no target variable that the model must predict outcomes to.
The K-Means is a clustering algorithm that is used to identify patterns in data and group similar data points together based on their proximity to each other.
It works by randomly selecting a number K of centroids, which are simply the center of each cluster, the arithmetic mean of the data points assigned to that specific cluster.
The goal of a K-Means model is to identify certain subsets of data that are both meaningful and useful. It is widely used in retail for customer segmentation purposes, dividing the total base of customer into distinct groups that have similar characteristics to make it easier to target each group with the ideal products, services, and marketing strategies.
Conclusion
In conclusion, machine learning is a rapidly growing field that has been revolutionizing many industries and decision-making processes throughout the world. It is an exciting subject to learn and work on, as it offers endless possibilities to explore and innovate. With the increasing availability of large datasets and powerful computing resources.
Overall, it is a fascinating and dynamic field that holds great promise for the future. Whether you are a researcher, developer, or simply curious about this exciting domain, there has never been a better time to get involved and explore the possibilities of machine learning.
This article is just a brief explanation on how these models work. In my Kaggle notebook The ABCs of Machine Learning: 4 Essential Models I dive a bit deeper into the details of each of them and demonstrate hands-on how to apply these models in Python language. I highly encourage you to take a look at it.
Thank you for reading,
Luís Fernando Torres