登录查看更多内容

A Quick Introduction to Machine?Learning

Randhir Singh

Senior Consultant - Data Engineering and Business Intelligence

发布日期: 2025年1月2日

Machine learning allows systems to learn from data and make decisions automatically with minimal human intervention. Whether it is guessing results, sorting things into groups or finding patterns, machine learning is becoming a very useful tool for working with data. I come from a BI background and have worked extensively with data. But now, I have come to realize the power of machine learning, which adds a whole new layer of intelligence to how data is processed and used. Let us try to break down how machine learning models work:

Thanks a lot, to the creator of this beautiful image that helped explain the process so easily!

1. Getting the?Data?

Every machine learning model starts with data which typically consists of Input Variables and Output Variables:

Input variables (Features): These are the pieces of information that the model uses to make predictions or decisions. They are the independent variables or predictors. If you are trying to predict the price of a house, the input variables could be things like: Size of the house, bedrooms, location, age of house etc.
Output variables (Target): This is the thing we want the model to predict or figure out based on the input data, such as the price of the house or whether someone gets a loan etc.

The first task is cleaning the data, which includes handling missing values, checking for outliers (bad records), and understanding its distribution.

2. Exploring the?Data?

Next, we explore the data to better understand it:

We check if there are relationships (correlations) between different pieces of data. For example, does the size of a house affect its price? or does age affect whether someone gets a loan?
We calculate important statistical metrics like mean, median, and standard deviation.
We also identify any missing values and determine how to handle them.

3. Preparing the?Data?

Once we understand the data, we need to pre-process it to make it usable for the machine learning model:

Scaling: Adjusts the values of features so that no variable dominates over others due to differences in their ranges. For example, if one feature like Square Footage has values from 500 to 5,000, and another like Number of Bedrooms ranges from 1 to 10, Square Footage will influence the model more. Scaling makes both features comparable, helping the model learn effectively. There are two most common methods are Min-Max Scaling and Standardization (Z-score normalization) to scale the numbers.
Encoding: Converting categorical data (e.g., gender, education) into numbers so the computer can process them.

4. Splitting the?Data?

We split the data into two parts:

Training data (70%): This is used to train the model. We use it to teach the model how to make predictions.
Testing data (30%): This is kept aside to test the model and see how well it learned.
It could be 80:20 also, depending on the requirement.

5. Teaching the?Model?

Now, we teach the computer using the training data. The model learns from examples using learning algorithms such as:

Random Forests

Why it is popular: Random Forests are extremely versatile, powerful, and perform well on a wide range of tasks, including classification and regression. Their ability to handle large datasets and prevent overfitting makes them a go-to choice for many.

2. Support Vector Machines (SVM)

Why it is popular: SVMs are highly effective for classification, especially in high-dimensional spaces. They are widely used in fields like text classification, image recognition, and bioinformatics.

领英推荐

Population, Sample, and Sampling Techniques in Machine…

SURESH BEEKHANI 3 个月前

How does Machine Learning Work?

Aqsa Z. 4 年前

Data Encoding in Machine Learning - Part 08

Vinod Kumar GR 11 个月前

3. Decision Trees

Why it is popular: Decision Trees are easy to interpret and understand, making them great for beginners. They work well for classification and regression tasks but are often used as the base for more complex models like Random Forests and Gradient Boosting.

4. K-Nearest Neighbors (KNN)

Why it is popular: KNN is simple and intuitive, making it a good starting point for many classification problems. It’s particularly useful for small datasets but can struggle with large datasets because of its computational complexity.

5. Logistic Regression

Why it is popular: Despite its name, Logistic Regression is a classification algorithm. It is popular for binary classification problems (e.g., predicting whether something happens or not) because it is fast, interpretable, and effective for linearly separable data.

6. Fine Tuning the Model?

Model may need fine tuning which can be done using hyperparameters:

Hyperparameters are the settings or configurations that you define before training a machine learning model. Hyperparameters are set manually and control the overall behavior of the model
Hyperparameter optimization ensures the model is as accurate as possible and we adjust it to help the model learn better.

7. Testing the Model?

Once the model is trained, we test its performance on the Test data that was kept aside for testing. We check how well the model can make predictions on unseen data.

Performance metrics include:

Accuracy: How often the model makes correct predictions.

8. Cross Validation?

To ensure the model is not just memorizing the data (overfitting), we use cross validation. This involves training the model multiple times on different subsets of the data and testing it on the rest.

9. Making Predictions?

Once the model is trained and tested, it is ready to make predictions. For example, we input someone’s age and income and see if the model predicts they will get a loan or not.

10. Evaluating the Model?

Once we have built and trained a machine learning model, it is time to see how well it works. Evaluation is like testing how good the model is at making predictions. We need to check if it is making accurate predictions or if it needs improvement. We use popular metrics to evaluate its performance:

Accuracy: This tells us how often the model is correct. It is a simple metric but works well when the data is balanced (positive and negative cases are about the same).
Sensitivity and Specificity: These metrics tell us how well the model detects different outcomes:
Sensitivity (Recall): How well it detects positive outcomes (e.g., people who should get the loan).
Specificity: How well it detects negative outcomes (e.g., people who shouldn’t get the loan).
RMSE and R2: These are used for models that predict continuous values (like predicting someone’s salary or the price of a house):
RMSE (Root Mean Square Error) measures how far off the model’s predictions are from the actual values.
R2 (R-squared) tells us how well the model explains the variation in the data.

Machine learning is all about learning from data, making predictions, and improving those predictions over time. While the terminology may sound technical, at its core, it is just like teaching a computer to make decisions based on examples. There is a lot to learn in Machine Learning and in this article we have covered is just the tip of the iceberg.

#MachineLearning #LearningJourney

要查看或添加评论，请登录

Randhir Singh的更多文章

Why Use get_dummies with drop_first=True in Machine Learning?

2025年2月4日

Why Use get_dummies with drop_first=True in Machine Learning?

When working with categorical variables in machine learning, models require numerical inputs. The function in pandas…
Using SQL Expressions in PySpark with the expr Function

2024年12月29日

Using SQL Expressions in PySpark with the expr Function

Developers who come from an SQL background find SQL like expressions very convenient as it allows them to use familiar…

3 条评论
Simplifying Deployment with Docker

2023年5月28日

Simplifying Deployment with Docker

Introduction Working as a MSBI (Microsoft Business Intelligence) administrator, managing applications, data sources…
Understanding AWS IAM with Windows Experience

2023年5月26日

Understanding AWS IAM with Windows Experience

Introduction As an SQL Server professional who has spent most of his career working on Windows systems, I understand…
Your First Steps into AWS VPC: A Simple Explanation

2023年5月23日

Your First Steps into AWS VPC: A Simple Explanation

Hello there! If you've started learning AWS like me, you might have encountered the term 'AWS VPC' while learning the…

See all articles

A Quick Introduction to Machine?Learning

Randhir Singh

Senior Consultant - Data Engineering and Business Intelligence

1. Getting the?Data?

2. Exploring the?Data?

3. Preparing the?Data?

4. Splitting the?Data?

5. Teaching the?Model?

领英推荐

Randhir Singh的更多文章

社区洞察

其他会员也浏览了

Data Encoding in Machine Learning - Part 08

Machine Learning Topic 6: Overfitting and Underfitting in Machine Learning: A Clear Explanation with Examples and Techniques

What Can Business Analysts Do To Remove Bias In Machine Learning?

A Comprehensive Guide to Regularization in Machine Learning

ML Model's Performance - A Guide to Scoring Methods in Machine Learning

The Bias-Variance Tradeoff in Machine Learning:

Machine Learning Topic 5: Key Ideas in Machine Learning: Training Data, Testing Data, Features, and Models

A Few Useful Things to Know About Machine Learning The paper provides a concise and insightful overview of essential concepts, practices, in ML

Noobs Intro to Machine Learning

Harnessing the Power of Machine Learning with Bagging Predictors

1. Getting the?Data?

2. Exploring the?Data?

3. Preparing the?Data?

4. Splitting the?Data?

5. Teaching the?Model?

领英推荐

Randhir Singh的更多文章

Why Use get_dummies with drop_first=True in Machine Learning?

Using SQL Expressions in PySpark with the expr Function

Simplifying Deployment with Docker

Understanding AWS IAM with Windows Experience

Your First Steps into AWS VPC: A Simple Explanation

社区洞察

其他会员也浏览了

Data Encoding in Machine Learning - Part 08

Machine Learning Topic 6: Overfitting and Underfitting in Machine Learning: A Clear Explanation with Examples and Techniques

What Can Business Analysts Do To Remove Bias In Machine Learning?

A Comprehensive Guide to Regularization in Machine Learning

ML Model's Performance - A Guide to Scoring Methods in Machine Learning

The Bias-Variance Tradeoff in Machine Learning:

Machine Learning Topic 5: Key Ideas in Machine Learning: Training Data, Testing Data, Features, and Models

A Few Useful Things to Know About Machine Learning The paper provides a concise and insightful overview of essential concepts, practices, in ML

Noobs Intro to Machine Learning

Harnessing the Power of Machine Learning with Bagging Predictors