Assumptions for machine learning algorithms - important things to know

Krishna Yogi Kolluru

Data Scientist | ML Architect | GenAI | Sagemaker | Speaker | ex-Microsoft | IIT - NUS Alumni | AWS Certified ML / Data Engineer

发布日期: 2019年10月12日

IID ( Identical and Independent Distribution ) is the fundamental assumption of almost all statistical learning methods. Meaning each of the data points of a sample need to be independent of each other, when this assumption fails some machine learning algorithms perform poorly.

Logistic Regression

First, logistic regression does not require a linear relationship between the dependent and independent variables. Second, the error terms (residuals) do not need to be normally distributed. Third, homoscedasticity is not required. Finally, the dependent variable in logistic regression is not measured on an interval or ratio scale.

Second, logistic regression requires observations to be independent of each other. In other words, the observations should not come from repeated measurements or matched data.

Third, logistic regression requires there to be little or no multicollinearity among the independent variables. This means that the independent variables should not be too highly correlated with each other.

Fourth, logistic regression assumes the linearity of independent variables and log odds. although this analysis does not require the dependent and independent variables to be related linearly, it requires that the independent variables are linearly related to the log odds.

Finally, logistic regression typically requires a large sample size. A general guideline is that you need at minimum of 10 cases with the least frequent outcome for each independent variable in your model. For example, if you have 5 independent variables and the expected probability of your least frequent outcome is .10, then you would need a minimum sample size of 500 (10*5 / .10).

SVM

SVM is quite tolerant of input data, especially the soft-margin version. I can not remember any specific assumption of data is taken(please correct me).

Naive Bayes

Naive Bayes is called naive because it makes the naive assumption that features have zero correlation with each other. They are independent of each other.

By doing so, the joint distribution can be found easily by just multiplying the probability of each feature whilst in the real world they may not be independent and you have to find the correct joint distribution. It is naive due to this simplification.

Pros:

It is easy and fast to predict class of test data set. It also perform well in multi class prediction
When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

Cons:

If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.
On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.
Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

Decision Trees and Random Forest

We are assuming perfect sampling

For example, if one class consists of two components and in our dataset, one component is represented by 100 samples, and another component is represented by 1 sample - probably most individual decision trees will see only the first component and Random Forest will misclassify the second one.

the assumption of variables being multicolinear across multi-dimensional statistical space.

References : https://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

https://towardsdatascience.com/the-importance-of-analyzing-model-assumptions-in-machine-learning-a1ab09fb5e76

要查看或添加评论，请登录

Krishna Yogi Kolluru的更多文章

Mastering Spark SQL Functions: A Comprehensive Guide

2024年9月2日

Mastering Spark SQL Functions: A Comprehensive Guide

Apache Spark SQL provides a rich set of functions to handle various data operations. This guide covers essential Spark…
100 Data Engineering Jargon That You Must Know

2024年8月27日

100 Data Engineering Jargon That You Must Know

Data engineering is at the heart of how businesses collect, process, and use data to make informed decisions. As the…

3 条评论
Slowly Changing Dimensions in Data Warehouses

2024年8月17日

Slowly Changing Dimensions in Data Warehouses

What is a Data Warehouse? A data warehouse is a centralized repository where data from different sources is stored. It…
VectorDB Tutorial — A Beginner’s Guide

2024年7月27日

VectorDB Tutorial — A Beginner’s Guide

A Vector Database (VectorDB) is designed to store and manage vector data, often used in machine learning and AI…
Databricks SQL Series — Part 5 — Managing and Securing Your Data

2024年7月26日

Databricks SQL Series — Part 5 — Managing and Securing Your Data

Synopsis Introduction to Data Management in Databricks Introduction to Data Management in Databricks Data management…
Databricks SQL Series: Integrating Databricks SQL with Visualization Tools — Part 4

2024年7月26日

Databricks SQL Series: Integrating Databricks SQL with Visualization Tools — Part 4

A Detailed Guide on working with visualization tools Synopsis Introduction In part 3, we saw about using Windows…
Databricks SQL Series: Advanced Analytics in Databricks SQL — Using Window Functions — Part 3

2024年7月25日

Databricks SQL Series: Advanced Analytics in Databricks SQL — Using Window Functions — Part 3

A Detailed Guide on Window Functions Synopsis Introduction Window functions in Databricks SQL are used for performing…
Databricks SQL Series — Optimizing Data Queries with Databricks SQL — Part 2

2024年7月25日

Databricks SQL Series — Optimizing Data Queries with Databricks SQL — Part 2

Synopsis Understanding the Basics of Query Optimization Welcome to the second part of our Databricks SQL Series, where…
Databricks SQL Series — Introduction to Databricks SQL — Part 1

2024年7月24日

Databricks SQL Series — Introduction to Databricks SQL — Part 1

Synopsis What is Databricks SQL? Are you a professional looking to master Databricks SQL practically? Look no further!…

2 条评论
Delta Live Tables — Part 5— Exploring Advanced Features and Optimization Techniques in Delta Live Tables

2024年7月22日

Delta Live Tables — Part 5— Exploring Advanced Features and Optimization Techniques in Delta Live Tables

As we learnt about the architecture, step-by-step process and data process management in the previous blogs of the…

See all articles

Krishna Yogi Kolluru的更多文章

Mastering Spark SQL Functions: A Comprehensive Guide

100 Data Engineering Jargon That You Must Know

Slowly Changing Dimensions in Data Warehouses

VectorDB Tutorial — A Beginner’s Guide

Databricks SQL Series — Part 5 — Managing and Securing Your Data

Databricks SQL Series: Integrating Databricks SQL with Visualization Tools — Part 4

Databricks SQL Series: Advanced Analytics in Databricks SQL — Using Window Functions — Part 3

Databricks SQL Series — Optimizing Data Queries with Databricks SQL — Part 2

Databricks SQL Series — Introduction to Databricks SQL — Part 1

Delta Live Tables — Part 5— Exploring Advanced Features and Optimization Techniques in Delta Live Tables

社区洞察