登录查看更多内容

IID in machine learning

Ajit Jaokar

发布日期: 2024年7月7日

Many thanks for the comments and feedback to the post yesterday (Why is machine learning challenging for some engineers ) - I will come back to those later this week.

In machine learning, "IID assumption" stands for "Independent and Identically Distributed."?

The assumption of independence implies that the generation of any data point in a dataset does not influence and is not influenced by the generation of any other data point. In other words, each data point is generated without regard to the others.

The assumption of Identically Distributed means that all data points come from the same probability distribution. In other words, each piece of data is drawn from the same underlying process, ensuring that the dataset has a consistent statistical profile.

When data points are IID, it's assumed that the way you split the data into training and test sets doesn't matter because each subset of the data will be representative of the whole. In other words, if the IID condition is not satisfied, you could be comparing Apples to Oranges(distribution wise). But if IID is satisfied, then you have completely random data.

So, before you go down the test train split in machine learning, you need to check for IID

And how do you do that?

Through statistical tests and analysis

And therein lies the touchpoint between the two approaches?

While statistical inference is different from machine learning inference, statistical techniques can still be used in machine learning

Statistical tests are procedures used to make decisions or inferences about populations based on sample data. Statistical tests provide a framework to evaluate hypotheses, assess relationships between variables, and determine the significance of predictive features.?

Sanjay Kumar MBA,MS,PhD 1 个月前

[Newsletter] Three Mistakes to Avoid with Machine…

Daniella F Santana 1 年前

4 steps in building effective machine learning models

Naveen Joshi 7 年前

You can use several statistical tests and approaches to detect iid.

Tests for Independence: autocorrelation tests can check if there is any correlation between observations at different times in a time series.?

Tests for Identically Distributed distributions: We can use the Kolmogorov-Smirnov Test , a nonparametric test that compares the cumulative distributions of two datasets or a dataset against a known distribution for testing if two samples come from the same distribution.

Chi-square Goodness of Fit Test : tests whether the distribution of sample categorical data matches an expected distribution.

Visual Inspection: Of course, before applying statistical tests, visual inspections using plots (e.g., histograms, scatter plots) can provide insights into violations of the IID assumption.?

Domain-Specific Tests: Depending on the data and context, domain-specific tests might be more appropriate.?

Metadata Analysis: Reviewing metadata and the context in which data was collected can reveal sources of potential violations, such as changes in data collection instruments or protocols.

Logging Data Changes: Maintaining logs of how data is collected, processed, and stored, to identify any changes in the system that could influence the data distribution.

Cohort Analysis: Segmenting the data into cohorts based on time of entry or other relevant factors and examining if outputs significantly differ across cohorts.

Thus, a number of strategies are possible to ensure iid.

Image source and good reference for iid https://www.youtube.com/watch?v=EGKbPww2_rc&t=3s

Artificial Intelligence

114,358 位关注者

Hans Z.

4 个月

If I understand correctly, this ignores chaos theory?

1 次回应

Venkat dharaneswar reddy

Currently pursuing my b. tech in, Artificial intelligence and data science, in Amrita Vishwa Vidyapeetham

4 个月

I want to know, in the real world how the raw data is treated and what is done to that data and what steps are taken to.modify the data before starting training and testing steps. Can you write an article on this, it will be really useful. Thank you

1 次回应

Rodney Beard

International vagabond and vagrant at sprachspiegel.com, Economist and translator - Fisheries Economics Advisor

4 个月

One needs to be careful as some machine learning algorithms have a sum or product specification (see the loss function) which essentially is an i.i.d. assumption. Also sometimes you can get away with weaker assumptions concerning the mixing properties of the underlying process/data which basically generalize i.i.d.

1 次回应

Bill Luker Jr PhD

Senior Economist and Methodologist. Statistics, Applied Econometrics, General Analytics, and the Data Sciences. Incisive Thinker, Writer, Researcher, Teacher. Entrepreneur. Author, Writer, Editor, Blogger, Poet.

4 个月

I think it’s very interesting, Ajit, that the IID issue you discuss here, and the methods you suggest in testing for it, are wholly from the theoretical and practical ground of “small data” statistics and its traditional socio- (including econo-) and psycho-metrics applications. This is in direct opposition to the yearnings, first articulated in the early 20-teens by “big data evangelists,” to break free of the constraints imposed by the dead white men who invented statistics and probability theory. Those of us who said “wait a minute, you’re saying big data voids the laws of probability?” were dismissed as useless fossils and resistant to change. (And I ran into just such an attitude today from what I would call an “AI evangelist”on LI). Frankly, though, I see a renewed recognition of the need for statistical understanding and learning in the data sciences (plural), in ML and AI, and feel a measure of vindication. I have used all the tests and approaches you describe in sampling practice, and will continue to do so in full confidence that they are neither outdated nor a trivial nuisance.

4 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

IID in machine learning

Ajit Jaokar

领英推荐

Artificial Intelligence

114,358 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Generalization

3 Keys to Machine Learning

Knowledge graphs for Machine Learning are so cool !

Cyclical Encoding: An Alternative to One-Hot Encoding

DIMENSIONALITY REDUCTION

Role of Feature Engineering in Machine Learning

Introduction to Data

Data Requirements and Model Selection in Machine Learning

10 Machine Learning Algorithms every Data Scientist should know

Training Data vs Test Data in Machine Learning - Essential Guide

领英推荐

Artificial Intelligence

114,358 位关注者

Generative AI for Creatives - Reinterpreting the classics for the modern age using chatGPT : Proust and the Matrix

2024年11月25日

Low Code Data Scientist - learning from Grace Hopper

2024年11月24日

Generative AI in Creative Roles: Best Practices

2024年11月24日

However did Euler come up with the Euler’s identity?

2024年11月23日

AI Opportunities in the new Justice AI Unit in the UK

2024年11月22日

Artificial Intelligence: Generative AI, Cloud and MLOps (online) - an amazing set of speakers

2024年11月21日

My new role - Senior AI fellow - Justice AI Unit - Ministry of Justice - UK Government

2024年11月20日

Securing an AI model

2024年11月17日

Auditing and Securing an AI model

2024年11月15日

An easy way to learn Python coding using chatGPT - part two

2024年11月13日

社区洞察

其他会员也浏览了

Generalization

3 Keys to Machine Learning

Knowledge graphs for Machine Learning are so cool !

Cyclical Encoding: An Alternative to One-Hot Encoding

DIMENSIONALITY REDUCTION

Role of Feature Engineering in Machine Learning

Introduction to Data

Data Requirements and Model Selection in Machine Learning

10 Machine Learning Algorithms every Data Scientist should know

Training Data vs Test Data in Machine Learning - Essential Guide