IID in machine learning

IID in machine learning

Many thanks for the comments and feedback to the post yesterday (Why is machine learning challenging for some engineers ) - I will come back to those later this week.

In machine learning, "IID assumption" stands for "Independent and Identically Distributed."?

The assumption of independence implies that the generation of any data point in a dataset does not influence and is not influenced by the generation of any other data point. In other words, each data point is generated without regard to the others.

The assumption of Identically Distributed means that all data points come from the same probability distribution. In other words, each piece of data is drawn from the same underlying process, ensuring that the dataset has a consistent statistical profile.

When data points are IID, it's assumed that the way you split the data into training and test sets doesn't matter because each subset of the data will be representative of the whole. In other words, if the IID condition is not satisfied, you could be comparing Apples to Oranges(distribution wise). But if IID is satisfied, then you have completely random data.

So, before you go down the test train split in machine learning, you need to check for IID

And how do you do that?

Through statistical tests and analysis

And therein lies the touchpoint between the two approaches?

While statistical inference is different from machine learning inference, statistical techniques can still be used in machine learning

Statistical tests are procedures used to make decisions or inferences about populations based on sample data. Statistical tests provide a framework to evaluate hypotheses, assess relationships between variables, and determine the significance of predictive features.?

You can use several statistical tests and approaches to detect iid.

Tests for Independence: autocorrelation tests can check if there is any correlation between observations at different times in a time series.?

Tests for Identically Distributed distributions: We can use the Kolmogorov-Smirnov Test , a nonparametric test that compares the cumulative distributions of two datasets or a dataset against a known distribution for testing if two samples come from the same distribution.

Chi-square Goodness of Fit Test : tests whether the distribution of sample categorical data matches an expected distribution.

Visual Inspection: Of course, before applying statistical tests, visual inspections using plots (e.g., histograms, scatter plots) can provide insights into violations of the IID assumption.?

Domain-Specific Tests: Depending on the data and context, domain-specific tests might be more appropriate.?

Metadata Analysis: Reviewing metadata and the context in which data was collected can reveal sources of potential violations, such as changes in data collection instruments or protocols.

Logging Data Changes: Maintaining logs of how data is collected, processed, and stored, to identify any changes in the system that could influence the data distribution.

Cohort Analysis: Segmenting the data into cohorts based on time of entry or other relevant factors and examining if outputs significantly differ across cohorts.

Thus, a number of strategies are possible to ensure iid.

Image source and good reference for iid https://www.youtube.com/watch?v=EGKbPww2_rc&t=3s

If I understand correctly, this ignores chaos theory?

Venkat dharaneswar reddy

Currently pursuing my b. tech in, Artificial intelligence and data science, in Amrita Vishwa Vidyapeetham

4 个月

I want to know, in the real world how the raw data is treated and what is done to that data and what steps are taken to.modify the data before starting training and testing steps. Can you write an article on this, it will be really useful. Thank you

Rodney Beard

International vagabond and vagrant at sprachspiegel.com, Economist and translator - Fisheries Economics Advisor

4 个月

One needs to be careful as some machine learning algorithms have a sum or product specification (see the loss function) which essentially is an i.i.d. assumption. Also sometimes you can get away with weaker assumptions concerning the mixing properties of the underlying process/data which basically generalize i.i.d.

Bill Luker Jr PhD

Senior Economist and Methodologist. Statistics, Applied Econometrics, General Analytics, and the Data Sciences. Incisive Thinker, Writer, Researcher, Teacher. Entrepreneur. Author, Writer, Editor, Blogger, Poet.

4 个月

I think it’s very interesting, Ajit, that the IID issue you discuss here, and the methods you suggest in testing for it, are wholly from the theoretical and practical ground of “small data” statistics and its traditional socio- (including econo-) and psycho-metrics applications. This is in direct opposition to the yearnings, first articulated in the early 20-teens by “big data evangelists,” to break free of the constraints imposed by the dead white men who invented statistics and probability theory. Those of us who said “wait a minute, you’re saying big data voids the laws of probability?” were dismissed as useless fossils and resistant to change. (And I ran into just such an attitude today from what I would call an “AI evangelist”on LI). Frankly, though, I see a renewed recognition of the need for statistical understanding and learning in the data sciences (plural), in ML and AI, and feel a measure of vindication. I have used all the tests and approaches you describe in sampling practice, and will continue to do so in full confidence that they are neither outdated nor a trivial nuisance.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了