Topic : Repeatability Vs Reproducibility In Experiments
Reproducibility Vs Repeatability in Machine Learning !!

Topic : Repeatability Vs Reproducibility In Experiments

Article talks about the difference between Reproducibility and Repeatability in the scientific research experiments and why its important. Article also relates it's importance in Machine Learning experiment tracking, Projects/model deployments & Operations.

Repeatability means having one result from an experiment, you can try the same experiment again, with the same setup, and produce/get that exact same result.

Reproducibility is a measure if the same result can be attained by a different? team, using the same artifacts.

Reproducibility Crisis specially in research and development experienced by scientists that they are not able to reproduce the experiment again in Lab for a particular result. However this is the separate concern and not the scope of this article.

Machine Learning end to end development and deployment is complex and considering the non-deterministic nature of?subject it become?important to strategize how to flawlessly integrate it with business and?products. Machine Learning has deep roots in Statistics & probabilities and sometimes slight change in the order of the input/Data can change the output of the ML program. So Lets dive in!

What is reproducibility in Machine Learning?

In simple language same input should provided by same output for multiple runs. Keeping in mind it's non-deterministic nature, Machine learning code should be reproducible and every run should not give different results. Usually we make use of test_train_split, while doing so, it performs a random shuffle of dataset. If SEED value is not set, every run produces different training dataset distribution for test train split.

What are SEED ? Why defining? SEED help to produce the reproducibility in the Machine Learning Algorithms??

SEED helps to produce reproducibility & bring the randomness in the Machine Learning Algorithms using Random Numbers. Random numbers are of two types, pseudorandom numbers and true random numbers. "Pseudorandom numbers" are numbers that appears to be random, but they are not truly random. Typically, pseudorandom numbers will be generated using a SEED value (provided by a user) which is then passed to an algorithm that uses the value to generate a new number.?

For example, let’s say we use the following? simple equation to generate a series of random numbers:

R = (387 x S + 217) // 954

Where:

R is the random number to be produced, S is the seed value for R, Lets start with a seed value (S) value of 43.

R = (387 x 43 + 217) // 953

R = 657 (First Random Number)

To produce the second random number, we then insert 657 as S, back into the equation:

R = (387 x 657 + 217) // 953

R =?25 (Second Random Number)

If the seed value (S) is the same, the sequence of "random" numbers produced by the algorithm will be exactly the same every time. This means that if you know the equation and the seed value, you can predict the entire sequence of "random" numbers. This process can be repeated as many times as needed, generating an apparently random series of numbers. The numbers seem random to us, but actually generates a deterministic algorithm that creates number sequences that (only) look random.

“Using a pseudorandom number generator ensures that we are able to replicate our results and in this particular case able to generate the exact same train-test split dataset from the data corpus”

?Why Does it Matter??

Process of selecting a random sample for a scientific study,? using pseudorandom numbers, allows others to replicate your results by using the same seed value.

  • In video games, being able to trigger the same "random" events is very useful when the game is being tested.?
  • In applications were its require to make use of encryption, using true random numbers is particularly important. It helps to ensure that data remains protected.
  • Similarly, for online gambling, gaming companies need to have a very high level of confidence that the way results are being produced in everything from blackjack (how the cards are shuffled), to roulette (where the ball lands) and poker machines? is a truly random process, or they risk someone reverse engineering the algorithm.

Why there is need for Reproducibility in Machine Learning Architecture???

Reproducibility is the ability to duplicate the Machine Learning Model exactly, with the same?raw data as input, it should returns the same result. We don’t generally deploy Machine Learning algorithms but we deploy the entire ML pipeline. We need to make sure that every single step of the ML pipeline is reproducible.

Every Steps in Machine Learning Pipeline should be reproducible including Data gathering, feature creation, model building & deployment, so versioning both code, data and exacts infrastructure environment with right configurations is critical.

This become the basis of the Machine Learning Operations (MLOps) !



要查看或添加评论,请登录

Amit Goswami的更多文章

社区洞察

其他会员也浏览了