Probability Inequalities in Machine Learning

Probability Inequalities in Machine Learning

One of the common preliminaries that students are asked to be familiar with to master machine learning are probability and linear algebra.

During one of my journeys in the famed Bangalore traffic, as I was staring deep into the abyss (metaphorically and literally from the metro) I was wondering with the years of experience, what in probability is really useful in practice and theory.

Probability for ML Practice:

In probability, from my experience, a basic understanding of probability mass function, density function, expectations, variance, covariance might be necessary to follow most new papers with a predominantly practical focus.

Probability for ML Theory:

On the other hand, to even get a basic understanding of machine learning theory requires knowing a set of few probability inequalities, that I think are not taught as a part of most undergraduate courses. Probability inequalities correspond to upper-bounding the probability of a random variable taking a value beyond a specified target. These inequalities are useful in theoretically guaranteeing that the risk cannot exceed a certain value with a very high probability.

In this post, I will briefly list down a few probability inequalities and post references for proofs. Selfishly, I hope that by writing this post, I can remember them when needed and not forget it completely, like I do most of the times. (Aside, how do most of you remember concepts? I seem to forget nearly everything I read!)

  1. Markov Inequality

The Markov inequality states that probability that a random variable does not take a value greater than a (>0) is bounded as follows:

No alt text provided for this image
Markov Inequality

Note that there is no constraint on the type of random variable. This is also not considered a "tight bound" as we could find a quantity less than RHS that could hold too.

2. Chebyshev Inequality

The Chebychev inequality is used to bound the probability that a random variable takes values beyond k standard deviations. As one can expect, this bound is weaker than in the case of knowing the exact distribution.

No alt text provided for this image
Chebyshev Inequality

3. Chernoff Bounds

Chernoff bounds are nearly the same as Markov inequality, but for substituting the random variable X = exp(tx). This makes the right-hand side a moment generating function (function of t). For the uninitiated, a moment generating function is an expression that allows us to generate different moments of a distribution through a parameter s (mean is considered the first moment, variance the second moment etc.) Since the right hand side is now a function, we can optimize the right hand side depending on the distribution and make the bound tighter than Markov. Example, in certain applications variance might be lead to a tighter bound than variance.

No alt text provided for this image
Chernoff Bounds

4. Hoeffding inequality

Hoeffding inequality is used to bound the sum of N iid random variables. This is particularly useful in machine learning as samples in a dataset are considered as iid random variables and hence, Hoeffding is perhaps one of the most used inequalities in machine learning.

No alt text provided for this image
Hoeffding Inequality

Here, S_n denotes the sum of n iid random variables.


I think beyond this there are a couple of inequalities like Holder's inequality, but I am not sure if they are really used much.

References and Proofs:

  1. ch15M.pdf (uconn.edu)
  2. hoeffding.pdf (stanford.edu)



要查看或添加评论,请登录

Rahul Venkat的更多文章

  • When is semi-supervised learning useful?

    When is semi-supervised learning useful?

    Problems in machine learning are generally represented as Y=f(X), where Y is the variable of interest to be predicted…

社区洞察

其他会员也浏览了