A first problem with relative frequencies
Alessandro Gecchele
Esperto in Software per Impianti Galvanici e per Automazione Industriale. Oltre 20 anni di esperienza e più di 60 impianti collaudati nel mondo
One of the problems that relative frequencies have is something very similar to overfitting.
Overfitting is a classic problem that afflicts a statistical model and occurs when it is equipped with an excessive number of parameters compared to the data collected. If you give credit to a model affected by overfitting you are making a mistake, because in reality the model perfectly explains only the data available, but gives incorrect indications if you use other data coming from the same process or from other similar processes.
The use of relative frequencies instead of the actual probabilities of the process underlying the data has the same effect on us as the use of an overfitted model: in fact, from the relative frequencies we get an illusory apparent clear understanding of the observed partial phenomenon, because the uncertainty associated with the limited data has been completely removed from the scene. The data result absolutely perfect, but, unfortunately, in general they do not represent the process we want to understand.
For this problem, the precautionary principle would suggest the injection of a certain amount of uncertainty in the relative frequencies, especially in the case of data inserted in high-dimensional sample spaces, and in presence of low data density situations.
In order to explain what we risk taking bare and raw data for granted, I showed in the initial figure of this article two diagrams containing the results of an experiment. The colored points represent the state (expressed in terms of specific energy and specific entropy) of a thousand relative frequency distributions obtained from data coming from four distinct IID processes (throws of a loaded die with different prevalence of "1" at 50% -red points-, 60% -blue points-, 70% -purple points- and 80% -blue points-). The processes have each a very precise and fixed theoretical state, regardless of the dimension of the sample space used. The theoretical state of each process is highlighted by a cross on the curve with maximum possible entropy, being the processes IID. The first diagram uses data chunks of 500 samples per realization, while the second diagram uses data chunks of 10,000 samples per realization. Each cloud having the same color is composed of points referring to the same 1,000 realizations.
领英推荐
Well, as the density of the data decreases due to their insertion into sample spaces of increasing size, the relative frequency distributions give the impression of coming from different processes because their state points generate different clouds. In particular, it seems that the processes of origin of data are gradually simpler, because the apparent distributions become more and more concentrated and with less entropy. If we continue along the descending curves of the clouds of relative frequencies we are led to believe that the more the density will be reduced, the more the process will appear defined, clear and understandable. Ultimately, we will have that:
"With zero data, all is perfectly clear and predictable".
This is the paradox hidden in relative frequencies and from which it is advisable to protect oneself with a valid probability estimator, capable of tracing the center of gravity of the cloud to the right entropy.
One thing highlighted by the diagrams is the presence of two types of errors contained in the data: in addition to the classical statistical one (represented by the width of the cloud) there is a systematic error, represented by the distance of the cloud's center of gravity from the cross indicating the theoretical state. The diagrams show that, by averaging the results of several realizations of the same process, we can only bring the estimate closer to the center of gravity of the cloud, but we cannot bring the cloud closer to the theoretical state of the process. For this purpose it is essential to use a probability estimator, possibly the best one, which is able to make the points of the cloud go up along the right curve, stopping when the energy of the new distribution reaches the estimated energy of the underlying probabilities.
A consequence follows from what has been said: if we believe that relative frequencies are equivalent to probabilities, we arrive at the reassuring - but completely wrong - conclusion that our individual experiences are sufficient to produce a reliable and credible representation of Reality.