What is Machine Learning: a point of view
1. The Size of the Problem
Imagine the following scene: we have at the center of a football pitch an asymmetrical object, composed of different parts. You may think of this object as a set of irregularly shaped sculptures. Suppose that, distributed in a disorderly fashion over the edges of the field, a few people observe such an object. Assume that each of these observers gives a description of what they see. It is only natural that we expect these accounts to be different from one another, to the extent that they may not seem to be about the same object. This disparity is caused by divergences between the different viewing angles. None of the accounts alone is sufficient to give a complete description of the observed entity. If we wanted to obtain a complete knowledge of the observed object, we should reconcile all the standpoints we have.
Our goal here is to give an answer to an interesting but capricious question: "What is Machine Learning?". First, it is necessary to keep in mind that there are different types and approaches of Machine Learning. Moreover, once Machine Learning is an outcome from efforts derived from distinct areas of human knowledge, we are prone to find a considerable amount of viewpoints on the subject. The point of view of the computer scientist may differ from that of the statistician, and the electrical engineer’s may be very different the philosopher’s. However, all of them must describe the same thing. And none of these points of view alone can provide a complete picture of the whole situation. For example, the more technical and applicable view of the engineer will lack the broader and more theoretical considerations of which a philosopher is capable. We have a situation analogous to the parable of the football pitch described in the preceding paragraph.
For those reasons we could never dream to give a definitive answer to the question posed. Instead, we could try going for one that goes over the little science lightly, within the limits of the fidelity to the subject.
It is normal to define Machine Learning as recognition of patterns supposedly present in extensive amounts of data, in order to arrive at a rule that would allow us to "make predictions" about new examples cropping up from the same context. We will not throw away this norm: in addition to illustrating it, we will argue that this definition is a natural complement to the scientific process. In order to exemplify what this scientific method would be, I see here the opportunity to speak a little about Physics, which perhaps is, quantitatively and conceptually, the most successful example of science.
In general, "learning" can be described as the process of transforming a set of information, obtained in some way, into knowledge. This definition of learning conforms to the above definition of Machine Learning in the following terms:
(i) Some "information" is extracted from a set of data. It is crucial that the components of this data set follow some underlying logic, meaning that they all originate from the same population. For example, in the context of image recognition, the data set would consist of thousands of pictures of cats, each one of those properly labeled as “cat”, plus thousands of dog pictures labeled "dog", plus thousands of photos of cars with the proper labels and so on.
(ii) The "process" is called training. It is where, through the use of some of the so called “Machine Learning algorithms”, the patterns in the data are recognized. Often times such algorithms are versions of already well established procedures in Statistics. In Machine Learning, the programming paradigms are different from those of traditional programming, the logical flow is controlled more by goals set by a mathematical optimization process than by strict rules imposed by the programmer. Continuing with the example of (i): the model learns through a kind of "trial and error" process. The optimization algorithms "punishes" the model whenever a mistaken classification takes place, i.g., when it says that it “sees” an image of a dog but it is actually of a cat. The training process is considered to be complete when the model finally "learns" to classify the objects images present in the data set the best it can. This "best shape" is achieved when the model ceases to improve its performance.
(iii) "Knowledge" is obtained in the following way: once we have a trained model, we can hope that future cases will present similar patterns. Similar, but not necessarily the same, because they will most likely exhibit variations within the underlying "logic" of the examples we have already seen in training. From "understanding" these possible variations, in a statistical sense, derives the ability to make predictions about future situations. Following the example of image recognition, assume that the trained model is presented with a picture of a cat that was not in the original data set. If the training was performed properly, we can expect the model to correctly classify this new photo as being really a cat.
2. A Case of Success
For millennia mankind looked up at the sky, confectioning astronomical maps. In modern terms, one can say that these observations sought to collect data in order to recognize the patterns of movement of the celestial bodies. Based on these primitive observations, the Babylonians were able to predict, with considerable accuracy, Lunar eclipses. Centuries later, the Greek mathematician and philosopher Thales of Miletus would be credited for being the first man to predict a solar eclipse. Here we point out that both the lunar and solar cases were achieved long before any knowledge of even the most basic principles of Physics had been established.
However, we realize that the ability to predict lunar eclipses proved itself to be fruitless for the case of predicting solar eclipses, otherwise we would not had have to wait for hundreds of years until Thales, who was the first to predict a solar eclipse: the Babylonians would have taken advantage of the knowledge they had on the lunar instance and extrapolated it to the solar case. That is to say: if these two phenomena follow the same physical laws, why did the solution of one case not immediately allowed the possibility of solving the other? For, as noted in the previous paragraph, at that time no fundamental law explaining these phenomena had been discovered (the earth was still believed to be flat).
The prediction power of eclipses in the Babylonian era derived purely from inferences based on the periodicity spotted in the observations (the "Big Data" of the time). For example, that lunar eclipses were repeated every 18 years, ten days and eight hours (i.e., the Saros cycle). This type of information applied only to particular cases, without being useful in general ones. It is only natural: if you bought a two-pound salmon today, it does not mean that all salmon will always have the same weight. That is, to recognize the pattern of the period with which a specific manifestation of a certain phenomenon repeats itself is not sufficient to describe the periods of all possible different manifestations of the same type of phenomenon. In this case, something else has to be learned. In Machine Learning, the worst thing that can happen is when your model fails to serve general situations. In the salmon case, you would need to buy far more fish over a much longer of time in order to be able to infer a distribution for salmon size.
Physics seeks to understand the nature of "objects" as matter, time, space, energy and, essentially, how the interactions between such "objects" work. To these things, and to the relation between them, we give the name of "physical phenomena". In other words, physics aims to discover the fundamental principles that govern these phenomena. These descriptions should be expressed in the form of basic laws, usually by mathematical formulas. Typically such principles are induced from observations (these done in the past, which correspond to the "data"), Newton called it inductive reasoning, and must have their accuracy tested against future observations (this means that the knowledge contained in these principles and formulas should suffice to describe what will happen in future occurrences of the same phenomenon). That is to say, it is crucial that such laws are consistent with both past and future facts. The ability to predict the outcome of future observations in the Machine Learning world is called “power of generalization”. In the example of the image recognition task, a good power of generalization can be illustrated by the capacity of our model to properly classifying unseen images, i.e., pictures of cars, dogs, cats, etc that were not part of the set used for training.
Physics has been very successful in this endevour. For example, Newton's theory of universal gravitation (as well as, more generally, Einstein's general relativity) is able to describe the trajectory of the Earth around the Sun (i.e.,Earth’s orbit). The physical objects involved here are the Sun, the planet Earth, their respective masses, the distance that separates the center of one body from the the center of the other, and finally the interaction between them. The orbit of the Earth follows a clear pattern: it describes an ellipse around the sun, that takes one year to be fully completed. Based on data from astronomical observations made over millennia and following the efforts of Copernicus, Kepler, Galileo, Hooke and others, Newton understood, at least in the first approximation, the general principles governing gravity: the force is proportional to the product of the masses and inversely proportional to the square of the distance between their centers. This made it possible to describe the patterns present in both the "fall of an apple" and in the motion of all planets (except for Mercury, which had to wait for more than two centuries until Einstein presented a theory that "explained" its orbit) of the solar system and also of its moons, comets that approach us and so on. This is a case of successful generalization. In Machine Learning, the quality of your algorithm must be measured by its capacity of generalization.
3. If Only Everything Was That Simple....
Have you ever thought about how an electron, as a structure, is much simpler than a human being? Or of how the problem predicting the total falling time of an apple from the top a three meters tall tree is a much easier one than trying to predict today the outcome of the next year’s elections?
In part, the success of physics could be explained by the simplicity of the phenomena which it deals with. In other areas, this simplicity is not often the case. While physics studies relatively primitive objects such as the motion of massive bodies and elementary particles, fields that study relations between more complex entities, such as those happening between human beings, tend to fail when it comes to determining the basic principles which govern such interactions. Our intellect usually disappoints when the data present great complexity. The latter can be caused both by its sheer quantity and by excessive complexity in its supposed patterns. When we can not understand all the basic principles of the phenomenon of which we want to make predictions, Machine Learning offers hope.
Someone may point out: "But this is already one of the roles of Statistics (more precisely, Inferential Statistics), to provide scientific treatment to situations where we do not know the principles governing the phenomenon we are interested in." Yes, that is correct. So would Machine Learning be a competitor of Statistics? Not really: Machine Learning and Statistics are allied in this endeavor to extract information from data set, which may serve as the basis for good decision making. Formally, they are united by the "Statistical Learning Theory".
The point is that once the principles governing a phenomenon are known, we immediately know the patterns generated by this phenomenon. On the other hand, recognizing patterns does not necessarily imply immediately knowing the fundamental principles behind the phenomenon. But most of the time, recognizing patterns is the best we can aspire to. And here, when the amount of data is too large and its nature too complex, Machine Learning can be of great help. That is to say, Machine Learning is an extension of the scientific method in the sense that it hopes to assist us in obtaining knowledge even when we are unable to fully understand the phenomenon we are studying.
For example, it is indeed very complicated to try to explain what makes a dog, a dog, or a cat, a cat. We humans know how to differentiate the two animals more synthetically than analytically. That is to say, when we recognize a dog, we do it automatically, without having to resort to any complex analytical processes. Actually, even a dog is able to make such a distinction. Or, in the same vein, when you meet someone you know, how do you know who it is? How can you tell that John is not Joseph? Do you measure the distance between the eyes or the thickness of that person's lip? No, you simply recognize him or her. We simply learned to repeat a classification process, without knowing how to describe it precisely. In other words, it is a knowledge that we obtained empirically.
In order for an image recognition Machine Learning model to succeed in the task of classifying whether a given image is of a cat or of a dog, it needs only to be able to replicate the classification process after being exposed to a zillions of examples, i.e., properly labeled photos of these animals.
In other words, when it is difficult to find the basic principles governing a given phenomenon, our next natural move is to try to explain the patterns found in our observations as being generated by some well known "statistical rules", usually called "probability distribution". Long before the birth of Machine Learning, Statistics was a successful fully developed area of Mathematics and was already widely used in scientific research in a wide range of fields. This means that these probability distributions had already been largely tested in applications.
4. A Certain Personal Disappointment
I must confess that I was absolutely fascinated when I first heard the term Machine Learning.
What is the point of saying that a machine learns? Does the machine get happier when it learns something? Does it understand what it has learned to the point of being able to extrapolate such knowledge into totally different contexts? Do the people doing this Machine Learning thing know something that members of other scientific communities are still to find out?
I must admit that, when I began to understand what it is about, I felt bit blasé: classical "machine learning" is often about relatively old statistical techniques allied to the processing power of today's computers. And deep down, I expected something a bit more magical. But it is also a relief that things turn out to be as they are: Machine Learning is nothing more than a natural evolution of the search for knowledge that mankind has keeping for thousands of years.
And why do we say that the machine learns? It learns in the sense that it does not need hard-coding programming, i.e., the programmer does not need to predict all kinds of situations that happen happen in that given context, for example: "if A happens, then do B". Everything is controlled by a mathematical optimization process that is chosen by the programmer. The type of statistical model that it learns is also supposed in advance by the programmer, but the machine, with the aid of the optimization algorithm, learns to adjust the necessary parameters.
So, essential in Machine Learning is the introduction of an "error function". The objective of the optimization process is to make the value of this function as small as possible. If the model "sees" the picture of a dog and "shots" that it is a car, this causes the error function to have its value increased, which is not good, once it is contrary to the objective. The optimization algorithm (the most used one has its origins more than two hundred years ago) then forces the model to adjust its parameters so that the next time that the same picture shows up, the guess is not "it's a car". Normally, the parameters are in a colossal quantity and must be adjusted, at each step, in a way that the general error decreases, and not just the error attributed to that particular photo. On the other hand, whenever the model hits the label, the error decreases. When the error gets as small as possible, the training is complete. In an ideal world, we will have learned what is the probability distribution to which our data belong. So the model is ready to assist us in our tasks. Perhaps, at the end of the day, we are the ones who learn through the machine.
Or it may be that everything I wrote above is bullshit: it may be that indeed the machines learn the fundamental principles behind the phenomena they examine and that they have decided, at least for the time being, not to tell us their secret!
Passionate about Storytelling and Wine.
7 年Maya Wegrzyn
Passionate about Storytelling and Wine.
7 年From physics to phylosopy to coding.... amazing article!!! Last sentence is tenuous. Would not expect anyone else to end a piece of writing the way you did. ?