What is Machine Learning? A Simple Definition
Machine Learning (ML), a subfield of Artificial Intelligence (AI), is claimed to be the “new oil” although it has been around for decades, actively studied in academia and used in industry. The advent of the Web has brought new problems that are usually easy to describe but seem difficult to solve; ML seems to be a perfect fit for such problems.
Now ML is the talk of the town so much so that it is regarded by some as holding the keys to our future. Multiple well-known companies seem to have even changed their strategic vision to bet almost the whole house on their ML investments. Yet it seems difficult to find a definition of ML that is easy to understand for the layperson. This seems to add more mystery to its already mysterious fame and probably contributes to the existing hype around it.?
In this article, I will try to give a very simple introduction to ML. I hope to show that ML is actually easy to understand at its core.
What is Machine Learning? It is relatively easy to give a mathematical definition but unfortunately there is no standard definition simple enough for the layperson, that is, a definition "in English" with minimal or no Math. One good definition is by Tom Mitchell of Carnegie Mellon University: "A computer program is said to learn from experience?E?with respect to some class of tasks?T?and performance measure?P?if its performance at tasks in?T, as measured by?P, improves with experience?E." We can parsethis definition further to answer obvious questions that it raises but let us not digress. Let us first attempt to develop a simpler definition from a different perspective and tie back to this definition in an upcoming article.
Let us start. Think of two data sets, called Input (X) and Output (Y). The members of X are called inputs and those of Y are called outputs. For simplicity, think of X and Y as a two dimensional table with rows and columns as in a spreadsheet application like Excel or as shown below. This table shown below has five rows and two columns.
For ML applications, it is usually the case that X will have multiple attributes (or columns of its own) while Y will have a single attribute or column. For example, if X and Y are about people, X may have columns for a person's demographic attributes such as age, gender, and education level while Y may have a single column such as the person's income tier.
Now think of a set F of functions that take in each input in X and produces an output in Y; a function in F is also said to map each input in X to an output in Y. For example, in the table above, the function takes in 20 in Row 2 and produces 40 on the same row. On a computer, each function will be represented using a computer program ("the machine") that reads its inputs and produces its outputs.
A function can be as simple as the doubling function, which takes any number in as an input and outputs its double (as in the table above); a function can also be as sophisticated as a search engine, which takes a search query as an input and produces the search results as an output, or as an email classifier, which finds out whether or not a given email is spam.
Now consider the following key table. It lists three cases involving what is known (by us) and unknown about Input, Function, and Output. We need to know at least two to be able to derive the unknown one.
To understand this table, let us use a simple example. Suppose both Input and Output consist of real numbers. In addition, for the first two cases, suppose that the known function is the doubling function. For simplicity, let us assume that functions we are interested in always refer to functions that can compute their outputs in a reasonable amount of time.
In Case 1, Output is unknown but we can compute each unknown output by evaluating the function, i.e., by simply multiplying each known input by 2. This case corresponds to the most common case of computing the output of or evaluating a given function.
In Case 2, Input is unknown so we perform the reverse of Case 1: To compute each unknown input, we invert the function; we simply divide the known output by 2. This case is less common than Case 1 because not every function is invertible.
For Case 3, given input and output numbers, we need to discover, find, or derive the unknown function. Deriving the unknown function is unfortunately not easy as there can be a huge number of functions that map the input numbers to output numbers. To speed up this search process so as to find the right or the best function, we usually need to introduce various criteria as constraints. For example, let us assume we know how Function looks like, say, each function simply returns a constant multiple of its input number. What we do need to find out now is the constant factor to multiply a given input with. In Math notation, we have assumed to know Y = F(X) = K * X so we know K is a constant but we do not know its value. Then, say, given 30 as an input and 60 as an output, we can conclude that the best function is the doubling function.
Here is the punchline: ML is about Case 3. In other words, Machine Learning tries to find or derive the best function that maps a given set of inputs to a given set of outputs. Here “the best” depends on the criteria introduced in the function search process, as we did in the example above. Of course, we do not stop at finding the best function. Once the best function is found, we usually move to Case 1 to compute new outputs for given inputs.?
领英推荐
At a high level, this is all you need to know about the definition of ML but let us discuss some details. Let us start with an example.
Given the table above, suppose we are trying to find a function F1 that maps the inputs to the outputs. From these inputs and outputs, it seems highly likely from the plot of F1 above that F1 is the doubling function. It is definitely possible that given more inputs, the outputs may diverge from the exact doubles but we do not have evidence of that yet.
This time, let us alter the outputs slightly as follows. Suppose again that we are trying to find a function F2 that maps the input to the output. Notice that for each row on the table below the output is close to the double of the input. How does the function F2 then look like?
As I mentioned above, there are many functions that can map these inputs to these outputs. So which function is the one we are looking for? To make progress, we need to introduce constraints, as mentioned above. One typical constraint is to limit how F2 looks like. A simple choice, as was done above, is to assume that F2 simply outputs a constant multiple of its input, i.e., F2(X) = K * X. Now the next step is to find the particular constant K.
Using the rows of the table above, we derive multiple values for K, as shown on the last column above: 1.50, 2.00,2.50,1.75, and 2.00. Which one is the right one, or which one gives us the best function? A reasonable choice or performance measure is to select the K that leads to the smallest overall average deviation between the computed output and the (expected) output in the table above. In other words, we try to minimize the error between the computed and expected outputs.
The deviation between the computed and expected outputs is simply their difference. The sum of these differences can be computed in multiple ways. One common way is to square each difference before adding and minimize the overall average. For F2, this leads to K = 2. In other words, despite the slight deviations in the computed outputs, a reasonable choice is to assume that F2 is also the doubling function, as shown by the graph above.
Now I of course have ignored many details about ML so that I can give a simple definition. To convince you that this definition is still a very useful one, let us look at a few consequences of it below, with some analogies drawn from various other fields.
… Let me stop here though I think I can go on with more consequences. I hope you are now convinced that the simple definition of ML above is still a very useful one.
Acknowledgments. Thanks to Joshua Koran, Sahin Geyik, and Shahriar Shariat for their feedback on a draft of this article.
Note (Jan 1, 2024): I had to revise this article and add images again as it seems LinkedIn lost the original formatting and images.
Disclaimer: This article presents the opinions of the author. It does not necessarily reflect the views of the author's employer(s).
Technical Quarterback for Sales and Marketing
7 年Ali, this is great. A typical example to "Simplicity is the ultimate sophistication"
Data, AI and ML Engineering | Distributed Computing | Graph Theory
7 年Case 1 is not completely true for one way function. Even if we know input and the function it may be computationally very hard to find the output. On the other hand for case 3, if input in this case is say sequence of n {0,1} and expected output is a 3-DNF on n variables that is consistent with the input sequence then even learning that could be NP hard. I was also thinking that if example given to a learning algorithm can be thought as an access to an oracle (even a Probabilistic one).