What is Machine Learning? A Simple Definition

Machine Learning (ML), a subfield of Artificial Intelligence (AI), is claimed to be the “new oil” although it has been around for decades, actively studied in academia and used in industry. The advent of the Web has brought new problems that are usually easy to describe but seem difficult to solve; ML seems to be a perfect fit for such problems.

Now ML is the talk of the town so much so that it is regarded by some as holding the keys to our future. Multiple well-known companies seem to have even changed their strategic vision to bet almost the whole house on their ML investments. Yet it seems difficult to find a definition of ML that is easy to understand for the layperson. This seems to add more mystery to its already mysterious fame and probably contributes to the existing hype around it.?

In this article, I will try to give a very simple introduction to ML. I hope to show that ML is actually easy to understand at its core.

What is Machine Learning? It is relatively easy to give a mathematical definition but unfortunately there is no standard definition simple enough for the layperson, that is, a definition "in English" with minimal or no Math. One good definition is by Tom Mitchell of Carnegie Mellon University: "A computer program is said to learn from experience?E?with respect to some class of tasks?T?and performance measure?P?if its performance at tasks in?T, as measured by?P, improves with experience?E." We can parsethis definition further to answer obvious questions that it raises but let us not digress. Let us first attempt to develop a simpler definition from a different perspective and tie back to this definition in an upcoming article.

Let us start. Think of two data sets, called Input (X) and Output (Y). The members of X are called inputs and those of Y are called outputs. For simplicity, think of X and Y as a two dimensional table with rows and columns as in a spreadsheet application like Excel or as shown below. This table shown below has five rows and two columns.

Two data sets X and Y. Assuming a relationship between them, we may call X input and Y output.

For ML applications, it is usually the case that X will have multiple attributes (or columns of its own) while Y will have a single attribute or column. For example, if X and Y are about people, X may have columns for a person's demographic attributes such as age, gender, and education level while Y may have a single column such as the person's income tier.

Now think of a set F of functions that take in each input in X and produces an output in Y; a function in F is also said to map each input in X to an output in Y. For example, in the table above, the function takes in 20 in Row 2 and produces 40 on the same row. On a computer, each function will be represented using a computer program ("the machine") that reads its inputs and produces its outputs.

A function can be as simple as the doubling function, which takes any number in as an input and outputs its double (as in the table above); a function can also be as sophisticated as a search engine, which takes a search query as an input and produces the search results as an output, or as an email classifier, which finds out whether or not a given email is spam.

Now consider the following key table. It lists three cases involving what is known (by us) and unknown about Input, Function, and Output. We need to know at least two to be able to derive the unknown one.

Three cases over three parameters with two of them are known and one of them is unknown.

To understand this table, let us use a simple example. Suppose both Input and Output consist of real numbers. In addition, for the first two cases, suppose that the known function is the doubling function. For simplicity, let us assume that functions we are interested in always refer to functions that can compute their outputs in a reasonable amount of time.

In Case 1, Output is unknown but we can compute each unknown output by evaluating the function, i.e., by simply multiplying each known input by 2. This case corresponds to the most common case of computing the output of or evaluating a given function.

In Case 2, Input is unknown so we perform the reverse of Case 1: To compute each unknown input, we invert the function; we simply divide the known output by 2. This case is less common than Case 1 because not every function is invertible.

For Case 3, given input and output numbers, we need to discover, find, or derive the unknown function. Deriving the unknown function is unfortunately not easy as there can be a huge number of functions that map the input numbers to output numbers. To speed up this search process so as to find the right or the best function, we usually need to introduce various criteria as constraints. For example, let us assume we know how Function looks like, say, each function simply returns a constant multiple of its input number. What we do need to find out now is the constant factor to multiply a given input with. In Math notation, we have assumed to know Y = F(X) = K * X so we know K is a constant but we do not know its value. Then, say, given 30 as an input and 60 as an output, we can conclude that the best function is the doubling function.

Here is the punchline: ML is about Case 3. In other words, Machine Learning tries to find or derive the best function that maps a given set of inputs to a given set of outputs. Here “the best” depends on the criteria introduced in the function search process, as we did in the example above. Of course, we do not stop at finding the best function. Once the best function is found, we usually move to Case 1 to compute new outputs for given inputs.?

At a high level, this is all you need to know about the definition of ML but let us discuss some details. Let us start with an example.

Data for Y=F1(X).
F1's plot.

Given the table above, suppose we are trying to find a function F1 that maps the inputs to the outputs. From these inputs and outputs, it seems highly likely from the plot of F1 above that F1 is the doubling function. It is definitely possible that given more inputs, the outputs may diverge from the exact doubles but we do not have evidence of that yet.

This time, let us alter the outputs slightly as follows. Suppose again that we are trying to find a function F2 that maps the input to the output. Notice that for each row on the table below the output is close to the double of the input. How does the function F2 then look like?

Data for Y=F2(X).
F2's plot.

As I mentioned above, there are many functions that can map these inputs to these outputs. So which function is the one we are looking for? To make progress, we need to introduce constraints, as mentioned above. One typical constraint is to limit how F2 looks like. A simple choice, as was done above, is to assume that F2 simply outputs a constant multiple of its input, i.e., F2(X) = K * X. Now the next step is to find the particular constant K.

Using the rows of the table above, we derive multiple values for K, as shown on the last column above: 1.50, 2.00,2.50,1.75, and 2.00. Which one is the right one, or which one gives us the best function? A reasonable choice or performance measure is to select the K that leads to the smallest overall average deviation between the computed output and the (expected) output in the table above. In other words, we try to minimize the error between the computed and expected outputs.

The deviation between the computed and expected outputs is simply their difference. The sum of these differences can be computed in multiple ways. One common way is to square each difference before adding and minimize the overall average. For F2, this leads to K = 2. In other words, despite the slight deviations in the computed outputs, a reasonable choice is to assume that F2 is also the doubling function, as shown by the graph above.

Now I of course have ignored many details about ML so that I can give a simple definition. To convince you that this definition is still a very useful one, let us look at a few consequences of it below, with some analogies drawn from various other fields.

  1. The derived function in Case 3 depends fully on the given inputs and outputs. If the inputs and outputs are bad, biased, insufficient, “garbage,” or whatever, the derived function will carry those undesirable characteristics too. Using an analogy, if we run a survey interviewing only the young people in a small town, we do not expect that the survey results will be valid for all age groups or all parts of a country.
  2. If inputs and outputs are actually not related to each other, the derived function is not expected to work well. Using an analogy, if a student is taught only French but instead gets tested on German in a language test, we do not expect the student to do well.
  3. Due to the inputs and outputs being bad or due to lack of care in deriving the function, the resulting function in Case 3 may not work or “generalize” well. This means it can produce inaccurate outputs for unseen inputs in Case 1.
  4. On the flip side of 3, we usually assume that the function in Case 3 derived using only the given inputs and outputs will generalize to yet unseen inputs in Case 1. Of course, we do not simply make this assumption and move from Case 3 to Case 1. What we do is we increase our confidence in the goodness of the derived function. This involves testing the derived function before running it in Case 1. In turn for that, we again refer to the given data: We use part of the given data to derive the function and the rest to test the function. ML has multiple ways of doing this step well. Once our confidence improves with good test results, we then deploy or move the derived function to Case 1 for the real-world scenarios with yet unseen data.
  5. Inputs and outputs can change with time. For example, a human user’s interests can shift with age. This means we simply cannot derive the function once and run it for good for many years. ML has ways of solving this problem too. For example, we can monitor the results of Case 1 to ensure that the derived function is still good. We can also derive the function every so often with new data to take into account the changes that come with time.
  6. As we mentioned earlier, we need to introduce various criteria in Case 3 in order to make progress towards deriving the “best” function. ML researchers have come up with lots of types or forms of functions that are best suited for this or that class of problems under different sets of criteria. Many of these forms also have a visual structure beyond just a Math equation. Some of these visual forms look like a decision tree or a network.
  7. We may not have all the attributes for each input. A missing attribute may be missing because it may not exist at all or its value may not be known to us. Yet we still expect the derived function to produce an output. For example, in e-commerce, users are not expected to do all their shopping on a single site nor are they expected to buy every product available on the site. Yet e-commerce sites still attempt to derive a function to recommend a set of products, directly or inside advertisements, to a given user with the expectation that the recommendation is good, i.e., the user will probably buy those products.
  8. Given the task as in Case 3, a function will usually be found. This does not mean that a function really exists, or what is found is the “real” function that has been waiting to be discovered. What we know is that it is the best function with the given inputs, outputs, and criteria.
  9. Since the function (its parameters and partly its form) is derived directly from the data, it is not explicitly programmed. At the same time, it may not be easy to reduce to a simple set of rules or insights how the function does its job, even when it performs really well.

… Let me stop here though I think I can go on with more consequences. I hope you are now convinced that the simple definition of ML above is still a very useful one.

Acknowledgments. Thanks to Joshua Koran, Sahin Geyik, and Shahriar Shariat for their feedback on a draft of this article.

Note (Jan 1, 2024): I had to revise this article and add images again as it seems LinkedIn lost the original formatting and images.

Disclaimer: This article presents the opinions of the author. It does not necessarily reflect the views of the author's employer(s).

Volkan Civelek

Technical Quarterback for Sales and Marketing

7 年

Ali, this is great. A typical example to "Simplicity is the ultimate sophistication"

回复
Subhas Kumar Ghosh

Data, AI and ML Engineering | Distributed Computing | Graph Theory

7 年

Case 1 is not completely true for one way function. Even if we know input and the function it may be computationally very hard to find the output. On the other hand for case 3, if input in this case is say sequence of n {0,1} and expected output is a 3-DNF on n variables that is consistent with the input sequence then even learning that could be NP hard. I was also thinking that if example given to a learning algorithm can be thought as an access to an oracle (even a Probabilistic one).

回复

要查看或添加评论,请登录

ali dasdan的更多文章

社区洞察

其他会员也浏览了