登录查看更多内容

Machine Learning in Simple Words

Ricardo Gutiérrez

Senior Software Engineer at VitalSource.com

发布日期: 2019年6月26日

Motivation

People tend to make mistakes when are facing huge volumes of information, we are not designed for that. We need robots to do the math for them. So, let's go the computational way here. Let's provide the machine some data and ask it to find all hidden patterns related to the problem to solve.

The machine copes with this task much better than a real person does when carefully analyzing all the dependencies in their mind.

Definition

Machine Learning (ML) is a field within Computer Science, it's goal is to understand the structure of data and fit that data into models that can be understood and utilized by people. It differs from traditional computational approaches. In traditional computing, algorithms are sets of explicitly programmed instructions used by computers to calculate or problem solve.

Machine learning algorithms instead allow for computers to train on data inputs and use statistical analysis in order to output values that fall within a specific range. Because of this, machine learning facilitates computers in building models from sample data in order to automate decision-making processes based on data inputs.

In simple words, the only goal of Machine Learning is to predict results based on incoming data. That's it. All ML tasks can be represented this way, or it's not an ML problem from the beginning.

Components

Therefore, we need three components to teach the machine how to solve a problem:

Data: these are the samples you have to find relevant patterns and predict the result. Want to forecast stocks? Find the price history. Want to detect spam? Get samples of spam messages. The more diverse the data, the better the result.
Features: Also known as parameters or variables. Those could be car mileage, user's gender, stock price, word frequency in the text. In other words, these are the factors for a machine to look at. When data stored in tables it's simple — features are column names. But what are they if you have 100 Gb of cat pics? We cannot consider each pixel as a feature. That's why selecting the right features usually takes way longer than all the other ML parts.
Algorithms: Any problem can be solved differently. The method you choose affects the precision, performance, and size of the final model. There is one important nuance though: if the data is crappy, even the best algorithm won't help. Sometimes it's referred as "garbage in – garbage out".

More Definitions

It's easy to get confused by so many terms like Artificial Intelligence, Machine Learning, Neural Networks, etc... Let's defined in brief to see how these are related each other.

Artificial intelligence is the name of a whole knowledge field, similar to biology or chemistry.
Machine Learning is a part of artificial intelligence. An important part, but not the only one.
Neural Networks are one of machine learning types. A popular one, but there are other good guys in the class.
Deep Learning is a modern method of building, training, and using neural networks. Basically, it's a new architecture.

Types

There is never a sole way to solve a problem in the machine learning world. There are always several algorithms that fit into a problem, and you have to choose which one fits better. Nowadays there are four main directions in machine learning to make a good choice:

1. Classical Machine Learning

The first methods came from pure statistics in the '50s. They solved formal math tasks — searching for patterns in numbers, evaluating the proximity of data points, and calculating vectors' directions.

Nowadays, half of the Internet is working on these algorithms. When you see a list of articles to "read next" or your bank blocks your card at random gas station in the middle of nowhere, most likely it's the work of one of those little guys.

Classical approaches are so natural that you could easily explain them to a everyone. They are like basic arithmetic and we use it every day, without even thinking. The following image could illustrate it better:

1.1 Supervised Learning

In this case, the machine has a "supervisor" or a "teacher" who gives the machine all the answers, like whether it's a cat in the picture or a dog. The teacher has already divided (labeled) the data into cats and dogs, and the machine is using these examples to learn. One by one. Dog by cat.

Clearly, the machine will learn faster with a teacher, so it's more commonly used in real-life tasks. There are two types of such tasks: classification – an object's category prediction, and regression – prediction of a specific point on a numeric axis.

1.1.1 Classification

In classification, you always need a teacher. The data should be labeled with features so the machine could assign the classes based on them. Everything could be classified — users based on interests (as algorithmic feeds do), articles based on language and topic (that's important for search engines), music based on genre (Spotify playlists), and even your emails.

"Splits objects based at one of the attributes known beforehand. Separate socks by based on color, documents based on language, music by genre"

Today used for:

Spam filtering
Language detection
A search of similar documents
Sentiment analysis
Recognition of handwritten characters and numbers
Fraud detection

1.1.2 Regression

Regression is basically classification where we forecast a number instead of category. Examples are car price by its mileage, traffic by time of the day, demand volume by growth of the company etc. Regression is perfect when something depends on time.

"Draw a line through these dots. Yep, that's the machine learning"

Today this is used for:

Stock price forecasts
Demand and sales volume analysis
Medical diagnosis
Any number-time correlations

Popular algorithms are Linear and Polynomial regressions.

1.2 Unsupervised learning

Unsupervised was invented a bit later, in the '90s. It is used less often, but sometimes we simply have no choice.

Labeled data is luxury. But what if I want to create, let's say, a bus classifier? Should I manually take photos of million fucking buses on the streets and label each of them? No way, that will take a lifetime.

Instead of manual work you can try to use unsupervised learning. It's usually useful for exploratory data analysis but not as the main algorithm. There are several approaches that will be described below.

1.2.1 Clustering

Clustering is a classification with no predefined classes. It’s like dividing socks by color when you don't remember all the colors you have. Clustering algorithm trying to find similar (by some features) objects and merge them in a cluster. Those who have lots of similar features are joined in one class.

"Divides objects based on unknown features. Machine chooses the best way"

Nowadays used:

For market segmentation (types of customers, loyalty)
To merge close points on a map
For image compression
To analyze and label new data
To detect abnormal behavior

Popular algorithms: K-means_clustering, Mean-Shift, DBSCAN

1.2.2 Generalization

Previously these methods were used by hardcore data scientists, who had to find "something interesting" in huge piles of numbers. When Excel charts didn't help, they forced machines to do the pattern-finding. That's how they got Dimension Reduction or Feature Learning methods.

"Assembles specific features into more high-level ones"

Nowadays is used for:

Recommender systems (★)
Beautiful visualizations
Topic modeling and similar document search
Fake image analysis
Risk management

Popular algorithms: Principal Component Analysis (PCA), Singular Value Decomposition (SVD), Latent Dirichlet allocation (LDA), Latent Semantic Analysis (LSA, pLSA, GLSA), t-SNE (for visualization)

Recommender Systems and Collaborative Filtering is a super-popular use of the dimensionality reduction method. Seems like if you use it to abstract user ratings, you get a great system to recommend movies, music, games and whatever you want.

Machines get these high-level concepts even without understanding them, based only on knowledge of user ratings. Nicely done, Mr.Computer.

1.2.3 Association Rule Learning

This includes all the methods to analyze shopping carts, automate marketing strategy, and other event-related tasks. When you have a sequence of something and want to find patterns in it.

"Look for patterns in the orders' stream"

Nowadays is used:

To forecast sales and discounts
To analyze goods bought together
To place the products on the shelves
To analyze web surfing patterns

Popular algorithms: Apriori, Euclat, FP-growth

2. Reinforcement Learning

Reinforcement learning is used in cases when your problem is not related to data at all, but you have an environment to live in. Like a video game world or a city for self-driving car.

Knowledge of all the road rules in the world will not teach the autopilot how to drive on the roads. Regardless of how much data we collect, we still can't foresee all the possible situations. This is why its goal is to minimize error, not to predict all the moves.

There may be two different approaches — Model-Based and Model-Free.

Model-Based means that car needs to memorize a map or its parts. That's a pretty outdated approach since it's impossible for the poor self-driving car to memorize the whole planet.

In Model-Free learning, the car doesn't memorize every movement but tries to generalize situations and act rationally while obtaining a maximum reward.

3. Ensemble Methods

If you take a bunch of inefficient algorithms and force them to correct each other's mistakes, the overall quality of a system will be higher than even the best individual algorithms.

You'll get even better results if you take the most unstable algorithms that are predicting completely different results on small noise in input data. Like Regression and Decision Trees. These algorithms are so sensitive to even a single outlier in input data to have models go mad. There are three battle-tested methods to create ensembles.

3.1 Stacking

Output of several parallel models is passed as input to the last one which makes a final decision. Like that girl who asks her friends whether to meet with you in order to make the final decision herself.

Emphasis here on the word "different". Mixing the same algorithms on the same data would make no sense. The choice of algorithms is completely up to you. However, for final decision-making model, regression is usually a good choice.

3.2 Bagging

Use the same algorithm but train it on different subsets of original data. In the end — just average answers.Data in random subsets may repeat. For example, from a set like "1-2-3" we can get subsets like "2-2-3", "1-2-2", "3-1-2" and so on. We use these new datasets to teach the same algorithm several times and then predict the final answer via simple majority voting.

The most famous example of bagging is the Random Forest algorithm, which is simply bagging on the decision trees (which were illustrated above). When you open your phone's camera app and see it drawing boxes around people's faces — it's probably the results of Random Forest work.

3.3 Boosting

Same as in bagging, we use subsets of our data but this time they are not randomly generated. Now, in each sub-sample we take a part of the data the previous algorithm failed to process. Thus, we make a new algorithm learn to fix the errors of the previous one.

If you want a real example of boosting — open Facebook or Google and start typing in a search query. Can you hear an army of trees roaring and smashing together to sort results by relevancy? That's because they are using boosting.

4. Neural Networks and Deep Leaning

A neural network is basically a collection of neurons and connections between them. Neuron is a function with a bunch of inputs and one output. Its task is to take all numbers from its input, perform a function on them and send the result to the output.

Here is an example of a simple but useful in real life neuron: sum up all numbers from the inputs and if that sum is bigger than N — give 1 as a result. Otherwise — zero.

Connections are like channels between neurons. They connect outputs of one neuron with the inputs of another so they can send digits to each other. Each connection has only one parameter — weight. It's like a connection strength for a signal. When the number 10 passes through a connection with a weight 0.5 it turns into 5.

These weights tell the neuron to respond more to one input and less to another. Weights are adjusted when training — that's how the network learns.

To prevent the network from falling into anarchy, the neurons are linked by layers, not randomly. Within a layer neurons are not connected, but they are connected to neurons of the next and previous layers. Data in the network goes strictly in one direction — from the inputs of the first layer to the outputs of the last.

If you throw in a sufficient number of layers and put the weights correctly, you will get the following: by applying to the input, say, the image of handwritten digit 4, black pixels activate the associated neurons, they activate the next layers, and so on and on, until it finally lights up the exit in charge of the four. The result is achieved.

After we constructed a network, our task is to assign proper ways so neurons will react correctly to incoming signals. Now is the time to remember that we have data that is samples of 'inputs' and proper 'outputs'. We will be showing our network a drawing of the same digit 4 and tell it 'adapt your weights so whenever you see this input your output would emit 4'.

To start with, all weights are assigned randomly. After we show it a digit it emits a random answer because the weights are not correct yet, and we compare how much this result differs from the right one. Then we start traversing network backward from outputs to inputs and tell every neuron 'hey, you did activate here but you did a terrible job and everything went wrong from here downwards, let's keep less attention to this connection and more of that one, please'.

After hundreds of thousands of such cycles of 'infer-check-punish', there is a hope that the weights are corrected and act as intended. The science name for this approach is Backpropagation, or a 'method of backpropagating an error'.

Differences of deep learning from classical neural networks were in new methods of training that could handle bigger networks. Nowadays only theoretics would try to divide which learning to consider deep and not so deep. Practitioners are using popular 'deep' libraries like Keras, TensorFlow & PyTorch even when they build a mini-network with five layers.

There are several architectures of Neural Networks and so many applications for each of them, the most popular nowadays are:

Convolutional Neural Networks (CNN): they are used to search for objects on photos and in videos, face recognition, style transfer, generating and enhancing images, creating effects like slow-mo and improving image quality.
Recurrent Neural Networks (RNN): networks gave us useful things like neural machine translation, speech recognition and voice synthesis in smart assistants. RNNs are the best for sequential data like voice, text or music.

Machines will replace us?

The main problem here is that the question "when will the machines become smarter than us and enslave everyone?" is initially wrong. There are too many hidden conditions in it.

We say "become smarter than us" like we mean that there is a certain unified scale of intelligence. The top of which is a human, dogs are a bit lower, and stupid pigeons are hanging around at the very bottom.

That's wrong.

If this were the case, every human must beat animals in everything but it's not true. The average squirrel can remember a thousand hidden places with nuts — I can't even remember where are my keys.

So intelligence is a set of different skills, not a single measurable value? Or is remembering nuts stashed locations not included in intelligence?

An even more interesting question for me - why do we believe that the human brain possibilities are limited? There are many popular graphs on the Internet, where the technological progress is drawn as an exponent and the human possibilities are constant. But is it?

Ok, multiply 1680 by 950 right now in your mind. I know you won't even try, lazy bastards. But give you a calculator — you'll do it in two seconds. Does this mean that the calculator just expanded the capabilities of your brain?

If yes, can I continue to expand them with other machines? Like, use notes in my phone to not to remember a shitload of data? Oh, seems like I'm doing it right now. I'm expanding the capabilities of my brain with the machines.

References:

Roberto Puente

Project Manager | Creator Economy | Ayudo a creadores a escalar

5 年

No lo veo nada simple Ricardo! jeje oye te escribo por privado, me gustaría retomar contacto contigo.

要查看或添加评论，请登录

Ricardo Gutiérrez的更多文章

Static Libraries in C

2019年3月4日

Static Libraries in C

One of the problems with developed programs is that they tend to grow larger and larger, bringing up overall…
What happens when you compiles a C program?

2019年2月7日

What happens when you compiles a C program?

To understand what happens when you compile a C program we need to define some terms in order to explain the whole…
What is the difference between a hard link and a symbolic link?

2019年2月6日

What is the difference between a hard link and a symbolic link?

Let's start with a definition for both terms and then explain the differences between hard and symbolic links. A hard…
What happens when you type ls *.c?

2019年2月5日

What happens when you type ls *.c?

List (ls) is one of the most popular commands on Linux or other Unix-like operating systems, it's used to list…