Applying Machine Learning in your Company: 10 criteria for the best AI solutions
By reading this third part of the blog series, you will get an understanding of what Machine Learning models are.
Reading this article will enable you to communicate with your data scientist at eye level competently.
Here is what you will learn:
· What kind of prediction methods do exist and what benefit do they have.
· Understanding the criteria to select the right method.
· Understanding common terms in the context of machine learning.
This is part 3 of 5. The full article is split into the following blog post is split into two parts: Part 3-A and Part 3-B.
- Part 1: Machine Learning Projects: 5 Steps to Success!
- Part 2: Data are the new oil: 10 essential Machine Learning tricks to maximize exploitation
- Part 3 - First Part: 10 criteria for the best Machine Learning models
- Part 3 - Second Part: 10 criteria for the best Machine Learning models (this part)
- Part 4 & 5: Coming soon! Follow me on LinkedIn so you won’t miss it.
Read to the end of both articles to fully understand all aspects. The whole article is split into two parts A & B.
PART 3B: Applying Machine Learning in your Company: Ten criteria for the best solutions
Please note: This is part B of the full article. I recommend you first read part A, click here, then continue reading this article.
Regression
Another type of prediction problem is called regression.
Regression is the same as classification. However, the target variable (the quantity to be predicted) is continuous, that is the target variable can have infinitely many values.
Typical examples of such a variable are consumptions, timespans, or other measurements
Regression algorithms are often used to make quantitative predictions while classification is used to make decisions.
Loosely speaking, regression can be considered as a continuous classification that is infinite many classes.
Clustering – The invisible risk
If you have a situation where the sample data does not contain any realizations of the target variable (un-supervised problem) you can apply so-called clustering methods to analyze your data.
Since the target variable or the outcome has not been observed and, therefore, is not available it is impossible to measure the quality of the prediction.
In particular, this means that no examples of data and outcome (target variable) are available, so it will not be possible to compare any algorithmic predictions with historical seen (and proven) outcomes (target variable).
As a consequence, you will not be able to measure the quality of your results.
Instead, you blindly rely on the result of your clustering algorithm.
Unfortunately, the algorithms are mathematically constructed in a way such that they always converge, i.e., they will always find a “solution.”
However, beware! The result you’ll obtain is disconnected from your target variable because no target variables available means your algorithm won’t be able to take these into account when learning the data.
Your business goal is usually directly or indirectly associated with the target variable.
So by not having any values for your target variable and thus no representation of your business goal implies that the result of your algorithm is disconnected to your business goal itself.
Your result only depends on the mathematical “architecture” of the algorithm and can, therefore, be arbitrary concerning what you actually what – in particular arbitrary bad.
Moreover, since there are usually much more possibilities to do something wrong instead of right, the possibility of getting the correct result is minimal.
However, it gets even worse: You lack examples of the target variable to assess the quality of your prediction.
Not only you won’t know what you get, but you even won’t be able to detect how bad you are.
Clustering means that you rely on a mathematical crystal ball and the results of that thing should be treated with the highest caution.
Clustering: Test yourself!
Are you able to detect any clusters of associated data points? No. A machine learning algorithm also will not.
Only labeled data will allow you to detect the correct structure within the given data. For an algorithm, this is no different as for any human.
Applying labeled data shows you regions where by tendency either green or blue data points dominate. This is something a machine learning model can learn.
Zero error is bad – Small errors are great
Besides the target variable, there are further essential criteria which will determine which model you should apply.
The most important one will probably be the quality of your model prediction.
That means: how good or how “right” are the predictions for unseen and new data?
In the supervised problem, you can simulate this by splitting the example data into two batches.
The first one will be called training dataset while the second one will be called test or evaluation dataset.
As the name suggests, the machine learning algorithm will be learned or inferred on the training dataset.
Subsequently, the so obtained model will be applied for the prediction of the evaluation data.
To ensure, the algorithm gets an evaluation data point and makes a prediction. The prediction result is then compared with the real observed result.
Summing up all prediction errors will give you information about the average prediction error.
Intuitively, you probably think a prediction error of zero is the best. Moreover, you are right.
However, in reality, you will never encounter a situation where all predictions can be correctly made.
Because all data is to some degree a little messy, you will always have to accept a certain error rate.
While an as small as possible error rate is desirable an error of zero tells you that something is wrong – because of messy data. Thus, an error of zero is usually a bad sign, and you should revise your implementation.
Generally, it can be said that good algorithms achieve small errors while bad ones make bigger prediction errors.
Please note: For the algorithm, the evaluation data is new unseen data since it did not see the evaluation data during the learning phase. Thus, the model does not “know” this evaluation data.
This makes this evaluation method objective and comparable.
Since your model will (in the future) make predictions for truly unseen data, you should use the model with the lowest prediction error on the evaluation data.
The split ratio of train and evaluation data is in principle arbitrary. However, as a rule of thumb, you can consider 80% training data and 20% evaluation data.
Black Box Algorithms
Sometimes the best performing model is not necessarily the best candidate.
Frequently you would like to understand what is going on within your model.
Why does an algorithm make this or that decision?
To put it differently: Do you understand your chosen Machine Learning Model?
A deeper understanding of the decision processes makes your model understandable for humans.
Typical examples of non-transparent models are neural networks, here in particular deep learning models or Hidden Markov models (a model that assumes non-observable states and causes).
The benefit of a deeper model understanding are potentially new insights within your existing business processes that have been prior unknown.
By applying this reversed method, your employees might discover new decision criteria. You could say the knowledge is being returned from the apprentice back to it’s master.
Therefore, comprehensibility will give you an additional value that goes beyond the automation of a process.
Fast models – Fast decisions
Sometimes your setting requires quick decisions.
Classic examples are real-time applications, i.e., results have to be calculated within, e.g., 30 seconds or even within milliseconds (depending what kind of latency you call real-time).
For instance, programmatic advertising (real-time bidding) requires decisions within about 100 milliseconds.
Therefore, in real-time bidding, the application of less complex and therefore fast models is preferred.
On the other hand, other settings do not require those kinds of short latencies, for instance, processes that can calculate results overnight.
In such a case more complex and computationally more intense algorithms can be used that will deliver better prediction results.
Generally, the duration of your process also depends on the amount of data available.
Small data sets can be applied to complex models since the amount of data also limits the runtime required.
Therefore, when choosing a model to keep the model latency in mind.
Your model doesn’t understand your data?
Your model selection process will also depend on the given data.
Some Machine Learning models only accept certain types of data, like discrete, continuous, binary, or categorical data.
Simply put, continuous data are decimal numbers like 1.43, 20.58, etc
Discrete numbers are integer numbers like 1, 2, 3, 4, 5, while binary data is a special kind since only two values might be taken, for instance, 0 or 1.
Categorical data are non-numeric values, for instance green, large, or liquid.
Categorical data usually does not have any order like for instance “green is better than liquid.”
Many models require numbers as input. However, a nice little tick turns categorical data into processible values.
For instance, the canonical Support Vector Machine only accepts continuous data while a decision tree can process out-of-the-box (non-transformed) categorical data.
Another kind of data is time dependent. There exists a variety of algorithms that are specially designed for this kind of time series data.
Always choose the Machine Learning algorithm that is best suited for the given data.
Missing Data
One of the more significant difficulties when inferring an AI model is missing or incomplete data.
The main problem is that one cannot generate any knowledge from non-existent data.
No conclusions or decisions can be drawn.
However, many Machine Learning models require complete and gapless data.
In some cases, this problem can be solved by filling up those gaps with artificial data.
For instance, if in a series of prices one price is missing you might use the average of all other prices and assign this value to the unknown price.
By applying this substitution method, you can fill the gaps with the help of the other data.
This approach then will enable you to use specific models.
However, caution! In this case, you are “inventing” data that does not necessarily stand in any relationship to the actual necessary value.
In the worst case scenario, this might have a harmful effect since the unobserved reality might have nothing in common with the invented reality.
Data substitution can be very dangerous in therefore should only be considered if a few data is missing.
If the amount of data available is too few, you should probably completely ignore this particular attribute.
Some Machine Learning algorithms don’t have any problems with missing data though. Therefore it might make sense to consider these models (for instance decision trees).
Many models offer many choices
There exist a lot of Machine Learning models, and each one has it’s pros and cons.
So, testing many models makes sense.
However, the more models you test, the more time the selection process will take.
The question is: How many should you use?
As a rule of thumb, you should consider at least several models since the additional effort will be acceptable.
On the one hand, the probability of finding the best model increases and on the other hand you will have several results which you can compare.
Sometimes it turns out that another less appealing model beats the model you expected to perform the best.
If the data is incompatible with the algorithm or it is already in advance clear that the data structure does not fit the AI algorithm you should discard this model.
In the case, you want to apply a clustering method you have to make sure that very experienced experts support you.
Make sure the criteria for selecting a model match those of your business values.
Checklist
The following compact checklist summarizes the essential aspects of this article:
- Does the chosen prediction model serve your business goal?
- Which model is best suited for your target variable?
- Do you consider the model that best fits your data?
- Did you consider and discuss the risks of applying a clustering method?
- Do you understand what your chosen model does?
- Does your model meet the requirements concerning the response times?
- Did you delete any attributes with a too low number of data points?
- Do you consider more than one model?
- Do you compare the performance of different models?
- If possible, try to avoid any clustering
Conclusion
The phase of finding the right model is a rather technical aspect of a Machine Learning project.
It is in contrast to the data preparation process (as described in the second part of the blog series) a rather systematic process.
The person doing the job does not need to have outstanding skills but needs basic knowledge about applying machine learning frameworks and how to use their algorithms libraries.
The next part of the blog series will be about how to apply a correct and systematic evaluation of a machine learning project.