Data are the new oil: 10 essential Machine Learning tricks to maximize exploitation

Data are the new oil: 10 essential Machine Learning tricks to maximize exploitation

The following blog series will give you all the necessary knowledge to plan and execute a Machine Learning based Predictive Analytics project. By reading this article you will learn the following:

  • You must understand machine learning to implement a successful Predictive Data Analytics project and, to do so, you must know about the so-called ‘new oil’: data. This article covers some of the fundamentals, namely: The difference between a supervised and an un-supervised Machine Learning problem.
  • Why the choice of the correct target variable is critical to your business.
  • What the labeling process is and why it is important.

This is the second part of a five-part blog series on how to implement a Predictive Data Analytics project:

(Read to the end of the article for a list of tips)

Feel free to follow me on LinkedIn or Twitter for new articles about Machine Learning and how to apply it to your data.

Data preparation and data understanding

Since the industrial revolution, our reliance on fossil fuels has ensured oil remains a high-value resource, bringing wealth for those who discover and control it. 

As we move deeper in to the 21st century however, it is becoming clear that data – the wealth of facts of numbers available through digital means – is reaching a similar premium, earning it the moniker the “new oil”. Only through effective analytics and their predictive insights can such new oil be refined and its value tapped.

Data are the basis for each Predictive Data Analytics project. Only good and meaningful data will ultimately enable you to achieve great results.

It is important that you gain a fundamental understanding of your data’s properties, applicability and informational value, specifically the information that could be used to improve your company. 

Consequently, data are the most important resource in any Big Data Analytics or Machine Learning project. If your data do not contain any valuable information, they will most likely be useless since, in such a case, it cannot be applied to generate any knowledge. Proper data therefore, are the foundation for your project and you must develop a good intuition for identifying this “raw material”.

Supervised or Un-Supervised? That is the (data) question

At the beginning of your project, you should find out what kind of data you have. 

Roughly speaking, there are two types of data available. The first comprises data for which the result is already known. That means the recorded data not only contains properties but also the resulting effect of those properties. For instance: An online shop records data points such as “customer looks at items X, Y and Z and then buys Y”. This data can be split in to the properties “customer looks at items X, Y and Z” and the following result “customer buys” or “customer does not buy”, which is usually the outcome you want to predict in your future applications. This kind of data situation is called “supervised setting” or “supervised problem”.

The second kind of problem is called the “un-supervised problem”. It is like the supervised setting but the result cannot be observed. This means that you cannot record what you would like to predict for future data. In the case of the online shop example, this could for instance mean that you want to determine the gender from the customer’s shop behavior. This cannot be observed, at least not directly, and thus the result cannot be recorded. 

If you already know the difference between supervised or un-supervised problems, feel free to skip the following side note otherwise reading it will help you better understand the differences.

Side Note:

Supervised Problem (supervised learning)

The supervised problem is comprised of observations and their corresponding effects or results.

For instance: For a predictive maintenance project a company wishes to know how likely the failure of a machine is. Therefore, the company will collect data like vibration, operating noise, energy consumption etc. Additionally to this data they will also record the result “machine has a malfunction” or “machine is ok”. That means that the historical data not only contains (machine) measurements but also the resulting effect (on the machine).

By applying these historical effects, algorithms are able to connect seen measurements with the true results (=recorded effects). During the inference phase, they are able to apply the correct result such that they can supervise themselves learning the correct or incorrect outcome. The knowledge about the result gives the algorithm the correct feedback, just like a human student learning from examples by solving a problem and then being able to lookup the correct result. This way, the algorithm can adjust itself to the real situation. Therefore, this problem is called supervised.

The observed result or effect is often call value of the target variable. Or, to put it differently, the target variable is the quantity you would like to predict.

Un-Supervised Problem (unsupervised learning)

The un-supervised problem is also often called clustering.

It is just like the supervised Machine Learning problem but the values of the target variable cannot be observed. Consider again the predictive maintenance example above. By applying the collected data you will be able to predict whether a machine is likely to fail or not. But not which part of the machine will cause the failure. At best, you can only guess the exact cause.

Of course, you can investigate an incident and search for the actual cause, which will give you the correct feedback. This way you can turn an unsupervised problem into a supervised one. In a predictive maintenance setting this should ideally be possible.

Unfortunately, in other situations this will not be possible. A typical example is the categorization of customers into different groups based on their behavior. Usually, you will not know what kind of categories exist, or even how many categories you should consider and it will only be possible to consider similar behaviors and then group these kind of similarities. Imagine the behavior in an online shop. You identified a group of people that preferably looks a cosmetic items and another separated group that usually does not consider these products. Having identified two groups, you could now argue that the first group consists of female shop visitors and the rest male. However, this is just a guess and could be wrong, and since you cannot observe the true result, you cannot be sure if your guess is true.

Such situations allow several solutions - or better said - interpretations. This makes an objective analysis impossible.

Unfortunately, the lack of objective assessment makes clustering a dangerous technique, since it does not allow a proper quality evaluation. Depending on the sensitivity of the problem, misinterpretations and wrong conclusions are easily made.

Therefore, always try to convert an unsupervised problem into a supervised one. Only the supervised problem gives you correct feedback and enables an objective evaluation.

The question of whether your data make it a supervised or unsupervised problem will be important because it will determine the quality of the predictions made.

Supervised Predictive Analytics problems usually allow a good statistical and objective evaluation. The historical data guide the algorithm during the learning or inference phase and help it to better fit the data.

Unsupervised problems, on the other hand, require a very deep understanding of the problem at hand. The disadvantage is the lack of objective assessment of the results.

This lack of knowledge about the effects or results increases the risk of drawing vague or even incorrect conclusions. Objectively justified judgements are not possible. This lack of information makes the application of unsupervised or clustering methods prone to error. Fundamentally, you should always prefer the supervised situation; only then you will be able assess the quality of your outcomes correctly.

Therefore, the rest of this article will focus on the more common and reliable supervised problem.

Do not forget to aim for the right goal

In a supervised problem, it is important that you define your target variable clearly. For instance, a target variable could express the functionality of a machine, like “machine fails”, “machine is okay” or customer “buys” or “doesn’t buy” etc. In these simple examples, the target variable is well-defined because it both describes the desired outcome and defines all possible outcome values.

Now consider you want to increase the impact of a newsletter campaign. You could manage the data to either optimize for “user opens newsletter” or “user clicks on a link XYZ” within the newsletter. In this case, the target variable is not yet clearly defined because the business goal (open or click) is ambiguous.

Therefore, be clear and specific about your target variable. Here, the precision and uniqueness of the definition plays a central role.

Knowing what metrics and results you wish your target value to be will also avoid ambiguity. For instance, do you want to get probabilities like 80% probability to purchase a product, or will a mere yes/no prediction be enough? Or do you just want to know if some data belongs to a certain group, such as whether a customer is interested in products of the category living accessoires or technology?

When designing a target variable therefore, you should pay attention to the definition, precision of that definition and the range of values it can take.

These three criteria related to the supervised Machine Learning problem. The un-supervised setting, however, does not consider a target variable and, as a consequence, the project result can be arbitrarily bad. This violation of the requirements is fundamentally what makes an un-supervised approach unreliable.

One caveat: in some situations the application of un-supervised methods can be reasonable. If this is the case, make sure that you understand any detail of these kind of exploratory data analysis methods such that you can be sure you understand why they result in certain outcomes. Often enough, the application of an unsupervised method requires such a huge amount of understanding of the problem that an un-supervised method is no longer necessary as, to all intents and purposes, it has become a supervised problem.

Data is the new oil 

In the supervised problem, the target variable will correspond to the prediction you would like to make. This prediction is made by considering the input data.

The input data is comprised of so-called attributes, features or dimensions – all terms describing the same thing. In the Predictive Maintenance example, an attribute could be the vibration and another one the energy consumption. In an online shop, it could be the retention time (in seconds) and the number of products viewed (1,2,3,… integers). These attributes are what is often called the new oil because, these data contains the information from which you can generate value.

Algorithms analyze and evaluate the structure of the attributes and based on this generate predictions. As for the target variable the attributes have to satisfy the same requirements: Definition, precision and range of values. Those three properties have a significant influence on the choice of algorithms and no every algorithm can handle every type of input data (more on that in the next blog part).

After checking the target variables and the input data, it is time to ask yourself again if the target variable corresponds to your business goal and if it will generate the desired business value. Answering this question will help you justify the effort your Predictive Analytics project requires.

Data preparation or the art of Data Science

It is rare you will encounter a situation in which you can apply the data right away to a Machine Learning model.

Instead, it requires a thorough analysis, preprocessing and transformation of the data to gain a high degree of informative value from it. This phase of a Predictive Analytics project is usually referred to as Data Processing or the Data Preparation phase. Commonly, this is the part where the art comes into play. To put it simply: the better the Data Scientist can prepare or modify the data the better the results will be.

While there are some standard data preparation techniques, this phase is usually application-specific. Therefore, a detailed explanation would exceed the scope of this article. 

The most important part to remember is that the data preparation phase consumes a substantial portion of your resources in any Machine Learning Project. This step will require the greatest part of the time budget as well as an experienced Data Scientist. The effort expended in this phase will have a dramatic influence on the quality of the final prediction outcomes.

Following data preparation, the data should be stored in a table or matrix, which will then be the input to the Machine Learning algorithm. Usually this matrix consists of a feature (=attribute, =dimension) in each column and an individual data point in each row. If this kind of data consolidation is not possible, you should consider some other kind of data format as the input for the algorithm.

The more the better

While planning the application of a supervised method you should ensure a sufficient amount of training and test data can be collected. How many data you have to collect often depends on the dimensionality of the data (that is the number of different attributes or features) and the (unknown) structure of the data. By applying a proper testing and comparison of these evaluation results, you can figure out if you did apply a sufficient amount of data. There is more about that in the evaluation part of this blog series.

The target variable must be frequent enough to be observed in a statistically meaningful way. For instance, an online shop can relatively easily collect the target variable “bought” / ”not bought” because every session ends with either result. Clearly, in situations, such as the example of a factory machine failing mentioned above, there may not be sufficient examples of the target variable of each malfunction to allow accurate predictive maintenance.

In other situations, the target variable cannot automatically be recorded. Instead, these values must be manually assigned to complete the analysis. This usually happens by having human workers look at each data point and have them decide which label is assigned to which data points. For instance, consider the classification of images: a human worker can tag images of cars with brand names in a manual process called “labeling”. If you are unfamiliar with this term, please read the side note below.

Automatic labeling is usually given when a computer can automatically record the value of the target variable (online shop, predictive maintenance) and is usually very cheap. In contrast, the application of manual labeling will lead to further costs and required time. Consider this when calculating the money and time budget. As a basic principle you should always use labeled data if possible. And if required you should also apply manual labeling. The reason for that is that this approach will lead to a much higher result quality of your Machine Learning project.

Side Note:

Labeling:

Labeling is the process of assigning a value to the target variable for a given data point. In an online shop, this could mean that for a user session the two attributes (features) “user looks at products XYZ” exactly “3 times” get the target variable value or label “buys” / “does not buy” assigned. Labeling can either be automatically implemented or manually by a human worker. During the process of manual labeling a person gets shown a certain data point and then has to decide which value the target variable of this data point should get.

The goal of the labeling process is therefore, always the definition of the correct outcome or effect for a certain given data point or dataset.

Another example could be the assignment of keywords or tags to different texts. For a later processing of those texts a person reads each text and assigns it one or multiple labels or tags. In this example a text would be the data point and a tag a label.

Hint: Labeling can become a cost and labor-intensive task. Nevertheless, it is still a necessary and reasonable step within the phase of data preparation. There are several different SaaS platforms such as the Amazon Mechanical Turk that promise to be very cheap. However, applying such a service requires additional planning and quality measures which in turn can again lead to higher costs.

Avoid looking at your data through rose-tinted glasses

Finally, a general but very important aspect of any predictive analytics project is whether your data reflect the true situation and are unbiased. For instance, consider the collected shopping behavior data of tech-minded persons. In theory, you could apply their historical data for the inference of an algorithm for the buying probability of people who like to buy household products. However, this might not be optimal since the tech-minded people will probably prefer tech-household products which in turn result in a biased buying prediction, skewing the prediction away from the behavior of an average customer. This kind of situation is called Sample Selection Bias, because the samples used for the inference do not reflect or represent the actual or a neutral situation.

From the given online shopping example you might think that such a bias is obvious and easy to detect. In reality a sample selection bias not only occurs quite often but is usually very difficult to detect, which can lead to poor predictive performance. Typically such a bias occurs over time, acquiring an incremental bias leading to progressively poorer predictions. This is one of the reasons why machine learning algorithms have to be updated often.

If you have detected a Sample Selection Bias you can consider different methods to compensate it and to fix the situation.

Checklist

The following checklist highlights the most important takeaways from this article:

  1. Is the problem at hand a supervised or unsupervised problem?
  2. Can you uniquely interpret your target variable?
  3. Have you checked the definition, precision and range of values for both attributes and target variable?
  4. Can you easily obtain new values for the target variable?
  5. Does this particular target variable help you solving your business goal?
  6. Do you have sufficiently large training and test datasets?
  7. Do the data reflect the true situation (Sample Selection Bias)?
  8. Do you understand your data well enough to justify the application of a clustering method?
  9. Have you factored in the data preparation phase as the most time consuming part of the project?
  10. Have you labelled your data?

Conclusion

Coupled with the evaluation phase covered in part four of this series, the data preparation process is the most important yet fundamental step in any Predictive Analytics Project. As such, it should be carried out by an experienced data scientist who understands your data and also understands how to combine different data in order to get the most out of them. Mistakes in this phase of the project usually have a devastating effect on the overall Machine Learning project performance.         

About the author

Dr. Thomas Vanck is an expert for Machine Learning and Data Analysis. Since years, he supports companies in applying their data for bigger success. He is looking forward to hear your questions about your planned or ongoing data projects. Feel free to write him a message. 

要查看或添加评论,请登录

Dr. Thomas Vanck的更多文章

社区洞察

其他会员也浏览了