AI and Machine Learning for Qlik Developers - Part 5

AI and Machine Learning for Qlik Developers - Part 5

This is the fifth article on my project to create an AI only using Qlik Script. First article can be found here and the previous one where I talk about linear multivariate regression here.

No alt text provided for this image

Trying to develop something with machine learning in Qlik Script may seem like trying to eat soup with a fork.

Well, I am not hungry. My purpose is not to eat the soup. It is rather to understand what is in the soup, and for that reason, a fork is really helpful because I can pick up the noodles much better.

Leaving the linear world (sort of…)

It is time to start looking at problems that are not linear. Before we do anything, lets talk a little about linear relations vs non-linear relations, because you need to understand linear regression and what we did in the previous article, before you can understand logistical regression.

Look at this series of numbers:

10,12,14,16

As you see there is an equal step between all numbers, for every number we increase with 2.

Having equal steps between each data point is the same as saying there is a linear relation between the data points.

No alt text provided for this image

We visualize a linear relationship between several data points by drawing a line like this.




No alt text provided for this image

In the previous articles we experimented with linear regression where we tried to find a straight line that as good as possible covers all data points. In the above graph the line is 100% perfect but with real data, life is not always perfect(like the graph to the right here) and we have small variances - but the goal was still to find a straight line which had a minimized sum of errors.

No alt text provided for this image


What we did was to find a line that minimized the MSE - Mean Squared Error.



Now look at this series of numbers:

1, 4, 9, 16


No alt text provided for this image

Here we do not have the same step between each number. But there is still some kind of relation between the numbers, but it is just not any more "linear". In this case, it is quadratic.

Lets plot those values in a graph:



No alt text provided for this image


Trying to find a straight line that fits all dots will not work well, at least if the number of data points is large. We could potentially be lucky for a few data points, but that is not good enough, it needs to fit all data points.


No alt text provided for this image

The sum of errors will just get bigger and bigger, no way to find a straight line with minimum sum of errors:

So - we need another algorithm because linear regression will not work. We need non-linear regression / logistical regression


Now look at these numbers:

109,131 127,127,240,341,123,325,340,347,122,251 

Plotting them can look like this:

No alt text provided for this image

Plotting the data points we can get something like this, and it will not be possible to find a line of any type that matches the data points here.



No alt text provided for this image


But we can draw a shapes around them to "classify" them right?



Classification with logistical regression

While linear regression is a regression algorithm ; logistic regression - despite the name - is a classification algorithm. The result of the algorithm is rather a Yes/No answer with a probability rate. So we are not using logistical regression / classification to estimate the price of a house, rather to estimate how likely it is if the house will be sold or not.

Other popular classification algorithms in machine learning: Naive Bayes, Nearest Neighbor, Support Vector Machines, Decision Trees, Boosted Trees, Random Forest, Neural Networks


No alt text provided for this image

Decision boundary

The task of a classification algorithm is to separate the points by drawing a line of some kind of shape around them. That line is called the decision boundary. Everything that is below, or inside the decision boundary belongs to category A and the remaining part belongs to category B, etc.

No alt text provided for this image

A decision line can be linear (straight line) or non linear (like above, or like a box or a circle, etc). For now, we will just keep things simple and stick to logistical regression which uses a linear decision boundary.

Activation functions

A central term in classical machine learning is Sigmoid. The word comes from Greek sigmoeidēs, more knowns as Sigma - the Greek letter for S: Σ, σ, ?).

On a side note, the last letter ? is also called s in Greek, but it's only used in the end of a word.

The Sigmoid function is what is called a Activation Function.

There are many other activation functions used in Machine Learning: TanH, ReLU, Softmax and Swish to name a few. They are all good and bad for different purposes. Right now, we just go with Sigmoid.

Activation functions are a key component in neurons when you create neural networks, they make the neuron "pop". :-)

Let's have a look at it in a Sigmoid graph before we look at the Sigmoid formula:

No alt text provided for this image

You see that the higher value on the x-axis, the closer the function (y-axis) go to value 1, and the more negative value on the x-axis, the closer the function (y-axis) go to value 0. So the conclusion of the graph: Sigmoid is a function that whatever you put into it, it will give a value between 0 and 1.

So what has this Sigmoid to do with logistical regression?

The sigmoid function, in mathematical language looks like this:

No alt text provided for this image

In this example, z is the value we saw earlier on the x-axis, and g(z) is the value of the y-axis.

For instance, when z=5 in the above graph, g(z) will return close to 1, and when z=-5, g(z) will return something close to 0.

Don't bother in understanding what it everything is in the function now, it is just maths and you don't need to understand what is what in there, other than it is a function that takes one value and output another value between 0 and 1.

Translating the Sigmoid formula into a Qlik script variable it looks like this:

Set Sigmoid = 1/ ( 1 + pow( e(),-$1 ) );

And you use it to calculate a variable where z=1 like this:

Let myValue = $(Sigmoid(1));

Or to create a table with 20 rows and some examples:

Set Sigmoid = 1 / ( 1 + pow( e(),-$1 ) );

LOAD 
    rowno()-10 as z,

    $(Sigmoid((rowno()-10))) as g

autogenerate (20);

This table will look like this:

No alt text provided for this image

Ah! Melons, again

A practical example, mentioned in my previous article, was the task to understand if a melon is sweet or not, given for instance it's volume or weight. Let us pretend the sweetness of a melon is only depending on it's weight. The heavier it is, the more likely it is to be sweet, and the lighter it is, the more likely it is to be sour.

Now we let us try to put in the weight of our melon into the Sigmoid function, as a value of z.

Lets just try with the weight in kg like this:

Melon size = 1 kg -> 

z=1


Sigmoid was: 1 / ( 1 + e^-z )

So:

1 / ( 1 + e^-1 ) = approximately 0.73 

So given this simple test, we can say we think it is 73% likely that this is a sweet melon.

But wait a minute... This will really not work in the long run because with Sigmoid we need to have negative values to find sour melons! So we can not just put in the melon weight like that because then all melons with weight > 0 will be considered to be more or less most likely to be sweet.

So we need to add something more than just melon weight to into the sigmoid function right?

The solution is to look back at how linear regression works, with the hypothesis:

No alt text provided for this image

Where (θ) is a feature weight assigned to a particular feature (x)

In our melon dilemma, we have just one input column (weight of the melon) so if this was a linear problem to be solved (for instance, if we wanted the price of a melon given it's weight), we would have this formula h = θ0 + θ1 * x1

The solution to our melon adventure is that we simply re-use this linear solution together with the Sigmoid function, and Voila! We have a algorithm that works for us!

This is now a logistical regression solution using a linear decision boundary where everything on one side of the line is Sweet and everything else is Sour.

No alt text provided for this image

If we have the "melon sweetness hypothesis" using Sigmoid(z) and we say

z = θ0 + θ1 * x1 where x1 is the melon weight, then :

hypothesis = Sigmoid(θ0 + θ1 * x1)

The only difference here from the linear regression in the previous article, is that we mount our hypothesis into the sigmoid function, and it give us back a value between 0 and 1.

Now we just need to find the best values for θ0 and θ1 that gives us a value >0.5 (=sweet) or <=0.5 (=sour).

No alt text provided for this image


Lets say we have a list of melons with their weight, and a rating from customers saying if the melon was sweet or sour like this:


Feature Engineering

We transform the two columns Sweet and Sour into one column with value = 1 for sweet and 0 for sour.

In classification we need to make everything into numbers, and preferably numbers between -1 and 1, because in the end, we will add all columns and their weight into one sum, which we then drop into the Sigmoid function.

So for instance, if we had a column with colour (red, orange, yellow, green), we would have to change those into values like red=1, orange=2, yellow=3, green=4.

This process is called Feature Engineering, and if you start working with Machine Learning, you will spend a lot of time on feature engineering. Probably, most of your time will be spent on this.

Cost function

The cost function we used in linear regression will not work here because we have wrapped everything into the non-linear Sigmoid function. Instead now the cost function looks simply like this:

No alt text provided for this image

Luckily, you don't need to bother about how to implement the cost function, because in the end, we loop things in the same way as we did in linear regression but we need to keep wrapping Mr. Sigmoid around everything all the time. As long as we do that, we're good to go!

The regression method is the same as the method in linear regression

You loop through all rows in your input table just like we did in the linear regression. Start with random values for your weights, for instance in our melon example:

θ0=1 and θ1=1

Take first melon weight (from table above first weight = 0.3kg) together with those θ values, place them into the hypothesis:

 θ1+z*θ1 = 1 + 1 * 0.3 = 2.3

Then insert into sigmoid

Sigmoid(2.3) = 0.9088770... = our "predicted y"

So our initial weight suggest the melon is sweet to a high degree 90%

The melon of 0.3 kg was supposed to be sour (y=0), not sweet so the sigmoid function was very wrong. We calculate how wrong it was in the same way as in the linear example:

error = predicted y - actual y

0.908877 - 0 = 0.908877

Loop all rows with weights ( in the same way as in linear example), and sum up all errors using this formula:

No alt text provided for this image

written in Qlik Script (one example):

  rangesum((1/(1 + pow(e(),-($(θ0) + ( X1* $(θ1)) )))) - Y) * X1,
           peek(error_X1)  
      
              as error_X1

The rest -> calculating the change in the θ - weights, is the same as in linear regression.

For each θ (in the melon example θ0 and θ1). You sum all errors, divide the sum with how many rows you have to get an average error and you multiply with a good α and multiply also with each feature in your data set.

Repeat until convergence. Just like before.

When the regression is done, you can test your formula on sample data (just like in linear regression) and it will be hypothetically possible, given a certain weight of a melon, to say if it is likely to be sweet or sour, based on the value that the sigmoid function delivers.

Remember to extract 10-20% of the "training" data to use it later when you want to validate how correct your hypothesis is.

No alt text provided for this image

In this example I took out the weight 0.7kg and 2.5 from the training data and tested it on the final hypothesis. We can see that 0.7 was returning 0.76 which is wrong, it should have been less than 0.5 to be considered Sour. The other example I took out was 2.5 kg and in this case the hypothesis is returning a pretty certain 0.95 which is Sweet. So my hypothesis is currently 50% correct which is pretty bad.

Why is my hypothesis not working?

You will rarely reach 100% correctness, but you should always try to reach more correctness than you yourself could guess. In this melon example there are some melons that doesn't follow the idea that the weight is the only thing affecting the sweetness. For instance the 0.5 kg melon is sweet while all other melons below 1.6 kg is sour, and also the 3.2 kg melon was sour. This makes us suspect that there are more columns needed to be able to predict a melons sweetness. This is sadly all data we have.

To get better result, first thing to try is to just train more, loop more times and try different alpha values. Perhaps you had not reached convergence yet?

If you still don't get a good result, you need to look at the data and try to add more data into your hypothesis.

No alt text provided for this image

More feature engineering

When working with logistical regression it is always recommended that all features have the same scale, as I mentioned earlier, best is if all values are between -1 and 1. If you have a dataset with column where the min/max values are very different for each column, you should try to find a way to make them more similar and have the same scale. If you don't, it may take a lot more time to reach convergence, because some features will be too dominant and have to much to say about the where the lowest point of error is.

Having features on a similar scale can help the gradient descent converge more quickly towards the minima.


Normalization (a.k.a. MinMax Scaling)

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.

No alt text provided for this image

Qlik code: Get the max and min value from the column by resident loading the table, save them in two variables like x1_min and x1_max, then resident load the table again with this formula

Load … (x1-$(x1_min))/($(x1_max)-$(x1_min)) as x_norm resident table

Standardization

Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

No alt text provided for this image

μ is the mean of the feature and σ is the standard deviation of the feature.

Qlik code: Calculate the mean and standard deviation by resident loading the table and save the values in two variables like x1_mean and x1_stdav, then resident load the table again with this formula:

Load … (x1-$(x1_mean))/$(x1_stdav) as x_std resident table

Normalize or Standardize?

  • Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution.
  • Standardization can be helpful in cases where the data follows a Gaussian distribution. Standardization does not have a bounding range. So, even if you have outliers in your data, they will not be affected by standardization.
No alt text provided for this image

Qlik_ML_kit

I am starting to create Qlik subs for Machine Learning. A tool kit.

So far I have just done subs for normalization and standardization:

Call NormalizeField(TableName,FieldName);

or

Call CreateMinMaxField (TableName,FieldName);

Writing subs like this makes coding easier to understand and the code get much shorter in the final application. Using cleverly built subs - the code for classifying the melons would be something like this:

$(Include=lib://Scripts/Qlik_ML_kit.txt);

Let iterations = 1000;
Let alpha = 0.001;

Call ImportData(FileName,TableName);

Call NormalizeField(TableName,Weight);

Call LogisticRegression(TableName,Weight=x1,alpha,iterations);

Call VerifyResult(TableName);


Qlik_ML_kit has all code needed to call those subs. The subs would read the columns from TableName and for instance check how many columns it has with the function NoOfFields(), to automatically build the hypothesis and create the necessary θ's.

The Qlik_ML_kit is something I have started to create but it is not finished. I am thinking to maybe start a collaborative Git-project and if you want to help creating this kit, reach out to me. No matter what the result will be free for anyone to use.

With this strategy we are actually getting close to how many lines this would have been required in a final Python Script too! :-)

Most of your time, you will instead spend on making sure the data you import is well prepared. Try to fill in as many blanks you can.

In our Melon-case for instance - if a melon is missing the weight, you have to decide if you should remove it from the dataset or try to guess it's weight based on other data in the dataset. For instance, perhaps we also have price in the data, and by comparing the price to the other melons, you can guess the weight of the one that has no weight.

No alt text provided for this image

I have tested the concept on the Kaggle Titanic dataset "Predict survival on the Titanic and get familiar with ML basics", and I am getting close to working example in Qlik Sense that delivers near 80% correctness, which is reasonably good since most people who get better predictions on this dataset are using other and better suited algorithms than logistic regression. Choosing the right algorithm for the right dataset is very important to get a good result.

See the code in action: Case Study 3 - Survivals of Titanic


Thank you for reading. Please share if you like it, and comment on the post if you have questions or feedback about the content.


Comment below, or connect with me here on LinkedIn if you want to get a copy of a Qlik Sense App using Machine Learning.


No alt text provided for this image


No alt text provided for this image


Himanshu Gupta

Transaction Monitoring Investigation Specialist

4 年

This is really great article. Learned something new ! Thanks!!

Amauri Alves

Delphi Developer, Oracle PL/SQL Developer, Oracle Forms & Reports Developer , Qlik View Developer, Qlik Sense Developer

4 年

Wow! This article is realy realy cool! Will surely take a look at this serie's articles! #KeepWritingAndSharing

Christophe Brault

Qlik Enthusiast ?? Make Qlik happen

4 年

That's really a huge article. I'll have to read it again ! Keep working on this it's really great ! ??

Nick Blewden

Data, insights, AI and strategy @Co-op

4 年

Great blog Robert, I would be keen to receive a copy of the final toolkit. Well done for showing a practical path from BI to AI (but understanding actually how it works on the way)! ??

Chris Evers

Data Visualization Developer

4 年

Another great article Rob, appreciated.

要查看或添加评论,请登录

Robert Svebeck的更多文章

社区洞察

其他会员也浏览了