AI and Machine Learning for Qlik Developers - Case Study 3 - part 2
Robert Svebeck
Driving Responsible AI Implementation in Region Stockholm / Karolinska University Hospital
This is an example in Qlik where we are using a Logistic Regression Algorithm, built only with Qlik Script, to be able to predict if a person who boarded Titanic survived or not.
For more in-depth information on the Qlik Script concept, review the articles from start before you begin, or jump right into the article about logistic regression here.
This is part 2. Part 1 can be found here.
The Learning app
The learning app has 4 steps:
- Import about 90% of the rows from the QVD that was created in the Preparation App
- Run the Logistical Regression Algorithm
- Import the remaining ~ 10% of rows from the QVD that was created from the Preparation App and verify the prediction model on that data. Here we get a preliminary result on how well our model works.
- Store the result of the prediction model in a new QVD for the next app "Testing App / "Kaggle app"
Let's jump straight to the code:
1. Import data
Learning_table: LOAD * FROM [lib://DataFiles/learn.qvd] (qvd) where [Learn/Verify]='L';
Nothing special there, normal procedures for Qlik developers. No need to specify fields since this was prepared in the previous app, but we select with the "where clause" - we only want the "Learning" records, and we leave the "Verify" records for later.
2. Run logistical regression algorithm
Here we have a lot of code to explain.... take a deep breath! You have an advantage if you understood some of the reasoning in earlier articles about gradient descent, to fully get what is happening here... I have a slightly different approach to solve the maths than I had in the previous case studies though. This solution is more elegant and efficient.
The linear hypothesis we will use inside the logistical regression is the same as in the examples using the gradient descent algorithm:
SET Hypothesis = '(θ0 + (X1*θ1) + (X2*θ2) + (X3*θ3) + (X4*θ4) + (X5*θ5) + (X6*θ6) + (X7*θ7) + (X8*θ8) + (X9*θ9))';
So we need to declare 10 weights ( θ0 ... θ9 )
I am using 10 weights because I want to explain that this code can be used for any recordset that has 10 or less festures. The data set we produced in the previous app only has 9 actual features but we set the 10th one (X9) = 0 for all records so that is the method to handle a dataset with less than 10 features. It is of course also easy to expand the max features to what ever max you want, but the code will get much bigger and it will be more difficult to follow the logic so let's save that for the QLIK_ML_toolkit project...
In the previous case studies I stored the weights as Qlik variables with LET θ0 = ... but in this example I will instead store the weights as columns inside the learning_table.
I start with all initial θ0 ... θ9 (weights) = 0 and joining it to the learning table without using any key, which result in that all weights are added to all records in the table.
Join (Learning_table) LOAD 0 as θ0, 0 as θ1, 0 as θ2, 0 as θ3, 0 as θ4, 0 as θ5, 0 as θ6, 0 as θ7, 0 as θ8, 0 as θ9 AutoGenerate(1);
So we just added 10 more columns named θ0 ... θ9 into the table where all our features are (X1...X9).
What we will do now is calculate new weights hundreds, or thousands of times, each time updating the weight columns on all rows with new weights. The weights will always have the same value for all rows, but each weight θ0 ... θ9 will always be unique, and hopefully "better" and "better" for every time we loop.
Before the loop starts, our learning table looks like this (only showing two features and 3 weights here):
After a few loops our table will perhaps look like this:
In the end, when we have looped "enough", we use the final weights on the algorithm again, but this time not using training data, instead using new values from the verification data with new values of X1....X9 to calculate an expected survival on each passenger.
Visualization above showing how each weight is slightly adjusted for every iteration. This is also a fantastic experience of using Qlik for running the algorithm. To visualize what is actually going on inside the code makes it easier to understand!
Defining some variables for how many iterations and how many rows we have to in the table.
Let iterations= 500; Let m = NoOfRows('Learning_table'); Let α1 = 0.1; Let α2 = 0.0001;
Number of iterations; Start with a small value and increase it gradually as you need.
Remember the α from the gradient descent examples? We could keep using just one α here but I decided to try and use a α that is gradually smaller and smaller for each iteration. Still will speed up the learning and require less iterations to stabilize the weights. The start value for α will be α1, and the final value will be α2. See below...
Using a dynamic α speed up the calculation a lot because we don't need to do so many iterations. In my tests I reduced from 10'000 iterations to 1'000 to get the same result. So 10 times faster!
Lets start the loop!
For i = 1 to iterations Let α = (1/( pow((1/α1) + i,1/3))) + α2;
Calculating α using a 1/x function, it will make it exponentially smaller and smaller for every iteration, just like the amount of food my son eats during a meal:
As the weight error gets smaller and smaller, the adjustment factor α also get smaller and smaller.
Apply logistical regression algorithm
Creating a temp table based on the learning table data, where I apply the logistical regression algorithm for each row like this:
temp: LOAD Row, θ0, θ1, θ2, θ3, θ4, θ5, θ6, θ7, θ8, θ9, ((1/(1 + pow( e(),-$(Hypothesis) )))-Y) as deviation_0, ((1/(1 + pow( e(),-$(Hypothesis) )))-Y) * X1 as deviation_X1, ((1/(1 + pow( e(),-$(Hypothesis) )))-Y) * X2 as deviation_X2, ((1/(1 + pow( e(),-$(Hypothesis) )))-Y) * X3 as deviation_X3, ((1/(1 + pow( e(),-$(Hypothesis) )))-Y) * X4 as deviation_X4, ((1/(1 + pow( e(),-$(Hypothesis) )))-Y) * X5 as deviation_X5, ((1/(1 + pow( e(),-$(Hypothesis) )))-Y) * X6 as deviation_X6, ((1/(1 + pow( e(),-$(Hypothesis) )))-Y) * X7 as deviation_X7, ((1/(1 + pow( e(),-$(Hypothesis) )))-Y) * X8 as deviation_X8, ((1/(1 + pow( e(),-$(Hypothesis) )))-Y) * X9 as deviation_X9 Resident Learning_table;
Read more about the logistical regression theory here.
To explain that code a little bit...
What this line do (for instance)
( ( 1/ ( 1 + pow( e(), -$(Hypothesis) ) ) ) -Y ) * X2 as deviation_X2
Expanding this $(Hypothesis) first:
( ( 1/ ( 1 + pow( e(), -( θ0 + (X1*θ1) + (X2*θ2) + (X3*θ3) + (X4*θ4) + (X5*θ5) + (X6*θ6) + (X7*θ7) + (X8*θ8) + (X9*θ9) ) ) ) -Y ) * X2 as deviation_X2
Or in pseudo code text: Calculate all values on each row in the table, using all columns on the row in some "special mathematical thing" and multiply it with X2 to create a column named deviation_X2
Remember the Sigmoid function from the article about logistic regression?
1/(1 + pow( e(),-X ))
or in math language
or as a graph over x:
It gives a value between 0 and 1, which is exactly what we want to compare it to our survival rate which also has a value between 0 and 1 (or rather, is always either 0 or 1).
So for each row, the sigmoid function calculates a value between 0 and 1, then we subtract Y (which is 0 or 1), then we multiply with all X values, for each feature, one by one and we get a deviation for each x (in this case deviation_x2).
Simplified:
( Sigmoid(hypothesis) - Survival ) * Feature = Deviation_feature
The Deviations calculated there for each feature are then in the next step used to adjust the weights.
How to adjust weights
To calculate new weights, we do the same way as in gradient descent. But instead of doing it on variables, we do it in a table called new_weights:
new_weights: Load Distinct FirstValue(θ0) - ( $(α) * 1/$(m) * (sum(deviation_0))) as θ0, FirstValue(θ1) - ( $(α) * 1/$(m) * (sum(deviation_X1))) as θ1, FirstValue(θ2) - ( $(α) * 1/$(m) * (sum(deviation_X2))) as θ2, FirstValue(θ3) - ( $(α) * 1/$(m) * (sum(deviation_X3))) as θ3, FirstValue(θ4) - ( $(α) * 1/$(m) * (sum(deviation_X4))) as θ4, FirstValue(θ5) - ( $(α) * 1/$(m) * (sum(deviation_X5))) as θ5, FirstValue(θ6) - ( $(α) * 1/$(m) * (sum(deviation_X6))) as θ6, FirstValue(θ7) - ( $(α) * 1/$(m) * (sum(deviation_X7))) as θ7, FirstValue(θ8) - ( $(α) * 1/$(m) * (sum(deviation_X8))) as θ8, FirstValue(θ9) - ( $(α) * 1/$(m) * (sum(deviation_X9))) as θ9 Resident temp; drop table temp;
- FirstValue() is used just because we need "any" value from the weight columns (remember all rows have the same value for the same weight).
- α is a variable that will be smaller and smaller for every iteration.
- m is the number of rows in our learning table. Will always be the same.
- sum(deviation_xn) is the sum of all errors for all rows for each feature.
To simplify: We make a sum of all deviations (per feature) calculated in the temp table above, divide it with how many rows we have in the table (so we have the mean error per row and per feature) and then we multiply that mean value with our α. This final value is how much we adjust the new weight with.
Store new weights to the learning table and loop again!
Finally, we need to join back the new weights on the learning table and do everything again (iterate).
First we remove the old weights from the learning table, then we join the new weights.
drop fields θ0,θ1,θ2,θ3,θ4,θ5,θ6,θ7,θ8,θ9 from Learning_table; left join (Learning_table) LOAD * resident new_weights; drop table new_weights; next i;
The complete algorithm concept in one picture:
And that is it. We have done logistical regression!
Store the final weights
We want to store the final weights as a QVD because we need them in the final app where we will create a file to send to Kaggle. Then we can also drop the learning table.
Weights: LOAD DISTINCT θ0,θ1,θ2,θ3,θ4,θ5,θ6,θ7,θ8,θ9 resident Learning_table; STORE Weights INTO [lib://DataFiles/learning_result.qvd] (qvd); drop table Learning_table;
Verify the result
To check how well our learning algorithm performed, we apply the hypothesis on the remaining data created in the preparation app.
Load the remaining training data and join the weights.
Verify_table: LOAD * FROM [lib://DataFiles/$(FileNameImport)] (qvd) where [Learn/Verify]='V'; join (Verify_table) LOAD * resident Weights;
For the user interface, we want to store the weights as variables. Then we drop the weights table:
let θ0 = peek('θ0',-1,'Weights'); let θ1 = peek('θ1',-1,'Weights'); let θ2 = peek('θ2',-1,'Weights'); let θ3 = peek('θ3',-1,'Weights'); let θ4 = peek('θ4',-1,'Weights'); let θ5 = peek('θ5',-1,'Weights'); let θ6 = peek('θ6',-1,'Weights'); let θ7 = peek('θ7',-1,'Weights'); let θ8 = peek('θ8',-1,'Weights'); let θ9 = peek('θ9',-1,'Weights'); drop table Weights;
Join back the verify_table on itself to apply the result of the algorithm:
join (Verify_table) Load Row, Y, if(SigmoidHypothesis>0.5,1,0) as MLPrediction; LOAD Row, Y, 1/(1 + pow(e(),-$(Hypothesis))) as SigmoidHypothesis resident Verify_table;
The field MLPrediction field will, for each Row, tell us what the hypothesis say is the outcome of that passenger, we also have the actual outcome of the passenger in Y.
The sigmoid function will transform the prediction into a value between 0 and 1, and we just round up the result into nearest integer, meaning 1=survived and 0=died. It's up to you to decide if 0.5 should be considered survived or died. Hopefully very few (if any) will be precisely 0.5...
Last steps is just interface things. Some nice variables to present the result in the app:
temp: LOAD COUNT(IF(MLPrediction=Y,Row)) as MLPrediction, COUNT(IF(MLPrediction=1 AND Y = 0,Row)) as MLPrediction_false_positive, COUNT(IF(MLPrediction=0 AND Y = 1,Row)) as MLPrediction_false_negative, COUNT(Row) as Total Resident Verify; drop table Verify; Let MLCorrect = peek('MLPrediction',-1,'temp'); Let MLPrediction_false_positive = peek('MLPrediction_false_positive',-1,'temp'); Let MLPrediction_false_negative = peek('MLPrediction_false_negative',-1,'temp'); Let Total = peek('Total',-1,'temp'); drop table temp;
False negatives and false positives are good to see if the prediction is overly optimistic or not towards one type of error.
Final calculation for the Machine Learning Guess Rate takes how many passengers that is i correctly estimated vs the total. In my case, 6 passengers are not correct. I think with some better feature engineering I could do better, but this is a good enough result for this study.
Log the progress
Perhaps you also enjoy looking at how the weights are slowly reaching their goal?
To create this log, just add this code before the loop:
log: Load 0 as log.row autogenerate (1);
and add this in the loop (before dropping the table new_weights:
Concatenate(log) Load rowno() as log.row,* resident new_weights;
Then make a line chart with log.row as dimension and ten "θn" ... as n expressions.
That's all it takes! We are done.
Final article in this Titanic series is how to run the algorithm on the test data, create a test file and send it to Kaggle for verification.
Thank you for reading. Please share if you like it, and comment on the post if you have questions or feedback about the content.
Comment below, or connect with me here on LinkedIn if you want to get a copy of the Qlik Sense App.
Data Analist at KWF Kankerbestrijding
2 年Thank you for this post Robert. Is it possible that I can get the App? Thank you on beforehand. Johan
Qlik Developer with many years of experience
3 年great article, Robert - really informative. I have been building dashboards in Qlik for many years now and only recently been asked to branch out into using Python e.g. Pandas/Numpy/SKlearn to apply machine learning to some business cases / questions. Always thought there would be a way of bring ML and Qlik together -this confirms my hopes :-) would you be able to send me a copy of the qvfs for the titanic apps ?