AI and Machine Learning for Qlik Developers - Case Study 1

AI and Machine Learning for Qlik Developers - Case Study 1

This is an simple example in Qlik where we are using a Supervised Univariate Gradient Descent Algorithm to be able to predict the amount of a insurance claim given the number of claims.

(The purpose of this example is to make you understand something about Machine Learning. This task could of course be solved much easier and quicker without using ML, by using built in Qlik functionality, or by calculating Pearson's Correlation Coefficient and Linear Regression.

For more in-depth information on the Qlik Script, review the articles from start before you begin.

If you want a copy of the Qlik Sense QVF here - contact me on LinkedIn and I will send it to you.

Reference: Swedish Committee on Analysis of Risk Premium in Motor Insurance

Data Source: Auto Insurance in Sweden (download data here)

No alt text provided for this image

Data extract (first 10 rows)

No alt text provided for this image

Hypothesis

Y = θ0 + X * θ1



Example: Qlik Sense Chart after training

No alt text provided for this image
  • X = number of claims
  • Y = total payment for all the claims in thousands of Swedish Kronor
  • Red Line = The result of our algorithm after 15'000 iterations

Example: Qlik Sense Log report after training

No alt text provided for this image

We can see that the value of θ1 became stable after around 5'000 iterations and θ0 (the bias) around 10'000 iterations using a fixed learning rate (α) at 0.001.

Script

Sheet 1 (GENERAL SETUP)

// REGIONAL SETTINGS

SET ThousandSep=' ';
SET DecimalSep='.';


Sheet 2 (INPUT DATA)

Training_table:

LOAD

  num(num#(X,'',',')) as X, 
  num(num#(Y,'',',')) as Y
    
FROM [lib://AI_DATA/INSURANCE/data.txt]
(txt, codepage is 28591, embedded labels, delimiter is '\t', msq);

Sheet 3 (LOG TABLE)

// SUB TO GENERATE A LOGFILE IN A LOG TABLE
// EVERY TIME THE SUB IS CALLED, IT GENERATES A NEW ROW IN THE TABLE Log.

Sub Log

 Log: 

 LOAD 

  rowno() as logRow,
  $(θ0) as θ0,
  $(θ1) as θ1

 autogenerate(1);
End sub


Sheet 4 (GRADIENT DESCENT IN ACTION)

// GRADIENT DESCENT FUNCTION



// CHECK THE TIME NOW (TO BE ABLE TO VISUALIZE HOW MUCH TIME LEARNING TAKES)
Let StartTrainingTime = now();

// DEFINE HOW MANY TIMES TO LOOP
Let iterations= 15000;

// DEFINE LEARNING RATE
Let α = 0.001;

// DEFINE START VALUES FOR ALL THETAS (WEIGHTS). 
Let θ0 = 0;
Let θ1 = 0;

// GET HOW MANY ROWS TRAINING DATA HAS.

Let m = NoOfRows('Training_table');

// START LOOP
For i = 1 to iterations

    // CREATE A SUMMARY TEMP TABLE
    temp:
    LOAD

        Rangesum(($(θ0) + (X* $(θ1)) - Y),
                    peek(deviation_0)) as deviation_0,

        Rangesum(($(θ0) + (X * $(θ1)) - Y) * X,
                    peek(deviation_X)) as deviation_X  

    Resident Training_table;

    // GET THE LAST ROW FROM THE temp TABLE 
    // THAT HAS THE TOTAL SUM OF ALL ROWS
    Let deviation_0 = Peek('deviation_0',-1,'temp');
    Let deviation_X = Peek('deviation_X',-1,'temp');
    
    // DROP THE temp TABLE. NO LONGER NEEDED
    drop table temp;
    
    // CHANGE THE VALUE OF EACH θ TOWARDS A BETTER θ
    Let θ0 = θ0 - ( α * 1/m * deviation_0);
    Let θ1 = θ1 - ( α * 1/m * deviation_X);
    

    // CREATE A LOG OF EACH θ
    Call Log;

// REPEAT UNTIL ITERATIONS HAVE REACHED THE GIVEN MAX
next i;


// STORE END TIME AND CALCUALTE TOTAL TIME AND SPEED
Let EndTrainingTime = now();
Let TrainingTime = timestamp(EndTrainingTime - StartTrainingTime);
Let learningSpeed = iterations / Second(TrainingTime);

Sheet 5 (User interface preparation)

// RENAME X AND Y TO ACTUAL NAMES FROM IMPORTED DATA

rename field X to [Number of claims];
rename field Y to [Total payment];


Improvements

I wanted to keep the code above as simple as possible, but I can see some improvements for the code that you can change/add yourself if you have understood how the code works.

The main thing to improve would be to test the learning rate (α) to a slightly bigger number so that the number of iterations don't have to be so high. If you try with α too high you will get this kind of weird error:

No alt text provided for this image


This is because you get divergence (failure to converge) and the size of our θ becomes to big for Qlik Sense to handle. Adding some checks on how θ0 and θ1 is changing after each iteration is a good idea to avoid this problem, and exit the loop if they don't converge.

Since θ0 is changing slower to reach it's target than θ1, you can try to use two different learning rates for each θ. Try with α=0.01 for θ0 and keep α=0.001 for θ1, for instance.

Finally, add code to exit the loop once you have reached a reasonable level of convergence, no need to continue after 15'000 iterations for sure.

Linear Regression and Pearson Correlation Coefficient (r)

And now for something completely different!

Pearson Correlation Coefficient (r):

No alt text provided for this image



Using the same data as in the ML example above, finding the constants θ0 and θ1 but in a more traditional mathematical approach.

The goal is to find a and b in the form f(x) = a + b*x by first finding the r.

// GET AVERAGE OF X AND y
temp_avg:
Load 
 sum(X)/count(X) as avgX,
    sum(Y)/count(Y) as avgY 
resident Training_table;
Let avgX = Peek('avgX',-1,'temp_avg');
Let avgY = Peek('avgY',-1,'temp_avg');
drop table temp_avg;

// CALULATE THINGS NEEDED LATER...
temp_calc:
Load 
 (X-$(avgX))*(Y-$(avgY)) as [X-avgX*Y-avgY],
 sqr((X-$(avgX))) as [SQR(X-avgX)],
 sqr((Y-$(avgY))) as [SQR(Y-avgY)]
Resident Training_table;

// SUM UP
temp_sum:
LOAD 
 sum([X-avgX*Y-avgY]) as [SUM X-avgX*Y-avgY], 
    sum([SQR(X-avgX)]) as [SQR(X-avgX)], 
    sum([SQR(Y-avgY)]) as [SQR(Y-avgY)] 
Resident temp_calc;
drop table temp_calc;

Let sumXavgX_YavgY = Peek('SUM X-avgX*Y-avgY',-1,'temp_sum');
Let sqr_X_avgX   = Peek('SQR(X-avgX)',-1,'temp_sum');
Let sqr_Y_avgY   = Peek('SQR(Y-avgY)',-1,'temp_sum');

drop table temp_sum;

// Pearson Correlation Coefficient (r)
// This is a measure of the linear correlation between two variables X and Y
Let r = sumXavgX_YavgY / sqrt(sqr_X_avgX*sqr_Y_avgY);

// GET NUMBER OF ROWS (NEEDED FOR STANADRD DEVIATION CALC)
Let m = NoOfRows('Training_table');

//Standard Deviations of X and Y
let stddev_x = sqrt(sqr_X_avgX/m);
let stddev_y = sqrt(sqr_Y_avgY/m);

// CALCULATE b
let b = r * (stddev_y/stddev_x);

// CALCULATE a
Let a = avgY - b * avgX;

And there we go!

The a and b should more or less get the same values as θ0 and θ1. Much quicker and less coding, but no Machine Learning there. Just math.

Good luck with your coding! More advanced Use Cases studies will come.


Thank you for reading. Please share if you like it, and comment on the post if you have questions or feedback about the content.


Comment below, or connect with me here on LinkedIn if you want to get a copy of the Qlik Sense App.


No alt text provided for this image



No alt text provided for this image


Nikhileswara Reddy S

Senior Consultant

11 个月

please send me the qliksense files [email protected]

回复

Hello Robert, what a nice work! excellent! would possible to get the QVF file? Thanks in advance:) ([email protected])

Fernando S.

Data Science at Volkswagen Group (Logistic Data Lake)

4 年

Hello Robert, what a nice work! excellent. would possible to get the Qlik Sense QVF? Thanks in advance ([email protected])

Hugo Simancas

Consultor Senior Izertis

4 年

Hello excellent work! would it be possible for me to share the .qvs file? Thanks [email protected]

Christophe Brault

Qlik Enthusiast ?? Make Qlik happen

4 年

Thanks, this case study really helps me understand the concept!

要查看或添加评论,请登录

Robert Svebeck的更多文章

社区洞察

其他会员也浏览了