AI and Machine Learning for Qlik Developers - Case Study 2

AI and Machine Learning for Qlik Developers - Case Study 2

This is an example in Qlik where we are using a Supervised Multivariate Gradient Descent Algorithm to be able to predict Stock Index Prices depending on Year, Month, Unemployment Rate and Interest Rate.

(The purpose of this example is to make you understand something about Machine Learning. This task could maybe be solved easier and without using ML, or by using built in Qlik functionality.

For more in-depth information on the Qlik Script, review the articles from start before you begin.

If you want a copy of the Qlik Sense QVF here - contact me on LinkedIn and I will send it to you.

Reference: Example data found here, where you can also find Python code for same data.

In the previous case study we had one single variable. In this example we have 4 columns of input data and one output.

Data

The data has 24 lines and 5 columns. First lines looks like this:

No alt text provided for this image

The last "date stamp" on the file is december 2017. Let's pretend we are now in January 2018. The task is to predict Stock Index Prices for the current and the coming months depending on what the interest rate and unemployment rate is.

Since there are 5 columns, lets first simply guess this hypothesis would work:

y = θ0 + X1 * θ1 + X2 * θ2 + X3 * θ3 + X4 * θ4

We will now write a script to find out if that hypothesis works.

Qlik Script

Please note; I will only make remarks on things in the code that are different from the first case study. If you don't understand something in this code, review the previous case study or go back to the main article where I explain more about the method and math.

Loading Data

Input_table:

LOAD
    Interest_Rate       as x01,
    Unemployment_Rate   as x02,
    Year                as x03,
    Month               as x04,
    Stock_Index_Price   as y

FROM [lib://AI_DATA/STOCKMARKET/data.xlsx]
(ooxml, embedded labels, table is Data);

The order of the columns and how we name them are not relevant.

Log table

Sub Log

   Log: 

   LOAD 
        rowno() as logRow,
        $(θ0) as θ0,
        $(θ1) as θ1,
        $(θ2) as θ2,
        $(θ3) as θ3,
        $(θ4) as θ4

   autogenerate(1);
End sub

Only difference from the previous case study is that we add more variables (θ2,θ3,θ4) to the log.

Gradient Descent Algorithm

Define Intial weights.

Let θ00 = 0;
Let θ01 = 0;
Let θ02 = 0;
Let θ03 = 0;
Let θ04 = 0;

 
Let m = NoOfRows('Input_table');
 
Let iterations = 100;
Let α = 0.001;


5 weights instead of 2 as in first case study. Start with alfa=0.001 and 100 iterations. No need to have too many iterations until we know weights are converging.

Loop

For i = 1 to iterations

Inside loop

  temp:
  LOAD
   Rangesum(($(θ00) + 
            (x01 * $(θ01)) + 
            (x02 * $(θ02)) + 
            (x03 * $(θ03)) + 
            (x04 * $(θ04)) - y),
            peek(error_x00)) as error_x00,

   Rangesum(($(θ00) + 
           (x01 * $(θ01)) + 
           (x02 * $(θ02)) + 
           (x03 * $(θ03)) + 
           (x04 * $(θ04)) - y) * x01,
           peek(error_x01)) as error_x01,

   Rangesum(($(θ00) + 
           (x01 * $(θ01)) + 
           (x02 * $(θ02)) + 
           (x03 * $(θ03)) + 
           (x04 * $(θ04)) - y) * x02,
           peek(error_x02)) as error_x02,

   Rangesum(($(θ00) + 
           (x01 * $(θ01)) + 
           (x02 * $(θ02)) + 
           (x03 * $(θ03)) + 
           (x04 * $(θ04)) - y) * x03,
           peek(error_x03)) as error_x03,

   Rangesum(($(θ00) + 
           (x01 * $(θ01)) + 
           (x02 * $(θ02)) + 
           (x03 * $(θ03)) + 
           (x04 * $(θ04)) - y) * x04,
           peek(error_x04)) as error_x04
                      
Resident Input_table;

Let sum_error_x00 = Peek('error_x00',-1,'temp');
Let sum_error_x01 = Peek('error_x01',-1,'temp');
Let sum_error_x02 = Peek('error_x02',-1,'temp');
Let sum_error_x03 = Peek('error_x03',-1,'temp');
Let sum_error_x04 = Peek('error_x04',-1,'temp');

drop table temp; 


Still not big difference compared to case study 1. Only fact is that because the hypothesis is larger - longer code is required and more variables to check against actual y.

Adjust weights (θ)

  Let θ00 = θ00 - ( α * 1/m * sum_error_x00);
  Let θ01 = θ01 - ( α * 1/m * sum_error_x01);
  Let θ02 = θ02 - ( α * 1/m * sum_error_x02);
  Let θ03 = θ03 - ( α * 1/m * sum_error_x03);
  
  Let θ04 = θ04 - ( α * 1/m * sum_error_x04);


Again, same code, just more repeted lines for each weight.

Log the weights and close the loop

 Call Log;
  
  
next i;
No alt text provided for this image

And that is it!

There is now a multivariate linear regression example in full. Do a reload!

No alt text provided for this image
"Data has not been loaded. Please correct the error and try loading again."

... Doesn't work? -> Divergence?

Don't worry! This is normal! We need to study the data BEFORE we create our hypothesis!

We assumed every column has a direct linear relationship with the output column. This is not the case in this data, and in fact, get use to it. It's normally not the case that all data you get is just ready to be used directly without manipulation.

If you don't understand your data, one simple way to find out which columns to remove or manipulate is to simply remove as many as possible and add them instead one by one. That way you will find out which columns are causing the divergence.

Change the input in this way:

Input_table:

LOAD
    Interest_Rate         as x01,

    //Unemployment_Rate   as x02,
    0                     as x02,

    //Year                as x03,
    0                     as x03,

    //Month               as x04,
    0                     as x04,

    Stock_Index_Price     as y

FROM [lib://AI_DATA/STOCKMARKET/data.xlsx]
(ooxml, embedded labels, table is Data);

The change is that we default columns x02, x03 and x04 to 0, basically we remove them from the hypothesis as they will always be zero.

Our hypotheis is now rather this: y = θ0 + X1 * θ1

Run the code. Works better right? So we know that interest rate has a linear relation with stock index price. However, the hypothesis will not be very good at predicting future stock index prices. Repeat this method over all variables. You will find out that the column Year is the problem here. It not possible to add that column into our hypothesis like the other columns.

How do we add the year?

The solution is: Given that we pretend we are now in January 2018: Add a new column to the input data that calculate (based on Year and Month columns) the number of months from "today" for each row, and name that column x03.

Like this:

Input_table:

LOAD
    Interest_Rate as x01,
    Unemployment_Rate as x02,

    ((Year*12)+Month) - ((2018*12)+1) as x03,

    0 as x04,
    Stock_Index_Price as y

FROM [lib://AI_DATA/STOCKMARKET/data.xlsx]
(ooxml, embedded labels, table is Data);

x03 is now a negative number for how many months ago that transaction belongs.

Run the code again. Now it works with all columns!

Our working hypothesis is now: y = θ0 + X1 * θ1 + X2 * θ2 + X3 * θ3

To save some reload time, you can remove all traces of X4 and θ4, but keeping the x4=0 will not mess up the prediction if you don't remove them from the code.

Analyze your log

Experiment manually with different values on alfa and iterations by looking at the log table to find the best values for this dataset.

Build a nice user interface for the log data. With Qlik Sense you do this in 5 minutes;

No alt text provided for this image

That will help you to find optimal parameters.

Change the learning rate (α)

To make the code run faster try to change alfa in different way. Start first iterations with a big value on alfa and make it smaller and smaller for every iterations. For instance like this:

Let α = (1/( pow((1/0.01) + i,1))) + 0.0001;

This will make the alfa start with 0.01 and end with ~0.0001

You can also check how fast each weight is adjusting itself in the log and use different alfa values for each weight like this:

  Let θ00 = θ00 - ( α1 * 1/m * sum_error_x00);
  Let θ01 = θ01 - ( α2 * 1/m * sum_error_x01);
  Let θ02 = θ02 - ( α2 * 1/m * sum_error_x02);
  Let θ03 = θ03 - ( α3 * 1/m * sum_error_x03);


In my code, I managed to get my weights stable after about 800 iterations which took 3 seconds to iterate on my laptop.

But is the hypothesis useful?

By now we know the code runs and the weights are converging well. But is the hypothesis able to predict anything with good probability? How do we know that the weights are the correct weights?

Since this is is example data and we don't have data for the months after december 2017, we have to use the data we got to verify if our hypothesis is good or not.

What we do is that we remove some rows from the training data and use those rows to test our hypothesis against. This is normal procedure in machine learning development. The recomendation I have read is that you take out 10% of the data for verification. I guess it depends on how much data you have from start.

Training and Verification

We do not want to take out first 10% of the rows, neither last 10% of the rows -because the data might be sorted and we need to train our model with data samples of all kinds. Therefore I suggest taking out random rows from the data. I do it like this:

Create a new fild in the input table:

if($(GenerateRandomNumber(0,1,0.1))>0.8,'V','T') as [Train/Verify]

I talked about the variable $(GenerateRandomNumber()) in article 4.

The column [Train/Verify] will hold a V or a T depending on the random number generated. In this case, approximately 80% of the rows will have T and the rest V.

After this, I just filter rows with [Train/Verify]='T' for the Gradient Descent, and when training is completed and we have stable values on all thetas, I check the hypothesis on each of the verification rows like this:

Result_Table:

LOAD

 Row,
 x01 as [Interest Rate], 
 x02 as [Unemployment Rate], 
 x03 as [Months ago], 
 y as [Stock Index Price],

 $(θ00) + 
   (x01 * $(θ01)) + 
   (x02 * $(θ02)) + 
   (x03 * $(θ03)) as [Predicted Stock Index Price]

Resident Input_table

where [Train/Verify] = 'V';

The result table will hold the actual stock index price and the predicted stock index price.

Visualising this in Qlik Sense:

No alt text provided for this image

Conclusion

Doing supervised multivariate linear regression in Qlik Sense is possible and easy to do and fast to execute, at least as long as we do not have too many variables. Running the code on thousands of rows of data will still be extremely quick because Qlik is so fast at loading data into memory.

Having hundreds of columns of input data, with hundreds of weights would not be reasonable to code like I have done here in this example. The code quickly becomes unreadable as the number of parameters increases, because the algorithm expands in two dimensions for every parameter added.

But I can see possibilities to use a different coding approach using a combination of Qlik Subs and Qlik Variables with $-parameters to solve also problems with hundreds of columns in the hypothesis.

However, I think problems with hundreds of columns is very rare to find where the end hypothesis is going to be predicting well using only linear regression. In reality there is not a clear linear relation between all columns in a dataset, at least not for all columns that is needed to predict any useful output.

Thank you for reading. Please share if you like it, and comment on the post if you have questions or feedback about the content.


Comment below, or connect with me here on LinkedIn if you want to get a copy of the Qlik Sense App.


No alt text provided for this image


No alt text provided for this image


Hello Robert, excellent! would possible to get the QVF file? Thanks~([email protected])

Hugo Simancas

Consultor Senior Izertis

4 年

Excellent work! I would appreciate if you can share the qvs file to my email [email protected]

Christophe Brault

Qlik Enthusiast ?? Make Qlik happen

4 年

That's great practical example after the theory, it allows me to better understand both technical and business usage ??

要查看或添加评论,请登录

Robert Svebeck的更多文章