AI and Machine Learning for Qlik Developers - Case Study 2
Robert Svebeck
”Success consists of going from failure to failure without a loss of enthusiasm.” Innovation and Digitalization Leader | AI Strategist | Solution Architect
This is an example in Qlik where we are using a Supervised Multivariate Gradient Descent Algorithm to be able to predict Stock Index Prices depending on Year, Month, Unemployment Rate and Interest Rate.
(The purpose of this example is to make you understand something about Machine Learning. This task could maybe be solved easier and without using ML, or by using built in Qlik functionality.
For more in-depth information on the Qlik Script, review the articles from start before you begin.
If you want a copy of the Qlik Sense QVF here - contact me on LinkedIn and I will send it to you.
Reference: Example data found here, where you can also find Python code for same data.
In the previous case study we had one single variable. In this example we have 4 columns of input data and one output.
Data
The data has 24 lines and 5 columns. First lines looks like this:
The last "date stamp" on the file is december 2017. Let's pretend we are now in January 2018. The task is to predict Stock Index Prices for the current and the coming months depending on what the interest rate and unemployment rate is.
Since there are 5 columns, lets first simply guess this hypothesis would work:
y = θ0 + X1 * θ1 + X2 * θ2 + X3 * θ3 + X4 * θ4
We will now write a script to find out if that hypothesis works.
Qlik Script
Please note; I will only make remarks on things in the code that are different from the first case study. If you don't understand something in this code, review the previous case study or go back to the main article where I explain more about the method and math.
Loading Data
Input_table: LOAD Interest_Rate as x01, Unemployment_Rate as x02, Year as x03, Month as x04, Stock_Index_Price as y FROM [lib://AI_DATA/STOCKMARKET/data.xlsx] (ooxml, embedded labels, table is Data);
The order of the columns and how we name them are not relevant.
Log table
Sub Log Log: LOAD rowno() as logRow, $(θ0) as θ0, $(θ1) as θ1, $(θ2) as θ2, $(θ3) as θ3, $(θ4) as θ4 autogenerate(1); End sub
Only difference from the previous case study is that we add more variables (θ2,θ3,θ4) to the log.
Gradient Descent Algorithm
Define Intial weights.
Let θ00 = 0; Let θ01 = 0; Let θ02 = 0; Let θ03 = 0; Let θ04 = 0; Let m = NoOfRows('Input_table'); Let iterations = 100; Let α = 0.001;
5 weights instead of 2 as in first case study. Start with alfa=0.001 and 100 iterations. No need to have too many iterations until we know weights are converging.
Loop
For i = 1 to iterations
Inside loop
temp: LOAD Rangesum(($(θ00) + (x01 * $(θ01)) + (x02 * $(θ02)) + (x03 * $(θ03)) + (x04 * $(θ04)) - y), peek(error_x00)) as error_x00, Rangesum(($(θ00) + (x01 * $(θ01)) + (x02 * $(θ02)) + (x03 * $(θ03)) + (x04 * $(θ04)) - y) * x01, peek(error_x01)) as error_x01, Rangesum(($(θ00) + (x01 * $(θ01)) + (x02 * $(θ02)) + (x03 * $(θ03)) + (x04 * $(θ04)) - y) * x02, peek(error_x02)) as error_x02, Rangesum(($(θ00) + (x01 * $(θ01)) + (x02 * $(θ02)) + (x03 * $(θ03)) + (x04 * $(θ04)) - y) * x03, peek(error_x03)) as error_x03, Rangesum(($(θ00) + (x01 * $(θ01)) + (x02 * $(θ02)) + (x03 * $(θ03)) + (x04 * $(θ04)) - y) * x04, peek(error_x04)) as error_x04 Resident Input_table; Let sum_error_x00 = Peek('error_x00',-1,'temp'); Let sum_error_x01 = Peek('error_x01',-1,'temp'); Let sum_error_x02 = Peek('error_x02',-1,'temp'); Let sum_error_x03 = Peek('error_x03',-1,'temp'); Let sum_error_x04 = Peek('error_x04',-1,'temp'); drop table temp;
Still not big difference compared to case study 1. Only fact is that because the hypothesis is larger - longer code is required and more variables to check against actual y.
Adjust weights (θ)
Let θ00 = θ00 - ( α * 1/m * sum_error_x00); Let θ01 = θ01 - ( α * 1/m * sum_error_x01); Let θ02 = θ02 - ( α * 1/m * sum_error_x02); Let θ03 = θ03 - ( α * 1/m * sum_error_x03); Let θ04 = θ04 - ( α * 1/m * sum_error_x04);
Again, same code, just more repeted lines for each weight.
Log the weights and close the loop
Call Log; next i;
And that is it!
There is now a multivariate linear regression example in full. Do a reload!
"Data has not been loaded. Please correct the error and try loading again."
... Doesn't work? -> Divergence?
Don't worry! This is normal! We need to study the data BEFORE we create our hypothesis!
We assumed every column has a direct linear relationship with the output column. This is not the case in this data, and in fact, get use to it. It's normally not the case that all data you get is just ready to be used directly without manipulation.
If you don't understand your data, one simple way to find out which columns to remove or manipulate is to simply remove as many as possible and add them instead one by one. That way you will find out which columns are causing the divergence.
Change the input in this way:
Input_table: LOAD Interest_Rate as x01, //Unemployment_Rate as x02, 0 as x02, //Year as x03, 0 as x03, //Month as x04, 0 as x04, Stock_Index_Price as y FROM [lib://AI_DATA/STOCKMARKET/data.xlsx] (ooxml, embedded labels, table is Data);
The change is that we default columns x02, x03 and x04 to 0, basically we remove them from the hypothesis as they will always be zero.
Our hypotheis is now rather this: y = θ0 + X1 * θ1
Run the code. Works better right? So we know that interest rate has a linear relation with stock index price. However, the hypothesis will not be very good at predicting future stock index prices. Repeat this method over all variables. You will find out that the column Year is the problem here. It not possible to add that column into our hypothesis like the other columns.
How do we add the year?
The solution is: Given that we pretend we are now in January 2018: Add a new column to the input data that calculate (based on Year and Month columns) the number of months from "today" for each row, and name that column x03.
Like this:
Input_table: LOAD Interest_Rate as x01, Unemployment_Rate as x02, ((Year*12)+Month) - ((2018*12)+1) as x03, 0 as x04, Stock_Index_Price as y FROM [lib://AI_DATA/STOCKMARKET/data.xlsx] (ooxml, embedded labels, table is Data);
x03 is now a negative number for how many months ago that transaction belongs.
Run the code again. Now it works with all columns!
Our working hypothesis is now: y = θ0 + X1 * θ1 + X2 * θ2 + X3 * θ3
To save some reload time, you can remove all traces of X4 and θ4, but keeping the x4=0 will not mess up the prediction if you don't remove them from the code.
Analyze your log
Experiment manually with different values on alfa and iterations by looking at the log table to find the best values for this dataset.
Build a nice user interface for the log data. With Qlik Sense you do this in 5 minutes;
That will help you to find optimal parameters.
Change the learning rate (α)
To make the code run faster try to change alfa in different way. Start first iterations with a big value on alfa and make it smaller and smaller for every iterations. For instance like this:
Let α = (1/( pow((1/0.01) + i,1))) + 0.0001;
This will make the alfa start with 0.01 and end with ~0.0001
You can also check how fast each weight is adjusting itself in the log and use different alfa values for each weight like this:
Let θ00 = θ00 - ( α1 * 1/m * sum_error_x00); Let θ01 = θ01 - ( α2 * 1/m * sum_error_x01); Let θ02 = θ02 - ( α2 * 1/m * sum_error_x02); Let θ03 = θ03 - ( α3 * 1/m * sum_error_x03);
In my code, I managed to get my weights stable after about 800 iterations which took 3 seconds to iterate on my laptop.
But is the hypothesis useful?
By now we know the code runs and the weights are converging well. But is the hypothesis able to predict anything with good probability? How do we know that the weights are the correct weights?
Since this is is example data and we don't have data for the months after december 2017, we have to use the data we got to verify if our hypothesis is good or not.
What we do is that we remove some rows from the training data and use those rows to test our hypothesis against. This is normal procedure in machine learning development. The recomendation I have read is that you take out 10% of the data for verification. I guess it depends on how much data you have from start.
Training and Verification
We do not want to take out first 10% of the rows, neither last 10% of the rows -because the data might be sorted and we need to train our model with data samples of all kinds. Therefore I suggest taking out random rows from the data. I do it like this:
Create a new fild in the input table:
if($(GenerateRandomNumber(0,1,0.1))>0.8,'V','T') as [Train/Verify]
I talked about the variable $(GenerateRandomNumber()) in article 4.
The column [Train/Verify] will hold a V or a T depending on the random number generated. In this case, approximately 80% of the rows will have T and the rest V.
After this, I just filter rows with [Train/Verify]='T' for the Gradient Descent, and when training is completed and we have stable values on all thetas, I check the hypothesis on each of the verification rows like this:
Result_Table: LOAD Row, x01 as [Interest Rate], x02 as [Unemployment Rate], x03 as [Months ago], y as [Stock Index Price], $(θ00) + (x01 * $(θ01)) + (x02 * $(θ02)) + (x03 * $(θ03)) as [Predicted Stock Index Price] Resident Input_table where [Train/Verify] = 'V';
The result table will hold the actual stock index price and the predicted stock index price.
Visualising this in Qlik Sense:
Conclusion
Doing supervised multivariate linear regression in Qlik Sense is possible and easy to do and fast to execute, at least as long as we do not have too many variables. Running the code on thousands of rows of data will still be extremely quick because Qlik is so fast at loading data into memory.
Having hundreds of columns of input data, with hundreds of weights would not be reasonable to code like I have done here in this example. The code quickly becomes unreadable as the number of parameters increases, because the algorithm expands in two dimensions for every parameter added.
But I can see possibilities to use a different coding approach using a combination of Qlik Subs and Qlik Variables with $-parameters to solve also problems with hundreds of columns in the hypothesis.
However, I think problems with hundreds of columns is very rare to find where the end hypothesis is going to be predicting well using only linear regression. In reality there is not a clear linear relation between all columns in a dataset, at least not for all columns that is needed to predict any useful output.
Thank you for reading. Please share if you like it, and comment on the post if you have questions or feedback about the content.
Comment below, or connect with me here on LinkedIn if you want to get a copy of the Qlik Sense App.
--
3 年Hello Robert, excellent! would possible to get the QVF file? Thanks~([email protected])
Consultor Senior Izertis
4 年Excellent work! I would appreciate if you can share the qvs file to my email [email protected]
Qlik Enthusiast ?? Make Qlik happen
4 年That's great practical example after the theory, it allows me to better understand both technical and business usage ??