AI and Machine Learning for Qlik Developers - Case Study 3 - part 1
Robert Svebeck
Driving Responsible AI Implementation in Region Stockholm / Karolinska University Hospital
This is an example in Qlik where we are using a Logistic Regression Algorithm, built only with Qlik Script, to be able to predict if a person who boarded Titanic survived or not.
For more in-depth information on the Qlik Script concept, review the articles from start before you begin, or jump right into the article about logistic regression here.
"Machine Learning from Disaster - Predict survival on the Titanic" is a classic Supervised Machine Learning case study. If you google it, you will find hundreds of articles on this case study. Solved in almost equal number of ways. But as I understand, never before using only Qlik Script. Normally this is solved using R or Python.
The data source I use here is coming directly from Kaggle, and if you want to start working with Machine Learning, getting an account on Kaggle is a good idea.
The dataset contains of two csv files (download here):
- Learning data set
- Test data set
The two files have identical columns except that the test data set doesn't have the column "Survived". The goal is to develop your prediction model on the learning data set, then run your prediction on the test data set, send it to Kaggle and they give you back how well your prediction worked.
The result is given back to you in percent, how many % of the passengers in your test data that has the correct prediction. If you predicted that a passenger survived and it did not, or if you predicted it died but it did not, it is classified as an error. They simply count how many errors against how many correct predictions you have.
Kaggle has a leader board for this data set showing how well you rank against other Machine Learning Beginners, but since this data set has been around a long time - there are hundreds of people who are 100% correct in their prediction. Many of them are probably not honest predictions, but some might be. To get 100% or close to 100% correct prediction you need to have done your feature engineering perfectly and also tested different kinds of algorithms, and I am pretty sure that logistical regression is not the perfect algorithm for this data set, but it works good enough for my purpose here to explain how to run logistic regression with Qlik Script only. Basically to explain how logistic regression actually works.
My solution
I solved this case study using 3 Qlik Sense Apps. Not because it is was needed, could have done everything in one app, but because it is easier to explain and present the code, handle the data and the methodology in this way.
1. Preparation App (part 1 - this article)
- Importing the training data file (which is a csv file).
- Analysing the imported data.
- Feature Engineering.
- Export the manipulated data to a QVD.
2. Learning App (part 2 - read it here)
- Import about 90% of the rows from the QVD that was created from the Preparation App
- Run the Logistical Regression Algorithm
- Import the remaining ~ 10% of rows from the QVD that was created from the Preparation App and verify the prediction model on that data. Here we get a preliminary result on how well our model works.
- Store the result of the prediction model in a new QVD
3. Testing App "Kaggle app" (part 3 - coming article)
- Import the testing data file (which is a csv file).
- Apply same Feature Engineering as in the preparation app
- Import Prediction Model QVD from the Learning App
- Apply prediction on the test data - here we don't know how well the prediction is.
- Export result to CSV
- Post the file to Kaggle (if you have created your account there)!
- Expect great results!
Preparation app
The preparation app really has no Machine Learning specific coding. The script is more or less just like any normal BI-application in Qlik Sense, except maybe that the "expected user" of the data is a machine and not a human so we have higher demands in making sure data is completely cleaned.
Importing the training data
LOAD PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked FROM [lib://DataFiles/train.csv] (txt, codepage is 28592, embedded labels, delimiter is ',', msq);
or just
LOAD * FROM [lib://DataFiles/train.csv] (txt, codepage is 28592, embedded labels, delimiter is ',', msq);
Because at this stage we want to load and look at all data in the file.
The data file contains the following columns:
PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked
In BI projects - you need to understand the data before you build good dashboards and reports on it.
This is equally important in Machine Learning projects.
Here is the description provided by Kaggle:
- PassengerId: unique ID
- Survived: 0 = No, 1 = Yes
- pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
- Name
- Sex: male / female
- Age : Age in years
- sibsp: # of siblings / spouses aboard the Titanic
- parch : # of parents / children aboard the Titanic
- ticket: Ticket number
- fare: Passenger fare
- cabin: Cabin number
- embarked: Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton
We could just ignore looking at the data further and send it all into our algorithm without any further manipulation - but trust me (I tried) that will give you a super low prediction, even lower than just randomly guessing... :-)
What we want to do in this preparation app is to make sure that the data we export to the logistical regression algorithm is as good as possible for the algorithm to predict well.
Feature Engineering
We need to look at a lot of factors on each field and we need to take decisions on if the field is needed in the model or not.
Most importantly, we need to understand if the field is actually related to the survival of a passenger or not. For instance, it's quite easy to guess that PassengerId is not related at all if a passenger survived or not. Right? But what about the other fields? Ticket? Cabin? Name?
One way to approach feature engineering is to start with the fields you think are most obvious for a prediction, run the algorithm to get your prediction rate (how correct your prediction is), and then keep adding more and more columns until your prediction rate is not getting better. You may find that adding features actually sometimes reduces the prediction rate. This is not always because the information is not relevant, but sometimes it just needs to be more engineered before you put it into your model.
We don't need to guess if a feature has positive or negative effect on the survival, we just need to guess if it has more or less effect, regardless of the outcome. The weights in the model will handle how important a feature is to surviving.
So let's try to make a wild guess.... Sex - being male or female is probably important and Age too. We can guess that passengers travelling in low class (pclass=3) was deeper in the boat and was perhaps therefore less likely to survive. Ticket number is probably irrelevant for survival. Lets load the data into our sense app and have a look at it.
NULLS, blanks and 0's!
When loading the data you need to see how good quality the data has. To get a smooth and quick process for myself in these Machine Learning Projects I have developed a lot of subs that I find useful to re-use a lot. My own Qlik-ML-toolkit.
This is an example of a sub that looks up all sorts of relevant facts on a given field in a table, such as "# missing values", "uniqueness", "average value", "max value" and also shows the correlation coefficient towards another given field. In this case I run the sub for the field Age against the field Survived for a table named temp.
call CheckTableField('temp','Age','Survived');
The code behind this sub is to long to visualize in this interface. But it creates a table row like this:
We have 891 values, 89 distinct values, 177 blank values. Correlation to Survival is low (0.08, probably because there are so many blanks).
Running this sub on all columns in the data set - we get this:
Basically all fields except Pclass and Fare are at this stage relevant to send to the algorithm. But to many fields are not numeric: Sex, Embarked, Cabin and Name, so they will probably get more relevant as we transform them to numeric values.
Overall, the data is a wreck (pun intended)! -> We need a lot of feature engineering!
Since we will use only logistical regression to train with, we need to have all data numeric, no blanks and preferably similar ranges in min and max values,high correlation, low standard deviation and standard error etc etc.
Here is a summary things I try doing on the training data before I send it to the learning algorithm - but I encourage you to do your own analysis! I'm sure you can do better than me and find things I never thought of!
- PassengerId: Will be used as unique identifier but not as a feature.
- Survived: Can be used just like it is (will be our y-value in the prediction model)
- pclass: I use almost as it is. All passengers has a class 1, 2 or 3 - I like to start with zero, so I just reduce class with 1 to get class 0,1,2 instead.
- Name: I categorize passenger names into 7 buckets (0-6) depending on if they have titles in them like "Mr", "Mrs", "Miss", "Jr", etc. I selected some titles that seemed relevant in terms of how many passengers had them as title in their name. Passengers without matching titles ends up in bucket = 0. This creates a feature called Title_group.
- Sex: Transform into numeric, female=0 and male=1
- Age : Many passengers are missing age. But I can use an average age for those. I calculate average age by Title (extracted from the name above). Because I'm thinking that if you are a "Mr", your are probably within the age of other Mr's, etc. Then we classify all ages into some age groups. And then we create a feature called Age_group.
- sibsp: Use together with parch to create a feature called FamilySize and later a feature called "NotAlone" (=1 or 0). Passengers who were alone on the boat has perhaps a different chance to survive than passengers with family. The feature FamilySize is used later to calculate correct Fare per person.
- parch : See sibsp
- ticket: Not used as a feature, but used to get Fare and Cabin for those passengers who has ticket info but where missing fare and cabin.
- fare: I discovered that fare is fare per ticket, and not fare per person, so I divide fare with FamilySize. After that I group the fare field to create the feature Fare_per_person_group
- cabin: Using first character of Cabin gives me the Deck on the boat. For those passengers who was missing cabin I look it up by using ticket or passenger class to find or estimate a Deck. In the end I create a feature transforming to numeric value where Deck A=1, B=2 etc.
- embarked: Transform to numbers 0,1,2 depending on port. Fill blanks to Cherbourg because those who was missing value in embarked were in the same pclass as majority of passengers from Cherbourg.
How to find out what do transform?
Use the great charts included in Sense! I use the Histogram, Sankey and Combo Chart mostly, and of course the table.
Create master items measures for #survived, #died and #passengers.
count(DISTINCT {<Survived={0}>} PassengerId) etc...
Feature Engineering example
I will not post all feature engineering code here, instead I give just an example.
Sex and family size features:
TrainingData: LOAD if(FamilySize>1,1,0) as NotAlone, ... LOAD if(Parch+SibSp=0,1,Parch+SibSp) as FamilySize, if(Sex='female',1,0) as Sex_fixed, ... FROM [lib://DataFiles/train.csv] (txt, codepage is 28592, embedded labels, delimiter is ',', msq);
Drop me a line in this post or message me LinkedIn if you want a copy of the full QVF, or a copy of the complete script (the script will work in Qlikview too)
After this engineering, my new features look like this:
The result is much better! But is it good enough? Prediction rate will let me know...
Since I want to use most passengers for learning but leave some for verifying, I also create a field called "Learn/Verify" like this:
if($(GenerateRandomNumber(0,1,0.1))>$(RatioLearningVerification),'V','L') as [Learn/Verify]
using the two previously created variables earlier:
Set RatioLearningVerification=0.90; Set GenerateRandomNumber = 'Round(Rand()*($2-$1)+$1,$3)';
Last step in the Preparation app is to export the data into a QVD.
Here I change names on my features so that I don't have to change anything in the learning app (which then can be re-used for learning something else, regardless of field names and content):
EXPORT: LOAD PassengerId as Row 0 as X9, NotAlone as X8, Embarked_fixed_mapped as X7, Age_group as X6, Deck_fixed_mapped as X5, Fare_per_person_grouped as X4, Pclass as X3, Title_group as X2, Sex_fixed as X1, Survived as Y, [Learn/Verify] resident temp; STORE EXPORT INTO [lib://DataFiles/learn.qvd] (qvd); drop table EXPORT;
My algorithm can currently only take 9 variables, but for this dataset I just need 8, so I set the last variable (X9) to zero.
That is all there is to the preparation app. Next task is to train the algorithm on this data.
Learning app
This article is already too long, so I decided I will write about the learning app in part 2. Read part 2 here.
For those of you who want to get into Machine Learning but still have not started your journey, I encourage you to download the data sets here and start building your preparation app right away in Sense!
Soon you will be able to send your Export QVD into the algorithm to test your data!
In the end, you might also be sending the export data to Python or R algorithms but no matter what, you gain a lot of skills by understanding feature engineering and what is important to think about - and why Qlik Sense and Qlikview is just the perfect tools for this part of Machine Learning. In my opinion, there is really no need to do the feature engineering in another language. You will see that this is where you spend most of your time anyway.
Thank you for reading. Please share if you like it, and comment on the post if you have questions or feedback about the content.
Comment below, or connect with me here on LinkedIn if you want to get a copy of the Qlik Sense App.
Senior Consultant
6 个月will provide full dashboard and qvds guide me
Hi Robert Svebeck your posts kinda create a lot of enthusiasm and fosters knowledge Would request you to share the QVF for the Part 1 and and part 2 [email protected] would be waiting for 3d objects in qliksense in your upcoming posts ??
--
2 年Hello Robert Svebeck. I am learning step by step by your article. thank you. would possible to get the QVF file? Thanks ([email protected])
Great Article and a great example of how to work with ML in QlikSense!
Analytics Project Manager & Pre-Sales | Business Intelligence & Data Integration Consultant | 2x Qlik Certified
4 年Awesome article Robert Svebeck , just stumbled upon this article of yours and definitely made me more interested to look at your qvd app and the script you have written for training.