AI and Machine Learning for Qlik Developers - Case Study 3 - part 1

AI and Machine Learning for Qlik Developers - Case Study 3 - part 1

This is an example in Qlik where we are using a Logistic Regression Algorithm, built only with Qlik Script, to be able to predict if a person who boarded Titanic survived or not.

For more in-depth information on the Qlik Script concept, review the articles from start before you begin, or jump right into the article about logistic regression here.

"Machine Learning from Disaster - Predict survival on the Titanic" is a classic Supervised Machine Learning case study. If you google it, you will find hundreds of articles on this case study. Solved in almost equal number of ways. But as I understand, never before using only Qlik Script. Normally this is solved using R or Python.

The data source I use here is coming directly from Kaggle, and if you want to start working with Machine Learning, getting an account on Kaggle is a good idea.

The dataset contains of two csv files (download here):

  • Learning data set
  • Test data set

The two files have identical columns except that the test data set doesn't have the column "Survived". The goal is to develop your prediction model on the learning data set, then run your prediction on the test data set, send it to Kaggle and they give you back how well your prediction worked.

The result is given back to you in percent, how many % of the passengers in your test data that has the correct prediction. If you predicted that a passenger survived and it did not, or if you predicted it died but it did not, it is classified as an error. They simply count how many errors against how many correct predictions you have.

Kaggle has a leader board for this data set showing how well you rank against other Machine Learning Beginners, but since this data set has been around a long time - there are hundreds of people who are 100% correct in their prediction. Many of them are probably not honest predictions, but some might be. To get 100% or close to 100% correct prediction you need to have done your feature engineering perfectly and also tested different kinds of algorithms, and I am pretty sure that logistical regression is not the perfect algorithm for this data set, but it works good enough for my purpose here to explain how to run logistic regression with Qlik Script only. Basically to explain how logistic regression actually works.

No alt text provided for this image

My solution

I solved this case study using 3 Qlik Sense Apps. Not because it is was needed, could have done everything in one app, but because it is easier to explain and present the code, handle the data and the methodology in this way.

1. Preparation App (part 1 - this article)

  1. Importing the training data file (which is a csv file).
  2. Analysing the imported data.
  3. Feature Engineering.
  4. Export the manipulated data to a QVD.

2. Learning App (part 2 - read it here)

  1. Import about 90% of the rows from the QVD that was created from the Preparation App
  2. Run the Logistical Regression Algorithm
  3. Import the remaining ~ 10% of rows from the QVD that was created from the Preparation App and verify the prediction model on that data. Here we get a preliminary result on how well our model works.
  4. Store the result of the prediction model in a new QVD

3. Testing App "Kaggle app" (part 3 - coming article)

  1. Import the testing data file (which is a csv file).
  2. Apply same Feature Engineering as in the preparation app
  3. Import Prediction Model QVD from the Learning App
  4. Apply prediction on the test data - here we don't know how well the prediction is.
  5. Export result to CSV
  6. Post the file to Kaggle (if you have created your account there)!
  7. Expect great results!
No alt text provided for this image

Preparation app

The preparation app really has no Machine Learning specific coding. The script is more or less just like any normal BI-application in Qlik Sense, except maybe that the "expected user" of the data is a machine and not a human so we have higher demands in making sure data is completely cleaned.

Importing the training data

LOAD

    PassengerId,
    Survived,
    Pclass,
    Name,
    Sex,
    Age,
    SibSp,
    Parch,
    Ticket,
    Fare,
    Cabin,
    Embarked

FROM [lib://DataFiles/train.csv]

(txt, codepage is 28592, embedded labels, delimiter is ',', msq);


or just

LOAD * FROM [lib://DataFiles/train.csv]

(txt, codepage is 28592, embedded labels, delimiter is ',', msq);


Because at this stage we want to load and look at all data in the file.

The data file contains the following columns:

PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked

In BI projects - you need to understand the data before you build good dashboards and reports on it.

This is equally important in Machine Learning projects.


Here is the description provided by Kaggle:

  1. PassengerId: unique ID
  2. Survived: 0 = No, 1 = Yes
  3. pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
  4. Name
  5. Sex: male / female
  6. Age : Age in years
  7. sibsp: # of siblings / spouses aboard the Titanic
  8. parch : # of parents / children aboard the Titanic
  9. ticket: Ticket number
  10. fare: Passenger fare
  11. cabin: Cabin number
  12. embarked: Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton

We could just ignore looking at the data further and send it all into our algorithm without any further manipulation - but trust me (I tried) that will give you a super low prediction, even lower than just randomly guessing... :-)

What we want to do in this preparation app is to make sure that the data we export to the logistical regression algorithm is as good as possible for the algorithm to predict well.

No alt text provided for this image

Feature Engineering

We need to look at a lot of factors on each field and we need to take decisions on if the field is needed in the model or not.

Most importantly, we need to understand if the field is actually related to the survival of a passenger or not. For instance, it's quite easy to guess that PassengerId is not related at all if a passenger survived or not. Right? But what about the other fields? Ticket? Cabin? Name?

One way to approach feature engineering is to start with the fields you think are most obvious for a prediction, run the algorithm to get your prediction rate (how correct your prediction is), and then keep adding more and more columns until your prediction rate is not getting better. You may find that adding features actually sometimes reduces the prediction rate. This is not always because the information is not relevant, but sometimes it just needs to be more engineered before you put it into your model.

We don't need to guess if a feature has positive or negative effect on the survival, we just need to guess if it has more or less effect, regardless of the outcome. The weights in the model will handle how important a feature is to surviving.

So let's try to make a wild guess.... Sex - being male or female is probably important and Age too. We can guess that passengers travelling in low class (pclass=3) was deeper in the boat and was perhaps therefore less likely to survive. Ticket number is probably irrelevant for survival. Lets load the data into our sense app and have a look at it.

NULLS, blanks and 0's!

When loading the data you need to see how good quality the data has. To get a smooth and quick process for myself in these Machine Learning Projects I have developed a lot of subs that I find useful to re-use a lot. My own Qlik-ML-toolkit.

This is an example of a sub that looks up all sorts of relevant facts on a given field in a table, such as "# missing values", "uniqueness", "average value", "max value" and also shows the correlation coefficient towards another given field. In this case I run the sub for the field Age against the field Survived for a table named temp.

call CheckTableField('temp','Age','Survived');

The code behind this sub is to long to visualize in this interface. But it creates a table row like this:

No alt text provided for this image

We have 891 values, 89 distinct values, 177 blank values. Correlation to Survival is low (0.08, probably because there are so many blanks).

Running this sub on all columns in the data set - we get this:

No alt text provided for this image

Basically all fields except Pclass and Fare are at this stage relevant to send to the algorithm. But to many fields are not numeric: Sex, Embarked, Cabin and Name, so they will probably get more relevant as we transform them to numeric values.

Overall, the data is a wreck (pun intended)! -> We need a lot of feature engineering!

Since we will use only logistical regression to train with, we need to have all data numeric, no blanks and preferably similar ranges in min and max values,high correlation, low standard deviation and standard error etc etc.

Here is a summary things I try doing on the training data before I send it to the learning algorithm - but I encourage you to do your own analysis! I'm sure you can do better than me and find things I never thought of!

  1. PassengerId: Will be used as unique identifier but not as a feature.
  2. Survived: Can be used just like it is (will be our y-value in the prediction model)
  3. pclass: I use almost as it is. All passengers has a class 1, 2 or 3 - I like to start with zero, so I just reduce class with 1 to get class 0,1,2 instead.
  4. Name: I categorize passenger names into 7 buckets (0-6) depending on if they have titles in them like "Mr", "Mrs", "Miss", "Jr", etc. I selected some titles that seemed relevant in terms of how many passengers had them as title in their name. Passengers without matching titles ends up in bucket = 0. This creates a feature called Title_group.
  5. Sex: Transform into numeric, female=0 and male=1
  6. Age : Many passengers are missing age. But I can use an average age for those. I calculate average age by Title (extracted from the name above). Because I'm thinking that if you are a "Mr", your are probably within the age of other Mr's, etc. Then we classify all ages into some age groups. And then we create a feature called Age_group.
  7. sibsp: Use together with parch to create a feature called FamilySize and later a feature called "NotAlone" (=1 or 0). Passengers who were alone on the boat has perhaps a different chance to survive than passengers with family. The feature FamilySize is used later to calculate correct Fare per person.
  8. parch : See sibsp
  9. ticket: Not used as a feature, but used to get Fare and Cabin for those passengers who has ticket info but where missing fare and cabin.
  10. fare: I discovered that fare is fare per ticket, and not fare per person, so I divide fare with FamilySize. After that I group the fare field to create the feature Fare_per_person_group
  11. cabin: Using first character of Cabin gives me the Deck on the boat. For those passengers who was missing cabin I look it up by using ticket or passenger class to find or estimate a Deck. In the end I create a feature transforming to numeric value where Deck A=1, B=2 etc.
  12. embarked: Transform to numbers 0,1,2 depending on port. Fill blanks to Cherbourg because those who was missing value in embarked were in the same pclass as majority of passengers from Cherbourg.

How to find out what do transform?

Use the great charts included in Sense! I use the Histogram, Sankey and Combo Chart mostly, and of course the table.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image


Create master items measures for #survived, #died and #passengers.

count(DISTINCT {<Survived={0}>} PassengerId) etc...


Feature Engineering example

I will not post all feature engineering code here, instead I give just an example.

Sex and family size features:

TrainingData:

LOAD

  if(FamilySize>1,1,0) as NotAlone,

...

LOAD

  if(Parch+SibSp=0,1,Parch+SibSp) as FamilySize,

  if(Sex='female',1,0) as Sex_fixed,

...

FROM [lib://DataFiles/train.csv]
(txt, codepage is 28592, embedded labels, delimiter is ',', msq); 

Drop me a line in this post or message me LinkedIn if you want a copy of the full QVF, or a copy of the complete script (the script will work in Qlikview too)

After this engineering, my new features look like this:

No alt text provided for this image

The result is much better! But is it good enough? Prediction rate will let me know...

Since I want to use most passengers for learning but leave some for verifying, I also create a field called "Learn/Verify" like this:

if($(GenerateRandomNumber(0,1,0.1))>$(RatioLearningVerification),'V','L') as [Learn/Verify]

using the two previously created variables earlier:

Set RatioLearningVerification=0.90;

Set GenerateRandomNumber  = 'Round(Rand()*($2-$1)+$1,$3)';

Last step in the Preparation app is to export the data into a QVD.

Here I change names on my features so that I don't have to change anything in the learning app (which then can be re-used for learning something else, regardless of field names and content):

EXPORT:

LOAD
    PassengerId                as Row

    0                          as X9,
    NotAlone                   as X8,
    Embarked_fixed_mapped      as X7,
    Age_group                  as X6,
    Deck_fixed_mapped          as X5,
    Fare_per_person_grouped    as X4,
    Pclass                     as X3,
    Title_group                as X2,
    Sex_fixed                  as X1,

    Survived                   as Y,

    [Learn/Verify]

resident temp;

STORE EXPORT INTO [lib://DataFiles/learn.qvd] (qvd);

drop table EXPORT;

My algorithm can currently only take 9 variables, but for this dataset I just need 8, so I set the last variable (X9) to zero.

That is all there is to the preparation app. Next task is to train the algorithm on this data.

Learning app

This article is already too long, so I decided I will write about the learning app in part 2. Read part 2 here.

For those of you who want to get into Machine Learning but still have not started your journey, I encourage you to download the data sets here and start building your preparation app right away in Sense!

Soon you will be able to send your Export QVD into the algorithm to test your data!

In the end, you might also be sending the export data to Python or R algorithms but no matter what, you gain a lot of skills by understanding feature engineering and what is important to think about - and why Qlik Sense and Qlikview is just the perfect tools for this part of Machine Learning. In my opinion, there is really no need to do the feature engineering in another language. You will see that this is where you spend most of your time anyway.

Thank you for reading. Please share if you like it, and comment on the post if you have questions or feedback about the content.


Comment below, or connect with me here on LinkedIn if you want to get a copy of the Qlik Sense App.


No alt text provided for this image


No alt text provided for this image


Nikhileswara Reddy S

Senior Consultant

6 个月

will provide full dashboard and qvds guide me

Hi Robert Svebeck your posts kinda create a lot of enthusiasm and fosters knowledge Would request you to share the QVF for the Part 1 and and part 2 [email protected] would be waiting for 3d objects in qliksense in your upcoming posts ??

Hello Robert Svebeck. I am learning step by step by your article. thank you. would possible to get the QVF file? Thanks ([email protected])

Great Article and a great example of how to work with ML in QlikSense!

Vishal Gupta

Analytics Project Manager & Pre-Sales | Business Intelligence & Data Integration Consultant | 2x Qlik Certified

4 年

Awesome article Robert Svebeck , just stumbled upon this article of yours and definitely made me more interested to look at your qvd app and the script you have written for training.

要查看或添加评论,请登录

Robert Svebeck的更多文章

社区洞察

其他会员也浏览了