登录查看更多内容

AI and Machine Learning for Qlik Developers - Case Study 3 - part 1

Robert Svebeck

Driving Responsible AI Implementation in Region Stockholm / Karolinska University Hospital

发布日期: 2020年5月11日

This is an example in Qlik where we are using a Logistic Regression Algorithm, built only with Qlik Script, to be able to predict if a person who boarded Titanic survived or not.

For more in-depth information on the Qlik Script concept, review the articles from start before you begin, or jump right into the article about logistic regression here.

"Machine Learning from Disaster - Predict survival on the Titanic" is a classic Supervised Machine Learning case study. If you google it, you will find hundreds of articles on this case study. Solved in almost equal number of ways. But as I understand, never before using only Qlik Script. Normally this is solved using R or Python.

The data source I use here is coming directly from Kaggle, and if you want to start working with Machine Learning, getting an account on Kaggle is a good idea.

The dataset contains of two csv files (download here):

Learning data set
Test data set

The two files have identical columns except that the test data set doesn't have the column "Survived". The goal is to develop your prediction model on the learning data set, then run your prediction on the test data set, send it to Kaggle and they give you back how well your prediction worked.

The result is given back to you in percent, how many % of the passengers in your test data that has the correct prediction. If you predicted that a passenger survived and it did not, or if you predicted it died but it did not, it is classified as an error. They simply count how many errors against how many correct predictions you have.

Kaggle has a leader board for this data set showing how well you rank against other Machine Learning Beginners, but since this data set has been around a long time - there are hundreds of people who are 100% correct in their prediction. Many of them are probably not honest predictions, but some might be. To get 100% or close to 100% correct prediction you need to have done your feature engineering perfectly and also tested different kinds of algorithms, and I am pretty sure that logistical regression is not the perfect algorithm for this data set, but it works good enough for my purpose here to explain how to run logistic regression with Qlik Script only. Basically to explain how logistic regression actually works.

My solution

I solved this case study using 3 Qlik Sense Apps. Not because it is was needed, could have done everything in one app, but because it is easier to explain and present the code, handle the data and the methodology in this way.

1. Preparation App (part 1 - this article)

Importing the training data file (which is a csv file).
Analysing the imported data.
Feature Engineering.
Export the manipulated data to a QVD.

2. Learning App (part 2 - read it here)

Import about 90% of the rows from the QVD that was created from the Preparation App
Run the Logistical Regression Algorithm
Import the remaining ~ 10% of rows from the QVD that was created from the Preparation App and verify the prediction model on that data. Here we get a preliminary result on how well our model works.
Store the result of the prediction model in a new QVD

3. Testing App "Kaggle app" (part 3 - coming article)

Import the testing data file (which is a csv file).
Apply same Feature Engineering as in the preparation app
Import Prediction Model QVD from the Learning App
Apply prediction on the test data - here we don't know how well the prediction is.
Export result to CSV
Post the file to Kaggle (if you have created your account there)!
Expect great results!

Preparation app

The preparation app really has no Machine Learning specific coding. The script is more or less just like any normal BI-application in Qlik Sense, except maybe that the "expected user" of the data is a machine and not a human so we have higher demands in making sure data is completely cleaned.

Importing the training data

LOAD

    PassengerId,
    Survived,
    Pclass,
    Name,
    Sex,
    Age,
    SibSp,
    Parch,
    Ticket,
    Fare,
    Cabin,
    Embarked

FROM [lib://DataFiles/train.csv]

(txt, codepage is 28592, embedded labels, delimiter is ',', msq);

or just

LOAD * FROM [lib://DataFiles/train.csv]

(txt, codepage is 28592, embedded labels, delimiter is ',', msq);

Because at this stage we want to load and look at all data in the file.

The data file contains the following columns:

PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked

In BI projects - you need to understand the data before you build good dashboards and reports on it.

This is equally important in Machine Learning projects.

Here is the description provided by Kaggle:

PassengerId: unique ID
Survived: 0 = No, 1 = Yes
pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
Name
Sex: male / female
Age : Age in years
sibsp: # of siblings / spouses aboard the Titanic
parch : # of parents / children aboard the Titanic
ticket: Ticket number
fare: Passenger fare
cabin: Cabin number
embarked: Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton

We could just ignore looking at the data further and send it all into our algorithm without any further manipulation - but trust me (I tried) that will give you a super low prediction, even lower than just randomly guessing... :-)

What we want to do in this preparation app is to make sure that the data we export to the logistical regression algorithm is as good as possible for the algorithm to predict well.

Feature Engineering

We need to look at a lot of factors on each field and we need to take decisions on if the field is needed in the model or not.

Most importantly, we need to understand if the field is actually related to the survival of a passenger or not. For instance, it's quite easy to guess that PassengerId is not related at all if a passenger survived or not. Right? But what about the other fields? Ticket? Cabin? Name?

One way to approach feature engineering is to start with the fields you think are most obvious for a prediction, run the algorithm to get your prediction rate (how correct your prediction is), and then keep adding more and more columns until your prediction rate is not getting better. You may find that adding features actually sometimes reduces the prediction rate. This is not always because the information is not relevant, but sometimes it just needs to be more engineered before you put it into your model.

We don't need to guess if a feature has positive or negative effect on the survival, we just need to guess if it has more or less effect, regardless of the outcome. The weights in the model will handle how important a feature is to surviving.

So let's try to make a wild guess.... Sex - being male or female is probably important and Age too. We can guess that passengers travelling in low class (pclass=3) was deeper in the boat and was perhaps therefore less likely to survive. Ticket number is probably irrelevant for survival. Lets load the data into our sense app and have a look at it.

NULLS, blanks and 0's!

When loading the data you need to see how good quality the data has. To get a smooth and quick process for myself in these Machine Learning Projects I have developed a lot of subs that I find useful to re-use a lot. My own Qlik-ML-toolkit.

This is an example of a sub that looks up all sorts of relevant facts on a given field in a table, such as "# missing values", "uniqueness", "average value", "max value" and also shows the correlation coefficient towards another given field. In this case I run the sub for the field Age against the field Survived for a table named temp.

call CheckTableField('temp','Age','Survived');

The code behind this sub is to long to visualize in this interface. But it creates a table row like this:

We have 891 values, 89 distinct values, 177 blank values. Correlation to Survival is low (0.08, probably because there are so many blanks).

Running this sub on all columns in the data set - we get this:

Basically all fields except Pclass and Fare are at this stage relevant to send to the algorithm. But to many fields are not numeric: Sex, Embarked, Cabin and Name, so they will probably get more relevant as we transform them to numeric values.

Overall, the data is a wreck (pun intended)! -> We need a lot of feature engineering!

Since we will use only logistical regression to train with, we need to have all data numeric, no blanks and preferably similar ranges in min and max values,high correlation, low standard deviation and standard error etc etc.

Here is a summary things I try doing on the training data before I send it to the learning algorithm - but I encourage you to do your own analysis! I'm sure you can do better than me and find things I never thought of!

PassengerId: Will be used as unique identifier but not as a feature.
Survived: Can be used just like it is (will be our y-value in the prediction model)
pclass: I use almost as it is. All passengers has a class 1, 2 or 3 - I like to start with zero, so I just reduce class with 1 to get class 0,1,2 instead.
Name: I categorize passenger names into 7 buckets (0-6) depending on if they have titles in them like "Mr", "Mrs", "Miss", "Jr", etc. I selected some titles that seemed relevant in terms of how many passengers had them as title in their name. Passengers without matching titles ends up in bucket = 0. This creates a feature called Title_group.
Sex: Transform into numeric, female=0 and male=1
Age : Many passengers are missing age. But I can use an average age for those. I calculate average age by Title (extracted from the name above). Because I'm thinking that if you are a "Mr", your are probably within the age of other Mr's, etc. Then we classify all ages into some age groups. And then we create a feature called Age_group.
sibsp: Use together with parch to create a feature called FamilySize and later a feature called "NotAlone" (=1 or 0). Passengers who were alone on the boat has perhaps a different chance to survive than passengers with family. The feature FamilySize is used later to calculate correct Fare per person.
parch : See sibsp
ticket: Not used as a feature, but used to get Fare and Cabin for those passengers who has ticket info but where missing fare and cabin.
fare: I discovered that fare is fare per ticket, and not fare per person, so I divide fare with FamilySize. After that I group the fare field to create the feature Fare_per_person_group
cabin: Using first character of Cabin gives me the Deck on the boat. For those passengers who was missing cabin I look it up by using ticket or passenger class to find or estimate a Deck. In the end I create a feature transforming to numeric value where Deck A=1, B=2 etc.
embarked: Transform to numbers 0,1,2 depending on port. Fill blanks to Cherbourg because those who was missing value in embarked were in the same pclass as majority of passengers from Cherbourg.

How to find out what do transform?

Use the great charts included in Sense! I use the Histogram, Sankey and Combo Chart mostly, and of course the table.

Create master items measures for #survived, #died and #passengers.

count(DISTINCT {<Survived={0}>} PassengerId) etc...

Feature Engineering example

I will not post all feature engineering code here, instead I give just an example.

Sex and family size features:

TrainingData:

LOAD

  if(FamilySize>1,1,0) as NotAlone,

...

LOAD

  if(Parch+SibSp=0,1,Parch+SibSp) as FamilySize,

  if(Sex='female',1,0) as Sex_fixed,

...

FROM [lib://DataFiles/train.csv]
(txt, codepage is 28592, embedded labels, delimiter is ',', msq);

Drop me a line in this post or message me LinkedIn if you want a copy of the full QVF, or a copy of the complete script (the script will work in Qlikview too)

After this engineering, my new features look like this:

The result is much better! But is it good enough? Prediction rate will let me know...

Since I want to use most passengers for learning but leave some for verifying, I also create a field called "Learn/Verify" like this:

if($(GenerateRandomNumber(0,1,0.1))>$(RatioLearningVerification),'V','L') as [Learn/Verify]

using the two previously created variables earlier:

Set RatioLearningVerification=0.90;

Set GenerateRandomNumber  = 'Round(Rand()*($2-$1)+$1,$3)';

Last step in the Preparation app is to export the data into a QVD.

Here I change names on my features so that I don't have to change anything in the learning app (which then can be re-used for learning something else, regardless of field names and content):

EXPORT:

LOAD
    PassengerId                as Row

    0                          as X9,
    NotAlone                   as X8,
    Embarked_fixed_mapped      as X7,
    Age_group                  as X6,
    Deck_fixed_mapped          as X5,
    Fare_per_person_grouped    as X4,
    Pclass                     as X3,
    Title_group                as X2,
    Sex_fixed                  as X1,

    Survived                   as Y,

    [Learn/Verify]

resident temp;

STORE EXPORT INTO [lib://DataFiles/learn.qvd] (qvd);

drop table EXPORT;

My algorithm can currently only take 9 variables, but for this dataset I just need 8, so I set the last variable (X9) to zero.

That is all there is to the preparation app. Next task is to train the algorithm on this data.

Learning app

This article is already too long, so I decided I will write about the learning app in part 2. Read part 2 here.

For those of you who want to get into Machine Learning but still have not started your journey, I encourage you to download the data sets here and start building your preparation app right away in Sense!

Soon you will be able to send your Export QVD into the algorithm to test your data!

In the end, you might also be sending the export data to Python or R algorithms but no matter what, you gain a lot of skills by understanding feature engineering and what is important to think about - and why Qlik Sense and Qlikview is just the perfect tools for this part of Machine Learning. In my opinion, there is really no need to do the feature engineering in another language. You will see that this is where you spend most of your time anyway.

Thank you for reading. Please share if you like it, and comment on the post if you have questions or feedback about the content.

Comment below, or connect with me here on LinkedIn if you want to get a copy of the Qlik Sense App.

Nikhileswara Reddy S

Senior Consultant

6 个月

will provide full dashboard and qvds guide me

1 次回应

Raja S.

2 年

Hi Robert Svebeck your posts kinda create a lot of enthusiasm and fosters knowledge Would request you to share the QVF for the Part 1 and and part 2 [email protected] would be waiting for 3d objects in qliksense in your upcoming posts ??

2 次回应

???

2 年

Hello Robert Svebeck. I am learning step by step by your article. thank you. would possible to get the QVF file? Thanks ([email protected])

2 次回应

Fredrik Ehno

3 年

Great Article and a great example of how to work with ML in QlikSense!

1 次回应

Vishal Gupta

Analytics Project Manager & Pre-Sales | Business Intelligence & Data Integration Consultant | 2x Qlik Certified

4 年

Awesome article Robert Svebeck , just stumbled upon this article of yours and definitely made me more interested to look at your qvd app and the script you have written for training.

2 次回应

查看更多评论

要查看或添加评论，请登录

Robert Svebeck的更多文章

Dags att ?ppna skattkistan?

2024年10月10日

Dags att ?ppna skattkistan?

AI-assisterad journalgranskning: Att l?sa upp sjukv?rdens dolda skatt Nyligen hade jag f?rm?nen att delta i ett…
A Personal Perspective on Regulation vs Innovation

2024年9月30日

A Personal Perspective on Regulation vs Innovation

As I sit and stare at my desk, fingers poised over the keyboard, I'm acutely aware of the invisible lines that shape my…

3 条评论
European Parliament Unveils Groundbreaking New Law: 18-Hour Days to Revolutionize Productivity

2023年4月1日

European Parliament Unveils Groundbreaking New Law: 18-Hour Days to Revolutionize Productivity

In a bold and unexpected move, the European Parliament has unveiled a groundbreaking new law that will revolutionize…

5 条评论
Using machine learning to turn verbal conversations into structured contextual data records

2023年3月20日

Using machine learning to turn verbal conversations into structured contextual data records

Introduction Many businesses, organizations, and researchers benefit from extracting insights from conversations…

1 条评论
I wrote a story with ChatGPT

2022年12月23日

I wrote a story with ChatGPT

One could say that, much like John Bunyan as he crafted "The Pilgrim's Progress" I find myself employing allegory in my…

3 条评论
How to be specific in AI image creation process

2022年10月3日

How to be specific in AI image creation process

Disclaimer: The pace of development in this area is so incredible that writing a "how to" article is almost a waste of…
Qlik RGB() to HSL() to RGB() / ARGB()

2021年2月10日

Qlik RGB() to HSL() to RGB() / ARGB()

For a project I recently had a special RGB color code and I needed to make it just slightly brighter and more colorful.…

7 条评论
How to make a 2D Interactive Solar System in Qlik Sense

2021年1月25日

How to make a 2D Interactive Solar System in Qlik Sense

Here is just a super quick guide on how to create a simple animated and interactive 2D Solar System using Qlik Standard…

13 条评论
AI and Machine Learning for Qlik Developers - Case Study 3 - part 2

2020年5月14日

AI and Machine Learning for Qlik Developers - Case Study 3 - part 2

This is an example in Qlik where we are using a Logistic Regression Algorithm, built only with Qlik Script, to be able…

4 条评论

See all articles

AI and Machine Learning for Qlik Developers - Case Study 3 - part 1

Robert Svebeck

Driving Responsible AI Implementation in Region Stockholm / Karolinska University Hospital

My solution

1. Preparation App (part 1 - this article)

2. Learning App (part 2 - read it here)

3. Testing App "Kaggle app" (part 3 - coming article)

Preparation app

Importing the training data

Feature Engineering

NULLS, blanks and 0's!

How to find out what do transform?

Feature Engineering example

Learning app

Robert Svebeck的更多文章

社区洞察

其他会员也浏览了

Putting Machine Learning Models to use

Any Body Can Do - EDA

ML - Pipeline

Introduction to Pandas Profiling

"Secure Data Analysis: Step-by-Step Guide Using Pandas AI and GPT-4o Locally"

A Journey from BI to AI

About my ML project

A Beginner's Guide to Data Preprocessing in Machine Learning: Cleaning and Preparing Data for Analysis

Revolutionizing Data Science: Accessible Visualization Solutions for Screen Reader Users

Use Snowflake Machine Learning – How to do series forecasting

My solution

1. Preparation App (part 1 - this article)

2. Learning App (part 2 - read it here)

3. Testing App "Kaggle app" (part 3 - coming article)

Preparation app

Importing the training data

Feature Engineering

NULLS, blanks and 0's!

How to find out what do transform?

Feature Engineering example

Learning app

Robert Svebeck的更多文章

Dags att ?ppna skattkistan?

A Personal Perspective on Regulation vs Innovation

European Parliament Unveils Groundbreaking New Law: 18-Hour Days to Revolutionize Productivity

Using machine learning to turn verbal conversations into structured contextual data records

I wrote a story with ChatGPT

How to be specific in AI image creation process

Qlik RGB() to HSL() to RGB() / ARGB()

How to make a 2D Interactive Solar System in Qlik Sense

AI and Machine Learning for Qlik Developers - Case Study 3 - part 2

社区洞察

其他会员也浏览了

Putting Machine Learning Models to use

Any Body Can Do - EDA

ML - Pipeline

Introduction to Pandas Profiling

"Secure Data Analysis: Step-by-Step Guide Using Pandas AI and GPT-4o Locally"

A Journey from BI to AI

About my ML project

A Beginner's Guide to Data Preprocessing in Machine Learning: Cleaning and Preparing Data for Analysis

Revolutionizing Data Science: Accessible Visualization Solutions for Screen Reader Users

Use Snowflake Machine Learning – How to do series forecasting