登录查看更多内容

Choosing Your Companion for Data and AI Journey: Jupyter Notebook vs. Dataiku DSS. Part 2.

Vladimir Parkov

Program & Project Leader | Strategic Planning, Data Analytics, Digital & AI Transformation | I Help Businesses Boost ROI with Data-Driven Strategies

发布日期: 2023年5月9日

In the first part (available?here), we took a first look at our dataset and did some data cleaning by removing duplicates. We also identified several outliers in the employee's tenure that could impair the quality of our predictions.

Now we can use the power of Python and Dataiku DSS to understand the relationships between variables in the data.

To sum up, we have the resulting variable called "left" that we would like to predict. "Left" is a Boolean variable with only two possible values: 0 means the employee is still working with the company, and 1 represents the employee who has left the company.

No alt text provided for this image — The data dictionary

Step 2. Making sense of relationships between variables

Let's start by creating a heatmap of correlations. This is easy to make with both platforms:

There are only 8 variables in this heatmap because there are two columns with categorical values: "salary" (high, low, medium) and "department" (accounting, HR, sales etc. – 10 departments in total). We will deal with this issue later.?

The heatmap shows a limited positive correlation between the number of projects, monthly hours, and evaluation scores. It makes sense: you take on more projects, work longer hours, and usually get higher evaluation scores.?
Interestingly there is no significant negative correlation between the level of satisfaction and number of working hours.
Employees leaving correlates negatively with their satisfaction level and correlates positively with tenure (the longer you work, the less likely you leave).?
Those who leave have lower satisfaction levels.

Data Visualizations with Python

You can start and make pairwise scatterplots in Python with a single command: sns. pairplot(df.name). I'll show just one of these scatterplots: Satisfaction level plotted against the latest evaluation score.

This picture could be more helpful. Here the complexity of Python turns into its strength. If you put enough effort into making sense of all its intricacies, you can create customized visualizations to get valuable insights.

Let's see what would happen if we highlight data points for those employees that left the company:

This is truly "A picture is worth a thousand words" moment!

We immediately see two distinct clusters of those who left:

People with less than-average evaluation scores (0.445 to 0.575) and low satisfaction levels (0.35 to 0.46). No surprises here. Which factor here causes what? Do other external variables cause both of these?
People who got very high evaluation scores but have extremely low satisfaction levels. This is more interesting. In addition, we see only a few people with low satisfaction and evaluation scores leaving the company.?Probably, the decision to quit takes longer after receiving low evaluation scores.

We can isolate these groups of people and plot them against other variables. For example, tenure, monthly working hours or salary. Let's quickly do this:?

For the first cluster, 835 employees left, and they did that after?three years?with the company working?less than an average number of hours?(144 hours vs 200 hours) and with?less than an average number of projects?(2 vs 3.8).??

These people found themselves in a job that wasn't the right fit. It's common for people to take a few years to find their groove in a company before they start feeling restless and looking for new opportunities. Data analysis shows that, on average, these people took three years to find their place in the company before eventually leaving.

For the second cluster, 497 employees left, and they did that after?four?years with the company working?more?than the average number of hours?(277 hours vs 200 hours) and with?more than the average number of projects?(6 vs 3.8).??

These poor lads work tirelessly, achieving high evaluation scores, yet are stuck in the same position year after year. This frustrating experience can lead to burnout and, ultimately, a decision to leave after four years. No promotion for all four years of hard work may affect their decision to leave.

The 3-4 year mark is crucial in an employee's journey with a company. By monitoring the number of working hours and paying attention to warning signs, companies can identify potential issues and take steps to address them. One area worth investigating is the company's promotion policies, specifically at the four-year mark.

You have the first significant insights you can share with business leaders to make data-driven decisions!

Python is the Swiss Army Knife for data.

But you must familiarize yourself with its tools and functions: read documentation, explore Stack Overflow or ask ChatGPT if you can describe what you are trying to achieve but struggle with complex syntax.

For example, we can also create customized boxplot and histogram to represent data visually to compare satisfaction and salary levels:

The number of employees who left or satisfaction levels for low and medium salaries are the same. The trend changes with high salaries: fewer high-paid people left, but those who left had significantly lower satisfaction levels. Money can't buy work satisfaction.

Data Visualizations with Dataiku DSS

In Dataiku DSS, the process of creating data visualizations is virtually effortless. Not a single line of code is needed to get the insights.

Just open the dataset, go to “Charts”, and use drag-and-drop to generate scatterplots for any variables.

Testing it with satisfaction level plotted against the latest evaluation score, we quickly get the same picture:

Then by applying the filter, you can quickly show only those data points corresponding to employees leaving the company:

领英推荐

Polars Vs Pandas: Benchmarking performances and beyond

Machine Learning Reply GmbH 1 年前

Handling Duplicates using Pandas DataFrames

ITVersity, Inc. 1 个月前

Accessing Data with loc: Label-Based Indexing in Pandas

ITVersity, Inc. 1 个月前

Using filters again, we can quickly identify the cluster of 835 employees with low satisfaction coupled with low evaluation scores:

With Dataiku DSS, you can analyze visual data in seconds using pairwise visualization and statistical tests.

Moreover, you can immediately create dashboards from any visualizations you have made and immediately start building your data storyline!

Step 3. Choosing AI models?

We want to predict whether an employee will leave the company. The outcome can be either 1 (employee left) or 0 (employee stayed with the company).

This prediction task is called binary classification, which leaves us with two possible approaches to machine learning:

Logistic regression?
Tree-based models (random forest, decision tree)

Let's start with logistic regression. It is sensitive to outliers and assumes no multicollinearity among independent variables (i. e. they shouldn't be correlated).

Step 4. Feature Selection, Engineering and Extraction

We have three potential problems to solve before?

There is some correlation between independent variables: for example, between monthly working hours and the number of projects. We'll explore this further.
Our dataset has outliers in the "time_spend_company" – 824 rows identified previously.
There are also two categorical variables ("salary" and "department") that we will need to convert to numerical.

Identifying Multicollinearity

We can use the Variance inflation factor (VIF) to measure how much an independent variable's behaviour (variance) is influenced, or inflated, by its interaction/correlation with the other independent variables.

In Python, we can do it with a little bit of coding:

A rule of thumb for interpreting the variance inflation factor: 1 = not correlated. Between 1 and 5 = moderately correlated. Greater than 5 = highly correlated.

We may have a problem here as most of the variables are correlated. For now, let's keep all the variables and see whether the quality of logistic regression prediction will suffer.

Removing outliers

In Python, we create a new data frame removing the rows that contain “time_spend_company” values outside of 1.5 of the interquartile range. People working less than 1.5 years (none) or more than 5.5 years (824 rows) with the company will be removed from the dataset.

As usual, removing outliers is way easier with Dataiku DSS. You just need to open the dataset, click on the “time_spend_company” column, and choose the criteria for outliers that need removal.

The dataset keeps shrinking! Out of the original 14,999 rows, we first removed 3,008 duplicates and now removed 824 outliers. 11,167 rows remain.

Encoding the categorical variables?

In Python, one-hot encoding is done via the Pandas Get Dummies – pd.get_dummies() function. This function converts categorical variables into binary features, with each possible category getting its own column of either 0s or 1s.

As a result, the categorical column “salary” is replaced by three numerical columns (salary_high, salary_medium and salary_low) for each row is either 0 or 1 to reflect the level of salary an employee receives. Likewise, the “department” categorical column is replaced by ten columns by the number of departments in the company.

What about Dataiku? Dataiku DSS takes care of the nitty-gritty details of model training, so you don't have to. You don't need to encode text values explicitly! The platform handles this under the hood, allowing you to focus on the bigger picture and derive insights from your data.

Now we are ready to build our first model! Stay tuned for the next part!

#dataiku?#dataikudss #data?#ai?#artificialintelligence?#machinelearning?#ml?#dataanalytics?

#python?#jupyternotebook

Jake Meer

Leading Talent Sourcing @ Dataiku

1 年

Love seeing your passion for Dataiku and hoping to be coworkers Vlad!

2 次回应

要查看或添加评论，请登录

Vladimir Parkov的更多文章

The Roadmap to AI Success: Building a Solid Foundation for Transformation

2023年6月27日

The Roadmap to AI Success: Building a Solid Foundation for Transformation

The AI transformation journey may initially seem overwhelming, but many companies have already travelled it. By…

1 条评论
Illuminating the AI Transformation Path: Unveiling the Lighthouse Approach

2023年6月21日

Illuminating the AI Transformation Path: Unveiling the Lighthouse Approach

In today's rapidly evolving business landscape, harnessing the power of AI has become imperative for organisations…

3 条评论
Choosing Your Companion for Data and AI Journey: Jupyter Notebook vs Dataiku DSS. Part 3. Logistic Regression.

2023年5月17日

Choosing Your Companion for Data and AI Journey: Jupyter Notebook vs Dataiku DSS. Part 3. Logistic Regression.

In the previous two parts (you can read these here and here), we explored and transformed our dataset with data on…

2 条评论
Choosing Your Companion for Data and AI Journey: Jupyter Notebook vs. Dataiku DSS. Part 1.

2023年5月3日

Choosing Your Companion for Data and AI Journey: Jupyter Notebook vs. Dataiku DSS. Part 1.

The Call to Adventure: Google Advanced Data Analytics Professional Certificate In a world where technology is rapidly…

1 条评论

Choosing Your Companion for Data and AI Journey: Jupyter Notebook vs. Dataiku DSS. Part 2.

Vladimir Parkov

Program & Project Leader | Strategic Planning, Data Analytics, Digital & AI Transformation | I Help Businesses Boost ROI with Data-Driven Strategies

Step 2. Making sense of relationships between variables

领英推荐

Step 3. Choosing AI models?

Step 4. Feature Selection, Engineering and Extraction

Vladimir Parkov的更多文章

社区洞察

其他会员也浏览了

The Only Roadmap You’ll Ever Need for Data Science (2025)

Data Analysis Tools (DATs)

Top 10 Tools or Applications or Libraries or Packages Used by Data Scientists in Day-to-Day Work and their mapping to Data Science Life Cycle in IT

Matplotlib

Mastering Data Visualization with Matplotlib: A Comprehensive Guide to Creating Powerful Plots and Charts

How to Parse API Responses (XML, JSON, or Other Formats) into Tabular Format in Domo Jupyter Workspace

Cleaning Data with Pandas

Introduction to Big Data, multi-platform Versatility, Python, Map Reduce

(Week 9) NumPy and Visualization Tools: A Journey into Efficient Data Manipulation and Stunning Visualizations!

Boost Your Data Analysis with These 30 Essential Pandas Tricks!

Step 2. Making sense of relationships between variables

领英推荐

Step 3. Choosing AI models?

Step 4. Feature Selection, Engineering and Extraction

Vladimir Parkov的更多文章

The Roadmap to AI Success: Building a Solid Foundation for Transformation

Illuminating the AI Transformation Path: Unveiling the Lighthouse Approach

Choosing Your Companion for Data and AI Journey: Jupyter Notebook vs Dataiku DSS. Part 3. Logistic Regression.

Choosing Your Companion for Data and AI Journey: Jupyter Notebook vs. Dataiku DSS. Part 1.

社区洞察

其他会员也浏览了

The Only Roadmap You’ll Ever Need for Data Science (2025)

Data Analysis Tools (DATs)

Top 10 Tools or Applications or Libraries or Packages Used by Data Scientists in Day-to-Day Work and their mapping to Data Science Life Cycle in IT

Matplotlib

Mastering Data Visualization with Matplotlib: A Comprehensive Guide to Creating Powerful Plots and Charts

How to Parse API Responses (XML, JSON, or Other Formats) into Tabular Format in Domo Jupyter Workspace

Cleaning Data with Pandas

Introduction to Big Data, multi-platform Versatility, Python, Map Reduce

(Week 9) NumPy and Visualization Tools: A Journey into Efficient Data Manipulation and Stunning Visualizations!

Boost Your Data Analysis with These 30 Essential Pandas Tricks!