Choosing Your Companion for Data and AI Journey: Jupyter Notebook vs. Dataiku DSS. Part 2.
Vladimir Parkov
Program & Project Leader | Strategic Planning, Data Analytics, Digital & AI Transformation | I Help Businesses Boost ROI with Data-Driven Strategies
In the first part (available?here), we took a first look at our dataset and did some data cleaning by removing duplicates. We also identified several outliers in the employee's tenure that could impair the quality of our predictions.
Now we can use the power of Python and Dataiku DSS to understand the relationships between variables in the data.
To sum up, we have the resulting variable called "left" that we would like to predict. "Left" is a Boolean variable with only two possible values: 0 means the employee is still working with the company, and 1 represents the employee who has left the company.
Step 2. Making sense of relationships between variables
Let's start by creating a heatmap of correlations. This is easy to make with both platforms:
There are only 8 variables in this heatmap because there are two columns with categorical values: "salary" (high, low, medium) and "department" (accounting, HR, sales etc. – 10 departments in total). We will deal with this issue later.?
Data Visualizations with Python
You can start and make pairwise scatterplots in Python with a single command: sns. pairplot(df.name). I'll show just one of these scatterplots: Satisfaction level plotted against the latest evaluation score.
This picture could be more helpful. Here the complexity of Python turns into its strength. If you put enough effort into making sense of all its intricacies, you can create customized visualizations to get valuable insights.
Let's see what would happen if we highlight data points for those employees that left the company:
This is truly "A picture is worth a thousand words" moment!
We immediately see two distinct clusters of those who left:
We can isolate these groups of people and plot them against other variables. For example, tenure, monthly working hours or salary. Let's quickly do this:?
For the first cluster, 835 employees left, and they did that after?three years?with the company working?less than an average number of hours?(144 hours vs 200 hours) and with?less than an average number of projects?(2 vs 3.8).??
These people found themselves in a job that wasn't the right fit. It's common for people to take a few years to find their groove in a company before they start feeling restless and looking for new opportunities. Data analysis shows that, on average, these people took three years to find their place in the company before eventually leaving.
For the second cluster, 497 employees left, and they did that after?four?years with the company working?more?than the average number of hours?(277 hours vs 200 hours) and with?more than the average number of projects?(6 vs 3.8).??
These poor lads work tirelessly, achieving high evaluation scores, yet are stuck in the same position year after year. This frustrating experience can lead to burnout and, ultimately, a decision to leave after four years. No promotion for all four years of hard work may affect their decision to leave.
The 3-4 year mark is crucial in an employee's journey with a company. By monitoring the number of working hours and paying attention to warning signs, companies can identify potential issues and take steps to address them. One area worth investigating is the company's promotion policies, specifically at the four-year mark.
You have the first significant insights you can share with business leaders to make data-driven decisions!
Python is the Swiss Army Knife for data.
But you must familiarize yourself with its tools and functions: read documentation, explore Stack Overflow or ask ChatGPT if you can describe what you are trying to achieve but struggle with complex syntax.
For example, we can also create customized boxplot and histogram to represent data visually to compare satisfaction and salary levels:
The number of employees who left or satisfaction levels for low and medium salaries are the same. The trend changes with high salaries: fewer high-paid people left, but those who left had significantly lower satisfaction levels. Money can't buy work satisfaction.
Data Visualizations with Dataiku DSS
In Dataiku DSS, the process of creating data visualizations is virtually effortless. Not a single line of code is needed to get the insights.
Just open the dataset, go to “Charts”, and use drag-and-drop to generate scatterplots for any variables.
Testing it with satisfaction level plotted against the latest evaluation score, we quickly get the same picture:
Then by applying the filter, you can quickly show only those data points corresponding to employees leaving the company:
领英推荐
Using filters again, we can quickly identify the cluster of 835 employees with low satisfaction coupled with low evaluation scores:
With Dataiku DSS, you can analyze visual data in seconds using pairwise visualization and statistical tests.
Moreover, you can immediately create dashboards from any visualizations you have made and immediately start building your data storyline!
Step 3. Choosing AI models?
We want to predict whether an employee will leave the company. The outcome can be either 1 (employee left) or 0 (employee stayed with the company).
This prediction task is called binary classification, which leaves us with two possible approaches to machine learning:
Let's start with logistic regression. It is sensitive to outliers and assumes no multicollinearity among independent variables (i. e. they shouldn't be correlated).
Step 4. Feature Selection, Engineering and Extraction
We have three potential problems to solve before?
Identifying Multicollinearity
We can use the Variance inflation factor (VIF) to measure how much an independent variable's behaviour (variance) is influenced, or inflated, by its interaction/correlation with the other independent variables.
In Python, we can do it with a little bit of coding:
A rule of thumb for interpreting the variance inflation factor: 1 = not correlated. Between 1 and 5 = moderately correlated. Greater than 5 = highly correlated.
We may have a problem here as most of the variables are correlated. For now, let's keep all the variables and see whether the quality of logistic regression prediction will suffer.
Removing outliers
In Python, we create a new data frame removing the rows that contain “time_spend_company” values outside of 1.5 of the interquartile range. People working less than 1.5 years (none) or more than 5.5 years (824 rows) with the company will be removed from the dataset.
As usual, removing outliers is way easier with Dataiku DSS. You just need to open the dataset, click on the “time_spend_company” column, and choose the criteria for outliers that need removal.
The dataset keeps shrinking! Out of the original 14,999 rows, we first removed 3,008 duplicates and now removed 824 outliers. 11,167 rows remain.
Encoding the categorical variables?
In Python, one-hot encoding is done via the Pandas Get Dummies – pd.get_dummies() function. This function converts categorical variables into binary features, with each possible category getting its own column of either 0s or 1s.
As a result, the categorical column “salary” is replaced by three numerical columns (salary_high, salary_medium and salary_low) for each row is either 0 or 1 to reflect the level of salary an employee receives. Likewise, the “department” categorical column is replaced by ten columns by the number of departments in the company.
What about Dataiku? Dataiku DSS takes care of the nitty-gritty details of model training, so you don't have to. You don't need to encode text values explicitly! The platform handles this under the hood, allowing you to focus on the bigger picture and derive insights from your data.
Now we are ready to build our first model! Stay tuned for the next part!
Leading Talent Sourcing @ Dataiku
1 年Love seeing your passion for Dataiku and hoping to be coworkers Vlad!