The Art of Deduction: Sherlock Holmes Way of Exploratory Data Analysis
Netra Varun Ramachandran
Insight mining, strategic planning, driving digital agenda, campaign planning,
Data Analysts and strategists around the globe, consciously or sublimely follow the path of the great fictional genius - Sherlock Holmes. He single handedly revolutionized the art of reasoning thereby elevating it to a proper science. Indeed, some of his methods were officially adopted by The Scotland Yard.
Unlike Mr. Holmes, data analysts are not equipped with special capabilities -only the power of induction, whereby one meticulously evolves one’s process and intuitively builds one’s power of inductive reasoning.
Every byte of data presented to a data analyst is much like a well preserved crime scene, devoid of manipulation, untouched and raw, or at least, ideally so.
It all begins with a single “Why”. This is followed by many more “Why’s?”. Each “Why” stacks one upon the other, and before long a pattern is formed, the hypothesis proved or negated and a final theory presented. One does not jump to conclusions, or massage the data so it speaks the convenient truth.
EDA is “Elementary, my Dear Watson” to building any kind of model.
Exploratory Data Analysis is not just viewing, but closely studying the data – for patterns, for anomalies, to check assumptions and the hypothesis. However, before anything else, we start with a big “WHAT”?
The entire EDA process can be divided as such:
Chapter One : The Plot
Chapter Two : The Setting
Chapter Three : Character Development (Suspect identification and elimination)
Chapter Four : Decoding & Encoding Characters
Chapter Five : The Point of View
Chapter Six : The Story Going Forward
Chapter One:
The Plot
WHY ARE WE DOING?WHAT ARE WE SOLVING? WHAT ARE WE PROVING WRONG? WHAT IS THE BUSINESS and DO WE FULLY UNDERSTAND THE BUSINESS CONTEXT?
As Data Scientists or Business Analysts, commencing by asking “WHAT IS THE BUSINESS PROBLEM?” is a very pertinent beginning. By asking the right question, one is able to make an informed deduction thereby eliminating the possibility of a wasteful and ultimately unproductive wild goose chase.
What our ideal intellectual, Mr. Holmes, would do next is to mindfully engage with the data presented.
Chapter Two:
The Setting: “To a great mind nothing is little.”
Closely examine the Head, the Tail, the Shape and everything else with healthy skepticism.
Upload every bit of information (the data set), channelize the inner mind and activate mindfulness. Leave every little prejudice and bias at the door.
- Observe meticulously, column by column. Ask yourself “Are there any NAN Values, Missing values or Special Characters?”
- In the attic of your mind, compartmentalize observations into separate boxes in the form of categorical variables(nominal or ordinal) and numeric variables.
- Are there variables that do not add value to your observation and only present “noise”?
- Which is the most important feature or variable that can be leveraged to solve the defined business problem?
- How many columns (features / variables) and how many rows (observations) encompass the data set?
Observing all the circumstantial details is the first step business analysts and data science professionals need to embark on to be able to replicate Mr. Holmes’s reasoning process. At this stage, one must only assimilate and not hasten to polish the data.
Actively engage with the dataset – how much more does the data reveal
Sherlock said, “Never trust general impressions, my boy, but concentrate yourself upon details.”
.info()+ value_counts () +.isnull().sum()
- What are the clues provided by each variable / function?
- Does each variable have the same number of observations as noted earlier with the .shape command?
- What is the data type – integer, float or string? Is a string / object variable converted to an integer or float type?
- Are there any missing values?
.duplicated() – Are we dealing with imposters - duplicates that could mislead our reading and affect the overall model?
Observe all first impressions closely.
“There is nothing more deceptive than an obvious fact.” Sherlock
.describe().T
- Missing Value Indicator: Count: Does the count of each variable match the shape of the data and the observations in .info section. If any are missing, then corrective measures are to be taken immediately / next.
- What is the average of the data (mean) – how far or close is the average of each variable from each other?
- What is difference between minimum, mean and inter-quartile range and the maximum? If Sherlock was to perform EDA – he would call it “the general tendencies”. How would the extremes behave? Point where suspicion would have its say here.
- If the difference between, minimum, average, inter-quartile range and maximum – are not the same or similar, then there are outliers or the odd ones, the need to be put under the scanner.
- Understanding how far the Standard Deviation is from other variables.
Chapter Three:
Character Development (Suspect identification and elimination)
It’s time now to become objective.
“Crime is common. Logic is rare. Therefore it is upon the logic rather than upon the crime, that you should dwell.”
To Mr. Holmes it would be unpardonable if missing values, special characters, outliers, duplicates were ignored or were overlooked. The below are a few logical steps to perform:
Missing Values: Drop or Impute
- Drop: Post the hawk-eye analysis, one can determine if the Missing Value is significant. If insignificant, then drop the missing values or the variables itself. For example: if a variable has more than 60% zero or missing value – drop the variable. However, this should be the last option and is not recommended.
- Impute with mean value: For the numerical column / variable, you can replace the missing values with mean values. Before doing so, it is advisable to check that the variable hasn’t any extreme values .i.e. outliers.
- Impute with median value: For the numerical column/variables, you can also replace the missing values with median values. In case you have extreme values such as outliers it is advisable to use the median approach.
- Impute with mode value: For the categorical column / variables, you can replace the missing values with mode values i.e.- the frequent ones. And if there is more than one mode – define the mode that you propose to replace the missing value with.
Duplicate and Unique Identifier column
If the data set has Unique Identifiers, Sl. No., Registration ID – identify the duplicates and drop them. Next, drop the column that has the Unique Identifier, as they do not contribute to any analysis. Remember, the art of deduction also includes eliminating noise and / or the crowd. Mr. Holmes would approve….
Make Inclusive Observation
“You know my methods, Watson. There was not one of them which I did not apply to the inquiry. And it ended with me discovering traces, but very different ones from those which I had expected.”
Focus on fine-tuning the data with visualization, specially outliers with box plots. Afterwards one can use these visualization objectively and move onto correcting the outliers. They might be the usual suspects, but might slip by in broad day light.
With the help of box plot, one can identify the observations that are outside of the lower range (quartile 1 - 1.5 times IQR) and Upper Range (Quartile 3 + 1.5 times IQR)
Do treat the outliers appropriately. One can either drop the outliers (not recommended) or treat them with the IQR Range.
Chapter Four:
Decoding & Encoding Characters
“What further inferences may we draw? Do none suggest themselves? You know my methods. Apply them!”
Before we start to analyze the data, there two crucial Turning Points to where the data can lead and the data truth it might throw up.
1. What about data that have different scales?
2. How does one deal with Categorical data?
Normalizing and Scaling Data
If you find the variables in different and varying scales- bring them all to the same platform? Mr. Holmes would despise you for comparing apples and oranges.
If one variable is in Kilometers, another in Units and yet another in Millions, use Z –score transformation or Standard Scaler to normalize them to bring them to a comparable format of Standard deviation. How far is the variable from its Standard Deviation and we will be doing this only for the numerical variables.
Treating Categorical Data
Machine Learning Algorithms only work on NUMERICAL DATA. That is why there is a need to convert the categorical column into a numerical one. Categorical Data can be classified into:
Ordinal: One that can be set in order – Yes /No, Excellent / Very Good / Good / Satisfactory / Bad, Big / Medium / Small etc.
Ordinal data can be one-hot encoded
Nominal Data – City, Country, College names, Subject, Marital Status, Gender, Education etc. These can be replaced with dummy numerical values.
Now that we have all variables set NUMERICALLY, the next big step is to ensure the Variables are all scaled to one Numerical Scale to enable comparison.
Chapter Five:
The Point of View
“Eliminate all the factors. The one that remains is the truth.”
Uni-variate and Bi-variate Analysis
Sherlock never left his home without a pencil, a small notebook and a magnifying glass. When the data is cleaned, polished and made mutually comparable, it’s time to start the story telling. And graphs tell stories that words fail to capture.
Uni-variate Analysis narrates the story of Mean, Median and Mode. The measure of Central Tendency. Be mindful, Uni-variate Analysis does not deal with cause and effect or relationships. It only describes the variable. What is the range and what is the frequency of distribution?
Most widely used Uni-variate graphs or visualization technique are: Histogram, Bar Plot / Count Plot, Pie Chart, Boxplot.
On the other hand, we have Bi-variate Analysis. As the name suggests this set of analysis defines the relationship between two variablesandnarrates the cause and effect of two variables. While Univariate derives from dispersion, Bi-variate derives from correlation. Because the data involves categorical / object data and numerical data, here is a handy guide to the kind of plot that can be employed used for the story telling.
1. Numerical vs. Numerical
1. Scatterplot
2. Line plot
3. Heatmap for correlation
4. Joint plot
2. Categorical vs. Numerical
1. Bar chart
2. Violin plot
3. Categorical box plot
4.Swarm plot
3. Two Categorical Variables
1. Bar chart
2. Grouped bar chart
3. Point plot
As a seasoned Data & Business Analyst, one needs to work on Covariance and Correlation to draw insights and pick suitable models that can be developed, implemented and deployed.
Chapter Six:
The Story Going Forward
In every single case that Sherlock Holmes worked on, before he presented his theory or was one step away from solving a case, he would take a STEP BACK and wash away all assumptions that may have existed in the first place.
Then he would look at all the available information, theories, biases and as an impartial spectator. This way he was able to build a narrative that led him to the suspect and eventually to the one who committed the crime.
As data science and business analytics professionals, one needs to follow the same approach. Step back, let all the data sink in, look at the nuances with an unbiased lens and then start to look at the models we need to build. And finally, deliver the best model that addresses the Business Problem. Answering the “BIG WHY”.
What would be a Holmes Solution?
Habit, Habit, Habit.
And remember, EDA is ELEMENTORY