Acing your take home data science interview exercises!
How do you stand out for that one sweet job you know you deserve? Do you increase your model accuracy? Do you try all the different techniques under the sun? So did ten others. Well, that's boring. And bland. And not at all what most interviewers are looking for (at least not the reasonable ones). To clarify, I am not advocating you to not do the above. Of course in our line, accuracy of models does matter, but what most interviewers are looking for is, clarity of thought and presentation. Why did you take those decisions? Would you be able to convey key findings to non-technical folk? And more importantly for senior roles, is your work organised?
I am going to take you through a journey of how to be an absolute MAD LAD and flex on your data science interviewers! I will be categorising them into simple flexes, medium flexes and mega flexes in order of increasing difficulty.
1. Simple Flexes
1.1 Compartmentalise - Maintain structure
I re-iterate this point even in my previous article as this is essential if you do not want to be seen as a novice. Good code, should read like good poetry. Simple to understand and having a flow to it. More on this here.
While its necessary to maintain structure within the code, its also great if you could do this outside of it - namely, folder structure. No one will see your code if all your outputs and your code and your plots and your entire analysis is in one flipping folder! No one wants to read Untitled13.ipynb! Its crude. Beat this habit out of your system. A good folder structure should look something like:
Looking at it, you can clearly understand the hierarchy in which your files are structured. Always keep an html version of the notebook (if you are using Jupyter) so that the reviewer does not need to fire up their Python shell to view the notebook. Now doesn't this look beautiful?
What makes this stand out? Well, while you are making your submission in this format, some of your competition is submitting this:
Mega yuck right? Also, this looks like the work of a five year old.
1.2 Compartmentalise - Table of Contents
This tip is mostly for Jupyter notebook users. If you are using it, you might as well exploit all the features it has to offer! Now Jupyter has many great extensions. Check them out here. My favourite one is undoubtedly the Table of Contents extension. After installing them, enable this extension by firing up your Jupyter shell and going to the Nbextensions tab, which will now appear.
Whats great about this extension? I am glad you asked. Assuming you did document your Jupyter notebook with well placed markdown cells, the first cell of your notebook would look like this:
And yes, these are hyperlinks. So you can click on them to navigate to that specific section of your code.
1.3 Aesthetics - Better visualizations
Sticking to the default colour palette of matplotlib? Wrong choice. The stock orange/blue combo has never been a career maker. Trust me. Even if the information conveyed in the charts are meaningful, it would not amount to anything if your audience does not give their complete undivided attention. So instead of:
Please use:
This particular visualisation style is very well known and highly acclaimed as "fivethirtyeight" and its based on Nate Silver's website which conducts statistical journalism.
All that is needed to transform your charts into this (please add titles, subtitles, axes labels, legend etc as well!) is some three lines of code:
import seaborn as sns import matplotlib.pyplot as plt plt.style.use('fivethirtyeight')
Consider moving to seaborn, if not, plotly. Plotly produces some nice non static charts.
2. Medium Flexes
2.1 Code Optimisation - Replace for loops
Them for-loops giving you the run around? My advice is never to use them unless necessary. Swap the usage of for-loops with apply and lambda functions. Or, use them in the context of list comprehensions.
To illustrate, lets take an example of a list containing a certain number of elements. If you are given the task to find the index position of a certain element occurring in the list, what would be your first instinct? This?
result = [] shouts = ['Fus', 'Roh', 'Dah', 'Fus'] for index, shout in enumerate(shouts): if 'Fus' in shout: result.append(index) [0, 3]
Nope. While this does perform the job, we can do it better, like so:
shouts = ['Fus', 'Roh', 'Dah', 'Fus'] result = [index for index, shout in enumerate(shouts) if 'Fus' in shout] [0, 3]
List comprehensions are more efficient and show that you are an experienced coder. Could you think of how to rewrite this as a lambda function?
2.2 Code Optimisation - Using dictionaries to replace categories
Dictionaries are very under-utilised in data science code in my opinion. They are one of the most useful data structures in python. During the course of data cleaning, I have seen many candidates using long-nested np.where() statements especially when they need to find and replace categories with something else. This is what an average coder does:
df['Hero'] = np.where(df['Name']=='Peter Parker', 'Spiderman', np.where(df['Name']=='Clark Kent', 'Superman', df['Name']))
Oof. Even writing this out was tough :/ . Consider something like this:
hero_dict = {'Peter Parker' : 'Spiderman', 'Clark Kent' : 'Superman'} df['Hero'] = df['Name'].copy().replace(hero_dict)
Pandas provides an excellent replace() api especially for this. Crisp. Clear. And a whole lot better. Be the 1%.
2.3 Code Optimisation - Specifying the independent variables
While specifying the target and independent variables pre-modelling, more often than not this is what actually gets coded:
y = df['Amount'].tolist() X = df[['X1', 'X2'..........'X100']]
Yes, people actually type out the entire hundred column names (or copy paste them).
Can we improve them like so?:
target_col = 'Amount' y = df[target_col].tolist() X = df.copy().drop(target_col, axis=1)
Much cleaner, in my opinion and you don't subject your reviewer to reading a hundred or so column names.
2.4 Code Optimisation - Serialization
Serialization is converting a Python object into a byte stream. Pickle files are the standard option. Two really simple rules to follow here:
- Serialize large dataframes into pickle files for faster read times
- Serialize machine learning models for reproducability
We can use the joblib package for this. An example:
import joblib # Storing dataframe as pickle file joblib.dump(df, 'df.pkl') # Storing model as pickle file joblib.dump(model, 'model.pkl')
3. Mega Flexes
3.1 Statistical Thinking - Interpret open ended exercises to your advantage
Sometimes, if you are very lucky, you do get open ended exercises where you are free to explore any and all options available. Take the case of a problem statement like so:
You are given a Retail Dataset. Develop a model to Predict Sales Next Month and "why" you chose that model? Generate Forecast for 3 months based on best selected model.
I can confidently say, most of the candidates will jump on the Time Series bandwagon just seeing the word forecast in there. Not to say that it is wrong, but is it the only option? Consider transforming this into a supervised regression problem and explore the regressors. If framing the problem in a supervised regression context is deemed appropriate after testing and cross validation, then you sir have flexed hard on your reviewer! While this is only an example, you are free to interpret any open ended problem statements in your own way.
Kindly have a look at Jason Brownlee's excellent article on how to transform a time series problem into a supervised learning problem here.
3.2 Statistical Thinking - Exploring categorical encoders
Most candidates spend a lot of time on trying out different models while spending little to no time on encoding techniques other than OneHotEncoder() or pd.get_dummies(). There are other methods of encoding categorical features. I especially recommend the Weight of Evidence technique. You can find an excellent article here regarding this. The python package having different categorical encoders can be found here.
3.3 Decision Making - Solution Writeup
If by reading your analysis/code, the reviewer is not convinced of the decisions you took while approaching the problem, consider writing an additional notebook/markdown file documenting your decisions.
The above screenshot is of a coding exercise I had done a few years back. As you can see, every decision regarding the data and the models has been documented.
While there are more ways in which your work would stand out from the rest (pipelines, multiprocessing etc), these should be enough to to establish your prowess as Lord Flexenor of candidates and earn your way to a callback! Good Luck!