Is Coding the Least Important Core Skill of Data Science?

Is Coding the Least Important Core Skill of Data Science?

Don’t get me wrong. Coding is a core skill of data science. But it is overhyped. Or, at least, overemphasized.

I see job postings for data scientists and data analysts that look for candidates with fluency in SQL, R, Python, C++ etc.

What’s the problem?

It does nothing to ensure you don’t hire someone who knows just enough about data science and analytics to be dangerous. Data science and analytics are fields that require deep technical expertise and the coding component is just the tip of the iceberg.

There’s nothing more dangerous than someone who can run an algorithm but has no idea how it works nor whether it worked.

So, what are the more important core skills in data science? Here are my thoughts in order of importance. And, yes, this is my opinion so you are more than welcome to disagree with me!

Business Context

There is no value to data science without context. You need to know the company and the sector for analysis to make sense. If you work in the FMCG sector and want to measure changes in spend over time, you’ll need to factor in seasonality into your analysis. If you work in a subscriber-based industry or product category (like postpaid mobile phones where you make fixed monthly payments), you’ll need to figure out which inputs are symptoms of churn and which are causes of churn. Although much of this work will be solved analytically, the initial context will more likely be set by the business. This is the framework around which initiatives are proposed and creativity thrives.

It also provides a clear understanding of the business impact of each project. In many cases, data science initiatives require too much time to be worth the investment.

In the absence of business context, analytics teams can spend long periods of time creating models that deliver negative returns, by either failing to solve a worthwhile problem or requiring more resources than the problem warrants.

Statistical Thinking/Applied Mathematics

Data science and analytics is about separating the signal from the noise. Business context gives you a clear picture of the problem, mathematical knowledge enables the solution. I’d argue that there are three core components here:

  1. Probability: Understanding how to compare any result to what you would expect by chance. Given the popularity of significance testing, people often trawl through data looking for significant effects. This leads to chance effects being offered as “insights” and a loss of trust in analytics when the effect is never replicated. However, this also extends to research design (such as stratified A/B testing with control groups). Probability is one of those things that the human mind lacks an intuition for and so it requires study. It’s not data science per se but is a consideration for many analytical techniques.
  2. Modelling Techniques: Knowing how modelling techniques work is crucial when it comes to fitting a mathematical function to an analytical problem. In supervised learning, you need to have some hypotheses about how the input variables map onto the output variable and the sample size considerations for the approach you choose. The more complex algorithms (Gradient Boosting Machines etc.) can map non-linear interaction effects, which is useful when they are present but wasteful if the relationships are largely linear and bivariate. Moreover, more complex algorithms can map more complexity but require more statistical power. You need larger sample sizes so they can fit to the sparser regions of the feature space.  In unsupervised learning, you need to consider what your data looks like. Do you have highly skewed data with plenty of outliers? If so, K-Means will result in clusters that misrepresent your data and do not generalize. The bottom line is that understanding how these methods work will help you figure out which techniques to apply to the problem.   
  3. Model Evaluation: Knowing how models work is the first piece of the puzzle. The next is how to validate your models. The problem is that these tools will fit to your data and pump out results no matter what you put in them. It doesn’t mean that it fit anything other than noise. There are many things that can go wrong here. For example, cross-validation tends to overestimate how well your model is describing reality leading to a false sense of security regarding your model. For classification, accuracy metrics can be misleading if you predict one class better than another. This is only scratching the surface, smarter people than me have written entire books on this.

I’d personally much rather have a data scientist with terrible coding skills but exceptional analytical skills. They may need to lean on IT to deploy an algorithm but will be better able to navigate the mathematical complexity of data science.

Interpersonal Skills


Articles often cite communication as a key skill in analytics and data science, but often gloss over the details. I’d argue that this has led to misunderstandings about what is required. Let’s be a bit more detailed:

  1. Trust-Building: Analytics teams work with other teams (data managers, marketing departments etc.) and close relationships and trust is required for great work to happen. If you have a team of PhDs that don’t get on well with the data warehousing teams and confuse the marketing teams, then you’ll struggle to get what you need and sell in great initiatives. It is very easy for business stakeholders and analytics professionals to develop adversarial relationships due to this process breaking down.
  2. Bargaining Skills: Once trust is established, you’ll need to learn how to negotiate mutual victories. The problem with analytics teams is that they often get hit with lists of poorly scoped requests and non-analytical tasks that other teams can’t do. It’s not practical to simply say no every time but lines will need to be drawn on occasion. It’s about picking your battles and avoid the traps of pushing back too often or not often enough.
  3. Empathy: Although being able to communicate complexity is very important, I’d say it plays second fiddle to understanding other people’s perspectives and speaking to them. For many subject-matter experts, analytics is a threat. They may feel that data-driven insights will work against their recommendations. Once trust is established, partnering with them by understanding and speaking to their perspective will make them an ally rather than a road block.

Information Systems

Each company stores information in different ways. The more technologically developed have non-relational, big data architectures whereas most have the more common relational databases. Good analysts understand where information comes from, what it refers to and know how to combine these conceptually. They also know how to ensure the environment is looked after. A common issue in these environments is tech debt, where poor code that is not deployment ready is pushed into deployment environments and leads to inefficient systems that may have bugs causing outages and inaccuracies.

For example, a data scientist may be looking to create some derived fields for a simple dashboard based on them being strong predictors of customer attrition. Although these could be manually created in Tableau or Shiny, this could lead to the interface becoming slower and, depending on how fields are updated, result in the dashboard being out of sync with the rest of the system, leading to end-users wondering why they are seeing different numbers in different places. Or changes to the data eco-system could lead to the dashboards developing bugs where certain fields are coming up with nulls (or worse). Knowing when a change should be briefed to IT/data management and when it should be done in a non-scalable way requires sound judgement and this comes from understanding both the information systems and priorities of the business.

Coding Skills

So, here is where I put coding. Once you have the above, you are in a good position to improve your ability to execute by further developing your coding skills. In reality, it’s hard to have the skills above and not have a reasonable grasp of coding (it’s almost always learned in tandem). However, it’s worth mentioning that the programming is the easiest part. That’s not to say it’s easy. It is incredibly difficult. When I was learning to program in R, I wrote some awful code. And I still get schooled every time I log onto Stack Overflow. Good data scientists write efficient code that is easy to read, check and modify but let’s not fall into the trap of thinking that great data science will happen from coding skills alone.

When I see people looking for data scientists but only mention programming skills, I ask myself:

“Are they looking for a software developer who dabbles in data science or a data scientist who can program?”

Because these two things are not the same. 

But those are just my thoughts. What do you think?

Anandam Sarcar

Data & AI | People | 20+ years of Global Experience in BOTH Business & Engineering roles for multiple Microsoft businesses

7 年

Both are required. A good balance of core technical and business skills are required for being a good data scientist. That’s why the field is niche and hard to get folks who are great in both ends of the spectrum ??.

Ranvijay Singh

Sr. Data and Reporting Engineer

7 年

I agree but both skills are vital component for Data Scientist and that is the reason world is moving from hard core analytics to Machine Learning and Deep learning. The tools are self capable to handle complex analytics implicitly without much core knowledge in Statistics.

回复
Vijay Bhogi

M365/Power Platform Specialist at Rio Tinto

7 年
回复

要查看或添加评论,请登录

Jehan Gonsal的更多文章

社区洞察

其他会员也浏览了