The Next Generation Data Scientist
With Deloitte predicting that 80% of UK companies will hire a data scientist in the next year and Glassdoor rating the role as the #1 most sought-after job for the past 6 consecutive years, there has never been a better time to be a data scientist. However, with McKinsey predicting a shortage of 250000 data scientists in the US alone by 2025, how can we better define and identify ‘The Next Generation Data Scientist’? To better understand where the industry is moving, I recently had the absolute pleasure of speaking with:
- Pedro Pinto Coelho – Chief Executive Officer at Banco BNI Europa
- Antti Myllym?ki – Head of Artificial Intelligence at OP Financial Group
- Tad Slaff – Data Science Product Lead at Picnic Technologies
- Georges Mansourati – Chief Analytics Officer at Northmill Bank
- Thomas Berngruber – Head of Data Analytics & Development at Jyske Bank
As an Account Manager at Altair, I spend a significant amount of time speaking to Data Executives and their definition of a data scientist can vary quite wildly – from people in charge of preparing basic reports and running simple SQL queries to people building complex machine learning models, data processing pipelines and everything in between! Widespread adoption of the title ‘Data Scientist’ has been helpful in bringing the field to prominence but it has also brought a severe lack of clarity as the role means so many different things to so many different industries. For example, Georges Mansourati (Chief Analytics Officer at Northmill Bank) “distinguished between three different types of data scientist related roles:
- Data Scientist: people who do the modelling
- Data Engineer: architects – the ones who build the plumbing
- BI Developers: doing the visualisation and presenting this to the end users”
Similarly, Tad Slaff (Data Science Product Lead), followed this up by stating that Picnic are “looking for full stack data scientists. We are not hiring only data scientists, or only engineering side, or only data extraction / ETL work but instead we want someone who can pull data out, do all cleaning they want, build the model and productionalise it themselves”.
One thing that is clear is that companies are gradually shifting away from employing a single, often overworked, data scientist to creating a more broad, multi-skilled data team. No matter the industry, a data scientist is tasked with creating value through data. The common trait is not necessarily an in-depth programming knowledge, or a computer science background, but it appears to be more that all data scientists have an intense curiosity and a willingness to delve deeper into the problem to find their answers. This reminds me of just about every scientist in their creative disciplines!
The Covid-19 pandemic has stopped a lot of industries in its tracks but data science, on the other hand, continues to be brought deeper into the mainstream. It is no longer an industry reserved exclusively for the nerds amongst us! Turn on the news and you will see a never-ending number of graphs and models charting this epidemic and predicting its future spread. Governments and Health Organisations have relied heavily on data science to form the basis of many of the existing global restrictions – the idea of lockdowns, masks and social distancing were all borne out of analysing vast amounts of historical and real-time data. Furthermore, as the world shut down, data science continued to greet us from within the confines of our own homes with Netflix recommending our new favourite movie, Spotify creating tailored playlists depending on our mood and Amazon delivering everything we needed – from toilet paper to quiz books!
Evidently, there are big expectations on the shoulders of data scientists but what are their most pressing challenges and how can we help make their jobs a little bit easier. Let’s expand on a few critical trends:
Data Quality - Poorly prepared data continues to be one of the biggest obstacles to success within data science. As George Mansourati (Northmill Bank) said, “It’s about the data. That’s the boring answer but that’s the honest answer. It has always been about having good data quality, good data infrastructure and being able to join interesting datasets from different parts of the business. In all of my previous roles, the one common factor where it’s been problematic has always been about data and data availability as opposed to modelling difficulties”. Our audience echoed George’s sentiment with 50% blaming poor data quality as the main reason for data science project failure. Clearly, CIO’s and CDO’s need to emphasise the need for good quality, relevant and timely datasets.
The concept of GIGO (Garbage In Garbage Out) is more relevant than ever before. Oftentimes, data scientists are too distracted by the shiny neural network or random forest model and completely forget about the importance of data quality. The Next Generation Data Scientist should not skip any steps in the modelling process and need to avoid the temptation to impress with complex Machine Learning models that are mismatched with the problem being solved. It is perfectly acceptable to spend a significant portion of your time down in the trenches working with the data if this is what it takes to build a more accurate model.
Automation – It is common knowledge that a typical data scientist spends 80% of their day working on mundane, repetitive, time-consuming, error-prone tasks such as data preparation, feature engineering and feature selection. This is simply not efficient, nor productive, and data scientists need to quickly adopt an automation-first mindset. In a KDnuggets poll released last year, 51% of respondents said that they expect most expert-level Predictive Analytics/Data Science tasks currently done by human data scientists to be automated by 2025.
Automation is a great thing! However, there is a dark side to automation, and this occurs when Machine Learning becomes difficult to follow and incredibly complex to audit and govern. As Pedro Pinto Coelho (CEO at Banco BNI Europa) mentioned:
“We are coming to a point in the financial services industry which is very concerning which I call the black box issue. We have great data scientists working in a model but we are not able to explain our results – for instance how did we actually come out with that recommendation in terms of credit - so I believe we really need to be able to trace back the decision process and be able to explain how the model has behaved and why to stakeholders like the regulators and the shareholders”
AutoML and Model Explainability must mirror each other in order to generate success. With greater organisational understanding, data scientists will ensure increased buy-in and budget from senior management.
Communication - However, without astute communication skills, a lot of the hard-technical graft a data scientist goes through can quickly be undone. Explaining technical concepts to a non-technical audience is a key challenge for most data scientists who tend to find it difficult to take a step back from something they have been immersed in for days and weeks. As data science is still a relatively new craft, we find that decision-makers and senior executives can have difficulty understanding these new tools. As evidenced by the recent Congressional hearings in the US, even tech leaders like Mark Zuckerberg can have difficulty getting their ideas across when faced with non-technical people. As Antti Myllym?ki (Head of AI at OP Financial Group) said, “A data scientist must be excellent in communicating complex things - not making them simple but making them understandable”.
Almost just as important as training data scientists in the modelling workflow is investing time and resources on presentation training. Unlike other scientific fields, a data scientist will rarely be solely presenting to a group of like-minded peers. Oftentimes, data scientists will be working in cross-functional teams where understanding will vary. Accordingly, Thomas Berngruber (Head of Data Analytics & Development at Jyske Bank) stated that “the single bits and pieces are not necessarily that complex but putting it all together is what creates a lot of complexity in the solution and then with that in mind, Change Management is something that becomes really important. What we’re experiencing every day is wrestling a traditional business model and trying to change it to something else.”
Domain Knowledge – Many businesses either cannot afford to keep enough data scientists (according to Glassdoor, data scientists make an average of $113,309 per year) or simply cannot find experts with the right balance of skills. However, Gartner’s coining of the term “citizen data scientist”, promises to reduce the barriers to entry for many organisations. They have defined a citizen data scientist as “a person who creates or generates models that use advanced diagnostic analytics or predictive and prescriptive capabilities, but whose primary job function is outside the field of statistics and analytics”. Educating our business-savvy experts on data science will reduce the gap between the business and the data scientist. As Thomas Berngruber (Jyske Bank) stated, “if I were to boil it down to one sentence, matching decision making, and knowledge is an extremely important thing”
Of course, there will be lots of business use cases where there will be a strong ROI to support investing heavily in complex algorithms which requires a deep understanding of machine learning. It is therefore critically important to design technology that is powerful enough for data scientists while still being accessible enough for data analysts. As Pedro (Banco BNI Europa) stated, “ultimately, data scientists will be so central to everything we do that hopefully one day every one of us will have skills and we’ll cross the skills with the knowledge on the specific sector”. In my opinion, Data Literacy should be at the forefront of every organisational strategy as we move further into this new decade.
A lot of these challenges recently converged at Picnic Technologies – Europe’s fastest growing online supermarket - when Tad Slaff and his team were building a demand forecasting model for purchase order management. According to Tad,
“A lot of the items we are ordering are based on current demand we have. Being able to forecast that – for instance, how many bananas do we need to purchase today to deliver to customers tomorrow - becomes incredibly important. From a pure data science perspective, it doesn’t seem too much of challenge as we have tons of historical data and people are making fairly regularly orders. However, what makes this a challenge is the fact that this needs to be an operational system and needs to operate in real-time. We really need to have all the pieces fit together – from how fresh is the data to is the model working correctly to does it produce a highly accurate forecast? Furthermore, this is high impact – if you are making mistakes on how many items you need to order, it can lead to losses quite quickly. Building the model is only one really small piece to what makes this challenging. What sets companies apart isn’t necessarily how great they are at building models but everything around it – what does that entire pipeline look like, can you actually get products to market quick and is it generating actual business value?
However, the benefits of investing in next-generation data science is impressive. Antti Mylmmaki (OP) stated that “when we started AI transformation journey in late-2016, everybody thought this Big Data and AI trend was all about customer insight and sales operations i.e. the better you understand individual users and their preferences, the bigger the share of their wallet you will have. However, at OP, we are seeing a significant impact on efficiency – not just chatbots, on the customer service, but also seriously reducing manual reports in areas like fraud prediction and anti-money laundering.”
In fact, OP’s Centre of Excellence have already achieved more than 23m euros of operational efficiency savings in their three-year AI journey to date.
There is a lot of information here but the most important thing to remember is that data science is just like riding a bike. You can only learn by doing. The benefits are clear. Take the stabilisers off. Remember the importance of data quality. Adopt an automation-first mindset and champion data literacy throughout. The Next Generation Data Scientist is just around the corner.
If you would like to hear how Altair’s technology stack is designed with these concepts in mind, please check out our upcoming webinar on October 28th:
https://web.altair.com/en/innovative-tools-for-the-next-generation-data-scientist
Implementation Consultant
4 年Thammy I. Marcato