Data Science Platforms: A Necessary Evil?
Data Science will change the world (https://houseofgeekery.com/2015/04/28/top-10-superheroes-who-didnt-originate-in-comics/)

Data Science Platforms: A Necessary Evil?

“Yes! Another article about data science platforms,” said no one. But you’re here already, so might as well stay for the next 3 or so minutes as I share my personal reflections about data science platforms. Specifically, 3 things: what are data science platforms; why are they necessary; and the evil part. As such, the views in this article are solely mine and do not represent the views of my employers, past or present [and I guess, ipso facto, future employers].

Let’s start with something familiar: data is like [insert favorite embellishment]. Unless you’ve spent the last couple of decades driving around Mars in one of those cool rovers (https://mars.nasa.gov/mer/), we’ve all heard it, seen it, and whether we know it or not, experienced the wonders of technology, data, and machine learning. Remember spam email? Me neither. What about paper maps? Or better yet, for those of us who have driven around India, how about navigating by landmarks such as dhabas (tea stalls) or statues? Ok, that one I still remember but, thankfully, don’t have to employ any more. Then, there is the breakthrough I am really praying for, self-driving cars, so I don’t have to teach my daughters how to drive and sure, lots of other benefits, but mostly the not having to teach driving to my daughters.

Behind these and so many other innovations, there is a village of smart people that make it all possible, however, for the purposes of this article, I focus on those focused on the data-related aspects, mainly the data scientists and ML engineers. Let me paint a mental picture for you: imagine a person sitting in front of multiple curved screen monitors, effortlessly piecing together huge amounts of data from multiple sensors, crunching the numbers using the deepest learning model on the clouds of Venus and spitting out ideas on how to feed the world. That’s what data scientists dream, besides electric sheep. Back in reality, the data scientist is most likely frustrated because the model keeps crashing due to the multiple misspellings of the word, misspelling [for those curious, Oxford University Press keeps a list of the most commonly misspelled English words: https://oupeltglobalblog.com/2010/09/30/20-most-commonly-misspelt-words-in-english/]. Yes, for the most part, the work of a data scientist consists of unglamorous activities including data vacuuming, data gluing, data cleaning, data massaging, and only a small part is spent discerning those ever-important patterns from data that become the smarts of the innovation.

The following picture illustrates a “day in the life” of a data scientist.

No alt text provided for this image

This is a simplified picture. It only shows the mechanics. There is a lot of planning and thinking that goes behind the scenes like “Is there enough data?” or “What is the right performance metric?” or “How do I collaborate?” or “How do I reduce/eliminate biases from the model?” Given all this, wouldn’t it be nice if we had some way to make the life of the data scientist easy, even a little bit? Enter data science platforms. Here’s how I think about data science platforms: a data science platform is a collection of connected tools that make a data scientist’s life easier.

Breaking down the definition a bit more, the set of tools refers to tools that enable data scientists to do the mechanics of accessing data, wrangling, modeling, deploying and operations. There are a lot of such tools. Which brings me to the second important aspect of the definition: the connected part. This is about the experience a data scientist will have with the platform. I must admit that I am overloading the term connected to imply 3 different, yet related concepts. Concept 1: In September 2021, Matt Turck, John Wu and FirstMark released the Machine Learning, AI, and Data (MAD) landscape shown below [reference: https://mattturck.com/data2021/, high res version: https://46eybw2v1nh52oe80d3bi91u-wpengine.netdna-ssl.com/wp-content/uploads/2021/12/2021-MAD-Landscape-v3.pdf].

No alt text provided for this image

This depicts the myriad of tools, companies, products, etc., available for data scientists to do what data scientists do. Calling this an eye chart would be like saying V Y Canis Majoris is big. This landscape is massive, and it is very easy to get lost in the myriad of offerings with their often-imperceptible differences. This is where the connectedness property of a [good] data science platform becomes a lifesaver: it enables the data scientist to do the various activities in a single environment providing a variety of options for tools that automagically interoperate with each other to provide a seamless experience.

Taking this argument further gets us to concept number 2 for connectedness: data science platforms need to be extensible, i.e., data scientists should be able to add new tools to the toolbox and integrate them into their existing experience. This is especially critical given the pace at which the MAD landscape is evolving [again, refer to Matt Turck’s website where he has MAD landscapes from previous years] – no data science platform can keep with that pace and offer canned integrated solutions hence the need for self-serve integration.

Which brings us to the concept number 3 tied to connectedness: collaboration. Data science is a team sport and for collaboration to occur, people don’t even have to be part of the same team or organization. What is key is the ability to share work [code, model parameters, results, etc.] with the broader community within your team, organization or outside for reuse or as a foundation. A good data science platform makes it easy for users to share their work.

So, there we have it. A data science platform has made life easier, and problem solved. Well, I did mention an evil part. That has to do with the over-reliance on data science platforms and forgetting what Uncle Ben taught us about great power and great responsibility. Data science platforms should not be used to circumvent the data science process – the “science” in data science demands that we go through a rigorous process of model design keeping in mind various subtleties of variance, bias, ethics, interpretability, etc., which may not be fully represented in data science platforms. The record-breaking pace at which vaccines were developed for COVID-19 has taught us that the mechanics can be accelerated without compromising the science. Here’s to hoping that data-driven science sees the same acceleration in mechanics and enhancement in cognition with platforms bolstering the journey all along. Now, I’ll be getting back to finding a solution for life, universe and everything!

Rohit Meheta

Data & Analytics Leader | Analytics Solution Architect | Accelerating Data & Digital Journey for Industry Leaders using Cloud and AI capabilities

2 å¹´

Great read and learnings, Naveen Singla - thanks for sharing! It also made me chuckle a few times!

I love the point about attending to unglamorous aspects of data science as an integral part of the process. I've seen time and time again companies structure linear projects, like "first we get the data right...then do analytics...then profit." This linear way of thinking is why so many projects get stuck in the data phase, and never move to model deployment and monitoring.

赞
回复

要查看或添加评论,请登录

Naveen Singla的更多文章

  • The Promise of AI for Agriculture

    The Promise of AI for Agriculture

    Growing up in India, I used to visit my grandparents in Punjab and go to their cotton and wheat farms. My memories from…

    12 条评论

社区洞察

其他会员也浏览了