Rise of the (Data Science) Robots

Rise of the (Data Science) Robots

I started out at university studying Molecular Genetics and for a long time considered doing a doctorate and building a career in this field but the thing that ultimately stopped me was the pipette. A micro-pipette is a simple device that enables users to accurately and consistently measure tiny amounts of liquid. Experimental work might require the preparation of hundreds or thousands of individual concoctions, each of which needs to contain a mix of ingredients in proportions as precise as possible. The pipette is essential to minimise variation due to human error.

Our final year projects sometimes involved setting up a lot of samples and people's thumbs would be aching after a few days of pipetting. Post-grads and research staff suffered in a similar way even though they sometimes got to use more expensive multi-channel pipettes which allow a block of many receptacles to be prepared in parallel, significantly reducing the duration of a large task.

No alt text provided for this image

 A single-channel micro-pipette and a 64-point multi-channel micro-pipette

Around the same time I watched a BBC documentary showing US labs where researchers had robotic pipette machines that could prepare batches of test samples automatically. The pipetting that a researcher might do manually in a week could be done using one of these machines in a few hours. When, I wondered, would the labs at my university have one of those?

No alt text provided for this image

 A Tecan 'Freedom Evo Clinical' Laboratory Pipetting Robot

A Career in Futility

This got me to thinking more broadly about automation of traditionally manual tasks, division of labour, the availability of the latest equipment and the effect of funding on the productivity of research staff. A researcher in any poorly funded research facility would spend countless hours over the course of their career executing manual tasks like pipetting that scientists in better funded labs could leave to machines. Working outside of a top-class lab seemed to me then to be somewhat a career in futility.

This insight left me with three choices - leave Ireland to pursue a top-flight research career abroad, stay in Ireland to pursue a probably second-rate career, or give up on research altogether. I decided to take a year out to try other options. The internet revolution was picking up speed and soon I got sucked into programming and a career in IT - a progression that I have never looked back on with regret.

Some of my classmates who pursued research careers in Ireland did well enough but many eventually gave up and got ‘proper jobs’, leveraging their experience in medical or pharma sales, education or manufacturing. Some who went abroad were very successful but to maintain that success they had to stay abroad. The careers of those who returned were never quite the same.

Another Robot-Pipette Moment

Now why am I sharing all of this? Because recently I had another 'robot pipette' moment, and its consequences could affect many of my customers and colleagues in the data science business.

As a ‘data engineer' I have worked with a lot of ‘data scientists' – professionals who use machine learning algorithms that identify patterns in data to generate ‘models’ that anticipate future outcomes.

Some algorithms are proprietary and are associated with particular tools and vendors but most are ‘open source’ IP - developed by university mathematicians, published in academic journals and implemented as programmes in analytic tools like RapidMiner or SAS. Data scientists don't write these algorithms, but rather provide value by using their knowledge and experience to quickly determine what data is required and the right kind of algorithm to use for a given problem.

Having applied pattern finding algorithms to the source data, the next step in the cycle is to assess the effectiveness of the candidate model. Data scientists typically repeat the process, perhaps selecting a different algorithm, tweaking the parameters or using different variations of source data until they arrive at the most measurably useful ‘model’.

A Traditional Data Science Process

A Traditional Data Science Process

While modelling tools like SAS have been around for decades, data science has really emerged from the basement in recent years and become increasingly commodified. Freeware like RapidMiner and open source coding languages like R have democratised the technology and opened up a closed world to everyone. There is a fairly finite set of commercially applicable algorithms and a plethora of journals and user groups that will quickly direct you to methods and approaches that are tried and tested for most scenarios.

Each algorithm has a finite range of parameters that can be set (e.g. the maximum number of levels in a decision tree) and parameter changes are limited by diminishing returns. There is only rarely much benefit to be gained from making large alterations to default settings and in many cases data scientists don’t change the default settings at all – ever!

The most valuable thing you can do to improve machine learning outcomes is actually to improve the data that is input to the algorithms – but these challenges are best approached with a combination of deep knowledge of business processes and data engineering skills, not a high-end statistician. Data Scientists agree, with 60% of them saying that preparing and organising data is their least favourite part of the job. The bit that Data Scientists really love is 'modeling data, mining data for patterns, and refining algorithms'.

Robot Data Science

Theoretically it has always been possible to automate the data science process by simply feeding the training data into all possible variations of all algorithms and comparing all of the results to find the best one. This is a ‘brute force’ approach, historically discounted due to the prohibitive computational costs.

However I recently became aware of new services like DataRobot and XpanseAI that do exactly that. You load up your data set to the service in the cloud, tell it what your success criteria are and it will churn through different representations of the data with different algorithms and compare the results, ultimately identifying the ideal input data and model to achieve the best outcome. 

No alt text provided for this image

The New 'Automated' Data Science Process

Offering Data Science as a service is not new – there are many data science consultancy firms who will take in a data set and a business problem and hand you back a model a few weeks later - but they all do it manually, and their services are slow and expensive. These new cloud-based services cut out the manual effort, and minimise the need for modelling expertise by relying mostly on brute force automation.

The hardware and computational capability lies in the cloud, so customers don't need any infrastructure, software or modelling expertise of their own and the resultant models are provided back to customers in common code formats so that they can be easily implemented on most systems.

Conclusion

DataRobot and XpanseAI are just two of a number of vendors now offering automated data science services and more are appearing. No doubt these kinds of services will become ever more powerful, ever cheaper and ever more accessible. In a relatively short time they are certain to become a standard offering of cloud data platforms like Azure, AWS etc.

My friends and colleagues who work as data scientists should take a long hard look at this trend - it might be time to think about a career adjustment.

Questions on Data Strategy, Data Warehousing, Data Integration, Data Quality, Business Intelligence, Data Management or Data Governance? Click Here to begin a conversation.

John Thompson is a Managing Partner with Client Solutions Data Insights Division. His primary focus for the past 15 years has been the effective design, management and optimal utilisation of large analytic data systems.

要查看或添加评论,请登录

John Thompson的更多文章

  • Enterprise Data - its just plumbing, right?

    Enterprise Data - its just plumbing, right?

    When I started as a data consultant many years ago, my first solo assignment was to resolve a number of issues a small…

    7 条评论
  • After Big Data

    After Big Data

    When Distributed File Systems came on the scene in the late noughties, everyone realised that something big was…

    4 条评论
  • The Big Power of Small Data

    The Big Power of Small Data

    We have all been so bombarded in recent years with information about 'Big Data' that the value of 'Small Data' is…

    1 条评论
  • When do you not need a Data Warehouse?

    When do you not need a Data Warehouse?

    ‘Data Warehouse’ (DWH) is the term used for the last 30 years by both technicians and business stakeholders to mean…

    2 条评论
  • Becoming Data Centric

    Becoming Data Centric

    I’ve spent the last two decades working with analysts to solve data problems in a systematic way and to create…

  • What is Data Entropy?

    What is Data Entropy?

    There is a common meme that LinkedIn regulars will know well. It shows a series of pictures of Lego, one with lots of…

    6 条评论
  • Schrems II: What Does it mean for EU Data Processors?

    Schrems II: What Does it mean for EU Data Processors?

    The Schrems 2 case has been long running and much discussed and its ultimate findings, while still being digested, will…

  • How is Data Management Different from IT Management?

    How is Data Management Different from IT Management?

    In a season where the Liverpool football team is about to win the Premier League for the first time in 30 years, a…

  • Choosing a BI Tool

    Choosing a BI Tool

    Data reporting and visualisation ‘BI’ tools come in many flavours, with a bewildering variety of features to confuse…

    7 条评论
  • Why Do We Need Analytic Data Platforms?

    Why Do We Need Analytic Data Platforms?

    When talking to customers I often encounter the same questions repeatedly. One of the most common is "Why do we need a…

    3 条评论

社区洞察

其他会员也浏览了