Rise of the (Data Science) Robots
I started out at university studying Molecular Genetics and for a long time considered doing a doctorate and building a career in this field but the thing that ultimately stopped me was the pipette. A micro-pipette is a simple device that enables users to accurately and consistently measure tiny amounts of liquid. Experimental work might require the preparation of hundreds or thousands of individual concoctions, each of which needs to contain a mix of ingredients in proportions as precise as possible. The pipette is essential to minimise variation due to human error.
Our final year projects sometimes involved setting up a lot of samples and people's thumbs would be aching after a few days of pipetting. Post-grads and research staff suffered in a similar way even though they sometimes got to use more expensive multi-channel pipettes which allow a block of many receptacles to be prepared in parallel, significantly reducing the duration of a large task.
A single-channel micro-pipette and a 64-point multi-channel micro-pipette
Around the same time I watched a BBC documentary showing US labs where researchers had robotic pipette machines that could prepare batches of test samples automatically. The pipetting that a researcher might do manually in a week could be done using one of these machines in a few hours. When, I wondered, would the labs at my university have one of those?
A Tecan 'Freedom Evo Clinical' Laboratory Pipetting Robot
A Career in Futility
This got me to thinking more broadly about automation of traditionally manual tasks, division of labour, the availability of the latest equipment and the effect of funding on the productivity of research staff. A researcher in any poorly funded research facility would spend countless hours over the course of their career executing manual tasks like pipetting that scientists in better funded labs could leave to machines. Working outside of a top-class lab seemed to me then to be somewhat a career in futility.
This insight left me with three choices - leave Ireland to pursue a top-flight research career abroad, stay in Ireland to pursue a probably second-rate career, or give up on research altogether. I decided to take a year out to try other options. The internet revolution was picking up speed and soon I got sucked into programming and a career in IT - a progression that I have never looked back on with regret.
Some of my classmates who pursued research careers in Ireland did well enough but many eventually gave up and got ‘proper jobs’, leveraging their experience in medical or pharma sales, education or manufacturing. Some who went abroad were very successful but to maintain that success they had to stay abroad. The careers of those who returned were never quite the same.
Another Robot-Pipette Moment
Now why am I sharing all of this? Because recently I had another 'robot pipette' moment, and its consequences could affect many of my customers and colleagues in the data science business.
As a ‘data engineer' I have worked with a lot of ‘data scientists' – professionals who use machine learning algorithms that identify patterns in data to generate ‘models’ that anticipate future outcomes.
Some algorithms are proprietary and are associated with particular tools and vendors but most are ‘open source’ IP - developed by university mathematicians, published in academic journals and implemented as programmes in analytic tools like RapidMiner or SAS. Data scientists don't write these algorithms, but rather provide value by using their knowledge and experience to quickly determine what data is required and the right kind of algorithm to use for a given problem.
Having applied pattern finding algorithms to the source data, the next step in the cycle is to assess the effectiveness of the candidate model. Data scientists typically repeat the process, perhaps selecting a different algorithm, tweaking the parameters or using different variations of source data until they arrive at the most measurably useful ‘model’.
A Traditional Data Science Process
While modelling tools like SAS have been around for decades, data science has really emerged from the basement in recent years and become increasingly commodified. Freeware like RapidMiner and open source coding languages like R have democratised the technology and opened up a closed world to everyone. There is a fairly finite set of commercially applicable algorithms and a plethora of journals and user groups that will quickly direct you to methods and approaches that are tried and tested for most scenarios.
Each algorithm has a finite range of parameters that can be set (e.g. the maximum number of levels in a decision tree) and parameter changes are limited by diminishing returns. There is only rarely much benefit to be gained from making large alterations to default settings and in many cases data scientists don’t change the default settings at all – ever!
The most valuable thing you can do to improve machine learning outcomes is actually to improve the data that is input to the algorithms – but these challenges are best approached with a combination of deep knowledge of business processes and data engineering skills, not a high-end statistician. Data Scientists agree, with 60% of them saying that preparing and organising data is their least favourite part of the job. The bit that Data Scientists really love is 'modeling data, mining data for patterns, and refining algorithms'.
Robot Data Science
Theoretically it has always been possible to automate the data science process by simply feeding the training data into all possible variations of all algorithms and comparing all of the results to find the best one. This is a ‘brute force’ approach, historically discounted due to the prohibitive computational costs.
However I recently became aware of new services like DataRobot and XpanseAI that do exactly that. You load up your data set to the service in the cloud, tell it what your success criteria are and it will churn through different representations of the data with different algorithms and compare the results, ultimately identifying the ideal input data and model to achieve the best outcome.
The New 'Automated' Data Science Process
Offering Data Science as a service is not new – there are many data science consultancy firms who will take in a data set and a business problem and hand you back a model a few weeks later - but they all do it manually, and their services are slow and expensive. These new cloud-based services cut out the manual effort, and minimise the need for modelling expertise by relying mostly on brute force automation.
The hardware and computational capability lies in the cloud, so customers don't need any infrastructure, software or modelling expertise of their own and the resultant models are provided back to customers in common code formats so that they can be easily implemented on most systems.
Conclusion
DataRobot and XpanseAI are just two of a number of vendors now offering automated data science services and more are appearing. No doubt these kinds of services will become ever more powerful, ever cheaper and ever more accessible. In a relatively short time they are certain to become a standard offering of cloud data platforms like Azure, AWS etc.
My friends and colleagues who work as data scientists should take a long hard look at this trend - it might be time to think about a career adjustment.
Questions on Data Strategy, Data Warehousing, Data Integration, Data Quality, Business Intelligence, Data Management or Data Governance? Click Here to begin a conversation.
John Thompson is a Managing Partner with Client Solutions Data Insights Division. His primary focus for the past 15 years has been the effective design, management and optimal utilisation of large analytic data systems.
Those dam'd machines :-)
Colin