How do we decide what Data Science is?
If you ask a group of data scientists what they do and how they do it, it’s unlikely that everyone will agree. So how do we decide what "Data Science" is?
When we discuss data science, we’re falling into the trap of talking about a number of different disciplines and roles with the same term, whilst failing to recognise other things as being data science.
My view is that, as a set of professionals in an emerging discipline, we need to start to set the boundaries of what we do.
When I was at primary school in the 80s, the autumn of 2015 was firmly a long way in the future - a future full of hover boards, compost fuel and flying cars. For all the things that Back to the Future got wrong, there are plenty of examples of the things they got right. I remember thinking that hand held computers, video conferencing or flat screen TVs seemed like the fiction they were meant to be.
Our lives are very different today: our tech-filled lives would not be the same without our tablets and smart phones or 4K TVs.
In one of the “keynote” speeches at the International Congress of Mathematics, held in Paris in August 1900, the German mathematician David Hilbert reviewed the progress of mathematics in the 19th century. He went on to look to the future and set out a mathematical roadmap – a challenge to the world’s greatest mathematicians to solve these problems in their lifetime.
The “Hilbert problems” are widely acknowledged to have influenced a large portion of 20th century mathematics – a “Back to the Future” prediction for modern mathematics.
David Hilbert was a talented mathematician who spent his time looking for a set of universal rules that would provide the foundation for all of mathematics. To the general public however, his legacy is less about work in proof theory or mathematical logic and more about a set of questions that prompted international problem solvers to spring into action.
The Hilbert problems provided the motivation for some of the most important mathematical work of the last century and a history of modern mathematics would not be complete without a reference to these 23 problems.
The history of data science would not be complete without a mention of John W. Tukey. He set out the “Future of Data Analytics” in a seminal paper that was published in 1962. In that paper, Tukey talks about his interest in the mechanics of “data analytics”. He shared his fascination for the machinery needed to perform the analytics, focusing very much on the “what” and less so on the “so what”.
Tukey sets out a range of topics in his paper that today make up the core components of data science. In my view, data science is a mix of mathematics and statistics, data skills, story telling and an deep understanding of the problem domain we're working with. In his own words, Tukey set out these pillars. He talks about how we need to think more about:
- Procedures for analysing data
- Techniques for interpreting the results of such procedures
- Ways of planning the gathering of data to make its analysis easier
- The machinery of mathematics which apply to analysing data
Tukey’s world was one without widespread computing, let alone tablets and smart phones. In a world before relational databases, open data, distributed computing, map reduce and APIs, the thoughts John Tukey presented to the world were profound and quite prophetic.
Since Tukey’s death in 2000, the data has got bigger, more varied and comes with an relentless need to deliver value and meaning in a world of fragmented stories.
The general principles that Tukey set out are as sound today as they were in 1962 and they still need as much innovation today as they did 50 years ago.
Tukey discussed the way in which data analytics, statistics and mathematics intertwine and questions the roles of each area with the others. He lays down the rhetoric for modern day data scientists to mark out the territory for the discipline. Tukey started the “what is data science” conversation but we now look to others to help agree the consensus.
Are data scientists mathematicians? Are we mathematical modellers or statisticians, or both? Are we computer scientists who use some mathematics when it suits us or are we mathematicians who can handle complexity of data in our work? Are we the new incarnation of the Database Analyst, DBA or Sys Admin? Are we a unicorn, sent to confuse recruiters and senior management?
I’ve given a number of talks this year where I’ve referred to how “we talk about the same things in different ways and different things in the same way”. My increasing frustration is that we’re falling into the trap of talking about a number of different disciplines and roles with the same term, whilst failing to recognise other things as being data science.
Data Science must be about the use of scientific principles to abstract the specific case we find in our data to the general population, whilst understanding the issues we face with data sets which are exploding in size and width. Where data analysis was limited to a relatively narrow analysis of a data set in the 1960s, data science must seek to find ways to automate data exploration and help sift the general from the specific.
How do you set the boundaries for where data analysis stops and data science starts? How much should we be concerned with the distinction?
Data science is allowing us a view into problems previously too complex to investigate. Some of my work at Bloom with the Universities of Oxford and Strathclyde uses social media data to map out the complex structures that exist in cities. This is an example of a existing data source being used in different ways, at scale and in real time, to tackle previously intractable problems. Our motivation is to give civic leaders a real opportunity to monitor their decisions in real time.
Our aim is to look for the general features which affect cities all of the time and our challenge in this work is to remove the noise, the uncertainty and the risk of our insights being skewed by temporal data. For the data science to succeed, we must navigate the path of abstraction from an increasingly noisy, unstable and error prone data set and we must use our mix of data and science skills to do this in an acceptable way to all stakeholders.
When the history of data science is finally written and agreed, we’ll have a chapter which succinctly records how we defined the boundaries of data science in a way which all parties are happy with. This definition will give credit to the talented professionals who are innovating and allows for a useful distinction, without creating silos, within other professional disciplines such as data governance, storage and creation.
As a profession, we need to come together and debate this issue - our definition of what we do will surely come from talking and discussing these very points that I raise in this article. It's going to be one interesting journey setting those boundaries and I'm looking forward to taking my part.
It would be great to know what you think - and great to hear your comments.
Image Credit: The Paris Night Sky by Benh LIEU SONG
Award-winning PR for SMEs | National & local media coverage | Thought leadership
9 年An interesting article especially your comment 'Our motivation is to give civic leaders a real opportunity to monitor their decisions in real time'. I wonder if any civic leaders can comment on this?