Don't steal from Statistics!

Don't steal from Statistics!

This text presents a compelling and a personal nuanced perspective on the nature and origins of data science. But it is simultaneously a scream against those who steal from giants that, in the past, contributed to the position that data science is standing in today!


Unveiling the roots of Data Science: a historical and multidisciplinary perspective

As we know it today, data science stands at the forefront of technological innovation and business strategy. Yet, despite its contemporary allure, it is far from being a novel domain. In a panel discussion about “What is Data Science?” organised by Universidade Nova de Lisboa at The Knowledge Hub Universities I participated in yesterday, I delved into the historical evolution of data science, arguing for its recognition as both an autonomous branch of science and an inheritor of deep statistical tradition.


The emergence of Data Science

The term “data science” gained momentum post-2014 as data storage costs plummeted, catalysing the autonomisation of information management fields like machine learning, data analytics, business intelligence, and artificial intelligence. This convergence has led to the proliferation of techniques and methods reshaping industries and scientific inquiry. However, to consider data science as a groundbreaking discipline means overlooking the rich tapestry of its origins.


A statistical lineage

Long before the advent of massive computational power and big data, the field of statistics provided the foundational techniques that underpin many algorithms we now associate with machine learning. Logistic regression, linear regression, and clustering are not inventions of the modern era but are statistical methods repurposed and rebranded within the machine learning lexicon.


The misappropriation of techniques

Nate Silver, a renowned statistician and the founder of FiveThirtyEight, has been vocal about oversold as machine learning innovations, leading to a lack of transparency and accountability in the use of these methods (https://hbr.org/2013/09/nate-silver-on-finding-a-mentor-teaching-yourself-statistics-and-not-settling-in-your-career). Nate Silver’s perspective aligns with the sentiment that the term “data scientist” is often used to glamorise the role of statisticians.

Gil Press’s opinion on the term “data scientist” is that it is slightly redundant in some opinions. He argues that historically, most data scientists were statisticians or had strong backgrounds in statistics, highlighting that tensions between statisticians and computer scientists working in data science still exist but may be easing. Press emphasises that data science is a team sport, akin to baseball, where very few individuals can competently play every position. Additionally, he mentions that there is a skills shortage in analytics and data management within data science, and there are concerns about over-specialisation and diluted skills among statisticians due to the evolving nature of the field (https://www.thedigitaltransformationpeople.com/channels/analytics/what-the-heck-is-data-science-anyway/).

The 2015 American Statistical Association (ASA) statement on the role of statistics in data science emphasises the foundational nature of statistics in data science. The ASA states that statistics is one of three foundational communities in data science, alongside database management and distributed and parallel systems. The statement underscores the importance of collaboration among statisticians, data scientists, and other professionals to maximise the potential of data science. It acknowledges that data science encompasses more than just statistics. It recognises the critical role statistical science plays in this rapidly growing field. The ASA encourages multifaceted collaboration to enhance researchers’ abilities to accumulate knowledge and address the challenges faced in data science and artificial intelligence (https://magazine.amstat.org/blog/2015/10/01/asa-statement-on-the-role-of-statistics-in-data-science/#:~:text=For%20data%20science%20to%20fully,knowledge%20and%20obtain%20better%20answers).


The technological evolution

Indeed, some methods, such as decision trees and ensemble methods, such as random forests, emerged distinctly within machine learning to harness the brute force of modern computation. Yet, it is essential to recognise that some of these techniques are based on heuristics and lack the formal proof and rigorous scientific foundation that traditional statistical methods possess.


The true nature of Data Science

Data science combines multidisciplinary efforts – a fusion of statistics, computer science, and domain expertise. By leveraging technology, we have accelerated and automated processes, but this does not imply proprietorship over statistical techniques by data scientists.


A call for recognition and collaboration

As we continue to advance in this exciting domain, it is crucial to acknowledge the shoulders of giants upon which we stand – the statisticians who crafted the tools we now readily employ. Data science is not a usurper but a beneficiary of statistical wisdom.

Let us not misconstrue the past as we forge the future. Data scientists, statisticians, and other professionals must collaborate, recognising the multidisciplinary essence of our field. This acknowledgement will only allow us to push the boundaries of what is possible with data.

Rui Gouveia

Técnico Superior Especialista em Estatística no INE

10 个月

In a period where new models are generated every day, we can't forget those generated by traditional statistics. Both because of the contribution they have made and still make, and because the models generated by data science need to be tested and consolidated. This is despite the many contributions of data science to understanding phenomena and generating value. This is because we are aware of the gigantic resources that data science has and stasistics did not.

要查看或添加评论,请登录

Jorge M. Mendes的更多文章

社区洞察

其他会员也浏览了