DATA DRIVEN SCIENCE
source: https://content.sciendo.com/view/journals/jdis/4/2/article-p79.xml

DATA DRIVEN SCIENCE

Clustering algorithms, the most popular nowadays in unsupervised machine learning, are just ubiquitous in many scientific fields. With plenty of color and configuration, they can give scientists lots of insights from a single chart, enriching presentations and white papers. They can be considered masterpieces themselves.

Multi-colored, 3-d charts are produced by algorithms that were born as the state of art of complexity, but today they can be written with a few lines of code. Just knowing the right packages to include and some functions and - voila! Ready for your slide to surprise your audience.

 But things go far beyond that.

We can say that machine learning / deep learning algorithms and structures, which have been born from data science, have made this branch of computer science an intrinsic part of all quantitative sciences, besides a science itself.  About data science, there would be lots that we could say, even when critics deny its scientific status. But here I would like to focus on an interesting phenomenon: data and their science have changed the inner core of other sciences.

Lab experiences and testing, which take to measurement, have been part of this core since ever. We can even say science was born from experience. And between experiencing and theory, measurement and results have apparently created a science of their own making. Here we saw the sciences of statistics, probability, numerical analysis and others develop almost from zero; today, all of them, besides the core topic, configure just like a compass that can drive the core itself.

I meant data – algorithms, structures, packages, and all the related insights – have been changing the way science itself happens and is done. More and more measurements have become dependent of huge datasets, that can only be handled with equally powerful data tools. And an experiment with a few small datasets is worth too less than another one with a massive data ensemble that can only be dealt with the right machine learning package and tool. This means if you a scientist – an astronomer, an applied mathematician, an environmental biologist, an organic chemistry engineer, a game theory practitioner – you must be a little – or not so little – of a data scientist.

Of course, packages on ML have been created and tested by computer scientists. But this has changed, and you can see, while inspecting the source code of the most recent developments, physicists, medical doctors, molecular specialists, as code writers. No, they aren’t changing their core areas: they are developing machine learning code as part of their science. Code that will make their results faster, more robust, safer and – yes – code that is made available publicly to be used, reused and eventually updated to evolve and become better.

When the free software concept exploded back in the 90’s as a way for computer science to evolve free from the powerful software corporations, it was not thought to be this wide. Proprietary software developers versus free software developers. Computer scientists with high incomes from corporations against idealists working for free at home while they should be sleeping. That made good products to be created, but not as much as they desired. Until data science came and told the geologist he/she could create his/her own custom software without having to know algorithm development details. And, the best thing: software that you should not tell the science core concepts to a computer expert, with the risk of misunderstanding.

So, we can say today applied/quantitative sciences have inverted the flow: in the beginning, the concepts flew from science to computer science, and software in the opposite way. Today, scientists generate software, that is used by other scientists and even computer experts.  

This has changed the way their science happens. The timing of results, the proofs of concept, the experimental evaluation. Practice and theory have always fed each other, but data have evolved and gained status of a third leg here. This resembles grid computing, when you had lots of computers running some data-intensive software as a giant multiprocessing machine; today you have lots of computer scientists, producing small (or not so) pieces of code for specific purposes, but always publicly available to all.

And the role of computer scientists is important, too. Usually they evolve those pieces of code and make them better, faster and scalable to face the growing amount and size of data chunks. Brothers in science, we can say, helping each other – and the benefit is for all.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了