Data Driven Science

Data Driven Science

We’re living on the edge of an exponential growth of information (Figure 1). This information is composed by data generated both by Humans and computers that is served not only to Humans but more and more to feed computers [1], fueling the Internet of Things.

No alt text provided for this image

The Data Driven Science, that can be explained as the systematic extraction of knowledge from data [1], takes this concept of having this large availability of data into the field of science, making room to speculations of how the scientists as we know will be changed.

According to Professor Hans Rosling it’s needed to:

“have access to the data bases freely, make then searchable and with a second click, make them in the graph format so they can be instantaneously understood” [2]

In this point of view it’s taken in consideration that: data have these stories that are preserved far from Human comprehension in its patterns, that observation is information and that the processes that we should do as scientists are retrieve data, visualize it and analyse it [2].

The problem with this perspective appears when we try to work with data from dynamic areas that can’t be easily segregated into independent categories [3]. For example, when we have to rely in Humans as the object of an experiment, like in areas of Life Sciences or Social Sciences, nomenclature, subjective opinions or measurement mistakes can give serious contamination and error propagation in the data. [2]


Data-Driven vs Hypothesis-Driven

On his book, Factfulness, Professor Hans Rosling puts in this way:

“[…] I never trust data to 100% […] there’s always some uncertainty.” [4]

admitting, so, that the observation method in colorful graphs is solely a generator of hypothesis [5]. This generator of hypothesis is of great importance in Astronomy [6], Particle Physics, Computational Linguistics or Bioinformatics [7]. In these areas wither through simulations, machine learning or in the visualization of data, can guide us into the questions that we shouldn’t think otherwise [8].

Since data can’t be always trusted, it may seem obvious to jump to the conclusion that maybe it’s safer not to base our science in data, but instead in just Hypothesis-driven. Nevertheless, since this one is just hanging from nothing more than the scientist imagination, we might be creating this extra-terrestrial reality.

Nevertheless, when inverting the cycle of hypothesis formulation it shouldn’t be used the same data to generate and validate hypothesis since the conclusions were tailor made to that particular cases, resulting in unreproducible science [9]. In the project Many Labs 2, a team of 200 psychologists repeated several studies from the phycology field and found that only 14 out of 28 have the same results as reported [10], similar findings in economy [11] and medicine [12].

It’s needed, then, scientists with computational thinking [6] that besides using computational tools, they also are critics of the results that the algorithms give them.


The End of Theory

In a provocative article titled: “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” the author explores the end of the need to create models taking the example of the Google’s success in the targeted publicity [13]. In this case there wasn’t the need to use scientific models about Human behaviour or in the direct translation of languages, letting the machine develop its own unspoken language [14], but instead just data, tons of data. The main argument here is that we should forget taxonomy, ontology or psychology, since nobody knows why they do what they do, leaving room, then, to just track and precisely measure those behaviours:

“the petabytes will talk by themselves” [13]

An opposite opinion is defended by Mark Graham that highlights the digital trace as way to show that competing platforms tend to close within their borders their users data, thus originating subsets of information that dilutes its potential as a source of information [15] to social sciences. And if you also take in consideration that not all people relate with computers and internet the same way you can end up with a non-representative sample of people which could exclude digital illiterates from your scientific results.

In Astronomy, you can take petabytes of information with our telescopes from space but you still need to know where to point them [6].

To sum up its still needed the opinion of specialists that can give context and offer a deeper analysis to the subject in study [15].


The 4th Paradigm

According to Thomas Kuhn, North American philosopher, science history can be divided into two periods: the normal science and briefs moments of revolutionary science. The first one is described as when a scientist works with a scientific theory that is well accepted during that period while the second occurs when the accumulation of abnormalities that don’t fit in the former theory reach into a paradigm shift [16].

The first paradigm was linked to the empiricism time where the methodology was: observation, formulation of hypothesis and experimenting to refute them. The second paradigm was to create theories which could generalize and explain the empirical observations, scientific laws, under the mathematical formulation, were coined such as the gravitational law by Newton. The third paradigm is attributed to the middle XX century when scientists were able to run simulations of complex systems, such as meteorology. Reaching to the proposed 4th paradigm, the Data-driven Science which gathers data from simulations, experiments and other sources that without the aid of a computer would be impossible to a Human brain to assimilate. Powerful algorithms are used to find a searchable object or a probable hypothesis. It should be seemed as an additional method to deal with big volumes of data.

This perspective is highly questioned by Kim Kasterns [17] where he argues that the 4th paradigm is just a consequence of the accumulation of data that it overloads the capacity of the scientific community to give a reason out of that data than a really revolutionary mechanism.


Take aways

The data don’t always tell the full story, since it may have interference from hidden secondary effects, but it’s the only way that we have that can disprove hypothesis created in our imagination after the observation of an event.

Technologies such as data mining can be helpful to visualize data and use it as a generator of hypothesis having in mind not to create a model that fits the data already collected. So, we should brake our desire of acquiring a lot of data to prevent the feeling of having it all. Instead it’s needed to develop a critical computational thinking that explores the unprecedented computational power since what we can get from them it’s dependent on what we want to ask.

[1]J. Crowell, “Insight Paper,” Trexin, 11 3 2016. [Online]. Available: https://www.trexin.com/the-philosophy-and-process-of-data-science/. [Acedido em 12 1 2019].

[2]P. Keil, “R-bloggers,” R-bloggers, 2 1 2013. [Online]. Available: https://www.r-bloggers.com/data-driven-science-is-a-failure-of-imagination/. [Acedido em 12 1 2019].

[3]T. Wang, Writer, The Human Insights Missing From Big Data. [Performance]. TEDxCambridge, 2016.

[4]H. Rosling, Factfulness, New York: Flatiron Books, 2018.

[5]H. Rosling, Writer, The Best Stats You’ve Ever Seen. [Performance]. TED Talks, 2006.

[6]T. Murphy, “The Conversation,” 10 Maio 2017. [Online]. Available: https://theconversation.com/why-data-driven-science-is-more-than-just-a-buzzword-76949. [Accessed 6 Fevereiro 2019].

[7]T. Murphy, “Data-driven Astronomy,” The University of Sidney, [Online]. Available: https://www.coursera.org/learn/data-driven-astronomy. [Accessed 6 Fevereiro 2019].

[8]V. Dhar, “Data science and prediction,” Communications of the ACM, pp. 64–73, 12 2013.

[9]K. Zhang, “The Conversation,” The Conversation Trust, 13 Dezembro 2018. [Online]. Available: https://theconversation.com/how-big-data-has-created-a-big-crisis-in-science-102835. [Accessed 6 Fevereiro 2019].

[10]E. Young, “The Atlantic,” The Atlantic Monthly Group, 19 Novembro 2018. [Online]. Available: https://www.theatlantic.com/science/archive/2018/11/psychologys-replication-crisis-real/576223/. [Accessed 6 Fevereiro 2019].

[11]J. P. A. Loannidis, T. D. Stanley and H. Doucouliagos, “The Power of Bias in Economics Research,” The Economic Journal, vol. 127, no. 605, pp. 236–265, 1 Outubro 2017.

[12]J. P. A. Loannidis, “Why Most Published Research Findings Are False,” PLoS Med, 30 Agosto 2005.

[13]C. Anderson, “The Wired,” Condé Nast, 23 Junho 2008. [Online]. Available: https://www.wired.com/2008/06/pb-theory/. [Accessed 5 Fevereiro 2019].

[14]D. Coldewey, “TechCrunch,” 22 Novembro 2016. [Online]. Available: https://techcrunch.com/2016/11/22/googles-ai-translation-tool-seems-to-have-invented-its-own-secret-internal-language/?guccounter=1. [Accessed 6 Fevereiro 2019].

[15]M. Graham, “The Guardian,” 9 Mar?o 2012. [Online]. Available: https://www.theguardian.com/news/datablog/2012/mar/09/big-data-theory. [Accessed 6 Fevereiro 2019].

[16]S. Godfrey, “Understanding Science: How Science Really Works,” University of Berkeley, 2003. [Online]. Available: https://undsci.berkeley.edu/article/philosophy. [Accessed 6 Fevereiro 2019].

[17]K. Kastens, “Earth and Mind: the Blog,” 20 Outubro 2012. [Online]. Available: https://serc.carleton.edu/earthandmind/posts/4thpardigm.html. [Accessed 6 Fevereiro 2019].

[18]I. Wladawsky-Berger, “The Wall Street Journal,” Dow Jones & Company, 2 5 2014. [Online]. Available: https://blogs.wsj.com/cio/2014/05/02/why-do-we-need-data-science-when-weve-had-statistics-for-centuries/. [Accessed em 12 1 2019].

[19]M. E. McCue and A. M. McCoy, “The Scope of Big Data in One Medicine: Unprecedented Opportunities and Challenges,” Frontiers in Veterinary Science, vol. 4, no. 194, 16 Novembro 2017.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了