What is the current thinking in complexity science, applied mathematics, and computational social science on analysis of social media
This is one of eleven primers that the organizing team of the WHO infodemiology conference (June/July 2020) prepared to feed into multidisciplinary discussions in working groups that were discussing a public health research agenda. The primer is not intended to be exhaustive review of literature, but more a rapid review and a starting point for discussion. I will be publishing the primers over the course of next weeks. Hope you find them useful as well. Thank you to colleagues from Demand for Immunization Team at US CDC for participation in primer preparation.
Definitions and Key Concepts
- Complexity science: An emerging multidisciplinary field for understanding complex physical, biological, and social systems. It acknowledges the limitations of traditional reductionist approaches used to understand complex systems (e.g. standard statistical methods based on averaging of a system’s many components). It provides an alternative framework by integrating the network of relationships between components within and between systems, and by accounting for uncertainty [1],[2].
- Emergent behavior: A term used in complexity science to describe a system’s behaviors that arise from the relationships between its components rather than from the components themselves2.
- Computational social science: A new interdisciplinary field that studies human behavior and social interactions through the analysis of “big data” without relying heavily on traditional research methods used in social and behavioral sciences (e.g. surveys of narrowly defined populations). It exists at the intersection of varied disciplines, including social sciences, computer and information science, physics, and mathematics [3].
- Big data: A type of data that are high volume, high velocity (speed of data in and out), and high variety in terms of range of data types and sources [4].
- Unstructured data: A type of data that lacks the structural organization usually needed for analysis, including text, images, video, and audio data. Unstructured data constitute 95% of big data [5].
- Machine learning: The automated detection of meaningful patterns in data using computational programs that can “learn” and “train” themselves based on existing datasets. Examples of technologies that use machine learning include search engines and face detection for digital cameras [6].
Leveraging Emerging Insights from New Sciences and Big Data
In the three related fields of complexity science, computational social science, and applied mathematics, innovative methods that were traditionally not available to social and behavioral scientists are being used to analyze large volumes of unstructured data obtained from social media. One of the common characteristics of these methods is that many of them involve applying recent advances in artificial intelligence and machine learning [7]. For example, natural language processing (NLP), which refers to a range of computational techniques used for automatic analysis of human language, has been used to detect online hate speech and fake news [8],[9],[10],[11]. Further, convolutional neural networks (CNN), a subfield of machine learning originally designed for processing image data, can also be applied to the analysis of visually-driven social media such as Instagram [12]. Leveraging these computational tools, researchers are able to explore unprecedentedly high volumes of unstructured data on a global scale.
Understanding offline human interactions and behaviors based on data extracted from the online world is another approach common to the three fields, particularly complexity science. For example, through the analysis of Twitter and credit card shopping data, complexity scientists demonstrated that online interactions were segregated by income just as physical interactions were[13]. Similarly, by using machine learning to analyze spatiotemporal metadata associated with Twitter posts (i.e time posted and geolocation of users), it is possible to investigate the dynamics of illegal wildlife trade taking place physically[14]. By triangulating data sources and types, these disciplines provide insight into the nature of both offline and online worlds.
The frameworks and methods discussed above have been employed in various studies looking at public health information and misinformation, ranging from predicting the veracity of online rumors about the 2014 Ebola epidemic[15] to NLP-based analyses of over 200,000 online posts to examine pregnant women’s information-seeking behaviors[16]. Computational methods can also be used to identify and assess risks of negative health outcomes. An AI using CNN successfully estimated the risk of alcohol abuse based on images and texts people had shared on Instagram[17]. Likewise, a machine learning model trained on 44,000 child electronic health records identified children at risk of not being vaccinated[18]. Understanding online social networks is another way in which these disciplines contribute to public health. For example, a group of complexity scientists analyzed a global pool of around three billion Facebook users to provide a system-level understanding of the contention surrounding pro-, anti-, and undecided vaccination views19. They mapped the online ecology of clusters (i.e. Facebook pages and their members) holding differing vaccination views and conclude that anti-vaccination clusters are highly entangled with undecided clusters, while pro-vaccination clusters are more peripheral. They also used mathematical formulae to predict the conditions needed to prevent the spread of anti-vaccination narratives, including manipulating the rate at which links between sets of clusters are created[19].
A number of COVID-19-related studies and research protocols informed by the three disciplines are starting to be published[20]. For example, NLP approaches are being used to identify the main topics of COVID-19-related posts shared by Twitter users [21] and to understand their perceptions toward mitigation policies [22]. Others have approached the topic with a broader scope. Using a range of computational tools, a group of researchers performed comparative analysis of more than 8 million comments and posts collected from five social media platforms (Twitter, Instagram, YouTube, Reddit and Gab) [23]. Based on this analysis, they developed a model for characterizing the “reproduction numbers” of information for each platform. There are more opportunities for research in these areas, especially because researchers are openly sharing social media data sets. There is a public repository of Twitter data containing more than 123 million tweets related to COVID-19, which is actively being updated on a weekly basis [24].
To summarize, given the importance of social media in infodemiology, complexity science, computational social science, and applied mathematics will all be essential because they equip researchers with tools necessary for analyzing unstructured data on a global scale. Future research can look into ways of leveraging these tools beyond analysis and explore how they can inform interventions that would address issues associated with the COVID-19 infodemic.
[1] New England Complex Systems Institute. (2019). Research. New England Complex Systems Institute. https://necsi.edu/research
[2] Siegenfeld, A. F., & Bar-Yam, Y. (2020). An Introduction to Complex Systems Science and its Applications. ArXiv:1912.05088 [Physics]. https://arxiv.org/abs/1912.05088
[3] Cornell University. (2020). Computational Social Science. Masters in Computational Social Science. https://as.cornell.edu/block/computational-social-sciences
[4] Chen, S.-H., & Yu, T. (2018). Big Data in Computational Social Sciences and Humanities: An Introduction. In S.-H. Chen (Ed.), Big Data in Computational Social Science and Humanities (pp. 1–25). Springer International Publishing. https://doi.org/10.1007/978-3-319-95465-3_1
[5] Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137–144. https://doi.org/10.1016/j.ijinfomgt.2014.10.007
[6] Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
[7] Columbia University. (2020). Computational Social Science | Data Science Institute. https://www.datascience.columbia.edu/computational-social-science
[8] FakerFact. (2017). FakerFact. About FakerFact. https://www.fakerfact.org/
[9] Oshikawa, R., Qian, J., & Wang, W. Y. (2020). A Survey on Natural Language Processing for Fake News Detection. ArXiv:1811.00770 [Cs]. https://arxiv.org/abs/1811.00770
[10] Schmidt, A., & Wiegand, M. (2017). A Survey on Hate Speech Detection using Natural Language Processing. Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, 1–10. https://doi.org/10.18653/v1/W17-1101
[11] Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent Trends in Deep Learning Based Natural Language Processing. ArXiv:1708.02709 [Cs]. https://arxiv.org/abs/1708.02709
[12] Lopez Pinaya, W. H., Vieira, S., Garcia-Dias, R., & Mechelli, A. (2020). Chapter 10—Convolutional neural networks. In A. Mechelli & S. Vieira (Eds.), Machine Learning (pp. 173–191). Academic Press. https://doi.org/10.1016/B978-0-12-815739-8.00010-9
[13] Morales, A. J., Dong, X., Bar-Yam, Y., & ‘Sandy’ Pentland, A. (2019). Segregation and polarization in urban areas. Royal Society Open Science, 6(10), 190573. https://doi.org/10.1098/rsos.190573
[14] Minin, E. D., Fink, C., Hiippala, T., & Tenkanen, H. (2019). A framework for investigating illegal wildlife trade on social media with machine learning. Conservation Biology, 33(1), 210–213. https://doi.org/10.1111/cobi.13104
[15] Vosoughi, S., Mohsenvand, M. ‘Neo,’ & Roy, D. (2017). Rumor Gauge: Predicting the Veracity of Rumors on Twitter. ACM Transactions on Knowledge Discovery from Data, 11(4), 1–36. https://doi.org/10.1145/3070644
[16] Wexler, A., Davoudi, A., Weissenbacher, D., Choi, R., O’Connor, K., Cummings, H., & Gonzalez-Hernandez, G. (2020). Pregnancy and health in the age of the Internet: A content analysis of online “birth club” forums. PloS One, 15(4), e0230947. https://doi.org/10.1371/journal.pone.0230947
[17] Hassanpour, S., Tomita, N., DeLise, T., Crosier, B., & Marsch, L. A. (2019). Identifying substance use risk based on deep neural networks and Instagram social media data.
[18] Bell, A., Rich, A., Teng, M., Ore?kovi?, T., Bras, N. B., Mestrinho, L., Golubovic, S., Pristas, I., & Zejnilovic, L. (2019). Proactive advising: A machine learning driven approach to vaccine hesitancy. 2019 IEEE International Conference on Healthcare Informatics (ICHI), 1–6. https://doi.org/10.1109/ICHI.2019.8904616
[19] Johnson, N. F., Velásquez, N., Restrepo, N. J., Leahy, R., Gabriel, N., El Oud, S., Zheng, M., Manrique, P., Wuchty, S., & Lupu, Y. (2020). The online competition between pro- and anti-vaccination views. Nature, 1–4. https://doi.org/10.1038/s41586-020-2281-1
[20] Bullock, J., Luccioni, A., Pham, K. H., Lam, C. S. N., & Luengo-Oroz, M. (2020). Mapping the Landscape of Artificial Intelligence Applications against COVID-19. ArXiv:2003.11336 [Cs]. https://arxiv.org/abs/2003.11336
[21] Abd-Alrazaq, A., Alhuwail, D., Househ, M., Hamdi, M., & Shah, Z. (2020). Top Concerns of Tweeters During the COVID-19 Pandemic: Infoveillance Study. Journal of Medical Internet Research, 22(4), e19016. https://doi.org/10.2196/19016
[22] Lopez, C. E., Vasu, M., & Gallemore, C. (2020). Understanding the perception of COVID-19 policies by mining a multilanguage Twitter dataset. ArXiv:2003.10359 [Cs]. https://arxiv.org/abs/2003.10359
[23] Cinelli, M., Quattrociocchi, W., Galeazzi, A., Valensise, C. M., Brugnoli, E., Schmidt, A. L., Zola, P., Zollo, F., & Scala, A. (2020). The COVID-19 Social Media Infodemic. ArXiv:2003.05004 [Nlin, Physics:Physics]. https://arxiv.org/abs/2003.05004
[24] Chen, E., Lerman, K., & Ferrara, E. (2020). Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set. JMIR Public Health and Surveillance, 6(2), e19273. https://doi.org/10.2196/19273
Connecting ideas and people to improve health
4 年This is a really helpful synthesis and exciting emergent field! Thank you, Tina D Purnat. Bob Spoer, you may be interested if you haven't seen it.