BIG DATA
Trishla Pathak
Business Integration and Arch Analyst || DHL Gradstar Top 100 in SA 2021 Awardee || International Honours Society - Golden Key Holder
Introduction
We have all experienced times when we scroll through social media, and all of a sudden, we see an advertisement for a product that we spoke about a while back. How is it that our devices know exactly what we talk about? How is it that companies like Amazon or Takealot recommend products to us that we only ever thought of? One may think that we are being spied upon if such occurrences take place quite often, but the truth is that all of these things are possible due to Big Data.
At the moment, big data is an essential chunk of the technological industry and impacts a major part of our lives which we are unaware of. In this essay, aspects of big data such as its definition, background, advantages and disadvantages, tools, relationship with cloud computing, and its application will be discussed.
Definition
In today’s world, data is being generated by every individual without their knowledge. What is data? All facts and figures which can be stored in a digital format are called Data. For instance, smartphones are a device used by almost everyone around the planet. These devices generate copious amounts of data, also known as data deluge (Khan, Liu, Shakil and Alam, 2017) in the form of phone calls, e-mails, music, photos, and videos – exceeding the ability of the mind to process or calculate how much of it is being generated. In fact, this amount of data is quite a lot for even traditional computing systems to process and analyze. This enormous amount of data is what we term Big Data.
There have been various definitions of Big Data proposed. One being that of Gartner, Inc. Saying “Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.†(Gandomi and Haider, 2015). Another, by De Mauro et al. Saying “Big Data is the information asset characterized by such a high Volume, Velocity, and Variety to require specific Technology and Analytical Methods for its transformation into Value†(Ahmed and Ameen, 2017). Both these definitions have one thing in common, the 3V’s - Volume, velocity, and variety, known as Gartner's 3V model. This 3V model initially only consisted of three V’s but as more information and knowledge were gathered on this subject, an additional two V’s were added to the model, which will further be discussed in detail in the next section
Background
The Initial 3V Model
- Volume – The first characteristic is the size of the data I.e., Megabyte, gigabyte, etc. There is an abundance of data being produced every second from a myriad of sources that are too big for personal computers to process. (What is Big Data, anyway?, 2013) The latest technologies have helped in accessing, identifying, and collecting significantly large amounts of data (Sznaier, 2021). Something as simple as a google search can generate more data than you can imagine. However, some information is produced in megabytes so our personal computers can process them. Although, if terabytes and petabytes of data were to be analyzed and processed, for example, CCTV footage would require computers to use professional analysis tools as they would be dealing with massive chunks of data. (Ahmed and Ameen, 2017)
- Velocity – The second characteristic is the speed associated with big data I.e., the processes, the batch, streams, availability, and transfer at multiple locations, etc. As we advance technologically and create more systems, we initiate an increase in the speed of data being generated, processed, and analyzed (What is Big Data, anyway?, 2013). For instance, a social media platform like Twitter can produce about 6000 tweets in a second, meaning it can produce and contain more than 200 billion tweets every year (Ahmed and Ameen, 2017).
- Variety – With the abundance of data comes to the diversity of data types I.e., structured or unstructured data types. Various systems have different types of data being generated, for instance, an e-mail is a different type of data from that produced in a financial transaction (What is Big Data, anyway?, 2013). Non-textual and textual data are two categories you can divide the various data types into. Data that you produce in basic applications like Microsoft Excel can be stored within an excel sheet, but big data such as chat messages, e-mails, blogs, etc., are saved in files with different extensions, and cannot be simply entered into files such as MS sheets. To deal with continuous occurrences of such data, a set variety of measuring patterns and sources that analyze data should be saved to work efficiently and effectively (Ahmed and Ameen, 2017).
The Addition to the 3V Model
- Value – An essential characteristic of big data I.e., value-added to achieve objectives, organizational values, events, etc. There are two major values described of the big data - The first one being the infrastructure of the IT industry and the cost of processing, in which the companies spend and invest to store and process the data. The second is the turnover value, which they earn after their stored data has been processed (Ishwarappa and Anuradha, 2015).
- Veracity – the last characteristic of big data associates with trustworthiness, authenticity, accountability, etc. The veracity of data is related to how relevant and accurate the data is in terms of the initial 3V’s model. It is this factor that helps discard redundant or unnecessary data so that the main objective of the data can be fulfilled with ease (Ahmed and Ameen, 2017).
Information: The fuel of Big Data
Big data demands the integration of techniques, tools, and technologies to disclose a clear understanding from the data-sets, ensuring its diversity, complexity, and its reach on a gigantic scale (Demigha, 2020). Digitization has made it possible to provide information access to people all around the globe. The reasoning for big data to expand, share and be utilized quickly is because data being placed structurally, relevant to the user, becomes information, and information then acts as a fuel to spread the phenomenon of Big Data I.e., information is converted to knowledge, benefitting various organizations by providing value to those organizations (De Mauro, Greco, and Grimaldi, 2016).
Advantages and disadvantages
Benefits and Advantages
- Increased productivity – Various practices of using modern technological tools coupled with appropriate knowledge management skills have aided data scientists to analyze and process data faster. This not only increases and refines personal productivity but also brings an overall efficiency towards organizations using these practices effectively (Suciu, et al.,2018).
- Cost Reduction – Large amounts of data such as GPS signal or satellite imagery may require extensive technological tools to be used to be stored and processed. However, in the long run, these tools will help reduce the costs to buy additional technology, and also aid in doing business efficiently (Favaretto, De Clercq, Schneble and Elger, 2020).
- Fraud Detection – Highly advanced tools are used to process Big Data. These tools are exceptional at identifying patterns and anomalies that would perhaps not be detected if analyzed by humans.
- Improved Customer Service – The Internet is being used by almost every individual on the planet, spreading from an individual’s home to the whole country. Data extracted about these individuals (potential customers) from social media platforms, websites, or systems help companies serve their customers better by providing them the services that they would like to receive – some of which they themselves are unaware of (Agnellutti, 2014).
- Increased Revenue – Information being extracted from mobile platforms help to connect businesses and the people better. This helps the businesses in better decision-making to improve their customer services, which in turn increases their revenue. (Hudy, 2015)
Risks and Disadvantages
- Technical education and expertise – To be able to use tools and handle processing frameworks and engines needed for computing the data within a data system (Gurusamy, Kannan and Nandhini, 2017) in the right manner, education and expertise needed to keep track and understand the processes is difficult to achieve. It requires a lot of focus and hard work, hence not a lot of people can cope up as they are unaware of how to handle these matters (BAMIAH, BROHI and RAD, 2018)
- Security and Privacy – People tend to display their personal information while performing their daily activities without being aware of the fact that all this information is recorded within a database. This can make them vulnerable to cyberattacks. (Mai, 2016)
- Inaccurate or Redundant Data – The sources that one may receive data from can be inaccurate and contain errors, and the process to analyze this data has a chance of increasing these errors if not detected. The data can then become redundant and not be used effectively. (Krasnow Waterman and Bruening, 2014)
- Tools: Applying analytical tools can cause a massive outbreak in costs that organizations may bear. These tools can become corrupted themselves if inaccurate or unreliable data is being processed through them. This may cost the organization using these tools a fortune because uncertainties in the tools can corrupt all the organization's data. (Clarke, 2015)
- Change in Technology – The development of data is bringing about a change in the kinds of technology we are operating. Every now and then there is an innovation on producing a set of information from this data. This can result in businesses having to regularly update their technological tools, resulting in changes in team dynamics and providing mixed and disruptive results. (Kitchin, 2015)
Tools
Various tools have been used in order to handle big data. A theorem namely HACE – “Heterogeneous, Autonomous, Complex and Evolving data.†has been used to describe the features and characteristics of big data (Srinivasan and Thirumalai Kumari, 2018). This theorem is used to understand the function and abilities of the tools needed to process this data. A brief on two of these tools are as follows:
- HADOOP – This an “open-source software†running on a collection of commodity machines. A widely used tool with the ability to process and store a tremendous amount of data. It is flexible with a high power to compute data and excellent fault tolerance. It is cost-effective and has wide scalability (Salina and Rao, 2016).
- HPCC – This system consists of two platforms integrated, each having a distinct collection. The first one is THOR which is a “back-end data refinery†collection needed for refining the data. The second one is ROXIE which is the “front-end data delivery†collection needed for delivering the data. ECL, which is the Enterprise Control Language is an extremely powerful language that created applications that run on both these collections. These components combine to produce a thorough and widely scalable solution to process and analyze big data (Salina and Rao, 2016).
Relationship between big data and cloud computing.
Cloud computing is a method that is trending in the technology industry at the moment. This technology allows information to be saved and become available electronically, by providing its services running on servers that are accessible from anywhere in the world by using internet services (Miller, 2013).
This disregards the usage of buying expensive devices to store massive chunks of data (Hashem et al., 2015).
Big Data and Cloud computing share a consistent relationship, as it is this data that accelerates the development and usage of cloud computing, due to the huge capacity storage resources and effective computing of data incorporated within the cloud system (Nabeel, Al-Haj and Khwaldeh, 2017). Cloud computing and Big Data complement each other in the ways that they store, analyze and process data. If excessive data is a problem, then excessive cloud storage is the solution to that problem. With increased amounts of data being generated today, cloud storage is being developed and expanded in order to absorb these large amounts of data, and make it available from multiple locations, just by the use of the internet. The cloud does this by expanding its storage through virtual machines and making different kinds of data more accessible.
In order for this relationship to work better, the cloud computing environment needs to be altered from time to time based on the kind of data that it is being integrated with. In the upcoming years, many more alterations will be made to bring out maximum efficiency of data storage and usage.
Application - Healthcare
Big Data has been able to integrate its functions within various fields, one that we are well aware of – Technology. Artificial intelligence within technology makes our daily activities faster and helps us make decisions more efficiently (Bauguess, 2017). However, Big data has also helped massively in other fields such as the medical field.
Big Data has allowed health professionals and patients to engage better with one another. Since all the examinations and medical records in hospitals are digitized, it is easier to maintain records and provide quick services to patients (Dash, Shakyawar, Sharma, and Kaushik, 2019). It has also helped medical researchers develop new drugs using tools and algorithms, which reduces clinical trial failure (Hong et al., 2018). Big Data has worked massively in the medical field in order to save lives, and surprisingly in the near future, it may be the reason for medical researches to find the cure for cancer as well.
Conclusion
In conclusion, Big Data is a phenomenon that requires data to be processed and analyzed in order to produce suitable results that provide value to an organization/person. It has its advantages and disadvantages but is currently spreading drastically around the world. It uses various tools and technologies in order to be processed effectively so that this data can be used in various fields and help the organizations and individuals work with ease.
References
Agnellutti, C. (2014) Big Data: an Exploration of Opportunities, Values, and Privacy Issues. New York: Nova Science Publishers, Inc (Internet Theory, Technology and Applications). Available at: https://search.ebscohost.com.uplib.idm.oclc.org/login.aspx?direct=true&db=nlebk&AN=811106&site=ehost-live&scope=site (Accessed: 2 June 2021).
Ahmed, W. and Ameen, K., 2017. Defining big data and measuring its associated trends in the field of information and library management. Library Hi Tech News, 34(9), pp.21-24.
BAMIAH, M., BROHI, S. and RAD, B., 2018. BIG DATA TECHNOLOGY IN EDUCATION: ADVANTAGES, IMPLEMENTATIONS, AND CHALLENGES. [online] Jestec.taylors.edu.my. Available at: <https://jestec.taylors.edu.my/Special%20Issue%20ICCSIT%202018/ICCSIT18_19.pdf> [Accessed 2 June 2021].
Bauguess, S., 2017. The Role of Big Data, Machine Learning, and AI in Assessing Risks: A Regulatory Perspective. SSRN Electronic Journal,.
Clarke, R., 2015. Big data, big risks. Information Systems Journal, 26(1), pp.77-90.
Dash, S., Shakyawar, S., Sharma, M. and Kaushik, S., 2019. Big data in healthcare: management, analysis and future prospects. Journal of Big Data, 6(1).
De Mauro, A., Greco, M. and Grimaldi, M., 2016. A formal definition of Big Data based on its essential features. Library Review, 65(3), pp.122-135.
Demigha, S. (2020) ‘Information Management (IM) and Big Data’, Proceedings of the European Conference on Knowledge Management, pp. 157–163. doi: 10.34190/EKM.20.118.
Favaretto, M., De Clercq, E., Schneble, C. and Elger, B., 2020. What is your definition of Big Data? Researchers’ understanding of the phenomenon of the decade. PLOS ONE, 15(2), p.e0228987.
Ishwarappa, A. and Anuradha, J. (2015), “A brief introduction on big data 5Vs characteristics and Hadoop technologyâ€, Procedia Computer Science, Vol. 48, pp. 319-324.
Khan, S., Liu, X., Shakil, K. and Alam, M., 2017. A survey on scholarly data: From big data perspective. Information Processing & Management, 53(4), pp.923-944.
Kitchin, R., 2015. The opportunities, challenges and risks of big data for official statistics. Statistical Journal of the IAOS, 31(3), pp.471-481.
Krasnow Waterman, K. and Bruening, P., 2014. Big Data analytics: risks and responsibilities. International Data Privacy Law, 4(2), pp.89-95.
Gandomi, A. and Haider, M., 2015. Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), pp.137-144.
Gurusamy, V., Kannan, S. and Nandhini, K., 2017. The Real Time Big Data Processing Framework Advantages and Limitations. International Journal of Computer Sciences and Engineering, 5(12), pp.305-312.
Hashem, I., Yaqoob, I., Anuar, N., Mokhtar, S., Gani, A. and Ullah Khan, S., 2015. The rise of “big data†on cloud computing: Review and open research issues. Information Systems, 47, pp.98-115.
Hong, L., Luo, M., Wang, R., Lu, P., Lu, W. and Lu, L., 2018. Big Data in Health Care: Applications and Challenges. Data and Information Management, 2(3), pp.175-197.
Hudy, A. C. (2015) ‘Turning the Big Data Crush into an Advantage’, Information Management Journal, 49(1), pp. 38–39. Available at: https://search.ebscohost.com.uplib.idm.oclc.org/login.aspx?direct=true&db=tnh&AN=100362263&site=ehost-live&scope=site (Accessed: 2 June 2021).
Mai, J., 2016. Big data privacy: The datafication of personal information. The Information Society, 32(3), pp.192-199.
Miller, H., 2013. Big-data in cloud computing: a taxonomy of risks. [online] Informationr.net. Available at: <https://informationr.net/ir/18-1/paper571.html#.YLbS_aj36Uk> [Accessed 2 June 2021].
Nabeel, Z., Al-Haj, A. and Khwaldeh, S., 2017. Cloud Computing and Big Data is there a Relation between the Two: A Study. [online] Ripublication.com. Available at: <https://www.ripublication.com/ijaer17/ijaerv12n17_89.pdf> [Accessed 2 June 2021].
Salina, A. and Rao, K., 2016. A Study on Tools of Big Data Analytics. [online] Research Gate. Available at: <https://www.researchgate.net/publication/310627056_A_Study_on_Tools_of_Big_Data_Analytics> [Accessed 2 June 2021].
Srinivasan, S. and Thirumalai Kumari, T., 2018. Big data analytics tools a review. International Journal of Engineering & Technology, 7(3.3), p.685.
Suciu, M.-C. et al. (2018) ‘The Impact of Big Data on Knowledge Management Systems in Romanian e-Commerce Retailers’, Proceedings of the European Conference on Knowledge Management, 2, pp. 821–828. Available at: https://search.ebscohost.com.uplib.idm.oclc.org/login.aspx?direct=true&db=lls&AN=132145966&site=ehost-live&scope=site (Accessed: 2 June 2021).
Sznaier, M., 2021. Control Oriented Learning in the Era of Big Data. IEEE Control Systems Letters, 5(6), pp.1855-1867.
‘What Is Big Data, Anyway?’ (2013) Public CIO, 11(1), pp. 6–7. Available at: https://search.ebscohost.com.uplib.idm.oclc.org/login.aspx?direct=true&db=tnh&AN=87747189&site=ehost-live&scope=site (Accessed: 2 June 2021).
Delivery Manager at TCS
3 å¹´Congratulations Trishla for your 1st Article on Big Data.