The Ascendancy of Big Data…
Aisha Ekundayo, PhD
Data Analytics Consulting | AI Consultant | Data Product Management
Advancement in technology recorded in the past two decades has led to a significant increase in data generated with almost every action we take generating new data.
With cloud computing, we now have massive storage capacity and compute power to crunch data, identify hidden patterns and use AI to make predictions.
Big Data is data that cannot be processed using your local machine or a single computer. We cannot use a single laptop to process big data because it is generated quickly, in large amounts and in different formats. Think about social media posts- it would have texts, pictures, audio, video data types. Therefore, we use three V’s to describe Big Data.
This article will provide an overview of what makes data ‘Big Data’ and its primary sources.
There are three main sources of big data.
Human-generated: Are data created by human activities, including social media posts, video files, audio files, text messages, emails, presentations, etc. Social media is the leader in this category, with the vast number of profiles created, status updates, shares, likes, etc. Human-generated data make up about 60% of the total data generated.
With continued innovation and the use of smartphones, tablets, mobile networks, and internet availability globally, about 2.5 quintillion bytes of data are produced by humans every day.
According to Statistica, over 4.66 billion people were active on the internet in January 2021, making up 59.5% of the global population. Of this total, 92.6 per cent (4.32 billion) accessed the internet via mobile devices, and 4.15 billion of them were active mobile social media users, with 1.5 billion people using Facebook every day. Instagram users post 95 million photos and videos daily.
Machine-generated: Programmed machines can generate data without active human intervention. Examples include sensors on appliances, satellites, medical devices, smartphone apps, fitness tracker, etc. These appliances, once installed, generates data without people telling them to record the data. For example, your smartphone apps record usage, including logins, time spent, screen activities, etc. Depending on local laws, compliance and regulations.
Organisation-generated: Organisations generates data from their activities, including customer data like demographic, location, DateTime, etc. Other data can also include sales data, revenue, expenses, employee utilisation, customer enquiries, number of churns, transaction data.
From the big data sources mentioned and their types, you might be thinking that some of this data would not be large, and rightfully so.
Next, we would highlight big data characteristics to know what type of data can be considered Big Data.
Characteristics of Big Data
Volume- This refers to the amount of the dataset. Big data must have a lot of data points. Think of human-generated data such as social media data, machine-generated data like IoT data from machinery sensors and organisation data like Netflix users watching history, amazon shopping history or Wishlist, etc.
Companies that generate big data need infrastructure to process, store and analyse the data collected. Big data analytics tools are required to store and process the data effectively.
Velocity- Velocity refers to the rate at which the data is generated. Big data are generated at a very fast pace, and they move around quickly. Imagine a tweet that goes viral on Twitter, with such a tweet having many likes and retweets, which also constitutes new data. The same goes for sensor fitted on appliances for predictive maintenance; once an anomaly is detected, the company arranges a visit quickly to check the machine to prevent a breakdown.
All in all, a company working with big data must have the right technology to store and analyse the data at every point, even if it changes every millisecond. Real-time analytics is the best solution here.
Variety- Big Data comes in various formats. For example, a company that has sensors fitted on machines might also have geographic and biometric information. All data types need to be stored, processed and analysed using the appropriate techniques and tools. Big data can come in different formats that include structured, unstructured or semi-structured data.
Structured data has an explicit schema in terms of rows and columns. An example of structured data is data in a spreadsheet.
Unstructured data refers to a dataset without a defined schema, such as photographs, texts, social media posts, images, streams from event hub, videos, etc. This data type makes up about 90% of big data, making it essential for generating business insights. Artificial Intelligence is used to process unstructured data to uncover hidden patterns and information.
Semi-structured data has no real structure but a well-defined level of organisation. A good example is HTML code with some hierarchy of parent-child relationship.
Conclusion
In summary, big data is generated through humans, machine or organisation activities. Big data has a vast volume and is generated at a fast rate in various formats.
From the characteristics of big data, it is evident that we cannot use traditional methods to store or process such dataset. So, what do we use? Distributed Computing is the answer. Microsoft Cloud has a variety of tools for Big Data Analytics.
We will look at the Storage and Processing options for Big Data in the next article.
Please share your understanding of Big Data in the comment section and the type of data generated by your organisation or through your everyday activities.
Really great summary Aisha !
Energy Trading | Algorithmic Trading
3 年Awesome!