Big Data Explained Simply?
Dr Mario Bojilov - MEngsSc, CISA, F Fin, PhD
I work with forward-looking, deep-thinking non-executive directors (NEDs) to help them harness Artificial Intelligence (AI) and create profoundly impactful organisations.
The first discussion of Big Data appeared in an article written by Mr Doug Laney, an analyst at Forrester Research at the time, in 2001. The paper did not mention Big Data but discussed the three main characteristics of Big Data for the first time:?Volume,?Velocity?and?Variety. The term Big Data only started appearing online in 2006-2007 and has taken hold since then.
Today, a Google search for?Big Data definition?will produce 3.08Bn results. However, this type of variety inevitably results in significant confusion around what Big Data is and when organisations need to start looking at specific applications and solutions related to Big Data. Furthermore, considering only the amount of data is not always sufficient since some organisations routinely process hundreds of terabytes per month, while others struggle with hundreds of gigabytes.
Instead, one way is to look at Big Data in a business-centric manner and consider its effectiveness within an organisational context. This perspective leads to a definition focused purely on business value and not on technical aspects –?We are dealing with Big Data when we cannot obtain the required information within the timeframes necessary for it to be adding value to organisational activities.?Or, to rephrase, organisations need the information to be available before certain events; otherwise, it is useless.
Characteristics
The three Big Data characteristics or 3V’s, identified by Mr Laney in his work, form the foundation used to build Big Data business initiatives and technology infrastructure. While new characteristics constantly appear and broaden the original definition, they often seem redundant and pretentious, created with a marketing purpose in mind. The original 3V’s are discussed below.
Volume
As the name implies, this characteristic refers to the size of the datasets that must be processed. When discussing volume, first, we need to define how it is measured.?As consumers and professionals, we are familiar with kilobytes, megabytes, and gigabytes.?
However, Big Data volumes go well beyond any of these quantities, so a definition is needed at this stage. The list below provides an explanation of terms used to measure data quantities, expressed as bytes, at present:
The above definitions are the so-called “decimal definitions" that law courts consider the most appropriate in trade and commerce. The highlighted volumes are the ones that are considered Big Data.
To put?volume?in context, it is worth noting that, according to IDC, in 2018, all data on Earth?was 33 Zettabytes. This amount will grow to 175 Zettabytes by 2025. Moreover, the emergence of COVID-19 and the associated rise in digital technology usage will likely increase this figure even further.
Velocity
Velocity is the 2nd characteristic of Big Data. It refers to the speed of creating data and the rate of processing and consuming data. The emergence of new business models, innovative applications, and widespread use of portable devices has increased velocity significantly.
The US Federal Reserve estimates that in 2012 a total of 24.4Bn general-purpose credit card transactions were made, while in 2018, that figure grew to 40.9Bn, an increase of 68%. Moreover, the electronic payments trend will further accelerate because of COVID-19, since electronic transactions were the only option for most during the lockdowns, and now people are very comfortable with digital technology. This trend, however, was visible even before the pandemic when banks started reducing the number of their Automated Teller Machines (ATMs) in some countries, like Australia.
The increased e-payment volumes are just some examples of increasing data velocity. Another example is social media. For example, Microsoft, LinkedIn’s parent company, reports that in Q4-2020, engagements are up by 31% on LinkedIn. These engagements include text and other data types such as video, audio, graphics, etc. And, this assortment of data brings us to the last characteristic of Big Data – variety.
领英推荐
Variety
When related to Big Data, Variety refers to the data sources that need to be processed. There are three main types of data sources we need to deal with:
Structured?– this data resides within enterprise systems, and its structure is well-defined. Examples include Payroll, Finance, or other ERP systems. In each case, a database stores all data. An example of such a data record is an HR system’s employee record. It will contain, as a minimum, an employee ID, first name, last name and other fields, as required.?
Structured data has been around since the early 80s. It is the easiest to process and the smallest of the three types in quantity.
Semi-structured?– this data type consists of large volumes of individual records with small sizes and a simple record structure. An example would be the data sent by an intelligent power meter to a central system.?Each packet has the same format: timestamp – 10 bytes, location – 10 bytes, consumption – 10 bytes + other information – 80 bytes.
Thus, information about electricity consumption takes 110 bytes. However, the 110 bytes is misleading since the?daily volume?in a city of 500,000 households with 5-sec intervals will be 950GB (110*12*60*24*500,000). Within a month, this dataset will grow to 11.4 Terabytes; after one year, its size will reach 137 Terabytes.?
Intelligent electricity meters are just one example of semi-structured data. With the proliferation of Internet-of-Things (IoT) devices, semi-structured data will be the fastest-growing one of the three types.
Unstructured?– strictly speaking, this data is still structured. However, we deal with many different structures and formats in this case. A more accurate term will be multi-structured; however, unstructured is currently used for one reason or another.
Examples of?unstructured?data include social media posts, such as audio, video, graphics, and text. Additionally, external systems, and data from enterprise sources, such as Word files, emails, and PDFs, are included here.
Figure 1 shows the wide variety of data items generated every minute in 2022. Some highlights contributing to Big Data include users sharing 1.7m Facebook pieces of content, uploading 5 hours of YouTube videos, and spending 104.6k hours in Zoom meetings.
Summary
Figure 1 highlights the continuous significant growth in Big Data in all three characteristics –?volume, velocity,?and?variety. However, this infographic presents only part of the picture – the data generated by the activities of individual consumers. Even higher data volumes are coming from organisations in various industries. Ad, this growth in Big Data will accelerate significantly during COVID-19 and afterwards, as organisations adopt new technologies and deploy new infrastructure, while “connected” consumers adopt new ways of connecting, shopping and working with great confidence.
What do you think of the need for Big Data in your organisation? Please feel free to drop me a message or leave a comment below.