Big Data: Problem and Cure
Aman Jhagrolia
SRE @ Zscaler | Ex-SRE @ Signzy | Ex-DevOps Intern @ TO THE NEW | ARTH Learner | Amity University Rajasthan
What is Data?
Data are characteristics or information that are collected through observation. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects.
What is Big Data?
Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools is able to store it or process it efficiently.
Data Growth over the years -
3 Vs of Big Data :-
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data.
(ii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.
(iii) Variety – Variety in Big Data refers to all the structured and unstructured data that has the possibility of getting generated either by humans or by machines. The most commonly added data are structured -texts, tweets, pictures & videos. However, unstructured data like emails, voicemails, hand-written text, ECG reading, audio recordings etc, are also important elements under Variety.
Facebook Stats -
Facebook revealed some stats on big data. These are the stats of Facebook for one day -
- 2.5 Billion - Content items shared
- 2.7 Billion - Likes
- 300 Million - Photos uploaded
- 100+ Petabyte - Disk space in a single HDFS Cluster
- 105 Terabyte - Data scanned via Hive in 30 Mins
- 70000 - Queries Executed
- 500+ Terabyte - New data ingested
Why we need Big Data?
Data are generated incessantly containing nuggets of valuable insight, critical for business success. The challenge is how to analyse and process these data in order to derive those nuggets of the information set to strengthen business strategy, efficiency and performance -be it customer feedback, market trends, demand for a product or competitor activity.
Big Data solutions help companies make sense out of random information, become proactive and start setting the pace instead of continuously putting out fires and following competition.
How Big Data is a Problem?
- Big MNCs like Facebook, Google, Amazon, etc are receiving a huge amount of data per day i.e. in units of Terabyte or Petabytes and they have to store the data. They need the storage device to store this huge amount of data but think once how big volume size of storage they need to store this data and also till today no storage device is available witch such a large volume size.
- Even we can make a single volume of such a large storage capacity but then one more problem comes up of I/O. As the size of the storage device will increase the I/O rate i.e. velocity will decrease. This leads to very high time consumption in reading or writing the data.
Solution of Big Data -
A Distributed Storage System is the infrastructure that can split data across multiple physical servers, and often across more than one data centre. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes.
It works on the principle of master-slave architecture i.e. master is that system to which every other system contributes their harddisk to solve the big data problem.
Hadoop -
There are many big data tools available in the market like Apache Hadoop, Apache Spark, Flink, Apache Storm, Apache Cassandra, MongoDB, Kafka and many more. But Hadoop is one of the famous tool for Big Data.
Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.
Thanks, Hope you liked it!!