BIG DATA
Anurag Kumar Singh
Senior Software Engineer II at MakeMyTrip | FrontEnd | React Native
About 90% of the data around the globe in produced in last 2 years alone. A huge amount of data is produced every day. Google currently processes over 20 petabytes of data per day while 500+ terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc..Social media reports show that almost 550 new social media users are added every minute. These are numbers generated every minute of the day:
- Snapchat users share 527,760 photos
- More than 120 professionals join LinkedIn
- 456,000 tweets are sent on Twitter
Moreover -
- Instagram users upload over 100 million photos and videos everyday. That is 69,444 million posts every minute!
- Over 3.5 billion Google searches are conducted worldwide each minute of everyday. That is 2 trillion searches per year worldwide. That is over 40,000 search queries per second!
- More than 4 million hours of content is uploaded to YouTube everyday, with users watching 5.97 billion hours of Youtube videos each day.
- 4.3 BILLION Facebook messages posted daily and 5.76 BILLION Facebook likes every day.
If the capacity of storage devices is much lesser than the data generated per day by such companies, how they manage it. How they process such a large amount of data without storing it anywhere? How Google search gives results in just a few seconds after processing the data stored from years? Can we store data in a storage device whose volume exceeds the capacity of the hardware?
Such an enormous amount of data is known as BigData. Many think BigData is a new technology but actually it is a term/concept that is given to a plethora of problems.
What is BigData ?
As there is many definition of big data .One popular interpretation of big data refers to extremely large data sets. A National Institute of Standards and technology report defined big data as consisting of “extensive datasets primarily in the characteristics of volume, velocity, and/or variability that require a scalable architecture for efficient storage, manipulation, and analysis.” Some have defined big data as an amount of data that exceeds a petabyte one million gigabytes and as some of you will be thinking that big data is any kind of technology but let me clear it it’s not a technology its and problem which many big companies like google, Facebook, amazon are facing now a days.
These are the main challenges associated with Big Data that peoples faces:
1. Input / output processing :
Data processing starts with data in its raw form and converts it into a more readable format (graphs, documents, etc.), giving it the form and context necessary to be interpreted by computers and utilized by employees throughout an organizations.
Its include:
- Data collection
- Data preparation
- Data input
- Processing
- Data output/interpretation
- Data storage
2. Volume :
To store this much amount of Data we need lots of Storage. Think like in whole world the biggest storage exist is 10 GB, But your Data is 20 GB, so how you gonna fit it ? According to Popular Storage Solution Companies like Dell EMC, IBM etc. it’s not a big deal to create huge amount of storages. But if we store the data inside one or few big big storages then we are also facing two more issues. 1st one is costing and another one is velocity. Also if somehow any storage that is huge, gets corrupted, then that will be the biggest disaster for company. Here, I am just trying to tell you very few key challenges we have under Big Data. Don’t think that these few are the only challenges.
3. Velocity:Have you ever thought why Google is so fast ? so its simple answere is velocity
Usually when we store our data in RAM then you will notice RAM is super fast. But when we store our data on Hard Disks or SSD then it’s comparatively very slower. Now you will easily say then store Data on RAM, why you need SSD or Hard Disk to store. The problem is in the architecture of RAM. As, RAM is ephemeral storage so as soon as you close any program it gets vanished from RAM. That means, we can’t store Data permanent on RAM. So, we need to find some kind of solutions which are faster means which can read and write data very faster.
4. Costing : sometimes it also became challenge depends on company-to company/business-to-business
As we all know that when you use bigger storages then the price of those increase exponentially, So companies also needs to think how they can lower their costing, because business runs on revenue and if revenue starts decreasing due to purchase of storages then companies might die. So, companies can’t buy huge storages at a whole to store their data. Also after observing the next challenge when I will discuss the solutions, you will come to know how they are maintaining their costing.
Solution:
Big Data can be taken as technology or field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data processing.
One of the concepts that solve the above problem is Distributed Storage.
Distributed Storage Cluster-
Distributed storage is an attempt to offer the advantages of centralized storage with the scalability and cost base of local storage. A distributed object store is made up of many individual object stores, normally consisting of one or a small number of physical disks. These object stores run on commodity server hardware, which might be the compute nodes or might be separate servers configured solely for providing storage services. As such, the hardware is relatively inexpensive. The disk of each virtual machine is broken up into a large number of small segments, typically a few megabytes in size each, and each segment is stored several times (often three) on different object stores. Each copy of each segment is called a replica. The team of master and slave nodes working together for one purpose called a cluster.
In simple terms think in this way, you have 4 laptops or 4 storage servers, typically known as Slave Nodes or Data Node. Every laptop is connected via networking with one main laptop typically known as Master Node or Name Node. Now suppose each server has 10 GB of storage, so if somehow 40GB data came then we won’t be able to store it in one server, so here comes the play of Distributed Storage.
- Master is always receiving the data and distributing the data in between the slaves. That means now we don’t have to think about Volume Problems. Because no matter how big the Data is, we can easily distribute them in the slaves and also we don’t need to purchase bigger storages.
- So, as we are not purchasing bigger storages so our costing will also decrease. Now we can purchase lots of small storage servers and attach them with master. Suppose in future the data becomes more huge, then we will purchase more storage servers and keep on attaching them with master.
- Final thing speed, if you notice suppose one storage server takes 10 minute to store 10 GB data, now as in parallel there are multiple storage serves in parallel so to store the same 40 GB data in 4 storage device (10 GB in each server) we will only need 10 minutes. Also it’s not always about storing the data, it’s also about how faster you can read the data.whereas if we use one storage to read 40GB data then it will take over 40 minute. These are simple examples, in actually Industry these architectures are more bigger with lots of components attached to each other.
One of the Topologies or the architecture used for implementing a Distributed storage system is Master-slave architecture.
Master-slave architecture- Master-slave is a model of asymmetric communication or control where one device(the “master”) connected to one or more other devices(the “slaves”) are connected using a protocol(Network), where the slave nodes/devices can contribute a part/whole of their storage devices to the master node, like this centralized master node can store data exceeding its local storage limit.
The slave nodes can be commodity hardware/device that need not be expensive and high functioning.
As the data is split and stored in(and loaded from) other devices/nodes, the I/O operations (data transfer) can be performed parallelly on all the devices. Thus, reducing the time consumption by a large factor.
Hence, distributed storage solves all the problems- Volume, Cost, and Velocity.
Some Star performers behind Facebook’s Big Data:
Facebook uses technologies like Hadoop,Scuba,Hive,Cassandra and many more.
Hadoop:
Facebook runs the world’s largest Hadoop cluster says Jay Parikh, Vice President Infrastructure Engineering, Facebook.Facebook’s Hadoop cluster goes beyond 40,0000 machines to store its stores hundreds of gigabytes to data.
Hadoop provides a common infrastructure for Facebook with efficiency and reliability. Beginning with searching, log processing, recommendation system, and data warehousing, to video and image analysis, Hadoop is empowering this social networking platform in each and every way possible. Facebook developed its first user-facing application, Facebook Messenger, based on Hadoop database, i.e., Apache HBase.
Scuba
With a huge amount of unstructured data coming across each day, Facebook slowly realized that it needs a platform to speed up the entire analysis part. That’s when it developed Scuba, which could help the Hadoop developers dive into the massive data sets and carry on ad-hoc analyses in real-time.
According to Jay Parikh, “Scuba gives us this very dynamic view into how our infrastructure is doing — how our servers are doing, how our network is doing, how the different software systems are interacting.”
Scuba is Facebook’s fast slice-and-dice data store. It stores thousands of tables in about 100 terabytes in memory. It ingests millions of new rows per second and deletes just as many. Throughput peaks around 100 queries per second, scanning 100 billion rows per second, with most response times under 1 second
Hive
After Yahoo implemented Hadoop for its search engine, Facebook thought about empowering the data scientists so that they could store a larger amount of data in the Oracle data warehouse. Hence, Hive came into existence. This tool improved the query capability of Hadoop by using a subset of SQL .
Hive is Facebook’s data warehouse, with 300 petabytes of data in 800,000 tables. Facebook generates 4 new petabyes of data and runs 600,000 queries and 1 million map-reduce jobs per day. Presto, HiveQL, Hadoop, and Giraph are the common query engines over Hive.
OPD:
Operational Data Sores stores 2 billion time series of counters. It is used most commonly in alerts and dashboards and for trouble-shooting system metrics with 1–5 minutes of time lag. ODS comprises of a time series database (TSDB), a query service, and a de- tection and alerting system. ODS’s TSDB is built at top of HBase storage system.