Big Data
Ankit Kumar
DevOps| Terraform | Linux |Azure | Azure DevOps | AWS | RH294( Ansible) | Python | Docker | Kubernetes | Grafana | ELK | Prometheus
Today is the world of data and most of the big companies like Facebook, Google, Microsoft, Netflix, Amazon, etc are dealing with a very huge amount of data on a daily basis.
What is big data?
It is a term that describes the large volumes of data - both structured as well as unstructured data. It is not technology, it is a problem that we have to deal with.
If we take one example of Facebook, then
According to the data revealed by Facebook,
- It collects 500+ terabytes of data every day.
- 2.7 Billion likes and 300 million photos per day.
- It scans 105 terabytes of data in every 30 minutes.
So we see the data collected by these companies on daily basis are very huge. To deal with such huge data, the companies require huge storage.
Why Big Data is important?
When you combine big data with high-powered analytics, you can accomplish business-related tasks such as:
- Determining root causes of failures, issues, and defects in near-real-time.
- Generating coupons at the point of sale based on the customer’s buying habits.
- Recalculating entire risk portfolios in minutes.
- Detecting fraudulent behavior before it affects your organization
What are the challenges that come with Big Data?
- Volume
- Velocity
- Variety
These are the 3’V problem with the Bigdata. So lets us talk on this
Volume
Here the volume basically means storage. Companies daily receive a very huge amount of data in Terabytes or petabytes. But storage of these data is a big issue. Many big appliances companies can create harddisk but the cost of such a hard disk is very high. So companies have to think also before investing such an amount of money.
Let us try to imagine the data received by the companies receive every data.
We upload the images/photographs on Facebook. This statement does not boggle on the mind until you are not able to realize that Facebook user has more than China’s population and each user upload photos on Facebook. Facebook is storing roughly 250 billion images. Now just think about 250 billion images. In 2016, Facebook had 2.5 trillion posts.
Talking about Netflix, then it would be 400 billion events daily, and 17GB per second during peak.
The next big Company is Google,
Google now processes over 40,000 search queries every second on average, which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide and Google currently processes over 20 petabytes of data per day.
These are all data and it needs to store in the hard disk. So these numbers are so big that we cannot imagine. This is basically the volume vector.
Velocity
Although appliances company have capable to create storage in petabytes, but again the new problems start. This problem is I/O(Input Output).
I/O problem: Companies receive Tb’s of data flooded in every minute or hours.
Let us suppose companies somehow manage to store the data in the hard disk, but the problem arises when you save such huge data in the hard disk.
If we try to analyze the speed to save the data in the normal SATA hard disk, then roughly it takes 1min to store 1Gb of data and the companies receive the data in terabytes. So we can imagine how much time will it save the data in the hard disk. When we process the data, it again takes so much time to load on the ram. So this is basically the input-output problem with the big data. So companies invest 1 or 2 days only to save the data, then imagine the situation when you search something in google, you will get the search result after 2 days of search.
So, this is the velocity vector of big data.
Variety
As we already see above about Facebook, Netflix, Google. We upload photos, like on the photos, comment, login in, login out, and many different activities we perform on Facebook. It creates different types of data, which means variety.
When we look over the network packets while surfing over the internet.
Email messages, a legal discovery process might require sifting through thousands to millions of email messages in a collection. Not one of those messages is going to be exactly like another. Along with the email, location and timing are also attached. So basically, we see the variety of data is received by the company on the daily events. We have to process the data so that we can use it.
This is a variety vector of big data.
How to manage the Big Data
To tackle Bigdata, companies use Distributive Storage Techniques.
The core concept of the Distributive Storage is to break the chunk of data in small pieces and store over the server.
So basically whats happens due to the splitting of data and store over the server.
The data will come to the main server and the main server is connected to many small servers via the network. When the data come over the main network, it spits the data into small chunks and sends it to the small server.
When we see over the volume perspective, so when we store the data in numbers of small volumes, then, it solves the storage problem.
We store the small volumes of data in the storage means it takes less time to store. So it solves our velocity issue also. All they serve are connected to the master using networking.
This is how distributive storage help to solve the problem of big data. This is called as the master-salve model. The Master is the main server and the small server are called a slave. To create such a cluster, we need software, the name of the software is Hadoop.
Prism Project
Together with Yahoo, Facebook spearheaded the creation of Hadoop, a sweeping software platform for processing and analyzing the epic amounts of data streaming across the modern web. Facebook is staring down an even larger avalanche of data, and there are new limitations that need fixing.
PRISM PROJECT.
This project aims to solve one of the biggest problems Facebook has faced operating at its uniquely massive scale: how to create server clusters that can operate as a unit even when they’re geographically distributed. Means for managing Hadoop.
Hope you like the Article. Have a good day.
Thank You.......