How the Big companies like Facebook,Google etc store the Big Data.

How the Big companies like Facebook,Google etc store the Big Data.


DATA IS PRECIOUS THING AND WILL LAST LONGER THAN SYSTEM THEMSELVES
-Tim Berners.lee

Now a days in era of growing technology we store a data like photos, document,videos in our devices which is having some storage of 32 GB or 1 TB .But have you thought where the data is stored, which we get after searching any thing on Google or the data you upload on your social media account like photos and videos?

After some days we start facing the problem of storage in our devices in that condition we insert a memory-card or buy a new hard-drive for our devices and start doing our work very easily,Is it possible to do same-thing for big companies?

Let have a look of data stored in Facebook per day

No alt text provided for this image


Including the system processes 2.5 billion pieces of content and 500+ terabytes of data each day. It’s pulling in 2.7 billion Like actions and 300 million photos per day, and it scans roughly 105 terabytes of data each half hour.


Lets have a look of Google storage also

No alt text provided for this image


Google now processes over 40,000 search queries every second on average, which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide.

A place where google stores and handles all its data is a data center. Google doesn’t hold the biggest of data centers but still it handles a huge amount of data. A data center normally holds petabytes to exabytes of data.

Now, What are these new terms? Petabytes or Exabytes? The highest data size I have heard till now is Terabyte(TB). 1 Petabyte(PB) = 1024 Terabytes(TB) 1 Exabyte(EB)= 1024 Petabyte(PB) An exabyte can be understood as 1 million Terabytes(TB). So , from this we can slowly understand this huge amount of data. Google uses its datacenters as well as collaborates with other datacenters to store their data. Each data center would cover an area of 20 Football fields combined. Its hard to calculate this huge amount of data. But with some educated guessing using the capital expenditures at remote locations and electricity consumption at each of the data centers and number of servers they have respectively, we can come to a conclusion that Google holds 10-15 Exabytes of data. This equals to data of 30 Million PCs combined. So now when someone stops you somewhere and asks you how much data does google handle!! You can boldly answer that Google handles 10-15 Exabytes of data.

This is for Facebook and Google only if you consider Gmail,Instagram,Google maps and other companies then it is not possible for us to calculate the data even we don't know the unit to measure this much of data with out googling it.Because every company need different stroage

And This is known as BIGDATA it is not a technology it is a problem faced by a data world it is a storage problem that how we can create this much storage and how to store it and read/write it fastly and efficiently.

What is BIGDATA

Big data problems have brought many changes in the way data is processed and managed over time. Today, data is not just posing challenge in terms of volume but also in terms of its high speed generation. The data quality and validity varies from source to source, and thus are difficult to process. This issue has led to the development of several stream processing engines/platforms by different companies such as Yahoo, LinkedIn, etc. Besides better performance in terms of latency, stream processing overcomes another shortcoming of batch data processing systems, i.e., scaling with high “velocity” data.Availability of several platforms also resulted in another challenge for user organizations in terms of selecting the most appropriate stream processing platform for their needs.

In simple words let take a example if google has 10 TB of data storage Now you want send a Email and google has not enough storage to store your email and send it . Then it creates a problem because google has no more storage.And similarly if you want to search something on google and you typed and google says no come after 4 days for your search then you says that google is not useful it not shows the data on time .This is because of Big Data problem their is slow speed and less storage.

The 4'Vs of Big data


No alt text provided for this image


Volume

The main characteristic that makes data “big” is the sheer volume. It makes no sense to focus on minimum storage units because the total amount of information is growing exponentially every year. In 2010, Thomson Reuters estimated in its annual report that it believed the world was “awash with over 800 exabytes of data and growing.”

For that same year, EMC, a hardware company that makes data storage devices, thought it was closer to 900 exabytes and would grow by 50 percent every year. No one really knows how much new data is being generated, but the amount of information being collected is huge.

Variety

Variety is one the most interesting developments in technology as more and more information is digitized. Traditional data types (structured data) include things on a bank statement like date, amount, and time. These are things that fit neatly in a relational database.

Structured data is augmented by unstructured data, which is where things like Twitter feeds, audio files, MRI images, web pages, web logs are put — anything that can be captured and stored but doesn’t have a meta model (a set of rules to frame a concept or idea — it defines a class of information and how to express it) that neatly defines it.

Unstructured data is a fundamental concept in big data. The best way to understand unstructured data is by comparing it to structured data. Think of structured data as data that is well defined in a set of rules. For example, money will always be numbers and have at least two decimal points; names are expressed as text; and dates follow a specific pattern.


With unstructured data, on the other hand, there are no rules. A picture, a voice recording, a tweet — they all can be different but express ideas and thoughts based on human understanding. One of the goals of big data is to use technology to take this unstructured data and make sense of it.

Veracity

Veracity refers to the trustworthiness of the data. Can the manager rely on the fact that the data is representative? Every good manager knows that there are inherent discrepancies in all the data collected.

Velocity

Velocity is the frequency of incoming data that needs to be processed. Think about how many SMS messages, Facebook status updates, or credit card swipes are being sent on a particular telecom carrier every minute of every day, and you’ll have a good appreciation of velocity. A streaming application like Amazon Web Services Kinesis is an example of an application that handles the velocity of data.

Value

It may seem painfully obvious to some, but a real objective is critical to this mashup of the four V’s. Will the insights you gather from analysis create a new product line, a cross-sell opportunity, or a cost-cutting measure? Or will your data analysis lead to the discovery of a critical causal effect that results in a cure to a disease?

Main Problem of Big Data !

Even though big data is changing businesses by providing actionable insights, there are certain problems related to it. A problem with big data is that it grows constantly and organizations often fail to capture the opportunities and extract actionable data. Companies often fail to recognize on where they need to allocate their resources. This failure in allocating the resources results in not making the most of the information. Apart from that, organizations often end up with talent that does not understand how they should use big data analytics. Such a dearth of trained employees who can extract information results in companies not making the most of information held by them. Furthermore, while extracting insights from the big data held by them, companies fail to identify the right objective and end up with insights that are not so helpful for their growth.


How to overcome through this problem?

Distributed Storage Cluster:

Big Data contain two huge problems one is huge data i.e volume(size) and other one is huge speed i.e velocity(I/O). but for solving these problem we have one approach or concept or technology and that concept is basically known as Distributed Storage. It is core of all the issues of Big Data World.

 For implementing this Concept we require one product and that product is known as Hadoop and In Hadoop we are going to create Master and Slave Relation i.e Cluster and the whole process is known as Hadoop Cluster.

No alt text provided for this image


What is Hadoop ?

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware(This means the system is capable of running different operating systems (OSes) such as Windows or Linux without requiring special drivers.). It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

In simple words lets take your system as master system and you require more storage then you ask you friend to give a storage (acting as Slave) but here its is not a physically given to you it is given to you through this software it creates a new drive on your system which you can use and if you take it from 50 people let imagine how much storage you get and similarly if you want to store large file which you unable to store in your system then break that file in Pisces and store it on different drives by which you can read and write the file more speedily and efficiently.

Advantages of a Hadoop Cluster

  • Hadoop clusters can boost the processing speed of many big data analytics jobs, given their ability to break down large computational tasks into smaller tasks that can be run in a parallel, distributed fashion.
  • Hadoop clusters are easily scalable and can quickly add nodes to increase throughput, and maintain processing speed, when faced with increasing data blocks.
  • The use of low cost, high availability commodity hardware makes Hadoop clusters relatively easy and inexpensive to set up and maintain.
  • Hadoop clusters replicate a data set across the distributed file system, making them resilient to data loss and cluster failure.
  • Hadoop clusters make it possible to integrate and leverage data from multiple different source systems and data formats.
  • It is possible to deploy Hadoop using a single-node installation, for evaluation purposes.

Hope had liked the article ???? !! Thank you ???? Like an comment !!


Prasant Mahato

DevOps, Cloud & Performance Engineer| DevOps Engineer

4 年

Well done Ritesh Chaudhari

回复

要查看或添加评论,请登录

Ritesh Chaudhari的更多文章

社区洞察

其他会员也浏览了