登录查看更多内容

How the Big companies like Facebook,Google etc store the Big Data.

Ritesh Chaudhari

DevSecOps Engineer

发布日期: 2020年9月16日

+ 关注

DATA IS PRECIOUS THING AND WILL LAST LONGER THAN SYSTEM THEMSELVES

-Tim Berners.lee

Now a days in era of growing technology we store a data like photos, document,videos in our devices which is having some storage of 32 GB or 1 TB .But have you thought where the data is stored, which we get after searching any thing on Google or the data you upload on your social media account like photos and videos?

After some days we start facing the problem of storage in our devices in that condition we insert a memory-card or buy a new hard-drive for our devices and start doing our work very easily,Is it possible to do same-thing for big companies?

Let have a look of data stored in Facebook per day

Including the system processes 2.5 billion pieces of content and 500+ terabytes of data each day. It’s pulling in 2.7 billion Like actions and 300 million photos per day, and it scans roughly 105 terabytes of data each half hour.

Lets have a look of Google storage also

Google now processes over 40,000 search queries every second on average, which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide.

A place where google stores and handles all its data is a data center. Google doesn’t hold the biggest of data centers but still it handles a huge amount of data. A data center normally holds petabytes to exabytes of data.

Now, What are these new terms? Petabytes or Exabytes? The highest data size I have heard till now is Terabyte(TB). 1 Petabyte(PB) = 1024 Terabytes(TB) 1 Exabyte(EB)= 1024 Petabyte(PB) An exabyte can be understood as 1 million Terabytes(TB). So , from this we can slowly understand this huge amount of data. Google uses its datacenters as well as collaborates with other datacenters to store their data. Each data center would cover an area of 20 Football fields combined. Its hard to calculate this huge amount of data. But with some educated guessing using the capital expenditures at remote locations and electricity consumption at each of the data centers and number of servers they have respectively, we can come to a conclusion that Google holds 10-15 Exabytes of data. This equals to data of 30 Million PCs combined. So now when someone stops you somewhere and asks you how much data does google handle!! You can boldly answer that Google handles 10-15 Exabytes of data.

This is for Facebook and Google only if you consider Gmail,Instagram,Google maps and other companies then it is not possible for us to calculate the data even we don't know the unit to measure this much of data with out googling it.Because every company need different stroage

And This is known as BIGDATA it is not a technology it is a problem faced by a data world it is a storage problem that how we can create this much storage and how to store it and read/write it fastly and efficiently.

What is BIGDATA

Big data problems have brought many changes in the way data is processed and managed over time. Today, data is not just posing challenge in terms of volume but also in terms of its high speed generation. The data quality and validity varies from source to source, and thus are difficult to process. This issue has led to the development of several stream processing engines/platforms by different companies such as Yahoo, LinkedIn, etc. Besides better performance in terms of latency, stream processing overcomes another shortcoming of batch data processing systems, i.e., scaling with high “velocity” data.Availability of several platforms also resulted in another challenge for user organizations in terms of selecting the most appropriate stream processing platform for their needs.

In simple words let take a example if google has 10 TB of data storage Now you want send a Email and google has not enough storage to store your email and send it . Then it creates a problem because google has no more storage.And similarly if you want to search something on google and you typed and google says no come after 4 days for your search then you says that google is not useful it not shows the data on time .This is because of Big Data problem their is slow speed and less storage.

The 4'Vs of Big data

Volume

The main characteristic that makes data “big” is the sheer volume. It makes no sense to focus on minimum storage units because the total amount of information is growing exponentially every year. In 2010, Thomson Reuters estimated in its annual report that it believed the world was “awash with over 800 exabytes of data and growing.”

For that same year, EMC, a hardware company that makes data storage devices, thought it was closer to 900 exabytes and would grow by 50 percent every year. No one really knows how much new data is being generated, but the amount of information being collected is huge.

Variety

Variety is one the most interesting developments in technology as more and more information is digitized. Traditional data types (structured data) include things on a bank statement like date, amount, and time. These are things that fit neatly in a relational database.

Structured data is augmented by unstructured data, which is where things like Twitter feeds, audio files, MRI images, web pages, web logs are put — anything that can be captured and stored but doesn’t have a meta model (a set of rules to frame a concept or idea — it defines a class of information and how to express it) that neatly defines it.

Unstructured data is a fundamental concept in big data. The best way to understand unstructured data is by comparing it to structured data. Think of structured data as data that is well defined in a set of rules. For example, money will always be numbers and have at least two decimal points; names are expressed as text; and dates follow a specific pattern.

With unstructured data, on the other hand, there are no rules. A picture, a voice recording, a tweet — they all can be different but express ideas and thoughts based on human understanding. One of the goals of big data is to use technology to take this unstructured data and make sense of it.

Veracity

Veracity refers to the trustworthiness of the data. Can the manager rely on the fact that the data is representative? Every good manager knows that there are inherent discrepancies in all the data collected.

Velocity

Velocity is the frequency of incoming data that needs to be processed. Think about how many SMS messages, Facebook status updates, or credit card swipes are being sent on a particular telecom carrier every minute of every day, and you’ll have a good appreciation of velocity. A streaming application like Amazon Web Services Kinesis is an example of an application that handles the velocity of data.

Value

It may seem painfully obvious to some, but a real objective is critical to this mashup of the four V’s. Will the insights you gather from analysis create a new product line, a cross-sell opportunity, or a cost-cutting measure? Or will your data analysis lead to the discovery of a critical causal effect that results in a cure to a disease?

Main Problem of Big Data !

Even though big data is changing businesses by providing actionable insights, there are certain problems related to it. A problem with big data is that it grows constantly and organizations often fail to capture the opportunities and extract actionable data. Companies often fail to recognize on where they need to allocate their resources. This failure in allocating the resources results in not making the most of the information. Apart from that, organizations often end up with talent that does not understand how they should use big data analytics. Such a dearth of trained employees who can extract information results in companies not making the most of information held by them. Furthermore, while extracting insights from the big data held by them, companies fail to identify the right objective and end up with insights that are not so helpful for their growth.

How to overcome through this problem?

Distributed Storage Cluster:

Big Data contain two huge problems one is huge data i.e volume(size) and other one is huge speed i.e velocity(I/O). but for solving these problem we have one approach or concept or technology and that concept is basically known as Distributed Storage. It is core of all the issues of Big Data World.

For implementing this Concept we require one product and that product is known as Hadoop and In Hadoop we are going to create Master and Slave Relation i.e Cluster and the whole process is known as Hadoop Cluster.

What is Hadoop ?

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware(This means the system is capable of running different operating systems (OSes) such as Windows or Linux without requiring special drivers.). It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

In simple words lets take your system as master system and you require more storage then you ask you friend to give a storage (acting as Slave) but here its is not a physically given to you it is given to you through this software it creates a new drive on your system which you can use and if you take it from 50 people let imagine how much storage you get and similarly if you want to store large file which you unable to store in your system then break that file in Pisces and store it on different drives by which you can read and write the file more speedily and efficiently.

Advantages of a Hadoop Cluster

Hadoop clusters can boost the processing speed of many big data analytics jobs, given their ability to break down large computational tasks into smaller tasks that can be run in a parallel, distributed fashion.
Hadoop clusters are easily scalable and can quickly add nodes to increase throughput, and maintain processing speed, when faced with increasing data blocks.
The use of low cost, high availability commodity hardware makes Hadoop clusters relatively easy and inexpensive to set up and maintain.
Hadoop clusters replicate a data set across the distributed file system, making them resilient to data loss and cluster failure.
Hadoop clusters make it possible to integrate and leverage data from multiple different source systems and data formats.
It is possible to deploy Hadoop using a single-node installation, for evaluation purposes.

Hope had liked the article ???? !! Thank you ???? Like an comment !!

Prasant Mahato

DevOps, Cloud & Performance Engineer| DevOps Engineer

4 年

Well done Ritesh Chaudhari

要查看或添加评论，请登录

Ritesh Chaudhari的更多文章

How to use AWS instance as Runner in Gitlab?

2022年9月29日

How to use AWS instance as Runner in Gitlab?

Hello, This is Ritesh, Today we will see how to add the Ec2 instance as a runner in your GitLab so that we can install…

1 条评论
How is OSPF (Open Short Path First) Routing Protocol implemented using Dijkstra Algorithm behind the scene?

2021年7月25日

How is OSPF (Open Short Path First) Routing Protocol implemented using Dijkstra Algorithm behind the scene?

Hello, Guys Here I came up with a new article In this article we will be going to talk about how OSPF(Open Shortest…
Automate Kubernetes Cluster Using Ansible and launching WordPress MySQL and expose WordPress pod .

2021年7月19日

Automate Kubernetes Cluster Using Ansible and launching WordPress MySQL and expose WordPress pod .

Hello guys, here I came up with a new and interesting task here I am configuring the Kubernetes master and slave using…

4 条评论
Configure K8 cluster using Ansible

2021年6月28日

Configure K8 cluster using Ansible

Hello guys, Here I came up with a new article in this article I configure the Kubernetes cluster using Ansible and…

2 条评论
How Cisco Use MongoDB?

2021年5月13日

How Cisco Use MongoDB?

We all know that storing data in the Database is Big and hectic task but in this technology world, we have so many…
GUI Application using Docker container.

2021年3月18日

GUI Application using Docker container.

Hello, Everybody Today we are going to perform and interesting handson we all know Docker it is a container technology…
How CISCO used the Openshift?

2021年3月13日

How CISCO used the Openshift?

What is Openshift? OpenShift is a cloud development Platform as a Service (PaaS) hosted by Red Hat. It’s an open source…
How Jenkins works and how it is used by companies, for CI/CD?

2021年3月12日

How Jenkins works and how it is used by companies, for CI/CD?

Hello Today we will be talking about the Jenkins a continuous integration and continuous development tool What is…
How microsoft used the Neural Network for Speech Recognition.

2021年3月4日

How microsoft used the Neural Network for Speech Recognition.

So Hello guys, Today we will be talking about the Neural Network how the companies using the Neural Network the topic…

1 条评论
How Bosch increases vehicle safety using map-matching algorithms and Azure Kubernetes Service.

2021年3月3日

How Bosch increases vehicle safety using map-matching algorithms and Azure Kubernetes Service.

What is Kubernetes? Kubernetes is open-source orchestration software for deploying, managing and scaling containers…

1 条评论

See all articles

How the Big companies like Facebook,Google etc store the Big Data.

Ritesh Chaudhari

DevSecOps Engineer

Let have a look of data stored in Facebook per day

Lets have a look of Google storage also

What is BIGDATA

The 4'Vs of Big data

Volume

Variety

Veracity

Velocity

Value

Main Problem of Big Data !

How to overcome through this problem?

Distributed Storage Cluster:

What is Hadoop ?

Advantages of a Hadoop Cluster

Ritesh Chaudhari的更多文章

社区洞察

其他会员也浏览了

Big Data will make or break the businesses of the next decade

Data Technology Tends (9 part series - links are provided in the respective sections)

BIG DATA – a problem

Azure Data Explorer: Real-Time Analytics - Fortinet Logs

Unlocking the Power of OneLake Shortcuts: Unifying Your Data Across Platforms

The problem with big data

Task-1:How Facebook store thousands of tera-bytes of data and managing it.

A New Big Data Paradigm for the Zeta Byte Era

Reimagining the Big Data warehouse: Simple tips for increasing speed and agility

Attention CIO's, Big Data is NOT an IT Plaything!

Let have a look of data stored in Facebook per day

Lets have a look of Google storage also

What is BIGDATA

The 4'Vs of Big data

Volume

Variety

Veracity

Velocity

Value

Main Problem of Big Data !

How to overcome through this problem?

Distributed Storage Cluster:

What is Hadoop ?

Advantages of a Hadoop Cluster

Ritesh Chaudhari的更多文章

How to use AWS instance as Runner in Gitlab?

How is OSPF (Open Short Path First) Routing Protocol implemented using Dijkstra Algorithm behind the scene?

Automate Kubernetes Cluster Using Ansible and launching WordPress MySQL and expose WordPress pod .

Configure K8 cluster using Ansible

How Cisco Use MongoDB?

GUI Application using Docker container.

How CISCO used the Openshift?

How Jenkins works and how it is used by companies, for CI/CD?

How microsoft used the Neural Network for Speech Recognition.

How Bosch increases vehicle safety using map-matching algorithms and Azure Kubernetes Service.

社区洞察

其他会员也浏览了

Big Data will make or break the businesses of the next decade

Data Technology Tends (9 part series - links are provided in the respective sections)

BIG DATA – a problem

Azure Data Explorer: Real-Time Analytics - Fortinet Logs

Unlocking the Power of OneLake Shortcuts: Unifying Your Data Across Platforms

The problem with big data

Task-1:How Facebook store thousands of tera-bytes of data and managing it.

A New Big Data Paradigm for the Zeta Byte Era

Reimagining the Big Data warehouse: Simple tips for increasing speed and agility

Attention CIO's, Big Data is NOT an IT Plaything!