登录查看更多内容

BIG DATA – a problem

Komal Suthar

Technical Support Engineer@Red Hat | RHCA

发布日期: 2020年9月17日

In this world full of technologies, were daily many customers are connecting themselves to this technologies and as the number of customers increases, parallely it also increase, the number of data around petabytes. The more the customers are, the more the data is generated. This is basically known as Big Data.

What is Big Data?

Big Data is a large amount of data. It is a term used to describe data that is huge in amount and which keeps growing with time. This data is difficult and also time-consuming to process. Big data can be characterized mainly as :

· Volume – The quantity of data

· Velocity – The speed of i/o processing

How and where this Big Data is managed?

Here to answer this question I have done some of the research on one of the MNC Google. Google is the largest tech company in the world and having a huge data base of the users. Google stores almost all data like all picture, videos, contacts, locations, documents, search history, download history and many more. Few days before Facebook is in the news for leaking there more than 50 crore users personal data. Google store more personal data as compared to Facebook.

Google Currently processes over 20 petabytes of data per day through an average of 100,000 MapReduse jobs spread across its massive computing clusters. With all of these products/services and the unthinkable amount of data that come with them, how does a company like Google go about storing its information? If we get a little meta and turn to Google with our question, we learn that our answer lies in the functionality of thousands upon thousands of servers. In August of 2011, Data Center Knowledge reported that the number is close to 900,000. Pretty remarkable, right?

Google and any other company which generates huge amount of data uses cloud to store it’s data because the number of users are always volatile, the data generated on a day’s scale is also volatile. Therefore Google doesn’t use an off the self type of storage to store there data.

1 GB of storage costs 0.03$

20 Petabytes costs 0.03 * 20 * 1000 * 1000 =600000$

That's quite a bit of money - like hiring 2190 employees for 100,000$ a year !

Purchasing 20 PB of hardware everyday is out of question. Google not only needs a scalable data, but at the same time needs durable one.

How Google solve this problem?

A Distributed File System is a way of storing data and reading across different servers, but through the same interface as accessing a local file. Google uses it’s own file system DFS to solve its problem of scalability by incorporating the object based storage, known as the Google File System.

The GFS consists of 3 layers,

· The Client – Handles requests for data from applications.

· The Master – It stores the metadata. Mainly, the names of data files and the location of their chunks.

· The chunk server – Huge amounts of data, are broken down into chunks of few hundred Mbs and stored across servers with replicas for back up.

If you get excited over thoughts of how large amount of data may flow from one part to another, with multiple master and slave machines, you might find getting a glimpse of how Google might handle this and how Google shares millions and tones of information across a very widely distributed network. You might understand this by reading by this particular phrase –

“A system having a resource manager, and a plurality of slaves, interconnected by a communications network. To distribute data, a master determined that a destination slave of the plurality slaves requires data. The master then generates a list of slaves from which to transfer data to the destination slave. The master transmits the list to the resource manager. The resource is configured to select a source slave from the list based on available system resources.”

Google’s web servers are those that will probably resonate most with the common user, as they are responsible for handling the queries that we enter into Google Search. When a user enters a query, web servers carry out the process of interacting with other server types (e.g. index, spelling, ad, etc.) and returning results/serving ads in HTML format. Web servers are the ‘results-gathering’ servers. On a similar note, Google has servers designated to perform specific tasks –

1. Data-Gathering Servers

Data-gathering servers that send out bots to crawl the web.

2. Index Servers

Google’s index servers that contain the list of document IDs that contain the user’s query.

3. Document Servers

Document servers store the document version of web page content saved in the form of JPEG files, PDF files, and more.

4. Ad Servers

Ad servers that manage ads on the search results pages.

5. Spelling Servers

If you have ever searched for something in Google and the results came up with the phrase, “Did you mean correctspelling,” know that a spelling server was at work.

How we can implement this Distributed File System?

Now as Google uses shared computing to satisfy their customers needs. More than 1,000 computers are involved in answering every query. In fact, the most popular open source for distributed computing system is Apache Hadoop. Which is basically called Hadoop Distributed File System (HDFS) designed to run on commodity hardware. Hadoop has a compound annual growth rate of 58% and will surpass $1 billion by 2020.

What is Hadoop?

Apache Hadoop is a collection software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for Distributed Storage and processing of Big Data using the ManReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware.

Thanks for reading!!

查看更多评论

要查看或添加评论，请登录

Komal Suthar的更多文章

Mastering Kubernetes: Step-by-Step Guide to Setting Up Your Cluster on AWS Cloud

2024年8月6日

Mastering Kubernetes: Step-by-Step Guide to Setting Up Your Cluster on AWS Cloud

Setting up a Kubernetes cluster on AWS can seem daunting, but with a clear backend knowledge, step-by-step guide, it…

5 条评论
Openshift - Container Platform

2021年5月14日

Openshift - Container Platform

When it comes to container orchestration, the first thing that comes to mind is “Kubernetes” but now OpenShift is the…

3 条评论
MongoDB: The NoSQL Non-Relational Database

2021年5月13日

MongoDB: The NoSQL Non-Relational Database

MongoDB is a rich document-oriented NoSQL database. It is one of the most popular open-source NoSQL databases written…
What Is Jenkins? Why To Use It? How Netflix Use This Tool?

2021年4月18日

What Is Jenkins? Why To Use It? How Netflix Use This Tool?

Jenkins is a Continuous Integration (CI) server or tool which is written in java. It provides Continuous Integration…

8 条评论
Dynamic provisioning of Jenkins slave node on AWS Cloud using Docker

2021年4月16日

Dynamic provisioning of Jenkins slave node on AWS Cloud using Docker

Jenkins is the most used open-source tool in the world, It’s Master-slave architecture is great for scalability to do…

8 条评论
What is AWS SQS – Benefits, Queue & Function

2021年3月1日

What is AWS SQS – Benefits, Queue & Function

SQS-stands for Simple Queue Service-is a service operated by AWS to handle queueing of messages. One service sends…
Case-study on Azure Kubernetes Service (AKS) - Finastra

2021年3月1日

Case-study on Azure Kubernetes Service (AKS) - Finastra

In this growing world every day we have to launch many applications on different different operating system and it…

2 条评论
How one of the pioneers - Google is using Neural Networks (NN)

2021年3月1日

How one of the pioneers - Google is using Neural Networks (NN)

Recently there has been a great buzz around the words “neural network” in the field of computer science and it has…

2 条评论
Kubernetes Cluster on AWS ec2 instances: Ansible

2021年2月13日

Kubernetes Cluster on AWS ec2 instances: Ansible

As we all know, Kubernetes is an open source software that allows us to deploy and manage containerized applications at…

4 条评论
Setup a Multi-Node Hadoop Cluster using Docker

2021年1月19日

Setup a Multi-Node Hadoop Cluster using Docker

In this article, we will look at how you can set up Docker to be used to launch a Multi-node Hadoop cluster inside a…

9 条评论

See all articles

BIG DATA – a problem

Komal Suthar

Technical Support Engineer@Red Hat | RHCA

Komal Suthar的更多文章

社区洞察

其他会员也浏览了

Optimizing Data Pipelines in Microsoft Fabric: Overcoming the Challenge of Absent Storage Event Trigger

Amazon Data Firehose: The X-Men's Darwin of Your Data Ecosystem

Big Data for everyone - Simple and Awesome!

Big Data and its Challenges

Big Query | How To Handle Or Analyz Big Data In This

There's No Such Thing as Big Data in the Legal Industry

“Data is the new science. Big Data holds the answers.”

BIG DATA :- HOW DATA ORIENTED COMPANIES STORE TERABYTES OF DATA PER DAY?

BIG Data .... are you ready?

Hype vs Reality regarding Big Data?

Komal Suthar的更多文章

Mastering Kubernetes: Step-by-Step Guide to Setting Up Your Cluster on AWS Cloud

Openshift - Container Platform

MongoDB: The NoSQL Non-Relational Database

What Is Jenkins? Why To Use It? How Netflix Use This Tool?

Dynamic provisioning of Jenkins slave node on AWS Cloud using Docker

What is AWS SQS – Benefits, Queue & Function

Case-study on Azure Kubernetes Service (AKS) - Finastra

How one of the pioneers - Google is using Neural Networks (NN)

Kubernetes Cluster on AWS ec2 instances: Ansible

Setup a Multi-Node Hadoop Cluster using Docker

社区洞察

其他会员也浏览了

Optimizing Data Pipelines in Microsoft Fabric: Overcoming the Challenge of Absent Storage Event Trigger

Amazon Data Firehose: The X-Men's Darwin of Your Data Ecosystem

Big Data for everyone - Simple and Awesome!

Big Data and its Challenges

Big Query | How To Handle Or Analyz Big Data In This

There's No Such Thing as Big Data in the Legal Industry

“Data is the new science. Big Data holds the answers.”

BIG DATA :- HOW DATA ORIENTED COMPANIES STORE TERABYTES OF DATA PER DAY?

BIG Data .... are you ready?

Hype vs Reality regarding Big Data?