登录查看更多内容

How SocialMedia sites like FB,Google etc...., Managing Big data?

Tejashwini Kottha

★Sr.Software Developer★AWS Devops ★python Developer ★MLops Intern ★Backend Developer★ ARTH Learner ★

发布日期: 2020年9月17日

+ 关注

What is Big Data?

Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.

Characteristics Of Big Data

(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data.

(ii) Variety – Variety refers to heterogeneous sources and the nature of data. Data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of data poses certain issues for storage, mining and analyzing data.

(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data. Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors ,etc. The flow of data is massive and continuous.

(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.

Benefits of Big Data Processing

Ability to process Big Data brings in multiple benefits, such as-

Businesses can utilize outside intelligence while taking decisions
Improved customer service
Early identification of risk to the product/services, if any
Better operational efficiency

Hadoop:

Apache Hadoop is an open source software framework used to develop data processing applications which are executed in a distributed computing environment.

HDFS is a distributed file system for storing very large data files, running on clusters of commodity hardware. It is fault tolerant, scalable, and extremely simple to expand. Hadoop comes bundled with HDFS (Hadoop Distributed File Systems).

When data exceeds the capacity of storage on a single physical machine, it becomes essential to divide it across a number of separate machines. A file system that manages storage specific operations across a network of machines is called a distributed file system. HDFS is one such software.

Hadoop Architecture:

NameNode:NameNode represented every files and directory which is used in the namespace.

DataNode:DataNode helps you to manage the state of an HDFS node and allows you to interacts with the blocks.

MasterNode:The master node allows you to conduct parallel processing of data using Hadoop MapReduce.

Slave node:The slave nodes are the additional machines in the Hadoop cluster which allows you to store data to conduct complex calculations. Moreover, all the slave node comes with Task Tracker and a DataNode. This allows you to synchronize the processes with the NameNode and Job Tracker respectively.

In Hadoop, master or slave system can be set up in the cloud or on-premise.

Features Of 'Hadoop'

? Suitable for Big Data Analysis

As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are best suited for analysis of Big Data. Since it is processing logic (not the actual data) that flows to the computing nodes, less network bandwidth is consumed. This concept is called as data locality concept which helps increase the efficiency of Hadoop based applications.

? Scalability

HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes and thus allows for the growth of Big Data. Also, scaling does not require modifications to application logic.

? Fault Tolerance

HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes. That way, in the event of a cluster node failure, data processing can still proceed by using data stored on another cluster node.

Big Data in Social Media:

The statistic shows that 500+terabytes of new data get ingested into the databases of social media sites (facebook,google,youtube ,etc..,.) every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc...,

Example: A single Jet engine can generates 10+ Terabytes of data in 30 minutes of flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.

Facebook Uses Hadoop to manage Bigdata:

Hadoop is the key tool Facebook uses, not simply for analysis, but as an engine to power many features of the Facebook site, including messaging. That multitude of monster workloads drove the company to launch its Prism project, which supports geographically distributed Hadoop data stores.

Google Bigdata Challenge :

Google and any other company which generates huge amount of data uses cloud to store it's data. Given that the number of users are always volatile, the data generated on a day's scale is also volatile. Therefore Google doesn't use an off the shelf type of storage to store their data. Purchasing 20 PB of hardware everyday is out of question. Google not only needs a scalable data, but at the same time needs durable one.

Google's solution to this:

Distributed File System, Big Table and Object Based Storage!

Object Based Storage:

In simple words, it is storing data as objects. An object constitutes of the data, metadata (which is basically information about the data stored, Eg:Size,Type etc), and a global identifier.

Distributed File System :

A Distributed File System is a way of storing data and reading across different servers, but through the same interface as accessing a local file. Google uses it's own file system based on DFS to solve its problem of scalability by incorporating the object based storage, known as the Google File System.

The GFS consists of 3 layers,

The Client: Handles requests for data from applications.

The Master: It stores the metadata. Mainly, the names of data files and the location of their chunks

The chunk server: Huge amounts of data, are broken down into chunks of few hundred MBs and stored across servers with replicas for back up :)

This is one cluster of course, with a single master. Google uses a distributed master system that can handle hundreds of masters, each of which can handle about 100 million files. So a distributed master system on top of a distributed file system

BigTable :

For Google's enormous scale, the GFS is not enough. It needs to scale everyday, that's where BigTable comes into play. It solves the problem of scaling petabytes of storage everyday.

· BigTable stores data in tables.

· A row is a URL name.

· A column can be the features of the web page

· A cell, contains the data which is time-stamped.

· The row ranges are broken up into partitions called tablets

· The tablets are distributed across multiple servers for load balancing.

The concept of tablets is what gives enormous power of handling such huge data to BigTable.

So a BigTable with a distributed master system controlling an army of distributed file systems is the secret for the Google's infinite scalability.

Nagaraj Kotta

IT- Functional Lead || IT - Mainframe Developer || Java Developer

4 年

Nice article , keep going on

1 次回应

查看更多评论

要查看或添加评论，请登录

Tejashwini Kottha的更多文章

GUI Container Inside Docker!!!

2021年6月1日

GUI Container Inside Docker!!!

Task Description ?? ?? GUI container on the Docker ?? Launch a container on docker in GUI mode ?? Run any GUI software…
Salary predicting MachineLearning model in Docker.

2021年5月30日

Salary predicting MachineLearning model in Docker.

?Task Description 1.Pull the Docker container image of CentOS image from DockerHub and create a new container 2.

2 条评论
Challenge : Restarting HTTPD Service is not idempotence in nature and also consume more resources. Way to Rectify this??

2020年12月27日

Challenge : Restarting HTTPD Service is not idempotence in nature and also consume more resources. Way to Rectify this??

In this Article ,I had done this way by Using Ansible Playbook to rectify the challenge :Restarting HTTPD Service is…

6 条评论
Configuring Docker and httpd using Ansible Playbook..!!!

2020年12月21日

Configuring Docker and httpd using Ansible Playbook..!!!

Hello peeps?? In this Article I described how to configure Docker and httpd using Ansible Playbook..

2 条评论
How to contribute limited/specific amount of storage as slave to the Hadoop cluster?

2020年10月26日

How to contribute limited/specific amount of storage as slave to the Hadoop cluster?

Hello hi..

4 条评论
What is this Machine Learning & AI....????!!!

2020年10月20日

What is this Machine Learning & AI....????!!!

Hello HI..

8 条评论
Accessing and managing AWS using CLI.....

2020年10月15日

Accessing and managing AWS using CLI.....

Hello all..

6 条评论
Explaining AWS(Amazon Web Services) with a Case Study....

2020年9月22日

Explaining AWS(Amazon Web Services) with a Case Study....

Now a days many companies are using AWS known as Amazon Web Services to enhance their startups..

11 条评论

See all articles

How SocialMedia sites like FB,Google etc...., Managing Big data?

Tejashwini Kottha

★Sr.Software Developer★AWS Devops ★python Developer ★MLops Intern ★Backend Developer★ ARTH Learner ★

What is Big Data?

Characteristics Of Big Data

Benefits of Big Data Processing

Hadoop:

Features Of 'Hadoop'

Big Data in Social Media:

Facebook Uses Hadoop to manage Bigdata:

Google Bigdata Challenge :

Google's solution to this:

Tejashwini Kottha的更多文章

社区洞察

其他会员也浏览了

How big MNC's are Store,manage,manipulate Data by Using Big data

Data Technology Trend #0: Foundational

Delta Lake Format: Understanding Parquet under the hood.

Hadoop in Enterprise Data Strategy

!! Big Data Concept with Distributed Storage Cluster and Hadoop !!

Big Data Frameworks You Should Know About

Top 10 Big Data Trends for 2017

Hadoop Big Data Analytics Market Size & Report Forecast 2023-2028

Data lakes: The smart person's guide

From Data Lakes to Data Swamps & Back Again

What is Big Data?

Characteristics Of Big Data

Benefits of Big Data Processing

Hadoop:

Features Of 'Hadoop'

Big Data in Social Media:

Facebook Uses Hadoop to manage Bigdata:

Google Bigdata Challenge :

Google's solution to this:

Tejashwini Kottha的更多文章

GUI Container Inside Docker!!!

Salary predicting MachineLearning model in Docker.

Challenge : Restarting HTTPD Service is not idempotence in nature and also consume more resources. Way to Rectify this??

Configuring Docker and httpd using Ansible Playbook..!!!

How to contribute limited/specific amount of storage as slave to the Hadoop cluster?

What is this Machine Learning & AI....????!!!

Accessing and managing AWS using CLI.....

Explaining AWS(Amazon Web Services) with a Case Study....

社区洞察

其他会员也浏览了

How big MNC's are Store,manage,manipulate Data by Using Big data

Data Technology Trend #0: Foundational

Delta Lake Format: Understanding Parquet under the hood.

Hadoop in Enterprise Data Strategy

!! Big Data Concept with Distributed Storage Cluster and Hadoop !!

Big Data Frameworks You Should Know About

Top 10 Big Data Trends for 2017

Hadoop Big Data Analytics Market Size & Report Forecast 2023-2028

Data lakes: The smart person's guide

From Data Lakes to Data Swamps & Back Again