How SocialMedia sites like FB,Google etc...., Managing Big data?
Tejashwini Kottha
★Sr.Software Developer★AWS Devops ★python Developer ★MLops Intern ★Backend Developer★ ARTH Learner ★
What is Big Data?
Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.
Characteristics Of Big Data
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data.
(ii) Variety – Variety refers to heterogeneous sources and the nature of data. Data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of data poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data. Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors ,etc. The flow of data is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.
Benefits of Big Data Processing
Ability to process Big Data brings in multiple benefits, such as-
- Businesses can utilize outside intelligence while taking decisions
- Improved customer service
- Early identification of risk to the product/services, if any
- Better operational efficiency
Hadoop:
Apache Hadoop is an open source software framework used to develop data processing applications which are executed in a distributed computing environment.
HDFS is a distributed file system for storing very large data files, running on clusters of commodity hardware. It is fault tolerant, scalable, and extremely simple to expand. Hadoop comes bundled with HDFS (Hadoop Distributed File Systems).
When data exceeds the capacity of storage on a single physical machine, it becomes essential to divide it across a number of separate machines. A file system that manages storage specific operations across a network of machines is called a distributed file system. HDFS is one such software.
Hadoop Architecture:
NameNode:NameNode represented every files and directory which is used in the namespace.
DataNode:DataNode helps you to manage the state of an HDFS node and allows you to interacts with the blocks.
MasterNode:The master node allows you to conduct parallel processing of data using Hadoop MapReduce.
Slave node:The slave nodes are the additional machines in the Hadoop cluster which allows you to store data to conduct complex calculations. Moreover, all the slave node comes with Task Tracker and a DataNode. This allows you to synchronize the processes with the NameNode and Job Tracker respectively.
In Hadoop, master or slave system can be set up in the cloud or on-premise.
Features Of 'Hadoop'
? Suitable for Big Data Analysis
As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are best suited for analysis of Big Data. Since it is processing logic (not the actual data) that flows to the computing nodes, less network bandwidth is consumed. This concept is called as data locality concept which helps increase the efficiency of Hadoop based applications.
? Scalability
HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes and thus allows for the growth of Big Data. Also, scaling does not require modifications to application logic.
? Fault Tolerance
HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes. That way, in the event of a cluster node failure, data processing can still proceed by using data stored on another cluster node.
Big Data in Social Media:
The statistic shows that 500+terabytes of new data get ingested into the databases of social media sites (facebook,google,youtube ,etc..,.) every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc...,
Example: A single Jet engine can generates 10+ Terabytes of data in 30 minutes of flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.
Facebook Uses Hadoop to manage Bigdata:
Hadoop is the key tool Facebook uses, not simply for analysis, but as an engine to power many features of the Facebook site, including messaging. That multitude of monster workloads drove the company to launch its Prism project, which supports geographically distributed Hadoop data stores.
Google Bigdata Challenge :
Google and any other company which generates huge amount of data uses cloud to store it's data. Given that the number of users are always volatile, the data generated on a day's scale is also volatile. Therefore Google doesn't use an off the shelf type of storage to store their data. Purchasing 20 PB of hardware everyday is out of question. Google not only needs a scalable data, but at the same time needs durable one.
Google's solution to this:
Distributed File System, Big Table and Object Based Storage!
Object Based Storage:
In simple words, it is storing data as objects. An object constitutes of the data, metadata (which is basically information about the data stored, Eg:Size,Type etc), and a global identifier.
Distributed File System :
A Distributed File System is a way of storing data and reading across different servers, but through the same interface as accessing a local file. Google uses it's own file system based on DFS to solve its problem of scalability by incorporating the object based storage, known as the Google File System.
The GFS consists of 3 layers,
The Client: Handles requests for data from applications.
The Master: It stores the metadata. Mainly, the names of data files and the location of their chunks
The chunk server: Huge amounts of data, are broken down into chunks of few hundred MBs and stored across servers with replicas for back up :)
This is one cluster of course, with a single master. Google uses a distributed master system that can handle hundreds of masters, each of which can handle about 100 million files. So a distributed master system on top of a distributed file system
BigTable :
For Google's enormous scale, the GFS is not enough. It needs to scale everyday, that's where BigTable comes into play. It solves the problem of scaling petabytes of storage everyday.
· BigTable stores data in tables.
· A row is a URL name.
· A column can be the features of the web page
· A cell, contains the data which is time-stamped.
· The row ranges are broken up into partitions called tablets
· The tablets are distributed across multiple servers for load balancing.
The concept of tablets is what gives enormous power of handling such huge data to BigTable.
So a BigTable with a distributed master system controlling an army of distributed file systems is the secret for the Google's infinite scalability.
IT- Functional Lead || IT - Mainframe Developer || Java Developer
4 年Nice article , keep going on