登录查看更多内容

How we can store huge data which is beyond our capacity (Big-Data)?

Vibhor Jain

System & DevOps Admin | Elevating IT Systems | DevOps Admirer | AWS Cloud Enthusiast | Red Hat Certified | Adaptive IT Strategist | Always in Learn-Deploy Mode

发布日期: 2020年9月17日

How It All Started?

Before getting into technicalities in this Hadoop tutorial article, let me begin with an interesting story on How Hadoop came into existence? and Why is it so popular in the industry nowadays?.

So, it all started with two people, Mike Cafarella and Doug Cutting, who were in the process of building a search engine system that can index 1 billion pages. After their research, they estimated that such a system will cost around half a million dollars in hardware, with a monthly running cost of $30,000, which is quite expensive. However, they soon realized that their architecture will not be capable enough to work around with billions of pages on the web.

They came across a paper, published in 2003, that described the architecture of Google’s distributed file system, called GFS, which was being used in production at Google. Now, this paper on GFS proved to be something that they were looking for, and soon, they realized that it would solve all their problems of storing very large files that are generated as a part of the web crawl and indexing process.

Later in 2004, Google published one more paper that introduced MapReduce to the world. Finally, these two papers led to the foundation of the framework called “Hadoop“. Doug quoted on Google’s contribution to the development of Hadoop framework:

“Google is living a few years in the future and sending the rest of us messages.”

So, by now you would have realized how powerful Hadoop is. Now, before moving on to Hadoop, let us start the discussion with Big Data, that led to the development of Hadoop.

What is Big Data?

Have you ever wondered how technologies evolve to fulfil emerging needs?

For example:

Earlier we had landline phones, but now we have shifted to smartphones. Similarly, how many of you remember floppy drives that were extensively used back in the ’90s? These Floppy drives have been replaced by Hard disks because these floppy drives had very low storage capacity and transfer speed.

Thus, this makes floppy drives insufficient for handling the amount of data with which we are dealing today. In fact, now we can store terabytes of data on the cloud without being bothered about size constraints.

Which type of data can be termed as Big Data ?

The data having these v’s is considered as big data

Let it understand with an Example:

Big Data & Hadoop – Restaurant Analogy

Let us take an analogy of a restaurant to understand the problems associated with Big Data and how Hadoop solved that problem.

Bob is a businessman who has opened a small restaurant. Initially, in his restaurant, he used to receive two orders per hour and he had one chef with one food shelf in his restaurant which was sufficient enough to handle all the orders.

Fig: Traditional Restaurant Scenario

Now let us compare the restaurant example with the traditional scenario where data was getting generated at a steady rate and our traditional systems like RDBMS is capable enough to handle it, just like Bob’s chef. Here, you can relate the data storage with the restaurant’s food shelf and the traditional processing unit with the chef as shown in the figure above.

Fig: Traditional Scenario

After a few months, Bob thought of expanding his business and therefore, he started taking online orders and added few more cuisines to the restaurant’s menu in order to engage a larger audience. Because of this transition, the rate at which they were receiving orders rose to an alarming figure of 10 orders per hour and it became quite difficult for a single cook to cope up with the current situation. Aware of the situation in processing the orders, Bob started thinking about the solution.

Fig: Distributed Processing Scenario

Similarly, in Big Data scenario, the data started getting generated at an alarming rate because of the introduction of various data growth drivers such as social media, smartphones etc.

Now, the traditional system, just like the cook in Bob’s restaurant, was not efficient enough to handle this sudden change. Thus, there was a need for a different kind of solutions strategy to cope up with this problem.

After a lot of research, Bob came up with a solution where he hired 4 more chefs to tackle the huge rate of orders being received. Everything was going quite well, but this solution led to one more problem. Since four chefs were sharing the same food shelf, the very food shelf was becoming the bottleneck of the whole process. Hence, the solution was not that efficient as Bob thought.

Fig: Distributed Processing Scenario Failure

Similarly, to tackle the problem of processing huge data sets, multiple processing units were installed so as to process the data in parallel (just like Bob hired 4 chefs). But even in this case, bringing multiple processing units was not an effective solution because the centralized storage unit became the bottleneck.

In other words, the performance of the whole system is driven by the performance of the central storage unit. Therefore, the moment our central storage goes down, the whole system gets compromised. Hence, again there was a need to resolve this single point of failure.

Fig: Solution to Restaurant Problem

Bob came up with another efficient solution, he divided all the chefs into two hierarchies, that is a Junior and a Head chef and assigned each junior chef with a food shelf. Let us assume that the dish is Meat Sauce. Now, according to Bob’s plan, one junior chef will prepare meat and the other junior chef will prepare the sauce. Moving ahead they will transfer both meat and sauce to the head chef, where the head chef will prepare the meat sauce after combining both the ingredients, which then will be delivered as the final order.

Fig: Hadoop in Restaurant Analogy

Hadoop functions in a similar fashion as Bob’s restaurant. As the food shelf is distributed in Bob’s restaurant, similarly, in Hadoop, the data is stored in a distributed fashion with replications, to provide fault tolerance. For parallel processing, first the data is processed by the slaves where it is stored for some intermediate results and then those intermediate results are merged by master node to send the final result.

Now, you must have got an idea why Big Data is a problem statement and how Hadoop solves it. As we just discussed above, there were three major challenges with Big Data:

The first problem is storing the colossal amount of data

Storing huge data in a traditional system is not possible. The reason is obvious, the storage will be limited to one system and the data is increasing at a tremendous rate.

The second problem is storing heterogeneous data

Now we know that storing is a problem, but let me tell you it is just one part of the problem. The data is not only huge, but it is also present in various formats i.e. unstructured, semi-structured and structured. So, you need to make sure that you have a system to store different types of data that is generated from various sources.

Finally let’s focus on the third problem, which is the processing speed

Now the time taken to process this huge amount of data is quite high as the data to be processed is too large.

To solve the storage issue and processing issue, two core components were created in Hadoop – HDFS and YARN. HDFS solves the storage issue as it stores the data in a distributed fashion and is easily scalable. And, YARN solves the processing issue by reducing the processing time drastically. Moving ahead, let us understand what is Hadoop?

What is Hadoop?

Hadoop is an open-source software framework used for storing and processing Big Data in a distributed manner on large clusters of commodity hardware. Hadoop is licensed under the Apache v2 license.

Hadoop was developed, based on the paper written by Google on the MapReduce system and it applies concepts of functional programming. Hadoop is written in the Java programming language and ranks among the highest-level Apache projects. Hadoop was developed by Doug Cutting and Michael J. Cafarella.

Hadoop-as-a-Solution

Fig: Hadoop-as-a-Solution

Features of Hadoop

Hadoop Core Components

While setting up a Hadoop cluster, you have an option of choosing a lot of services as part of your Hadoop platform, but there are two services which are always mandatory for setting up Hadoop. One is HDFS (storage) and the other is YARN (processing). HDFS stands for Hadoop Distributed File System, which is a scalable storage unit of Hadoop whereas YARN is used to process the data i.e. stored in the HDFS in a distributed and parallel fashion.

Fig: HDFS

Fig: YARN

Hadoop Ecosystem

So far you would have figured out that Hadoop is neither a programming language nor a service, it is a platform or framework which solves Big Data problems. You can consider it as a suite which encompasses a number of services for ingesting, storing and analyzing huge data sets along with tools for configuration management.

let us know how Last.fm used Hadoop as a part of their solution strategy.

Hadoop Tutorial: Last.FM Case Study

Last.FM is internet radio and community-driven music discovery service founded in 2002. Users transmit information to Last.FM servers indicating which songs they are listening to. The received data is processed and stored so that, the user can access it in the form of charts. Thus, Last.FM can make intelligent taste and compatible decisions for generating recommendations. The data is obtained from one of the two sources stated below:

scrobble: When a user plays a track of his or her own choice and sends the information to Last.FM through a client application.
radio listen: When the user tunes into a Last.FM radio station and streams a song.

Last.FM applications allow users to love, skip or ban each track they listen to. This track listening data is also transmitted to the server.

Over 40M unique visitors and 500M page views each month
Scrobble stats:
Up to 800 scrobbles per second
More than 40 million scrobbles per day
Over 75 billion scrobbles so far
Radio stats:
Over 10 million streaming hours per month
Over 400 thousand unique stations per day
Each scrobble and radio listen generates at least one logline

Hadoop at Last.FM

100 Nodes
8 cores per node (dual quad-core)
24GB memory per node
8TB (4 disks of 2TB each)
Hive integration to run optimized SQL queries for analysis

Last.FM started using Hadoop in 2006 because of the growth in users from thousands to millions. With the help of Hadoop, they processed hundreds of daily, monthly, and weekly jobs including website stats and metrics, chart generation (i.e. track statistics), metadata corrections (e.g. misspellings of artists), indexing for search, combining/formatting data for recommendations, data insights, evaluations & reporting. This helped Last.FM to grow tremendously and figure out the taste of their users, based on which they started recommending music.

!!!! THANK YOU FOR READING THIS ARTICLE !!!!

Pranay Agrawal

DevOps Engineer | Aviatrix Certified Engineer | Red Hat Certified System Administrator | DevOps | Cloud

4 年

Nice????

Prabal Agrawal(He/Him)

DevOps Engineer @ Flutterwave | RHCSA Certified v8

4 年

Good article????

1 次回应

查看更多评论

要查看或添加评论，请登录

Vibhor Jain的更多文章

From Code to Production: A Beginner’s Guide to CI/CD in DevOps

2024年7月19日

From Code to Production: A Beginner’s Guide to CI/CD in DevOps

In today’s fast-paced software development world, the need for efficient, reliable, and scalable development practices…

1 条评论
Exploring Deployment of Node.js Application to AWS ECS Fargate with GitHub Actions

2024年1月20日

Exploring Deployment of Node.js Application to AWS ECS Fargate with GitHub Actions

In the ever-evolving landscape of software development, efficient deployment workflows are crucial. Today, let's dive…
??Automating Docker Image Build & Push to AWS ECR and Deployment on EC2 instances through GitHub Actions??.

2023年7月28日

??Automating Docker Image Build & Push to AWS ECR and Deployment on EC2 instances through GitHub Actions??.

In this article, I’ll explain to you how to build & push docker images to AWS ECR (Elastic Container Registry) and how…

1 条评论
Docker in Docker: A Comprehensive Guide

2023年6月16日

Docker in Docker: A Comprehensive Guide

Get ready to delve into the world of Docker in Docker! This guide will cover everything from what Docker in Docker is…

2 条评论
High Availability Architecture in AWS with AWS-CLI

2021年7月4日

High Availability Architecture in AWS with AWS-CLI

NOTE : For Better Under Standing visit by previous article ..

2 条评论
AWS at the Command Line

2021年5月25日

AWS at the Command Line

What is the AWS Command Line Interface? AWS CLI is a tool that pulls all the AWS services together in one central…

6 条评论
Increase or Decrease the size of Static Partitions in Linux without losing your data !!

2021年4月24日

Increase or Decrease the size of Static Partitions in Linux without losing your data !!

We often see some cases where we fall shortage on storage after creating our initial partition. Or we need to decrease…

5 条评论
Industrial Use Case For Jenkins

2021年3月28日

Industrial Use Case For Jenkins

Hello Connections..

1 条评论
Industry Use Case For Amazon SQS

2021年3月28日

Industry Use Case For Amazon SQS

I will start my article with a beautiful quote - “Live as if you were to die tomorrow. Learn as if you were to live…
INDUSTRY USE CASE FOR KUBERNETES

2021年3月14日

INDUSTRY USE CASE FOR KUBERNETES

Holla guys back again with the another article on industry use case for kubernetes. OVERVIEW : Container-based…

See all articles

How we can store huge data which is beyond our capacity (Big-Data)?

Vibhor Jain

System & DevOps Admin | Elevating IT Systems | DevOps Admirer | AWS Cloud Enthusiast | Red Hat Certified | Adaptive IT Strategist | Always in Learn-Deploy Mode

How It All Started?

What is Big Data?

Which type of data can be termed as Big Data ?

Let it understand with an Example:

Big Data & Hadoop – Restaurant Analogy

What is Hadoop?

Hadoop-as-a-Solution

Features of Hadoop

Hadoop Core Components

Fig: HDFS

Hadoop Ecosystem

Hadoop Tutorial: Last.FM Case Study

Hadoop at Last.FM

!!!! THANK YOU FOR READING THIS ARTICLE !!!!

Vibhor Jain的更多文章

社区洞察

其他会员也浏览了

18 years later is Hadoop still Relephant?

Hadoop and the Iceberg

BIG DATA IN LITTLE SPACES: HADOOP AND SPARK AT THE EDGE

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Is Hadoop on a Downtrend?

Hadoop File Formats, when and what to use?

Increasing/decreasing the size of Hadoop Datanode dynamically

Hadoop: Pioneering the Era of Big Data Storage Technologies

??Hadoop: how to contribute limited/specific amount of storage as slave to the cluster?

How It All Started?

What is Big Data?

Which type of data can be termed as Big Data ?

Let it understand with an Example:

Big Data & Hadoop – Restaurant Analogy

What is Hadoop?

Hadoop-as-a-Solution

Features of Hadoop

Hadoop Core Components

Fig: HDFS

Hadoop Ecosystem

Hadoop Tutorial: Last.FM Case Study

Hadoop at Last.FM

!!!! THANK YOU FOR READING THIS ARTICLE !!!!

Vibhor Jain的更多文章

From Code to Production: A Beginner’s Guide to CI/CD in DevOps

Exploring Deployment of Node.js Application to AWS ECS Fargate with GitHub Actions

??Automating Docker Image Build & Push to AWS ECR and Deployment on EC2 instances through GitHub Actions??.

Docker in Docker: A Comprehensive Guide

High Availability Architecture in AWS with AWS-CLI

AWS at the Command Line

Increase or Decrease the size of Static Partitions in Linux without losing your data !!

Industrial Use Case For Jenkins

Industry Use Case For Amazon SQS

INDUSTRY USE CASE FOR KUBERNETES

社区洞察

其他会员也浏览了

18 years later is Hadoop still Relephant?

Hadoop and the Iceberg

BIG DATA IN LITTLE SPACES: HADOOP AND SPARK AT THE EDGE

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Is Hadoop on a Downtrend?

Hadoop File Formats, when and what to use?

Increasing/decreasing the size of Hadoop Datanode dynamically

Hadoop: Pioneering the Era of Big Data Storage Technologies

??Hadoop: how to contribute limited/specific amount of storage as slave to the cluster?