登录查看更多内容

Big Data Diagnosis: (Hadoop & Distributed Storage Clusters)

Ajeenkya S.

Jr. Soft Engg @Cognizant, EDI-Maps Developer, 2X OCI, 1xAWS Certified, 1X Aviatrix Certified, AT&T Summer Learning Academy Extern, LW summer Research Intern, ARTH Learner, 1X Gitlab Certified Associate, ARTH 2.0 LW_TV

发布日期: 2020年9月17日

Since, we are all living in an era of technical world but we should know how the tech giants like Google,Facebook etc. are working currently to manage the storage of thousands and thousand of Terra bytes in this current leading pace. According to Forbes, about 2.5 quintillion bytes of data is generated every day. Nonetheless, this number is just projected to constantly increase in the following years (90% of nowadays stored data has been produced within the last two years).

Big Data is defined by 3 properties:

Volume = because of the large amount of data, storing data on a single machine is impossible. How can we process data across multiple machines assuring fault tolerance?
Variety = How can we deal with data coming from varied sources which have been formatted using different schemas?
Velocity = How can we quickly store and process new data?

Big data can be analysed by 2 points :-

Batch processing = usually used if we are concerned by the volume and variety of our data. We first store all the needed data and then process it in one go (this can lead to high latency). A common application example can be calculating monthly payroll summaries.
Stream processing = usually employed if we are interested in fast response times. We process our data as soon as is received (low latency). An application example can be determining if a bank transaction is fraudulent or not.

Big Data can be processed using different tools such as MapReduce, Spark, Hadoop, Pig, Hive, Cassandra and Kafka. Each of these different tools has its advantages and disadvantages which determines how companies might decide to employ them.

Big Tech such as Google and facebook are also using a technique by which they are able to Storage tons of data which is popularly called as Distributed Storage Clusters or technically Hadoop. Hadoop is an open source framework overseen by Apache Software Foundation which is written in Java for storing and processing of huge datasets with the cluster of commodity hardware. There are mainly two problems with the big data. First one is to store such a huge amount of data and the second one is to process that stored data. The traditional approach like RDBMS is not sufficient due to the heterogeneity of the data. So Hadoop comes as the solution to the problem of big data i.e. storing and processing the big data with some extra capabilities. There are mainly two components of Hadoop which are Hadoop Distributed File System (HDFS) and Yet Another Resource Negotiator(YARN).

In 2003, they came across a paper that described the architecture of Google’s distributed file system, called GFS (Google File System) which was published by Google, for storing the large data sets. Now they realize that this paper can solve their problem of storing very large files which were being generated because of web crawling and indexing processes. But this paper was just the half solution to their problem.

In 2004, Google published one more paper on the technique MapReduce, which was the solution of processing those large datasets. Now this paper was another half solution for Doug Cutting and Mike Cafarella for their Nutch project. These both techniques (GFS & MapReduce) were just on white paper at Google. Google didn’t implement these two techniques. Doug Cutting knew from his work on Apache Lucene ( It is a free and open-source information retrieval software library, originally written in Java by Doug Cutting in 1999) that open-source is a great way to spread the technology to more people. So, together with Mike Cafarella, he started implementing Google’s techniques (GFS & MapReduce) as open-source in the Apache Nutch project.

In 2007, Yahoo successfully tested Hadoop on a 1000 node cluster and start using it.

In January of 2008, Yahoo released Hadoop as an open source project to ASF(Apache Software Foundation). And in July of 2008, Apache Software Foundation successfully tested a 4000 node cluster with Hadoop.

In 2009, Hadoop was successfully tested to sort a PB (PetaByte) of data in less than 17 hours for handling billions of searches and indexing millions of web pages. And Doug Cutting left the Yahoo and joined Cloudera to fulfill the challenge of spreading Hadoop to other industries.

Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject.

Hadoop’s architecture has four main elements:

1. Hadoop Common: These consist of Java utilities and libraries that are required by Hadoop modules and its applications. They also have the OS and file system information to start Hadoop.

2. Hadoop YARN: This application within Hadoop supports cluster management and job scheduling.

3. Hadoop Distributed File System (HDFS): This is a distributed file system that ensures high-throughput access and processing of data.

4. Hadoop MapReduce: The framework helps in the parallel processing of the jobs and tasks on the Big-data.

Major Advantages of Hadoop

1. Scalable

Hadoop is a highly scalable storage platform because it can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional relational database systems (RDBMS) that can’t scale to process large amounts of data.

2. Cost-effective

Hadoop also offers a cost-effective storage solution for businesses exploding data sets. The problem with traditional relational database management systems is that it is extremely cost prohibitive to scale to such a degree in order to process such massive volumes of data. In an effort to reduce costs, many companies in the past would have had to down-sample data and classify it based on certain assumptions as to which data was the most valuable. The raw data would be deleted, as it would be too cost-prohibitive to keep.

3. Flexible

Hadoop enables businesses to easily access new data sources and tap into different types of data (both structured and unstructured) to generate value from that data. This means businesses can use Hadoop to derive valuable business insights from data sources such as social media, email conversations.

4. Fast

Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data wherever it is located on a cluster. The tools for data processing are often on the same servers where the data is located, resulting in much faster data processing. If you’re dealing with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours.

#bigdata #hadoop #bigdatamanagement #arthbylw #vimaldaga #righteducation #educationredefine #rightmentor#worldrecordholder #ARTH #linuxworld #makingindiafutureready #righeudcation

Incredible Interns

11 个月

Your attention to the details of Hadoop and cluster storage is super impressive, especially just being on your second day! Diving deeper into machine learning could really complement your big data skills. Have you considered how big data analytics could shape your future career path? What industry do you dream of revolutionizing with your knowledge? Keep up the fantastic work; the sky's the limit!

要查看或添加评论，请登录

Ajeenkya S.的更多文章

Microservices: Architecture and Case Study from Various Organizations

2022年11月29日

Microservices: Architecture and Case Study from Various Organizations

What are microservices? Microservices are an architectural approach to building applications. As an architectural…

14 条评论
Research Insights On JVM (Java Virtual Machine)

2022年10月28日

Research Insights On JVM (Java Virtual Machine)

What is Java Virtual Machine? JVM(Java Virtual Machine) acts as a run-time engine to run Java applications. JVM is the…

1 条评论
How to create a Flutter Linux Mobile Application that runs with the help of a CGI program at the Backend?

2021年8月25日

How to create a Flutter Linux Mobile Application that runs with the help of a CGI program at the Backend?

Task 11 ?? ?? Team Task ?? ?Description: Till date whatever we have learned in Flutter, is need to be implemented in…
How to make a Kubernetes web application using CGI?

2021年8月22日

How to make a Kubernetes web application using CGI?

Task 09 ??????? Kubernetes Integration with Python-CGI Task Description ?? ?? In continuation of task 7.1 you need to…
K-means clustering: Applications in security domains

2021年7月26日

K-means clustering: Applications in security domains

k-means is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each…

1 条评论
Javascript: Industry Usecases

2021年7月23日

Javascript: Industry Usecases

What is JavaScript, and why is it important? JavaScript is a programming language used primarily by Web browsers to…

2 条评论
Confusion Matrix and Cyber Crime

2021年6月3日

Confusion Matrix and Cyber Crime

A confusion matrix is a tabular summary of the number of correct and incorrect predictions made by a classifier. It is…

2 条评论
Industry use cases of Azure Kubernetes Service

2021年3月4日

Industry use cases of Azure Kubernetes Service

Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that…

2 条评论
Industry use cases of Neural Networks

2021年3月4日

Industry use cases of Neural Networks

What is a Neural Networks ? Neural networks are a series of algorithms that mimic the operations of a human brain to…

1 条评论
AWS SQS and it's use cases

2021年3月1日

AWS SQS and it's use cases

"RedBus founded in 2006 is the best platform in the world for booking bus tickets online. Currently, it is serving 6…

2 条评论

See all articles

Big Data Diagnosis: (Hadoop & Distributed Storage Clusters)

Ajeenkya S.

Jr. Soft Engg @Cognizant, EDI-Maps Developer, 2X OCI, 1xAWS Certified, 1X Aviatrix Certified, AT&T Summer Learning Academy Extern, LW summer Research Intern, ARTH Learner, 1X Gitlab Certified Associate, ARTH 2.0 LW_TV

Hadoop Distributed File System (HDFS)

Hadoop’s architecture has four main elements:

Major Advantages of Hadoop

1. Scalable

2. Cost-effective

3. Flexible

4. Fast

Ajeenkya S.的更多文章

社区洞察

其他会员也浏览了

Data Analysis Using Apache Hadoop and Apache Spark

Hadoop vs Spark: Which Big Data Framework is the Best Fit for Your Organization?

All about BIG data

Big Data Frameworks You Should Know About

Data Hubs: MarkLogic vs. Hadoop

Introduction to Hadoop

HADOOP HDFS

Beginner's Guide to Big Data

Demystifying Hadoop's Architecture and Its Crucial Role in Data Science

BigData-Hadoop

Hadoop Distributed File System (HDFS)

Hadoop’s architecture has four main elements:

Major Advantages of Hadoop

1. Scalable

2. Cost-effective

3. Flexible

4. Fast

Ajeenkya S.的更多文章

Microservices: Architecture and Case Study from Various Organizations

Research Insights On JVM (Java Virtual Machine)

How to create a Flutter Linux Mobile Application that runs with the help of a CGI program at the Backend?

How to make a Kubernetes web application using CGI?

K-means clustering: Applications in security domains

Javascript: Industry Usecases

Confusion Matrix and Cyber Crime

Industry use cases of Azure Kubernetes Service

Industry use cases of Neural Networks

AWS SQS and it's use cases

社区洞察

其他会员也浏览了

Data Analysis Using Apache Hadoop and Apache Spark

Hadoop vs Spark: Which Big Data Framework is the Best Fit for Your Organization?

All about BIG data

Big Data Frameworks You Should Know About

Data Hubs: MarkLogic vs. Hadoop

Introduction to Hadoop

HADOOP HDFS

Beginner's Guide to Big Data

Demystifying Hadoop's Architecture and Its Crucial Role in Data Science

BigData-Hadoop