Big Data Diagnosis: (Hadoop & Distributed Storage Clusters)
Ajeenkya S.
Jr. Soft Engg @Cognizant, EDI-Maps Developer, 2X OCI, 1xAWS Certified, 1X Aviatrix Certified, AT&T Summer Learning Academy Extern, LW summer Research Intern, ARTH Learner, 1X Gitlab Certified Associate, ARTH 2.0 LW_TV
Since, we are all living in an era of technical world but we should know how the tech giants like Google,Facebook etc. are working currently to manage the storage of thousands and thousand of Terra bytes in this current leading pace. According to Forbes, about 2.5 quintillion bytes of data is generated every day. Nonetheless, this number is just projected to constantly increase in the following years (90% of nowadays stored data has been produced within the last two years).
Big Data is defined by 3 properties:
- Volume = because of the large amount of data, storing data on a single machine is impossible. How can we process data across multiple machines assuring fault tolerance?
- Variety = How can we deal with data coming from varied sources which have been formatted using different schemas?
- Velocity = How can we quickly store and process new data?
Big data can be analysed by 2 points :-
- Batch processing = usually used if we are concerned by the volume and variety of our data. We first store all the needed data and then process it in one go (this can lead to high latency). A common application example can be calculating monthly payroll summaries.
- Stream processing = usually employed if we are interested in fast response times. We process our data as soon as is received (low latency). An application example can be determining if a bank transaction is fraudulent or not.
Big Data can be processed using different tools such as MapReduce, Spark, Hadoop, Pig, Hive, Cassandra and Kafka. Each of these different tools has its advantages and disadvantages which determines how companies might decide to employ them.
Big Tech such as Google and facebook are also using a technique by which they are able to Storage tons of data which is popularly called as Distributed Storage Clusters or technically Hadoop. Hadoop is an open source framework overseen by Apache Software Foundation which is written in Java for storing and processing of huge datasets with the cluster of commodity hardware. There are mainly two problems with the big data. First one is to store such a huge amount of data and the second one is to process that stored data. The traditional approach like RDBMS is not sufficient due to the heterogeneity of the data. So Hadoop comes as the solution to the problem of big data i.e. storing and processing the big data with some extra capabilities. There are mainly two components of Hadoop which are Hadoop Distributed File System (HDFS) and Yet Another Resource Negotiator(YARN).
In 2003, they came across a paper that described the architecture of Google’s distributed file system, called GFS (Google File System) which was published by Google, for storing the large data sets. Now they realize that this paper can solve their problem of storing very large files which were being generated because of web crawling and indexing processes. But this paper was just the half solution to their problem.
In 2004, Google published one more paper on the technique MapReduce, which was the solution of processing those large datasets. Now this paper was another half solution for Doug Cutting and Mike Cafarella for their Nutch project. These both techniques (GFS & MapReduce) were just on white paper at Google. Google didn’t implement these two techniques. Doug Cutting knew from his work on Apache Lucene ( It is a free and open-source information retrieval software library, originally written in Java by Doug Cutting in 1999) that open-source is a great way to spread the technology to more people. So, together with Mike Cafarella, he started implementing Google’s techniques (GFS & MapReduce) as open-source in the Apache Nutch project.
In 2007, Yahoo successfully tested Hadoop on a 1000 node cluster and start using it.
In January of 2008, Yahoo released Hadoop as an open source project to ASF(Apache Software Foundation). And in July of 2008, Apache Software Foundation successfully tested a 4000 node cluster with Hadoop.
In 2009, Hadoop was successfully tested to sort a PB (PetaByte) of data in less than 17 hours for handling billions of searches and indexing millions of web pages. And Doug Cutting left the Yahoo and joined Cloudera to fulfill the challenge of spreading Hadoop to other industries.
Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject.
Hadoop’s architecture has four main elements:
1. Hadoop Common: These consist of Java utilities and libraries that are required by Hadoop modules and its applications. They also have the OS and file system information to start Hadoop.
2. Hadoop YARN: This application within Hadoop supports cluster management and job scheduling.
3. Hadoop Distributed File System (HDFS): This is a distributed file system that ensures high-throughput access and processing of data.
4. Hadoop MapReduce: The framework helps in the parallel processing of the jobs and tasks on the Big-data.
Major Advantages of Hadoop
1. Scalable
Hadoop is a highly scalable storage platform because it can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional relational database systems (RDBMS) that can’t scale to process large amounts of data.
2. Cost-effective
Hadoop also offers a cost-effective storage solution for businesses exploding data sets. The problem with traditional relational database management systems is that it is extremely cost prohibitive to scale to such a degree in order to process such massive volumes of data. In an effort to reduce costs, many companies in the past would have had to down-sample data and classify it based on certain assumptions as to which data was the most valuable. The raw data would be deleted, as it would be too cost-prohibitive to keep.
3. Flexible
Hadoop enables businesses to easily access new data sources and tap into different types of data (both structured and unstructured) to generate value from that data. This means businesses can use Hadoop to derive valuable business insights from data sources such as social media, email conversations.
4. Fast
Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data wherever it is located on a cluster. The tools for data processing are often on the same servers where the data is located, resulting in much faster data processing. If you’re dealing with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours.
#bigdata #hadoop #bigdatamanagement #arthbylw #vimaldaga #righteducation #educationredefine #rightmentor#worldrecordholder #ARTH #linuxworld #makingindiafutureready #righeudcation
Your attention to the details of Hadoop and cluster storage is super impressive, especially just being on your second day! Diving deeper into machine learning could really complement your big data skills. Have you considered how big data analytics could shape your future career path? What industry do you dream of revolutionizing with your knowledge? Keep up the fantastic work; the sky's the limit!