Big Data Technologies Resume
Anas DADI ??
Senior DevOps Engineer ?? | Cloud Engineer ?? | SRE Engineer ??? | Freelance ??
As a Big Data and Cloud Computing engineering student in ENSET Mohammedia, I attended some remarkable Big data courses and passed many certificates about it. In this article, I will share with everyone interested in Big Data a small resume about Big Data and Hadoop ecosystem technologies.
Big Data Industry and Technological Trends
Big data is a collection of data that takes more than the tolerable time to capture, manage, and process data using common software tools. It is something that can be used to analyze insights that can lead to better decisions and strategic business moves.
It is used in various fields, such as politics, finance, education, travel, tourism, government public safety, sports, and other fields.
The challenges of big data faced by enterprises are unclear data requirements, serious data silo problems, low data availability, lack of management technology, data security.
HDFS - Hadoop Distributed File System
HDFS is developed based on the Google file system, is a distributed file system in the Hadoop technology framework that manages files deployed on multiple independent physical machines.
HDFS relies on metadata persistence to back up metadata in memory. It can choose different strategies for data storage, including tiered storage and label storage.
The high reliability of HDFS is mainly reflected in the use of Zookeeper to implement the primary and secondary NameNodes to solve the single point NameNode failure problem.
The purpose of Colocation is to store the data of the associated relationship or the data that may be associated with the same node.
MapReduce - Distributed Off-line Batch Processing
MapReduce is based on the MapReduce paper published by Google and is used for parallel computing of large data sets (larger than 1TB).
The MapReduce process includes two phases, Map and Reduce.
- Map() performs sorting and filtering of data and thereby organizing them in the form of a group. Map generates a key-value pair-based result which is later on processed by the Reduce() method.
- Reduce(), as the name suggests does the summarization by aggregating the mapped data. In simple, Reduce() takes the output generated by Map() as input and combines those tuples into a smaller set of tuples.
Yarn - Yet Another Resource Negotiator
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the resources across the clusters. In short, it performs scheduling and resource allocation for the Hadoop System. Consists of three major components i.e.
- Resource Manager
- Nodes Manager
- Application Manager
Resource manager has the privilege of allocating resources for the applications in a system whereas Node managers work on the allocation of resources such as CPU, memory, bandwidth per machine and later on acknowledges the resource manager. The application manager works as an interface between the resource manager and node manager and performs negotiations as per the requirement of the two.
Spark2x - Distributed NoSQL Database
Apache Spark is a fast, universal, and extensible big data computing engine that combines batch processing, real-time stream processing, interactive query, graph computing, and machine learning. Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch processing, hence both are used in most interchangeably companies.
Spark SQL is a module of Spark, mainly used for processing structured data. DataFrame is the core programming abstraction.
Spark Streaming is an extension of Spark's core API that enables high-throughput, fault-tolerant, real-time streaming data processing.
Structured Streaming is a streaming data processing engine built on the Spark SQL that handles streaming data in a way that handles static data.
HBase - Distributed NoSQL Database
HBase uses memstore and storefile to store, update data for the table. It is a highly reliable, high-performance, column-oriented, scalable distributed storage system. The secondary index implements the function of indexing according to the values of some columns. The HBase cluster has two roles: HMaster and HRegionServer.
It’s a NoSQL database that supports all kinds of data and thus capable of handling anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data sets effectively.
At times where we need to search or retrieve the occurrences of something small in a huge database, the request must be processed within a short quick span of time. At such times, HBase comes in handy as it gives us a tolerant way of storing limited data.
Hive - Distributed Data Wherehouse
Hive is a data warehouse software based on Hadoop that can query and manage distributed data at the PB level. Its basic principle is to automatically convert the HQL language into a MapReduce task. The syntax used by Hive is Hive SQL statement, which is a kind of SQL statement.
The Hive data table can be divided into partitions according to the value of a certain field, and different data can be put into different buckets according to the manner of the bucket.
Hive's enhanced features include Colocation with the same distribution, column encryption, and specified line separators.
Apache Pig
Pig was basically developed by Yahoo which works on a pig Latin language, which is a Query-based language similar to SQL. It is a platform for structuring the data flow, processing, and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of MapReduce are taken care of. After the processing, Pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way Java runs on the JVM. It helps to achieve ease of programming and optimization and hence is a major segment of the Hadoop Ecosystem.
Mahout - free implementations of distributed/scalable machine learning algorithms
Mahout allows Machine Learnability to a system or application. Machine Learning, as the name suggests helps the system to develop itself based on some patterns, user/environmental interaction, or om the basis of algorithms.
It provides various libraries or functionalities such as collaborative filtering, clustering, and classification which are nothing but concepts of Machine learning. It allows invoking algorithms as per our need with the help of its own libraries.
Streaming - Distributed Stream computing engine
Streaming is based on open source Storm and is a distributed, real-time computing framework. It applies to scenarios where the response time is strictly required, usually in the order of milliseconds. Also, Streaming ensures the reliability of the message by setting the Ack mechanism.
StreamCQL is a query language (CQL) based on a distributed stream processing platform.
Flink - Stream Processing and Batch Processing Platform
Flink is a unified computing framework that combines batch processing and stream processing. It is an event-driven real-time streaming system.
The Flink program consists of Stream data and Transformation operators.
The Checkpoint mechanism is an important means of fault tolerance during Flink operation.
Flume - Massive Logs Aggregation
Flume is a streaming log collection tool. It provides the ability to simply process data and write to various data recipients.
Flume can be single or cascading mode. Encryption can be set in cascade mode.
Flume supports simple cleaning of the data, add an interceptor where the Source and Channel are connected.
Modify the related configuration of Source, Channel, and Sink, and generate the properties. properties configuration file to complete the configuration of the Flume collection process.
To collect data for Kafka, you need to specify the Topic.
Kafka - Distributed Message Subscription
Kafka is a high-throughput, distributed, publish-based subscription messaging system.
A typical Kafka cluster contains several Producers, several Brokers, several Consumers, and a Zookeeper cluster.
Zookeeper - Cluster Distributed Coordination Service
The ZooKeeper distributed service framework is mainly used to solve some data management problems that are often encountered in distributed applications and provide distributed high-availability coordination service capabilities.
The ZooKeeper cluster consists of a group of Server nodes. There is only one leader node in this group of Server nodes, and the other nodes are Followers.
Other Components
Apart from all of these, there are some other components too that carry out a huge task to make Hadoop capable of processing large datasets. They are as follows:
- Solr, Lucene: These are the two services that perform the task of searching and indexing with the help of some java libraries, especially Lucene is based on Java which allows a spell check mechanism, as well. However, Lucene is driven by Solr.
- Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together as a single unit. There are two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially ordered manner whereas Oozie Coordinator jobs are those that are triggered when some data or external stimulus is given to it.
Hope that these notes are useful for you.
software engineer
4 年ELMEHDI GHOULAMI
DevOps -Pré-production IT | Intégrateur Technique | Software Engineering | IT Consultant
4 年Very useful !