登录查看更多内容

Sqoop Architecture in Depth

Jayvardhan Reddy Vanchireddy

Senior Data Engineer at Cognizant?? | Ex-Honeywell | #ONO ?? | #Azure ? | #German B1 Level Certified ???? | Writer@Medium ? | #BigData Engineer ??

发布日期: 2019年3月2日

Apache Sqoop is a data ingestion tool designed for efficiently transferring bulk data between Apache Hadoop and structured data-stores such as relational databases and vice-versa.

As part of this blog, I will be explaining as to how the architecture works on executing a Sqoop command. Details such as the jar generation via Codegen, execution of MapReduce job and the various stages involved in running a Sqoop import/export command.

1. Codegen:

Understanding of Codegen is essential, as internally this converts our Sqoop job into a jar which consists of several Java classes such as POJO, ORM, and a class that implements DBWritable, extending SqoopRecord to read and write the data from relational databases to Hadoop & vice-versa.

You can create a Codegen explicitly as shown below to check the classes present as part of the jar.

The output jar will be written in your local file system. You will get Jar file, Java file and java files which are compiled into .class files

Let us see a snippet of the code that will be generated.

ORM class for table ‘products’ // Object-relational modal generated for mapping.

Setter & Getter methods to get values

Internally it uses JDBC prepared statements to write to Hadoop and ResultSet to read data from Hadoop.

2. Sqoop Import:

It is used to import data from traditional relational databases into Hadoop.

Let’s see a sample snippet for the same.

The following steps take place internally during the execution of Sqoop.

Step-1: Read data from MySQL in streaming fashion and does various operations before writing the data into HDFS.

As part of this, it will first generate code i.e, typical Map reduce code which is nothing but a java code and using this java code it will try to import.

Generate the code. (Hadoop MR)
Compile the code and generate the Jar file.
Submit the Jar file and perform the import operations

During the import, it has to make certain decisions as to how to divide the data into multiple threads so that Sqoop import can be scaled.

Step-2: Understand the structure of the data and perform CodeGen

Using the above SQL statement it will fetch one record along with the column names, using this information it will extract the metadata information of the columns, datatype etc.

Step-3: Create the java file, compile it and generate a jar file

As part of code generation, it needs to understand the structure of the data and it has to apply that object on the incoming data internally to make sure the data is correctly copied onto the target database. Each unique table has one java file talking about the structure of data.

This jar file will be injected into Sqoop binaries to apply the structure to incoming data.

Step-4: Delete the target directory if already exists.

Step-5: Importing the data

Here it connects to resource manager gets the resource and starts the application master.

To perform equal distribution of data among the map tasks, it internally executes a boundary query based on the primary key by default to find the minimum and maximum count of records in the table. Based on the max count, it will divide by the number of mappers and split it among each mapper.

It uses 4 mappers by default

It executes these jobs on different executors as shown below

The default number of mappers can be changed by setting the following parameter

So in our case, it uses 4 threads, each thread processes mutually exclusive subsets i.e, each thread processes different data from one another.

To see different values below

Operations that are being performed under each executor nodes.

In case you perform a Sqoop hive import, one another extra step as part of execution takes place.

Step-6: Copy data to hive table

3. Sqoop Export:

It is used to export data from Hadoop into traditional relational databases.

Let’s see a sample snippet for the same

On executing the above command the execution steps(1–4) similar to Sqoop import take place, but the source data is read from the file system, which is nothing but HDFS. Here it will use boundaries upon block size to divide the data and it is internally taken care by Sqoop.

The processing splits are done as shown below.

After connecting to the respective database to which the records are to be exported, it will issue a JDBC insert command to read data from HDFS and store it into the database as shown below.

Now, that we have seen how Sqoop works internally. You can determine the flow of execution from jar generation to execution of a MapReduce task on submission of a Sqoop job.

Note: The code that was executed related to this post are added as part of my GIT account.

Similarly, you can also read about

Hive Architecture in Depth with code.
HDFS Architecture in Depth with code.

If you enjoyed reading it ,you can click the like, share button and let others know about it. If you would like me to add anything else, please feel free to leave a response ??

Srinivasan Govindasamy

Senior Technical Architect

3 年

This is really good stuff. Thanks for sharing insights. well explained Jay. ?? ?? ??

1 次回应

Jayvardhan Reddy Vanchireddy

6 年

Sure Shehnaz Asfi, I will try to write about it as well.

Jayvardhan Reddy Vanchireddy

6 年

Git link https://github.com/Jayvardhan-Reddy/BigData-Ecosystem-Architecture/tree/master/Sqoop-Architecture What do you say Andreas Kretz?Admond Lee Kin Lim?Ted Malaska?Jacek Laskowski?Venu Ryali?Devendra Bisht is this one of the tools that you use extensively during data ingestion?

1 次回应

查看更多评论

要查看或添加评论，请登录

Jayvardhan Reddy Vanchireddy的更多文章

Apache Spark-3.0 Sneek peak

2019年11月10日

Apache Spark-3.0 Sneek peak

Apache Spark has remained strong over the years and now is coming back with one of its major releases with its ongoing…

14 条评论
Working of Sqoop Incremental Load

2019年7月15日

Working of Sqoop Incremental Load

In my series of BigData Architecture, we have seen the internal working of Sqoop. Now as part of this article, we'll…

3 条评论
Deep-dive into Spark Internals & Architecture

2019年5月12日

Deep-dive into Spark Internals & Architecture

Apache Spark is an open-source distributed general-purpose cluster-computing framework. A spark application is a JVM…

12 条评论
HDFS Architecture in Depth

2019年2月10日

HDFS Architecture in Depth

Hadoop consists of mainly two main core components HDFS, MapReduce. HDFS is the Hadoop Distributed File System ( HDFS )…

3 条评论
Hive Architecture in?Depth

2018年11月6日

Hive Architecture in?Depth

Apache Hive is an ETL and Data warehousing tool built on top of Hadoop for data summarization, analysis and querying of…
Application Development: 4 Simple Steps to Resolve Remote Debugging Connection Problems

2018年8月25日

Application Development: 4 Simple Steps to Resolve Remote Debugging Connection Problems

As a developer, we frequently debug the application during the development activities. The real time applications are…

1 条评论
5 Useful Tools for a Full-stack Developer

2018年8月12日

5 Useful Tools for a Full-stack Developer

The below tools will help you increase your productivity and reduce compilation issues on running a debug job…

1 条评论
Database Transaction Leak in Java Application

2018年8月5日

Database Transaction Leak in Java Application

In a real time application the Database leak occurs due to Unclosed transactions created by the programmers. The stakes…

1 条评论
Analysis of Memory Leak in Java Applications via Heap?Dump

2018年7月22日

Analysis of Memory Leak in Java Applications via Heap?Dump

Memory plays a vital role in any application performance and we cannot afford to waste the resources unnecessarily, as…

6 条评论
Heap dump generation & Analysis using JMAP, JMAT, JvisualVM Tools

2018年7月21日

Heap dump generation & Analysis using JMAP, JMAT, JvisualVM Tools

Every Programmer is bound to use these tools at some point of time, As it plays a vital role in optimizing the…

See all articles

Sqoop Architecture in Depth

Jayvardhan Reddy Vanchireddy

Senior Data Engineer at Cognizant?? | Ex-Honeywell | #ONO ?? | #Azure ? | #German B1 Level Certified ???? | Writer@Medium ? | #BigData Engineer ??

Jayvardhan Reddy Vanchireddy的更多文章

社区洞察

其他会员也浏览了

WHAT IS SQOOP

Spark Vs Hadoop Map Reduce

HIVE

APACHE HADOOP & HDFS

Introduction to Hadoop

Hive

Getting started with Apache Spark

Pig Latin and its Operators

APACHE HIVE

Jayvardhan Reddy Vanchireddy的更多文章

Apache Spark-3.0 Sneek peak

Working of Sqoop Incremental Load

Deep-dive into Spark Internals & Architecture

HDFS Architecture in Depth

Hive Architecture in?Depth

Application Development: 4 Simple Steps to Resolve Remote Debugging Connection Problems

5 Useful Tools for a Full-stack Developer

Database Transaction Leak in Java Application

Analysis of Memory Leak in Java Applications via Heap?Dump

Heap dump generation & Analysis using JMAP, JMAT, JvisualVM Tools

社区洞察

其他会员也浏览了

WHAT IS SQOOP

Spark Vs Hadoop Map Reduce

HIVE

APACHE HADOOP & HDFS

Introduction to Hadoop

Hive

Getting started with Apache Spark

Pig Latin and its Operators

APACHE HIVE