Sqoop Architecture in Depth

Sqoop Architecture in Depth

Apache Sqoop is a data ingestion tool designed for efficiently transferring bulk data between Apache Hadoop and structured data-stores such as relational databases and vice-versa.

As part of this blog, I will be explaining as to how the architecture works on executing a Sqoop command. Details such as the jar generation via Codegen, execution of MapReduce job and the various stages involved in running a Sqoop import/export command.

1. Codegen:

Understanding of Codegen is essential, as internally this converts our Sqoop job into a jar which consists of several Java classes such as POJO, ORM, and a class that implements DBWritable, extending SqoopRecord to read and write the data from relational databases to Hadoop & vice-versa.

You can create a Codegen explicitly as shown below to check the classes present as part of the jar.

The output jar will be written in your local file system. You will get Jar file, Java file and java files which are compiled into .class files

Let us see a snippet of the code that will be generated.

ORM class for table ‘products’ // Object-relational modal generated for mapping.

Setter & Getter methods to get values

Internally it uses JDBC prepared statements to write to Hadoop and ResultSet to read data from Hadoop.

2. Sqoop Import:

It is used to import data from traditional relational databases into Hadoop.

Let’s see a sample snippet for the same.

The following steps take place internally during the execution of Sqoop.

Step-1: Read data from MySQL in streaming fashion and does various operations before writing the data into HDFS.

As part of this, it will first generate code i.e, typical Map reduce code which is nothing but a java code and using this java code it will try to import.

  • Generate the code. (Hadoop MR)
  • Compile the code and generate the Jar file.
  • Submit the Jar file and perform the import operations

During the import, it has to make certain decisions as to how to divide the data into multiple threads so that Sqoop import can be scaled.

Step-2: Understand the structure of the data and perform CodeGen

Using the above SQL statement it will fetch one record along with the column names, using this information it will extract the metadata information of the columns, datatype etc.

Step-3: Create the java file, compile it and generate a jar file

As part of code generation, it needs to understand the structure of the data and it has to apply that object on the incoming data internally to make sure the data is correctly copied onto the target database. Each unique table has one java file talking about the structure of data.

This jar file will be injected into Sqoop binaries to apply the structure to incoming data.

Step-4: Delete the target directory if already exists.

Step-5: Importing the data

Here it connects to resource manager gets the resource and starts the application master.

To perform equal distribution of data among the map tasks, it internally executes a boundary query based on the primary key by default to find the minimum and maximum count of records in the table. Based on the max count, it will divide by the number of mappers and split it among each mapper.

It uses 4 mappers by default

It executes these jobs on different executors as shown below

The default number of mappers can be changed by setting the following parameter

So in our case, it uses 4 threads, each thread processes mutually exclusive subsets i.e, each thread processes different data from one another.

To see different values below

Operations that are being performed under each executor nodes.

In case you perform a Sqoop hive import, one another extra step as part of execution takes place.

Step-6: Copy data to hive table

3. Sqoop Export:

It is used to export data from Hadoop into traditional relational databases.

Let’s see a sample snippet for the same

On executing the above command the execution steps(1–4) similar to Sqoop import take place, but the source data is read from the file system, which is nothing but HDFS. Here it will use boundaries upon block size to divide the data and it is internally taken care by Sqoop.

The processing splits are done as shown below.

After connecting to the respective database to which the records are to be exported, it will issue a JDBC insert command to read data from HDFS and store it into the database as shown below.

Now, that we have seen how Sqoop works internally. You can determine the flow of execution from jar generation to execution of a MapReduce task on submission of a Sqoop job.

Note: The code that was executed related to this post are added as part of my GIT account.

Similarly, you can also read about 

If you enjoyed reading it ,you can click the like, share button and let others know about it. If you would like me to add anything else, please feel free to leave a response ??

Srinivasan Govindasamy

Senior Technical Architect

3 年

This is really good stuff. Thanks for sharing insights. well explained Jay. ?? ?? ??

Jayvardhan Reddy Vanchireddy

Senior Data Engineer at Cognizant?? | Ex-Honeywell | #ONO ?? | #Azure ? | #German B1 Level Certified ???? | Writer@Medium ? | #BigData Engineer ??

6 年

Sure Shehnaz Asfi, I will try to write about it as well.

回复
Jayvardhan Reddy Vanchireddy

Senior Data Engineer at Cognizant?? | Ex-Honeywell | #ONO ?? | #Azure ? | #German B1 Level Certified ???? | Writer@Medium ? | #BigData Engineer ??

6 年

Git link https://github.com/Jayvardhan-Reddy/BigData-Ecosystem-Architecture/tree/master/Sqoop-Architecture What do you say Andreas Kretz?Admond Lee Kin Lim?Ted Malaska?Jacek Laskowski?Venu Ryali?Devendra Bisht is this one of the tools that you use extensively during data ingestion?

要查看或添加评论,请登录

Jayvardhan Reddy Vanchireddy的更多文章

  • Apache Spark-3.0 Sneek peak

    Apache Spark-3.0 Sneek peak

    Apache Spark has remained strong over the years and now is coming back with one of its major releases with its ongoing…

    14 条评论
  • Working of Sqoop Incremental Load

    Working of Sqoop Incremental Load

    In my series of BigData Architecture, we have seen the internal working of Sqoop. Now as part of this article, we'll…

    3 条评论
  • Deep-dive into Spark Internals & Architecture

    Deep-dive into Spark Internals & Architecture

    Apache Spark is an open-source distributed general-purpose cluster-computing framework. A spark application is a JVM…

    12 条评论
  • HDFS Architecture in Depth

    HDFS Architecture in Depth

    Hadoop consists of mainly two main core components HDFS, MapReduce. HDFS is the Hadoop Distributed File System ( HDFS )…

    3 条评论
  • Hive Architecture in?Depth

    Hive Architecture in?Depth

    Apache Hive is an ETL and Data warehousing tool built on top of Hadoop for data summarization, analysis and querying of…

  • Application Development: 4 Simple Steps to Resolve Remote Debugging Connection Problems

    Application Development: 4 Simple Steps to Resolve Remote Debugging Connection Problems

    As a developer, we frequently debug the application during the development activities. The real time applications are…

    1 条评论
  • 5 Useful Tools for a Full-stack Developer

    5 Useful Tools for a Full-stack Developer

    The below tools will help you increase your productivity and reduce compilation issues on running a debug job…

    1 条评论
  • Database Transaction Leak in Java Application

    Database Transaction Leak in Java Application

    In a real time application the Database leak occurs due to Unclosed transactions created by the programmers. The stakes…

    1 条评论
  • Analysis of Memory Leak in Java Applications via Heap?Dump

    Analysis of Memory Leak in Java Applications via Heap?Dump

    Memory plays a vital role in any application performance and we cannot afford to waste the resources unnecessarily, as…

    6 条评论
  • Heap dump generation & Analysis using JMAP, JMAT, JvisualVM Tools

    Heap dump generation & Analysis using JMAP, JMAT, JvisualVM Tools

    Every Programmer is bound to use these tools at some point of time, As it plays a vital role in optimizing the…

社区洞察

其他会员也浏览了