Working of Sqoop Incremental Load
image Credits: https://data-flair.training/blogs/features-of-sqoop/

Working of Sqoop Incremental Load

In my series of BigData Architecture, we have seen the internal working of Sqoop. Now as part of this article, we'll see the process behind the execution of Sqoop incremental load in a production environment via various scheduling tools of a Sqoop Job. 

The process is as follows:

  1. Let’s first create a Sqoop job as shown below
sqoop job -- create CustomerJob \-- import \-- connect jdbc:mysql://localhost:3306/retail_db \-- username root \-- password ****** \-- table customers \-- check-column customer_id \-- incremental append \-- last-value 0 \-- as-textfile \-- warehouse-dir= /user/cloudera/sqoop_imported_tables/Cust

2. Sqoop metastore: 

Metastore is the place where the metadata of each action performed is logged. Usually, it has information like the schema of the table that it needs to fetch from the source database, the database connection and so on… 

In this case, Sqoop determines the ‘last value’ of the records to be imported in order to perform incremental load.

[cloudera@quickstart Desktop]$ cd ~
[cloudera@quickstart ~]$ cd .sqoop
[cloudera@quickstart .sqoop]$ ll
[cloudera@quickstart .sqoop]$ nano metastore.db.script
No alt text provided for this image
[cloudera@quickstart .sqoop]$ nano metastore.db.script
No alt text provided for this image

3. Execute the Sqoop Job

[cloudera@quickstart Desktop]$ sqoop job — exec CustomerJob
No alt text provided for this image

4. You can see at the end of the sqoop job execution details are written to the metastore.

No alt text provided for this image

5. Let’s check the last value counter in sqoop metastore after sqoop job execution.

[cloudera@quickstart .sqoop]$ nano metastore.db.script
No alt text provided for this image

6. Let's insert a new record into customer table in order to perform an incremental update

mysql> insert into customers values(12436,'Jay','Reddy','****@gmail.com','hello','R T Nagar','Bangalore','Karnataka','500032');

7. Execute sqoop job again and you will see only the new record being imported into the sqoop customer table.

[cloudera@quickstart Desktop]$ sqoop job — exec CustomerJob
No alt text provided for this image

8. Finally, let's validate if the last value is being updated from 12435 to 12436

[cloudera@quickstart .sqoop]$ nano metastore.db.script
No alt text provided for this image

Now that we have seen how the incremental update takes, you can debug easily incase of any issues during incremental load of a Sqoop job via sqoop metastore and also reset the 'last value' accordingly.

Note: The commands that were executed related to this post are added as part of my GIT account.

If you liked this article, don’t forget to like and share!. If you would like me to add anything else, please feel free to leave a response ??

#Sqoop #Bigdata #Hadoop #Learn #Share #Grow #SqoopIncrementalImport #JayReddy

Does anyone know how to connect to the metastore so that you can query and inspect the SQOOP_SESSIONS table?

回复
Naveenkumar Murugesan

Senior Data Engineer at IBM | AI & Big Data Specialist

5 年

Its very useful content and you made it very clear on sqoop metastore. Thanks Jay for sharing your knowledge.. Keep posting :)

Jayvardhan Reddy Vanchireddy

Senior Data Engineer at Cognizant?? | Ex-Honeywell | #ONO ?? | #Azure ? | #German B1 Level Certified ???? | Writer@Medium ? | #BigData Engineer ??

5 年

Link to the Series of In-depth Bigdata Ecosystem Architecture with Code :? https://github.com/Jayvardhan-Reddy/BigData-Ecosystem-Architecture/blob/master/README.md

要查看或添加评论,请登录

Jayvardhan Reddy Vanchireddy的更多文章

  • Apache Spark-3.0 Sneek peak

    Apache Spark-3.0 Sneek peak

    Apache Spark has remained strong over the years and now is coming back with one of its major releases with its ongoing…

    14 条评论
  • Deep-dive into Spark Internals & Architecture

    Deep-dive into Spark Internals & Architecture

    Apache Spark is an open-source distributed general-purpose cluster-computing framework. A spark application is a JVM…

    12 条评论
  • Sqoop Architecture in Depth

    Sqoop Architecture in Depth

    Apache Sqoop is a data ingestion tool designed for efficiently transferring bulk data between Apache Hadoop and…

    9 条评论
  • HDFS Architecture in Depth

    HDFS Architecture in Depth

    Hadoop consists of mainly two main core components HDFS, MapReduce. HDFS is the Hadoop Distributed File System ( HDFS )…

    3 条评论
  • Hive Architecture in?Depth

    Hive Architecture in?Depth

    Apache Hive is an ETL and Data warehousing tool built on top of Hadoop for data summarization, analysis and querying of…

  • Application Development: 4 Simple Steps to Resolve Remote Debugging Connection Problems

    Application Development: 4 Simple Steps to Resolve Remote Debugging Connection Problems

    As a developer, we frequently debug the application during the development activities. The real time applications are…

    1 条评论
  • 5 Useful Tools for a Full-stack Developer

    5 Useful Tools for a Full-stack Developer

    The below tools will help you increase your productivity and reduce compilation issues on running a debug job…

    1 条评论
  • Database Transaction Leak in Java Application

    Database Transaction Leak in Java Application

    In a real time application the Database leak occurs due to Unclosed transactions created by the programmers. The stakes…

    1 条评论
  • Analysis of Memory Leak in Java Applications via Heap?Dump

    Analysis of Memory Leak in Java Applications via Heap?Dump

    Memory plays a vital role in any application performance and we cannot afford to waste the resources unnecessarily, as…

    6 条评论
  • Heap dump generation & Analysis using JMAP, JMAT, JvisualVM Tools

    Heap dump generation & Analysis using JMAP, JMAT, JvisualVM Tools

    Every Programmer is bound to use these tools at some point of time, As it plays a vital role in optimizing the…

社区洞察

其他会员也浏览了