登录查看更多内容

Working of Sqoop Incremental Load

Jayvardhan Reddy Vanchireddy

Senior Data Engineer at Cognizant?? | Ex-Honeywell | #ONO ?? | #Azure ? | #German B1 Level Certified ???? | Writer@Medium ? | #BigData Engineer ??

发布日期: 2019年7月15日

In my series of BigData Architecture, we have seen the internal working of Sqoop. Now as part of this article, we'll see the process behind the execution of Sqoop incremental load in a production environment via various scheduling tools of a Sqoop Job.

The process is as follows:

Let’s first create a Sqoop job as shown below

sqoop job -- create CustomerJob \-- import \-- connect jdbc:mysql://localhost:3306/retail_db \-- username root \-- password ****** \-- table customers \-- check-column customer_id \-- incremental append \-- last-value 0 \-- as-textfile \-- warehouse-dir= /user/cloudera/sqoop_imported_tables/Cust

2. Sqoop metastore:

Metastore is the place where the metadata of each action performed is logged. Usually, it has information like the schema of the table that it needs to fetch from the source database, the database connection and so on…

In this case, Sqoop determines the ‘last value’ of the records to be imported in order to perform incremental load.

[cloudera@quickstart Desktop]$ cd ~
[cloudera@quickstart ~]$ cd .sqoop
[cloudera@quickstart .sqoop]$ ll
[cloudera@quickstart .sqoop]$ nano metastore.db.script

[cloudera@quickstart .sqoop]$ nano metastore.db.script

3. Execute the Sqoop Job

[cloudera@quickstart Desktop]$ sqoop job — exec CustomerJob

4. You can see at the end of the sqoop job execution details are written to the metastore.

5. Let’s check the last value counter in sqoop metastore after sqoop job execution.

[cloudera@quickstart .sqoop]$ nano metastore.db.script

6. Let's insert a new record into customer table in order to perform an incremental update

mysql> insert into customers values(12436,'Jay','Reddy','****@gmail.com','hello','R T Nagar','Bangalore','Karnataka','500032');

7. Execute sqoop job again and you will see only the new record being imported into the sqoop customer table.

[cloudera@quickstart Desktop]$ sqoop job — exec CustomerJob

8. Finally, let's validate if the last value is being updated from 12435 to 12436

[cloudera@quickstart .sqoop]$ nano metastore.db.script

Now that we have seen how the incremental update takes, you can debug easily incase of any issues during incremental load of a Sqoop job via sqoop metastore and also reset the 'last value' accordingly.

Note: The commands that were executed related to this post are added as part of my GIT account.

If you liked this article, don’t forget to like and share!. If you would like me to add anything else, please feel free to leave a response ??

#Sqoop #Bigdata #Hadoop #Learn #Share #Grow #SqoopIncrementalImport #JayReddy

Jon Morisi

5 年

Does anyone know how to connect to the metastore so that you can query and inspect the SQOOP_SESSIONS table?

Naveenkumar Murugesan

Senior Data Engineer at IBM | AI & Big Data Specialist

5 年

Its very useful content and you made it very clear on sqoop metastore. Thanks Jay for sharing your knowledge.. Keep posting :)

1 次回应

Jayvardhan Reddy Vanchireddy

5 年

Link to the Series of In-depth Bigdata Ecosystem Architecture with Code :? https://github.com/Jayvardhan-Reddy/BigData-Ecosystem-Architecture/blob/master/README.md

1 次回应

查看更多评论

要查看或添加评论，请登录

Jayvardhan Reddy Vanchireddy的更多文章

Apache Spark-3.0 Sneek peak

2019年11月10日

Apache Spark-3.0 Sneek peak

Apache Spark has remained strong over the years and now is coming back with one of its major releases with its ongoing…

14 条评论
Deep-dive into Spark Internals & Architecture

2019年5月12日

Deep-dive into Spark Internals & Architecture

Apache Spark is an open-source distributed general-purpose cluster-computing framework. A spark application is a JVM…

12 条评论
Sqoop Architecture in Depth

2019年3月2日

Sqoop Architecture in Depth

Apache Sqoop is a data ingestion tool designed for efficiently transferring bulk data between Apache Hadoop and…

9 条评论
HDFS Architecture in Depth

2019年2月10日

HDFS Architecture in Depth

Hadoop consists of mainly two main core components HDFS, MapReduce. HDFS is the Hadoop Distributed File System ( HDFS )…

3 条评论
Hive Architecture in?Depth

2018年11月6日

Hive Architecture in?Depth

Apache Hive is an ETL and Data warehousing tool built on top of Hadoop for data summarization, analysis and querying of…
Application Development: 4 Simple Steps to Resolve Remote Debugging Connection Problems

2018年8月25日

Application Development: 4 Simple Steps to Resolve Remote Debugging Connection Problems

As a developer, we frequently debug the application during the development activities. The real time applications are…

1 条评论
5 Useful Tools for a Full-stack Developer

2018年8月12日

5 Useful Tools for a Full-stack Developer

The below tools will help you increase your productivity and reduce compilation issues on running a debug job…

1 条评论
Database Transaction Leak in Java Application

2018年8月5日

Database Transaction Leak in Java Application

In a real time application the Database leak occurs due to Unclosed transactions created by the programmers. The stakes…

1 条评论
Analysis of Memory Leak in Java Applications via Heap?Dump

2018年7月22日

Analysis of Memory Leak in Java Applications via Heap?Dump

Memory plays a vital role in any application performance and we cannot afford to waste the resources unnecessarily, as…

6 条评论
Heap dump generation & Analysis using JMAP, JMAT, JvisualVM Tools

2018年7月21日

Heap dump generation & Analysis using JMAP, JMAT, JvisualVM Tools

Every Programmer is bound to use these tools at some point of time, As it plays a vital role in optimizing the…

See all articles

Working of Sqoop Incremental Load

Jayvardhan Reddy Vanchireddy

Senior Data Engineer at Cognizant?? | Ex-Honeywell | #ONO ?? | #Azure ? | #German B1 Level Certified ???? | Writer@Medium ? | #BigData Engineer ??

Jayvardhan Reddy Vanchireddy的更多文章

社区洞察

其他会员也浏览了

Despite Uniform and Apache XTable, your choice of Table Format still matters (Apache Iceberg, Apache Hudi, and Delta Lake)

Power Down Stream Relational Database Aurora Postgres from Apache Hudi Transactional Data Lake with CDC| Step by Step Guide

Building a Data-Driven Future: Part 2 - Six ELT Challenges Nobody Tells You

Using Airbyte with Tabular

Top 20 Big Data Platforms: The Best Open Source Tools (updated April 2020)

Top 10 big data platforms – Part 1

Presto(PrestoDB) - What it Offers and Where and How it can be used

Delta Lake Format: Understanding Parquet under the hood.

Unleash NO-Code ETL Pipeline With Azure Datafactory (ADF)

Learn How to Display Data From Hudi Tables to your Frontend with Flask and Daft (NO SPARK NEEDED)

Jayvardhan Reddy Vanchireddy的更多文章

Apache Spark-3.0 Sneek peak

Deep-dive into Spark Internals & Architecture

Sqoop Architecture in Depth

HDFS Architecture in Depth

Hive Architecture in?Depth

Application Development: 4 Simple Steps to Resolve Remote Debugging Connection Problems

5 Useful Tools for a Full-stack Developer

Database Transaction Leak in Java Application

Analysis of Memory Leak in Java Applications via Heap?Dump

Heap dump generation & Analysis using JMAP, JMAT, JvisualVM Tools

社区洞察

其他会员也浏览了

Despite Uniform and Apache XTable, your choice of Table Format still matters (Apache Iceberg, Apache Hudi, and Delta Lake)

Power Down Stream Relational Database Aurora Postgres from Apache Hudi Transactional Data Lake with CDC| Step by Step Guide

Building a Data-Driven Future: Part 2 - Six ELT Challenges Nobody Tells You

Using Airbyte with Tabular

Top 20 Big Data Platforms: The Best Open Source Tools (updated April 2020)

Top 10 big data platforms – Part 1

Presto(PrestoDB) - What it Offers and Where and How it can be used

Delta Lake Format: Understanding Parquet under the hood.

Unleash NO-Code ETL Pipeline With Azure Datafactory (ADF)

Learn How to Display Data From Hudi Tables to your Frontend with Flask and Daft (NO SPARK NEEDED)