Working of Sqoop Incremental Load
Jayvardhan Reddy Vanchireddy
Senior Data Engineer at Cognizant?? | Ex-Honeywell | #ONO ?? | #Azure ? | #German B1 Level Certified ???? | Writer@Medium ? | #BigData Engineer ??
In my series of BigData Architecture, we have seen the internal working of Sqoop. Now as part of this article, we'll see the process behind the execution of Sqoop incremental load in a production environment via various scheduling tools of a Sqoop Job.
The process is as follows:
- Let’s first create a Sqoop job as shown below
sqoop job -- create CustomerJob \-- import \-- connect jdbc:mysql://localhost:3306/retail_db \-- username root \-- password ****** \-- table customers \-- check-column customer_id \-- incremental append \-- last-value 0 \-- as-textfile \-- warehouse-dir= /user/cloudera/sqoop_imported_tables/Cust
2. Sqoop metastore:
Metastore is the place where the metadata of each action performed is logged. Usually, it has information like the schema of the table that it needs to fetch from the source database, the database connection and so on…
In this case, Sqoop determines the ‘last value’ of the records to be imported in order to perform incremental load.
[cloudera@quickstart Desktop]$ cd ~ [cloudera@quickstart ~]$ cd .sqoop [cloudera@quickstart .sqoop]$ ll [cloudera@quickstart .sqoop]$ nano metastore.db.script
[cloudera@quickstart .sqoop]$ nano metastore.db.script
3. Execute the Sqoop Job
[cloudera@quickstart Desktop]$ sqoop job — exec CustomerJob
4. You can see at the end of the sqoop job execution details are written to the metastore.
5. Let’s check the last value counter in sqoop metastore after sqoop job execution.
[cloudera@quickstart .sqoop]$ nano metastore.db.script
6. Let's insert a new record into customer table in order to perform an incremental update
mysql> insert into customers values(12436,'Jay','Reddy','****@gmail.com','hello','R T Nagar','Bangalore','Karnataka','500032');
7. Execute sqoop job again and you will see only the new record being imported into the sqoop customer table.
[cloudera@quickstart Desktop]$ sqoop job — exec CustomerJob
8. Finally, let's validate if the last value is being updated from 12435 to 12436
[cloudera@quickstart .sqoop]$ nano metastore.db.script
Now that we have seen how the incremental update takes, you can debug easily incase of any issues during incremental load of a Sqoop job via sqoop metastore and also reset the 'last value' accordingly.
Note: The commands that were executed related to this post are added as part of my GIT account.
If you liked this article, don’t forget to like and share!. If you would like me to add anything else, please feel free to leave a response ??
#Sqoop #Bigdata #Hadoop #Learn #Share #Grow #SqoopIncrementalImport #JayReddy
Does anyone know how to connect to the metastore so that you can query and inspect the SQOOP_SESSIONS table?
Senior Data Engineer at IBM | AI & Big Data Specialist
5 年Its very useful content and you made it very clear on sqoop metastore. Thanks Jay for sharing your knowledge.. Keep posting :)
Senior Data Engineer at Cognizant?? | Ex-Honeywell | #ONO ?? | #Azure ? | #German B1 Level Certified ???? | Writer@Medium ? | #BigData Engineer ??
5 年Link to the Series of In-depth Bigdata Ecosystem Architecture with Code :? https://github.com/Jayvardhan-Reddy/BigData-Ecosystem-Architecture/blob/master/README.md