Sqoop Tutorial: Big Data on Hadoop

Sqoop Tutorial: Big Data on Hadoop

Table of Contents

  • What is Sqoop and Why Use Sqoop?
  • Sqoop Features
  • Sqoop Architecture
  • Sqoop Import
  • Sqoop Export
  • Sqoop Processing
  • Demo on Sqoop

The Hadoop ecosystem consists of various facets specific to different career specialties. One such discipline centers around Sqoop, which is a tool in the Hadoop ecosystem used to load data from relational database management systems (RDBMS) to Hadoop and export it back to the RDBMS. Simply put, Sqoop helps professionals work with large amounts of data in Hadoop.

This Sqoop tutorial gives you an in-depth walkthrough for using the Sqoop tool in Hadoop to manage Big Data. It digs into everything from the basics of Sqoop and its architecture, to how to actually use it.?

Let us now begin our Sqoop tutorial by understanding what exactly is Sqoop.

What is Sqoop and Why Use Sqoop?

Let us begin this Sqoop tutorial by understanding about Sqoop. Sqoop is a tool used to transfer bulk data between Hadoop and external datastores, such as relational databases (MS SQL Server, MySQL).?

To process data using Hadoop, the data first needs to be loaded into Hadoop clusters from several sources. However, it turned out that the process of loading data from several heterogeneous sources was extremely challenging. The problems administrators encountered included:

  1. Maintaining data consistency
  2. Ensuring efficient utilization of resources
  3. Loading bulk data to Hadoop was not possible
  4. Loading data using scripts was slow

The solution was Sqoop. Using Sqoop in Hadoop helped to overcome all the challenges of the traditional approach and it could load bulk data from RDBMS to Hadoop with ease.?

Now that we've understood about Sqoop and the need for Sqoop, as the next topic in this Sqoop tutorial, let's learn the features of Sqoop.

Sqoop Features

Sqoop has several features, which makes it helpful in the Big Data world:?

  1. Parallel Import/Export Sqoop uses the YARN framework to import and export data. This provides fault tolerance on top of parallelism.
  2. Import Results of an SQL Query Sqoop enables us to import the results returned from an SQL query into HDFS.
  3. Connectors For All Major RDBMS Databases Sqoop provides connectors for multiple RDBMSs, such as the MySQL and Microsoft SQL servers.
  4. Kerberos Security Integration Sqoop supports the Kerberos computer network authentication protocol, which enables nodes communication over an insecure network to authenticate users securely.
  5. Provides Full and Incremental Load Sqoop can load the entire table or parts of the table with a single command.

After going through the features of Sqoop as a part of this Sqoop tutorial, let us understand the Sqoop architecture.?

Sqoop Architecture

Now, let’s dive deep into the architecture of Sqoop, step by step:

1. The client submits the import/ export command to import or export data.

2. Sqoop fetches data from different databases. Here, we have an enterprise data warehouse, document-based systems, and a relational database. We have a connector for each of these; connectors help to work with a range of accessible databases.

3. Multiple mappers perform map tasks to load the data on to HDFS.

4. Similarly, numerous map tasks will export the data from HDFS on to RDBMS using the Sqoop export command.?

This Sqoop tutorial now gives you an insight of the Sqoop import.

Sqoop Import

The diagram below represents the Sqoop import mechanism.?

  1. In this example, a company’s data is present in the RDBMS. All this metadata is sent to the Sqoop import. Scoop then performs an introspection of the database to gather metadata (primary key information).
  2. It then submits a map-only job. Sqoop divides the input dataset into splits and uses individual map tasks to push the splits to HDFS.?

Few of the arguments used in Sqoop import are shown below:

Sqoop Export

Let us understand the Sqoop export mechanism stepwi

e:

  1. The first step is to gather the metadata through introspection.
  2. Sqoop then divides the input dataset into splits and uses individual map tasks to push the splits to RDBMS.?

Let’s now have a look at few of the arguments used in Sqoop export:

After understanding the Sqoop import and export, the next section in this Sqoop tutorial is the processing that takes place in Sqoop.

Sqoop Processing

Processing takes place step by step, as shown below:

  1. Sqoop runs in the Hadoop cluster.
  2. It imports data from the RDBMS or NoSQL database to HDFS.?
  3. It uses mappers to slice the incoming data into multiple formats and loads the data in HDFS.
  4. Exports data back into the RDBMS while ensuring that the schema of the data in the database is maintained.

Next, let us take a look at the Sqoop demo as a part of this Sqoop tutorial.

Demo on Sqoop

For this Sqoop demo, we will be using the Cloudera QuickStart VM. Once you are in the Cloudera QuickStart VM, you will see a GUI. Here, it will open to the Hue GUI by default; if it doesn’t, then you have to click on Hue. Under Hue, you have to go on to Query, then choose Editor and finally, choose Sqoop. The screen looks like, as shown below:

You will then get a screen where you can load the Sqoop editor into Hue. Here, we will be switching over to the command line, as CLI runs quicker on many computers. Now you can go ahead and open the command line.

For this hands-on demo, we will be using Oracle VirtualBox Manager and the Cloudera QuickStart VM. Here, we have the databases in MySQL, so to get into the MySQL server, open a terminal in Cloudera, and type:

mysql -u root -pcloudera // This depends on your setup. Here, we have typed Cloudera’s username and password.?

After typing the above commands, you will get into MySQL. Let’s look into a few quick commands here:

show databases; // This will display a list of 12 databases present in MySQL

After typing the above command, the output will be as follows:

As seen in the above image, MySQL comes with a standard set of databases. We will use the retail_db database for our demo. To use this database and check the tables in it:

use retail_db;

By typing the command above, that database will be set as the default in MySQL. To check the list of tables present inside the retail_db database, you have to type the following command:

show tables;?

As shown below, the above command will provide a list of six tables present inside the database:

Now, we will use the “departments” table from the “retail_db” database. For that, you have to type the following:

select * from departments;// This views the contents of the departments table.?

After typing the above command, the output will be as follows:

After this, open another terminal in Cloudera and type the following:

hostname -f //Checks the hostname

sqoop list-databases --connect jdbc:mysql://localhost/ --password cloudera --username root; //Lists the databases using Sqoop

After typing the above command, the output will be as follows:

As seen above, the databases match. Now to list the tables present in retail_db using Sqoop, type the following command:

sqoop list-tables --connect jdbc:mysql://quickstart:3306/retail_db --password cloudera --username root; //Lists the tables present in retail_db using Sqoop

After typing the above command, the output will be as follows:

After this, we would have to connect to the MySQL database and import the customer’s database to HDFS. Let’s go ahead and run our first import command:

sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --password cloudera --username root --table departments; //Executes Map Tasks at the back end

After typing the above command, the output will be as follows:

After the code is executed, we can check the web UI of HDFS using localhost: 50075, where the data is imported. Now, you can go back to the terminal window and type the following:

hadoop fs -ls /user/cloudera //Checks if importing data into HDFS was successful

hadoop fs -cat /user/cloudera/departments/part*? //Views the contents of the department table

After typing the above commands, the output will be as shown below:

The next command to be typed is as follows:

sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --password cloudera --username root --table departments --target-dir /user/cloudera/dept1; //Imports the departments data to a target directory:

This process will take some time to run. After that, type the following to view the contents of the department table1 inside dept1 directory:

hadoop fs -cat /user/cloudera/dept1/part*

The output will be as follows:

Let us move onto filter data now; we can use the following import command to filter specific rows using the where clause:

sqoop import -–connect jdbc:mysql://quickstart:3306/retail_db --password cloudera --username root --table departments –-m 3 --where “department_id>4” --target-dir /user/cloudera/dept2;

The above command will only return department ids that are greater than four and will store it in dept2. After running the above command, we have to view the contents of the dept2 directory, and for that, you will have to type the following:

hadoop fs -cat /user/cloudera/dept2/part*

The output after filtering will be:

Now that we have imported the data, the next stage is to export the data. For that, we have to go back to the MySQL server and create a database to export into. We will simply be creating a new table “dept” to export the “departments” table data to the new dept table. Create the below table under the MySQL terminal:

create table dept(department_id int not null default NULL auto_increment, department_name varchar(45) not null default 'NULL', primary key(department_id));

Now we have to export the departments table to dept table using the following command:

sqoop export --connect jdbc:mysql://quickstart:3306/retail_db --password cloudera --username root --table dept –-export-dir /user/cloudera/dept2;

The above command has to be typed out in the Sqoop setup. After the command is executed, go back to your MySQL terminal and type:

select * from dept;??

The above command will let you check the contents of the dept table under the MySql terminal.

Norisa Paul

Master of Science - MS at Columbia University in the City of New York

1 年

Your IBM Certification success story begins at www.edusum.com/ibm. Don't miss out on this opportunity! ???? #CertifiedProfessional #InfoSecTraining #StudySmart

回复

要查看或添加评论,请登录

Girish V的更多文章

  • HBase Tutorial

    HBase Tutorial

    Table of Contents Introduction to HBase HBase History What is HBase? Why HBase? Characteristics of HBase Applications…

  • Hive Tutorial:

    Hive Tutorial:

    Table of Contents History of Hive Architecture of Hive Data Flow in Hive Hive Data Modeling Hive Data Types Different…

社区洞察