Sqoop Tutorial: Big Data on Hadoop
Table of Contents
The Hadoop ecosystem consists of various facets specific to different career specialties. One such discipline centers around Sqoop, which is a tool in the Hadoop ecosystem used to load data from relational database management systems (RDBMS) to Hadoop and export it back to the RDBMS. Simply put, Sqoop helps professionals work with large amounts of data in Hadoop.
This Sqoop tutorial gives you an in-depth walkthrough for using the Sqoop tool in Hadoop to manage Big Data. It digs into everything from the basics of Sqoop and its architecture, to how to actually use it.?
Let us now begin our Sqoop tutorial by understanding what exactly is Sqoop.
What is Sqoop and Why Use Sqoop?
Let us begin this Sqoop tutorial by understanding about Sqoop. Sqoop is a tool used to transfer bulk data between Hadoop and external datastores, such as relational databases (MS SQL Server, MySQL).?
To process data using Hadoop, the data first needs to be loaded into Hadoop clusters from several sources. However, it turned out that the process of loading data from several heterogeneous sources was extremely challenging. The problems administrators encountered included:
The solution was Sqoop. Using Sqoop in Hadoop helped to overcome all the challenges of the traditional approach and it could load bulk data from RDBMS to Hadoop with ease.?
Now that we've understood about Sqoop and the need for Sqoop, as the next topic in this Sqoop tutorial, let's learn the features of Sqoop.
Sqoop Features
Sqoop has several features, which makes it helpful in the Big Data world:?
After going through the features of Sqoop as a part of this Sqoop tutorial, let us understand the Sqoop architecture.?
Sqoop Architecture
Now, let’s dive deep into the architecture of Sqoop, step by step:
1. The client submits the import/ export command to import or export data.
2. Sqoop fetches data from different databases. Here, we have an enterprise data warehouse, document-based systems, and a relational database. We have a connector for each of these; connectors help to work with a range of accessible databases.
3. Multiple mappers perform map tasks to load the data on to HDFS.
4. Similarly, numerous map tasks will export the data from HDFS on to RDBMS using the Sqoop export command.?
This Sqoop tutorial now gives you an insight of the Sqoop import.
Sqoop Import
The diagram below represents the Sqoop import mechanism.?
Few of the arguments used in Sqoop import are shown below:
Sqoop Export
Let us understand the Sqoop export mechanism stepwi
e:
Let’s now have a look at few of the arguments used in Sqoop export:
After understanding the Sqoop import and export, the next section in this Sqoop tutorial is the processing that takes place in Sqoop.
Sqoop Processing
Processing takes place step by step, as shown below:
Next, let us take a look at the Sqoop demo as a part of this Sqoop tutorial.
Demo on Sqoop
For this Sqoop demo, we will be using the Cloudera QuickStart VM. Once you are in the Cloudera QuickStart VM, you will see a GUI. Here, it will open to the Hue GUI by default; if it doesn’t, then you have to click on Hue. Under Hue, you have to go on to Query, then choose Editor and finally, choose Sqoop. The screen looks like, as shown below:
You will then get a screen where you can load the Sqoop editor into Hue. Here, we will be switching over to the command line, as CLI runs quicker on many computers. Now you can go ahead and open the command line.
For this hands-on demo, we will be using Oracle VirtualBox Manager and the Cloudera QuickStart VM. Here, we have the databases in MySQL, so to get into the MySQL server, open a terminal in Cloudera, and type:
mysql -u root -pcloudera // This depends on your setup. Here, we have typed Cloudera’s username and password.?
After typing the above commands, you will get into MySQL. Let’s look into a few quick commands here:
show databases; // This will display a list of 12 databases present in MySQL
After typing the above command, the output will be as follows:
As seen in the above image, MySQL comes with a standard set of databases. We will use the retail_db database for our demo. To use this database and check the tables in it:
use retail_db;
By typing the command above, that database will be set as the default in MySQL. To check the list of tables present inside the retail_db database, you have to type the following command:
show tables;?
As shown below, the above command will provide a list of six tables present inside the database:
Now, we will use the “departments” table from the “retail_db” database. For that, you have to type the following:
select * from departments;// This views the contents of the departments table.?
After typing the above command, the output will be as follows:
After this, open another terminal in Cloudera and type the following:
hostname -f //Checks the hostname
sqoop list-databases --connect jdbc:mysql://localhost/ --password cloudera --username root; //Lists the databases using Sqoop
After typing the above command, the output will be as follows:
As seen above, the databases match. Now to list the tables present in retail_db using Sqoop, type the following command:
sqoop list-tables --connect jdbc:mysql://quickstart:3306/retail_db --password cloudera --username root; //Lists the tables present in retail_db using Sqoop
After typing the above command, the output will be as follows:
After this, we would have to connect to the MySQL database and import the customer’s database to HDFS. Let’s go ahead and run our first import command:
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --password cloudera --username root --table departments; //Executes Map Tasks at the back end
After typing the above command, the output will be as follows:
After the code is executed, we can check the web UI of HDFS using localhost: 50075, where the data is imported. Now, you can go back to the terminal window and type the following:
hadoop fs -ls /user/cloudera //Checks if importing data into HDFS was successful
hadoop fs -cat /user/cloudera/departments/part*? //Views the contents of the department table
After typing the above commands, the output will be as shown below:
The next command to be typed is as follows:
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --password cloudera --username root --table departments --target-dir /user/cloudera/dept1; //Imports the departments data to a target directory:
This process will take some time to run. After that, type the following to view the contents of the department table1 inside dept1 directory:
hadoop fs -cat /user/cloudera/dept1/part*
The output will be as follows:
Let us move onto filter data now; we can use the following import command to filter specific rows using the where clause:
sqoop import -–connect jdbc:mysql://quickstart:3306/retail_db --password cloudera --username root --table departments –-m 3 --where “department_id>4” --target-dir /user/cloudera/dept2;
The above command will only return department ids that are greater than four and will store it in dept2. After running the above command, we have to view the contents of the dept2 directory, and for that, you will have to type the following:
hadoop fs -cat /user/cloudera/dept2/part*
The output after filtering will be:
Now that we have imported the data, the next stage is to export the data. For that, we have to go back to the MySQL server and create a database to export into. We will simply be creating a new table “dept” to export the “departments” table data to the new dept table. Create the below table under the MySQL terminal:
create table dept(department_id int not null default NULL auto_increment, department_name varchar(45) not null default 'NULL', primary key(department_id));
Now we have to export the departments table to dept table using the following command:
sqoop export --connect jdbc:mysql://quickstart:3306/retail_db --password cloudera --username root --table dept –-export-dir /user/cloudera/dept2;
The above command has to be typed out in the Sqoop setup. After the command is executed, go back to your MySQL terminal and type:
select * from dept;??
The above command will let you check the contents of the dept table under the MySql terminal.
Master of Science - MS at Columbia University in the City of New York
1 年Your IBM Certification success story begins at www.edusum.com/ibm. Don't miss out on this opportunity! ???? #CertifiedProfessional #InfoSecTraining #StudySmart