Bulk Data Load using Apache Sqoop
Jisto Jose
Data Architect | DE | DBE | DB Consultant | DB Migration Expert | 5x Multi-cloud Certificate
Apache Sqoop is an open-source tool that is widely used for transferring bulk data between Hadoop and structured data sources such as relational databases, data warehouses, and NoSQL databases. Sqoop was initially developed by Cloudera and later contributed to the Apache Software Foundation in 2012. Sqoop provides a simple command-line interface to import and export data, as well as a scalable and fault-tolerant architecture that can handle large volumes of data.
Bulk Data Load:
Bulk data loading is the process of transferring large volumes of data from one system to another. It is often required when migrating data between different systems or when loading data into a new system. Bulk data loading can be a time-consuming and complex process, especially when dealing with large datasets. Apache Sqoop provides a convenient and efficient way to load bulk data into Hadoop, where it can be easily analyzed and integrated with other data sources.
Apache Sqoop for Bulk Data Load:
Apache Sqoop can be used to import data from various sources, including relational databases, data warehouses, and NoSQL databases. Sqoop uses JDBC drivers to connect to these sources and extract data using SQL queries. Sqoop supports a wide range of databases, including Oracle, MySQL, PostgreSQL, SQL Server, and many more. Sqoop can also import data from data warehouses such as Teradata and Netezza, as well as NoSQL databases such as MongoDB and Cassandra.
The import process in Sqoop is straightforward. First, Sqoop connects to the source database and retrieves metadata about the table(s) to be imported. Sqoop then generates a MapReduce job to extract the data using SQL queries. The extracted data is stored in Hadoop Distributed File System (HDFS) in a format that is optimized for querying and processing with Hadoop tools such as MapReduce, Hive, and Pig.
Sqoop provides several options for controlling the import process, including selecting specific columns, filtering rows, and controlling the number of mappers used to extract the data. Sqoop can also handle complex data types such as BLOBs, CLOBs, and binary data.
In addition, Sqoop provides several features to optimize the import process for large datasets, including:
Now I'm demonstrating a use case with apache sqoop
Use-case: Migration of historical data from RDBMS like Postgres or MySQL
Source Database? : Postgres
Target Database : GCP BigQuery
Here, we are going to talk about bulk migration from Postgres databases to GCP BigQuery.
Prerequisites:
STEP 1: Creation of Hadoop cluster.
from . import
GCS_BUCKET_NAME = "<BUCKET NAME>" # Name of the bucket
GCP_PROJECT_NAME = "<PROJECT NAME>" # Name of the project
SUBNETWORK_URI_ = f"projects/{GCP_PROJECT_NAME}/regions/<region>/subnetworks/<subnet name>"
DATAPROC_CLUSTER_CONFIGURATION = {
? ? CONFIG_BUCKET: GCS_BUCKET_NAME,
? ? MASTER_CONFIG: {
? ? ? ? NUM_INSTANCES: 1,
? ? ? ? MACHINE_TYPE_URI: 'n1-standard-2',? # Machine type
? ? ? ? DISK_CONFIG: {
? ? ? ? ? ? BOOT_DISK_TYPE: "pd-standard", "boot_disk_size_gb": 100
? ? ? ? ? ? }
? ? },
? ? WORKER_CONFIG: {
? ? ? ? NUM_INSTANCES : 2,? ?# No. of instances
? ? ? ? MACHINE_TYPE_URI: "n1-standard-2",
? ? ? ? DISK_CONFIG: {
? ? ? ? ? ? BOOT_DISK_TYPE: "pd-standard", "boot_disk_size_gb": 100
? ? ? ? ? ? }
? ? },
? ? SECONDARY_WORKER_CONFIG: {
? ? ? ? NUM_INSTANCES: 2,
? ? ? ? MACHINE_TYPE_URI: "n1-standard-2",
? ? ? ? DISK_CONFIG: {
? ? ? ? ? ? BOOT_DISK_TYPE: "pd-standard", "boot_disk_size_gb": 100
? ? ? ? ? ? }
? ? },
? ? SOFTWARE_CONFIG: {
? ? ? ? IMAGE_VERSION: "1.5-debian10",? ? # OS
? ? ? ? PROPERTIES: {
? ? ? ? ? ? "dataproc:dataproc.conscrypt.provider.enable": "false"
? ? ? ? ? ? }
? ? },
? ? LIFESYCLE_CONFIG: {
? ? ? ? IDLE_DELETE_TTL: {"seconds": 1800}? ?
? ? },
? ? GCE_CLUSTER_CONFIG: {
? ? ? ? SUBNETWORK_URI: SUBNETWORK_URI_,
? ? ? ? INTERNAL_IP_ONLY: True,??
? ? ? ? SERVICE_ACCOUNT_SCOPES: ["https://www.googleapis.com/auth/cloud-platform"],
? ? },
? ? INITIALIZATION_ACTIONS: [
? ? ? ? {EXECUTABLE_FILE: f"gs://{GCS_BUCKET_NAME}/<Folder name>/<Script name>.sh"}],
? ? AUTOSCALING_CONFIG : {
? ? ? ? POLICY_URI : f"projects/{GCP_PROJECT_NAME}/regions/<region>/autoscalingPolicies/postgres-migration-autoscaling-policy"?
? ? }
}
Note: Cluster can be created using Dataproc SDK or airflow operator.
STEP 2: Bulk migration of data from Postgres DB to Google Cloud Storage bucket.
# To load entire table dat
sqoop import \
--connect <JDBC url> \
--username root --password <Actual password>
--table <Table name> --m <Number of mappers>
--split-by <field name> --target-dir <Target directory> \
--<file-type>
# To tranform data in the table using query. You can apply filters here.
sqoop import \
--connect <JDBC url> \
--username root --password <Actual password>
--query <Query string> --m <Number of mappers>
--split-by <field name> --target-dir <Target directory> \
--<file-type>a
?
STEP 3: Load data from GCS to Bigquery using Dataflow