Fast Data Loading into Greenplum “Open-source” Data Warehouse
??????????????? ? ?.
???????????????????????????? | ???????????????????? | ????????A???????????????? | ?????????????? & ?????????????????????????? | ?????????????????? | PersonalityType:Logistician (ISTJ-A) | ???????????????? ?????? ????????
As the nature of enterprise data changes, data warehouses must change too. Today’s data warehousing relies on fast data loading, with reliable tools that helps in retrieving and managing data. Fast means, you should be able to load very large amounts of data into your warehouse from any external sources at any given time, to meet your business needs. (“Data” is the core asset, and all of them are using data in a much richer way now to understand their customers — according to estimates, the volume of business data being captured and stored worldwide, across all companies, doubles every 1.2 years)
Data loading is the beginning step in the entire analytics process. This post will provide you a little history, and an overview of several methods for loading data into a Greenplum Database (GPDB), data warehouse.
One very important aspect of the data warehouse is the ease and speed with which data can be loaded into the warehouse, from external sources (It's a key part of ETL — Extract, Transform, and Load / ELT — Extract, Load, and Transform). This is one area that most companies overlook when purchasing a data warehouse (most company officials think, loading of data is simply that – just loading data, but in reality it's more than that).
“What Makes GPDB so Fast?”
Greenplum leverages Scatter/Gather streaming technology to provide the highest performing data loading capabilities in the industry. This technology eliminates the bottlenecks associated with other approaches to data loading, and supports both large batch and continuous near-real-time loading. Data is transformed and processed on the fly, utilizing all available database nodes. The final gathering of data (when data is written to the disk) is done by all nodes simultaneously.
Scatter/Gather streaming is a significant step forward in business analytics, because it does more than just parallelize the simultaneous ingest of multiple data streams from multiple sources. This capability alone represents a big advancement over the single source Extract-Transform-Load (ETL) process, common to the more traditional data warehouses.
At a high level, Scatter/Gather streaming:
- ? Simultaneously manages the flow of data into all nodes of the database.
- ? Does not require additional software or systems.
- ? Takes advantage of the Greenplum parallel data flow engine.
Greenplum offers a very powerful, comprehensive suite of parallel and non-parallel data loading and unloading methods.
Method - 1 ? “Standard SQL INSERT command”
This is the same SQL INSERT command that is available in every type of database. SQL INSERT command is the slowest method, as it operates row-by-row as each INSERT statement is executed (this method should not be considered for bulk loads of large data volumes). Once the data entered into the system, it gets distributed across the Greenplum segments, based on the distribution key specified during table creation.
All the data loaded using the SQL INSERT command is passed through the Greenplum master into the segment servers, and therefore it does not take advantage of the full power of GPDB segments. Therefore this method is only suited for adding a very small amounts of data into the database.
Demonstration:
Non-Parallel Load Architecture
Method - 2 ? “PostgreSQL COPY command”
This is the statement available in PostgreSQL, which can load the contents of an entire file, or the output of a process into the database. PostegreSQL COPY command is a non-parallel load/unload mechanism, where data is loaded/unloaded in a single process.
In this method you use an external file that contains the data to be loaded. This external file could also be the dynamic output/program/script, which might be reading from another database for example. PostgreSQL COPY command, copies the entire block of data to the Greenplum master, which then distributes the data to the Greenplum segments based on the distribution key of the target table. This method is more efficient and significantly faster than loading the data line-by-line, as in the case with standard SQL INSERT command.
For small amounts of data, COPY offers a simple way to move data into/out of a database in a single transaction. However, that the data is still sent through the Greenplum master, which again does not take advantage of the full power of segments servers, and limits the available throughput.
Demonstration:
Unloading: Table ? Flat File
Loading: Flat File ? Table
Method - 3 ? “Parallel Loading using External Tables”
The most efficient method for loading data into the Greenplum Database, data warehouse, is to leverage the power of the segment servers, and load external data in parallel using External Tables (as if they are regular database tables). Unlike other loading mechanisms, you can access multiple data sources with one SELECT of an External Table (ET).
External Tables in Greenplum are commonly used with the Extract, Load, and Transform (ELT) process, which is an alternative to ETL. It supports SQL operations such as SELECT, SORT, and JOIN, so the data can be Extracted, Loaded and Transformed simultaneously, or you can extract data from external data sources and loads that data into a load table, then transform the raw data using SQL operations, and finally make the data available in the target table.
Greenplum provides “gpfdist” utility (parallel file distribution program in conjunction with external tables), which enables the segment servers to connect directly to the external sources, such as flat files/programs and load the data directly into the database. This greatly increases the ingest rate, by bypassing the master and letting the segment servers operate in parallel, via high speed interconnect network (by default 10Gig in Data Computing Appliance).
Parallel Load Architecture
gpfdist is used by readable external tables and “gpload” to serve external table files to all Greenplum Database (GPDB) segments in parallel. It is also used by writable external tables to accept output streams from GPDB segments in parallel, and write them out to a file.
To setup the parallel load, first you need to CREATE an EXTERNAL TABLE (writable/readable) which points to the gpfdist program running on an ETL server. gpfdist is capable of streaming the contents of data files/the results of programs, when the table is read. This command sets up the load, but no data is loaded. Then to load the data, a CREATE TABLE AS command is executed which selects the records from the external table into the new table. At this point all the Greenplum segments read the external files in parallel, and load the data into the database.
Demonstration:
Unloading: Writable External Table ? GPDB Table ? Flat File
Loading: Readable External Table ? Flat File ? GPDB Table
To further simplify the data load process, Greenplum provides “gpload” (a wrapper that uses information in a YAML – Yet Another Multicolumn Layout formatted control file; to eliminate manual steps) which invokes the parallel file server, and gpfdist on remote server - which basically acts as an interface to the external table parallel loading feature.
gpload utility takes care of creating and executing SQL against the external table, including INSERT, UPDATE, or MERGE operation to load the data automatically from your source to target table in the database (here the load is accomplished in a single transaction). The parallel utility supports the use of pre and post SQL statements. It drops the temporary external table once it has completed the load, and terminates the gpfdist process on the remote server. Note: Creating and dropping external tables, can sometimes cause catalog bloat (frequent vacuuming will help to avoid the bloat).
For users who love using Informatica: gpload is integrated with Informatica Power Exchange Connector (PWX), which seamlessly manages the entire parallel ETL process, by making direct connections from the Greenplum interconnect network to the ETL hosts. Also, there are 3rd party programs (Attunity, Outsourcer, Oracle Goldengate) that do load the data with External Tables from remote databases like SQL Server/Oracle, into GPDB.
Broader View on Load Architectures
The picture above will help you to visualize the data flow from Master and ETL hosts to the Segment hosts. The ETL hosts has the data files, and the gpfdist program which will serve those files to the Greenplum segments when the load starts. A simple SELECT command from the client to load from the external table starts the segments, and read the external files directly from the ETL hosts and gpfdist.
The combination of direct connection to the high speed interconnect network, the parallel use of all the segments, and bypassing the master for data transfer, plus the Greenplum database achieve extremely high data load rates. For even higher data load rates, multiple ETL hosts running multiple gpfdist programs can be configured. This scales the parallel effect and can produce truly remarkable ingest rates for large amounts of data. Sample tests proved that GPDB can load 10TB/hour for a full rack DCA, when using gpfdist.
If you are working on Data Lake platforms like me, you can take data loading to the next level, by using “gphdfs” protocol, which acts as an interface between Greenplum Database (GPDB) and the Hadoop Distributed File System (HDFS) – It's a broader topic, someday I will write about it.
Points to Remember while Loading/Unloading the data:
- ? Run Analyze on the table after loading.
- ? Run Vacuum after load errors.
- ? Ideally indexes are not needed in Greenplum Database, if at all you end up creating an index on a table, then drop the index before loading the data into a table. After the load is completed, re-create the index for the table.
- ? Do not use ODBC INSERT to load large volumes of data into GPDB.
- ? External Tables are not intended for frequent or ad hoc access.
“Conclusion”
In todays world, big data, analytics, and business intelligence are becoming more and more critical to an organization’s success. Everything starts by getting the data into the data warehouse. The minute the data loading process becomes difficult or slow, it becomes a barrier to a company's agility and ultimately its performance. So, fast data loading is the key to it's ETL/ELT processes, which then process large amounts of data into the data warehouse.
With the rise of big data, demands from more data sources, such as EMR (Electronic Medical Record) data, patient logs, call detail records, banking transactions, weblog files, social media files, etc.; and dynamically changing business requirements (some BI systems merely get incrementally updated, whereas others require a complete reload at each iteration), all these together put your data loading into your data warehouse a big challenge.
To overcome the above challenges, GPDB is offering parallel and non-parallel methods for loading data into the data warehouse. Data loading no longer has to be a slow, batch oriented process anymore. Parallel loading with gpfdist and gpload provides extremely high speed ingest rates, and can further scale up the data transfer by configuring multiple ETL hosts with multiple instances of gpfdist.
“Whatever you choose today is the key to a better tomorrow” — Be wise in choosing the right data warehouse platform for your organization.
References
- EMC Pulse Product & Technology Blog.
- EMC Greenplum Data Computing Appliance Enhances EMC IT's Global Data Warehouse. (White Paper)
- Pivotal Greenplum.
- Inspired, from Pivotal Open Source Hub.
Data Analyst at Tata Consultancy Services
8 年Can you shed some light on how real time data is transferred from source to target after I load data through ETL in GP?
Data and Analytics Director |Data Engineering|Data Architecture|Machine Learning|Big Data Platform|MDM|Data Governance|Data Management|Data Virtualization/Visualization
8 年Sandeep, if ETL load codes are completely done in GP function and we are sure we did not use GPload and GPfdist. Does it mean the loads are done by insert or update SQL statement.
Director of Data & Analytics at Envoy Global | Data Engineering, Business Intelligence, Data Integration, AI | Technology Leadership | EB1a ‘Einstein Visa’ Green Card Recipient | Corporate Advisory Board Member
8 年Very helpful! I am new to Greenplum and this helped me right in time. I am trying to use the gpload and gpfdist utilities for the first time.
Client Technical Specialist and Chief Database Architect at Mphasis, a Blackstone company || Health AI @ DocNote.ai || GenAI Search @ MetaRAG.ai || GRC @ NIST.ai || KYC @ OFAC.ai
8 年Solid post especially loading data with External web tables, meeting heterogeneous requirements, CDC etc. and others without that file server... as you know :-)