登录查看更多内容

Fast Data Loading into Greenplum “Open-source” Data Warehouse

??????????????? ? ?.

???????????????????????????? | ???????????????????? | ????????A???????????????? | ?????????????? & ?????????????????????????? | ?????????????????? | PersonalityType:Logistician (ISTJ-A) | ???????????????? ?????? ????????

发布日期: 2016年3月22日

As the nature of enterprise data changes, data warehouses must change too. Today’s data warehousing relies on fast data loading, with reliable tools that helps in retrieving and managing data. Fast means, you should be able to load very large amounts of data into your warehouse from any external sources at any given time, to meet your business needs. (“Data” is the core asset, and all of them are using data in a much richer way now to understand their customers — according to estimates, the volume of business data being captured and stored worldwide, across all companies, doubles every 1.2 years)

Data loading is the beginning step in the entire analytics process. This post will provide you a little history, and an overview of several methods for loading data into a Greenplum Database (GPDB), data warehouse.

One very important aspect of the data warehouse is the ease and speed with which data can be loaded into the warehouse, from external sources (It's a key part of ETL — Extract, Transform, and Load / ELT — Extract, Load, and Transform). This is one area that most companies overlook when purchasing a data warehouse (most company officials think, loading of data is simply that – just loading data, but in reality it's more than that).

“What Makes GPDB so Fast?”

Greenplum leverages Scatter/Gather streaming technology to provide the highest performing data loading capabilities in the industry. This technology eliminates the bottlenecks associated with other approaches to data loading, and supports both large batch and continuous near-real-time loading. Data is transformed and processed on the fly, utilizing all available database nodes. The final gathering of data (when data is written to the disk) is done by all nodes simultaneously.

Scatter/Gather streaming is a significant step forward in business analytics, because it does more than just parallelize the simultaneous ingest of multiple data streams from multiple sources. This capability alone represents a big advancement over the single source Extract-Transform-Load (ETL) process, common to the more traditional data warehouses.

At a high level, Scatter/Gather streaming:

? Simultaneously manages the flow of data into all nodes of the database.
? Does not require additional software or systems.
? Takes advantage of the Greenplum parallel data flow engine.

Greenplum offers a very powerful, comprehensive suite of parallel and non-parallel data loading and unloading methods.

Method - 1 ? “Standard SQL INSERT command”

This is the same SQL INSERT command that is available in every type of database. SQL INSERT command is the slowest method, as it operates row-by-row as each INSERT statement is executed (this method should not be considered for bulk loads of large data volumes). Once the data entered into the system, it gets distributed across the Greenplum segments, based on the distribution key specified during table creation.

All the data loaded using the SQL INSERT command is passed through the Greenplum master into the segment servers, and therefore it does not take advantage of the full power of GPDB segments. Therefore this method is only suited for adding a very small amounts of data into the database.

Demonstration:

Non-Parallel Load Architecture

Method - 2 ? “PostgreSQL COPY command”

This is the statement available in PostgreSQL, which can load the contents of an entire file, or the output of a process into the database. PostegreSQL COPY command is a non-parallel load/unload mechanism, where data is loaded/unloaded in a single process.

In this method you use an external file that contains the data to be loaded. This external file could also be the dynamic output/program/script, which might be reading from another database for example. PostgreSQL COPY command, copies the entire block of data to the Greenplum master, which then distributes the data to the Greenplum segments based on the distribution key of the target table. This method is more efficient and significantly faster than loading the data line-by-line, as in the case with standard SQL INSERT command.

For small amounts of data, COPY offers a simple way to move data into/out of a database in a single transaction. However, that the data is still sent through the Greenplum master, which again does not take advantage of the full power of segments servers, and limits the available throughput.

Demonstration:

Unloading: Table ? Flat File

Loading: Flat File ? Table

Method - 3 ? “Parallel Loading using External Tables”

The most efficient method for loading data into the Greenplum Database, data warehouse, is to leverage the power of the segment servers, and load external data in parallel using External Tables (as if they are regular database tables). Unlike other loading mechanisms, you can access multiple data sources with one SELECT of an External Table (ET).

External Tables in Greenplum are commonly used with the Extract, Load, and Transform (ELT) process, which is an alternative to ETL. It supports SQL operations such as SELECT, SORT, and JOIN, so the data can be Extracted, Loaded and Transformed simultaneously, or you can extract data from external data sources and loads that data into a load table, then transform the raw data using SQL operations, and finally make the data available in the target table.

Greenplum provides “gpfdist” utility (parallel file distribution program in conjunction with external tables), which enables the segment servers to connect directly to the external sources, such as flat files/programs and load the data directly into the database. This greatly increases the ingest rate, by bypassing the master and letting the segment servers operate in parallel, via high speed interconnect network (by default 10Gig in Data Computing Appliance).

Parallel Load Architecture

gpfdist is used by readable external tables and “gpload” to serve external table files to all Greenplum Database (GPDB) segments in parallel. It is also used by writable external tables to accept output streams from GPDB segments in parallel, and write them out to a file.

To setup the parallel load, first you need to CREATE an EXTERNAL TABLE (writable/readable) which points to the gpfdist program running on an ETL server. gpfdist is capable of streaming the contents of data files/the results of programs, when the table is read. This command sets up the load, but no data is loaded. Then to load the data, a CREATE TABLE AS command is executed which selects the records from the external table into the new table. At this point all the Greenplum segments read the external files in parallel, and load the data into the database.

Demonstration:

Unloading: Writable External Table ? GPDB Table ? Flat File

Loading: Readable External Table ? Flat File ? GPDB Table

To further simplify the data load process, Greenplum provides “gpload” (a wrapper that uses information in a YAML – Yet Another Multicolumn Layout formatted control file; to eliminate manual steps) which invokes the parallel file server, and gpfdist on remote server - which basically acts as an interface to the external table parallel loading feature.

gpload utility takes care of creating and executing SQL against the external table, including INSERT, UPDATE, or MERGE operation to load the data automatically from your source to target table in the database (here the load is accomplished in a single transaction). The parallel utility supports the use of pre and post SQL statements. It drops the temporary external table once it has completed the load, and terminates the gpfdist process on the remote server. Note: Creating and dropping external tables, can sometimes cause catalog bloat (frequent vacuuming will help to avoid the bloat).

For users who love using Informatica: gpload is integrated with Informatica Power Exchange Connector (PWX), which seamlessly manages the entire parallel ETL process, by making direct connections from the Greenplum interconnect network to the ETL hosts. Also, there are 3rd party programs (Attunity, Outsourcer, Oracle Goldengate) that do load the data with External Tables from remote databases like SQL Server/Oracle, into GPDB.

Broader View on Load Architectures

The picture above will help you to visualize the data flow from Master and ETL hosts to the Segment hosts. The ETL hosts has the data files, and the gpfdist program which will serve those files to the Greenplum segments when the load starts. A simple SELECT command from the client to load from the external table starts the segments, and read the external files directly from the ETL hosts and gpfdist.

The combination of direct connection to the high speed interconnect network, the parallel use of all the segments, and bypassing the master for data transfer, plus the Greenplum database achieve extremely high data load rates. For even higher data load rates, multiple ETL hosts running multiple gpfdist programs can be configured. This scales the parallel effect and can produce truly remarkable ingest rates for large amounts of data. Sample tests proved that GPDB can load 10TB/hour for a full rack DCA, when using gpfdist.

If you are working on Data Lake platforms like me, you can take data loading to the next level, by using “gphdfs” protocol, which acts as an interface between Greenplum Database (GPDB) and the Hadoop Distributed File System (HDFS) – It's a broader topic, someday I will write about it.

Points to Remember while Loading/Unloading the data:

? Run Analyze on the table after loading.
? Run Vacuum after load errors.
? Ideally indexes are not needed in Greenplum Database, if at all you end up creating an index on a table, then drop the index before loading the data into a table. After the load is completed, re-create the index for the table.
? Do not use ODBC INSERT to load large volumes of data into GPDB.
? External Tables are not intended for frequent or ad hoc access.

“Conclusion”

In todays world, big data, analytics, and business intelligence are becoming more and more critical to an organization’s success. Everything starts by getting the data into the data warehouse. The minute the data loading process becomes difficult or slow, it becomes a barrier to a company's agility and ultimately its performance. So, fast data loading is the key to it's ETL/ELT processes, which then process large amounts of data into the data warehouse.

With the rise of big data, demands from more data sources, such as EMR (Electronic Medical Record) data, patient logs, call detail records, banking transactions, weblog files, social media files, etc.; and dynamically changing business requirements (some BI systems merely get incrementally updated, whereas others require a complete reload at each iteration), all these together put your data loading into your data warehouse a big challenge.

To overcome the above challenges, GPDB is offering parallel and non-parallel methods for loading data into the data warehouse. Data loading no longer has to be a slow, batch oriented process anymore. Parallel loading with gpfdist and gpload provides extremely high speed ingest rates, and can further scale up the data transfer by configuring multiple ETL hosts with multiple instances of gpfdist.

“Whatever you choose today is the key to a better tomorrow” — Be wise in choosing the right data warehouse platform for your organization.

References

EMC Pulse Product & Technology Blog.
EMC Greenplum Data Computing Appliance Enhances EMC IT's Global Data Warehouse. (White Paper)
Pivotal Greenplum.
Inspired, from Pivotal Open Source Hub.

Arkajyoti Chakraborty

Data Analyst at Tata Consultancy Services

8 年

Can you shed some light on how real time data is transferred from source to target after I load data through ETL in GP?

Holly Zhang

8 年

Sandeep, if ETL load codes are completely done in GP function and we are sure we did not use GPload and GPfdist. Does it mean the loads are done by insert or update SQL statement.

Neetha Sindhu

Director of Data & Analytics at Envoy Global | Data Engineering, Business Intelligence, Data Integration, AI | Technology Leadership | EB1a ‘Einstein Visa’ Green Card Recipient | Corporate Advisory Board Member

8 年

Very helpful! I am new to Greenplum and this helped me right in time. I am trying to use the gpload and gpfdist utilities for the first time.

1 次回应

Albert R.

Client Technical Specialist and Chief Database Architect at Mphasis, a Blackstone company || Health AI @ DocNote.ai || GenAI Search @ MetaRAG.ai || GRC @ NIST.ai || KYC @ OFAC.ai

8 年

Solid post especially loading data with External web tables, meeting heterogeneous requirements, CDC etc. and others without that file server... as you know :-)

查看更多评论

要查看或添加评论，请登录

??????????????? ? ?.的更多文章

11 Years 11 Months & 11 Days

2020年12月10日

11 Years 11 Months & 11 Days

11:11 - Some call it angel numbers & some call it coincidental nonsense. — At first, this occurrence felt like a silly…
Perspective Is Everything

2019年7月22日

Perspective Is Everything

Do you remember hearing the story about the blind men and an elephant each of whom were exploring an elephant for the…
It’s time to Comply with GDPR — Are you Ready?

2018年4月26日

It’s time to Comply with GDPR — Are you Ready?

What Is GDPR? The General Data Protection Regulation (GDPR) is a new regulation aimed at improving the protection of…
Choosing Storage in an Advanced, Fully Featured, Open Source Data Warehouse

2017年7月17日

Choosing Storage in an Advanced, Fully Featured, Open Source Data Warehouse

All most all applications need to use some kind of persistent durable storage to get their job done. This might be to…
Reading and Writing Data to Cloud Storage using Pivotal Greenplum

2016年12月21日

Reading and Writing Data to Cloud Storage using Pivotal Greenplum

The evolution of the cloud has seen many organizations evolve from private, community, public and now hybrid cloud…
Manipulating Result Sets & Predicting the Likelihood of Future Events Using Greenplum

2016年9月2日

Manipulating Result Sets & Predicting the Likelihood of Future Events Using Greenplum

Like all other databases, Greenplum Database (GPDB) can do different kinds of computations: it can add, subtract…
Optimize GPDB with Vacuum

2016年8月1日

Optimize GPDB with Vacuum

Just like your house gets dirty and requires cleaning from time to time, your Greenplum Database (GPDB) may accumulate…

1 条评论
Analyzing Query Execution Plans In Greenplum

2016年6月30日

Analyzing Query Execution Plans In Greenplum

Query Execution plans can become very useful tool for every database developer and DBA. They provide a deeper…

5 条评论
Big Data Analysis with MapReduce and Greenplum

2016年5月22日

Big Data Analysis with MapReduce and Greenplum

Before I say anything about Big Data Analysis/MapReduce, let me start this with a quote: “Google is living a few years…

4 条评论
Scaling Greenplum with PgBouncer

2016年4月22日

Scaling Greenplum with PgBouncer

Deploying Greenplum in a high-demand environment requires reliability and scalability. (Scalability, simply, is about…

4 条评论

See all articles

Fast Data Loading into Greenplum “Open-source” Data Warehouse

??????????????? ? ?.

???????????????????????????? | ???????????????????? | ????????A???????????????? | ?????????????? & ?????????????????????????? | ?????????????????? | PersonalityType:Logistician (ISTJ-A) | ???????????????? ?????? ????????

??????????????? ? ?.的更多文章

社区洞察

其他会员也浏览了

Azure Data Factory: A Beginner's Guide for Data Engineers

Data Lake vs Data Warehouse: Which one to choose?

Data Warehouse, Data Lake, or Data Lakehouse? What is the Best Solution for your Data Strategy?

Navigating the Data Landscape: Evolution from Traditional Warehousing to Modern Lakehouse Architecture

WHAT MODERN DATA TEAMS DO DIFFERENTLY

DATA LAKE & DATA WAREHOUSE

Strategies for Integrating Data from Multiple Sources

Revealing Contemporary Data Frameworks: From Warehouses to Meshes

Data Warehousing and BI Analytics — Aamir P

Data Warehousing 101: Tracing its Evolution to the Modern Day

??????????????? ? ?.的更多文章

11 Years 11 Months & 11 Days

Perspective Is Everything

It’s time to Comply with GDPR — Are you Ready?

Choosing Storage in an Advanced, Fully Featured, Open Source Data Warehouse

Reading and Writing Data to Cloud Storage using Pivotal Greenplum

Manipulating Result Sets & Predicting the Likelihood of Future Events Using Greenplum

Optimize GPDB with Vacuum

Analyzing Query Execution Plans In Greenplum

Big Data Analysis with MapReduce and Greenplum

Scaling Greenplum with PgBouncer

社区洞察

其他会员也浏览了

Azure Data Factory: A Beginner's Guide for Data Engineers

Data Lake vs Data Warehouse: Which one to choose?

Data Warehouse, Data Lake, or Data Lakehouse? What is the Best Solution for your Data Strategy?

Navigating the Data Landscape: Evolution from Traditional Warehousing to Modern Lakehouse Architecture

WHAT MODERN DATA TEAMS DO DIFFERENTLY

DATA LAKE & DATA WAREHOUSE

Strategies for Integrating Data from Multiple Sources

Revealing Contemporary Data Frameworks: From Warehouses to Meshes

Data Warehousing and BI Analytics — Aamir P

Data Warehousing 101: Tracing its Evolution to the Modern Day