IMPROVING FLAT FILES LOADING TIMES USING MULTILOAD AND FASTLOAD TERADATA EXTERNAL LOADERS WITH INFORMATICA POWERCENTER
Introduction
Many times we need to load Flat Files with a considerable amount of information, ranging from millions or even billions of rows. Moreover, we need to apply some kind of adjustment to comply with the business rules in each one of such rows. Informatica PowerCenter can load fixed or value separated flat files into Target tables but isn't one of their strongest abilities. Depending on your configuration, Informatica PowerCenter can insert on average 500 rows/second from the flat file source into the target, such a thing wouldn't be a problem with smaller files, but with bigger ones, the loading times can take days. These exaggerated loading times aren't allowed in a current work environment, to solve this problem Informatica PowerCenter can rely on External Loaders. In such a scenario, Informatica PowerCenter will act as an intermediary to generate the execution scripts and call the external loader, but won't work as the primary loading tool. Doing so and depending on which external loader we choose (and in our environment) we can reach on average a 17000 rows/second velocity.
1.- External Loaders
An External Loader is a tool designed to execute Data Manipulation Language (DML) sentences, such as DELETE, UPDATE, INSERT, or UPSERT in a bulk form, without the necessity to execute an SQL statement for each received row, this different approach leads to improve loading times. Each one of the multiple Database Management Systems (DBMS) has their one External Loaders. IBM DB2 has its DB2 LOAD utility, Oracle has its SQL*Loader, SQL Server has its BCP Utility and Teradata, the tool we'll utilize in this article has three external loaders, MultiLoad for loading up to 5 empty/non-empty tables, FastLoad for the fastest loading times with empty tables, and MultiLoadX (MloadX) as an improved version of the original MultiLoad introduced in Teradata Database 14.10.
1.2.- Teradata External Loaders
Teradata has three External Loaders (EL), each one with different necessities and particularities. Mload can INSERT, DELETE, UPDATE or UPSERT up to 5 tables at the same time which can be empty or non-empty, their loading times are slightly slower than the FastLoad EL but tend to fall into the 2571 error. The Fastload EL can do only INSERTs and the target tables need to be empty, nevertheless, it has the fastest loading times. MloadX is an improvement to the original Mload EL, which solves the limitation of working only with Primary Index (PI) Tables. In this article, I'm going to use the two most popular Teradata EL, Mload and FastLoad.
1.2.1.- Teradata MultiLoad (Mload)
The Teradata MultiLoad EL is a bulk utility which comes in the Teradata Tools and Utilities (TTU) package.
MultiLoad can work with the following DML statements: INSERT, DELETE, UPDATE, or UPSERT. It can load as I mentioned before, up to 5 tables at the same time and we have to be plenty aware that Mload uses a table-level block when Mloading. When Updating/Deleting Mload needs a Primary Index (PI) in the target table or won't work, because No Primary Index (NPI) tables are not supported. Every time a Mload command is called two error tables are created per Target table, plus one Work and one Log table.
Mload works in five phases, Preliminary, DML Transaction, Acquisition, Application and Cleanup. If an error is presented in the Application phase, a Lock in the target table is held, and we will need to unlock it manually or create a backup table with the locked table data, to be able to drop the locked table and created again to release the lock. This issue is commonly known as the Mload 2571 error.
1.2.2.- Teradata FastLoad
The Teradata FastLoad EL is a bulk utility which comes in the Teradata Tools and Utilities (TTU) package.
The Teradata FasLoad EL is designed to do a single table bulk load, supports tables with No Primary Index (NoPI), and is the fastest of all the EL utilities that Teradata has to offer. Nevertheless, FastLoad can perform only INSERT statements on empty tables, and UPDATE/DELETE/UPSERT is not supported, moreover duplicated rows can't be inserted. As their complement Mload, every time a FastLoad session is called two error tables plus one work and log table are created.
The FastLoad EL utility works in just two phases, Acquisition, and Application. When data is being Floaded, a table lock is placed on the target table, and such lock is released exclusively when FastLoad encounters the END LOADING command inside the calling script. If an error is encountered in the Application phase, the table will remain locked until we remove the lock manually or drop and create the target table. Such failure is known as FastLoad error 2652.
2.- Teradata external loaders practical example
In this practical example, we will import a 63,418,979 rows CSV file from the Addresses and geolocations for South American countries Kaggle Dataset using the MultiLoad and FastLoad Teradata External Loaders and Informatica PowerCenter as an intermediary to automatically create and execute the scripts for Mload and Fload. To do so, we will create a Teradata target table to receive the CSV file, a Mload/Fload ODBC connection to connect the external loaders to the Teradata database and an Informatica PowerCenter Mapping and Workflow to call Mload and Fload.
2.1.- Prerequisites
To execute the following example first, we will need to download the Addresses and geolocations for South American countries from the Kaggle Dataset and Create the Teradata target table which will receive these Addresses.
2.1.1.- OpenAddresses - South America CSV File
Kaggle is a webpage which stores larges datasets to be used mainly as a source for machine learning models. To test the velocity of Mload and Fload, we will use one of the bigger datasets that Kaggle has to offer; therefore we choose a 63,418,979 CSV File with 11 columns called openaddresses-south-america. Inside the openaddresses-south-america.zip file, we have six files, and we will use the 5.8 Gb brazil.csv file due to is the biggest one.
The brasil.zip CSV file has a total of 63,418,979 rows from which 56,625,086 are unique.
2.1.2.- Teradata table creation
The brazil.csv file has 11 columns from which the first three are numbers, and the remainder are strings.
The script below is designed to create the Teradata target table db_usuario1.FLoad_Table for the Fload load, the db_usuario1.MLoad_Table for the Mload load and the db_usuario1.IPC_Load_Table for the classic IPC load. In these three tables, the brazil.csv records will be stored.
2.2.- Informatica PowerCenter configuration
2.2.1.- Importing the source brazil.CSV as a source definition
Like any other source file, we'll import the brasil.csv file going to the Informatica PowerCenter designer, clicking on Sources, Import from File and selecting it. As the brazil.CSV has headers in the first row, we must check the Import filed names from first line box. As this brazil.csv dataset can have DataType problems (strings in number columns) and to avoid future issues, we will declare all the ports as strings. The resulting source definition is shown in the figure below.
2.2.2.- Importing the FLoad_Table Teradata target table as a Target Definition
To import the FLoad_Table target definition, we will go to the Target Designer, click on Targets, Import from Database, connect to our Teradata Database, and select our table. Notice, the same steps should be done for the MLoad_Table and IPC_Load_Table Teradata target tables. After importing the three Teradata table definition, it'll look like the figure below.
2.2.3.- Creating the m_fload_teradata, m_mload_teradata, m_ipc_load Informatica PowerCenter mappings
The steps for creating the three mappings are the same. We will drag the Brazil (Flat File) source definition and the Teradata target table into the Mapping Designer zone then we'll connect the ports from the SQ_brazil Source Qualifier to the Teradata Target Definition. The figure below describes how the three mappings look like.
Notice, the three mappings creation is redundant; the same process could be done with one mapping, also the three Teradata table creation. I'm doing this for the mere purpose of clarity.
2.2.4.- Informatica PowerCenter ODBC Teradata external loaders configuration
Every time we need to add an ODBC connection from the Informatica PowerCenter server to a DBMS, we do it from the Informatica PowerCenter Workflow Manager. This ODBC Workflow differs from the ODBC Mapping connection since the first one uses a 64 bits ODBC connection and the second one utilizes a 32 bits ODBC connection. Moreover, we need to have previously installed the ODBC drivers corresponding to our DBMS. To create a MultiLoad and FastLoad Teradata ODBC connection, we need to have already installed the Teradata ODBC drivers which comes in the Teradata Tools and Utilities (TTU) package. Besides the ODBC drivers, the External Loaders need to have a Host entry in the Windows/Linux (depending on your configuration) host file.
2.2.4.1.- Windows Hosts File configuration
Depending on your configuration the Host file will reside on a different path. For this example, I'm using a Windows 7 machine, on which the Hosts fill resides in, C:\Windows\System32\Drivers\etc\hosts.
To edit the Hosts file, we need to have Administrator credentials. We will add the IP direction to our Teradata Server with a name like it's shown in the figure below.
The name on which I'll refer to our Teradata server in the Mload/Fload ODBC workflow connections will be: "TeradataServerSuse."
2.2.4.2.- ODBC FastLoad Informatica PowerCenter configuration
After opening the Informatica PowerCenter Workflow Manager, we'll click on Connections, Loader.
A window called Loader Connection Browser will appear, after clicking on "New...," a Select Subtype window will show up. In this new window, we can read all the External Loaders in which Informatica PowerCenter can act as an intermediary to generate and call the EL scripts. For the FastLoad connection, we will choose the: "Teradata FastLoad External Loader" option and click on OK.
A window called Connection Object Definition will appear. We choose a name on which we'll identify the FastLoad connection from the Workflow Manager and User name and password to connect to the Teradata Database server. In the Attributes section, we will write the Value: TeradataServerSuse (the same one we wrote in the hosts file), for the TDPID attribute. The name of the DEFAULT database name for our user and on which the automated generated scripts will be created and inserted values on the target tables. This option is critical because it can't be changed later in the Fload/Mload flow execution. Finally, we let the Drop Error tables box checked and click on OK.
2.2.4.3.- ODBC MultiLoad Informatica PowerCenter configuration
The Steps for configuring the Teradata Mload ODBC connection are pretty much the same as the Fload. Inside the Loader Connection Browser, we'll check on the Teradata Mload External Loader option, and click on OK.
Inside the Connection Object Definition window, we will add the same credentials as the Fload ODBC loader configuration, but declaring the Value: Insert, in the Load Mode attribute since our target tables don't have a PI.
2.3 Informatica PowerCenter Workflow creation and execution.
We require to create a workflow called: wf_external_loaders, in which we'll add three sessions, one session corresponding to the Fload mapping, the second one corresponding to the Mload mapping and the last one for the IPC load mapping. To test every case, we will disable the other two sessions, for example, to test the: m_ipc_load, mapping, we'll disable the m_fload_teradata and m_fload_teradata mappings. The following figure describes the workflow configuration.
For each session, we need to have appropriately configured the Source Filetype and Source File Directory value, pointing to the brazil.csv file. For the s_ipc_load session, we need to declare also the Table Name Prefix and Target Table Name.
Now it comes to the critical part, the ODBC Mload/Fload Teradata Configuration for the sessions s_fload_teradata and s_mload_teradata. Opening the s_float_teradata session, in the Mapping tab, we will click on the FLoad_Table under the Targets section. In the Writers section, we will choose the: "File Writer" option, even if the target table is relational. In the Connections section, we will select the Type: "Loader," and for the value, we choose the Teradata_FastLoad_El ODBC connection that we just created.
For the s_mload_teradata session, the steps for configuring the External Loader are the same, but in the Loader Type, we must choose the Teradata_Mload_EL ODBC connection.
3.- Results
The following graphs and figures will show you the number of records processed per second and the time which takes to finish to load the target table, for the three following cases. First, loading the brazil.CSV file and inserting all the records by the Teradata FastLoad External Loader. Second, we use the same source, but this time we'll insert all the records by the Teradata MultiLoad External Loader. Third, we are going to insert the records only by the Informatica PowerCenter Server. Just to be clear, the original task was to load all the 63 million rows from the brazil.CSV file, but since the author doesn't have too much time after work, I'll create a subset of 10 million rows from the brazil.CSV file.
3.1 Teradata FastLoad External Loader
As we can see in the figure below, using the Teradata FastLoad external loader, we can process the 10 million rows in a velocity of 17065 Rows/Sec, and the Workflow will take only 10 minutes.
Nevertheless, as I mentioned earlier, using the Fast Load external loader will result in removing all the duplicated rows implicitly from the source file (the brazil.csv file in this case). This behavior will prevail even if we have our table defined as MULTISET instead of SET. Even if the Informatica PowerCenter session log doesn't show any error and the total amount of rows are the original 10 million, the log file created by the Fast Load EL (located in the default IPC target directory) will show us the total amount of rows correctly inserted into the target table and also the total duplicated rows.
3.2 Teradata MultiLoad External Loader
The Teradata MultiLoad External loader can reach for this exercise a velocity of 4706 Rows/Sec and their corresponding Workflow will take only 35 min to finish their task.
If we review the log file created by the Teradata MultiLoad EL, we can see that all the 10 million rows are correctly inserted (because MultiLoad doesn't have any duplicated rows restriction).
3.3 Informatica PowerCenter Loader
There's no news that the default IPC flat file loader is one of the slowest out there. For this example, the IPC loader can reach a velocity of only 341 Rows/Sec, and after 2 hours running, it could load barely two million rows. Like the wise Afro-American lady told us, Ain't Nobody Got Time For That, so I stopped the session and did the math. At a velocity of 341 Rows/second, it'll take 8 hours and 8 minutes to finish their corresponding workflow.
The last and following figures are intended to show the Rows/Second velocity of the three loading methods here presented and the time required for each one to load the 10 million rows.
4.- Conclusions
Depending on your DBMS, you have a variety of External loaders available at your disposition to use. In this article, we use the two most common Teradata External Loaders, FastLoad, and MultiLoad using Informatica PowerCenter as an intermediary to create and call their execution scripts. FastLoad is the fastest among the three (including the IPC flat file loader) but has the limitation of requiring an empty target table and moreover can't insert duplicated rows. MultiLoad is slower than FastLoad and can insert/update/delete/upsert into non-empty target tables, also can work with duplicated records, but we need to be aware of our Teradata AMP space. The IPC default flat file loader is not recommended to use when dealing with massive amounts of information.
Thanks for reading!