WHAT IS ETL
In?computing , extract, transform, load (ETL) is a three-phase process where data is extracted, transformed (cleaned, sanitized, scrubbed) and loaded into an output data container. The data can be collated from one or more sources and it can also be output to one or more destinations. ETL processing is typically executed using?software applications ?but it can also be done manually by?system operators . ETL software typically automates the entire process and can be run manually or on reoccurring schedules either as single jobs or aggregated into a batch of jobs.
Conventional ETL diagram[1]
A properly designed ETL system extracts data from source systems and enforces?data type ?and data validity standards and ensures it conforms structurally to the requirements of the output. Some ETL systems can also deliver data in a presentation-ready format so that application developers can build applications and end users can make decisions.[1]
The ETL process is often used in?data warehousing .[2] ?ETL systems commonly integrate data from multiple applications (systems), typically developed and supported by different?vendors ?or hosted on separate computer hardware. The separate systems containing the original data are frequently managed and operated by different?stakeholders . For example, a cost accounting system may combine data from payroll, sales, and purchasing.
Extract[edit ]
Data extraction ?involves extracting data from homogeneous or heterogeneous sources;?data transformation ?processes data by?data cleaning ?and transforming it into a proper storage format/structure for the purposes of querying and analysis; finally, data loading describes the insertion of data into the final target database such as an?operational data store , a?data mart ,?data lake ?or a data warehouse.[3] [4]
ETL processing involves extracting the data from the source system(s). In many cases, this represents the most important aspect of ETL, since extracting data correctly sets the stage for the success of subsequent processes. Most data-warehousing projects combine data from different source systems. Each separate system may also use a different data organization and/or?format . Common data-source formats include?relational databases ,?XML ,?JSON ?and?flat files , but may also include non-relational database structures such as?Information Management System (IMS) ?or other data structures such as?Virtual Storage Access Method (VSAM) ?or?Indexed Sequential Access Method (ISAM) , or even formats fetched from outside sources by means such as?web spidering ?or?screen-scraping . The streaming of the extracted data source and loading on-the-fly to the destination database is another way of performing ETL when no intermediate data storage is required.
An intrinsic part of the extraction involves data validation to confirm whether the data pulled from the sources has the correct/expected values in a given domain (such as a pattern/default or list of values). If the data fails the validation rules, it is rejected entirely or in part. The rejected data is ideally reported back to the source system for further analysis to identify and to rectify incorrect records or perform?data wrangling .
Transform[edit ]
In the?data transformation ?stage, a series of rules or functions are applied to the extracted data in order to prepare it for loading into the end target.
领英推荐
An important function of transformation is?data cleansing , which aims to pass only "proper" data to the target. The challenge when different systems interact is in the relevant systems' interfacing and communicating. Character sets that may be available in one system may not be so in others.
In other cases, one or more of the following transformation types may be required to meet the business and technical needs of the server or data warehouse:
This article?needs additional citations for?verification .?Please help?improve this article ?by?adding citations to reliable sources . Unsourced material may be challenged and removed.
Find sources:?"Extract, transform, load" ?–?news ?·?newspapers ?·?books ?·?scholar ?·?JSTOR ?(May 2019)?(Learn how and when to remove this template message )
Load[edit ]
The load phase loads the data into the end target, which can be any data store including a simple delimited flat file or a?data warehouse .[5] ?Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative information; updating extracted data is frequently done on a daily, weekly, or monthly basis. Other data warehouses (or even other parts of the same data warehouse) may add new data in a historical form at regular intervals — for example, hourly. To understand this, consider a data warehouse that is required to maintain sales records of the last year. This data warehouse overwrites any data older than a year with newer data. However, the entry of data for any one year window is made in a historical manner. The timing and scope to replace or append are strategic design choices dependent on the time available and the?business ?needs. More complex systems can maintain a history and?audit trail ?of all changes to the data loaded in the data warehouse. As the load phase interacts with a database, the constraints defined in the database schema — as well as in triggers activated upon data load — apply (for example, uniqueness,?referential integrity , mandatory fields), which also contribute to the overall data quality performance of the ETL process.