ETL, or extraction, transformation, and loading, is the general term for the process of removing data from source systems and transferring it into a data warehouse.
- Extract :?
- The process of removing data from a source system to be used in a data warehouse setting is known as extraction. The ETL procedure starts with this step.
- One of the ETL processes that takes the longest is frequently the extraction phase.
- The complexity and lack of documentation of the source systems may make it challenging to identify which data has to be extracted.
- To ensure that the warehouse has access to all updated data, the data must be extracted multiple times regularly.
- The reconciliation phase is centered on transformation. Records are converted into a specific data warehouse format from their operational source format. Our reconciled data layer is produced at this point if we use a three-layer architecture.
- The data's filtering, purification, deduplication, validation, and authentication.
- Use the raw data to do computations, translations, or summaries. This can involve altering text strings, converting currencies or other units of measurement, adjusting row and column headings for consistency, and more.
- Carrying out audits to guarantee compliance and data quality
- Deleting, encrypting, or safeguarding information under the control of governmental or industry regulators
- Arranging the data so that it fits the target data warehouse's schema by formatting it into tables or connected tables.
Writing the data into the destination database is called a load. Making sure the load is executed accurately and with the least amount of resources required is vital during the load step.
Two methods exist for carrying out loading:
- Refresh: All of the data in the warehouse is redone. Essentially, an older file has been overwritten. Typically, refresh is combined with static extraction to first fill a data warehouse.
- Update: The Data Warehouse only contains the modifications that were made to the source data. Usually, an update is performed without erasing or changing previously stored data. When combined with incremental extraction, this technique is used to routinely update data warehouses.
- Development Time: By designing backward from the output, only data that is relevant to the solution is retrieved and processed, which may reduce the overhead associated with development, deletion, and processing.
- Targeted data: The load process's targeted feature ensures that the warehouse only holds data pertinent to the presentation. Lower warehouse content makes it easier to enforce security measures, which lowers administrative costs.
- Tools Availability: There are several tools that may be used to perform ETL, giving you freedom in your approach and the chance to choose the best tool for the job. A competitive functionality war resulting in a loss of maintainability is inevitable as a result of the proliferation of tools.
- Weaknesses :?
- Flexibility: By focusing just on pertinent data for output, the ETL routines will need to be modified to accommodate any future requirements that may call for data that was not included in the initial design. This frequently necessitates a major rethink and development due to the close dependence among the methods used. This raises the time and expense involved as a result.
- Hardware: To carry out the ETL stage, the majority of third-party programs use their engine. This may require the purchase of additional hardware in order to implement the ETL engine of the tool, regardless of the estimated cost of the solution. The ETL process necessitates the knowledge of new scripting languages and procedures due to the usage of third-party technologies.