What is the ETL or ELT or something - Something we often see on job postings...
Lala Ibadullayeva
PhD candidate @CSB | Certified TensorFlow Developer | Google WTM Ambassador
As a recent job seeker, I often faced the question of ETL or ELT for almost 1 week. However, due to the PhD life and my pure interest in Data Science and Machine Learning, I didn't want to focus and research.
As a result, how many people have always had their resumes match their job requirements?
However, curiosity got the better of me today. And I decided to investigate this unknown matter.
So, lets get started to our new information :)
What is ETL?
Today, especially large enterprises store their data in multiple sources. If each unit within the company wants to prepare their own analyzes and reports, the data in these sources should be brought together and converted into a single format. This process is called ETL.
It consists of the initials of the words Extract, Transform and Load. It is the process of loading (Load) the data into the target data warehouse after extracting (Extract) the data from various sources and going through certain processes (Transform).
Extract is the process of taking the company's data from all the environments in which it is stored. These sources are mostly databases and cloud environments, and data can also be read from Excel, CSV or text files.
Transformation is the process of converting data from various sources into a single format and suitable for reporting or analysis.
Load is the process of loading the converted data into the target database or data warehouse for analysis.
There are many ETL tools in the market for these transactions. According to the research result published by Gartner in July 2020, Informatica, IBM(InfoSphere DataStage), Talend, SAP(BODS), Oracle(ODI) are mostly used in the sector.
ETL (Extract Transform Load) is roughly the transfer of data from the source system to the target system after certain processes.
We can do the ETL transaction in two different ways. In the first method, we can transform the data with systems such as T-SQL, PL/SQL and load it into the target system. In the second method, we can use an ETL tool (SSIS, Data Integrator, Power Center..). Which of these two methods we use depends on our needs. For example, the ETL tool is more costly and more difficult to learn. But if we use the ETL tool, we get rid of the code confusion. When we look at what is written with code, it is difficult to understand and everyone sees the same thing. When we look at what is written with the Tool, every viewer sees and understands the same thing.
Before uploading the data to the target system, we perform the transform operation. This is to classify data according to a certain standard. By transforming, we enable us to get faster and cleaner answers in queries. Classified, clean, quality data means more accurate information.
ETL types:
Truncate/Load: In the Truncate type, the data is first taken from the source system and uploaded to the target system.
When there is a change in the source system, the table in the target system is truncated and the new data is completely reloaded. In other words, 200 data in the target system is deleted, 250 data in the source system is written. The process of constantly deleting the data and reloading the updated version is not very good in terms of performance. But if we have very little data, this method may be preferred.
领英推荐
Incremental Update: By adjusting the runtime, only changes that occur after the runtime are added. It is a widely used method.
+100 data inserted after runtime; only +100 data is added to the target table.
Slowly Changing Dimension: It is used for data that has not changed much. Marriage information, ID number and similar information change in less time intervals or do not change at all. It is very useful to use SCD for such data.
There are types of SCD. We decide which type of SCD we will make by looking at our requirements.
What is ELT?
It contains the same transactions as ETL, but the order of execution of the transactions is different. Unlike ETL, where the data transformation takes place on a proxy before being uploaded to the target system, ELT uploads the raw data directly to the target system and transforms it there. Therefore, the loading process in ELT is faster than in ETL.
If we are working with very large data and have resource problems, it will be more efficient to use ELT.
Difference Between ETL and ELT :
In the ETL process, only the data required for analysis and reporting is loaded into the target store and other redundant data is left in the sources. In the ELT system, on the other hand, all data is loaded into the target store without conversion.
Most traditional ETL tools require on-premise installation on-premises, which is too costly for workloads. ELT, on the other hand, is mostly cloud-based and does not need to be physically installed.
Frequent maintenance is required on the ETL line used in physical servers. Because ELT is cloud-based or serverless, it requires little or no maintenance.
While both systems can handle semi-structured or unstructured data in some way, ETL is the more preferred system for unstructured data as it is designed to handle any data structure in a way that can be analyzed later.
ETL vs ELT in short :
My research was like this. I hope it will be useful to everyone who reads it.
I am proud of myself today, because I found at least an answer to this question that has been swirling in my mind for the last week. Because not knowing is not a shame, not learning and not researching is one of the biggest shames.
I would like to note that if there are any mistakes in the article, please let me know in the comments so that I can correct them. I also got this information thanks to blogs and medium.