Effortless Data Fusion: Apache SeaTunnel Delivers Lightning-Fast Integration!
Apache SeaTunnel, the latest project to achieve top-level status within the Apache Software Foundation (ASF), addresses common challenges in data integration. With its ability to efficiently ingest and synchronize vast amounts of data from diverse sources, Apache SeaTunnel significantly reduces the cost associated with data transfer.
The current big data ecosystem comprises a range of data engines such as Hadoop, Hive, Kudu, Kafka, HDFS for the big data landscape, MongoDB, Redis, ClickHouse, Doris for generalized big database ecosystems, and cloud-based solutions like AWS S3, Redshift, BigQuery, Snowflake. Additionally, there are various data ecosystems like MySQL, PostgreSQL, IoTDB, TDEngine, Salesforce, Workday, and more. To seamlessly connect these diverse data sources, a comprehensive tool is required.
Apache SeaTunnel acts as a crucial bridge that enables accurate, real-time, and simplified integration of these complex data sources. It serves as the central "highway" for data flow in the expansive big data landscape, facilitating smooth and efficient data integration processes.
The rise in the popularity of codeless/low-code data integration frameworks is a direct response to the changing landscape of the big data world. To grasp the factors that sparked this demand, it is essential to explore the evolutionary journey of the data landscape.
From ETL to EtLT: A Paradigm Shift in Architecture Redefining Data Integration.
For a better understanding, let’s first revisit the evolution of the data warehouse architecture from ETL to EtLT.
When we look back, we can observe that the data warehouse architecture predominantly relied on ETL (Extract, Transform, Load) from 1990 to 2015. During this period, structured data sources such as MySQL, SQL Server, Oracle, ERP, and CRM were primarily utilized. The OLTP era witnessed Oracle and DB2 taking charge of data warehouse computing, serving as databases for querying and storing historical data. However, it is important to note that the computing power of these databases, such as Oracle and DB2, remained relatively limited, making it challenging to meet the demands of data warehouse computing across various scenarios.
During this progression, dedicated ETL software solutions like Informatica, Talend, and Kettle came into existence, which many companies continue to rely on today. However, with the advent of new technologies such as MPP (Massively Parallel Processing) and the widespread adoption of distributed architecture technologies like Hadoop and Hive, a significant shift occurred. People realized the potential of utilizing cost-effective hardware alternatives instead of relying on expensive Oracle and DB hardware services. This shift marked the entrance into the era of ELT (Extract, Load, Transform).
The core feature of this era is that data from different data sources, including structured and unstructured data, logs, etc., can be loaded into the data warehouse without any processing, or after some simple standardization, such as deduplication, parse URL parameters into individual components, mask or obfuscate sensitive data, word count reduction, etc, and was calculated layer by layer by engines such as MapReduce and Spark. At this time, because the data sources are not so complicated, people handle the process from data sources to data warehouses, mainly by writing MR programs or writing Spark programs.
As data sources grow more and more complex, many new technologies continue to emerge, and data sources are more complex. Some SaaS services and cloud data storage have appeared, further making data sources more complex. At the same time, on the target side, the data warehouse is very different from the previous data warehouse. With the emergence of a data lake and real-time data warehouse technology, the target side of data integration is also more complicated. At this time, if the MR program is still developed by data engineers as before, the integration efficiency will be very low. At this time, some professional teams and professional tools are urgently needed to solve such an ELT process.
Thus,?the field of data integration was born. Apache?SeaTunnel is the platform for next-generation data integration.
In the ELT scenario, there is a concept called EtLT. The small t there is different from the uppercase T in the back, which means data standardization, such as field screening, structured conversion of unstructured data, etc. It does not involve core business logic such as join, or aggregation. We also split the personnel under these two systems. The data EL process, that is, the previous EtL process is mainly handled by data engineers who do not need to understand the business very well. They only need to understand the relationship between different data sources, and the data characteristics and differences between them. After the data is loaded into the data warehouse, professional AI data scientists, data analysts, SQL developers, and other business-savvy people will do calculations based on the original data.
This is the evolution from ETL to EtLT architecture. In 2020, James Densmore proposed the EtLT architecture in the book “Data Pipelines Pocket Reference”. He predicted that from 2020 to the future, this is the evolution trend of architecture.
Challenges in the field of data integration & common solutions
From this, we extend to some common challenges and solutions in the field of data integration.
During the previous technology exploration, I identified core challenges in the field of data integration:
To address these needs, a comprehensive data integration product is required—one that is user-friendly, easily expandable, manageable, and maintainable. Extensive scheme research has been conducted to develop such a solution.
Different data integration products primarily cater to the following scenarios:
To address these challenges, users often need to employ a combination of the above components in a complex architecture, requiring a comprehensive big data platform and incurring significant learning costs. Additionally, managing different codebases can be challenging.
Apache SeaTunnel, the next-generation data integration platform, offers a solution to these pain points. It provides a unified platform that supports all the mentioned scenarios, simplifying the overall architecture and reducing the learning curve. SeaTunnel addresses the limitations of existing tools and aims to deliver a comprehensive and user-friendly data integration experience.
Next-generation data integration platform Apache SeaTunnel
SeaTunnel is a very easy-to-use, ultra-high-performance, distributed data integration platform that supports real-time synchronization of massive data. It can synchronize tens of billions of data stably and efficiently every day and has been used in production by nearly 100 companies.
Six Design Goals
SeaTunnel is primarily focused on data integration and synchronization, aiming to address common challenges in the field. Apache SeaTunnel's design goals can be summarized into six key aspects.
Firstly, it emphasizes simplicity and ease of use. Users can initiate synchronization jobs with minimal configuration and simple commands, ensuring a smooth and straightforward experience.
Secondly, SeaTunnel prioritizes monitoring the synchronization process. It provides quantifiable indicators, enabling users to track the status of ongoing synchronization operations. Transparency is crucial, and SeaTunnel avoids being a black box.
The third goal is to offer extensive support for various data sources. The community has identified over 500 data sources, with more than 100 already supported by SeaTunnel. Support for data sources continues to expand rapidly, with around 40 to 50 new sources added each quarter.
The fourth objective is to cover diverse synchronization scenarios. SeaTunnel aims to support real-time and offline synchronization, incremental and full data transfers, Change Data Capture (CDC), multi-table synchronization, and more. It eliminates the need for users to rely on multiple tools to achieve their integration goals.
Fifthly, SeaTunnel addresses the critical issue of data consistency. It ensures that systems with high consistency requirements do not lose data and avoid data duplication, guaranteeing the integrity and accuracy of synchronized data.
Finally, performance optimization is another consideration. SeaTunnel strives to minimize resource utilization and reduce impact on data sources while delivering the necessary functionality. Balancing performance and efficiency is crucial to provide a smooth and efficient data integration experience.
Project development history
Begun in 2017 and originally called Waterdrop, the project was renamed in October 2021 and entered the ASF incubator in December the same year. Created by a small group in China, SeaTunnel since has grown to more than 180 contributors around the world. The most recent version supports more than 70 data sources, and the number is surging.
Users all over the world
The Apache SeaTunnel community currently has nearly 5,000 members, with more than 200 contributors in the community, and the speed of PR submission and merging is relatively fast. In addition, users cover domestic Internet companies, such as Station B, Tencent Cloud, and others. Overseas, Shopee, India’s second-largest telecom operator, Bharti Telecom, etc. is also using Apache SeaTunnel.
Core Design and Architecture
Overall structure
The Apache SeaTunnel architecture is mainly divided into three modules. The first one is the data source, which includes some domestic and foreign databases; the second part is the target end. The target end and the data source can be combined. They are called data sources, which are mainly databases. , SaaS services, and product components such as data lakes and warehouses. From the data source to the target are defined a set of APIs dedicated to data synchronization, which is decoupled from the engine and can theoretically be extended to many engines. The engines currently supported include Apache SeaTunnel Zeta, Flink, and Spark.
SeaTunnel work flowchart
The runtime process of SeaTunnel is shown in the figure above.
The process begins with the user configuring job information and selecting the execution engine for job submission.
The Source Connector plays a crucial role in parallel data retrieval and forwarding it either to the downstream Transform or directly to the Sink. The Sink, in turn, handles writing the data to the desired destination. It's worth noting that users have the flexibility to develop and extend their own custom Source, Transform, and Sink connectors.
SeaTunnel operates as an EL(T) data integration platform, where the Transform component is primarily employed for performing simple data transformations. These transformations may include tasks like converting column data to uppercase or lowercase, renaming columns, or splitting a column into multiple columns.
By default, SeaTunnel utilizes the SeaTunnel Engine as its primary execution engine. However, if users choose to leverage the Flink or Spark engine, SeaTunnel packages the Connector into a Flink or Spark program and submits it for execution within the selected engine.
Connector API decoupled from the engine
The API designs in this set primarily focus on decoupling from the engine, specifically catering to data integration scenarios. They are divided into Source API, Transform API (referred to as small t), Sink API, and CDC API as mentioned earlier. By leveraging the Translation API, these connectors can be executed on different engines.
In all engines, the connector API is built on the foundation of the checkpoint mechanism. The main objective is to integrate distributed snapshot algorithms across various engines and utilize the checkpoint capability of the underlying engine. This enables the implementation of features like two-phase commit, ensuring data consistency and reliability.
Source Connector
The Source connector is implemented based on a set of APIs. Using the JDBC connector as an example, it offers support for both offline and real-time operation modes. With this connector, you only need to specify the job mode as BATCH or STREAMING in the environment configuration. This allows for easy switching between offline and real-time synchronization modes.
The Source connector provides several key capabilities, including parallel reading, dynamic shard discovery, field projection, and Exactly-once semantic guarantee. At the core, it utilizes the checkpoint capability provided by the engine. The Source API supports the underlying engine to invoke the checkpointed API, ensuring synchronization and preserving data while avoiding duplication.
Sink Connector
The main features supported by Sink Connector include:
Transform Connector
Key features of the Transform Connector include:
CDC Connector mainly has the following functions:
Checkpoint function design
Finally, it should be emphasized that?all?Apache?SeaTunnel Connectors are designed based on checkpoint logic.?The job starts from the Split enumerator, enters the Source reader, sends the data to the Sink Writer after reading, and finally submits it by the AggregateCommitter.
Next-generation data integration engine Apache SeaTunnel Zeta
Apache SeaTunnel Zeta, the next-generation data integration engine, is positioned as an easy-to-use, dedicated engine for full-scenario data integration, and based on this, it is faster, more stable, and more resource-efficient.
Apache SeaTunnel Zeta cluster management
The cluster management approach of Apache SeaTunnel Zeta is distinctive due to the following features:
Below are some of the proprietary attributes of the Apache SeaTunnel Zeta engine and what core problems it solves.
Apache SeaTunnel Zeta Pipeline Base Failover
After the problem is solved, manual restoration of a single Pipeline is supported.
Apache SeaTunnel Zeta Dynamic Thread Sharing
The core of dynamic threads is to reduce the problem of CDC multi-table synchronization, especially in scenarios where a large number of small tables exist, due to limited resources and many threads, resulting in performance degradation. Dynamic threads can dynamically match threads according to running time and data volume, saving resources. After testing, and running a job with 500 small tables in a single JVM scenario, the performance can be improved by more than 2 times after dynamic threads are enabled.
Apache SeaTunnel Zeta Connection Pool Sharing
Connection pool sharing is mainly used to solve the scenarios occupied by a large number of JDBCs, such as a single very large table, there are many parallel tasks to process, offline synchronization of multiple tables, CDC synchronization of multiple tables, etc. Connection pool sharing allows the same Job on the same TaskExecutionService node to share JDBC connections, thereby reducing JDBC usage.
Apache SeaTunnel Zeta multi-table synchronization
The last is multi-table synchronization, which is mainly used for table partition transform processing after the CDC Source is read, and the data is distributed to different Sinks, and each Sink will process the data of one table. In this process, connector sharing is used to reduce the usage of JDBC connections, and dynamic thread sharing is used to reduce thread usage, thereby improving performance.
Performance comparison
SeaTunnel Zeta syncs data around 30–50% faster than open-source data integration framework like DataX. Memory has no significant impact on SeaTunnel Zeta’s performance. The performance of Apache SeaTunnel in MySQL to S3 scenarios is more than 30 times that of Airbyte and 2 to 5 times that of AWS DMS and Glue. It has been observed that Apache SeaTunnel can complete synchronization in a small memory, and it is still in the case of a single point. Because Zeta supports distributed, it is believed that Apache SeaTunnel will have better performance when the order of magnitude is larger and multi-machine parallelism.
Roadmap going ahead
SeaTunnel is set to advance in multiple areas, including:
Apache SeaTunnel Engine is evolving and will be the first and a great independent-developed big data synchronization engine you’ve ever known!