Effortless Data Fusion: Apache SeaTunnel Delivers Lightning-Fast Integration!

Effortless Data Fusion: Apache SeaTunnel Delivers Lightning-Fast Integration!

Apache SeaTunnel, the latest project to achieve top-level status within the Apache Software Foundation (ASF), addresses common challenges in data integration. With its ability to efficiently ingest and synchronize vast amounts of data from diverse sources, Apache SeaTunnel significantly reduces the cost associated with data transfer.

The current big data ecosystem comprises a range of data engines such as Hadoop, Hive, Kudu, Kafka, HDFS for the big data landscape, MongoDB, Redis, ClickHouse, Doris for generalized big database ecosystems, and cloud-based solutions like AWS S3, Redshift, BigQuery, Snowflake. Additionally, there are various data ecosystems like MySQL, PostgreSQL, IoTDB, TDEngine, Salesforce, Workday, and more. To seamlessly connect these diverse data sources, a comprehensive tool is required.

Apache SeaTunnel acts as a crucial bridge that enables accurate, real-time, and simplified integration of these complex data sources. It serves as the central "highway" for data flow in the expansive big data landscape, facilitating smooth and efficient data integration processes.

No alt text provided for this image
Low-code development platform market revenue worldwide from 2018 to 2025 (in billion U.S. dollars)

The rise in the popularity of codeless/low-code data integration frameworks is a direct response to the changing landscape of the big data world. To grasp the factors that sparked this demand, it is essential to explore the evolutionary journey of the data landscape.

From ETL to EtLT: A Paradigm Shift in Architecture Redefining Data Integration.

For a better understanding, let’s first revisit the evolution of the data warehouse architecture from ETL to EtLT.

When we look back, we can observe that the data warehouse architecture predominantly relied on ETL (Extract, Transform, Load) from 1990 to 2015. During this period, structured data sources such as MySQL, SQL Server, Oracle, ERP, and CRM were primarily utilized. The OLTP era witnessed Oracle and DB2 taking charge of data warehouse computing, serving as databases for querying and storing historical data. However, it is important to note that the computing power of these databases, such as Oracle and DB2, remained relatively limited, making it challenging to meet the demands of data warehouse computing across various scenarios.

No alt text provided for this image

During this progression, dedicated ETL software solutions like Informatica, Talend, and Kettle came into existence, which many companies continue to rely on today. However, with the advent of new technologies such as MPP (Massively Parallel Processing) and the widespread adoption of distributed architecture technologies like Hadoop and Hive, a significant shift occurred. People realized the potential of utilizing cost-effective hardware alternatives instead of relying on expensive Oracle and DB hardware services. This shift marked the entrance into the era of ELT (Extract, Load, Transform).

No alt text provided for this image

The core feature of this era is that data from different data sources, including structured and unstructured data, logs, etc., can be loaded into the data warehouse without any processing, or after some simple standardization, such as deduplication, parse URL parameters into individual components, mask or obfuscate sensitive data, word count reduction, etc, and was calculated layer by layer by engines such as MapReduce and Spark. At this time, because the data sources are not so complicated, people handle the process from data sources to data warehouses, mainly by writing MR programs or writing Spark programs.

No alt text provided for this image

As data sources grow more and more complex, many new technologies continue to emerge, and data sources are more complex. Some SaaS services and cloud data storage have appeared, further making data sources more complex. At the same time, on the target side, the data warehouse is very different from the previous data warehouse. With the emergence of a data lake and real-time data warehouse technology, the target side of data integration is also more complicated. At this time, if the MR program is still developed by data engineers as before, the integration efficiency will be very low. At this time, some professional teams and professional tools are urgently needed to solve such an ELT process.

Thus,?the field of data integration was born. Apache?SeaTunnel is the platform for next-generation data integration.

In the ELT scenario, there is a concept called EtLT. The small t there is different from the uppercase T in the back, which means data standardization, such as field screening, structured conversion of unstructured data, etc. It does not involve core business logic such as join, or aggregation. We also split the personnel under these two systems. The data EL process, that is, the previous EtL process is mainly handled by data engineers who do not need to understand the business very well. They only need to understand the relationship between different data sources, and the data characteristics and differences between them. After the data is loaded into the data warehouse, professional AI data scientists, data analysts, SQL developers, and other business-savvy people will do calculations based on the original data.

This is the evolution from ETL to EtLT architecture. In 2020, James Densmore proposed the EtLT architecture in the book “Data Pipelines Pocket Reference”. He predicted that from 2020 to the future, this is the evolution trend of architecture.

Challenges in the field of data integration & common solutions

From this, we extend to some common challenges and solutions in the field of data integration.

During the previous technology exploration, I identified core challenges in the field of data integration:

  1. Abundance of Data Sources: The Apache SeaTunnel community has identified nearly 500 data sources, with the number continuously increasing. Ensuring compatibility with evolving data source versions and quickly adapting to new technologies pose significant challenges in data integration.
  2. Complex Synchronization Scenarios: Data synchronization encompasses various scenarios such as offline, real-time, full, incremental synchronization, CDC (change data capture), and multi-table synchronization. Supporting CDC requires reading and analyzing database change logs, handling different log data formats, transaction processing, and synchronizing at various granularities (whole databases, sub-databases, and sub-tables).
  3. Monitoring and Quantifiable Indicators: Lack of proper monitoring during the synchronization process leads to information opacity, making it uncertain to track the amount of synchronized data. Establishing effective monitoring mechanisms with quantifiable indicators is crucial.
  4. High Throughput and Low Latency with Limited Resources: Achieving high throughput and low latency while optimizing resource utilization is essential to reduce costs in data integration processes.
  5. Minimizing Impact on Data Sources: Real-time synchronization and frequent reading of binlogs can burden data sources and affect their stability. Excessive JDBC connections can also lead to instability. Minimizing the impact on data sources involves reducing connection occupation and limiting synchronization speeds.
  6. Data Consistency, Loss Prevention, and Duplication Avoidance: Systems with stringent data consistency requirements necessitate strategies to ensure data integrity, prevent data loss, and eliminate duplication.

To address these needs, a comprehensive data integration product is required—one that is user-friendly, easily expandable, manageable, and maintainable. Extensive scheme research has been conducted to develop such a solution.

Different data integration products primarily cater to the following scenarios:

  1. Full Offline Increment: In the past, Sqoop was widely used for this scenario, but it had limitations in terms of supporting data sources and relied on the MapReduce architecture, resulting in slow performance. However, Sqoop has been decommissioned from Apache, belonging to the previous generation of data integration projects. Presently, Airflow is a popular tool for data synchronization. While it offers useful features, it lacks support for real-time synchronization and multi-level parallel processing. Additionally, it lacks a distributed snapshot algorithm, making it unable to guarantee data consistency and support breakpoint resumption.
  2. Real-time Synchronization: Flink and Spark Streaming are commonly used for real-time scenarios. However, as these products are primarily positioned as computing engines, their core capabilities revolve around processing complex data calculations. They may not provide extensive support for various data sources like dedicated data synchronization products. Furthermore, their fault tolerance design may lead to synchronization failures for multi-table synchronization, requiring the entire job to be stopped and re-executed. Using Flink and Spark may also involve writing code, which adds to the learning curve.
  3. CDC Scenario: Flink CDC is frequently employed for CDC scenarios. However, it inherits the underlying issues of Flink and does not support table structure changes or reading multiple tables from a single source (each source can only read one table). This limitation means that the number of JDBC connections required for CDC synchronization is equal to the number of tables.

To address these challenges, users often need to employ a combination of the above components in a complex architecture, requiring a comprehensive big data platform and incurring significant learning costs. Additionally, managing different codebases can be challenging.

Apache SeaTunnel, the next-generation data integration platform, offers a solution to these pain points. It provides a unified platform that supports all the mentioned scenarios, simplifying the overall architecture and reducing the learning curve. SeaTunnel addresses the limitations of existing tools and aims to deliver a comprehensive and user-friendly data integration experience.

Next-generation data integration platform Apache SeaTunnel

No alt text provided for this image

SeaTunnel is a very easy-to-use, ultra-high-performance, distributed data integration platform that supports real-time synchronization of massive data. It can synchronize tens of billions of data stably and efficiently every day and has been used in production by nearly 100 companies.

Six Design Goals

SeaTunnel is primarily focused on data integration and synchronization, aiming to address common challenges in the field. Apache SeaTunnel's design goals can be summarized into six key aspects.

Firstly, it emphasizes simplicity and ease of use. Users can initiate synchronization jobs with minimal configuration and simple commands, ensuring a smooth and straightforward experience.

Secondly, SeaTunnel prioritizes monitoring the synchronization process. It provides quantifiable indicators, enabling users to track the status of ongoing synchronization operations. Transparency is crucial, and SeaTunnel avoids being a black box.

The third goal is to offer extensive support for various data sources. The community has identified over 500 data sources, with more than 100 already supported by SeaTunnel. Support for data sources continues to expand rapidly, with around 40 to 50 new sources added each quarter.

The fourth objective is to cover diverse synchronization scenarios. SeaTunnel aims to support real-time and offline synchronization, incremental and full data transfers, Change Data Capture (CDC), multi-table synchronization, and more. It eliminates the need for users to rely on multiple tools to achieve their integration goals.

Fifthly, SeaTunnel addresses the critical issue of data consistency. It ensures that systems with high consistency requirements do not lose data and avoid data duplication, guaranteeing the integrity and accuracy of synchronized data.

Finally, performance optimization is another consideration. SeaTunnel strives to minimize resource utilization and reduce impact on data sources while delivering the necessary functionality. Balancing performance and efficiency is crucial to provide a smooth and efficient data integration experience.

Project development history

Begun in 2017 and originally called Waterdrop, the project was renamed in October 2021 and entered the ASF incubator in December the same year. Created by a small group in China, SeaTunnel since has grown to more than 180 contributors around the world. The most recent version supports more than 70 data sources, and the number is surging.

No alt text provided for this image

Users all over the world

The Apache SeaTunnel community currently has nearly 5,000 members, with more than 200 contributors in the community, and the speed of PR submission and merging is relatively fast. In addition, users cover domestic Internet companies, such as Station B, Tencent Cloud, and others. Overseas, Shopee, India’s second-largest telecom operator, Bharti Telecom, etc. is also using Apache SeaTunnel.

Core Design and Architecture

No alt text provided for this image

Overall structure

The Apache SeaTunnel architecture is mainly divided into three modules. The first one is the data source, which includes some domestic and foreign databases; the second part is the target end. The target end and the data source can be combined. They are called data sources, which are mainly databases. , SaaS services, and product components such as data lakes and warehouses. From the data source to the target are defined a set of APIs dedicated to data synchronization, which is decoupled from the engine and can theoretically be extended to many engines. The engines currently supported include Apache SeaTunnel Zeta, Flink, and Spark.

SeaTunnel work flowchart

No alt text provided for this image

The runtime process of SeaTunnel is shown in the figure above.

The process begins with the user configuring job information and selecting the execution engine for job submission.

The Source Connector plays a crucial role in parallel data retrieval and forwarding it either to the downstream Transform or directly to the Sink. The Sink, in turn, handles writing the data to the desired destination. It's worth noting that users have the flexibility to develop and extend their own custom Source, Transform, and Sink connectors.

SeaTunnel operates as an EL(T) data integration platform, where the Transform component is primarily employed for performing simple data transformations. These transformations may include tasks like converting column data to uppercase or lowercase, renaming columns, or splitting a column into multiple columns.

By default, SeaTunnel utilizes the SeaTunnel Engine as its primary execution engine. However, if users choose to leverage the Flink or Spark engine, SeaTunnel packages the Connector into a Flink or Spark program and submits it for execution within the selected engine.

Connector API decoupled from the engine

The API designs in this set primarily focus on decoupling from the engine, specifically catering to data integration scenarios. They are divided into Source API, Transform API (referred to as small t), Sink API, and CDC API as mentioned earlier. By leveraging the Translation API, these connectors can be executed on different engines.

In all engines, the connector API is built on the foundation of the checkpoint mechanism. The main objective is to integrate distributed snapshot algorithms across various engines and utilize the checkpoint capability of the underlying engine. This enables the implementation of features like two-phase commit, ensuring data consistency and reliability.

Source Connector

No alt text provided for this image

The Source connector is implemented based on a set of APIs. Using the JDBC connector as an example, it offers support for both offline and real-time operation modes. With this connector, you only need to specify the job mode as BATCH or STREAMING in the environment configuration. This allows for easy switching between offline and real-time synchronization modes.

The Source connector provides several key capabilities, including parallel reading, dynamic shard discovery, field projection, and Exactly-once semantic guarantee. At the core, it utilizes the checkpoint capability provided by the engine. The Source API supports the underlying engine to invoke the checkpointed API, ensuring synchronization and preserving data while avoiding duplication.

Sink Connector

The main features supported by Sink Connector include:

  • SaveMode support, flexible selection of target performance data processing methods
  • Automatic table creation, support for table creation template modification, hands-free in multi-table synchronization scenarios
  • Exactly-once semantic support, data will not be lost or repeated, CheckPoint can adapt to Zeta, Spark, Flink three engines
  • CDC support, support for processing database log events

Transform Connector

Key features of the Transform Connector include:

  • Support for copying a column to a new column
  • Support field rename, change order, type modification, delete column
  • Support for replacing content in data
  • Support for splitting a column into multiple columns
  • CDC Connector Design


No alt text provided for this image

CDC Connector mainly has the following functions:

  • Support lock-free parallel snapshot history data
  • Support dynamic table addition
  • Support sub-database sub-table and multi-structure table reading
  • Support Schema evolution
  • Support the Checkpoint process to ensure that data is not lost or repeated
  • Support offline batch CDC synchronization

Checkpoint function design

No alt text provided for this image

Finally, it should be emphasized that?all?Apache?SeaTunnel Connectors are designed based on checkpoint logic.?The job starts from the Split enumerator, enters the Source reader, sends the data to the Sink Writer after reading, and finally submits it by the AggregateCommitter.

Next-generation data integration engine Apache SeaTunnel Zeta

Apache SeaTunnel Zeta, the next-generation data integration engine, is positioned as an easy-to-use, dedicated engine for full-scenario data integration, and based on this, it is faster, more stable, and more resource-efficient.

Apache SeaTunnel Zeta cluster management

The cluster management approach of Apache SeaTunnel Zeta is distinctive due to the following features:

  • Independence from third-party components and big data platforms, eliminating dependencies and providing flexibility.
  • Optional master node, allowing for an unowned setup that reduces single points of failure and enhances fault tolerance.
  • Write-Ahead Logging (WAL) mechanism enables full cluster restarts, ensuring the recovery of previously running jobs and maintaining job state continuity.
  • Built-in support for distributed snapshot algorithms, ensuring data consistency during the synchronization process. This guarantees that data is accurately captured and replicated across the cluster.
  • These characteristics make Apache SeaTunnel Zeta a robust and self-sufficient cluster management solution for efficient and reliable data integration and synchronization.

Below are some of the proprietary attributes of the Apache SeaTunnel Zeta engine and what core problems it solves.

No alt text provided for this image

Apache SeaTunnel Zeta Pipeline Base Failover

No alt text provided for this image

  • For Batch jobs and streaming jobs, resource allocation is performed in units of the Pipeline, and the Pipeline can start executing after allocating the required resources without waiting for all tasks to obtain resources. This can solve some pain points in data synchronization of engines such as Flink, that is, when there are multiple Sources and Sinks in a job for synchronization, if there is a problem at any end, the entire job will be marked as failed, and stopped.
  • Fault tolerance (Checkpoint, state rollback) is implemented at the granularity of the Pipeline. When a problem occurs in the target table, it will only affect the upstream and downstream tasks, and other tasks will be executed normally.

After the problem is solved, manual restoration of a single Pipeline is supported.

Apache SeaTunnel Zeta Dynamic Thread Sharing

No alt text provided for this image

The core of dynamic threads is to reduce the problem of CDC multi-table synchronization, especially in scenarios where a large number of small tables exist, due to limited resources and many threads, resulting in performance degradation. Dynamic threads can dynamically match threads according to running time and data volume, saving resources. After testing, and running a job with 500 small tables in a single JVM scenario, the performance can be improved by more than 2 times after dynamic threads are enabled.

Apache SeaTunnel Zeta Connection Pool Sharing

No alt text provided for this image

Connection pool sharing is mainly used to solve the scenarios occupied by a large number of JDBCs, such as a single very large table, there are many parallel tasks to process, offline synchronization of multiple tables, CDC synchronization of multiple tables, etc. Connection pool sharing allows the same Job on the same TaskExecutionService node to share JDBC connections, thereby reducing JDBC usage.

Apache SeaTunnel Zeta multi-table synchronization

No alt text provided for this image

The last is multi-table synchronization, which is mainly used for table partition transform processing after the CDC Source is read, and the data is distributed to different Sinks, and each Sink will process the data of one table. In this process, connector sharing is used to reduce the usage of JDBC connections, and dynamic thread sharing is used to reduce thread usage, thereby improving performance.

Performance comparison

SeaTunnel Zeta syncs data around 30–50% faster than open-source data integration framework like DataX. Memory has no significant impact on SeaTunnel Zeta’s performance. The performance of Apache SeaTunnel in MySQL to S3 scenarios is more than 30 times that of Airbyte and 2 to 5 times that of AWS DMS and Glue. It has been observed that Apache SeaTunnel can complete synchronization in a small memory, and it is still in the case of a single point. Because Zeta supports distributed, it is believed that Apache SeaTunnel will have better performance when the order of magnitude is larger and multi-machine parallelism.

Roadmap going ahead

SeaTunnel is set to advance in multiple areas, including:

  1. Performance and stability enhancements for the Zeta engine.
  2. Fulfillment of planned features like synchronized data definition language changes, error data handling, flow rate control, and multi-table synchronization.
  3. Transition of SeaTunnel-Web from alpha to release stage, enabling users to directly define and control synchronization processes through the interface.
  4. Strengthened collaboration with artificial general intelligence components, expanding integration with vector databases and plugins for large models.
  5. Improved connectivity with upstream and downstream ecosystems, including Apache DolphinScheduler and Apache Airflow.
  6. Focus on constructing SaaS connectors for platforms such as ChatGPT, Salesforce, and Workday.
  7. Continued evolution of the Apache SeaTunnel Engine, becoming a leading independent big data synchronization engine.
  8. Provide support for Kubernetes.

Apache SeaTunnel Engine is evolving and will be the first and a great independent-developed big data synchronization engine you’ve ever known!

要查看或添加评论,请登录

Devashish Somani的更多文章

社区洞察