登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

Effortless Data Fusion: Apache SeaTunnel Delivers Lightning-Fast Integration!

Devashish Somani

Senior Software Engineer at GoDaddy

发布日期: 2023年7月3日

Apache SeaTunnel, the latest project to achieve top-level status within the Apache Software Foundation (ASF), addresses common challenges in data integration. With its ability to efficiently ingest and synchronize vast amounts of data from diverse sources, Apache SeaTunnel significantly reduces the cost associated with data transfer.

The current big data ecosystem comprises a range of data engines such as Hadoop, Hive, Kudu, Kafka, HDFS for the big data landscape, MongoDB, Redis, ClickHouse, Doris for generalized big database ecosystems, and cloud-based solutions like AWS S3, Redshift, BigQuery, Snowflake. Additionally, there are various data ecosystems like MySQL, PostgreSQL, IoTDB, TDEngine, Salesforce, Workday, and more. To seamlessly connect these diverse data sources, a comprehensive tool is required.

Apache SeaTunnel acts as a crucial bridge that enables accurate, real-time, and simplified integration of these complex data sources. It serves as the central "highway" for data flow in the expansive big data landscape, facilitating smooth and efficient data integration processes.

No alt text provided for this image — Low-code development platform market revenue worldwide from 2018 to 2025 (in billion U.S. dollars)

The rise in the popularity of codeless/low-code data integration frameworks is a direct response to the changing landscape of the big data world. To grasp the factors that sparked this demand, it is essential to explore the evolutionary journey of the data landscape.

From ETL to EtLT: A Paradigm Shift in Architecture Redefining Data Integration.

For a better understanding, let’s first revisit the evolution of the data warehouse architecture from ETL to EtLT.

When we look back, we can observe that the data warehouse architecture predominantly relied on ETL (Extract, Transform, Load) from 1990 to 2015. During this period, structured data sources such as MySQL, SQL Server, Oracle, ERP, and CRM were primarily utilized. The OLTP era witnessed Oracle and DB2 taking charge of data warehouse computing, serving as databases for querying and storing historical data. However, it is important to note that the computing power of these databases, such as Oracle and DB2, remained relatively limited, making it challenging to meet the demands of data warehouse computing across various scenarios.

During this progression, dedicated ETL software solutions like Informatica, Talend, and Kettle came into existence, which many companies continue to rely on today. However, with the advent of new technologies such as MPP (Massively Parallel Processing) and the widespread adoption of distributed architecture technologies like Hadoop and Hive, a significant shift occurred. People realized the potential of utilizing cost-effective hardware alternatives instead of relying on expensive Oracle and DB hardware services. This shift marked the entrance into the era of ELT (Extract, Load, Transform).

The core feature of this era is that data from different data sources, including structured and unstructured data, logs, etc., can be loaded into the data warehouse without any processing, or after some simple standardization, such as deduplication, parse URL parameters into individual components, mask or obfuscate sensitive data, word count reduction, etc, and was calculated layer by layer by engines such as MapReduce and Spark. At this time, because the data sources are not so complicated, people handle the process from data sources to data warehouses, mainly by writing MR programs or writing Spark programs.

As data sources grow more and more complex, many new technologies continue to emerge, and data sources are more complex. Some SaaS services and cloud data storage have appeared, further making data sources more complex. At the same time, on the target side, the data warehouse is very different from the previous data warehouse. With the emergence of a data lake and real-time data warehouse technology, the target side of data integration is also more complicated. At this time, if the MR program is still developed by data engineers as before, the integration efficiency will be very low. At this time, some professional teams and professional tools are urgently needed to solve such an ELT process.

Thus,?the field of data integration was born. Apache?SeaTunnel is the platform for next-generation data integration.

In the ELT scenario, there is a concept called EtLT. The small t there is different from the uppercase T in the back, which means data standardization, such as field screening, structured conversion of unstructured data, etc. It does not involve core business logic such as join, or aggregation. We also split the personnel under these two systems. The data EL process, that is, the previous EtL process is mainly handled by data engineers who do not need to understand the business very well. They only need to understand the relationship between different data sources, and the data characteristics and differences between them. After the data is loaded into the data warehouse, professional AI data scientists, data analysts, SQL developers, and other business-savvy people will do calculations based on the original data.

This is the evolution from ETL to EtLT architecture. In 2020, James Densmore proposed the EtLT architecture in the book “Data Pipelines Pocket Reference”. He predicted that from 2020 to the future, this is the evolution trend of architecture.

Challenges in the field of data integration & common solutions

From this, we extend to some common challenges and solutions in the field of data integration.

During the previous technology exploration, I identified core challenges in the field of data integration:

Abundance of Data Sources: The Apache SeaTunnel community has identified nearly 500 data sources, with the number continuously increasing. Ensuring compatibility with evolving data source versions and quickly adapting to new technologies pose significant challenges in data integration.
Complex Synchronization Scenarios: Data synchronization encompasses various scenarios such as offline, real-time, full, incremental synchronization, CDC (change data capture), and multi-table synchronization. Supporting CDC requires reading and analyzing database change logs, handling different log data formats, transaction processing, and synchronizing at various granularities (whole databases, sub-databases, and sub-tables).
Monitoring and Quantifiable Indicators: Lack of proper monitoring during the synchronization process leads to information opacity, making it uncertain to track the amount of synchronized data. Establishing effective monitoring mechanisms with quantifiable indicators is crucial.
High Throughput and Low Latency with Limited Resources: Achieving high throughput and low latency while optimizing resource utilization is essential to reduce costs in data integration processes.
Minimizing Impact on Data Sources: Real-time synchronization and frequent reading of binlogs can burden data sources and affect their stability. Excessive JDBC connections can also lead to instability. Minimizing the impact on data sources involves reducing connection occupation and limiting synchronization speeds.
Data Consistency, Loss Prevention, and Duplication Avoidance: Systems with stringent data consistency requirements necessitate strategies to ensure data integrity, prevent data loss, and eliminate duplication.

To address these needs, a comprehensive data integration product is required—one that is user-friendly, easily expandable, manageable, and maintainable. Extensive scheme research has been conducted to develop such a solution.

Different data integration products primarily cater to the following scenarios:

Full Offline Increment: In the past, Sqoop was widely used for this scenario, but it had limitations in terms of supporting data sources and relied on the MapReduce architecture, resulting in slow performance. However, Sqoop has been decommissioned from Apache, belonging to the previous generation of data integration projects. Presently, Airflow is a popular tool for data synchronization. While it offers useful features, it lacks support for real-time synchronization and multi-level parallel processing. Additionally, it lacks a distributed snapshot algorithm, making it unable to guarantee data consistency and support breakpoint resumption.
Real-time Synchronization: Flink and Spark Streaming are commonly used for real-time scenarios. However, as these products are primarily positioned as computing engines, their core capabilities revolve around processing complex data calculations. They may not provide extensive support for various data sources like dedicated data synchronization products. Furthermore, their fault tolerance design may lead to synchronization failures for multi-table synchronization, requiring the entire job to be stopped and re-executed. Using Flink and Spark may also involve writing code, which adds to the learning curve.
CDC Scenario: Flink CDC is frequently employed for CDC scenarios. However, it inherits the underlying issues of Flink and does not support table structure changes or reading multiple tables from a single source (each source can only read one table). This limitation means that the number of JDBC connections required for CDC synchronization is equal to the number of tables.

To address these challenges, users often need to employ a combination of the above components in a complex architecture, requiring a comprehensive big data platform and incurring significant learning costs. Additionally, managing different codebases can be challenging.

Apache SeaTunnel, the next-generation data integration platform, offers a solution to these pain points. It provides a unified platform that supports all the mentioned scenarios, simplifying the overall architecture and reducing the learning curve. SeaTunnel addresses the limitations of existing tools and aims to deliver a comprehensive and user-friendly data integration experience.

Next-generation data integration platform Apache SeaTunnel

SeaTunnel is a very easy-to-use, ultra-high-performance, distributed data integration platform that supports real-time synchronization of massive data. It can synchronize tens of billions of data stably and efficiently every day and has been used in production by nearly 100 companies.

Six Design Goals

SeaTunnel is primarily focused on data integration and synchronization, aiming to address common challenges in the field. Apache SeaTunnel's design goals can be summarized into six key aspects.

Firstly, it emphasizes simplicity and ease of use. Users can initiate synchronization jobs with minimal configuration and simple commands, ensuring a smooth and straightforward experience.

Secondly, SeaTunnel prioritizes monitoring the synchronization process. It provides quantifiable indicators, enabling users to track the status of ongoing synchronization operations. Transparency is crucial, and SeaTunnel avoids being a black box.

The third goal is to offer extensive support for various data sources. The community has identified over 500 data sources, with more than 100 already supported by SeaTunnel. Support for data sources continues to expand rapidly, with around 40 to 50 new sources added each quarter.

The fourth objective is to cover diverse synchronization scenarios. SeaTunnel aims to support real-time and offline synchronization, incremental and full data transfers, Change Data Capture (CDC), multi-table synchronization, and more. It eliminates the need for users to rely on multiple tools to achieve their integration goals.

Fifthly, SeaTunnel addresses the critical issue of data consistency. It ensures that systems with high consistency requirements do not lose data and avoid data duplication, guaranteeing the integrity and accuracy of synchronized data.

Finally, performance optimization is another consideration. SeaTunnel strives to minimize resource utilization and reduce impact on data sources while delivering the necessary functionality. Balancing performance and efficiency is crucial to provide a smooth and efficient data integration experience.

Project development history

Begun in 2017 and originally called Waterdrop, the project was renamed in October 2021 and entered the ASF incubator in December the same year. Created by a small group in China, SeaTunnel since has grown to more than 180 contributors around the world. The most recent version supports more than 70 data sources, and the number is surging.

Users all over the world

The Apache SeaTunnel community currently has nearly 5,000 members, with more than 200 contributors in the community, and the speed of PR submission and merging is relatively fast. In addition, users cover domestic Internet companies, such as Station B, Tencent Cloud, and others. Overseas, Shopee, India’s second-largest telecom operator, Bharti Telecom, etc. is also using Apache SeaTunnel.

Core Design and Architecture

Overall structure

The Apache SeaTunnel architecture is mainly divided into three modules. The first one is the data source, which includes some domestic and foreign databases; the second part is the target end. The target end and the data source can be combined. They are called data sources, which are mainly databases. , SaaS services, and product components such as data lakes and warehouses. From the data source to the target are defined a set of APIs dedicated to data synchronization, which is decoupled from the engine and can theoretically be extended to many engines. The engines currently supported include Apache SeaTunnel Zeta, Flink, and Spark.

SeaTunnel work flowchart

The runtime process of SeaTunnel is shown in the figure above.

The process begins with the user configuring job information and selecting the execution engine for job submission.

The Source Connector plays a crucial role in parallel data retrieval and forwarding it either to the downstream Transform or directly to the Sink. The Sink, in turn, handles writing the data to the desired destination. It's worth noting that users have the flexibility to develop and extend their own custom Source, Transform, and Sink connectors.

SeaTunnel operates as an EL(T) data integration platform, where the Transform component is primarily employed for performing simple data transformations. These transformations may include tasks like converting column data to uppercase or lowercase, renaming columns, or splitting a column into multiple columns.

By default, SeaTunnel utilizes the SeaTunnel Engine as its primary execution engine. However, if users choose to leverage the Flink or Spark engine, SeaTunnel packages the Connector into a Flink or Spark program and submits it for execution within the selected engine.

Connector API decoupled from the engine

The API designs in this set primarily focus on decoupling from the engine, specifically catering to data integration scenarios. They are divided into Source API, Transform API (referred to as small t), Sink API, and CDC API as mentioned earlier. By leveraging the Translation API, these connectors can be executed on different engines.

In all engines, the connector API is built on the foundation of the checkpoint mechanism. The main objective is to integrate distributed snapshot algorithms across various engines and utilize the checkpoint capability of the underlying engine. This enables the implementation of features like two-phase commit, ensuring data consistency and reliability.

Source Connector

The Source connector is implemented based on a set of APIs. Using the JDBC connector as an example, it offers support for both offline and real-time operation modes. With this connector, you only need to specify the job mode as BATCH or STREAMING in the environment configuration. This allows for easy switching between offline and real-time synchronization modes.

The Source connector provides several key capabilities, including parallel reading, dynamic shard discovery, field projection, and Exactly-once semantic guarantee. At the core, it utilizes the checkpoint capability provided by the engine. The Source API supports the underlying engine to invoke the checkpointed API, ensuring synchronization and preserving data while avoiding duplication.

Sink Connector

The main features supported by Sink Connector include:

SaveMode support, flexible selection of target performance data processing methods
Automatic table creation, support for table creation template modification, hands-free in multi-table synchronization scenarios
Exactly-once semantic support, data will not be lost or repeated, CheckPoint can adapt to Zeta, Spark, Flink three engines
CDC support, support for processing database log events

Transform Connector

Key features of the Transform Connector include:

Support for copying a column to a new column
Support field rename, change order, type modification, delete column
Support for replacing content in data
Support for splitting a column into multiple columns
CDC Connector Design

CDC Connector mainly has the following functions:

Support lock-free parallel snapshot history data
Support dynamic table addition
Support sub-database sub-table and multi-structure table reading
Support Schema evolution
Support the Checkpoint process to ensure that data is not lost or repeated
Support offline batch CDC synchronization

Checkpoint function design

Finally, it should be emphasized that?all?Apache?SeaTunnel Connectors are designed based on checkpoint logic.?The job starts from the Split enumerator, enters the Source reader, sends the data to the Sink Writer after reading, and finally submits it by the AggregateCommitter.

Next-generation data integration engine Apache SeaTunnel Zeta

Apache SeaTunnel Zeta, the next-generation data integration engine, is positioned as an easy-to-use, dedicated engine for full-scenario data integration, and based on this, it is faster, more stable, and more resource-efficient.

Apache SeaTunnel Zeta cluster management

The cluster management approach of Apache SeaTunnel Zeta is distinctive due to the following features:

Independence from third-party components and big data platforms, eliminating dependencies and providing flexibility.
Optional master node, allowing for an unowned setup that reduces single points of failure and enhances fault tolerance.
Write-Ahead Logging (WAL) mechanism enables full cluster restarts, ensuring the recovery of previously running jobs and maintaining job state continuity.
Built-in support for distributed snapshot algorithms, ensuring data consistency during the synchronization process. This guarantees that data is accurately captured and replicated across the cluster.
These characteristics make Apache SeaTunnel Zeta a robust and self-sufficient cluster management solution for efficient and reliable data integration and synchronization.

Below are some of the proprietary attributes of the Apache SeaTunnel Zeta engine and what core problems it solves.

Apache SeaTunnel Zeta Pipeline Base Failover

For Batch jobs and streaming jobs, resource allocation is performed in units of the Pipeline, and the Pipeline can start executing after allocating the required resources without waiting for all tasks to obtain resources. This can solve some pain points in data synchronization of engines such as Flink, that is, when there are multiple Sources and Sinks in a job for synchronization, if there is a problem at any end, the entire job will be marked as failed, and stopped.
Fault tolerance (Checkpoint, state rollback) is implemented at the granularity of the Pipeline. When a problem occurs in the target table, it will only affect the upstream and downstream tasks, and other tasks will be executed normally.

After the problem is solved, manual restoration of a single Pipeline is supported.

Apache SeaTunnel Zeta Dynamic Thread Sharing

The core of dynamic threads is to reduce the problem of CDC multi-table synchronization, especially in scenarios where a large number of small tables exist, due to limited resources and many threads, resulting in performance degradation. Dynamic threads can dynamically match threads according to running time and data volume, saving resources. After testing, and running a job with 500 small tables in a single JVM scenario, the performance can be improved by more than 2 times after dynamic threads are enabled.

Apache SeaTunnel Zeta Connection Pool Sharing

Connection pool sharing is mainly used to solve the scenarios occupied by a large number of JDBCs, such as a single very large table, there are many parallel tasks to process, offline synchronization of multiple tables, CDC synchronization of multiple tables, etc. Connection pool sharing allows the same Job on the same TaskExecutionService node to share JDBC connections, thereby reducing JDBC usage.

Apache SeaTunnel Zeta multi-table synchronization

The last is multi-table synchronization, which is mainly used for table partition transform processing after the CDC Source is read, and the data is distributed to different Sinks, and each Sink will process the data of one table. In this process, connector sharing is used to reduce the usage of JDBC connections, and dynamic thread sharing is used to reduce thread usage, thereby improving performance.

Performance comparison

SeaTunnel Zeta syncs data around 30–50% faster than open-source data integration framework like DataX. Memory has no significant impact on SeaTunnel Zeta’s performance. The performance of Apache SeaTunnel in MySQL to S3 scenarios is more than 30 times that of Airbyte and 2 to 5 times that of AWS DMS and Glue. It has been observed that Apache SeaTunnel can complete synchronization in a small memory, and it is still in the case of a single point. Because Zeta supports distributed, it is believed that Apache SeaTunnel will have better performance when the order of magnitude is larger and multi-machine parallelism.

Roadmap going ahead

SeaTunnel is set to advance in multiple areas, including:

Performance and stability enhancements for the Zeta engine.
Fulfillment of planned features like synchronized data definition language changes, error data handling, flow rate control, and multi-table synchronization.
Transition of SeaTunnel-Web from alpha to release stage, enabling users to directly define and control synchronization processes through the interface.
Strengthened collaboration with artificial general intelligence components, expanding integration with vector databases and plugins for large models.
Improved connectivity with upstream and downstream ecosystems, including Apache DolphinScheduler and Apache Airflow.
Focus on constructing SaaS connectors for platforms such as ChatGPT, Salesforce, and Workday.
Continued evolution of the Apache SeaTunnel Engine, becoming a leading independent big data synchronization engine.
Provide support for Kubernetes.

Apache SeaTunnel Engine is evolving and will be the first and a great independent-developed big data synchronization engine you’ve ever known!

要查看或添加评论，请登录

Devashish Somani的更多文章

Rethinking AI Innovation Beyond Traditional Product Development Processes

2025年3月20日

Rethinking AI Innovation Beyond Traditional Product Development Processes

Managing Artificial Intelligence (AI) innovation using traditional product development frameworks is similar to…
Kimi’s MoBA vs. DeepSeek’s NSA: The Ultimate Long-Context AI Showdown

2025年2月21日

Kimi’s MoBA vs. DeepSeek’s NSA: The Ultimate Long-Context AI Showdown

Which Attention Mechanism Excels in Large-Scale Document Processing? AI is evolving to handle longer and more complex…
How to Build a Scalable Customer 360 Platform as a Service

2024年9月23日

How to Build a Scalable Customer 360 Platform as a Service

The concept of a Customer 360 view has been around for several years. Many companies have tried to build such a…
Expedite Apache Spark Queries with Bloom Filter Indexing

2024年9月9日

Expedite Apache Spark Queries with Bloom Filter Indexing

When dealing with large-scale datasets, achieving efficient data querying and processing is essential for maintaining…

3 条评论
Spark Performance Tuning: Addressing Common Issues and Optimization Strategies

2024年8月30日

Spark Performance Tuning: Addressing Common Issues and Optimization Strategies

Sometimes just knowing the tools are not important, knowing how to use it plays most important part. Though we have…

3 条评论
How Companies Are Rethinking Supply Chain Resilience Through Digital Transformation

2021年11月20日

How Companies Are Rethinking Supply Chain Resilience Through Digital Transformation

In recent years, the conventional wisdom has been that companies should prioritize efficiency over redundancy in their…
The Emergence of 'The Great Reset' of Capitalism for Recovery from the Covid-19 crisis.

2020年6月4日

The Emergence of 'The Great Reset' of Capitalism for Recovery from the Covid-19 crisis.

Now is the time to think of what history would say about this crisis. And now is the time for all of us to define our…
My Key Takeaways from Amazon AI Conclave 2019

2019年12月22日

My Key Takeaways from Amazon AI Conclave 2019

For over 20 years, Amazon has been investing deeply in Artificial Intelligence. Today, Machine learning (ML) algorithms…

3 条评论
Automated Machine Learning: Just How Much?

2019年9月15日

Automated Machine Learning: Just How Much?

Consumers everywhere depend on Alexa, Siri, or Google’s Assistant for all sorts of things -- answering obscure trivia…
Emergence of Data Storytelling in Data Science

2019年8月17日

Emergence of Data Storytelling in Data Science

Why stories? We’re really passionate about storytelling. Why? Because stories are the only way to ensure that everyone…

2 条评论

See all articles

From ETL to EtLT: A Paradigm Shift in Architecture Redefining Data Integration.

Challenges in the field of data integration & common solutions

Next-generation data integration platform Apache SeaTunnel

Six Design Goals

Project development history

Users all over the world

Core Design and Architecture

Overall structure

SeaTunnel work flowchart

Connector API decoupled from the engine

Source Connector

Sink Connector

Transform Connector

Checkpoint function design

Next-generation data integration engine Apache SeaTunnel Zeta

Apache SeaTunnel Zeta cluster management

Apache SeaTunnel Zeta Pipeline Base Failover

Apache SeaTunnel Zeta Dynamic Thread Sharing

Apache SeaTunnel Zeta Connection Pool Sharing

Apache SeaTunnel Zeta multi-table synchronization

Performance comparison

Roadmap going ahead

Devashish Somani的更多文章

Rethinking AI Innovation Beyond Traditional Product Development Processes

Kimi’s MoBA vs. DeepSeek’s NSA: The Ultimate Long-Context AI Showdown

How to Build a Scalable Customer 360 Platform as a Service

Expedite Apache Spark Queries with Bloom Filter Indexing

Spark Performance Tuning: Addressing Common Issues and Optimization Strategies

How Companies Are Rethinking Supply Chain Resilience Through Digital Transformation

The Emergence of 'The Great Reset' of Capitalism for Recovery from the Covid-19 crisis.

My Key Takeaways from Amazon AI Conclave 2019

Automated Machine Learning: Just How Much?

Emergence of Data Storytelling in Data Science

社区洞察