Stream Processing with Apache Kafka, Samza, and Flink
活动举办者 Adem Efe
2023 年 7 月 20 日 – 2023 年 7 月 20 日线上活动
获取Stream Processing with Apache Kafka, Samza, and Flink的最新资讯
查看还有哪些人参加Stream Processing with Apache Kafka, Samza, and Flink,并随时了解有关该活动的最新对话。
参与Stream Processing with Apache Kafka, Samza, and Flink体验活动简介
Meetup.com link: https://www.meetup.com/stream-processing-meetup-linkedin/events/294272507/
---
Location: https://linkedin.zoom.us/j/93886764982
6:00 - 6:05: Welcome
6:05 - 6:40: Trust but verify: A deep dive into the correctness of Apache Kafka's replication protocol
Divij Vaidya, Amazon Web Services
In this talk, we will explore Kafka's replication protocol and how it ensures data consistency and fault tolerance. We will also model and reason about the correctness of the Kafka replication protocol using a formal specification language (TLA+). By the end of this talk, attendees will have a better understanding of how Kafka's replication protocol works and why the protocol could be trusted to ensure data consistency.
Divij has a decade of experience in building large scale data storage, movement and retrieval systems at Amazon. He has bootstrapped multiple products and teams across different geographies, with Amazon Neptune (graph database) being the milestone zero to one story of his career. His latest adventure takes him into the data streaming world with an objective to enhance Apache Kafka.
6:40 - 7:15: Declarative Reasoning with Timelines: The Next Step in Event Processing
Ben Chambers, Kaskada
At the heart of modern data processing lies events. Events describe the roughest, most complete picture available of what has happened in the world, and practically every form of data processing ultimately begins with events. While the power of event processing has increased since the emergence of streaming data processing, current systems are still difficult to use when working on problems that deal with time and order, such as predictive AI/ML. Handling these problems requires a new kind of query language - a way to declaratively reason about events over time. This talk introduces the concept of timelines -- an intuitive abstraction for reasoning about temporal values. They support a broad range of useful operations which can be efficiently computed at scale. We will demonstrate the power and differentiation of timelines: How timelines allow declarative queries over events and time in a simple and intuitive manner Why timelines are ideal for applications such as behavioral predictions, trend analysis, and forecasting, and how existing solutions such as streaming SQL fall short. How to execute timeline-based queries using the open-source Kaskada event-processing engine.
Ben is a technology innovator with an extensive background in data processing, machine learning, cloud computing, and software engineering. Ben currently leads the Kaskada open-source project, a modern event-processing system that transforms how event-based data is processed and analyzed. Previously, Ben co-founded Kaskada, a startup specializing in advanced data platforms for machine learning applications. Prior to Kaskada's acquisition by DataStax, Ben played a pivotal role in driving the company's technical vision and implementing what would become the Kaskada open-source project. Prior to founding Kaskada, Ben was a software engineer at Google, where he made significant contributions to the development of the Apache Beam programming model.
7:15 - 7:50: Decoupling Compute and Storage for Stream Processing Systems: Benefits, Limitations, and Insights
Yingjun Wu, RisingWave
Stream processing plays a pivotal role in contemporary data infrastructure, but creating an efficient, scalable stream processing system can be a daunting task, particularly in a cloud environment. Decoupling compute and storage architecture has emerged as a popular solution.
In this presentation, we will delve into the pros and cons of decoupled compute and storage architecture in stream processing systems. Although this method enables infinite scalability, it may also give rise to data consistency and increased latency issues, especially when handling complex continuous queries that demand the management of sizable internal states. To address these challenges, we propose a tiered storage mechanism as our solution. This approach combines high-performance and cost-effective storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing. By the end of this talk, we will present experimental results that exemplify the balance between performance and cost-efficiency achieved by our proposed method, as implemented in RisingWave, a distributed SQL streaming database.
Yingjun is the founder of RisingWave Labs, the company developing RisingWave, a distributed SQL database for stream processing. Before running the company, Yingjun was a software engineer at the Redshift team, Amazon Web Services, and a researcher at the database group, IBM Almaden Research Center. Yingjun received his PhD degree from National University of Singapore, and was a visiting PhD at Carnegie Mellon University. He has been working in stream processing and database systems for over a decade.