登录查看更多内容

Streaming Data Solutions: Flink versus Spark

Dr. Shannon Block, CFE

Board Director?3-time CEO?President?Chief Digital Officer?Chief Strategy Officer?COO?CBDO?Doctorate in Computer Science?M.S. Physics?B.S. Applied Mathematics?B.S. Physics

发布日期: 2019年10月18日

While real-time stream processing has been around for a while, businesses are now trying to quickly process larger volumes of streaming data. Streaming data is everywhere from Twitter, sensors, stock ticker prices, and weather. Streaming data comes in continuously, which poses challenges in processing streaming data.

Flink was initially written in Java and Scala and exposes many Application Programming Interfaces (APIs), including the DataStream API. Flink was developed by a German University and became an incubator project for Apache in 2014.

Similar, but different, Spark Streaming is one of the most used libraries in Apache Spark. Spark developers create streaming applications using DataFrames or Dataset API’s, which are available in programming languages like Java, Python, and R. The product is essentially an extension of the core Spark API.

Similarities

Both Flink and Spark are big data systems that are fault-tolerant and built to scale data. While both Flink and Spark are in-memory databases and have ability to write data to permanent storage, the goal is to keep it in memory for current usage. Both products enable programmers to use MapReduce functions and apply machine learning algorithms with streaming data. That is, both Flink and Spark are good with machine learning in processing large training and testing datasets across a distributed architecture. Also, both technologies can work with Kafka (LinkedIn’s streaming product), as well as Storm topologies.

Differences

Flink was made to be a streaming product, whereas Spark added the steaming product onto an existing service line. Spark was initially built on static data, but Flink can process batch operations by stopping the streaming. With Spark, the stream data was initially divided into micro-batches that repeat in a continuous loop. This means that with the batch program, the file needs to be opened, processed, and then closed. However, in 2018, with Spark 2.3, Spark was able to start to move away from the previous “micro-batch” approach. In contrast, Flink has, for some time, been breaking streaming data into finite sets at a checkpoint, which can be an advantage in terms of speed in running algorithms.

Flink's Performance

Flink can be customized to have optimal performance. Specifically, code logic changes and configuration are relevant to performance. For example, event time or processing time can be considered as it relates to performance effectiveness.

Flink breaks time into “processing time” generated at each machine in a cluster and “time making” at the entry point machine in a cluster. The time generated at the entry point machine in a cluster is also known as the “ingestion time” since it is generated at the time of an event. Several scholars recommend using event time because the event time is constant, which means operations can generate deterministic results regardless of throughput. On the other hand, the processing time is the time observed by the machine. Using this lens, the operations based on processing time are not deterministic. In practice, while events are thought of as real-time, there is the assumption that the clocks at event sources are synchronized, which is rarely the case. As such, this challenges the assumption that the event time is monotonically increasing, which means the allowed lateness solves the dropped events problem, but the large lateness value can still have a significant effect on performance. Without setting the lateness, events can then be dropped due to incorrect timestamps.

Regardless of what approach is chosen, the key for efficient processing time is making sure the logic can handle events in the same event time window being split into smaller processing time windows. Researchers have also shown some performance efficiencies can be achieved by not breaking up complex events, but the tradeoff is the operators have to go through the dimensions in each event, and the event object is larger.

Spark's Performance

In terms of Spark, identified bottlenecks include the network and disk I/O. CPI can also be a bottleneck but is not as common. Resolving the CPU is estimated to improve the completion of job time by 1-2%. Some of the challenges in managing Spark performance include that tasks can create bottlenecks on a variety of resources and different times. Also, concurrent tasks on a machine may compete for resources. Additionally, memory conditions can be a common issue since Spark’s traditional architecture is memory-centric. The causes of these performance setbacks often involve high concurrency, inefficient queries, and incorrect configurations. These issues can be mitigated with an understanding of both Spark and the data, realizing that Spark’s default configuration may not be the best to optimize performance.

Final Thoughts

The importance of solutions like Flink and Spark is about allowing businesses to make important decisions based on what is currently happening. No one framework solves all the problems, so it becomes a situation of the best fit. Understanding the system and resources can help in addressing performance bottlenecks. There are many stream processing applications, and it is essential to pick a framework that best meets the business’ needs, as not all products are the same. Flink and Spark are two of the popular open stream processing frameworks. Depending on the application, parameters need to be set correctly to meet performance goals. It is essential to understand the tradeoffs involved to get the best performance relative to business needs.

#Spark #Flink #Performance #StreamingData #BigData

About the Author

Shannon Block is an entrepreneur, mother and proud member of the global community. Her educational background includes a B.S. in Physics and B.S. in Applied Mathematics from George Washington University, M.S. in Physics from Tufts University and she is currently completing her Doctorate in Computer Science. She has been the CEO of both for-profit and non-profit organizations. Currently as Executive Director of Skillful Colorado, Shannon and her team are working to bring a future of skills to the future of work. With more than a decade of leadership experience, Shannon is a pragmatic and collaborative leader, adept at bringing people together to solve complex problems. She approaches issues holistically, helps her team think strategically about solutions and fosters a strong network of partners with a shared interest in strengthening workforce and economic development across the United States. Prior to Skillful, Shannon served as CEO of the Denver Zoo, Rocky Mountain Cancer Centers, and World Forward Foundation. She is deeply engaged in the Colorado community and has served on multiple boards including the International Women's Forum, the Regional Executive Committee of the Young Presidents’ Organization, Children’s Hospital Quality and Safety Board, Women’s Forum of Colorado, and the Colorado-based Presbyterian/St. Luke’s Community Advisory Council. Follow her on Twitter @ShannonBlock or connect with her on LinkedIn.

Visit www.ShannonBlock.org for more on technology tools and trends.

要查看或添加评论，请登录

Dr. Shannon Block, CFE的更多文章

My Interview with Chat GPT

2023年2月7日

My Interview with Chat GPT

[my opinions are my own] Below is the transcript from my interview with Chat GPT. Enjoy! Dr.

1 条评论
API’s in a COVID-19 World

2020年5月9日

API’s in a COVID-19 World

Application program interfaces (APIs) are growing exponentially in the COVID-19 world. In a rare collaboration, Apple…

2 条评论
Security Assessment versus Security Audit?

2020年1月26日

Security Assessment versus Security Audit?

If you are a member of the Board and the topic of a cybersecurity audit comes up, it is important to define what it is…

2 条评论
Cybersecurity 101: Ten KPI’s to Monitor

2020年1月22日

Cybersecurity 101: Ten KPI’s to Monitor

It’s no surprise that attackers are using more sophisticated techniques to target systems from personal devices to all…

1 条评论
Telehealth is Changing Healthcare

2019年12月9日

Telehealth is Changing Healthcare

Telemedicine makes it easier for people to stay healthy. It has been estimated that nearly three of every four…

1 条评论
Magecart Cybercriminals Steal Credit Card Info for a Week from Macys.com and Stocks Decline 10%

2019年11月23日

Magecart Cybercriminals Steal Credit Card Info for a Week from Macys.com and Stocks Decline 10%

For the second time, Macy’s customers were involved in a credit card data breach. Reports say that the breach lasted…

1 条评论
Scalable and Intelligent Security Analytics: Splunk, Devo, IBM and McAfee

2019年11月8日

Scalable and Intelligent Security Analytics: Splunk, Devo, IBM and McAfee

Organizations of any size can be victims of a cyber attack. Small and medium-sized organizations can be tempting for…
What Should Be Keeping You Up At Night: Where is Big Data Stored?

2019年10月20日

What Should Be Keeping You Up At Night: Where is Big Data Stored?

The digital universe is expected to double in size every two years with machine-generated data experiencing a 50x…
Detecting Bots with IP Size Distribution Analysis

2019年10月7日

Detecting Bots with IP Size Distribution Analysis

Kylie Jenner reportedly makes $1 million per paid Instagram post, and Selena Gomez is a close second with over $800K…
Cloud Computing: Stochastic Model Architecture, Barriers and Application

2019年9月29日

Cloud Computing: Stochastic Model Architecture, Barriers and Application

According to Gartner, the market for cloud computing will expand to 623 billion USD by 2030. With this growth, there is…

See all articles

Streaming Data Solutions: Flink versus Spark

Dr. Shannon Block, CFE

Board Director?3-time CEO?President?Chief Digital Officer?Chief Strategy Officer?COO?CBDO?Doctorate in Computer Science?M.S. Physics?B.S. Applied Mathematics?B.S. Physics

Dr. Shannon Block, CFE的更多文章

社区洞察

其他会员也浏览了

Boost Real-time Processing with Spark Structured Streaming

Title: Why gRPC is the Future of High-Performance Service Communication

Extracting and Analyzing AWS CloudWatch Logs using Python and Boto3

Who Says Distributed Monoliths are Bad?

Top 10 year in review for AWS Serverless DA

Comparing Cloud Deployment Scenarios for AWS: Terraform vs AWS CDK

gRPC in Microservices: A Complete Guide to Building High-Performance Systems

Build Glue (Spark) Streaming pipeline for clickstreams and power data lake with Apache Hudi and Query Real time with Athena

KubeBites - Episode 5

dapr : distributed application runtime

Dr. Shannon Block, CFE的更多文章

My Interview with Chat GPT

API’s in a COVID-19 World

Security Assessment versus Security Audit?

Cybersecurity 101: Ten KPI’s to Monitor

Telehealth is Changing Healthcare

Magecart Cybercriminals Steal Credit Card Info for a Week from Macys.com and Stocks Decline 10%

Scalable and Intelligent Security Analytics: Splunk, Devo, IBM and McAfee

What Should Be Keeping You Up At Night: Where is Big Data Stored?

Detecting Bots with IP Size Distribution Analysis

Cloud Computing: Stochastic Model Architecture, Barriers and Application

社区洞察

其他会员也浏览了

Boost Real-time Processing with Spark Structured Streaming

Title: Why gRPC is the Future of High-Performance Service Communication

Extracting and Analyzing AWS CloudWatch Logs using Python and Boto3

Who Says Distributed Monoliths are Bad?

Top 10 year in review for AWS Serverless DA

Comparing Cloud Deployment Scenarios for AWS: Terraform vs AWS CDK

gRPC in Microservices: A Complete Guide to Building High-Performance Systems

Build Glue (Spark) Streaming pipeline for clickstreams and power data lake with Apache Hudi and Query Real time with Athena

KubeBites - Episode 5

dapr : distributed application runtime