登录查看更多内容

Big data Spark vs Flink

Santhosh Parampottupadam

Research Scientist | We're Hiring | German Cancer Research Center |Generative AI | PPML??| Federated Learning

发布日期: 2015年9月7日

+ 关注

at do they have in common? Flink and Spark are both general-purpose data processing platforms and top level projects of the Apache Software Foundation (ASF). They have a wide field of application and are usable for dozens of big data scenarios. Thanks to expansions like SQL queries (Spark: Spark SQL, Flink: MRQL), Graph processing (Spark: GraphX, Flink: Spargel (base) and Gelly(library)), machine learning (Spark: MLlib, Flink: Flink ML) and stream processing (Spark Streaming, Flink Streaming). Both are capable of running in standalone mode, yet many are using them on top of Hadoop (YARN, HDFS). They share a strong performance due to their in memory nature.

However, the way they achieve this variety and the cases they are specialized on differ.

Differences:

In contrast to Flink, Spark is not capable of handling data sets larger than the RAM before version 1.5.x

Flink is optimized for cyclic or iterative processes by using iterative transformations on collections. This is achieved by an optimization of join algorithms, operator chaining and reusing of partitioning and sorting. However, Flink is also a strong tool for batch processing. Flink streaming processes data streams as true streams, i.e., data elements are immediately "pipelined" though a streaming program as soon as they arrive. This allows to perform flexible window operations on streams.

Spark on the other hand is based on resilient distributed datasets (RDDs). This (mostly) in-memory datastructure gives the power to sparks functional programming paradigm. It is capable of big batch calculations by pinning memory. Spark streaming wraps data streams into mini-batches, i.e., it collects all data that arrives within a certain period of time and runs a regular batch program on the collected data. While the batch program is running, the data for the next mini-batch is collected.

Will Flink replace Hadoop?

No, it will not. Hadoop consists of different parts:

HDFS - Hadoop Distributed Filesystem
YARN - Yet Another Resource Negotiator (or Resource Manager)
MapReduce - The batch processing Framework of Hadoop

HDFS and YARN are still necessary as integral part of BigData clusters. Those two are building the base for other distributed technologies like distributed query engines or distributed databases. The main use-case for MapReduce is batch processing for data sets larger than the RAM of the cluster while Flink is designed for iterative processing. So in general those two can co-exist.

Mohammad Fawad Alam

9 年

I second Vishnu..

Vishnu Subramanian

Founder @ Jarvislabs.ai

9 年

Spark is not capable of handling data sets larger than the RAM before version 1.5.x : Why do u say this. I don't agree to it.

查看更多评论

要查看或添加评论，请登录

Santhosh Parampottupadam的更多文章

Cancer - A Must Read

2023年11月1日

Cancer - A Must Read

Key facts Cancer is a leading cause of death worldwide, accounting for nearly 10 million deaths in 2020, or nearly one…
TCS is HIRING ........!!!

2016年12月13日

TCS is HIRING ........!!!

TCS Hiring ..
Analytics with Amazon Kinesis

2016年11月9日

Analytics with Amazon Kinesis

Amazon Kinesis Analytics Amazon Kinesis Analytics is the easiest way to process streaming data in real time with…
Data Science Do's and Don'ts

2016年1月25日

Data Science Do's and Don'ts

Being a data scientist, as the name misappropriates, is not really an exact science, it is more of a trade. In the…
Splunk Opens New World of Opportunity for Hadoop Users

2015年10月12日

Splunk Opens New World of Opportunity for Hadoop Users

Splunk? Hadoop Connect and the Splunk App for HadoopOps Address Common Challenges Deploying and Running Hadoop O'Reilly…
Apple strikes $25m deal for US big data mapping company

2015年9月18日

Apple strikes $25m deal for US big data mapping company

Apple is beefing up its big data analytics capabilities by sealing a deal to acquire one of Silicon Valley’s most…
The Pillar Behind Hadoop and Data Analysis

2015年9月18日

The Pillar Behind Hadoop and Data Analysis

Douglass Read "Doug" Cutting is an advocate and creator of open-source search technology. He originated Lucene and…
SCALA vs JAVA

2015年9月17日

SCALA vs JAVA

There is admittedly some truth to the statement that “Scala is hard”, but the learning curve is well worth the…
10 Mistakes Enterprises Make in Big Data Projects

2015年9月17日

10 Mistakes Enterprises Make in Big Data Projects

Avoid common pitfalls when planning, creating, and implementing big data initiatives: 1 Lacking a business case Big…

1 条评论
TCS Hiring Experienced Professionals !! Urgent Requirements...Join Us ...!!!

2015年9月14日

TCS Hiring Experienced Professionals !! Urgent Requirements...Join Us ...!!!

Send Resume along with PAN number to : Santhosh.p7@tcs.

See all articles

Big data Spark vs Flink

Santhosh Parampottupadam

Research Scientist | We're Hiring | German Cancer Research Center |Generative AI | PPML??| Federated Learning

Santhosh Parampottupadam的更多文章

社区洞察

其他会员也浏览了

Apache Spark: The Ultimate Big Data Processing Engine

Hadoop - Managers' snapshot

Apache Spark Vs Hadoop

What is Apache Spark? The Big Data Platform That Surpassed Hadoop

Spark

Comparing Spark and MapReduce: The Pros and Cons of Two Popular Big Data Processing Frameworks on the Hadoop Ecosystem

Unleashing the Power of Apache Spark: A Comprehensive Overview

Spark with Kubernetes

Apache Spark vs. Hadoop MapReduce

Day 1 - 15Day Databricks: Spark Architecture & Internal Working Mechanism

Santhosh Parampottupadam的更多文章

Cancer - A Must Read

TCS is HIRING ........!!!

Analytics with Amazon Kinesis

Data Science Do's and Don'ts

Splunk Opens New World of Opportunity for Hadoop Users

Apple strikes $25m deal for US big data mapping company

The Pillar Behind Hadoop and Data Analysis

SCALA vs JAVA

10 Mistakes Enterprises Make in Big Data Projects

TCS Hiring Experienced Professionals !! Urgent Requirements...Join Us ...!!!

社区洞察

其他会员也浏览了

Apache Spark: The Ultimate Big Data Processing Engine

Hadoop - Managers' snapshot

Apache Spark Vs Hadoop

What is Apache Spark? The Big Data Platform That Surpassed Hadoop

Spark

Comparing Spark and MapReduce: The Pros and Cons of Two Popular Big Data Processing Frameworks on the Hadoop Ecosystem

Unleashing the Power of Apache Spark: A Comprehensive Overview

Spark with Kubernetes

Apache Spark vs. Hadoop MapReduce

Day 1 - 15Day Databricks: Spark Architecture & Internal Working Mechanism