登录查看更多内容

Apache Flink ,From a Developer point of View

Abhishek Choudhary

Data Infrastructure Engineering in RWE/RWD | Healthtech DhanvantriAI

发布日期: 2015年10月26日

+ 关注

What is Apache Flink ?

Apache Flink is an open source platform for distributed stream and batch data processing

Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.

For more details , follow the documentation and website, I am not going to explain that here.

Few features of Apache Flink-

Running or starting with Apache Flink is very easy , just untar and run the process. No much hassle, even setup in Cluster is easy to do.Mainly developed in Java, but supports Scala well. So a developer can either use Java or Scala. Python is basic, but it lacks few fundamental features compared to Apache Spark. So if you are a complete python user, you need to wait a while.

Apache Flink has mainly 2 features ie DataSet Processing and Data Streaming but the Streaming is not the part of latest 0.9.1 release, you need to use developer build to test Streaming. Building developer build for 0.1.0-SNAPSHOT is extremely easy , just download the codes from github and simply run mvn clean package. Once you are done with that, now you are ready to test streaming as well.

Apache Flink Streaming actually gave me little throughput and Low latency with very minimal configuration, so I accept this Point. I tested continuous streaming for around 1.5 Hours and processed the data to build an Analytics cloud. Although the processing of input stream didn't had much work but it was blazing fast.

Apache Flink Fault tolerance level was good atleast the processing engine didn't crash when I intentionally hanged my Kafka log processing but I didn't checked that in cluster level or extraordinary.

Flink gives both batch and Continous Stream over a Single Runtime, so I didn't need to setup any new configuration.

Apache Flink Memory Management is impressive as it has its own memory management. Flink says "Applications scale to data sizes beyond main memory and experience less garbage collection overhead."Flink has always had its own way of processing data in-memory. Instead of putting lots of objects on the heap, Flink serializes objects into a fixed number of pre-allocated memory segments. Its DBMS-style sort and join algorithms operate as much as possible on this binary data to keep the de/serialization overhead at a minimum. If more data needs to be processed than can be kept in memory, Flink’s operators partially spill data to disk

More details about basic and new off-heap memory management.

Apache Flink as well has a dedicated support for Iterative Computations mainly required in Machine Learning.

Apache Flink already contains quite a lot of streaming examples which are very broad.

On running a Kafka Streaming example on Apache Flink -

Flink already provides few connector to connect with other processing engine like Kafka, rabbitMQ , flume , Twitter etc.

Flink UI gives very great details-

Flink Links-

https://cwiki.apache.org/confluence/display/FLINK/Flink+Roadmap

https://cwiki.apache.org/confluence/display/FLINK/Apache+Flink+Home

Example of Kafka Running in Flink

Conclusion:

I don't think its is a replacement of Apache Spark but Apache Flink Streaming is great and its True Streaming so ideally if we need an actual Streaming Engine, its will be a great solution. Its yet not that mature ( I believe none of them in the streaming space :-) ) but the next release looks more promising. But I was able to use both Apache Spark and Flink is same application.

Nick Hillsdon

DevOps Engineer at Bullhorn

9 年

Abhishek what's your experience with Flink so far? I see it's burst in to market. Is it a replacement for Spark? Is it like combining a spark/hadoop environment?

查看更多评论

要查看或添加评论，请登录

Abhishek Choudhary的更多文章

Slack New Architecture

2020年1月1日

Slack New Architecture

This article presented the architecture/engineering decisions and changes brought in Slack to Scale it massively but by…
Unit Testing Apache Spark Applications in Scala or Python

2017年7月12日

Unit Testing Apache Spark Applications in Scala or Python

I saw a trend that developers usually find it very complicated to test spark application, may be no good library…
Spark On YARN cluster, Some Observations

2017年4月24日

Spark On YARN cluster, Some Observations

1. Number of partitions in Spark Basic => n Number of cores = n partitions = Number of executors Good => 2-3 times of…

4 条评论
Apache Spark (Big Data) Cache - Something Nice to Know

2017年1月17日

Apache Spark (Big Data) Cache - Something Nice to Know

Spark Caching is one of the most important aspect of in-memory computing technology. Spark RDD Caching is required when…
Apache Airflow - if you are bored of Oozie & style

2016年12月12日

Apache Airflow - if you are bored of Oozie & style

Apache Airflow is an incubator Apache project for Workflow or Job Scheduler. DAG is the backbone of airflow.

1 条评论
Apache Spark Serialization issue

2016年11月13日

Apache Spark Serialization issue

Its bit common to face Spark Serialization Issue while working with Streaming or basic Spark Job org.apache.

3 条评论
Few points On Apache Spark 2.0 Streaming Over cluster

2016年8月23日

Few points On Apache Spark 2.0 Streaming Over cluster

Experience on Apache Spark 2.0 Streaming Over cluster Apache Spark streaming documentation has enough details about its…
Facebook Architecture (Technical)

2015年11月19日

Facebook Architecture (Technical)

Facebook's current architecture is: Web front-end written in PHP. Facebook's HipHop Compiler [1] then converts it to…
Apache Spark (big Data) DataFrame - Things to know

2015年10月12日

Apache Spark (big Data) DataFrame - Things to know

What is the architecture of Apache Spark Now? What is the point of interaction in Spark? Previously it was RDD but…

6 条评论
Apache Spark 1.5 Released ...

2015年9月10日

Apache Spark 1.5 Released ...

Apache Spark 1.5 is released and now available to download https://spark.

See all articles

Apache Flink ,From a Developer point of View

Abhishek Choudhary

Data Infrastructure Engineering in RWE/RWD | Healthtech DhanvantriAI

Abhishek Choudhary的更多文章

社区洞察

其他会员也浏览了

Unlocking the Power of AWS Lambda with Python

Elastic Search

Apache Beam Tutorial

Apache Iceberg Quickstart with PyIceberg

Apache Spark : The Shuffle

Xmas special: InfoQ 2024 report, Apache Kafka 3.7.2, Spring AI MCP

Apache Spark - Memory Allocation

August 2023 - Iceberg Community News

Kafka Producer And Consumer In Spring Boot

Kafka vs RabbitMQ: Biggest Differences and Which Should You Learn?

Abhishek Choudhary的更多文章

Slack New Architecture

Unit Testing Apache Spark Applications in Scala or Python

Spark On YARN cluster, Some Observations

Apache Spark (Big Data) Cache - Something Nice to Know

Apache Airflow - if you are bored of Oozie & style

Apache Spark Serialization issue

Few points On Apache Spark 2.0 Streaming Over cluster

Facebook Architecture (Technical)

Apache Spark (big Data) DataFrame - Things to know

Apache Spark 1.5 Released ...

社区洞察

其他会员也浏览了

Unlocking the Power of AWS Lambda with Python

Elastic Search

Apache Beam Tutorial

Apache Iceberg Quickstart with PyIceberg

Apache Spark : The Shuffle

Xmas special: InfoQ 2024 report, Apache Kafka 3.7.2, Spring AI MCP

Apache Spark - Memory Allocation

August 2023 - Iceberg Community News

Kafka Producer And Consumer In Spring Boot

Kafka vs RabbitMQ: Biggest Differences and Which Should You Learn?