Apache Flink ,From a Developer point of View
Abhishek Choudhary
Data Infrastructure Engineering in RWE/RWD | Healthtech DhanvantriAI
What is Apache Flink ?
Apache Flink is an open source platform for distributed stream and batch data processing
Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
For more details , follow the documentation and website, I am not going to explain that here.
Few features of Apache Flink-
Running or starting with Apache Flink is very easy , just untar and run the process. No much hassle, even setup in Cluster is easy to do.Mainly developed in Java, but supports Scala well. So a developer can either use Java or Scala. Python is basic, but it lacks few fundamental features compared to Apache Spark. So if you are a complete python user, you need to wait a while.
Apache Flink has mainly 2 features ie DataSet Processing and Data Streaming but the Streaming is not the part of latest 0.9.1 release, you need to use developer build to test Streaming. Building developer build for 0.1.0-SNAPSHOT is extremely easy , just download the codes from github and simply run mvn clean package. Once you are done with that, now you are ready to test streaming as well.
Apache Flink Streaming actually gave me little throughput and Low latency with very minimal configuration, so I accept this Point. I tested continuous streaming for around 1.5 Hours and processed the data to build an Analytics cloud. Although the processing of input stream didn't had much work but it was blazing fast.
Apache Flink Fault tolerance level was good atleast the processing engine didn't crash when I intentionally hanged my Kafka log processing but I didn't checked that in cluster level or extraordinary.
Flink gives both batch and Continous Stream over a Single Runtime, so I didn't need to setup any new configuration.
Apache Flink Memory Management is impressive as it has its own memory management. Flink says "Applications scale to data sizes beyond main memory and experience less garbage collection overhead."Flink has always had its own way of processing data in-memory. Instead of putting lots of objects on the heap, Flink serializes objects into a fixed number of pre-allocated memory segments. Its DBMS-style sort and join algorithms operate as much as possible on this binary data to keep the de/serialization overhead at a minimum. If more data needs to be processed than can be kept in memory, Flink’s operators partially spill data to disk
More details about basic and new off-heap memory management.
Apache Flink as well has a dedicated support for Iterative Computations mainly required in Machine Learning.
Apache Flink already contains quite a lot of streaming examples which are very broad.
On running a Kafka Streaming example on Apache Flink -
Flink already provides few connector to connect with other processing engine like Kafka, rabbitMQ , flume , Twitter etc.
Flink UI gives very great details-
Flink Links-
https://cwiki.apache.org/confluence/display/FLINK/Flink+Roadmap
https://cwiki.apache.org/confluence/display/FLINK/Apache+Flink+Home
Example of Kafka Running in Flink
Conclusion:
I don't think its is a replacement of Apache Spark but Apache Flink Streaming is great and its True Streaming so ideally if we need an actual Streaming Engine, its will be a great solution. Its yet not that mature ( I believe none of them in the streaming space :-) ) but the next release looks more promising. But I was able to use both Apache Spark and Flink is same application.
DevOps Engineer at Bullhorn
9 年Abhishek what's your experience with Flink so far? I see it's burst in to market. Is it a replacement for Spark? Is it like combining a spark/hadoop environment?