Apache Flink ,From a Developer point of View

Apache Flink ,From a Developer point of View

What is Apache Flink ?

Apache Flink is an open source platform for distributed stream and batch data processing

Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.

For more details , follow the documentation and website, I am not going to explain that here.

Few features of Apache Flink-

Running or starting with Apache Flink is very easy , just untar and run the process. No much hassle, even setup in Cluster is easy to do.Mainly developed in Java, but supports Scala well. So a developer can either use Java or Scala. Python is basic, but it lacks few fundamental features compared to Apache Spark. So if you are a complete python user, you need to wait a while.

Apache Flink has mainly 2 features ie DataSet Processing and Data Streaming but the Streaming is not the part of latest 0.9.1 release, you need to use developer build to test Streaming. Building developer build for 0.1.0-SNAPSHOT is extremely easy , just download the codes from github and simply run mvn clean package. Once you are done with that, now you are ready to test streaming as well. 

Apache Flink Streaming actually gave me little throughput and Low latency with very minimal configuration, so I accept this Point. I tested continuous streaming for around 1.5 Hours and processed the data to build an Analytics cloud. Although the processing of input stream didn't had much work but it was blazing fast.

Apache Flink Fault tolerance level was good atleast the processing engine didn't crash when I intentionally hanged my Kafka log processing but I didn't checked that in cluster level or extraordinary.

Flink gives both batch and Continous Stream over a Single Runtime, so I didn't need to setup any new configuration.

Apache Flink Memory Management is impressive as it has its own memory management. Flink says "Applications scale to data sizes beyond main memory and experience less garbage collection overhead."Flink has always had its own way of processing data in-memory. Instead of putting lots of objects on the heap, Flink serializes objects into a fixed number of pre-allocated memory segments. Its DBMS-style sort and join algorithms operate as much as possible on this binary data to keep the de/serialization overhead at a minimum. If more data needs to be processed than can be kept in memory, Flink’s operators partially spill data to disk

More details about basic and new off-heap memory management.

 

 

 

 

 

 

Apache Flink as well has a dedicated support for Iterative Computations mainly required in Machine Learning.

 

Apache Flink already contains quite a lot of streaming examples which are very broad.

On running a Kafka Streaming example on Apache Flink -

Flink already provides few connector to connect with  other processing engine like Kafka, rabbitMQ , flume , Twitter etc.

Flink UI  gives very great details-

Flink Links-

https://cwiki.apache.org/confluence/display/FLINK/Flink+Roadmap

https://cwiki.apache.org/confluence/display/FLINK/Apache+Flink+Home

Example of Kafka Running in Flink

 


Conclusion:

I don't think its is a replacement of Apache Spark but Apache Flink Streaming is great and its True Streaming so ideally if we need an actual Streaming Engine, its will be a great solution. Its yet not that mature ( I believe none of them in the streaming space :-) ) but the next release looks more promising. But I was able to use both Apache Spark and Flink is same application.

Nick Hillsdon

DevOps Engineer at Bullhorn

9 年

Abhishek what's your experience with Flink so far? I see it's burst in to market. Is it a replacement for Spark? Is it like combining a spark/hadoop environment?

回复

要查看或添加评论,请登录

Abhishek Choudhary的更多文章

  • Slack New Architecture

    Slack New Architecture

    This article presented the architecture/engineering decisions and changes brought in Slack to Scale it massively but by…

  • Unit Testing Apache Spark Applications in Scala or Python

    Unit Testing Apache Spark Applications in Scala or Python

    I saw a trend that developers usually find it very complicated to test spark application, may be no good library…

  • Spark On YARN cluster, Some Observations

    Spark On YARN cluster, Some Observations

    1. Number of partitions in Spark Basic => n Number of cores = n partitions = Number of executors Good => 2-3 times of…

    4 条评论
  • Apache Spark (Big Data) Cache - Something Nice to Know

    Apache Spark (Big Data) Cache - Something Nice to Know

    Spark Caching is one of the most important aspect of in-memory computing technology. Spark RDD Caching is required when…

  • Apache Airflow - if you are bored of Oozie & style

    Apache Airflow - if you are bored of Oozie & style

    Apache Airflow is an incubator Apache project for Workflow or Job Scheduler. DAG is the backbone of airflow.

    1 条评论
  • Apache Spark Serialization issue

    Apache Spark Serialization issue

    Its bit common to face Spark Serialization Issue while working with Streaming or basic Spark Job org.apache.

    3 条评论
  • Few points On Apache Spark 2.0 Streaming Over cluster

    Few points On Apache Spark 2.0 Streaming Over cluster

    Experience on Apache Spark 2.0 Streaming Over cluster Apache Spark streaming documentation has enough details about its…

  • Facebook Architecture (Technical)

    Facebook Architecture (Technical)

    Facebook's current architecture is: Web front-end written in PHP. Facebook's HipHop Compiler [1] then converts it to…

  • Apache Spark (big Data) DataFrame - Things to know

    Apache Spark (big Data) DataFrame - Things to know

    What is the architecture of Apache Spark Now? What is the point of interaction in Spark? Previously it was RDD but…

    6 条评论
  • Apache Spark 1.5 Released ...

    Apache Spark 1.5 Released ...

    Apache Spark 1.5 is released and now available to download https://spark.

社区洞察

其他会员也浏览了