4G of Big Data "Apache Flink" - Introduction and a Quickstart Tutorial
1. Objective
In this tutorial we will discuss about the introduction to Apache Flink, What is Flink, Why and where to use Flink. This Flink tutorial will answer the question why Apache Flink is called 4G of Big Data? The tutorial also briefs about Flink APIs and features.
2. Video Tutorial
3. Introduction
Apache Flink is an open source platform which is a streaming data flow engine that provides communication, fault-tolerance and data-distribution for distributed computations over data streams. Flink is a top level project of Apache. Flink is a scalable data analytics framework that is fully compatible to Hadoop. Flink can execute both stream processing and batch processing easily.
Apache Flink was started under the project called Stratosphere. In 2008 Volker Markl formed the idea for Stratosphere and attracted other co-principal Investigators from HU Berlin, TU Berlin and the Hasso Plattner Institute Potsdam. They jointly worked on a vision and had already put the great efforts on open source deployment and systems building. Later on several decisive steps had been so that the project can be popular in commercial , research and open source community. A commercial entity named this project as Stratosphere. After applying for Apache incubation in April 2014 Flink name was finalized. Flink is a german word which means swift or agile.
4. Why Flink?
The key vision for Apache Flink is to overcome and reduces the complexity that has been faced by other distributed data driven engines. It is achieved by integrating query optimization, concepts from database systems and efficient parallel in-memory and out-of-core algorithms, with the MapReduce framework. As Apache Flink is mainly based on streaming modal, Apache Flink iterates data by using streaming architecture. The concept of iterative algorithm is tightly bounded in to Flink query optimizer. Apache Flink’s pipelined architecture allows processing the streaming data faster with lower latency than micro-batch architectures (Spark).
5. Apache Flink-API’s
Apache Flink provides API’s for creating several applications which use flink engine
i. DataStream APIs
It is a regular program in Apache Flink that implements the transformation on data streams For example- filtering, aggregating, update state etc. Results are returned through sink which can be generated through write data on files or in a command line terminal.
ii. DataSet APIs
It is regular program in Apache Flink that implements the transformation on data sets For example-joining, grouping, mapping, filtering etc. This API is used for batch processing of data, the data which is already available in the repository.
iii. Table APIs
This API in FLink used for handling relational operations. It is a SQL-like expression language used for relational stream and batch processing which can also be integrated in Datastream APIs and Dataset APIs.