A Comprehensive Analysis: Dataflow Technology
Rassul Fazelat
President & CEO @ Data Talent Advisors | Data, Analytics, RAG & GEN AI Recruiting
Disclaimer: This post is a combination of original content and facts gathered from reputable sources sited below. I've been compelled to write these posts due so many tech writers putting out articles that are not technically sound, these posts are meant to be a factoid for a "one-stop" reference. Also please keep in mind many of these topics are so new they are evolving as I type this post, so your inputs are greatly appreciated & welcomed.
Those of you that are following Big Data technologies closely have probably heard about Apache NiFi becoming a top level Apache project, Google making lots of waves about Google Cloud Dataflow and many others. In the attending the Spark Summit East conference a few weeks back in New York City, I found myself speaking with many folks about dataflow platforms like Apache NiFi and thought it would be a good topic for my next post.
In this post I am going give an overview of the concepts behind Dataflow programming and also refer to a number of vendors that you can research for your own comparison.
While many organizations are in an ongoing evaluation of the various Hadoop frameworks and the open sources tools they can leverage to build their Big Data systems. These products, utilize these same open sources tools into one solution, and help with automating some the various Big Data common functions, notably data ingestion & data processing.
Google's Dataflow website describes their Dataflow as the following, vendor to vendor they will be slightly different but I feel this a good explanation:
"The Dataflow programming model was designed to simplify the mechanics of large-scale data processing. When you program with a Dataflow software development kit (SDK), you're essentially creating a data processing job to be executed in the future. This model lets you concentrate on the logical composition of your data processing job, rather than the physical orchestration of parallel processing. You can focus on what you need to do with your job instead of exactly how that job gets executed.
The Dataflow model provides a number of useful abstractions that insulate you from low-level details of distributed processing, such as coordinating individual workers, sharding data sets, and other such tasks. These low-level details are fully managed for you by Cloud Dataflow's runner services.
When you think about data processing with Dataflow, you can think in terms of four major concepts:
- Pipelines
- PCollections
- Transforms
- I/O Sources and Sinks
Once you're familiar with these principles, you can learn about pipeline design principles to help determine how best to use the Dataflow programming model to accomplish your data processing tasks.
Pipelines
A pipeline encapsulates an entire series of computations that accepts some input data from external sources, transforms that data to provide some useful intelligence, and produces some output data. That output data is often written to an external data sink. The input source and output sink can be the same, or they can be of different types, allowing you to easily convert data from one format to another.
Each pipeline represents a single, potentially repeatable job, from start to finish, in the Dataflow service.
See Pipelines for a complete discussion of how a pipeline is represented in the Dataflow SDKs.
PCollections
A represents a set of data in your pipeline. The Dataflow classes are specialized container classes that can represent data sets of virtually unlimited size. A can hold a data set of a fixed size (such as data from a text file or a BigQuery table), or an unbounded data set from a continuously updating data source (such as a subscription from Google Cloud Pub/Sub).
s are the inputs and outputs for each step in your pipeline.
See PCollections for a complete discussion of how works in the Dataflow SDKs.
Transforms
A transform is a data processing operation, or a step, in your pipeline. A transform takes one or more s as input, performs a processing function that you provide on the elements of that , and produces an output .
Your transforms don't need to be in a strict linear sequence within your pipeline. You can use conditionals, loops, and other common programming structures to create a branching pipeline or a pipeline with repeated structures. You can think of your pipeline as a directed graph of steps, rather than a linear sequence.
See Transforms for a complete discussion of how transforms work in the Dataflow SDKs.
I/O Sources and Sinks
The Dataflow SDKs provide data source and data sink APIs for pipeline I/O. You use the source APIs to read data into your pipeline, and the sink APIs to write output data from your pipeline. These source and sink operations represent the roots and endpoints of your pipeline.
The Dataflow source and sink APIs let your pipeline work with data from a number of different data storage formats, such as files in Google Cloud Storage, BigQuery tables, and more. You can also use a custom data source (or sink) by teaching Dataflow how to read from (or write to) it in parallel.
See Pipeline I/O for more information on how data sources and sinks work in the Dataflow SDKs."
Current vendors offering these Dataflow type products include:
Pivotal - Spring Cloud Data Flow
Actian - Data Cloud & Data Connect
While Dataflow technologies are great they are essentially "Big Data projects in a box" having many of the baseline tools (i.e. HDFS, MapReduce, Spark) "cooked" into the solution. There are also a number of tools out there that provide this type of "big data solution in box," which are not marketed as "Dataflow" but still creating similar type of efficiency in operations using Big Data (i.e. Datameer, Splunk, Tamr).
At the end of the day all enterprises are trying to achieve Enterprise Data Unification to gain better insights for informed decision-making. So whether companies decide to build or buy it in a box or a hybrid approach, it can all work well based on the organization structure and overall needs.
I hope this post has at a high level explained these particular technologies. Vendor to vendor there will be differences and of course companies market their solutions differently as well. This post, I hope for many of you is a good starting point to perhaps take a deeper look at these technologies. Please share your comments below, these posts are meant to be as much informative as collaborative.Rassul Fazelat (follow me here @BigDataVision), is Managing Partner - Founder of Data Talent Advisors, a boutique Data & Analytics Talent Advisory & Headhunting firm, Organizer of NYC Big Data Visionaries Meetup, Co-Organizer of NYC Marketing Analytics Forum & Co-Organizer of NYC Advanced Analytics Meetup.
Other posts in the Comprehensive Analysis (Big Data) series:
- Deconstructing AI - A Closer Look
- 3 Reason Why Hadoop as a Service Is Making Sense For Business Analytics
- A Comprehensive Analysis: Blockchain Technology Beyond Bitcoin
- A Comprehensive Analysis: Big Data Security
- A Comprehensive Analysis: Dataflow Technology
- A Comprehensive Analysis: Data Processing Part Deux: Apache Spark vs Apache Storm
- A Comprehensive Analysis - NoSQL vs RDBMS
- A Comprehensive Analysis: Apache Kafka
- A Comprehensive Analysis: Java vs Scala
- A Comprehensive Analysis: Apache Flink and How it compares to Apache Spark
- A Comprehensive Analysis: Apache Spark vs MapReduce
Big Data Career series:
Data Scientist/Contractor/Influencer @ Analytical-Solution | Certified Scrum Product Owner
7 年Nice collection!
Building product to disrupt the market!
8 年Thanks for sharing your insights on data flow technologies. I would like to share your post.