登录查看更多内容

A Comprehensive Analysis: Dataflow Technology

Rassul Fazelat

President & CEO @ Data Talent Advisors | Data, Analytics, RAG & GEN AI Recruiting

发布日期: 2016年2月27日

Disclaimer: This post is a combination of original content and facts gathered from reputable sources sited below. I've been compelled to write these posts due so many tech writers putting out articles that are not technically sound, these posts are meant to be a factoid for a "one-stop" reference. Also please keep in mind many of these topics are so new they are evolving as I type this post, so your inputs are greatly appreciated & welcomed.

Those of you that are following Big Data technologies closely have probably heard about Apache NiFi becoming a top level Apache project, Google making lots of waves about Google Cloud Dataflow and many others. In the attending the Spark Summit East conference a few weeks back in New York City, I found myself speaking with many folks about dataflow platforms like Apache NiFi and thought it would be a good topic for my next post.

In this post I am going give an overview of the concepts behind Dataflow programming and also refer to a number of vendors that you can research for your own comparison.

While many organizations are in an ongoing evaluation of the various Hadoop frameworks and the open sources tools they can leverage to build their Big Data systems. These products, utilize these same open sources tools into one solution, and help with automating some the various Big Data common functions, notably data ingestion & data processing.

Google's Dataflow website describes their Dataflow as the following, vendor to vendor they will be slightly different but I feel this a good explanation:

"The Dataflow programming model was designed to simplify the mechanics of large-scale data processing. When you program with a Dataflow software development kit (SDK), you're essentially creating a data processing job to be executed in the future. This model lets you concentrate on the logical composition of your data processing job, rather than the physical orchestration of parallel processing. You can focus on what you need to do with your job instead of exactly how that job gets executed.

The Dataflow model provides a number of useful abstractions that insulate you from low-level details of distributed processing, such as coordinating individual workers, sharding data sets, and other such tasks. These low-level details are fully managed for you by Cloud Dataflow's runner services.

When you think about data processing with Dataflow, you can think in terms of four major concepts:

Pipelines
PCollections
Transforms
I/O Sources and Sinks

Once you're familiar with these principles, you can learn about pipeline design principles to help determine how best to use the Dataflow programming model to accomplish your data processing tasks.

Pipelines

A pipeline encapsulates an entire series of computations that accepts some input data from external sources, transforms that data to provide some useful intelligence, and produces some output data. That output data is often written to an external data sink. The input source and output sink can be the same, or they can be of different types, allowing you to easily convert data from one format to another.

Each pipeline represents a single, potentially repeatable job, from start to finish, in the Dataflow service.

See Pipelines for a complete discussion of how a pipeline is represented in the Dataflow SDKs.

PCollections

A represents a set of data in your pipeline. The Dataflow classes are specialized container classes that can represent data sets of virtually unlimited size. A can hold a data set of a fixed size (such as data from a text file or a BigQuery table), or an unbounded data set from a continuously updating data source (such as a subscription from Google Cloud Pub/Sub).

s are the inputs and outputs for each step in your pipeline.

See PCollections for a complete discussion of how works in the Dataflow SDKs.

Transforms

A transform is a data processing operation, or a step, in your pipeline. A transform takes one or more s as input, performs a processing function that you provide on the elements of that , and produces an output .

Your transforms don't need to be in a strict linear sequence within your pipeline. You can use conditionals, loops, and other common programming structures to create a branching pipeline or a pipeline with repeated structures. You can think of your pipeline as a directed graph of steps, rather than a linear sequence.

See Transforms for a complete discussion of how transforms work in the Dataflow SDKs.

I/O Sources and Sinks

The Dataflow SDKs provide data source and data sink APIs for pipeline I/O. You use the source APIs to read data into your pipeline, and the sink APIs to write output data from your pipeline. These source and sink operations represent the roots and endpoints of your pipeline.

The Dataflow source and sink APIs let your pipeline work with data from a number of different data storage formats, such as files in Google Cloud Storage, BigQuery tables, and more. You can also use a custom data source (or sink) by teaching Dataflow how to read from (or write to) it in parallel.

See Pipeline I/O for more information on how data sources and sinks work in the Dataflow SDKs."

Current vendors offering these Dataflow type products include:

Hortonworks - Apache Nifi

Streamsets - Data Collector

Google - Dataflow Cloud

Amazon - Kinesis Suite

Pivotal - Spring Cloud Data Flow

Actian - Data Cloud & Data Connect

While Dataflow technologies are great they are essentially "Big Data projects in a box" having many of the baseline tools (i.e. HDFS, MapReduce, Spark) "cooked" into the solution. There are also a number of tools out there that provide this type of "big data solution in box," which are not marketed as "Dataflow" but still creating similar type of efficiency in operations using Big Data (i.e. Datameer, Splunk, Tamr).

At the end of the day all enterprises are trying to achieve Enterprise Data Unification to gain better insights for informed decision-making. So whether companies decide to build or buy it in a box or a hybrid approach, it can all work well based on the organization structure and overall needs.

I hope this post has at a high level explained these particular technologies. Vendor to vendor there will be differences and of course companies market their solutions differently as well. This post, I hope for many of you is a good starting point to perhaps take a deeper look at these technologies. Please share your comments below, these posts are meant to be as much informative as collaborative.Rassul Fazelat (follow me here @BigDataVision), is Managing Partner - Founder of Data Talent Advisors, a boutique Data & Analytics Talent Advisory & Headhunting firm, Organizer of NYC Big Data Visionaries Meetup, Co-Organizer of NYC Marketing Analytics Forum & Co-Organizer of NYC Advanced Analytics Meetup.

Other posts in the Comprehensive Analysis (Big Data) series:

Big Data Career series:

Carla Gentry

Data Scientist/Contractor/Influencer @ Analytical-Solution | Certified Scrum Product Owner

7 年

Nice collection!

Obins Choudhary

Building product to disrupt the market!

8 年

Thanks for sharing your insights on data flow technologies. I would like to share your post.

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

A Comprehensive Analysis: Dataflow Technology

Rassul Fazelat

President & CEO @ Data Talent Advisors | Data, Analytics, RAG & GEN AI Recruiting

Pipelines

PCollections

Transforms

I/O Sources and Sinks

更多精彩文章

社区洞察

其他会员也浏览了

A Deep Intro to Apache Iceberg and Resources for Learning More

Detailed Guide on DataBricks Delta?Lake- Part 1

Despite Uniform and Apache XTable, your choice of Table Format still matters (Apache Iceberg, Apache Hudi, and Delta Lake)

Power Down Stream Relational Database Aurora Postgres from Apache Hudi Transactional Data Lake with CDC| Step by Step Guide

Working with Semi-Structured JSON Data in Databricks

DATA Pill #061 - Apache Celeborn, 8 Futuristic Databases to Watch in 2023

Using Airbyte with Tabular

Top 10 big data platforms – Part 1

Optimizing Time Series Management: The Strategic Choice of PostgreSQL with TimescaleDB

Learn How to use Hudi DeltaStreamer with Hudi 0.14 on AWS Glue: A Seamless Data Ingestion Guide

Pipelines

PCollections

Transforms

I/O Sources and Sinks

A Comparative Analysis: Cloud EDW

2017年10月25日

Deconstructing AI – A Closer Look

2017年2月1日

A Comprehensive Overview: Containers as a Service (CaaS), Next Generation of Virtualization

2016年9月24日

3 Reasons Why "Hadoop as a Service" Is Making Sense for Business Analytics?

2016年7月24日

A Comprehensive Analysis: Blockchain Beyond Bitcoin

2016年6月4日

Top 5 College Majors For A Data Science Career

2016年5月9日

A Comprehensive Analysis: Big Data Security

2016年3月18日

A Comprehensive Analysis - Data Processing Part Deux: Apache Spark vs Apache Storm

2016年1月23日

A Comprehensive Analysis - NoSQL vs RDBMS

2015年12月26日

A Comprehensive Analysis: Apache Kafka

2015年11月29日

社区洞察

其他会员也浏览了

A Deep Intro to Apache Iceberg and Resources for Learning More

Detailed Guide on DataBricks Delta?Lake- Part 1

Despite Uniform and Apache XTable, your choice of Table Format still matters (Apache Iceberg, Apache Hudi, and Delta Lake)

Power Down Stream Relational Database Aurora Postgres from Apache Hudi Transactional Data Lake with CDC| Step by Step Guide

Working with Semi-Structured JSON Data in Databricks

DATA Pill #061 - Apache Celeborn, 8 Futuristic Databases to Watch in 2023

Using Airbyte with Tabular

Top 10 big data platforms – Part 1

Optimizing Time Series Management: The Strategic Choice of PostgreSQL with TimescaleDB

Learn How to use Hudi DeltaStreamer with Hudi 0.14 on AWS Glue: A Seamless Data Ingestion Guide