Working with Data Pipelines on Stream Sets

Working with Data Pipelines on Stream Sets

working with HIVE tables and stream sets for Data Engineering.

Stream sets provides the following tools for data engineering from Control Hub to two modules namely: Data Collector (which is written in Java) and Data Transformer (which is written in Scala) and so I opted to use Data Transformer since we can test the performance of Scala for Big Data processing.

it usually starts with a blank interface where you can create a simple data pipeline

stream sets

after we create a pipeline this will bring us to a blank canvas where we can now connect to our hive tables, create a filter and write it back to a hive table.


stream sets canvas

once you get it to write to the second hive table, we can see that it has completed the run and be able to query it on our HUE (Hadoop User Experience notebook) or that's what I call it.

No alt text provided for this image

That whole process, only took us around 33.55 seconds to be able to work with Big Data on Scala with the Stream Sets cloud tool. I did not even had to code on Scala and just used drag and drop tools with Stream Sets.

  • Initial hive table: 4.5M rows
  • Filter function: display rows that are not Null on Accounts
  • Final hive table after filter function: 562k rows

Cluster configuration was at 2GB Spark Driver memory and 2 GB Spark Executor Memory

______________________________________________________________________

There is a lot more you can do with stream sets, even configuring your Docker Container on your local machine to spin up a cluster. Try the community edition.

https://streamsets.com/products/pricing/

要查看或添加评论,请登录

社区洞察

其他会员也浏览了