Working with Data Pipelines on Stream Sets
Albert Anthony D. Gavino, MBA
Book Writer | Data Science | Cloud Solutions
working with HIVE tables and stream sets for Data Engineering.
Stream sets provides the following tools for data engineering from Control Hub to two modules namely: Data Collector (which is written in Java) and Data Transformer (which is written in Scala) and so I opted to use Data Transformer since we can test the performance of Scala for Big Data processing.
it usually starts with a blank interface where you can create a simple data pipeline
after we create a pipeline this will bring us to a blank canvas where we can now connect to our hive tables, create a filter and write it back to a hive table.
once you get it to write to the second hive table, we can see that it has completed the run and be able to query it on our HUE (Hadoop User Experience notebook) or that's what I call it.
That whole process, only took us around 33.55 seconds to be able to work with Big Data on Scala with the Stream Sets cloud tool. I did not even had to code on Scala and just used drag and drop tools with Stream Sets.
Cluster configuration was at 2GB Spark Driver memory and 2 GB Spark Executor Memory
______________________________________________________________________
There is a lot more you can do with stream sets, even configuring your Docker Container on your local machine to spin up a cluster. Try the community edition.
https://streamsets.com/products/pricing/