Spark Use Case Discussion in Italy
Last week in Naples, Italy I was able to engage in a Spark Use case discussion. I gave feedback on potential solutions to updating Hive tables using Spark Streaming from triggered events in a Cassandra database. The concern is that the Hive data events are changing over time and if a new event is loaded in the existing Hive table. The new events would break currently executed processes that are reading from the Hive tables in parallel.
To resolve the use case issue, I discussed the potential solutions below:
- Using Kafka Topics to broker the events. The existing process can load Hive tables based on the events published to the topic. All new events can still be published to Kafka while the job is executing. Once the job completes than subscribe to Kafka to get new events for the next process run.
- Use Spark Streaming to load Hive transactional table. Than run a major Hive compaction to rebuild the Hive tables for consumption using the delta data loaded in the Hive buckets.
- Create a staging Hive table to store new event data and use Hive hql to merge the data into the existing table .
It was a great week working with the client in Naples. Also, I was able to enjoy authentic Neapolitan pizza.
Solution Architect at GFT
6 年Thank you Damien. The topics discussed during the course will help us to find the best solutions.
CEO, Founder en Magma S.r.L.
6 年This course opened our minds to new opportunities.
BI Data Architect | IT Business Intelligence & Data Analytics @ MSC Cruises
6 年Very interesting course... with the right mix of theory, lab and real use cases. Thanks Damien.