登录查看更多内容

Spark Use Case Discussion in Italy

Damien Edwards

发布日期: 2018年6月3日

Last week in Naples, Italy I was able to engage in a Spark Use case discussion. I gave feedback on potential solutions to updating Hive tables using Spark Streaming from triggered events in a Cassandra database. The concern is that the Hive data events are changing over time and if a new event is loaded in the existing Hive table. The new events would break currently executed processes that are reading from the Hive tables in parallel.

To resolve the use case issue, I discussed the potential solutions below:

Using Kafka Topics to broker the events. The existing process can load Hive tables based on the events published to the topic. All new events can still be published to Kafka while the job is executing. Once the job completes than subscribe to Kafka to get new events for the next process run.
Use Spark Streaming to load Hive transactional table. Than run a major Hive compaction to rebuild the Hive tables for consumption using the delta data loaded in the Hive buckets.
Create a staging Hive table to store new event data and use Hive hql to merge the data into the existing table .

It was a great week working with the client in Naples. Also, I was able to enjoy authentic Neapolitan pizza.

Luca Santaniello

Solution Architect at GFT

6 年

Thank you Damien. The topics discussed during the course will help us to find the best solutions.

1 次回应

Albert Balada

CEO, Founder en Magma S.r.L.

6 年

This course opened our minds to new opportunities.

Maurizio Pecoraro

BI Data Architect | IT Business Intelligence & Data Analytics @ MSC Cruises

6 年

Very interesting course... with the right mix of theory, lab and real use cases. Thanks Damien.

1 次回应

查看更多评论

要查看或添加评论，请登录

Damien Edwards的更多文章

Data Governance Strategy Course: Zurich, Switzerland

2018年12月12日

Data Governance Strategy Course: Zurich, Switzerland

I had the opportunity to make a trip to Switzerland and instruct the Hortonworks University Data Governance course. I…
Real-time course and Use Case Disscusion: Melbourne, Australia

2018年9月1日

Real-time course and Use Case Disscusion: Melbourne, Australia

I had the opportunity to make a return trip to Melbourne, Australia this week to instruct a Hortonworks University…

4 条评论
Real-time Course and Use Case: Johannesburg, South Africa

2018年7月14日

Real-time Course and Use Case: Johannesburg, South Africa

This week I had the opportunity to teach a real-time data integration course in Johannesburg, South Africa. We were…

2 条评论
Hortonworks Data Platform and Hortonworks Data Flow integration Use Case discussion: Athens,Greece

2018年7月6日

Hortonworks Data Platform and Hortonworks Data Flow integration Use Case discussion: Athens,Greece

This week in Athens Greece, I was able to discuss solutions for client use cases to integrate Hortonworks HDP and HDF…

1 条评论
Data Summit San Jose 2018: HDF Nifi Management Course

2018年6月22日

Data Summit San Jose 2018: HDF Nifi Management Course

I had the opportunity to be the instructor for the Hortonworks Data Flow NiFi Management course during the Data Summit…

1 条评论
Discussion on converting SAS data model to pySpark : Toronto

2018年6月12日

Discussion on converting SAS data model to pySpark : Toronto

Last week, I had the opportunity to discuss using Spark to replace SAS for analytics with a client in Toronto. We went…

See all articles

Spark Use Case Discussion in Italy

Damien Edwards

Damien Edwards的更多文章

社区洞察

其他会员也浏览了

Kafka Fundamentals - Step by Step Installation and Confiuration

NoSQL Day 2019 DC

Increase or Decrease the Size of Static Partition in LVM and providing Elasticity to DataNodeStorage.

ELASTICITY TO THE DATANODE

Restarting HTTPD Service is not idempotence in nature and also consume more resources suggest a way to rectify this challenge in Ansible playbook

Self-Learn Yourself Apache Spark in 21 Blogs – #8

Apache Flink: What, How, Why, Who, Where?

Contributing Limited/Specific Amount of Storage as Slave to the Cluster

Apache NiFi - easy to use, powerful, and reliable system to process and distribute data

Damien Edwards的更多文章

Data Governance Strategy Course: Zurich, Switzerland

Real-time course and Use Case Disscusion: Melbourne, Australia

Real-time Course and Use Case: Johannesburg, South Africa

Hortonworks Data Platform and Hortonworks Data Flow integration Use Case discussion: Athens,Greece

Data Summit San Jose 2018: HDF Nifi Management Course

Discussion on converting SAS data model to pySpark : Toronto

社区洞察

其他会员也浏览了

Kafka Fundamentals - Step by Step Installation and Confiuration

NoSQL Day 2019 DC

Increase or Decrease the Size of Static Partition in LVM and providing Elasticity to DataNodeStorage.

ELASTICITY TO THE DATANODE

Restarting HTTPD Service is not idempotence in nature and also consume more resources suggest a way to rectify this challenge in Ansible playbook

Self-Learn Yourself Apache Spark in 21 Blogs – #8

Apache Flink: What, How, Why, Who, Where?

Contributing Limited/Specific Amount of Storage as Slave to the Cluster

Apache NiFi - easy to use, powerful, and reliable system to process and distribute data