Spark Acid Support with Hive

Spark Acid Support with Hive

Spark Acid Support with Hive

Spark does not support any feature of hive's transactional tables,

you cannot use spark to delete/update a table and it also has problems reading the aggregated data when no compaction was done.

Reference - https://issues.apache.org/jira/browse/SPARK-15348

I going to share my personal tech experience on the above use case.

Some Solution to handle this use case

1.Spark JDBC call

Yes, we can use spark through JDBC channel to query the Hive ACID tables, but the problem is if we use JDBC then spark runs in Single Partition. of course, we can use “numpartitions”, but this need to be Dynamic.

(So, I was not able use this solution)

2.Hive table compaction

Yes, we can query hive acid table after table compaction in Hive, because spark can’t read multiple ORC files. The problem here is my Job is micro batch, even before compaction completes my next write will get started. May be If your job is not micro go for this option.

(So, I was not able use this solution)

3.HWC LLAP

Hortonworks developed a framework call LLAP using this we can read acid table directly , Problem here is we need a dedicated yarn queue for all the LLAP jobs in spark , that means all the team using the same cluster need to use one queue which is not recommended .

(So, I was not able use this solution)

4.Delta Lake by Databricks

Delta lake is a new framework, and this will require Spark version 2.4.2 but most the cluster in real time has not yet upgraded to this version as of now.

(So, I was not able use this solution)

5.Hive HBase/Cassandra integration

Handle the updates in HBase and create a view in hive as nonacid table then query this table via spark. (This is not recommended when you must perform huge transformations on the hive table).

(So, I was not able use this solution)

6.HUDI by Uber

HUDI was developed by Uber engineers, Hudi (Hadoop Upsert Delete and Incremental). We did a POC with this so basically HUDI does all the updates in File system level and we can create a HUDI table in hive and perform the spark activates, But we didn’t use this because of the support and HUDI recreate complete snapshot for every updates.

(So, I was not able use this solution)

Conclusion

If any of the above fix solves your problem, then fine else we can go with the traditional approach of hive nonacid tables (this is not we aim for, but I don’t have any other option here). So, we must wait for Spark new release. 

Melbin P.

Product Manager | Autonomous Mobile Robots

1 年

wonderful content Gowtham SB

Kavin Manikandan

Senior Big Data Engineer at Mastercard

2 年

Hey Gowtham SB - This is a nice outline. I have an issue where HWC is not supported in Spark 3.X. Does Spark 3 supports ACID now or should we use a datasource like qubole to achieve ACID. Any thoughts ? Thanks!!

回复
Pratyaksh Sharma

Apache Hudi Committer | Presto contributor | Open Source Developer at Ahana

3 年

Can you please elaborate on this - "HUDI recreate complete snapshot for every updates"? Updates are going at a file level and only that file gets rewritten. Hudi community has grown now, so you can expect all the support that you need. :)

回复

要查看或添加评论,请登录

Gowtham SB的更多文章

社区洞察

其他会员也浏览了