Julian to Gregorian: Timestamp Anomalies in Trino-Spark

Julian to Gregorian: Timestamp Anomalies in Trino-Spark

Introduction

In the domain of data processing, accuracy is crucial. But what occurs when the fundamental aspect of processing timestamp data begins to falter? this article explores a critical problem impacting numerous data processing tools, illuminating the complexities of timestamp inconsistencies and suggesting practical remedies.

Issue Summary

Timestamp data processing across various tools (Trino, Presto, Snowflake, Ray, AWS Athena, Apache Drill, Druid, etc.) is affected by inaccuracies caused by the use of int96 Parquet datatype, particularly in data created by Apache Spark. This discrepancy arises from a rebase of Julian to Gregorian dates during timestamp processing in Apache Spark while reading/writing data, notably for dates preceding the Julian end date of 1582-10-14, leading to inconsistencies in data retrieval and analysis downstream.

Analysis :

Timestamps preceding the Julian end date of 1582-10-14 are being inaccurately processed by Trino and other tools. Specifically, an additional days/hours/minutes/seconds are being added to these timestamp values during data retrieval.

SPARK -

Timestamp value in Spark

TRINO -

Timestamp value in Trino

This discrepancy arises from the use of the deprecated INT96 datatype for timestamp fields, particularly prevalent in data created by Apache Spark. While Spark internally utilizes INT96 for timestamp representation in Parquet files, it has been deprecated and marked as non-standard. More details in the below page

https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

Apache Spark Documentation

Furthermore, during the reading and writing of INT96 timestamp values, Spark performs a rebase from Julian to Gregorian dates, leading to the observed discrepancies. Below Spark source code referring to rebase logic

https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala

Below Parquet community Jira to deprecate INT96 and recommend all the major tools to migrate to INT64 - 7 years ago

https://issues.apache.org/jira/browse/PARQUET-323

Snowflake, In one the community post below, discusses a similar timestamp INT96 problem. The recommendation is to resolve the issue within the source application

https://community.snowflake.com/s/article/TIMESTAMP-function-returns-wrong-date-time-value-from-Parquet-file

Addressing Timestamp Inaccuracy in Affected Datasets: A Solution Approach

Re-ingest Timestamp data with INT64 Parquet datatype : Identify all datasets with timestamp datatypes and re-ingest them using int64 timestamps in Spark. This approach ensures compatibility with modern timestamp representations and eliminates the need for subsequent conversions.

Alternative Approach, not the Preferred Solution - Custom functions(UDFs) in downstream tools : UDFs in Trino to reverse the Julian to Gregorian conversion made by Apache Spark for dates before 1582-10-14. However, it's essential to note that Trino, like other data processing tools, only provides the interface to build UDFs for tool-specific datatypes. In this case, the UDF is built for the Trino timestamp datatype. Nonetheless, it's worth mentioning that timestamp values in Parquet can be stored as int96, int64, or int32, which means that applying a UDF in Trino may also impact standard int64 and int32 Parquet values.

Given these complexities, any new tool introduced to process this Parquet data would require custom functions to reverse the Julian to Gregorian conversion, ensuring accurate timestamp processing across the board.

Permanent fix

Update Default Spark Configurations: Modify default Spark configurations to use int64 instead of the deprecated int96 datatype for timestamps. By making this adjustment, future datasets processed by Spark would automatically use the standard INT64 timestamp representation, mitigating the risk of encountering similar issues.

spark.sql.parquet.outputTimestampType=TIMESTAMP_MICROS        

Conclusion

The inconsistency in timestamp values poses a critical challenge for the data analysis and subsequent decision-making processes. Given its direct impact on the decision making, addressing this issue is important to ensuring the integrity of data operations. Failure to rectify these discrepancies could not only compromise the reliability of data insights but also impede the ability to make informed and strategic decisions.

P.S. You might wonder why datetime values exist before the year 1582. Indeed, there are various reasons for this, which I will cover in forthcoming articles.

Garry Steedman

Data Engineering Lead, ADA Platform at DBS Bank

12 个月

Very nice analysis, Harish ????

Matt Martin

Staff Tech. Engineer - Enterprise Data at State Farm ?

12 个月
Jon Saltzman

Living at the intersection of people, systems & data

12 个月

Time is still hard, LOL

Nathan Glover

AWS IoT/Community Hero, Writer of YAML, Builder of the Weird. built #selfie2anime site.

12 个月

> You might wonder why datetime values exist before the year 1582. Indeed, there are various reasons for this, which I will cover in forthcoming articles. You invented time travel is my guess

Vinod kumar devarasetty

Spring Boot Application developer at Capgemini, India

12 个月

Thankyou so much for sharing

要查看或添加评论,请登录

社区洞察

其他会员也浏览了