Julian to Gregorian: Timestamp Anomalies in Trino-Spark
Introduction
In the domain of data processing, accuracy is crucial. But what occurs when the fundamental aspect of processing timestamp data begins to falter? this article explores a critical problem impacting numerous data processing tools, illuminating the complexities of timestamp inconsistencies and suggesting practical remedies.
Issue Summary
Timestamp data processing across various tools (Trino, Presto, Snowflake, Ray, AWS Athena, Apache Drill, Druid, etc.) is affected by inaccuracies caused by the use of int96 Parquet datatype, particularly in data created by Apache Spark. This discrepancy arises from a rebase of Julian to Gregorian dates during timestamp processing in Apache Spark while reading/writing data, notably for dates preceding the Julian end date of 1582-10-14, leading to inconsistencies in data retrieval and analysis downstream.
Analysis :
Timestamps preceding the Julian end date of 1582-10-14 are being inaccurately processed by Trino and other tools. Specifically, an additional days/hours/minutes/seconds are being added to these timestamp values during data retrieval.
SPARK -
TRINO -
This discrepancy arises from the use of the deprecated INT96 datatype for timestamp fields, particularly prevalent in data created by Apache Spark. While Spark internally utilizes INT96 for timestamp representation in Parquet files, it has been deprecated and marked as non-standard. More details in the below page
Furthermore, during the reading and writing of INT96 timestamp values, Spark performs a rebase from Julian to Gregorian dates, leading to the observed discrepancies. Below Spark source code referring to rebase logic
领英推荐
Below Parquet community Jira to deprecate INT96 and recommend all the major tools to migrate to INT64 - 7 years ago
Snowflake, In one the community post below, discusses a similar timestamp INT96 problem. The recommendation is to resolve the issue within the source application
Addressing Timestamp Inaccuracy in Affected Datasets: A Solution Approach
Re-ingest Timestamp data with INT64 Parquet datatype : Identify all datasets with timestamp datatypes and re-ingest them using int64 timestamps in Spark. This approach ensures compatibility with modern timestamp representations and eliminates the need for subsequent conversions.
Alternative Approach, not the Preferred Solution - Custom functions(UDFs) in downstream tools : UDFs in Trino to reverse the Julian to Gregorian conversion made by Apache Spark for dates before 1582-10-14. However, it's essential to note that Trino, like other data processing tools, only provides the interface to build UDFs for tool-specific datatypes. In this case, the UDF is built for the Trino timestamp datatype. Nonetheless, it's worth mentioning that timestamp values in Parquet can be stored as int96, int64, or int32, which means that applying a UDF in Trino may also impact standard int64 and int32 Parquet values.
Given these complexities, any new tool introduced to process this Parquet data would require custom functions to reverse the Julian to Gregorian conversion, ensuring accurate timestamp processing across the board.
Permanent fix
Update Default Spark Configurations: Modify default Spark configurations to use int64 instead of the deprecated int96 datatype for timestamps. By making this adjustment, future datasets processed by Spark would automatically use the standard INT64 timestamp representation, mitigating the risk of encountering similar issues.
spark.sql.parquet.outputTimestampType=TIMESTAMP_MICROS
Conclusion
The inconsistency in timestamp values poses a critical challenge for the data analysis and subsequent decision-making processes. Given its direct impact on the decision making, addressing this issue is important to ensuring the integrity of data operations. Failure to rectify these discrepancies could not only compromise the reliability of data insights but also impede the ability to make informed and strategic decisions.
P.S. You might wonder why datetime values exist before the year 1582. Indeed, there are various reasons for this, which I will cover in forthcoming articles.
Data Engineering Lead, ADA Platform at DBS Bank
12 个月Very nice analysis, Harish ????
Staff Tech. Engineer - Enterprise Data at State Farm ?
12 个月I thought the magic year was 1753? https://stackoverflow.com/questions/3310569/what-is-the-significance-of-1-1-1753-in-sql-server
Living at the intersection of people, systems & data
12 个月Time is still hard, LOL
AWS IoT/Community Hero, Writer of YAML, Builder of the Weird. built #selfie2anime site.
12 个月> You might wonder why datetime values exist before the year 1582. Indeed, there are various reasons for this, which I will cover in forthcoming articles. You invented time travel is my guess
Spring Boot Application developer at Capgemini, India
12 个月Thankyou so much for sharing