登录查看更多内容

Julian to Gregorian: Timestamp Anomalies in Trino-Spark

Harish Kumar M

Vice President - Data Security and Big Data Engineering

发布日期: 2024年3月30日

Introduction

In the domain of data processing, accuracy is crucial. But what occurs when the fundamental aspect of processing timestamp data begins to falter? this article explores a critical problem impacting numerous data processing tools, illuminating the complexities of timestamp inconsistencies and suggesting practical remedies.

Issue Summary

Timestamp data processing across various tools (Trino, Presto, Snowflake, Ray, AWS Athena, Apache Drill, Druid, etc.) is affected by inaccuracies caused by the use of int96 Parquet datatype, particularly in data created by Apache Spark. This discrepancy arises from a rebase of Julian to Gregorian dates during timestamp processing in Apache Spark while reading/writing data, notably for dates preceding the Julian end date of 1582-10-14, leading to inconsistencies in data retrieval and analysis downstream.

Analysis :

Timestamps preceding the Julian end date of 1582-10-14 are being inaccurately processed by Trino and other tools. Specifically, an additional days/hours/minutes/seconds are being added to these timestamp values during data retrieval.

SPARK -

TRINO -

This discrepancy arises from the use of the deprecated INT96 datatype for timestamp fields, particularly prevalent in data created by Apache Spark. While Spark internally utilizes INT96 for timestamp representation in Parquet files, it has been deprecated and marked as non-standard. More details in the below page

https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

Furthermore, during the reading and writing of INT96 timestamp values, Spark performs a rebase from Julian to Gregorian dates, leading to the observed discrepancies. Below Spark source code referring to rebase logic

https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala

领英推荐

GroupBy #17: Pinterest’s new wide column database…

Vu Trinh 1 年前

Storing Large Semi-Structured Data in Delta Tables…

Soumil S. 6 个月前

Query String Nested JSON Data in New S3 Table Buckets…

Soumil S. 2 周前

Below Parquet community Jira to deprecate INT96 and recommend all the major tools to migrate to INT64 - 7 years ago

https://issues.apache.org/jira/browse/PARQUET-323

Snowflake, In one the community post below, discusses a similar timestamp INT96 problem. The recommendation is to resolve the issue within the source application

https://community.snowflake.com/s/article/TIMESTAMP-function-returns-wrong-date-time-value-from-Parquet-file

Addressing Timestamp Inaccuracy in Affected Datasets: A Solution Approach

Re-ingest Timestamp data with INT64 Parquet datatype : Identify all datasets with timestamp datatypes and re-ingest them using int64 timestamps in Spark. This approach ensures compatibility with modern timestamp representations and eliminates the need for subsequent conversions.

Alternative Approach, not the Preferred Solution - Custom functions(UDFs) in downstream tools : UDFs in Trino to reverse the Julian to Gregorian conversion made by Apache Spark for dates before 1582-10-14. However, it's essential to note that Trino, like other data processing tools, only provides the interface to build UDFs for tool-specific datatypes. In this case, the UDF is built for the Trino timestamp datatype. Nonetheless, it's worth mentioning that timestamp values in Parquet can be stored as int96, int64, or int32, which means that applying a UDF in Trino may also impact standard int64 and int32 Parquet values.

Given these complexities, any new tool introduced to process this Parquet data would require custom functions to reverse the Julian to Gregorian conversion, ensuring accurate timestamp processing across the board.

Permanent fix

Update Default Spark Configurations: Modify default Spark configurations to use int64 instead of the deprecated int96 datatype for timestamps. By making this adjustment, future datasets processed by Spark would automatically use the standard INT64 timestamp representation, mitigating the risk of encountering similar issues.

spark.sql.parquet.outputTimestampType=TIMESTAMP_MICROS

Conclusion

The inconsistency in timestamp values poses a critical challenge for the data analysis and subsequent decision-making processes. Given its direct impact on the decision making, addressing this issue is important to ensuring the integrity of data operations. Failure to rectify these discrepancies could not only compromise the reliability of data insights but also impede the ability to make informed and strategic decisions.

P.S. You might wonder why datetime values exist before the year 1582. Indeed, there are various reasons for this, which I will cover in forthcoming articles.

Garry Steedman

Data Engineering Lead, ADA Platform at DBS Bank

12 个月

Very nice analysis, Harish ????

2 次回应

Matt Martin

Staff Tech. Engineer - Enterprise Data at State Farm ?

12 个月

I thought the magic year was 1753? https://stackoverflow.com/questions/3310569/what-is-the-significance-of-1-1-1753-in-sql-server

1 次回应

Jon Saltzman

Living at the intersection of people, systems & data

12 个月

Time is still hard, LOL

2 次回应

Nathan Glover

AWS IoT/Community Hero, Writer of YAML, Builder of the Weird. built #selfie2anime site.

12 个月

> You might wonder why datetime values exist before the year 1582. Indeed, there are various reasons for this, which I will cover in forthcoming articles. You invented time travel is my guess

1 次回应

Vinod kumar devarasetty

Spring Boot Application developer at Capgemini, India

12 个月

Thankyou so much for sharing

2 次回应

查看更多评论

要查看或添加评论，请登录

Julian to Gregorian: Timestamp Anomalies in Trino-Spark

Harish Kumar M

Vice President - Data Security and Big Data Engineering

Introduction

Issue Summary

Analysis :

领英推荐

Addressing Timestamp Inaccuracy in Affected Datasets: A Solution Approach

Permanent fix

Conclusion

社区洞察

其他会员也浏览了

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

May 2023 - Iceberg Community News

February 2023 - Iceberg Community News

January 2023 - Iceberg Community News

Spark Internals & Dataframe Partitions:

Apache Spark Basics 101: Schema Enforcement vs. Schema Inference

Apache Flink: Stream as Append & Upsert in Dynamic Tables with PyFlink

Mastering Delta Tables in Apache Spark on Microsoft Fabric: A Hands-On Guide

Apache Drill 1.12 on MapR 6.0: Release Highlights

DataLake - In-Depth Comparison of DeltaLake and Apache HUDI