登录查看更多内容

Getting ISO year right in PySpark

Morten Gammelgaard Hannibalsen

Senior Data Engineer

发布日期: 2025年1月7日

Part of working with data and dates is the age-old question "When does the first week of the year start?". In PySpark we can use the weekofyear function which is ISO 8601 compliant, but what if you want the correct year according to ISO 8601? That was more of a hassle than I thought!

I'm using Databricks runtime 13.3 LTS with Spark 3.4.1, so the behavoir of the functions used below might change in newer versions and make this issue obsolete

The issue

Imagine that you have data on date granularity and you need to create a week number and year from the date to group on. That's when you will run into issues when computing the year using the standard PySpark year or date_format functions and the week number using the weekofyear function.

The following code will produce date ranges starting on the 30th of December and ending on the 2nd of January for the years 2020 to 2025 to exemplify the issue.

%python
from pyspark.sql.functions import (
    col,
    year,
    month,
    weekofyear,
    date_format,
    when,
    lit,
    sum,
    udf,
)
from pyspark.sql.types import DateType, IntegerType
from datetime import datetime, timedelta

# Create a list of dates starting from 30th December 2024 for the past 5 years
start_date = datetime(2024, 12, 29)
dates = [
    start_date - timedelta(days=365 * i) + timedelta(days=j)
    for i in range(5)
    for j in range(4)
]

# Create a DataFrame from the list of dates
dates_df = spark.createDataFrame([(date,) for date in dates], ["date"])

# Convert the date column to DateType
end_of_year = dates_df.withColumn("date", dates_df["date"].cast(DateType()))

display(end_of_year.orderBy("date"))

This produces a nice little DataFrame with the date ranges for each year, of special interest are the dates 2021-01-01, 2021-01-02, 2024-12-30, and 2024-12-31.

Let's extract the year from the 'date' column using the year() and date_format() and also the week number using the weekofyear() built-in PySpark functions. We'll also add a 'count' column to aggregate later.

# Extract year and week number using built-in Spark functions
end_of_year_non_iso = (
    end_of_year.withColumn("year_standard", year("date"))
    .withColumn("year_date_format", date_format("date", "yyyy"))
    .withColumn("week_number", weekofyear("date"))
    .withColumn("count", lit(1))
)

display(end_of_year_non_iso.orderBy("date"))

If we take a closer look at 2024-12-30, we'll see that something isn't right as the year is set to 2024 but the week number is set to 1. The same goes for 2024-12-30.

The year doesn't match the ISO week number

But according to EpochConverter week 1 of 2025 starts on the 30th of December 2024, meaning the data gets assigned to the wrong year:

This is the crux of the issue: the year() and date_format() functions don't comply with the ISO 8601 standard while weekofyear() does. As far as I can tell, this isn't explained in the Apache Spark documentation and there is no way to force the two non-compliant functions to adhere to ISO 8601.

The fix

There are two ways to make the year ISO 8601 compliant: creating a User Defined Function (UDF) using isocalendar() from Python (or a similar library) or creating a custom logic in PySpark.

Both come with tradeoffs, UDFs can introduce performance overhead because they require serialization and deserialization of data between the JVM and Python. For better performance, it is recommended to use built-in Spark functions whenever possible. The custom logic might not produce the correct ISO-compliant year for all dates, it works on the ones I have tested but there might be edge cases.

User Defined Function

Using the isocalendar() function is pretty easy in PySpark, but since it is not native to PySpark it must be wrapped in a UDF to work in a DataFrame. The function also returns a tuple containing: ISO Year, ISO Week Number, and ISO Weekday, so we need only to extract the first element of the tuple using [0].

领英推荐

Harnessing the Power of PySpark in DataBricks Delta…

New Math Data 6 个月前

How to Drop Duplicates in PySpark?

StrataScratch 8 个月前

Getting started with PySpark on Google Colab

Eduardo Miranda 6 个月前

#Extract year using UDF
# Define a UDF to extract the ISO year using isocalendar()
def get_iso_year(dt):
    return dt.isocalendar()[0]


iso_year_udf = udf(get_iso_year, IntegerType())

end_of_year_udf = (
    end_of_year.withColumn("year_iso_calendar", iso_year_udf(col("date")))
    .withColumn("week_number", weekofyear("date"))
    .withColumn("count", lit(1))
)

display(end_of_year_udf.orderBy("date"))

Now the year is set to 2025 for 2024-12-30 and 2024-12-31, and also set to 2020 for 2021-01-01 and 2021-01-02 which complies with ISO 8601.

The year is now aligned with the ISO week number using a UDF

Remembering the potential performance degradation caused by using a UDF is important. It won't be noticeable on a small dataset like this, but if you are working with billions of rows you will surely see performance degradation.

Custom logic

The guard against performance degradation is implementing a custom logic that only uses native PySpark functions. The custom logic uses the year(), month (), and weekofyear() functions together to determine if we are in a situation where the date's month is in December, and the week is set to '1' indicating that the year() function will produce the incorrect year needing to added by 1. Likewise for a situation where the date's month is in January, and the week is set to '52' or above indicating that the year() function will produce the incorrect year needing to be subtracted by 1.

#Extract year using custom logic
end_of_year_custom = (
    end_of_year.withColumn(
        "year_iso_manual_logic",
        when(
            (weekofyear(col("date")) == 1) & (month(col("date")) == 12),
            year(col("date")) + 1,
        )
        .when(
            (weekofyear(col("date")) >= 52) & (month(col("date")) == 1),
            year(col("date")) - 1,
        )
        .otherwise(year(col("date"))),
    )
    .withColumn("week_number", weekofyear("date"))
    .withColumn("count", lit(1))
)

display(end_of_year_custom.orderBy("date"))

The logic will now correctly set the year to 2025 for 2024-12-30 and 2024-12-31 and also set the year to 2020 for 2021-01-01 and 2021-01-02 making it comply with ISO 8601.

The year is now aligned with the ISO week number using a custom logic

Closing remarks

While I haven't tested the custom logic on all dates for edge cases, it should be the best way to produce ISO 8601-compliant years until we have that functionality built into (maybe as a flag or parameter) the year() and date_format() native Spark functions.

When grouping the DataFrames it becomes apparent why taking into account the ISO year is important since not doing so will produce incorrect aggregations, in this case, the sum() of the 'count' column.

# Group by year_standard and week_number, and sum the count
grouped_end_of_year_non_iso = end_of_year_non_iso.groupBy("year_standard", "week_number").agg(
    sum("count").alias("total_count")
)

display(grouped_end_of_year_non_iso.orderBy("year_standard", "week_number"))

Produces a DataFrame where the counts of 2024-12-30, 2024-12-31, 2021-01-01, and 2021-01-02 are grouped to the incorrect year and week. It also creates a week 53 in 2021 which doesn't exist.

Incorrect aggregations and a made-up week 53 in 2021

Grouping using the ISO-compliant year custom logic column.

# Group by year_iso_manual_logic, and week_number, and sum the count
grouped_end_of_year_custom = end_of_year_custom.groupBy("year_iso_manual_logic", "week_number").agg(
    sum("count").alias("total_count")
)

display(grouped_end_of_year_custom.orderBy("year_iso_manual_logic", "week_number"))

It will assign the counts of 2024-12-30, 2024-12-31, 2021-01-01, and 2021-01-02 to the correct year and week thus giving the correct aggregates.

Correct aggregations and no made-up week 53 in 2021

Azure Data Ramblings

922 位关注者

Nikolaj Bidstrup J?rgensen

Head of Digital Transformation @ Arla | External Lecturer @ Aalborg University

1 个月

Jeg bliver n?dt til at l?re mere af dig, Morten!

1 次回应

Rishabh Singh

Technical Lead - Data Engineer | Transforming Data Landscapes with Expertise in Architecture, ETL, and Azure | Passionate about Driving Innovation | MBA In Operation and System Management | Azure Data Engineer

1 个月

Thanks for sharing Morten, I was also thinking why i am getting wrong while using naive function of spark. Its help lots.

?ukasz Lebioda

Speaking PL, EN, IT, ESP _ POWER BI Developer / People oriented leader / Lean Management enthusiast / Six Sigma Black belt / Fast learner

2 个月

That is great! Thanks for sharing Morten

1 次回应

查看更多评论

要查看或添加评论，请登录

Morten Gammelgaard Hannibalsen的更多文章

Fabric AI Skills is enabling a better development process

2025年2月14日

Fabric AI Skills is enabling a better development process

After the announcement of the AI Skills preview, I have had the chance to play around with the Fabric equivalent of…

2 条评论
Nested JSON arrays: the perfect niche for Dataflow Gen2 in Fabric

2024年10月23日

Nested JSON arrays: the perfect niche for Dataflow Gen2 in Fabric

On a project I'm currently working on I have to work with JSON files that are heavily nested. And by heavily I mean…

5 条评论
Correctly connecting to a file on SharePoint in Power BI Desktop to avoid refresh errors on Power BI Service

2024年7月9日

Correctly connecting to a file on SharePoint in Power BI Desktop to avoid refresh errors on Power BI Service

I have seen this issue and helped people fix it over so many years now, that it's time to write an article I can point…

9 条评论
Testing the seams of Microsoft Fabric - Part 2: data cleaning and curation

2023年12月28日

Testing the seams of Microsoft Fabric - Part 2: data cleaning and curation

Things move fast in the world of data, and with impeccable timing one week after part 1 of this series on Fabric was…
Testing the seams of Microsoft Fabric - Part 1: data ingestion

2023年11月13日

Testing the seams of Microsoft Fabric - Part 1: data ingestion

Microsoft Fabric was released in preview earlier in 2023 and it ticks the boxes for an "open monolith", where all the…

5 条评论
How to work with GraphQL in Data Factory

2023年7月13日

How to work with GraphQL in Data Factory

In a recent project, I had the pleasure of working with the Pluralsight API, which is of the GraphQL type. This posed…

2 条评论
Measuring the impact of learning

2023年6月5日

Measuring the impact of learning

In my current role, I have the pleasure of heading the Data Academy here at Arla Foods, where we produce learning…

1 条评论
Creating a dynamic date array for looping in Azure Data Factory

2022年9月4日

Creating a dynamic date array for looping in Azure Data Factory

Let me set the stage for what we are aiming for and the constraint that makes this interesting. I need to loop through…

2 条评论
Power BI Datamart - the Excel killer!

2022年5月27日

Power BI Datamart - the Excel killer!

Power BI Datamart just got announced at Microsoft Build, and already people are talking about how it'll disrupt the…

12 条评论
A BigQuery script to union multiple tables from separate datasets

2022年3月25日

A BigQuery script to union multiple tables from separate datasets

One of the nicer things in working with Google BigQuery is the ability to use wildcards to query multiple similar…

5 条评论

See all articles

Getting ISO year right in PySpark

Morten Gammelgaard Hannibalsen

Senior Data Engineer

The issue

The fix

User Defined Function

领英推荐

Custom logic

Closing remarks

Azure Data Ramblings

922 位关注者

Morten Gammelgaard Hannibalsen的更多文章

社区洞察

其他会员也浏览了

DATA Pill #029 - What is the future of Apache Flink? And what do football and LoL have to do with DATA?

Unpacking Lazy Evaluation in Apache Spark: A Deep Dive

PySpark Internal: Adaptive Query Execution (AQE)

Spark Tidbits - Lesson 6

Big Data Learn Guide

Simple ways to improve your PySpark and Parquet pipeline performance

Troubleshooting executor out of memory error in Pyspark

Navigating the Data Landscape: Essential Tools and Technologies for Every Data Scientist

End to End Pyspark Example

Unlock Databricks' Full Potential: Learn Basics, Mitigate Costs, and Know Limitations Today!

The issue

The fix

User Defined Function

领英推荐

Custom logic

Closing remarks

Azure Data Ramblings

922 位关注者

Morten Gammelgaard Hannibalsen的更多文章

Fabric AI Skills is enabling a better development process

Nested JSON arrays: the perfect niche for Dataflow Gen2 in Fabric

Correctly connecting to a file on SharePoint in Power BI Desktop to avoid refresh errors on Power BI Service

Testing the seams of Microsoft Fabric - Part 2: data cleaning and curation

Testing the seams of Microsoft Fabric - Part 1: data ingestion

How to work with GraphQL in Data Factory

Measuring the impact of learning

Creating a dynamic date array for looping in Azure Data Factory

Power BI Datamart - the Excel killer!

A BigQuery script to union multiple tables from separate datasets

社区洞察

其他会员也浏览了

DATA Pill #029 - What is the future of Apache Flink? And what do football and LoL have to do with DATA?

Unpacking Lazy Evaluation in Apache Spark: A Deep Dive

PySpark Internal: Adaptive Query Execution (AQE)

Spark Tidbits - Lesson 6

Big Data Learn Guide

Simple ways to improve your PySpark and Parquet pipeline performance

Troubleshooting executor out of memory error in Pyspark

Navigating the Data Landscape: Essential Tools and Technologies for Every Data Scientist

End to End Pyspark Example

Unlock Databricks' Full Potential: Learn Basics, Mitigate Costs, and Know Limitations Today!