登录查看更多内容

The Power of the Functions in Spark SQL

Zidan Rocha

Data Analyst | Analytics Engineer | Business Analyst | Power BI | Pl-300 | Python | SQL | Databricks | ETL

发布日期: 2024年8月12日

Do you have a situation where you need to group a client's timeline or aggregate the products of the clients into a single row?

In Apache Spark, the collect_list and collect_set functions can give you powerful tools to tackle these challenges.

What Are the Functions?

collect_list and collect_set are aggregation and grouping functions that are non-deterministic. What does this mean?

Non-deterministic means that the function's results cannot be predicted based on one condition because the function allows multiple outcomes. We'll explore more about how to handle this in this article.

How to Use in SQL?

In this article, we'll use Apache Spark SQL, but there is also the option to use Apache Spark with PySpark.

Example Table

We’ll use an example table with four columns:

date_comment: Date of the comment
client: ID of the client
department: Department of the company
description_department: Comment made by a department representative

This is an example of the table:

Collect List

The return of the function is an array. To use collect_list, you'll need to transform your table. Why? Because as a non-deterministic function, it’s necessary to order the table based on the condition that we want to append in the array first. In this case, we’ll order by date_comment, where the earliest date will appear first.

?with order_table as (
  select
    *
  from tb_collect
  order by date_comment asc
)
select
  to_date(date_comment) as date_comment
  ,client
  ,concat_ws(" | ",collect_list(department)) as department
  ,concat_ws(" ", collect_list(description_department)) as description_department
from order_table
group by all?

This code also uses another function, concat_ws, which takes two parameters: the separator and the text to combine. By using collect_list, we aggregate values for date_comment and client. The result is:

Handling Duplicates: Collect Set

When we aggregate values, duplicate rows might appear in our table, and GROUP BY could bring duplicated values. For this scenario, we use collect_set.

领英推荐

What is SQL And How Long Does It Take to Learn It?

LearnSQL.com 8 个月前

SQL Insights: In Conversation With Ignacio Spreafico

LearnSQL.com 9 个月前

2023 Complete SQL Bootcamp from Zero to Hero in SQL

Free Online Courses With Printable Certificates 1 年前

collect_set has the same structure as collect_list but doesn't return duplicate values. Here’s an example:

?with order_table as (
  select
    *
  from tb_collect
  order by date_comment asc
)
select
  client
  ,concat_ws(" | ",collect_set(to_date(date_comment))) as date_comments
  ,concat_ws(" | ",collect_set(department)) as departaments
  ,concat_ws(" ",collect_set(description_departament)) as description_departament
from order_table
group by all

Here’s our result:

The Problem with Collect Set

As we know, this function is non-deterministic. So when collect_set removes duplicates, it doesn't remove the first or last value because there is no condition to dictate the order.

Solving the Collect Set Problem

To remove duplicate values while keeping order, use array_distinct with collect_list. This function will keep the first value and remove the others.

?with order_table as (
  select
    *
  from tb_collect
  group by all
  order by date_comment asc
)
select
  client
  ,concat_ws(" | ",array_distinct(collect_list(to_date(date_comment)))) as date_comments
  ,concat_ws(" | ",array_distinct(collect_list(department))) as departments
  ,concat_ws(" ",array_distinct(collect_list(description_department))) as description_department
from order_table
group by all

Result:

Bonus - Using Collect List with CASE WHEN

You can use collect_list with a CASE WHEN statement to return values based on a condition. For example, if you want only comments that contain the word "sales":

 with order_table as (
  select
    *
  from tb_collect
  group by all
  order by date_comment asc
)?
select
  to_date(date_comment) as date_comment
  ,client
  ,trim(concat_ws(" ",collect_list(case when lower(description_department) like "%sales%" then description_department else null end))) as description_department
from order_table 
group by all
having description_department <> ""
order by all

Result:

Here, we use a HAVING clause because the CASE WHEN statement might return columns with empty values.

I hope this article helps you in your analyses.

The Power of the Functions in Spark SQL

Zidan Rocha

Data Analyst | Analytics Engineer | Business Analyst | Power BI | Pl-300 | Python | SQL | Databricks | ETL

What Are the Functions?

How to Use in SQL?

Example Table

Collect List

Handling Duplicates: Collect Set

领英推荐

The Problem with Collect Set

Solving the Collect Set Problem

Bonus - Using Collect List with CASE WHEN

社区洞察

其他会员也浏览了

Why SQL is so important, how can I master it?

Vector Embeddings and Fuzzy Matching with SQL

Revolutionizing SQLite Interactions: Why SqliteDict Is a Game-Changer for Developers

AI2SQL: Bridging the Gap Between Non-Engineers and SQL Query Generation

Aggregation Functions in PySpark

SQL Queries Quickstart: Part III

Creating and Deploying PySpark Temporary Views for SQL Querying... by Fidel Vetino

A retrospective on 2019 SQL Saturday in Los Angeles

# Understanding Apache Parquet: How It Makes Big Data Processing 10x More Efficient

SQL vs Pandas: DDL Operations | DROP, COMMENT, and TRUNCATE ???????