The Power of the Functions in Spark SQL
Zidan Rocha
Data Analyst | Analytics Engineer | Business Analyst | Power BI | Pl-300 | Python | SQL | Databricks | ETL
Do you have a situation where you need to group a client's timeline or aggregate the products of the clients into a single row?
In Apache Spark, the collect_list and collect_set functions can give you powerful tools to tackle these challenges.
What Are the Functions?
collect_list and collect_set are aggregation and grouping functions that are non-deterministic. What does this mean?
Non-deterministic means that the function's results cannot be predicted based on one condition because the function allows multiple outcomes. We'll explore more about how to handle this in this article.
How to Use in SQL?
In this article, we'll use Apache Spark SQL, but there is also the option to use Apache Spark with PySpark.
Example Table
We’ll use an example table with four columns:
This is an example of the table:
Collect List
The return of the function is an array. To use collect_list, you'll need to transform your table. Why? Because as a non-deterministic function, it’s necessary to order the table based on the condition that we want to append in the array first. In this case, we’ll order by date_comment, where the earliest date will appear first.
?with order_table as (
select
*
from tb_collect
order by date_comment asc
)
select
to_date(date_comment) as date_comment
,client
,concat_ws(" | ",collect_list(department)) as department
,concat_ws(" ", collect_list(description_department)) as description_department
from order_table
group by all?
This code also uses another function, concat_ws, which takes two parameters: the separator and the text to combine. By using collect_list, we aggregate values for date_comment and client. The result is:
Handling Duplicates: Collect Set
When we aggregate values, duplicate rows might appear in our table, and GROUP BY could bring duplicated values. For this scenario, we use collect_set.
领英推荐
collect_set has the same structure as collect_list but doesn't return duplicate values. Here’s an example:
?with order_table as (
select
*
from tb_collect
order by date_comment asc
)
select
client
,concat_ws(" | ",collect_set(to_date(date_comment))) as date_comments
,concat_ws(" | ",collect_set(department)) as departaments
,concat_ws(" ",collect_set(description_departament)) as description_departament
from order_table
group by all
Here’s our result:
The Problem with Collect Set
As we know, this function is non-deterministic. So when collect_set removes duplicates, it doesn't remove the first or last value because there is no condition to dictate the order.
Solving the Collect Set Problem
To remove duplicate values while keeping order, use array_distinct with collect_list. This function will keep the first value and remove the others.
?with order_table as (
select
*
from tb_collect
group by all
order by date_comment asc
)
select
client
,concat_ws(" | ",array_distinct(collect_list(to_date(date_comment)))) as date_comments
,concat_ws(" | ",array_distinct(collect_list(department))) as departments
,concat_ws(" ",array_distinct(collect_list(description_department))) as description_department
from order_table
group by all
Result:
Bonus - Using Collect List with CASE WHEN
You can use collect_list with a CASE WHEN statement to return values based on a condition. For example, if you want only comments that contain the word "sales":
with order_table as (
select
*
from tb_collect
group by all
order by date_comment asc
)?
select
to_date(date_comment) as date_comment
,client
,trim(concat_ws(" ",collect_list(case when lower(description_department) like "%sales%" then description_department else null end))) as description_department
from order_table
group by all
having description_department <> ""
order by all
Result:
Here, we use a HAVING clause because the CASE WHEN statement might return columns with empty values.
I hope this article helps you in your analyses.