The Power of the Functions in Spark SQL

The Power of the Functions in Spark SQL

Do you have a situation where you need to group a client's timeline or aggregate the products of the clients into a single row?

In Apache Spark, the collect_list and collect_set functions can give you powerful tools to tackle these challenges.

What Are the Functions?

collect_list and collect_set are aggregation and grouping functions that are non-deterministic. What does this mean?

Non-deterministic means that the function's results cannot be predicted based on one condition because the function allows multiple outcomes. We'll explore more about how to handle this in this article.

How to Use in SQL?

In this article, we'll use Apache Spark SQL, but there is also the option to use Apache Spark with PySpark.

Example Table

We’ll use an example table with four columns:

  • date_comment: Date of the comment
  • client: ID of the client
  • department: Department of the company
  • description_department: Comment made by a department representative

This is an example of the table:

Collect List

The return of the function is an array. To use collect_list, you'll need to transform your table. Why? Because as a non-deterministic function, it’s necessary to order the table based on the condition that we want to append in the array first. In this case, we’ll order by date_comment, where the earliest date will appear first.

?with order_table as (
  select
    *
  from tb_collect
  order by date_comment asc
)
select
  to_date(date_comment) as date_comment
  ,client
  ,concat_ws(" | ",collect_list(department)) as department
  ,concat_ws(" ", collect_list(description_department)) as description_department
from order_table
group by all?        

This code also uses another function, concat_ws, which takes two parameters: the separator and the text to combine. By using collect_list, we aggregate values for date_comment and client. The result is:

Handling Duplicates: Collect Set

When we aggregate values, duplicate rows might appear in our table, and GROUP BY could bring duplicated values. For this scenario, we use collect_set.

collect_set has the same structure as collect_list but doesn't return duplicate values. Here’s an example:

?with order_table as (
  select
    *
  from tb_collect
  order by date_comment asc
)
select
  client
  ,concat_ws(" | ",collect_set(to_date(date_comment))) as date_comments
  ,concat_ws(" | ",collect_set(department)) as departaments
  ,concat_ws(" ",collect_set(description_departament)) as description_departament
from order_table
group by all        

Here’s our result:

The Problem with Collect Set

As we know, this function is non-deterministic. So when collect_set removes duplicates, it doesn't remove the first or last value because there is no condition to dictate the order.

Solving the Collect Set Problem

To remove duplicate values while keeping order, use array_distinct with collect_list. This function will keep the first value and remove the others.

?with order_table as (
  select
    *
  from tb_collect
  group by all
  order by date_comment asc
)
select
  client
  ,concat_ws(" | ",array_distinct(collect_list(to_date(date_comment)))) as date_comments
  ,concat_ws(" | ",array_distinct(collect_list(department))) as departments
  ,concat_ws(" ",array_distinct(collect_list(description_department))) as description_department
from order_table
group by all        

Result:

Bonus - Using Collect List with CASE WHEN

You can use collect_list with a CASE WHEN statement to return values based on a condition. For example, if you want only comments that contain the word "sales":

 with order_table as (
  select
    *
  from tb_collect
  group by all
  order by date_comment asc
)?
select
  to_date(date_comment) as date_comment
  ,client
  ,trim(concat_ws(" ",collect_list(case when lower(description_department) like "%sales%" then description_department else null end))) as description_department
from order_table 
group by all
having description_department <> ""
order by all        

Result:

Here, we use a HAVING clause because the CASE WHEN statement might return columns with empty values.

I hope this article helps you in your analyses.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了