Common HiveQL to BigQuery Migration Errors: A Detailed Exploration - Part 2

Common HiveQL to BigQuery Migration Errors: A Detailed Exploration - Part 2

Intro

In the last post we covered four common errors that we have had to contend with during our time migrating HiveQL to BigQuery during a data warehouse migration. There we covered ranking issues, issues with using the TRIM function, size of NULL arrays, and issues with the Gregorian calendar reform. In this post we will continue with a few more examples.?

Using Arrays in Group By

In Hive it is possible to use an array in a GROUP BY statement.?


```
SELECT 
    name_array,
    MAX(age)
FROM
    source
GROUP BY
    name_array
```
        

However, in BigQuery this is not possible. Thus, a workaround is needed. One solution is to cast the array to a JSON string, as strings can be used in GROUP BY. You can wrap this in a CTE or subquery, and then convert the JSON string back to the original array afterwards

```
WITH group_by_helper AS (
SELECT 
    TO_JSON_STRING(name_array) AS name_array
    , MAX(age) AS max_age
FROM
    source
GROUP BY
    TO_JSON_STRING(name_array)
)
SELECT
    JSON_EXTRACT_ARRAY(name_array) AS name_array
    , max_age
FROM
    group_by_helper
```        

Using Aggregation Functions on STRUCT/RECORD

In Hive, it is possible to apply a function such as MIN OR MAX on a STRUCT. If we use MIN as an example, then the minimum STRUCT is determined based on the values of the sub-fields, where the first sub-field has precedence. So if we have two STRUCTs as such:

```
WITH two_structs AS (
SELECT named_struct(“name”, “abe”, “age”, 20, “height”, 170)
UNION ALL
SELECT named_struct(“name”, “barry”, “age”, 18, “height”, 180)
)
```        

Then the MIN of the two would return


```
{"name":"abe","age":20,”height”:170}
```        

Because abe < barry. Now if we change the data such that the first field has a tie

```
WITH two_structs AS (
SELECT named_struct(“name”, “barry”, “age”, 20, “height”, 170)
UNION ALL
SELECT named_struct(“name”, “barry”, “age”, 18, “height”, 180)
)
```        

Then the MIN would return?

```
{"name":"barry","age":18,“height”, 180}
```        

This is because the first field was not enough to determine the minimum, and thus the second field is used.

However, BigQuery does not allow STRUCT fields to be used in aggregation functions, and thus a workaround is needed. One possible solution is to apply the same ranking logic using the ROW_NUMBER() function.


```
WITH two_structs AS (
    SELECT STRUCT(“abe” AS name, 20 AS age, 170 as height) AS info
        UNION ALL
    SELECT STRUCT(“barry” AS name, 18 AS age, 180 as height) AS info
)
, ranked AS (
SELECT 
    info
    , ROW_NUMBER() OVER(ORDER BY info.name, info.age, info.height) as rn
FROM
    two_structs
)
SELECT
    info
FROM
    ranked
WHERE
    rn = 1
```        

From the above you would get the following result


Replicating MONTHS_BETWEEN in BigQuery

In Hive there is a function called MONTHS_BETWEEN. This function calculates a floating point value of the number of months between two dates. For example, if two dates are exactly a half month apart, then it will return 0.5. In BigQuery no such function exists, as the closest option is DATE_DIFF, which treats the difference in months as an integer, and thus if any date is within a different month from the original date, even if only 2 days apart (e.g. 2024-10-30 and 2024-11-1). In order to mimic the original behaviour, we wrote a custom user-defined function (UDF).

```
CREATE TEMP FUNCTION CUSTOM_MONTHS_DIFF(d1 DATETIME, d2 DATETIME) AS (
 (
     SELECT (EXTRACT(YEAR FROM d1) - EXTRACT(YEAR FROM d2)) * 12
     + (EXTRACT(MONTH FROM d1) - EXTRACT(MONTH FROM d2))
     + (
         (EXTRACT(DAY FROM d1) * 86400.0 + EXTRACT(HOUR FROM d1) * 3600 + EXTRACT(MINUTE FROM d1) * 60 + EXTRACT(SECOND FROM d1))
         - (EXTRACT(DAY FROM d2) * 86400.0 + EXTRACT(HOUR FROM d2) * 3600 + EXTRACT(MINUTE FROM d2) * 60 + EXTRACT(SECOND FROM d2))
     ) / 2678400.0
 )
);
```        

Conclusion

While there are more small issues that we have encountered along the way of migrating HiveQL to BigQuery, we simply cannot cover every single one of them. We hope that these examples from parts one and two will be enough to give you an idea of the types of issues you could face, and how to solve them. Even better if they addressed the exact problems that you are currently stuck on.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了