Common HiveQL to BigQuery Migration Errors: A Detailed Exploration - Part 2
Intro
In the last post we covered four common errors that we have had to contend with during our time migrating HiveQL to BigQuery during a data warehouse migration. There we covered ranking issues, issues with using the TRIM function, size of NULL arrays, and issues with the Gregorian calendar reform. In this post we will continue with a few more examples.?
Using Arrays in Group By
In Hive it is possible to use an array in a GROUP BY statement.?
```
SELECT
name_array,
MAX(age)
FROM
source
GROUP BY
name_array
```
However, in BigQuery this is not possible. Thus, a workaround is needed. One solution is to cast the array to a JSON string, as strings can be used in GROUP BY. You can wrap this in a CTE or subquery, and then convert the JSON string back to the original array afterwards
```
WITH group_by_helper AS (
SELECT
TO_JSON_STRING(name_array) AS name_array
, MAX(age) AS max_age
FROM
source
GROUP BY
TO_JSON_STRING(name_array)
)
SELECT
JSON_EXTRACT_ARRAY(name_array) AS name_array
, max_age
FROM
group_by_helper
```
Using Aggregation Functions on STRUCT/RECORD
In Hive, it is possible to apply a function such as MIN OR MAX on a STRUCT. If we use MIN as an example, then the minimum STRUCT is determined based on the values of the sub-fields, where the first sub-field has precedence. So if we have two STRUCTs as such:
```
WITH two_structs AS (
SELECT named_struct(“name”, “abe”, “age”, 20, “height”, 170)
UNION ALL
SELECT named_struct(“name”, “barry”, “age”, 18, “height”, 180)
)
```
Then the MIN of the two would return
```
{"name":"abe","age":20,”height”:170}
```
Because abe < barry. Now if we change the data such that the first field has a tie
领英推荐
```
WITH two_structs AS (
SELECT named_struct(“name”, “barry”, “age”, 20, “height”, 170)
UNION ALL
SELECT named_struct(“name”, “barry”, “age”, 18, “height”, 180)
)
```
Then the MIN would return?
```
{"name":"barry","age":18,“height”, 180}
```
This is because the first field was not enough to determine the minimum, and thus the second field is used.
However, BigQuery does not allow STRUCT fields to be used in aggregation functions, and thus a workaround is needed. One possible solution is to apply the same ranking logic using the ROW_NUMBER() function.
```
WITH two_structs AS (
SELECT STRUCT(“abe” AS name, 20 AS age, 170 as height) AS info
UNION ALL
SELECT STRUCT(“barry” AS name, 18 AS age, 180 as height) AS info
)
, ranked AS (
SELECT
info
, ROW_NUMBER() OVER(ORDER BY info.name, info.age, info.height) as rn
FROM
two_structs
)
SELECT
info
FROM
ranked
WHERE
rn = 1
```
From the above you would get the following result
Replicating MONTHS_BETWEEN in BigQuery
In Hive there is a function called MONTHS_BETWEEN. This function calculates a floating point value of the number of months between two dates. For example, if two dates are exactly a half month apart, then it will return 0.5. In BigQuery no such function exists, as the closest option is DATE_DIFF, which treats the difference in months as an integer, and thus if any date is within a different month from the original date, even if only 2 days apart (e.g. 2024-10-30 and 2024-11-1). In order to mimic the original behaviour, we wrote a custom user-defined function (UDF).
```
CREATE TEMP FUNCTION CUSTOM_MONTHS_DIFF(d1 DATETIME, d2 DATETIME) AS (
(
SELECT (EXTRACT(YEAR FROM d1) - EXTRACT(YEAR FROM d2)) * 12
+ (EXTRACT(MONTH FROM d1) - EXTRACT(MONTH FROM d2))
+ (
(EXTRACT(DAY FROM d1) * 86400.0 + EXTRACT(HOUR FROM d1) * 3600 + EXTRACT(MINUTE FROM d1) * 60 + EXTRACT(SECOND FROM d1))
- (EXTRACT(DAY FROM d2) * 86400.0 + EXTRACT(HOUR FROM d2) * 3600 + EXTRACT(MINUTE FROM d2) * 60 + EXTRACT(SECOND FROM d2))
) / 2678400.0
)
);
```
Conclusion
While there are more small issues that we have encountered along the way of migrating HiveQL to BigQuery, we simply cannot cover every single one of them. We hope that these examples from parts one and two will be enough to give you an idea of the types of issues you could face, and how to solve them. Even better if they addressed the exact problems that you are currently stuck on.