登录查看更多内容

Common HiveQL to BigQuery Migration Errors: A Detailed Exploration - Part 2

Aliz

We offer end-to-end delivery, consulting, and training for cloud Data Warehousing, Data Analytics, and Machine Learning.

发布日期: 2024年3月20日

Intro

In the last post we covered four common errors that we have had to contend with during our time migrating HiveQL to BigQuery during a data warehouse migration. There we covered ranking issues, issues with using the TRIM function, size of NULL arrays, and issues with the Gregorian calendar reform. In this post we will continue with a few more examples.?

Using Arrays in Group By

In Hive it is possible to use an array in a GROUP BY statement.?


```
SELECT 
    name_array,
    MAX(age)
FROM
    source
GROUP BY
    name_array
```

However, in BigQuery this is not possible. Thus, a workaround is needed. One solution is to cast the array to a JSON string, as strings can be used in GROUP BY. You can wrap this in a CTE or subquery, and then convert the JSON string back to the original array afterwards

```
WITH group_by_helper AS (
SELECT 
    TO_JSON_STRING(name_array) AS name_array
    , MAX(age) AS max_age
FROM
    source
GROUP BY
    TO_JSON_STRING(name_array)
)
SELECT
    JSON_EXTRACT_ARRAY(name_array) AS name_array
    , max_age
FROM
    group_by_helper
```

Using Aggregation Functions on STRUCT/RECORD

In Hive, it is possible to apply a function such as MIN OR MAX on a STRUCT. If we use MIN as an example, then the minimum STRUCT is determined based on the values of the sub-fields, where the first sub-field has precedence. So if we have two STRUCTs as such:

```
WITH two_structs AS (
SELECT named_struct(“name”, “abe”, “age”, 20, “height”, 170)
UNION ALL
SELECT named_struct(“name”, “barry”, “age”, 18, “height”, 180)
)
```

Then the MIN of the two would return


```
{"name":"abe","age":20,”height”:170}
```

Because abe < barry. Now if we change the data such that the first field has a tie

Nikola Ilic 1 年前

HTAP Summit 2024 Special Edition

TiDB, powered by PingCAP 3 个月前

Dedicated vs Serverless SQL in Azure Synapse – Which…

Nikola Ilic 1 年前

```
WITH two_structs AS (
SELECT named_struct(“name”, “barry”, “age”, 20, “height”, 170)
UNION ALL
SELECT named_struct(“name”, “barry”, “age”, 18, “height”, 180)
)
```

Then the MIN would return?

```
{"name":"barry","age":18,“height”, 180}
```

This is because the first field was not enough to determine the minimum, and thus the second field is used.

However, BigQuery does not allow STRUCT fields to be used in aggregation functions, and thus a workaround is needed. One possible solution is to apply the same ranking logic using the ROW_NUMBER() function.


```
WITH two_structs AS (
    SELECT STRUCT(“abe” AS name, 20 AS age, 170 as height) AS info
        UNION ALL
    SELECT STRUCT(“barry” AS name, 18 AS age, 180 as height) AS info
)
, ranked AS (
SELECT 
    info
    , ROW_NUMBER() OVER(ORDER BY info.name, info.age, info.height) as rn
FROM
    two_structs
)
SELECT
    info
FROM
    ranked
WHERE
    rn = 1
```

From the above you would get the following result

Replicating MONTHS_BETWEEN in BigQuery

In Hive there is a function called MONTHS_BETWEEN. This function calculates a floating point value of the number of months between two dates. For example, if two dates are exactly a half month apart, then it will return 0.5. In BigQuery no such function exists, as the closest option is DATE_DIFF, which treats the difference in months as an integer, and thus if any date is within a different month from the original date, even if only 2 days apart (e.g. 2024-10-30 and 2024-11-1). In order to mimic the original behaviour, we wrote a custom user-defined function (UDF).

```
CREATE TEMP FUNCTION CUSTOM_MONTHS_DIFF(d1 DATETIME, d2 DATETIME) AS (
 (
     SELECT (EXTRACT(YEAR FROM d1) - EXTRACT(YEAR FROM d2)) * 12
     + (EXTRACT(MONTH FROM d1) - EXTRACT(MONTH FROM d2))
     + (
         (EXTRACT(DAY FROM d1) * 86400.0 + EXTRACT(HOUR FROM d1) * 3600 + EXTRACT(MINUTE FROM d1) * 60 + EXTRACT(SECOND FROM d1))
         - (EXTRACT(DAY FROM d2) * 86400.0 + EXTRACT(HOUR FROM d2) * 3600 + EXTRACT(MINUTE FROM d2) * 60 + EXTRACT(SECOND FROM d2))
     ) / 2678400.0
 )
);
```

Conclusion

While there are more small issues that we have encountered along the way of migrating HiveQL to BigQuery, we simply cannot cover every single one of them. We hope that these examples from parts one and two will be enough to give you an idea of the types of issues you could face, and how to solve them. Even better if they addressed the exact problems that you are currently stuck on.

要查看或添加评论，请登录

Common HiveQL to BigQuery Migration Errors: A Detailed Exploration - Part 2

Aliz

We offer end-to-end delivery, consulting, and training for cloud Data Warehousing, Data Analytics, and Machine Learning.

Intro

Using Arrays in Group By

领英推荐

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Working with Semi-Structured JSON Data in Databricks

?? DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ?

Graph Database and Query Language 101: Speed & Simplicity

DoubleCloud’s 14th Product Update

Building an End-to-End Airflow Data Pipeline with BigQuery, dbt & Soda

Tackling the “Large Number of Small Files” Problem in Spark

Why Data Scientists Should Add Google BigQuery to Their Skillset

Record Level Indexing in Apache Hudi Delivers 70% Faster Point Lookups

Implementing Keyword Search in Hudi: Building Inverted Indexes with Record Level Index, Metadata Indexing and Point Lookups | Text search on data lake

Text-to-SQL with Dataherald and Yellowbrick

Intro

Using Arrays in Group By

领英推荐

Conclusion

Join Us as We Head to Google Cloud Summits in APAC and DACH

2024年5月23日

Benefits of Moving to Google Cloud vs. Staying On-Premises

2024年5月16日

Spark Compass: Elevating Mobile Tech with a Smart Cloud Shift

2024年5月9日

Die Presse: Digital Transformation for Modern Journalism

2024年5月2日

Tiket.com's Cloud Revolution: Sky-High Performance, Grounded Costs

2024年4月10日

Part 1: Introduction to MLOps - Mastering Machine Learning Operations

2024年4月3日

Wizz Air's Journey to Personalization: A Case Study in Customer-Centric Innovation

2024年3月28日

Aliz Celebrates Google Cloud's Leadership in AI Infrastructure Solutions: The Forrester Wave? Report, Q1 2024

2024年3月19日

Exclusive Collection of 2024 AI & Data Trends from Leading Sources

2024年3月13日

Common HiveQL to BigQuery Migration Errors: A Detailed Exploration Part 1

2024年3月12日

社区洞察

其他会员也浏览了

Working with Semi-Structured JSON Data in Databricks

?? DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ?

Graph Database and Query Language 101: Speed & Simplicity

DoubleCloud’s 14th Product Update

Building an End-to-End Airflow Data Pipeline with BigQuery, dbt & Soda

Tackling the “Large Number of Small Files” Problem in Spark

Why Data Scientists Should Add Google BigQuery to Their Skillset

Record Level Indexing in Apache Hudi Delivers 70% Faster Point Lookups

Implementing Keyword Search in Hudi: Building Inverted Indexes with Record Level Index, Metadata Indexing and Point Lookups | Text search on data lake

Text-to-SQL with Dataherald and Yellowbrick