Unlocking BigQuery SQL With User-Defined Functions (UDF’s)

Unlocking BigQuery SQL With User-Defined Functions (UDF’s)

Introduction

On August 25th, Google released a host of new and exciting changes to the BigQuery platform. While each of these changes deserves its own blog post, I wanted to spend a few minutes chatting about one type of change that has been recently been announced - added support for user-defined functions (UDFs). On the development side, while SQL itself is incredibly empowering, I am sure that many of you have been frustrated at one time or another by the restrictive nature of SQL when it comes to doing more complex work. As you and I know, SQL isn’t great at dealing with loops, complex conditionals, or challenging string parsing use cases which frequently present themselves when analyzing and reviewing large volumes of text data. Unfortunately, until this last release, there were little options for workarounds within BigQuery, Google’s BigData tool which operates using a SQL-based language. Thankfully, all of that changed as of the new release.

As part of the latest BigQuery release, Google has announced full support for executing user-defined functions (UDFs). This change gives developers tremendous power and is something I hope to display at my next Vancouver Enterprise Cloud Computing Users Group meetup. This change also gives users the opportunity to combine SQL code with JavaScript defined functions; something, that will give developers access to a wide variety of complex operations in BigQuery going forward.

Understanding The Basics

Now that you understand what UDF’s are, lets explore how they work in a little more detail. BigQuery UDFs are described by Google as being similar to “map functions in MapReduce”. These JavaScript UDF’s are executed on instances of Google V8 running on Google servers, with the code running close to your data in order to minimize added latency. In terms of specific UDF functions, a UDF takes a single row of data as an input and produces “zero or more rows of output, potentially with a different schema.”  The fact that these functions are written in JavaScript makes perfect sense because of the fact that JavaScript contains a built-in capability for decoding URL-encoded strings. Below is an example illustrating the framework for a UDF function.

As you can see in the above function, the UDF includes two formal parameters. Thomas Park, a Senior Software Engineering from Google describes these parameters in perfect detail. Thus, I reiterate his own description and attribute full credit to him for these initial descriptions. Within the two functions described, “the first parameter is a variable to which each input row will be bound whereas the second parameter is an “emitter” function. Each time the emitter is invoked with a JavaScript object, that object will be returned as a row to the query.”

On inspection, it is quite evident what's happening. In the above example, urlDecode refers to the actual UDF that will be invoked from BigQuery. When the UDF is run, it calls a helper function decodeHelper that uses JavaScript’s built-in decodeURI function to transform URL-encoded data into UTF-8. And, since the Google recognizes that data can be messy, the above example includes the use of a try / catch in decodeHelper which causes the function to return the original, un-decoded string if an error occurs.

Now that you understand the basics, you are probably asking yourself, okay, how do I execute this in BigQuery? To make this function visible to BigQuery, you have to include a registration call that describes the function. This call must include the input columns, output schema, as well as a name that you will use to reference the function within your SQL. Below is an example.

Once this step is complete, you can then invoke the function using the appropriate name, in the case “urlDecode” within the BigQuery SQL query. The source table, or subquery can then be passed as an argument. To help cement things, below is a full example that Thomas created showing what a full query looks like using the sample Wikipedia dataset that comes with BigQuery. This query seeks to find the most-visited "French Wikipedia articles from April 2015 that contain a cédille character (?) in the title":

Now that you have seen a generic example, you are probably itching to get rolling. One of the first things that you will probably bookmark relating to these functions is a new simple browser-based widget that the team from Google created. This simple browser-based widget allows you to test and debug UDFs and can be very useful, particularly because not all JavaScript functionality supported in the browser is available in BigQuery. For example, “anything related to the browser DOM is unsupported, including Window and Document objects, and any functions that require them, such as atob() / btoa().

I could on and on and will include a follow up post on UDF functions with many examples. That said, I hope that, by now, you have a full appreciation of what UDF functions are, how they work, and why they have the potential to transform how you currently use a tool like BigQuery. Over the last few months at Cardinal Path, I have seen dozens of use cases with the clients that I work with and look forward to sharing more of those general use cases in the weeks and months ahead! In the mean time, if you have any examples to share, please don’t hesitate to reach out!

Punnoose I.

SVP || Data, AI and Insights || Crunchyroll, LLC || Sony subsidiary

9 年

Wish redshift takes a note of this :-)

回复

要查看或添加评论,请登录

Nathaniel Payne, PhD (裴内森)的更多文章

社区洞察

其他会员也浏览了