登录查看更多内容

Hive Functions -- UDF,UDAF and UDTF with Examples

Gaurav Kumar Singh

Software Engineer | Mentor | Teacher| JavaScript Pro | React Pro | Node Pro | Databases Expert | xQL expert | Python | Java | K8s | AWS | Bigdata | ES | AI-ML

发布日期: 2016年4月25日

Hive Functions: User Defined Functions (UDFs) in hive are used to plug in our own logic in terms of code into hive when we are not able to get the desired result from hive's built in functions. We can invoke the UDFs from hive query.

There are three kind of UDFs in Hive:

1.Regular UDF, 2. User Defined Aggregate Function (UDAF), 3.User Defined Tabular Function (UDTF).

1.Regular UDF: UDFs works on a single row in a table and produces a single row as output. Its one to one relationship between input and output of a function. e.g Hive built in TRIM() function.

Hive allows us to define our own UDFs as well. Lets take an example of student record.

Problem Statement: Find the maximum marks obtained out of four subject by an student.

Sample Input: STUD_ID, NAME, CLASS, TOT_MARKS, MATH, ENGLISH, PHYSICS, SOCIAL_STUDY, YEAR [Click Here for complete source code]

Steps to Create & Execute UDFs in Java:

Step1: We have to extend a base Class UDF to write our business logic in Java. Step2: In order to write business logic , we have to overload a method called evaluate() inside our class.

Step3: We need to export the JAR files to HDFS where hive is running. In my case , I am running hive from my local directory path /home/GAURAV/HIVE, so I have copied the exported JAR file in this path.

Step4: Add the exported JAR file to hive classpath using below command from hive terminal: ADD JAR EXPORTED_FILE_NAME.jar Alternate: You can add exported JAR files in bashrc file using command "nano ~/.bashrc" as HIVE_AUX_JAR_PATH = '/home/GAURAV/HIVE/HIVE-UDF-MAXMARKS.jar'. It will avoid adding your hive jar to class path each time you login to hive session or hadoop as it will be loaded during hadoop cluster loading by the framework itself.

Step5: In order to apply business logic on top of hive column using our UDF, we need to create a temporary function for the exported jar file. CREATE temporary function func_name as 'absolute_class_path_name'.

The above diagram shows the steps 4 and step 5.

Java Code Snippet using Eclipse Tool:

OUTPUT: Below is the output:

2.UDAF: User defined aggregate functions works on more than one row and gives single row as output. e.g Hive built in MAX() or COUNT() functions. here the relation is many to one. Lets say you have a table with students name, id and total marks, so here if I have 10 rows in the table and if I have to find student who got maximum number then our query need to check each 10 row to find the maximum but ultimately we get only one output which is the maximum. Hope this justifies the many to one relationship.

Hive allows us to define our own UDAFs. Lets take an example of student record as explained in UDFs example with same input.

Problem Statement: Find the Mean of Marks obtained in Math by all of the students.

Sample Input: STUD_ID, NAME, CLASS, TOT_MARKS, MATH, ENGLISH, PHYSICS, SOCIAL_STUDY, YEAR [Click Here for complete source code]

Steps to Create & Execute UDAFs in Java: Step1:We have to extend a base Class UDAF to write our business logic in Java. Step2: We need to overwrite five methods called init(), iterate(), terminatePartial(), merge() and terminate() in our class GetMeanMarks.class. Step3 to Step5: It is same as explained in UDF example. Below is the screenshots for the same for UDAF.

Below is the sample flow diagram to explain the UDAF methods:

Java Code Snippet using Eclipse Tool:

OUTPUT: Below is the output:

3.UDTF: User defined tabular function works on one row as input and returns multiple rows as output. So here the relation in one to many. e.g Hive built in EXPLODE() function. Now lets take an array column USER_IDS as ARRAY10,12,5,45> then SELECT EXPLODE(USER_IDS) as ID FROM T_USER. will give 10,12,5,45 as four different rows in output. UDTF can be used to split a column into multiple column as well which we will look in below example. Here alias "AS" clause is mandatory .

Problem Statement: Expand the book_detail column from T_BOOK_DETAIL table to give it as three separate column in output, BOOK_NUMBER, TITLE and AUTHOR. Here we need to split one column into three columns.

Sample Input: Here book_detail column contains book_number,Title and Author name separated by comma. [Click Here for complete source code]

Steps to Create & Execute UDTFs in Java:

Step1: We have to extend a base Class Generic UDTF to write our business logic in Java.

Step2: We need to override 3 methods namely initialize(), process() and close() in our class ExpandBookDetail.class.

Step3 to Step5: It is same as explained in UDF example. Below is the screenshots for the same for UDTF.

Java Code Snippet using Eclipse Tool:

Output: Below is the Output:

This is the small article with example to understand and gets some hands on Hive UDFs. Hope you liked it. Please refer to the books and internet for more details on Hive functions.

Happy Learning.... Gaurav Singh

Bhavna Latare

Software Intern @ NXP Semiconductors | IIT PATNA '25 | Training and Placement Coordinator | M.Tech Artificial Intelligence

1 年

Thank you for such a crisp explanation. Really easy to understand.

Mala PRUM

Data Management Specialist | Automating Big Data ETL Pipelines and Data Analysis | Ensuring Data Quality, Accuracy, and Robust Data Warehousing

2 年

Do you have source code?

1 次回应

Nathan Kong

Software Developer focusing on Java Backend and also interested in AI and machine learning and Big Data technologies.

3 年

Really explicit and easy to understand. Thank u!

1 次回应

Pavithra S

Data Engineer at Cognizant

5 年

Great explanation. Thanks

1 次回应

Vikas Pabale

Google Cloud Certified Professional Data Engineer | Specializing in Spark (Scala, Python)

6 年

Really Good Explanation?

1 次回应

查看更多评论

要查看或添加评论，请登录

Gaurav Kumar Singh的更多文章

Build your first Chat GPT like app[Beginner]

2024年5月18日

Build your first Chat GPT like app[Beginner]

Introduction: As the world moves at a rapid pace, why shouldn't you keep up? You can either be left behind or ride the…

1 条评论
What's the hype surrounding Gen AI, LLMs, RAG and so on? It's crucial not to overlook this [Beginner]

2024年4月22日

What's the hype surrounding Gen AI, LLMs, RAG and so on? It's crucial not to overlook this [Beginner]

The purpose of this article is to provide you with a foundational understanding of Gen AI and its associated…

4 条评论
What is Full-Stack(MEAN, MERN etc)? How to get into web development? [Beginner-Advance]

2024年3月27日

What is Full-Stack(MEAN, MERN etc)? How to get into web development? [Beginner-Advance]

Introduction The term "full stack" has become increasingly popular, particularly within startup environments where…
Demystify ES6 class mocking in Jest (with examples)[Intermediate]

2024年3月5日

Demystify ES6 class mocking in Jest (with examples)[Intermediate]

Lets start with why are we here first. You already have the detailed explanations for ES6 class mocks here in Jest…

2 条评论
Closures in JavaScript[Beginner]

2018年1月29日

Closures in JavaScript[Beginner]

Introduction: A closure is just like any function that somehow remains available after the surrounding outer scope has…
In-memory Scope/Execution Context and Lexical Scope in JavaScript [Beginner]

2018年1月28日

In-memory Scope/Execution Context and Lexical Scope in JavaScript [Beginner]

Introduction: The example that I am showing here can be executed and seen in the Google Chrome browser's dev tool . I…
HIVE - Partitioning and Bucketing with examples

2016年4月30日

HIVE - Partitioning and Bucketing with examples

Partitioning: Partitioning is used to slice the data horizontally over the entire range or on a smaller range of values…

68 条评论
Custom Counters in Hadoop

2016年4月20日

Custom Counters in Hadoop

Click Here [for complete source code] Definition: As the name suggests, counters are way or channel or method to gather…

12 条评论
Distributed Cache in Hadoop with examples

2016年4月18日

Distributed Cache in Hadoop with examples

Source Code Click Here DISTRIBUTED CACHE: It is a facility which MapReduce framework provides to access small files…

7 条评论
Hadoop MapReduce Join End to End

2016年4月14日

Hadoop MapReduce Join End to End

Click here to get the source code Here is a small article on MR join in Hadoop-1. INPUT: We have two input files…

2 条评论

See all articles

Gaurav Kumar Singh的更多文章

Build your first Chat GPT like app[Beginner]

What's the hype surrounding Gen AI, LLMs, RAG and so on? It's crucial not to overlook this [Beginner]

What is Full-Stack(MEAN, MERN etc)? How to get into web development? [Beginner-Advance]

Demystify ES6 class mocking in Jest (with examples)[Intermediate]

Closures in JavaScript[Beginner]

In-memory Scope/Execution Context and Lexical Scope in JavaScript [Beginner]

HIVE - Partitioning and Bucketing with examples

Custom Counters in Hadoop

Distributed Cache in Hadoop with examples

Hadoop MapReduce Join End to End