Hive Functions -- UDF,UDAF and UDTF with Examples
Gaurav Kumar Singh
Software Engineer | Mentor | Teacher| JavaScript Pro | React Pro | Node Pro | Databases Expert | xQL expert | Python | Java | K8s | AWS | Bigdata | ES | AI-ML
Hive Functions: User Defined Functions (UDFs) in hive are used to plug in our own logic in terms of code into hive when we are not able to get the desired result from hive's built in functions. We can invoke the UDFs from hive query.
There are three kind of UDFs in Hive:
1.Regular UDF, 2. User Defined Aggregate Function (UDAF), 3.User Defined Tabular Function (UDTF).
1.Regular UDF: UDFs works on a single row in a table and produces a single row as output. Its one to one relationship between input and output of a function. e.g Hive built in TRIM() function.
Hive allows us to define our own UDFs as well. Lets take an example of student record.
Problem Statement: Find the maximum marks obtained out of four subject by an student.
Sample Input: STUD_ID, NAME, CLASS, TOT_MARKS, MATH, ENGLISH, PHYSICS, SOCIAL_STUDY, YEAR [Click Here for complete source code]
Steps to Create & Execute UDFs in Java:
Step1: We have to extend a base Class UDF to write our business logic in Java. Step2: In order to write business logic , we have to overload a method called evaluate() inside our class.
Step3: We need to export the JAR files to HDFS where hive is running. In my case , I am running hive from my local directory path /home/GAURAV/HIVE, so I have copied the exported JAR file in this path.
Step4: Add the exported JAR file to hive classpath using below command from hive terminal: ADD JAR EXPORTED_FILE_NAME.jar Alternate: You can add exported JAR files in bashrc file using command "nano ~/.bashrc" as HIVE_AUX_JAR_PATH = '/home/GAURAV/HIVE/HIVE-UDF-MAXMARKS.jar'. It will avoid adding your hive jar to class path each time you login to hive session or hadoop as it will be loaded during hadoop cluster loading by the framework itself.
Step5: In order to apply business logic on top of hive column using our UDF, we need to create a temporary function for the exported jar file. CREATE temporary function func_name as 'absolute_class_path_name'.
The above diagram shows the steps 4 and step 5.
Java Code Snippet using Eclipse Tool:
OUTPUT: Below is the output:
2.UDAF: User defined aggregate functions works on more than one row and gives single row as output. e.g Hive built in MAX() or COUNT() functions. here the relation is many to one. Lets say you have a table with students name, id and total marks, so here if I have 10 rows in the table and if I have to find student who got maximum number then our query need to check each 10 row to find the maximum but ultimately we get only one output which is the maximum. Hope this justifies the many to one relationship.
Hive allows us to define our own UDAFs. Lets take an example of student record as explained in UDFs example with same input.
Problem Statement: Find the Mean of Marks obtained in Math by all of the students.
Sample Input: STUD_ID, NAME, CLASS, TOT_MARKS, MATH, ENGLISH, PHYSICS, SOCIAL_STUDY, YEAR [Click Here for complete source code]
Steps to Create & Execute UDAFs in Java: Step1:We have to extend a base Class UDAF to write our business logic in Java. Step2: We need to overwrite five methods called init(), iterate(), terminatePartial(), merge() and terminate() in our class GetMeanMarks.class. Step3 to Step5: It is same as explained in UDF example. Below is the screenshots for the same for UDAF.
Below is the sample flow diagram to explain the UDAF methods:
Java Code Snippet using Eclipse Tool:
OUTPUT: Below is the output:
3.UDTF: User defined tabular function works on one row as input and returns multiple rows as output. So here the relation in one to many. e.g Hive built in EXPLODE() function. Now lets take an array column USER_IDS as ARRAY10,12,5,45> then SELECT EXPLODE(USER_IDS) as ID FROM T_USER. will give 10,12,5,45 as four different rows in output. UDTF can be used to split a column into multiple column as well which we will look in below example. Here alias "AS" clause is mandatory .
Problem Statement: Expand the book_detail column from T_BOOK_DETAIL table to give it as three separate column in output, BOOK_NUMBER, TITLE and AUTHOR. Here we need to split one column into three columns.
Sample Input: Here book_detail column contains book_number,Title and Author name separated by comma. [Click Here for complete source code]
Steps to Create & Execute UDTFs in Java:
Step1: We have to extend a base Class Generic UDTF to write our business logic in Java.
Step2: We need to override 3 methods namely initialize(), process() and close() in our class ExpandBookDetail.class.
Step3 to Step5: It is same as explained in UDF example. Below is the screenshots for the same for UDTF.
Java Code Snippet using Eclipse Tool:
Output: Below is the Output:
This is the small article with example to understand and gets some hands on Hive UDFs. Hope you liked it. Please refer to the books and internet for more details on Hive functions.
Happy Learning.... Gaurav Singh
Software Intern @ NXP Semiconductors | IIT PATNA '25 | Training and Placement Coordinator | M.Tech Artificial Intelligence
1 年Thank you for such a crisp explanation. Really easy to understand.
Data Management Specialist | Automating Big Data ETL Pipelines and Data Analysis | Ensuring Data Quality, Accuracy, and Robust Data Warehousing
2 年Do you have source code?
Software Developer focusing on Java Backend and also interested in AI and machine learning and Big Data technologies.
3 年Really explicit and easy to understand. Thank u!
Data Engineer at Cognizant
5 年Great explanation. Thanks
Google Cloud Certified Professional Data Engineer | Specializing in Spark (Scala, Python)
6 年Really Good Explanation?