Human Capital Management (HCM) - Sentence Similarity Language Model using Java
Jonathon Palmieri
HR Strategy | HR Technology | Talent Acquisition | IO Psychology | Workforce Planning | Consulting
HCM - Sentence Similarity Language Model using Java
Sentence Similarity using Onnx (ORT), Java, and Hugging Face Sentence Transformer.
Machine learning (ML) and Artificial intelligence AI are all the craze; with constant advancements in commercial solutions like OpenAI's ChatGPT, many programmers are trying to figure out how they can leverage language models in their code. I ran into a use case with an HCM project and wanted to explore how I could use the hugging face Sentence-Transformer, all-mpnet-base-v2. However, anyone familiar with the ML/AI/NLP space knows that many of the resources for working with language models, such as Pytorch, Langchain, LLama.cpp, etc, are in Python. I wanted to explore using Java, but could not find many tutorials or documentation. Therefore, I put together this informal article to be paired with this YouTube video and this GitHub code.
In the HCM & HR space, there are many inconsistencies with how things are labeled or described. I came across a large amount of data that I needed to categorize. To do this manually would have taken a large amount of time. Clustering like titles would have helped, but there was still a large amount of variation between the titles. For simplicity, think about two job titles: Accounts Executive & Big Client Manager. This led me to compare two strings but with more understanding than the difference in length and characters. Comparing text strings is a fundamental task in many areas of computing, including natural language processing, information retrieval, and data analysis. Various techniques have been developed for this purpose, each with its use cases and strengths. Each method has its specific applications and is chosen based on the requirements of the task, such as the nature of the text data, the need for speed versus accuracy, and the level of semantic understanding required.?
For my use case, I chose to employ some advanced NLP techniques:
For those who do not wish to dive into the details:
The method I employed allowed me to discern that "Large Account Executive" and "Big Account Manager" had an 87.24% match. Finding this match was something the other techniques could not do. Then, I automated this process across the data and clustered together anything with a high percentage, reducing the amount of manual time needed to merge these data sets.?Please note that some other work was done in addition to what is described in this article.
For those who want to get technical:
For my use case, I chose to employ some advanced NLP techniques involving a pre-trained transformer model for tokenization and embedding generation, followed by average pooling to create sentence-level embeddings and then compute the cosine similarity between these embeddings to assess the semantic similarity of the input sentences. Here is a brief explanation of how I accomplished it with Java.?The Github Repo is here, and the YouTube video is here.
Tokenization and Encoding with a Pre-trained Transformer Model:
Embedding Extraction with ONNX Runtime:
Average Pooling of Token Embeddings:
Cosine Similarity for Sentence Embeddings:
I hope this helps! If you run into any issues or have any questions, please comment here or connect with me. Also, feel free to reach out to me with any HCM, Talent Management, or Talent Acquisition project you have.
The GitHub repo you linked to does not appear to have any code
Proactive, Intentional Talent Acquisition. Removing Stress from Hiring. Recruiting Consultant.
1 年Well done!
HR Strategy | HR Technology | Talent Acquisition | IO Psychology | Workforce Planning | Consulting
1 年Here is the Youtube video for the lazy https://youtu.be/SuNpVql6Oec?si=b_gE8r84hx77jJ62 #ai #machinelearning #HCM #naturallanguageprocessing #tutorial #java