Human Capital Management (HCM) - Sentence Similarity Language Model using Java
Jonathon Palmieri

Human Capital Management (HCM) - Sentence Similarity Language Model using Java

HCM - Sentence Similarity Language Model using Java

Sentence Similarity using Onnx (ORT), Java, and Hugging Face Sentence Transformer.

Machine learning (ML) and Artificial intelligence AI are all the craze; with constant advancements in commercial solutions like OpenAI's ChatGPT, many programmers are trying to figure out how they can leverage language models in their code. I ran into a use case with an HCM project and wanted to explore how I could use the hugging face Sentence-Transformer, all-mpnet-base-v2. However, anyone familiar with the ML/AI/NLP space knows that many of the resources for working with language models, such as Pytorch, Langchain, LLama.cpp, etc, are in Python. I wanted to explore using Java, but could not find many tutorials or documentation. Therefore, I put together this informal article to be paired with this YouTube video and this GitHub code.

In the HCM & HR space, there are many inconsistencies with how things are labeled or described. I came across a large amount of data that I needed to categorize. To do this manually would have taken a large amount of time. Clustering like titles would have helped, but there was still a large amount of variation between the titles. For simplicity, think about two job titles: Accounts Executive & Big Client Manager. This led me to compare two strings but with more understanding than the difference in length and characters. Comparing text strings is a fundamental task in many areas of computing, including natural language processing, information retrieval, and data analysis. Various techniques have been developed for this purpose, each with its use cases and strengths. Each method has its specific applications and is chosen based on the requirements of the task, such as the nature of the text data, the need for speed versus accuracy, and the level of semantic understanding required.?

For my use case, I chose to employ some advanced NLP techniques:

For those who do not wish to dive into the details:

The method I employed allowed me to discern that "Large Account Executive" and "Big Account Manager" had an 87.24% match. Finding this match was something the other techniques could not do. Then, I automated this process across the data and clustered together anything with a high percentage, reducing the amount of manual time needed to merge these data sets.?Please note that some other work was done in addition to what is described in this article.

For those who want to get technical:

For my use case, I chose to employ some advanced NLP techniques involving a pre-trained transformer model for tokenization and embedding generation, followed by average pooling to create sentence-level embeddings and then compute the cosine similarity between these embeddings to assess the semantic similarity of the input sentences. Here is a brief explanation of how I accomplished it with Java.?The Github Repo is here, and the YouTube video is here.

Tokenization and Encoding with a Pre-trained Transformer Model:

  • The code uses HuggingFaceTokenizer to tokenize the input sentences. This tokenizer is part of the?Hugging Face library, which provides a wide range of pre-trained models for natural language processing.
  • The tokenizer is initialized with "sentence-transformers/all-mpnet-base-v2". Sentence Transformers are designed to produce meaningful sentence embeddings (vector representations of sentences). A sentence embedding refers to a numeric representation of a sentence in the form of a vector of real numbers in NLP to map words or phrases from vocabulary to?a corresponding vector of real numbers is used to find word predictions, word similarities/semantics, etc..
  • The batchEncode method is used to convert the input sentences into a format suitable for the model, generating token encodings that include input_ids and attention_mask. The transformer model uses these encodings to understand the context and semantics of each word in the sentences.

Embedding Extraction with ONNX Runtime:

  • The code then uses?Microsoft's ONNX Runtime, a performance-focused engine for running machine learning models, to load and run the pre-trained transformer model (all-mpnet-base-v2.onnx).
  • The model is used to generate embeddings for each token in the input sentences. These embeddings are high-dimensional vectors that capture the contextual information of each token.
  • I exported this model with hugging face's optimum.exporters.onnx?

Average Pooling of Token Embeddings:

  • The method averageEmbeddings computes the average of the token embeddings for each sentence. This step is crucial as it condenses the information from all tokens into a single vector per sentence, providing a fixed-size representation regardless of the sentence length.

Cosine Similarity for Sentence Embeddings:

  • Finally, the code uses the cosine similarity method to calculate the cosine similarity between the sentence embeddings. Cosine similarity is a measure used to determine how similar two vectors are. It is often used in text analysis to assess the similarity of documents or sentences. In this context, it quantifies how similar the two sentences are in terms of their meaning as captured by the model.

I hope this helps! If you run into any issues or have any questions, please comment here or connect with me. Also, feel free to reach out to me with any HCM, Talent Management, or Talent Acquisition project you have.


The GitHub repo you linked to does not appear to have any code

回复
Scott Leserman ACIR, CIR, PRC, CDR

Proactive, Intentional Talent Acquisition. Removing Stress from Hiring. Recruiting Consultant.

1 年

Well done!

Jonathon Palmieri

HR Strategy | HR Technology | Talent Acquisition | IO Psychology | Workforce Planning | Consulting

1 年

Here is the Youtube video for the lazy https://youtu.be/SuNpVql6Oec?si=b_gE8r84hx77jJ62 #ai #machinelearning #HCM #naturallanguageprocessing #tutorial #java

回复

要查看或添加评论,请登录

Jonathon Palmieri的更多文章

  • Is Recruiting dead?

    Is Recruiting dead?

    In a recent post on a Reddit recruiting community post a user asked if Recruiting is a dead-end career. This question…

    2 条评论
  • AI in Recruiting & Talent Acquisition

    AI in Recruiting & Talent Acquisition

    Artificial Intelligence (AI) in business is increasingly becoming a focal point as organizations strive to enhance and…

  • Mastering Human Capital Management (HCM) Process Improvement

    Mastering Human Capital Management (HCM) Process Improvement

    In today's dynamic business environment, the effective management of human capital stands as a cornerstone of…

  • CT Passes Pay Transparency & Wage Discrimination Legislation

    CT Passes Pay Transparency & Wage Discrimination Legislation

    On June 8, 2021, Connecticut’s Governor Lamont signed House Bill Number 6380, which requires employers to disclose to…

    2 条评论
  • Imposter Syndrome & Technical Professionals

    Imposter Syndrome & Technical Professionals

    IMPOSTER SYNDROME Link to the original article Here “It seems like whenever I have a problem and I go to google, I…

    3 条评论
  • HOW TO ALLEVIATE A HIRING MANAGER CONTINUOUSLY WANTING MORE CANDIDATES

    HOW TO ALLEVIATE A HIRING MANAGER CONTINUOUSLY WANTING MORE CANDIDATES

    Why do hiring managers give positive feedback the whole way throughout the process and then just as you’re gearing up…

  • Job Posting 101: Keywords & SEO

    Job Posting 101: Keywords & SEO

    Our next installment of #JobPosting101 series touches on the use of keywords and SEO. If you missed it, check out the…

  • Job Posting 101

    Job Posting 101

    Rex Recruiting | #jobposting101 Digital marketing is at the forefront of any good Recruiters toolbox. Many candidates…

  • HOW TO SELECT A STAFFING AGENCY OR HEAD HUNTER FOR YOUR JOB SEARCH

    HOW TO SELECT A STAFFING AGENCY OR HEAD HUNTER FOR YOUR JOB SEARCH

    Anyone who is looking for a new job whether you are unemployed, under employed, or even just employed can (will) tell…

  • A Recruiter's Perspective on Job Hopping

    A Recruiter's Perspective on Job Hopping

    “Good Advice: Do not Job Hop. You may not hear this often from Recruiters because they have an incentive for you to…

社区洞察