How we built domain-adapted foundation GenAI models to power our platform

Praveen Kumar Bodigutla

December 18, 2024

Co-authors: Co-authored byPraveen Kumar Bodigutla, Co-authored byAshvini Jindal, Co-authored byGirish Balaji, Co-authored byJason (Siyu) Zhu, Co-authored byJie Bing, Co-authored byRohit J., Co-authored byYanbin Jiang, and Co-authored byZheng Li

The advent of Generative AI (GenAI) has introduced powerful proprietary, open-source, and open-weights foundation language models. Adapting these models to meet the unique needs of LinkedIn's 1B+ members and 69 million companies has been a core focus for our engineering team and how we build member and customer experiences with GenAI that deliver the best value.

A core part of our approach is leveraging the best language models and adapting them to serve each specific member or customer use case. In this blog, we present a new family of domain-adapted foundation models that we developed as part of an internal AI engineering project at LinkedIn, called Economic Opportunity Network (EON). These domain-adapted foundation models (aka EON models) are built to enhance the capabilities of underlying open-source or third-party foundation models by adding domain-specific differentiation, such as data from the LinkedIn Economic Graph. This enables rapid development and scaling of new GenAI experiences on LinkedIn’s platform at lower cost. In testing with EON models, we’ve seen strong results including significantly improved candidate-job-requirements matching for our Hiring Assistant, launched in October 2024.

In this blog, we’ll explore the evolution of our in-house GenAI, illustrate the need we saw for building domain-adapted foundation models (EON models), describe a product application that uses them and share our learnings.

Evolution of in-house GenAI innovation at LinkedIn

LinkedIn has been innovating in and adopting state-of-the-art (SOTA) GenAI modeling research since 2022, when we started working on an in-house model for personalized AI-assisted messaging. In 2023, OpenAI’s GPT-4 models accelerated our efforts with GAI and enabled the launch of our Premium profile writing suggestions and LinkedIn collaborative articles products. These models generated high-quality text, yet latency and cost was a concern. In May 2023, we launched the AI-assisted messaging feature (Figure 1) that uses an InBart model, an in-house domain-adapted open-source encoder-decoder transformer model. We leveraged model imitation, data preprocessing and filtering via automated instruction faithfulness and hallucination detection techniques to train the model (Figure 2). We also leveraged Retrieval Augmented Generation (RAG) during inference to improve the content diversity of generated InMails.

Figure 1. Personalized AI-assisted outreach message generation using an initial version of the Recruiter messaging product.

Adding new features to the fine-tuned, domain-specific model and generalizing it to new applications required significant effort to regenerate representative and high-quality training data to re-train the model. To offer the flexibility of using a generalizable model that enhances LinkedIn’s professionally-oriented platform products and services, we developed domain-adapted foundation models.

Figure 2. InMail sections and attributes provided by content experts are used to generate examples for training a section classifier. The classifier is used to filter out messages generated as InMails that are not faithful to instructions in the prompt. Furthermore, InMails containing hallucinated entities are filtered out.

Domain-adapted foundation model aka EON model

This section details LinkedIn’s domain-adapted foundation models developed under the EON project. It covers multi-task instruction tuning with reasoning traces for domain adaptation, model selection, training and evaluation setup, and performance analysis on selected benchmarks.

Multi-task instruction tuning with reasoning traces

Members interact with LinkedIn in unique ways, such as finding and posting jobs, authoring articles, and connecting with other professionals. To enhance and scale GenAI features that meet these needs, we developed EON models that leverage data from the LinkedIn Economic Graph (a digital representation of the global economy and workforce), to create deeply personalized and helpful member and customer experiences.

EON models must:

Follow instructions and generalize to new professional oriented GenAI tasks.
Adhere to LinkedIn's Responsible AI governance requirements.
Generate output aligned with LinkedIn member expectations regarding product experiences.
Provide multi-step abstraction for task execution.
Provide a frictionless conversational interaction interface.

The EON models are trained in two steps: 1.) Multi-task instruction tuning with reasoning traces; and 2.) Preference and safety alignment using techniques such as Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO). We also employed prompt simplification strategies to standardize complex human-written prompts, which led to a 30% reduction in prompt size.

To adhere to LinkedIn's trust, responsible AI, and fairness principles, we first instruction-tuned the model with synthetically-generated safe outputs for input prompts containing harmful content and further aligned the model with preference data. The high-quality, diverse, and de-duped training data totaled around 200M tokens.

Model selection and evaluation

EON models should generalize to a wide set of LinkedIn-specific use cases, while preserving (even extending) the base model’s instruction following and generalization capabilities. To select the best foundation model, we measured the quality of multiple SOTA foundation models such as Llama, Mixtral and Mistral, after domain-adapting them. Specifically, we evaluated their performance on reasoning, instruction following capability, and safety compliance using both open source and LinkedIn benchmarks (Figure 3).

Figure 3. Foundation model selection and auto evaluation to develop EON models, which developers can leverage using prompt engineering and fine-tuning.

We used standard evaluation metrics to measure performance when reference labels were available. To measure the performance of EON models against SOTA open source AI models, we leverage open-source frameworks such as LM Evaluation Harness to measure performance on popular benchmarks such as ARC, MuSR, and IFEval; and Berkeley Function Calling Leaderboard (BFCL). In the absence of reliable reference labels, we used GPT-4 variants as a judge. While these metrics don’t completely replace human evaluation as raw scores, they provide a reliable directional signal on model quality.

In addition to quality and safety benchmarks, we evaluated these models for cost efficiency. We found the EON-8B model (a domain-adapted Llama 3.1-8B variant) to be 75x and 6x cost effective in comparison to GPT-4 and GPT-4o respectively (Figure 4).

Figure 4. A100 GPUs needed to serve 1QPS of an interactive AI Premium experience on LinkedIn. In terms of GPU cost, the EON 8B model, developed over Llama-3 8B, is cost effective in comparison to OpenAI GPT-4, GPT-4 turbo and 4o models. The number of GPUs were calculated from production deployment of GPT instances on Azure and on-premise deployment of EON-8B models for LinkedIn.

Training and evaluation setup

Built on top of on-premise Kubernetes platform, the EON training pipeline is designed as an end-to-end solution, connecting all the necessary jobs within a single, integrated system. This includes data preprocessing, model training, offline inference, and comprehensive evaluation (Figure 5).

Figure 5. A config based EON training pipeline that sequentially executes data processing, training, inference and evaluation steps on different candidate open-source LLMs.

The modular nature of the pipeline allows for tremendous flexibility in adapting SOTA foundation models to be used for LinkedIn purposes. For example, teams can test different optimization techniques in the training job, such as in-house Liger Kernels, DeepSpeed ZeRO, or Huggingface Accelerate. Similarly, the inference stage can be configured to leverage a variety of options such as vLLM. On the evaluation side, the pipeline integrates with an in-house developed evaluation platform that runs the entire benchmark suite.

Model evaluation can be triggered on-demand or automatically after training runs, with the results for each evaluation run persisted in MLFlow and HDFS for deeper review. Experiments with best performance on benchmarks are published into a central company-wide leaderboard (similar to the HuggingFace leaderboard).

Performance of domain-adapted EON models

We evaluated the performance of EON models (8B and 70B) against different in-house and external benchmarks. Figure 6 shows the comparison on three such sampled tasks: 1) Job-fit-assessment long text generation task; 2) Formatted NER; and 3) Berkeley function calling metrics (BCFL).

Figure 6. Performance of EON, Llama-3 8B and GPT models on sampled Job fit assessment, formatted NER and BFCLv1 metrics (Y-axis values for each task are on a different scale; a taller bar is better). Job fit assessment scores are obtained from human annotations and formatted NER are accuracy scores for the ‘education’ entity. BFCLv1 performance of GPT-4o is from: APIGen, Salesforce AI Research, USA, June 2024. Score is improved after fixed parsing errors. GPT scores were computed after the prompt optimized (by humans) for GPT models.

The first task, job-fit-assessment, requires a thorough evaluation of a candidate’s suitability for a given job. In addition to generating objective labels (e.g., match/no-match), this long-text generation task requires providing a detailed explanation to reason about the suitability prediction. The formatted NER task requires the model to extract entities from text and generate output in a desired schema. Lastly, BCFL metrics are open-source function-calling benchmarks that evaluate the model's performance in generating formatted function outputs based on an accurate understanding of the user’s query intent. On tasks seen during training, the EON-8B model outperformed base Llama-3-8B-Instruct and its performance was comparable to SOTA GPT models.

Generalization performance declines when a model is fine-tuned on data for a specific task. The EON model mitigates this issue by utilizing a high-quality, more balanced, and diverse dataset, along with an optimized reasoning augmented multi-task instruction tuning approach. EON preserves the underlying model’s generalization performance to reason and follow instructions.

Internal safety score measurements assess the out-of-box safety performance of GenAI models. Although multi-task instruction tuning initially led to a drop in performance on these benchmarks, the use of synthetic data and safety alignment steps helped recover the performance.

Some of the key things we learned building EON models are:

Multi-task instruction tuning with high quality, diverse instructions and reasoning achieves SOTA performance on tasks seen during training and even improves domain-specific generalization.
Reasoning with synthetically generated in-house and publicly available open-source instructions improves instruction diversity and generalizability.
Preference and safety alignment helps EON models generate output aligned with member expectations and adhere to LinkedIn's trust, responsible AI, and fairness principles.

Product application - Hiring Assistant

In October 2024, LinkedIn launched Hiring Assistant for select charter customers. This product automates recruiters' repetitive tasks by breaking down queries into multiple steps through an orchestration layer. A key step involves an evaluation agent that assesses candidate suitability against AI-extracted job requirements. This allows recruiters to easily approve and update AI-generated matching criteria, streamlining the process of sourcing the right candidates for a job.

Nearly 90% of large language model (LLM) calls in the hiring assistant flow come from the evaluation agent. This agent uses a language model to parse extensive contexts, including the candidate's profile, resume, recruiter notes, and job post, to determine the candidate's match. Given the scale of evaluation, the underlying model must be fast and accurate. Currently, LinkedIn's domain-adapted foundation model performs this evaluation, providing both a categorical output indicating the candidate's alignment with job requirements and a concise explanation and summary of the findings.

In testing, the EON-8B model supporting the evaluation agent significantly improved candidate-job-requirements matching accuracy, outperforming¹state-of-the-art models like OpenAI’s GPT-4o mini and Llama-3-8B-instruct by absolute 4% and 30% respectively, without further fine-tuning on the matching task. The EON model aligns with LinkedIn’s Responsible AI principles, helping the evaluation agent filter out biased and discriminatory job requirements.

What’s ahead?

We are enhancing EON models to support complex interactions with agents beyond single-task executions (Figure 7). In our testing, we improved the base Llama-3.1-8B-instruct model’s function calling capabilities on multi-turn and multi-step execution². Enhancing EON model's planning and reasoning capabilities; designing efficient context representations, developing efficient storage and retrieval techniques; bringing in dynamic goal identification and evaluation are some key areas of research we are excited about in the near future.

Figure 7. Language agents can use EON models to leverage advanced planning and reasoning capabilities to determine multi-step goals dynamically, update and fetch relevant context, and interact with the environment in a multi-turn fashion.

Lastly, while we have laid the foundation for rapidly developing and scaling our GenAI applications through domain-adapted foundation models, the road ahead gets more exciting as we innovate and plan for the possibilities to explore providing deep personalized, multi-modal and multi-lingual agentic experiences for our members. Join us in our journey and to learn more about LinkedIn’s research in this area visit: https://www.dhirubhai.net/blog/engineering.

Note: LinkedIn has access to the Llama models mentioned in this blog post via a partnership with Meta.

----------------------------------------------

¹The prompts were optimized by humans for each model for fair evaluation.

2 These improvements were measured by performance on BFCLv2 (Gorilla: Large Language Model Connected with Massive APIs, May 2023) and MMAU (MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains, Aug 2024)

----------------------------------------------

Team

It takes a family to build a FaMeLea (Fast Meaningful and Lean) domain-adapted foundation model. The team behind the efforts (in alphabetical order): Aakhila Shaheen, Anastasiya Karpovich, Ashvini Kumar Jindal, Edgar Zambrano, Girish Balaji, Jason Zhu, Jie Bing, Ji Yan, Michaeel Kazi, Nitin Pasumarthy, Prasad Rao, Praveen Bodigutla, Rohit Jain, Sai Vivek Kanaparthy, Saurabh Gupta, Tung Hoang, Xilun Chen, Yanbin Jiang, and Zheng Li.

Acknowledgement

Building our domain-adapted foundation models amidst the rapid evolution of the GenAI landscape is truly a collaborative effort. We would like to thank Mohak Shroff, Ya Xu and Xavier Amatriain for their vision and leadership in architecting, initiating and supporting our GenAI platform initiatives, along with the entire leadership team: Animesh Singh, Swapnil Ghike, Sandeep Jha, Haowen Ning, Grace Tang, Mary Hearne, Chanh Nguyen, Sakshi Jain, Tyler Grant, Kapil Surlaker, Karthik Ramgopal, Donald Thompson, Natesh Sivasubramoniapillai, Vivek Hariharan, Ercan Yildiz, Qing Lan, Yanning Chen, Souvik Ghosh, and Necip Fazil Ayan.

We want to express our deepest appreciation to our product engineering teams who are not only dedicated to building exceptional GenAI products but also providing us with invaluable feedback. This includes Lijun Peng, Daniel Hewlett, Haichao Wei, Lukasz Karolewski, Parvez Ahammad, Arya Choudhury, Adi Pruthi, Arashpreet Singh Mor, Sreeta Gorripaty, Juan Bottaro, Grace Liu, Mark Zhao, Isabella Li, Lucky Wang, Zhaokang Li, Sai Krishna Bollam and many others. Your passion and contributions are the driving force behind our success.

Lastly, we would like to thank our reviewers, business development, legal and comms partners, including Alex Riccomini, Jon Adams, Mary Shannon, Benito Leyva, Alistair Jedlin, Igor Perisic, Suzi Owens, Adam Lewin, Renee Brown, Will Cheng, and Darlene Moustirats.

Topics: Generative AI AI