How we built domain-adapted foundation GenAI models to power our platform
The advent of Generative AI (GenAI) has introduced powerful proprietary, open-source, and open-weights foundation language models. Adapting these models to meet the unique needs of LinkedIn's 1B+ members and 69 million companies has been a core focus for our engineering team and how we build member and customer experiences with GenAI that deliver the best value.
A core part of our approach is leveraging the best language models and adapting them to serve each specific member or customer use case. In this blog, we present a new family of domain-adapted foundation models that we developed as part of an internal AI engineering project at LinkedIn, called Economic Opportunity Network (EON). These domain-adapted foundation models (aka EON models) are built to enhance the capabilities of underlying open-source or third-party foundation models by adding domain-specific differentiation, such as data from the LinkedIn Economic Graph. This enables rapid development and scaling of new GenAI experiences on LinkedIn’s platform at lower cost. In testing with EON models, we’ve seen strong results including significantly improved candidate-job-requirements matching for our Hiring Assistant, launched in October 2024.
In this blog, we’ll explore the evolution of our in-house GenAI, illustrate the need we saw for building domain-adapted foundation models (EON models), describe a product application that uses them and share our learnings.
Evolution of in-house GenAI innovation at LinkedIn
LinkedIn has been innovating in and adopting state-of-the-art (SOTA) GenAI modeling research since 2022, when we started working on an in-house model for personalized AI-assisted messaging. In 2023, OpenAI’s GPT-4 models accelerated our efforts with GAI and enabled the launch of our Premium profile writing suggestions and LinkedIn collaborative articles products. These models generated high-quality text, yet latency and cost was a concern. In May 2023, we launched the AI-assisted messaging feature (Figure 1) that uses an InBart model, an in-house domain-adapted open-source encoder-decoder transformer model. We leveraged model imitation, data preprocessing and filtering via automated instruction faithfulness and hallucination detection techniques to train the model (Figure 2). We also leveraged Retrieval Augmented Generation (RAG) during inference to improve the content diversity of generated InMails.
Adding new features to the fine-tuned, domain-specific model and generalizing it to new applications required significant effort to regenerate representative and high-quality training data to re-train the model. To offer the flexibility of using a generalizable model that enhances LinkedIn’s professionally-oriented platform products and services, we developed domain-adapted foundation models.
Domain-adapted foundation model aka EON model
This section details LinkedIn’s domain-adapted foundation models developed under the EON project. It covers multi-task instruction tuning with reasoning traces for domain adaptation, model selection, training and evaluation setup, and performance analysis on selected benchmarks.
Multi-task instruction tuning with reasoning traces
Members interact with LinkedIn in unique ways, such as finding and posting jobs, authoring articles, and connecting with other professionals. To enhance and scale GenAI features that meet these needs, we developed EON models that leverage data from the LinkedIn Economic Graph (a digital representation of the global economy and workforce), to create deeply personalized and helpful member and customer experiences.
EON models must:
- Follow instructions and generalize to new professional oriented GenAI tasks.
- Adhere to LinkedIn's Responsible AI governance requirements.
- Generate output aligned with LinkedIn member expectations regarding product experiences.
- Provide multi-step abstraction for task execution.
- Provide a frictionless conversational interaction interface.
The EON models are trained in two steps: 1.) Multi-task instruction tuning with reasoning traces; and 2.) Preference and safety alignment using techniques such as Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO). We also employed prompt simplification strategies to standardize complex human-written prompts, which led to a 30% reduction in prompt size.
To adhere to LinkedIn's trust, responsible AI, and fairness principles, we first instruction-tuned the model with synthetically-generated safe outputs for input prompts containing harmful content and further aligned the model with preference data. The high-quality, diverse, and de-duped training data totaled around 200M tokens.
Model selection and evaluation
EON models should generalize to a wide set of LinkedIn-specific use cases, while preserving (even extending) the base model’s instruction following and generalization capabilities. To select the best foundation model, we measured the quality of multiple SOTA foundation models such as Llama, Mixtral and Mistral, after domain-adapting them. Specifically, we evaluated their performance on reasoning, instruction following capability, and safety compliance using both open source and LinkedIn benchmarks (Figure 3).
We used standard evaluation metrics to measure performance when reference labels were available. To measure the performance of EON models against SOTA open source AI models, we leverage open-source frameworks such as LM Evaluation Harness to measure performance on popular benchmarks such as ARC, MuSR, and IFEval; and Berkeley Function Calling Leaderboard (BFCL). In the absence of reliable reference labels, we used GPT-4 variants as a judge. While these metrics don’t completely replace human evaluation as raw scores, they provide a reliable directional signal on model quality.
In addition to quality and safety benchmarks, we evaluated these models for cost efficiency. We found the EON-8B model (a domain-adapted Llama 3.1-8B variant) to be 75x and 6x cost effective in comparison to GPT-4 and GPT-4o respectively (Figure 4).
Training and evaluation setup
Built on top of on-premise Kubernetes platform, the EON training pipeline is designed as an end-to-end solution, connecting all the necessary jobs within a single, integrated system. This includes data preprocessing, model training, offline inference, and comprehensive evaluation (Figure 5).
The modular nature of the pipeline allows for tremendous flexibility in adapting SOTA foundation models to be used for LinkedIn purposes. For example, teams can test different optimization techniques in the training job, such as in-house Liger Kernels, DeepSpeed ZeRO, or Huggingface Accelerate. Similarly, the inference stage can be configured to leverage a variety of options such as vLLM. On the evaluation side, the pipeline integrates with an in-house developed evaluation platform that runs the entire benchmark suite.
Model evaluation can be triggered on-demand or automatically after training runs, with the results for each evaluation run persisted in MLFlow and HDFS for deeper review. Experiments with best performance on benchmarks are published into a central company-wide leaderboard (similar to the HuggingFace leaderboard).
Performance of domain-adapted EON models
We evaluated the performance of EON models (8B and 70B) against different in-house and external benchmarks. Figure 6 shows the comparison on three such sampled tasks: 1) Job-fit-assessment long text generation task; 2) Formatted NER; and 3) Berkeley function calling metrics (BCFL).
The first task, job-fit-assessment, requires a thorough evaluation of a candidate’s suitability for a given job. In addition to generating objective labels (e.g., match/no-match), this long-text generation task requires providing a detailed explanation to reason about the suitability prediction. The formatted NER task requires the model to extract entities from text and generate output in a desired schema. Lastly, BCFL metrics are open-source function-calling benchmarks that evaluate the model's performance in generating formatted function outputs based on an accurate understanding of the user’s query intent. On tasks seen during training, the EON-8B model outperformed base Llama-3-8B-Instruct and its performance was comparable to SOTA GPT models.
Generalization performance declines when a model is fine-tuned on data for a specific task. The EON model mitigates this issue by utilizing a high-quality, more balanced, and diverse dataset, along with an optimized reasoning augmented multi-task instruction tuning approach. EON preserves the underlying model’s generalization performance to reason and follow instructions.
Internal safety score measurements assess the out-of-box safety performance of GenAI models. Although multi-task instruction tuning initially led to a drop in performance on these benchmarks, the use of synthetic data and safety alignment steps helped recover the performance.
Some of the key things we learned building EON models are:
- Multi-task instruction tuning with high quality, diverse instructions and reasoning achieves SOTA performance on tasks seen during training and even improves domain-specific generalization.
- Reasoning with synthetically generated in-house and publicly available open-source instructions improves instruction diversity and generalizability.
- Preference and safety alignment helps EON models generate output aligned with member expectations and adhere to LinkedIn's trust, responsible AI, and fairness principles.
Product application - Hiring Assistant
In October 2024, LinkedIn launched Hiring Assistant for select charter customers. This product automates recruiters' repetitive tasks by breaking down queries into multiple steps through an orchestration layer. A key step involves an evaluation agent that assesses candidate suitability against AI-extracted job requirements. This allows recruiters to easily approve and update AI-generated matching criteria, streamlining the process of sourcing the right candidates for a job.
Nearly 90% of large language model (LLM) calls in the hiring assistant flow come from the evaluation agent. This agent uses a language model to parse extensive contexts, including the candidate's profile, resume, recruiter notes, and job post, to determine the candidate's match. Given the scale of evaluation, the underlying model must be fast and accurate. Currently, LinkedIn's domain-adapted foundation model performs this evaluation, providing both a categorical output indicating the candidate's alignment with job requirements and a concise explanation and summary of the findings.
In testing, the EON-8B model supporting the evaluation agent significantly improved candidate-job-requirements matching accuracy, outperforming 1 state-of-the-art models like OpenAI’s GPT-4o mini and Llama-3-8B-instruct by absolute 4% and 30% respectively, without further fine-tuning on the matching task. The EON model aligns with LinkedIn’s Responsible AI principles, helping the evaluation agent filter out biased and discriminatory job requirements.
What’s ahead?
We are enhancing EON models to support complex interactions with agents beyond single-task executions (Figure 7). In our testing, we improved the base Llama-3.1-8B-instruct model’s function calling capabilities on multi-turn and multi-step execution2. Enhancing EON model's planning and reasoning capabilities; designing efficient context representations, developing efficient storage and retrieval techniques; bringing in dynamic goal identification and evaluation are some key areas of research we are excited about in the near future.
Lastly, while we have laid the foundation for rapidly developing and scaling our GenAI applications through domain-adapted foundation models, the road ahead gets more exciting as we innovate and plan for the possibilities to explore providing deep personalized, multi-modal and multi-lingual agentic experiences for our members. Join us in our journey and to learn more about LinkedIn’s research in this area visit: https://www.dhirubhai.net/blog/engineering.
Note: LinkedIn has access to the Llama models mentioned in this blog post via a partnership with Meta.
----------------------------------------------
1 The prompts were optimized by humans for each model for fair evaluation.
2 These improvements were measured by performance on BFCLv2 (Gorilla: Large Language Model Connected with Massive APIs, May 2023) and MMAU (MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains, Aug 2024)
----------------------------------------------
Team
It takes a family to build a FaMeLea (Fast Meaningful and Lean) domain-adapted foundation model. The team behind the efforts (in alphabetical order): Aakhila Shaheen, Anastasiya Karpovich, Ashvini Kumar Jindal, Edgar Zambrano, Girish Balaji, Jason Zhu, Jie Bing, Ji Yan, Michaeel Kazi, Nitin Pasumarthy, Prasad Rao, Praveen Bodigutla, Rohit Jain, Sai Vivek Kanaparthy, Saurabh Gupta, Tung Hoang, Xilun Chen, Yanbin Jiang, and Zheng Li.
Acknowledgement
Building our domain-adapted foundation models amidst the rapid evolution of the GenAI landscape is truly a collaborative effort. We would like to thank Mohak Shroff, Ya Xu and Xavier Amatriain for their vision and leadership in architecting, initiating and supporting our GenAI platform initiatives, along with the entire leadership team: Animesh Singh, Swapnil Ghike, Sandeep Jha, Haowen Ning, Grace Tang, Mary Hearne, Chanh Nguyen, Sakshi Jain, Tyler Grant, Kapil Surlaker, Karthik Ramgopal, Donald Thompson, Natesh Sivasubramoniapillai, Vivek Hariharan, Ercan Yildiz, Qing Lan, Yanning Chen, Souvik Ghosh, and Necip Fazil Ayan.
We want to express our deepest appreciation to our product engineering teams who are not only dedicated to building exceptional GenAI products but also providing us with invaluable feedback. This includes Lijun Peng, Daniel Hewlett, Haichao Wei, Lukasz Karolewski, Parvez Ahammad, Arya Choudhury, Adi Pruthi, Arashpreet Singh Mor, Sreeta Gorripaty, Juan Bottaro, Grace Liu, Mark Zhao, Isabella Li, Lucky Wang, Zhaokang Li, Sai Krishna Bollam and many others. Your passion and contributions are the driving force behind our success.
Lastly, we would like to thank our reviewers, business development, legal and comms partners, including Alex Riccomini, Jon Adams, Mary Shannon, Benito Leyva, Alistair Jedlin, Igor Perisic, Suzi Owens, Adam Lewin, Renee Brown, Will Cheng, and Darlene Moustirats.