Behind the platform: the journey to create the LinkedIn GenAI application tech stack

Karthik Ramgopal

Distinguished Engineer @ LinkedIn | Full Stack + GenAI

November 26, 2024

Co-authors: Co-authored byKarthik Ramgopal, Co-authored byXiaofeng Wang, and Co-authored bySandeep Jha

In early 2023, we started rolling out a completely reimagined product portfolio, including for the first time features that leveraged Generative AI (GenAI). These GenAI features have enabled our members and customers to work more efficiently, using tools such as collaborative articles, AI-assisted Recruiter, AI-powered insights, and most recently, our first AI agent, Hiring Assistant.

In the short period of time since the first GenAI feature made its way onto the platform, we’ve undergone remarkable change, learning and growth in our approach to building these unique product features and experiences. We went from launching products based on simple “prompt in, string out” solutions to crafting assistive agent experiences that offered multi-turn conversation capabilities supported by advanced contextual memory. In parallel, we gradually realized our early vision of building a GenAI application tech stack to power these products, which included continuously optimizing between time to market goals and long-term leverage.

This blog post will cover the work that happens below the surface of our AI-powered products and focus on our journey in building the GenAI application tech stack that enables them. As we delve deeper into various decisions around GenAI frameworks, programming languages, and the intricacies of bridging offline development with online deployment, you will get a peek into what it takes to build AI-first products at LinkedIn scale.

Genesis and evolution

One of the earliest challenges we faced is not an uncommon one for any new technology being adopted rapidly at scale - building a shared technology foundation to maximize leverage. The solution manifested as a framework to act as the hub of GenAI application development at LinkedIn, providing standard mechanisms for common tasks like prompt construction, inference, memory access and more.

Since most of the LinkedIn online serving stack was programmed in Java, our initial pragmatic approach was to build a shared Java midtier for all GenAI products encapsulating common functionality as built-in reusable Java code. As the number of use cases grew, this midtier became a development and operational bottleneck, prompting us to split things up into multiple different use-case specific Java midtier services.

The AI engineers working on the offline Large Language Model (LLM)-based workflows, prompt engineering and evaluations preferred Python given the plethora of open source Python solutions in this domain. Rather than be blocked by the onerous task of rebuilding this tooling in Java or getting our online serving stack to support Python, we made a conscious call to start with fragmented online and offline stacks with basic tooling to bridge them.

Moving forward with the fragmented approach helped keep our momentum in the short-term, but we quickly faced challenges in maintaining that direction as we scaled. It required substantial effort to prevent divergence across the various Java midtier services and keep logic mirrored across the offline Python and online Java stacks. Maintaining all of this amidst active product development and updates to underlying framework versions proved taxing.

As the GenAI landscape and associated open source libraries continued to evolve primarily in Python, it became clear that staying on Java for serving was a suboptimal long-term choice. We decided to invest in Python as a first-class language for both offline iteration and online serving at LinkedIn.

By this time, significant momentum had built around the LangChain open source project, including our own adoption of it for our offline stack. Our collaborative and productive relationship with the LangChain developers, and a deep analysis of areas like functionality, operability, open source community involvement, evolution track record, and future extensibility convinced us to use it for online serving as well.

Since LinkedIn had historically used Java almost exclusively for online serving, a lot of our online infrastructure for RPC, storage access, request context passing, distributed tracing and more, only had Java client implementations. Considering our emphasis on Python, we kicked off an initiative to enable Python support for critical infrastructure dependencies with few key principles in mind.

Pragmatic prioritization: While it would have been great to have Python equivalents of everything available in Java, doing so was rather expensive. We resorted to stack ranking requirements while coming up with creative solutions to bridge the gaps. For example, we implemented only a part of the request context spec in Python, because we didn’t need some functionality like bidirectional context passing for our GenAI applications, significantly reducing scope. We did not build a native Python client for Espresso (our online distributed document store) and instead used an existing REST proxy to talk to it.
Opportunistic alignment with migrations to future tech: Rather than building support for Python clients for existing infrastructure, we looked at the state and adoption progress of upcoming tech, and opportunistically built support only for future tech. For example, LinkedIn is transitioning RPCs from rest.li to grpc; so rather than building support and closing gaps for several key RPC-related infrastructure pieces (such as request context passing, distributed tracing, call logging, service to service ACLs, etc.) atop both rest.li and gRPC, we built them only for gRPC.
First-class developer experience: We wanted the Python developer experience to feel native without any shortcuts affecting debuggability or operability. This has translated to investing in Python native builds (in lieu of our legacy grade based build system), re-engineering solutions that involve Python code calling C/C++ code in parts that developers frequently debug to be fully Python native, tooling automation to ease importing open source Python libraries for use within LinkedIn, and investing in tooling and processes to ensure that we stay on a reasonably new version of the Python language and runtime.

Our GenAI application framework is now a thin wrapper atop LangChain, bridging it with LinkedIn infrastructure for logging, instrumentation, storage access and more. It is vended as a versioned internal library that is now mandated for use by all new GenAI applications at LinkedIn.

Top Takeaway: Given the relative recency and rapid evolution of GenAI, there is no one-size-fits-all formula for scalable product development. So engineering organizations - from practitioners to leaders - should make calculated framework investments that balance pragmatism and time to market with long-term leverage.

Prompt management

Prompt engineering is core to building GenAI applications with prompts providing the primary mechanism to “program” Large Language Models (LLMs). Prompt management refers to the system and processes in place to manage and curate these prompts across GenAI applications.

Our initial approach was manual string interpolation in code. While sufficient for basic use cases, this quickly proved error-prone and unscalable.

As we considered more complex approaches to prompt engineering, we made some initial observations that were critical in shaping our ultimate path forward.

We noticed many use cases benefited from partial or full prompts for shared functionality. Since Trust and Responsible AI are core requirements for all our products, there was an opportunity to universally inject guardrails that supported both into all our prompts.
It was also essential to ramp new prompt versions gradually to ensure that they did not break or worsen existing product experiences.

To provide more structure around modularization and versioning, we introduced a Prompt Source of Truth component. We standardized the use of the Jinja template language for authoring prompts, and built a Java prompt resolution library to avoid common string interpolation bugs.

After we built our standard application framework in Python atop LangChain, we subsumed the prompt resolution library into it, rewriting it in Python. Since Jinja offers Python-like expressions, the developer experience became even more fluent and native post this switch.

As conversational assistants with multi-turn conversational UIs emerged, we enhanced this component to provide more structure around human and AI roles in conversations, eventually converging on the OpenAI Chat Completions API once it was released and widely adopted.

All prompt engineers at LinkedIn today author prompts using the prompt source of truth guidelines. Developers are also required to adhere to the modularization and versioning requirements imposed by this component for storing their prompts, and in exchange receive fluent sharing across prompts as well as seamless integration into the rest of the application framework.

Top Takeaway: Prompt management can start off as deceptively simple string management, but there is a ton of nuance that includes (but is not limited to) providing systems and developer guidance for managing templating, versioning and prompt structure to make it work at scale for engineering complex GenAI applications.

Task automation via skills

LinkedIn has been a strong proponent of a skills-based approach to many aspects of work. A skill can be a particular attribute that is acquired either via learning or experience and useful for completing tasks associated with a job. We extended the same abstraction to our GenAI applications to use skill as a mechanism to enable task automation.

The skill abstraction in our framework enables LLMs to move beyond vanilla text generation and use function calling for Retrieval Augmented Generation (RAG) or task automation by converting natural language instructions in the prompts into API calls. In our product, this manifests itself as skills for viewing profiles, searching for posts, querying internal analytics systems and even accessing external tools like Bing for search and news.

Initially, we built this within each GenAI product as custom code that wrapped existing LinkedIn internal and external APIs (like Bing Search) using LLM friendly JSON schemas that could be used with the LangChain tool API. However, this approach ran into some scaling bottlenecks.

Teams often re-implement the same skills in different products. While we tried to consolidate some popular ones via internal libraries, keeping the libraries up to date and working with various minor product differences became cumbersome.
As the downstream invoked by the skill evolves, the skill also needs to be updated in tandem.
Application developers need to manually specify the skill or set of skills to use in the prompt.

To overcome these issues, we came up with Skill Inversion. Instead of the calling applications defining skills over the implementing downstreams, the downstreams define the skill and expose it to the calling application, thus organically eliminating the duplication and evolution problems.

We have eased the process of skill access, development and operations by building the following:

A centralized skill registry service that allows definitions to be added and retrieved (via skill ID or semantic search) at runtime.
Build plugins that easily enable downstream applications to annotate their endpoint implementations and automatically register them in the skill registry service with the necessary validations around schema structure and documentation, as part of the build.
A dynamic LangChain tool that retrieves skill definitions from the skill registry and invokes the actual skill with the supplied arguments, eliminating developer specified skills in prompts and giving significantly larger agency to LLMs.

With this infrastructure in place, we are gradually evolving our tech stack by creating skill abstractions for all APIs, enabling LLMs to interact seamlessly with them to achieve richer, more impactful outcomes.

Top Takeaway: There is significant product value unlocked by using GenAI for task automation vis-a-vis content generation. However, intentional full stack tooling is necessary to enable GenAI applications to perform task automation by scalably leveraging the same APIs called by human developers using imperative code.

Contextual awareness and personalization

Contextualization and personalization are considered essential for a great GenAI product experience, but are not available out of the box since LLMs are stateless by default (i.e., each incoming query is processed independently of other interactions).

The workaround for this limitation is to build a Conversational Memory Infrastructure to store LLM interactions, retrieve past context and inject it into future prompts, to share “state” with the LLM and offer a coherent product experience.

Our initial solution to this problem used Couchbase or Espresso databases as storage. Application teams were responsible for repetitive tasks such as setting up databases, writing requests/responses to the databases, and reading from memory before inference.

However, we soon needed more than raw conversation storage and retrieval. Since LLM context windows are limited and increasing input tokens has cost/latency implications, it was important to retrieve only the relevant parts of the conversation rather than the entire conversation history. To enable this, we needed semantic search (using embeddings) and summarization capabilities.

Rather than build yet another system ground up to solve this at scale, we decided to leverage the LinkedIn messaging stack for a few reasons:

The conversations between a human and GenAI applications were akin to human to human conversations, so much of the functionality of the messaging stack could be reused as-is.
The stack was proven to work in production with high availability and reliability.
Enhancements we needed, like semantic search and summarization, would also be useful for product use cases outside of GenAI applications.
Leverage was created by the low-latency reliable delivery of messages to mobile/web clients and state synchronization across devices.

We have now integrated our LinkedIn messaging based Conversational Memory infrastructure into our GenAI application framework using the LangChain Conversational Memory abstraction, making integration seamless for application developers.

As our GenAI applications became more advanced, we noticed a trend of needing to derive signals based on the experience of user-application interactions. Examples of such signals in our product include voice and tone for authoring text, preferred notification channel (in-app vs push vs email), choice of UI templates for visualizing AI-generated content. We call this Experiential Memory (i.e. memory derived based on experience) and offer a solution with drop-in GenAI application framework integration, and support for partial retrievals and updates.

Top Takeaway: Memory is core to the ability to learn from activities or interactions, incorporate feedback and preferences, and build expected personalized and beneficial experiences in product. Depending on the use case, memory is fast becoming a critical capability that requires thoughtful integration into tech stacks.

Model inference and fine tuning

The advent of GenAI with powerful foundational LLMs that could be prompted to perform a wide variety of tasks turned traditional AI modeling paradigms on their head. Our initial GenAI applications wholeheartedly embraced this new trend and solely used LLMs provided by the Azure OpenAI service. All requests were routed via a centralized GenAI proxy which offered functionalities like Trust and Responsible AI checks, seamless support for new models and model versions, incremental response streaming to reduce user-perceived latency and quota management to ensure fair use of relatively expensive LLM resources across various products.

Our GenAI applications have increasingly started depending on our AI platform, which is built atop open source frameworks like PyTorch, DeepSpeed, vLLM and provides a robust and highly scalable fine-tuning and serving infrastructure. In our experience, LLMs like Llama, when fine-tuned for LinkedIn specific tasks, often achieve comparable or better quality as state of the art commercial foundational models on these tasks, but at much lower costs and latencies. In keeping with LinkedIn’s ‘members first’ ethos, we also built a setting to enable LinkedIn members to more easily control whether their data is used for training or fine-tuning these models.

To make the experience of application developers transparent across external and internal models, we have invested in the following areas:

Our inference layer exposes an OpenAI Chat Completions API for all LLMs in use. This always allows application developers to program to this API regardless of the underlying model.
Configuration hooks in the application framework allow easy switching between on-prem and external models, without application developers having to worry about the routing details. This also allows developers to easily experiment with different underlying models for the same use case for local debugging and A/B tests in production.

Top Takeaway: Both proprietary and open-source LLMs are evolving at a rapid pace. There is often no one-size fits all solution, and there are various nuanced trade-offs around quality, cost, latency and more to navigate. Engineering organizations that strategically build their tech stack with sufficient abstractions to reuse the same core infrastructure across different models will see long-term benefits and leverage in efficiency and the capabilities of their products.

Migration

As our GenAI application stack developed and evolved, it was important to rapidly migrate off legacy bespoke solutions to more standardized ones, to minimize tech debt and increase leverage. We handled these migrations using a lean team combining engineers with deep knowledge of our historical Java stack and engineers working on the new stack. Our migration followed the following principles:

Incrementality: Rather than do a big bang migration of all components, we decided to migrate individual components one by one. For example, as soon as the LinkedIn messaging based conversational memory infrastructure was ready, we migrated many of the Java based GenAI applications to it, without waiting for the Python LangChain GenAI application framework migration. For the arguably larger GenAI application framework migration, we started with the simpler and smaller apps first before handling the more complex and larger ones. Each migration followed a depth first approach with a small team first prototyping, identifying gaps and fixing them in a small part of the app. This would be followed by a larger team going breadth first across the app doing the actual migration and setting up the A/B tests to ramp the new stack gradually. This led to progressive learnings and wins, ensured high levels of operational stability, and minimized the impact to the product roadmap of application teams.
Upskilling talent: Many of our senior engineers were proficient Java developers but needed to gain more Python experience. We paired them up with earlier in career but more experienced Python developers, to learn on the job, and take an “accelerated Python class”.

We have successfully migrated many of our existing applications to our new stack, with a handful of stragglers expected to finish within the next month.

Final thoughts

Our new GenAI application tech stack embraces AI-first development, lays a solid foundation for building GenAI apps efficiently and responsibly, and accelerates innovation in collaboration with the AI community. This technological groundwork is crucial in achieving our vision to bring economic opportunity to every member of the global workforce.

This blog only covered part of the tech stack centered around the GenAI application framework and its immediate adjacencies. To build and launch production-grade GenAI applications, we also depend on several other critical areas like our AI platform, in-house modeling, Observability and Monitoring stack, Responsible AI/Trust services and Evaluation process/frameworks, some of which we may cover in a future blog post.

Despite reaching significant milestones, numerous challenges remain. As the cutting edge of product experiences moves from conversational assistants to AI agents, we are seeing an influx of additional functional and operational requirements for our rapidly evolving tech stack. Stay tuned for more technical details around these areas as we build and learn.

Acknowledgement

Building our application tech stack amidst the rapid evolution of the GenAI landscape is truly a collaborative effort. We would like to thank Donald Thompson, Xavier Amatriain and Ya Xu for their vision and leadership in architecting and initiating our GenAI platform initiatives, along with the entire leadership team:, Kapil Surlaker, Zheng Li, Animesh Singh, Swapnil Ghike, Praveen Bodigutla, Grace Tang, Mary Hearne, Tyler Grant, Chanh Nguyen.

A special thanks goes to our engineering teams who developed the critical components of our tech stack: David Tag, Yi Liu, Nicolas Nkiere, Xiaonan Ding, Don Jung, Eugene Jin from the GenAI Foundations team; Suruchi Shah, Tiffany Zhou, Shangjin Zhang, Jason Belmonti, Ali Naqvi, Sasha Ovsankin from the AI Platform team; Han Wang, Tony Guan from the Python team; Benny Soetarman, Qishen Li, Pantelis Apostolopoulos from the System Infra team; Adam Kaplan, Tim Chao, Eddy Li from the Product Engineering team, and Priyanka Gariba, Sunil Ayyappan, Manish Khanna, Alex Xia, and Naqeeb Abbasi from the TPM team.

Finally, we want to express our deepest appreciation to our product engineering teams who are not only dedicated to building exceptional GenAI products but also providing us with invaluable feedback. This includes Juan Bottaro, Chenglong Hao, Daniel Hewlett, Xie Lu, Haichao Wei, Christopher Lloyd, Lukasz Karolewski, Parvez Ahammad, Lijun Peng, Avi Romascanu, and many others. Your passion and contributions are the driving force behind our success.

Topics: Generative AI AI Infrastructure

Infrastructure

Stateful workload operator: stateful systems on Kubernetes at ...

Michael Youssef

Nov 12, 2024
Infrastructure

Navigating the scale: how design patterns power LinkedIn’s inf...

Saira Khanum

Nov 7, 2024
Infrastructure

Right-sizing Spark executor memory

Rob Reeves

Nov 6, 2024