Building collaborative prompt engineering playgrounds using Jupyter Notebook

Ajay Prakash

February 13, 2025

Co-authors: Co-authored byAjay Prakash and Co-authored byLukasz Karolewski

At LinkedIn, we’ve been exploring opportunities to use the transformative power of generative AI to enhance products like LinkedIn Sales Navigator. Sales Navigator helps sales professionals and teams connect with key people in companies through LinkedIn's network to drive better sales outcomes. Since GenAI began gaining popularity, we’ve tested various ideas to identify ways to improve the product experience for both our members and our internal developers.

As we explored these possibilities, we needed to bridge the gap between rapid prototyping by engineers and insights from non-technical domain experts. To validate ideas effectively, a very tight feedback loop was necessary to allow for real-time collaboration and iteration. To achieve this, we adopted a unique prompt engineering process leveraging Notebooks , which broke down barriers between technical and non-technical team members. It enabled rapid experimentation, speeding up product development and improving cross-functional collaboration.

AccountIQ is one example that emerged from this exploration, achieving significant product-market fit and demonstrating the potential of LLM-driven enhancements. AccountIQ automates company research, reducing what used to take two hours to just five minutes. It gathers data from various sources and uses LLMs to generate key company insights, helping sellers quickly understand financials, priorities, challenges, competition, and key people. This allows them to engage effectively with leads without extensive manual research.

In this blog, we’ll discuss how we utilized Jupyter Notebooks to create collaborative prompt engineering playgrounds for developing AI-powered features like AccountIQ. You'll learn about the challenges of leveraging large language models in production, our approach to prompt engineering, and how we bridged the gap between technical and non-technical team members to accelerate development and improve collaboration.

Building features using LLMs

Developing features with LLMs is very different from traditional product development. LLMs are inherently non-deterministic and generate varied content. This makes maintaining quality at scale extremely challenging and requires innovative approaches to validation, iteration, and quality control.

Prompt engineering opportunity and challenges

Prompt engineering is a fundamental aspect of leveraging LLMs, representing a significant shift in how we interact with technology. Instead of requiring humans to learn how to code, prompt engineering allows LLMs to directly interpret and respond to human language. This shift has enabled new forms of team collaboration with engineers and domain experts working together more seamlessly than ever before.

To fully leverage this potential, we needed to address several challenges that limited the accessibility and effectiveness of prompt engineering for a diverse, cross-functional team.

Problem: complex setup

Many tools provided by LLM providers are available online for experimenting with LLMs. While these are useful for initial exploration, developing customer-ready features requires a custom setup that integrates smoothly with the development environment and its requirements. Some of the things that makes it complex:

LLM configurations

There are different LLMs available, such as OpenAI models, internal models, and publicly available models, like Llama. It is important for everyone on the team to use the same baseline model configuration to be able to compare the outputs while doing prompt or configuration changes. Generally, these configurations are maintained in code. Users need to be able to quickly launch the prompt engineering setup with these configurations.

Managing prompt templates and chains

Prompts to LLMs that are used for the features are commonly stored as prompt templates. Prompt templates support placeholders so that values are dynamically computed and replaced based on the context dynamically. For example, in the prompts used for the AccountIQ feature, the context could be the company's name, description, industry, website, and data from external sources like public SEC information.

When testing the prompts, users should be able to specify the company for which the values need to be resolved and the prompt templates should be loaded and values should be fetched and replaced in place of placeholders. Replacing these values manually with test company’s data in the prompts is not a feasible or scalable option.

When performing prompt engineering, it is not enough to test just the prompts in isolation, sometimes we need to test multiple prompts chained together using the orchestration frameworks like LangChain.

Solution

We decided to leverage Jupyter Notebooks to create a reusable solution that allows engineers to quickly set up custom prompt engineering playgrounds for their use-cases along with test datasets. This setup is also accessible to non-technical users, making prompt engineering inclusive and user-friendly.

Setup

We set up a python backend service that powers all the LLM based features in Sales Navigator, like AccountIQ. It internally uses Langchain to orchestrate all the operations, such as fetching and transforming data, preparing the prompts and then calling LLMs. All the prompts to the LLMs are managed using jinja templates.

Every use-case is mapped to a specific chain which takes input parameters in the form of a dictionary of strings and generates a response. The service exposes a grpc endpoint to pass in the input parameters along with metadata like use-case id and the corresponding chain is executed and the response is generated and returned in the response. For example as shown in figure 1 below, to generate a specific insight in AccountIQ like revenue, information there is a chain setup which takes input parameters like company name, website, description and other attributes required and uses LLM and tools like Bing search to generate specific insights about the revenue of the company.

We have added support to launch Jupyter Notebooks from our python code repository. This allows users to clone the repository, build the code and launch the notebooks in browser or IDE with a single command. These notebooks have access to all Python code and libraries. All the library dependencies are managed and resolved automatically during build. This allows us to access the chains, invoke them with test data and see the response all from within the notebooks. The chains behave exactly the same on the user's local machine as they do in production environments. Using this, engineers can set up notebooks for their use-case, save it to the code repository and make it available for others in the team.

Other team members can then use those notebooks, make any changes to the prompt files (jinja templates) or code and run and see the effect of changes.

Both the notebooks and the test data files live in the same code repository. This ensures any changes to the notebooks, test data etc., also need code reviews from other engineers before getting merged to the repository.

Problem: test data

When testing prompt changes, access to high-quality test data is crucial to ensure that users can test the prompts against realistic scenarios. A good test dataset should be representative of real world scenarios and it should also be diverse. This helps with testing the prompt changes and catching any obvious issues quickly.

For example, in AccountIQ, the test data is company information. A good test data should contain all the information about the company, such as company name, description of the company, industry that company works in, website and company posts. In terms of diversity, the test set should have a mix of companies in different industries, sizes in terms of number of employees, countries, publicly traded vs privately held companies, etc.

External sources

Sometimes we need to fetch the data directly from external sources in real time based on the dynamic outcome from an LLM. In AccountIQ, we use Bing Search APIs to perform web search to know more about specific aspects of the company.

The query to the websearch will be created dynamically so we cannot save the results from ‘Bing’ and use it for later, it has to be accessed realtime. Doing this manually is not feasible when testing changes end-to-end. We need to provide a way to automate this for the user when working on prompt engineering.

Handling data

We have implemented numerous privacy-enhancing processes and techniques to protect against personal information in the test datasets.

Solution

Test data collection

High quality test data preparation is essential using Jupyter notebooks. At LinkedIn, we can query data from our data lake using Trino. Creating good quality test data sets is not a one time task – the data needs to be sampled on a regular basis to get a fresh dataset. Engineers can help set up this test data collection in the notebooks which can be used by everyone for prompt engineering.

We have also created some utilities using Custom magics to make it very simple for engineers to perform some of the common tasks for example querying data from datalake, we can use %%sql cell magic to write trino sql queries.

When a team wants to start building any new feature or subfeature engineers in the team will use this framework to set up notebooks.

Making it easy

We wanted to make it very easy for anyone to get set up and start using the playgrounds without having to worry about dealing with dependencies. We made use of containers to make the code setup and build process very seamless with help of Linkedin’s remote development platform.

Figure 3. High level architecture of the setup

With just a few commands, anyone in the team can launch their remote build containers with code repositories fetched and pre-built. They can then connect to their containers via terminal or using IDEs like VS Code or IntelliJ, launch the notebooks, and start working on prompts. Any changes made to the prompt files can be committed to the internal GitHub repository, versioned, and deployed to production.

Since the containers run on our internal servers, others within the company can also access the notebooks remotely, provided they are connected to the company's VPN. This setup has allowed us to give LinkedIn Sales team members access to the prompt engineering playground. Live prompt engineering sessions with end users, observing how they tweak prompts and interact with the system, have provided valuable insights directly from the users

As users interact with our features, we sample the real usage data using tracking. This data helps us better understand the actual usage patterns and also allows us to update the test data sets. This continuous feedback loop helps us continuously improve our features.

With this setup, we’re able to directly involve the cross-functional team to collaborate on prompt iteration within days and enable them to see any changes to the prompts run interactively and instantly. This allows us to build confidence early and involve internal users to test and provide feedback on direction. As feedback comes in, engineers can focus on optimizing structure of prompts, orchestration of multiple prompts, and integrating with actual products for the next stage of testing.

Conclusion

With Jupyter Notebooks, we created a fast feedback loop, enabling seamless collaboration across teams. This approach allows engineers to focus on infrastructure and scalability while enabling domain experts to perform quick product iterations. The result is a rapid prototyping environment, customer-focused development, and continuous improvement - leveraging strength of individual function, all contributing to the delivery of effective solutions that meet real customer needs.

Topics: Generative AI Developer Experience/Productivity Product Design