Building an Agentic Application Using On-device Open-source Generative AI
Chris Pappalardo
Senior Director at Alvarez & Marsal | Software Engineer and FinTech Innovator | CPA, AWS Solutions Architect
Since their debut in 2023, there have been many interesting applications of Large Language Models.? One of the most intriguing applications is the agentic pattern, where the model is prompted to use a function when appropriate to interact with the world or produce a better mix of generative and predictive results.? Interacting with financial or economic data is a common use case since financial things tend to have discrete values, such as stock prices or GDP.
There are many tutorials and examples of function calling using external APIs for both proprietary and open models such as OpenAI and Mistral.? But what about for on-device open-source models?? And why is that important?
It is important for several reasons, including privacy, cost, and edge device capabilities.
Both Apple and NVIDIA have signaled as much with their recent announcements.? Apple just announced Apple Intelligence this week that among other things will make generative AI a core part of its operating systems and, where possible, use on-device models to power new features.? Last week NVIDIA announced Chat RTX which allows Windows users to run an LLM on their NVIDIA hardware that provides a chat interface and connects to their local content.
These announcements are in line with recent advances in the power of small language models.? Recently, Microsoft released Phi3-mini, a 3.8-billion parameter model with a 4k context window that outperforms models more than twice its size and is good at code generation.? As smaller open-source models continue to narrow the gap with larger (and proprietary) models, it eventually begs the question – why use an API and run anything off-device at all?
Which leads to the purpose of this article, which is to provide a demonstration of designing and building an agentic LLM application that uses only on-device open-source language models, APIs, and tooling.
Since applications are more fun to build than notebooks, I decided to create an application that could pull and discuss data from the Federal Reserve Economic Database (“FRED”) maintained by the Federal Reserve Bank of St. Louis.
In building this agentic LLM application, I set the following objectives and constraints:
To achieve these objectives, I utilized two key technology components: Ollama and Haystack. Ollama handles downloading and serving the LLMs via an API on my laptop, while Haystack provides a Python framework for building composable LLM-powered pipelines in an elegant way. These tools proved to be a joy to use and are highly recommended.
The Agent FRED source code is open-source and available on Github.? The project README explains how to install and configure the application and my implementations in code are easy to follow.
The Agentic LLM Prompt
When designing the Agentic AI pipeline, I needed to solve for two key design considerations.? First, how do I prompt an LLM to call a function, since this ability is not handled for me behind the scenes as it would be with a proprietary model API?? After a few attempts, this prompt, which was inspired by Simon Willison’s keynote at PyCon 2024 in Pittsburgh, PA last month, worked reasonably well:
The Conditional Pipeline
The other key design element was to incorporate conditionality into the LLM pipeline that can handle different workflow paths depending on the user’s prompt, with one path that deals with discrete function calling and another with generative conversation.? This had to be done in a way that did not break the pipeline’s ability to accommodate the state (history) of the conversation and provide relevant data points based on the question.
One of Haystack's best features is the ability to render diagrams from code.? This is the Haystack pipeline for Agent FRED:
Another issue I had to solve was working around the way that Haystack implements conditionality in their router, which uses Jinja2, a templating system that allows for simple expressions.? I needed something more specific to my use case to evaluate that the agent LLM constructed a proper function call.? So, I created a router that can accept and use a custom Jinja2 filter:
领英推荐
Creating and passing a custom filter that uses regex to extract arguments from the function call response resulted in conditional router logic that was as simple as this:
Talking to FRED
Calling the FRED data is thankfully straightforward and easy (special thanks to my colleague Zach Harner for researching the FRED API so I could focus on the AI aspect). The API for each data series appears to work the same way and the returned data is standardized, so I could make simplifying schema assumptions in the responses:
The Final Touches
The final steps in bringing the different elements in this project together were twofold.
First, I needed an integrated chat model that could benefit from data pulled by the agent and that would work like a normal conversational agent, such as remembering chat history.? It also needed a separate prompt template because the role that the chat model plays is different than the agent role.
For this I selected the Haystack ChatPromptBuilder which, after a bit of trial and error, I was able to configure with a simple prompt that integrated with my document store and conversation history.? This is the final pipeline, which is just 40 lines of python code:
Second, I needed a slick web interface because I wanted to display the data that the application pulled from the FRED, and chatting with an LLM is nicer in the browser than in a terminal window.? For this I used Gradio (https://www.gradio.app/) which provides a composable python framework for building responsive web applications.
It works!? Delicately…
I will admit that this is a simple application and generally works better when it is used delicately.? I will discuss this aspect of the application in the key takeaways below.? However, these integrations and the tech stack provide a foundation for building more fault-tolerant and robust applications, and they are fully local and fully open source.
Conclusion and key takeaways
This was a fun exercise that took me about 2 weeks to research and build, mostly in my spare time.? Some insights from my experience for future consideration that I will share:
With over 20 years of experience in financial risk management, fintech software development, and valuation services, I am a Senior Director at Alvarez & Marsal, a leading global professional services firm. As a CPA and AWS Certified Solutions Architect, I combine technical, business, and regulatory expertise to deliver innovative and scalable solutions for complex financial challenges.
Backend Software Engineer // I Discover Efficient, Quality Solutions to Technical Problems
9 个月Was wonderful watching Agent FRED develop post-PyCon! LLM agents are so fascinating to use.
Empowering VCs, PE Firms, and Startups with AI-Enhanced Pitch Decks & Streamlined Investor Q&A. Visit Our Website to Schedule Your Demo!
9 个月Muhammad Haseeb
Valuation Services at Concept Analytics | [email protected]
9 个月Enjoyed the read Chris Pappalardo, thanks for sharing!
Fullstack Developer at Alvarez & Marsal
9 个月Awesome to see the conference content put into practice!