Building an Agentic Application Using On-device Open-source Generative AI
From the St. Louis FRB who encourages each of us to “make a sweet app using our FRED through our free API”.

Building an Agentic Application Using On-device Open-source Generative AI

Since their debut in 2023, there have been many interesting applications of Large Language Models.? One of the most intriguing applications is the agentic pattern, where the model is prompted to use a function when appropriate to interact with the world or produce a better mix of generative and predictive results.? Interacting with financial or economic data is a common use case since financial things tend to have discrete values, such as stock prices or GDP.

There are many tutorials and examples of function calling using external APIs for both proprietary and open models such as OpenAI and Mistral.? But what about for on-device open-source models?? And why is that important?

It is important for several reasons, including privacy, cost, and edge device capabilities.

Both Apple and NVIDIA have signaled as much with their recent announcements.? Apple just announced Apple Intelligence this week that among other things will make generative AI a core part of its operating systems and, where possible, use on-device models to power new features.? Last week NVIDIA announced Chat RTX which allows Windows users to run an LLM on their NVIDIA hardware that provides a chat interface and connects to their local content.

These announcements are in line with recent advances in the power of small language models.? Recently, Microsoft released Phi3-mini, a 3.8-billion parameter model with a 4k context window that outperforms models more than twice its size and is good at code generation.? As smaller open-source models continue to narrow the gap with larger (and proprietary) models, it eventually begs the question – why use an API and run anything off-device at all?

Which leads to the purpose of this article, which is to provide a demonstration of designing and building an agentic LLM application that uses only on-device open-source language models, APIs, and tooling.

Since applications are more fun to build than notebooks, I decided to create an application that could pull and discuss data from the Federal Reserve Economic Database (“FRED”) maintained by the Federal Reserve Bank of St. Louis.

In building this agentic LLM application, I set the following objectives and constraints:

  • Local Generative Models: Use only local AI, no external API calls, including embeddings.
  • Function Calling: Build an agentic application that calls its own functions.
  • Fallback Capability: Integrate a second LLM for conversations that was “context aware”.
  • Open-Source Framework: Leverage an existing open-source AI development framework.
  • Ease of Installation: Ensure easy installation without requiring special or outsized hardware.
  • User Interface: Add a web interface for chat and data visualization.

To achieve these objectives, I utilized two key technology components: Ollama and Haystack. Ollama handles downloading and serving the LLMs via an API on my laptop, while Haystack provides a Python framework for building composable LLM-powered pipelines in an elegant way. These tools proved to be a joy to use and are highly recommended.

The Agent FRED source code is open-source and available on Github.? The project README explains how to install and configure the application and my implementations in code are easy to follow.

The Agentic LLM Prompt

When designing the Agentic AI pipeline, I needed to solve for two key design considerations.? First, how do I prompt an LLM to call a function, since this ability is not handled for me behind the scenes as it would be with a proprietary model API?? After a few attempts, this prompt, which was inspired by Simon Willison’s keynote at PyCon 2024 in Pittsburgh, PA last month, worked reasonably well:

Figure 2 - There are different ways to do this, including Chain of Thought, but this one works here.

The Conditional Pipeline

The other key design element was to incorporate conditionality into the LLM pipeline that can handle different workflow paths depending on the user’s prompt, with one path that deals with discrete function calling and another with generative conversation.? This had to be done in a way that did not break the pipeline’s ability to accommodate the state (history) of the conversation and provide relevant data points based on the question.

One of Haystack's best features is the ability to render diagrams from code.? This is the Haystack pipeline for Agent FRED:

Figure 3 – This pipeline includes two LLMs, a custom router, and a retriever to pull previously fetched data.

Another issue I had to solve was working around the way that Haystack implements conditionality in their router, which uses Jinja2, a templating system that allows for simple expressions.? I needed something more specific to my use case to evaluate that the agent LLM constructed a proper function call.? So, I created a router that can accept and use a custom Jinja2 filter:

Figure 4 - Once registered, the custom filter can be called from a Jinja2 template in the router.

Creating and passing a custom filter that uses regex to extract arguments from the function call response resulted in conditional router logic that was as simple as this:

Figure 5 - Agent FRED's agentic conditional router.

Talking to FRED

Calling the FRED data is thankfully straightforward and easy (special thanks to my colleague Zach Harner for researching the FRED API so I could focus on the AI aspect). The API for each data series appears to work the same way and the returned data is standardized, so I could make simplifying schema assumptions in the responses:

Figure 6 - The inputs to this function and when to call it is what the agent LLM determines.

The Final Touches

The final steps in bringing the different elements in this project together were twofold.

First, I needed an integrated chat model that could benefit from data pulled by the agent and that would work like a normal conversational agent, such as remembering chat history.? It also needed a separate prompt template because the role that the chat model plays is different than the agent role.

For this I selected the Haystack ChatPromptBuilder which, after a bit of trial and error, I was able to configure with a simple prompt that integrated with my document store and conversation history.? This is the final pipeline, which is just 40 lines of python code:

Figure 7 - The final Agent FRED pipeline, 40 lines of python.

Second, I needed a slick web interface because I wanted to display the data that the application pulled from the FRED, and chatting with an LLM is nicer in the browser than in a terminal window.? For this I used Gradio (https://www.gradio.app/) which provides a composable python framework for building responsive web applications.

It works!? Delicately…

I will admit that this is a simple application and generally works better when it is used delicately.? I will discuss this aspect of the application in the key takeaways below.? However, these integrations and the tech stack provide a foundation for building more fault-tolerant and robust applications, and they are fully local and fully open source.

Figure 8 - Agent FRED is on the case.

Conclusion and key takeaways

This was a fun exercise that took me about 2 weeks to research and build, mostly in my spare time.? Some insights from my experience for future consideration that I will share:

  • Design for generative, design for predictive: LLM technology is generative, which means it will occasionally not work as expected.? Use cases can be tolerant of this probabilistic nature or intolerant.? Most solutions will likely involve a mix of both, so consider that in how LLM-powered applications are designed and discussed.? Make sure to educate users on how the probabilistic nature of these applications will manifest.
  • Pick the right tool for the problem: There are many domain-specific use cases that generative AI can solve without needing an expensive proprietary model and heavy frameworks.? Building out and/or using AI *aaS is fine, but it isn’t necessary to solve every problem and lightweight solutions like these can be fast and accretive.
  • On-device models have major advantages: Cost and privacy, primarily, plus more integration opportunities.? In addition to Apple, NVIDIA just announced a new tool for running inference on larger models by leveraging their GPUs directly in Windows.? Microsoft released phi3, which in my experiments has turned out to be a great little model.? Expect to see faster growth in this area than on large platforms and proprietary models and be ready to take advantage of these capabilities.
  • Language models are good at language: Sometimes Large Language Models seem magical, so it is easy to forget that they are fundamentally just language models.? Implementing my own agentic prompt made sense because that is what language models do.? They are forms of narrow intelligence, not general intelligence, so treat them as such.? LLMs are good at creating natural language interfaces, searching content, and generating content.? They are not a good substitute for non-existent interfaces and bad data governance practices.
  • Most applications do not work out of the box:? Much like wrapping python code that should work in a try/except block, it became clear when building Agent FRED that there needs to be layers of fault tolerance included at each node of the pipeline, with the models themselves evaluating the results and retrying as necessary.? A good, lightweight framework makes building out fault tolerance easier.


With over 20 years of experience in financial risk management, fintech software development, and valuation services, I am a Senior Director at Alvarez & Marsal, a leading global professional services firm. As a CPA and AWS Certified Solutions Architect, I combine technical, business, and regulatory expertise to deliver innovative and scalable solutions for complex financial challenges.


Evan Gunnell

Backend Software Engineer // I Discover Efficient, Quality Solutions to Technical Problems

9 个月

Was wonderful watching Agent FRED develop post-PyCon! LLM agents are so fascinating to use.

Muhammad Habib, CFA, ACCA

Empowering VCs, PE Firms, and Startups with AI-Enhanced Pitch Decks & Streamlined Investor Q&A. Visit Our Website to Schedule Your Demo!

9 个月
回复
Jason Andrews

Valuation Services at Concept Analytics | [email protected]

9 个月

Enjoyed the read Chris Pappalardo, thanks for sharing!

Jordan Mbanefo

Fullstack Developer at Alvarez & Marsal

9 个月

Awesome to see the conference content put into practice!

要查看或添加评论,请登录

Chris Pappalardo的更多文章

社区洞察

其他会员也浏览了