What is OpenAI's Assistants API and Why Should I?Care?
Dennis Layton
A Senior IT architect and a proponent for the responsible adoption of AI
OpenAI introduced version 2 of the Assistants API in April 2024, and it was met with a collective yawn, and yet it might have been OpenAI’s most significant announcement in months. This article will explain why it was so important. This is a non-technical introduction to the Assistants API and the concept of agents. Soon, agents will be everywhere, and the Assistants API is one of the frameworks that brings that reality closer.
Application Programming Interface (API)
To keep this article non-technical, let’s explain what an API is. An Application Programming Interface (API) is a software-to-software interface that enables two applications to exchange data with each other.
OpenAI’s API allows a developer to write programming code that occasionally requires the use of a Large Language Model (LLM) like GPT-4. In this case, we are thinking of two software applications: the one the developer has created and the LLM GPT-4o.
OpenAI provides two levels of APIs. The first is the Chat Completion API, which allows a developer to create a program that interacts with the LLM in much the same manner as a user does when working with ChatGPT.
For anything more significant, there is the Assistants API. It is the Assistants API that moves us closer to what agents will be able to do. Agents are described in the next section.
You’ve likely heard of agents-the idea that, rather than asking a Large Language Model (LLM) like GPT-4 to write an email for you, you ask an agent of that LLM to set up and book an appointment with one or more individuals. The former is task oriented, where tasks are determined by a human in the loop. The latter, is goal oriented where individual tasks are determined by the non-human agent.
Writing an email is accomplished through a well-crafted prompt and response. This is well within the realm of what you could do with ChatGPT. Setting up a meeting, however, requires thinking about the goal in a step-by-step manner and then executing those steps. While a chat is a series of prompts and responses, agents are about performing a series of thoughts and actions to achieve a goal in a largely autonomous manner.
Furthermore, it would help if there was some additional knowledge and memory to ensure the goal is achieved. For example, if the meeting is about your company’s new product, knowledge pertaining to that product, pulled from the company’s own product catalog, might be useful. In addition, there could have been an exchange of information prior to the meeting, such as suggested locations, times, and who to contact.
In preparation for the meeting, you might require other tools, including web search, financial calculations, data analysis, etc. Agents need access to tools to generate summarization-level data for the meeting. This ability to intelligently use tools when needed is another aspect of agents.
Finally, what if the goal goes beyond what a single specialized agent can accomplish? A good framework would allow you to set up a number of specialized agents that could work together in a collaborative fashion, sharing ideas. Perhaps for this meeting, you need one agent to handle the scheduling, another to do the financial analysis of the documents provided, and a third to put together a presentation based on the analysis and summarization of information.
In a nutshell, agents are about generative AI becoming more than a co-pilot and instead being a co-worker. OpenAI’s Assistants API is rapidly moving us one step closer to that reality. I used the simple example of setting up a meeting, but with the right kinds of agents, it could be about any goal that requires upfront planning and execution.
How does the Assistants API Move Us Closer to?Agents?
At the core of the Assistants API are, you guessed it, assistants. An agent requires the ability to access domain-specific knowledge, perform actions, and have a memory of past thoughts and actions from which it can learn. Let’s take each of these in turn and describe what an OpenAI assistant can do today.
What if I told you that an OpenAI assistant can now ingest 10,000 files before it even gets started on what you want to know? This means that an assistant could have at its disposal all the emails, financial, and technical documents for a given client. Moreover, what if that assistant could not only do keyword searches on that repository of documents but also search based on the meaning of the words? In other words, it would know when you are referring to Apple the company and Apple the fruit.
So why does this matter? Think about all the tasks you do today that require expertise pertaining to your particular domain. Whether you work for a large company, an institution, or a one-person home renovation business, you have proprietary knowledge about how you do business, much of which is persisted in documents, formal and otherwise. Upload this information to an assistant, and it will know what you know.
We talked about agents performing a series of steps, thoughts, and actions. How does an agent access the memory of what has happened before when it gets to step 3, for example? In the case of ChatGPT, we can perform a series of prompts and responses, and the conversation is kept alive, contextually relevant, and I can go back to that conversation at any time and pick up where I left off.
For the Assistants API, we have the concept of a thread where messages from assistants and users are retained and made accessible. I can have any number of threads in a program. In one thread, I could have a user interacting with an assistant, and in another thread, I could have multiple assistants.
Imagine a scenario where I have three generative AI assistants helping me prepare for a meeting: a scheduler, a financial analyst reviewing documents, and a media specialist working on the presentation slides. They can communicate by posting messages on the same thread and work collaboratively towards the goal by sharing what they know at each step.
Earlier, I mentioned that agents are an orchestrated sequence of thoughts and actions. To perform actions, the Assistants API provides tools capability. Large Language Models (LLMs) are highly capable of many things, but sometimes more conventional programming tools can be more effective when used intelligently and selectively. Web searches and data analysis, where charts are needed, for example, can be best done when assistants use these capabilities as tools.
领英推荐
Most of these tools are nothing more than a function in the program code that is designed to leverage an API to perform some action with the outside world, such as sending an email, getting the weather, or doing a web search. In a world where there are tens of thousands of APIs available, that is a lot of possible actions.
Assistants API?—?What Are the Gaps??
Agents are going to be the next step as generative AI moves from being a co-pilot, incapable of working autonomously on a goal without a human in the pilot’s seat, to autonomous co-workers with all that means. The Assistants API is important because it takes us 80% of the way there. Here are two of the major gaps.
The gap is that agents are not yet capable of planning and executing from that plan; that part is still in the hands of a developer. In other words, I can write a program that creates or reuses three assistants to review, analyze, and summarize documents in preparation for a meeting; an assistant that helps in scheduling the meeting; and finally, a third assistant that creates the content of a slide presentation-at least the text of that presentation. I can create a thread where each of the assistants and users can post their prompts and responses, and at any time, an assistant or user can review the prompts and responses of previous steps in the process. However, it would be up to me, as the developer, to activate the assistants in the correct order.
The Assistants API provides memory, but assistants don’t reflect on what they have learned unless you prompt them to do so. This is actually not hard to do, but the question is whether assistants can learn from what has happened before and thereby yield better outcomes with experience. Early research suggests that they can.
There are other capabilities that we might want from agents for them to operate in our physical reality. These capabilities would include knowledge about interacting with the real world and spatial reasoning so that the agent can navigate itself. However, for the purposes of a purely cognitive agent, the capabilities described above are likely all we need.
If we have agents with all of the capabilities described earlier-capable of thoughts and actions, memory of earlier thoughts and actions, and the ability to reflect-then the question arises: are agents capable of getting better at what they do the more times they perform the tasks they are designed for?
Until now, we used a fairly trivial but everyday example of how agents may be used, but what about a more significant use case like medical diagnosis? In a recent paper, “Agent Hospital: A Simulacrum of a Hospital with Evolvable Medical Agents,” published in May 2024, a simulacrum of a working hospital was created. It not only simulated the entire process of treating illnesses, but all the patients, nurses, and doctors were also autonomous agents powered by LLMs. This allowed the researchers to treat ten thousand patients in a matter of days. Here is how it is described in the Agent Hospital paper.
“After treating around ten thousand patients (real-world doctors may take over two years), the evolved doctor agent achieves a state-of-the-art accuracy of 93.06% on a subset of the MedQA dataset that covers major respiratory diseases.”
Here is what happened to the accuracy of the examination, diagnosis and treatment as the knowledge that came from interaction with thousands of patients increased.?—?see chart below.
So why is this important? It means that you can get significantly better outcomes from LLM technologies, now and in the future, without changing the underlying LLM. Here is a quote from the same paper:
“… LLMs may encounter limitations in performance as task complexity and diversity escalate. The existing training paradigms, which require the use of extensive data corpora or heavy human supervision, are deemed costly. Therefore, the development of self-evolutionary approaches has gained momentum These approaches enable LLM-powered agents to autonomously acquire, refine, and learn through self-evolving strategies?.”
All of this means, that up to a point, agents can getter better at what they do. This is not unlimited as you can see from the flattening curve in the previous diagram, but the efficacy of the agents has improved significantly. This is described in the Agent Hospital paper this way.
“Through such reflective processes, agents can self-evolve, refine their methodologies, and thus achieve improved performance.”
Summary
The first part of this article described how close we are to full agent capabilities using only OpenAI’s Assistants API. These assistants can access large amounts of relatively unstructured domain knowledge-think documents and emails. This is on top of the more general knowledge that the LLM provides. They have memory, so repeated running of the assistant is retained and can be referenced. Finally, they can perform actions through the use of tools that a developer can define.
In other words, you can get an assistant to compose a reply to an email using knowledge from the LLM and previous emails with this particular individual. Finally, it can send that email through the use of tools- in this case, it would use an existing API with the right email client.
The gap between OpenAI’s assistants is the upfront planning as well as reflection. That is still up to the developer to build in, but none of this is especially difficult for most day-to-day tasks. However, it limits the autonomy of the assistants themselves. So what happens when that gap is closed? Assistants become cognitive agents and move from being a kind of co-pilot to being a co-worker.
Finally, agents running on their own are capable of getting better at what they do, at least up to a point. At the same time, the underlying LLMs are going to get smarter as well.
All of this leads us to a future where assistants evolve to being cognitive agents and their relationship with humans moves from being a kind of co-pilot to being a co-worker. This is likely where the impact of generative AI starts to have significant effects, both positive and negative, on the economy and society as a whole.