Multimodal Agents in AI
Micheal Kuhn
Software and Data Engineering Leader | Problem Solver | Driving Business Success with Innovative Ideas, Disciplined Execution, and Continuous Growth | Empowering Healthy and Inspired Teams
Disclaimer: This post is going to discuss components of AI and AWS services. I may misuse or incorrectly associate words or definitions. I welcome polite corrections, so I can learn.
On Friday, I attended a virtual session put on by AWS about AI. You can watch for yourself here. For me, there were two major takeaways. The first was about multi-modal AI. I see this as the next stage of AI that is starting to come out where model applications like ChatGPT transition from strictly text in and text out, to being able to take multiple types of input and providing different types of output. Think of this as speech in, text out; text in, image out; and text and document in, text out all through the same interface! The agent will access the foundational model (the brains - where the information is) and make use of a number of tools at its access to manipulate data and create outputs. Please see the sample application diagram below, taken from the video, as an example.
In essence, the agent is the worker/orchestrator of a process where it will take a certain set of input and organize the steps and actions necessary to reach the output.
Things really got interesting to me when they started talking about Agents for Amazon Bedrock, though. It was through this part of the discussion that my eyes really opened up around the possibility of what is next. This is where my biggest takeaway came from.
领英推荐
Agents - what if there is more than one?
While the idea above about Agents being capable of organizing a set of operations and executing them was interesting and helped shed some light on how things work behind the curtain, the idea of adding complexity by adding multiple agents creates some very interesting possibilities, especially if some of those agents have access to APIs that can enrich the data in the Foundational Model (FM). At first the example provided was for an application to provide dressing recommendations for your day. This is where access to public weather APIs would be helpful to enrich the data about current weather conditions and all you need to provide is your location and then you could be provided with some information on appropriate clothing. Nice, but who really wants/needs an app for that? Then they blew my mind. The question posed was around what kind of tasks do you spend tons of time on that have a bunch of repeatable actions? The first answer was booking flights - all the comparing and research you do before picking the right flight for you!
The possibilities! Can you imagine? Applications that help you plan your trip - take in the details of where you're going, when, what's important to you, and then letting it go out and find all of the options of where to go, where to stay, how to get around, what to do and then you make your selections and it makes the purchases and creates your itinerary. The hours this would save! And let me tell you, this is just the start of all of the possibilities that have come up in my head for so many different industries. The amount of time people spend on research and determining options is kind of insane alone! Many of these will have to be created by companies for use, probably internal to begin with. They may even have to look at partnerships with other companies to get access to the right kind of data to reach optimal model refinement and create a truly exciting experience for consumers.
In the end, the concepts of how AI agents and Agents for Amazon Bedrock works has awakened new business cases that can be pursued. We're starting to see steps that will lead from just answering simple questions or taking basic commands to create something towards being able to take in varying types of information, refine and synthesize with specialized or more current data, and then layer on multiple actions to create tools that can simplify time wasting and repetitive steps. In some cases, it might be something great in the hands of the consumer itself, and in others, it becomes an assistant for an employee and saves time and effort on research, providing them with options where they can then use their expertise to figure out where to go next, what to do, and why. Get ready, because I may only be thinking of this now, but I'm certain that there are many out there that are multiple steps ahead of me and the future is basically here!
Founder & Creative Director at OutSnapped.com
5 个月What did you think of ChatGPT 4o announcements and demo yestersay?
GITEX GLOBAL 2024 | Dubai World Trade Centre
5 个月Great Read! It's always exciting to explore new concepts with potential real-world applications. Thanks for sharing!
Co-founder, CEO @ Taskade
5 个月Great article, Micheal Kuhn! We've been busy working on working Multi-AI Agents for Taskade. Main use cases included project management, writing, and engineering. Would love to hear your thoughts sometime. Feel free to reach out.