Scalable Generative AI with Optimized Models.
Guilhaume LEROY-MELINE
IBM Distinguished Engineer, Transforming Businesses with AI, Quantum and Data, IBM Consulting France
??Each week we have new models’ updates, breaking benchmarks in multiple domains, like reasoning, extraction, querying, transcribing, recognizing… Focusing only on pure functional performance is a big mistake, especially in a near future where we will move from monolithic generative AI to an agentic approach, where multiple agents collaborate.
This is the reason why we need models like IBM Granite 3.0 : a family of Small and Optimized Language models suited for scalable and trustable AI, a balance between trust, quality, latency and cost, critical for moving towards agentic workflows at scale.
??Build your model portfolio for scaling use-cases
I am not asking to stop integrating and using large models, because getting access to those models are key for innovation, at smaller scale, for prototyping or for small volume use-case with highest variability. In order to discover novel use-case, or to provide a highly flexible daily assistant to your employees, large, and often costly models are fine. But once a pattern is discovered, like content creating in marketing, code generation for data transformation, knowledge extraction from scientific papers, call center summarization, support ticket handling for code fixing… ?we need to move to scalable generative AI with a collection of optimized models.
Transformational GenAI use-cases to have an impact in the way enterprises operate, it’s not sufficient to put assistants in the hands of employees, but also to radically transform a whole process called millions of times. Let’s take some examples taken from personal experiences:
?
?? Example 1: Call Center transformation
Giving the possibility to gather insights from a call center conversation with client, is not about to equip 2000 agents with an assistant, but to process more than 1 million minutes of calls each day, and extract insights for 130 000+ transcripts. It results in half a billion tokens per day to input in a Generative AI model.
To get the transcript, a faster-whisper implementation on V100 GPUs would need around 30 GPUs, in this world where GPUs are difficult to get. Highly optimized Large Speech Models can run on CPU, a much more accessible resource, around 200 CPUcores are needed for same volume, a 10x cost reduction, and 3x less energy consumption vs 30 GPUs. The impact is a marginal increase in word error rate, which doesn’t affect the quality of the transcript processing.
Now for the processing part, if we use one of the biggest open model, like a Llama-405B, it would result in 120 GPUs H200 daily usage with a latency of 10 seconds to get the result for each transcript. Your call centers transcripts are sensitive, and you don’t want any potential leak and respect all data-privacy, it would cost, from a pure hardware, management & energy perspective approximately 10k$ per day and 2MWh. Not sustainable from energy and cost perspective. The same use-case with an highly targeted and optimized model, a 8B model, or even a 3B MoE model , which would achieve same accuracy and even sometimes better when properly tuned, with some optimization like micro batching, 2 GPUs H100 are sufficient per day, 70x cost reduction and 90x energy reduction, minimum.?
As a conclusion, scalable generative AI in a call center, requires an in-depth optimization of the different models used, to support real large call centers operations. It’s something that is often not considered when doing simple prototypes / small pilots.
?
领英推荐
?? Example 2: Scientific Documentation Process transformation
We see stunning examples of latest models processing a full document of 10 pages, with mathematics equations, and solving them. It involves multimodal generative ai, and it’s certainly impressive. We can imagine such model to augment knowledge workers by extracting, interpreting scientific documents, coming from private as well as public sources. Just in the physics domain we are talking about 1 000 000 pages of publications in the last 5 years.
The easiest temptation is to use multimodal model on the 1 000 000 pages to find and process the data you need. But, and fortunately, science is continuously evolving! Each day with new concepts, new ideas, new ways to combine science domains to produce novel research and innovation. Does it mean that you will reprocess the 1 000 000 pages as image with a new prompt? Or would you process markdown / text converted documents (have a look at Docling from IBM Research) with an optimized text-based model? 48GB memory for smallest multimodal model vs 8GB and below for text based models. Convert once, process multiple time would be the principle of a scalable knowledge processing, especially for scientific documents. Of course, multimodal models are still useful, and important to keep in the model portfolio, for sporadic analysis, or to process parts of documents that require image and positional understanding, like diagrams, schematics, graphs...
As a result, 80% - 90% reduction in energy consumption can be made, by keeping a good equilibrium between what has to be processed by a pure-text model and multimodal ones. But more importantly an ability to reprocess at scale with novel models, or fine-tuned models for specific domains.
??The future: Impact of Agentic Workflows, a 100x increase in Generative AI workloads
In the past 3 years we got used to interact with generative AI in the form of chat-based conversation, injecting prompts and context, working at the pace of our inputs. With the new generation of models able to build code and some logic to achieve goals, like fixing code and product unit tests, generating diagrams using mermaid code, automating internet searches to grab information..., we are moving from a single call to LLMs to multiple calls in parallel, testing multiple scenarios to achieve the goal requested by the user.
Recent agentic implementations are making 5 to 10 steps, let's take a simple example, I want to quickly understand the impact of a new release of a framework I use, to leverage new features on my current code. The agent would get the release notes, would grab the code I work on, identify potential updates of the code I can do, write unit test to cover new features... hence transforming a simple request to more than 10+ sub-requests, and more if there are multiple paths to explore.
So, it had never been so critical to think about using the right model able to absorb 100x more requests with 100x less energy.
?
??At a glance:
We are moving from big monolithic and energy-hungry models to a mesh of reactive and sustainable micro models, orchestrated by intelligent routers. Time to build at scale, to deliver value to millions of transactions at the right carbon impact, cost and latency.
?
Note: personal post reflecting my own experience in implementing Generative AI at Scale.
Associate Partner | Data Services Practice Leader | IBM Consulting
4 个月Thank you Guilhaume ! Indeed, scaling Generative AI demands careful attention to costs and energy consumption, while still maintaining efficiency and low latency. It's a complex challenge to tackle. Small and optimised models are key in this "agentic" era.