登录查看更多内容

Scalable Generative AI with Optimized Models.

Guilhaume LEROY-MELINE

IBM Distinguished Engineer, Transforming Businesses with AI, Quantum and Data, IBM Consulting France

发布日期: 2024年10月22日

??Each week we have new models’ updates, breaking benchmarks in multiple domains, like reasoning, extraction, querying, transcribing, recognizing… Focusing only on pure functional performance is a big mistake, especially in a near future where we will move from monolithic generative AI to an agentic approach, where multiple agents collaborate.

This is the reason why we need models like IBM Granite 3.0 : a family of Small and Optimized Language models suited for scalable and trustable AI, a balance between trust, quality, latency and cost, critical for moving towards agentic workflows at scale.

??Build your model portfolio for scaling use-cases

I am not asking to stop integrating and using large models, because getting access to those models are key for innovation, at smaller scale, for prototyping or for small volume use-case with highest variability. In order to discover novel use-case, or to provide a highly flexible daily assistant to your employees, large, and often costly models are fine. But once a pattern is discovered, like content creating in marketing, code generation for data transformation, knowledge extraction from scientific papers, call center summarization, support ticket handling for code fixing… ?we need to move to scalable generative AI with a collection of optimized models.

Transformational GenAI use-cases to have an impact in the way enterprises operate, it’s not sufficient to put assistants in the hands of employees, but also to radically transform a whole process called millions of times. Let’s take some examples taken from personal experiences:

?? Example 1: Call Center transformation

Giving the possibility to gather insights from a call center conversation with client, is not about to equip 2000 agents with an assistant, but to process more than 1 million minutes of calls each day, and extract insights for 130 000+ transcripts. It results in half a billion tokens per day to input in a Generative AI model.

To get the transcript, a faster-whisper implementation on V100 GPUs would need around 30 GPUs, in this world where GPUs are difficult to get. Highly optimized Large Speech Models can run on CPU, a much more accessible resource, around 200 CPUcores are needed for same volume, a 10x cost reduction, and 3x less energy consumption vs 30 GPUs. The impact is a marginal increase in word error rate, which doesn’t affect the quality of the transcript processing.

Now for the processing part, if we use one of the biggest open model, like a Llama-405B, it would result in 120 GPUs H200 daily usage with a latency of 10 seconds to get the result for each transcript. Your call centers transcripts are sensitive, and you don’t want any potential leak and respect all data-privacy, it would cost, from a pure hardware, management & energy perspective approximately 10k$ per day and 2MWh. Not sustainable from energy and cost perspective. The same use-case with an highly targeted and optimized model, a 8B model, or even a 3B MoE model , which would achieve same accuracy and even sometimes better when properly tuned, with some optimization like micro batching, 2 GPUs H100 are sufficient per day, 70x cost reduction and 90x energy reduction, minimum.?

As a conclusion, scalable generative AI in a call center, requires an in-depth optimization of the different models used, to support real large call centers operations. It’s something that is often not considered when doing simple prototypes / small pilots.

领英推荐

This week's latest generative AI updates - October 8…

SymphonyAI 5 个月前

This week's latest generative AI updates - September…

SymphonyAI 6 个月前

Next-Gen ML Power: Faster Insights, Lower Costs

Ronald van Loon 4 个月前

?? Example 2: Scientific Documentation Process transformation

We see stunning examples of latest models processing a full document of 10 pages, with mathematics equations, and solving them. It involves multimodal generative ai, and it’s certainly impressive. We can imagine such model to augment knowledge workers by extracting, interpreting scientific documents, coming from private as well as public sources. Just in the physics domain we are talking about 1 000 000 pages of publications in the last 5 years.

The easiest temptation is to use multimodal model on the 1 000 000 pages to find and process the data you need. But, and fortunately, science is continuously evolving! Each day with new concepts, new ideas, new ways to combine science domains to produce novel research and innovation. Does it mean that you will reprocess the 1 000 000 pages as image with a new prompt? Or would you process markdown / text converted documents (have a look at Docling from IBM Research) with an optimized text-based model? 48GB memory for smallest multimodal model vs 8GB and below for text based models. Convert once, process multiple time would be the principle of a scalable knowledge processing, especially for scientific documents. Of course, multimodal models are still useful, and important to keep in the model portfolio, for sporadic analysis, or to process parts of documents that require image and positional understanding, like diagrams, schematics, graphs...

As a result, 80% - 90% reduction in energy consumption can be made, by keeping a good equilibrium between what has to be processed by a pure-text model and multimodal ones. But more importantly an ability to reprocess at scale with novel models, or fine-tuned models for specific domains.

??The future: Impact of Agentic Workflows, a 100x increase in Generative AI workloads

In the past 3 years we got used to interact with generative AI in the form of chat-based conversation, injecting prompts and context, working at the pace of our inputs. With the new generation of models able to build code and some logic to achieve goals, like fixing code and product unit tests, generating diagrams using mermaid code, automating internet searches to grab information..., we are moving from a single call to LLMs to multiple calls in parallel, testing multiple scenarios to achieve the goal requested by the user.

Recent agentic implementations are making 5 to 10 steps, let's take a simple example, I want to quickly understand the impact of a new release of a framework I use, to leverage new features on my current code. The agent would get the release notes, would grab the code I work on, identify potential updates of the code I can do, write unit test to cover new features... hence transforming a simple request to more than 10+ sub-requests, and more if there are multiple paths to explore.

So, it had never been so critical to think about using the right model able to absorb 100x more requests with 100x less energy.

??At a glance:

We are moving from big monolithic and energy-hungry models to a mesh of reactive and sustainable micro models, orchestrated by intelligent routers. Time to build at scale, to deliver value to millions of transactions at the right carbon impact, cost and latency.

Note: personal post reflecting my own experience in implementing Generative AI at Scale.

Mohamed LAYACHI

Associate Partner | Data Services Practice Leader | IBM Consulting

4 个月

Thank you Guilhaume ! Indeed, scaling Generative AI demands careful attention to costs and energy consumption, while still maintaining efficiency and low latency. It's a complex challenge to tackle. Small and optimised models are key in this "agentic" era.

1 次回应

查看更多评论

要查看或添加评论，请登录

Guilhaume LEROY-MELINE的更多文章

How to increase PwDA Diversity in Executive Leadership

2024年11月20日

How to increase PwDA Diversity in Executive Leadership

This week is the European Week for the Employment of People with Disabilities (better with Diverse abilities)…

10 条评论
How I used Speech Recognition to overcome challenges due to my deafness

2022年11月16日

How I used Speech Recognition to overcome challenges due to my deafness

During this European Week for the Employment of People with Disabilities (#seeph2022) week, I wanted to share my…

35 条评论

Scalable Generative AI with Optimized Models.

Guilhaume LEROY-MELINE

IBM Distinguished Engineer, Transforming Businesses with AI, Quantum and Data, IBM Consulting France

??Build your model portfolio for scaling use-cases

?? Example 1: Call Center transformation

领英推荐

?? Example 2: Scientific Documentation Process transformation

??The future: Impact of Agentic Workflows, a 100x increase in Generative AI workloads

??At a glance:

Guilhaume LEROY-MELINE的更多文章

社区洞察

其他会员也浏览了

AI and ML Innovations Set to Transform the World in 2025

Scaling Generative AI Models: Key Challenges and Solutions

Generative AI Amplifies the Focus on Data: How Companies Must Evolve into Data-Centric Organizations

How You Can Transform Your Industry With Generative AI

Scaling Isn’t Dead: How Reasoning Models and Synthetic Data Are Redefining AI Progress

DeepSeek-R1 and the Evolving Landscape of AI: A Shift in Priorities

Unlocking Business Value with OpenVINO: ROI Considerations for AI Projects

Learning from the Past: The Dot-Com Bubble and the AI Boom

Notes on DeepSeek: Generative AI is All About the Applications Now

Generative AI: A Powerful Tool with Significant Risks

??Build your model portfolio for scaling use-cases

?? Example 1: Call Center transformation

领英推荐

?? Example 2: Scientific Documentation Process transformation

??The future: Impact of Agentic Workflows, a 100x increase in Generative AI workloads

??At a glance:

Guilhaume LEROY-MELINE的更多文章

How to increase PwDA Diversity in Executive Leadership

How I used Speech Recognition to overcome challenges due to my deafness

社区洞察

其他会员也浏览了

AI and ML Innovations Set to Transform the World in 2025

Scaling Generative AI Models: Key Challenges and Solutions

Generative AI Amplifies the Focus on Data: How Companies Must Evolve into Data-Centric Organizations

How You Can Transform Your Industry With Generative AI

Scaling Isn’t Dead: How Reasoning Models and Synthetic Data Are Redefining AI Progress

DeepSeek-R1 and the Evolving Landscape of AI: A Shift in Priorities

Unlocking Business Value with OpenVINO: ROI Considerations for AI Projects

Learning from the Past: The Dot-Com Bubble and the AI Boom

Notes on DeepSeek: Generative AI is All About the Applications Now

Generative AI: A Powerful Tool with Significant Risks