Making GenAI Affordable: The Need to Slash Language Model Costs
Chandramouli (CM)
SVP, Artificial Intelligence // Head of Korn Ferry Digital India GCC
There has been a lot of chatter recently on the cost of GenAI and the pricing model to be followed by LLM Builders and Enterprises to build a sustainable, high-margin business around the technology. The cost of GenAI, when seen at the token level, seems minuscule, but if we dive deeper into the costs, you will soon realize that the costs start to add up.
TLDR:?
Context:
In this article, I will discuss a prevalent use case of GenAI: Call Transcription Analysis and Summarization. This involves analyzing a call transcript to provide a summary and action items, features that are now part of the Teams Copilot and various other communication-related applications.
Drawing from the experience of building a vertical-specific app for call transcript analysis, I can attest that the cost of delivering this feature is currently high. ‘Teams Copilot’ is a standout feature within the Microsoft Copilot experience that I find particularly beneficial. Taking call notes has become a thing of the past, and the accuracy and speed of Teams Copilot are impressive.
Quick math:
For the analysis and summarization of a 60-minute call transcript using GenAI, the process requires 14,900 input tokens and generates 2,600 output tokens.
In a 60-minute call transcript, the number of words typically ranges from 8,000 to 9,500, with the actual content being around 8,000 words. This count includes additional elements such as each sentence starting with the name or email of the speaker and full-length timestamps. According to OpenAI's guidelines, this translates to approximately 12,900 tokens, given that 1,500 words are roughly equivalent to 2,048 tokens.
The ratio of input to output tokens is based on an assumed 80:20 split, where the input tokens amount to 12,900 and the output tokens to 2,600. The typical length of a prompt is estimated at 2,000 tokens, assuming a simple prompt. However, if the extraction is highly domain-specific and context-specific, the prompt size may need to increase by 2 to 3 times, which would necessitate the implementation of a few-shot prompting method.
For a 60-minute call transcript, the total input tokens required are 14,900, and the output tokens are 2,600.
LLM Inference Cost Analysis
Let's examine the costs for utilizing OpenAI's models as of January 24, 2023, detailed on their pricing page (https://openai.com/pricing).
GPT 3.5-turbo-instruct has a context window of 4K tokens, for this use case to work the engineering effort required for chunking, queuing, summarization of output is going to be very high.? I would avoid using GPT-3.5-turbo-instruct for this use case and hence marked it in red.?
For processing a 60-minute call transcript, the costs for various models are as listed:?
领英推荐
GPT-4 Turbo: $0.23
GPT-4 32k: $1.21
GPT-3.5 Turbo: $0.02
Product Use Case
Now let us extend the LLM inference cost to 1000 users.? Variability is inherent in the number of users and the frequency of their interactions with the 'Call Transcript Analysis' feature powered by GenAI. For data scientists and product strategists, it's crucial to simulate different scenarios to gauge the financial repercussions of deploying such a feature.
Consider the following scenario:?
This equates to an expenditure of $4,600 to service 1,000 users who record one call each day, translating to a cost of $4.60 per user per month.?
It's important to note that these figures solely reflect the API usage costs. They do not include the additional overheads associated with resource cost, application integration, cloud infrastructure maintenance, app security measures, and tools required for regulatory compliance and audits.
Takeaway:
Whether you're a startup developing call transcript analysis features, or an enterprise creating advanced LLM-based solutions, it's evident that the existing expenses associated with transcript analysis are prohibitively steep. Action is required to mitigate the costs associated with Generative AI and Large Language Models through a variety of strategies:
Many may not be aware of the cost implications of using OpenAI and other Large Language Models (LLMs) when deployed at scale. That is why it is critical to first understand these cost implications and then mitigate them accordingly using the strategies mentioned above.
Note: The views expressed in this article are solely my own and do not reflect the opinions or positions of my employer.
Founder & CEO @InteligenAI
1 年I guess cost optimization is now the focus for most AI specialists. I have found that for use-case that you described where there is a constant predictable volume having an open-sourced model such as Mixtral deployed on an on-prem server is more cost effective. I was able to build an in-house GPU based server with 124GB of vRAM for approx $2500 using some pre-used components. This system if loaded with consistent volume will cost much less than pay-per-use applications.
Sr. Product Manager at SeekOut
1 年Chandramouli (CM)?it was helpful! Have you evaluated the costs of other models apart from openAi’s? How do they compare with these? And have you compared results, is there a significance difference?
Data Scientist at openwashdata (Global Health Engineering)
1 年Agreed with your points about cutting costs. In any case, as you mentioned, smaller models trained to specific tasks are likely to be cheaper and more accurate. If you have the engineering bandwidth or capability to outsource it and use your own open source, fine-tuned model pipelines which preprocess data and reduce the need to scale context windows for long meetings. I'd also argue that reliability and ownership of data are not insignificant concerns. ("lazy" GPT-4, Azure AI downtime etc.)