Making GenAI Affordable: The Need to Slash Language Model Costs
Make GenAI affordable

Making GenAI Affordable: The Need to Slash Language Model Costs

There has been a lot of chatter recently on the cost of GenAI and the pricing model to be followed by LLM Builders and Enterprises to build a sustainable, high-margin business around the technology. The cost of GenAI, when seen at the token level, seems minuscule, but if we dive deeper into the costs, you will soon realize that the costs start to add up.

TLDR:?

  • The current costs for GenAI, specifically in the application of 'Call Transcript Analysis and Summarization' using Large Language Models, are significant.
  • For processing a 60-minute call transcript, the costs for various models are as follows: GPT-4 Turbo costs $0.23, GPT-4 32k costs $1.21, and GPT-3.5 Turbo costs $0.02.
  • To reduce costs in LLM applications, consider using smaller, more efficient models like GPT-3.5 Turbo and automating data preprocessing to lower token counts.
  • Additionally, adopt consumption-based pricing and evaluate the potential costs of open-source models against proprietary services like OpenAI
  • Microsoft's development of more cost-effective models, such as Turing-NLG, indicates a market trend towards affordability.

Context:

In this article, I will discuss a prevalent use case of GenAI: Call Transcription Analysis and Summarization. This involves analyzing a call transcript to provide a summary and action items, features that are now part of the Teams Copilot and various other communication-related applications.

Drawing from the experience of building a vertical-specific app for call transcript analysis, I can attest that the cost of delivering this feature is currently high. ‘Teams Copilot’ is a standout feature within the Microsoft Copilot experience that I find particularly beneficial. Taking call notes has become a thing of the past, and the accuracy and speed of Teams Copilot are impressive.

Quick math:

For the analysis and summarization of a 60-minute call transcript using GenAI, the process requires 14,900 input tokens and generates 2,600 output tokens.

In a 60-minute call transcript, the number of words typically ranges from 8,000 to 9,500, with the actual content being around 8,000 words. This count includes additional elements such as each sentence starting with the name or email of the speaker and full-length timestamps. According to OpenAI's guidelines, this translates to approximately 12,900 tokens, given that 1,500 words are roughly equivalent to 2,048 tokens.

The ratio of input to output tokens is based on an assumed 80:20 split, where the input tokens amount to 12,900 and the output tokens to 2,600. The typical length of a prompt is estimated at 2,000 tokens, assuming a simple prompt. However, if the extraction is highly domain-specific and context-specific, the prompt size may need to increase by 2 to 3 times, which would necessitate the implementation of a few-shot prompting method.

For a 60-minute call transcript, the total input tokens required are 14,900, and the output tokens are 2,600.

LLM Inference Cost Analysis

Let's examine the costs for utilizing OpenAI's models as of January 24, 2023, detailed on their pricing page (https://openai.com/pricing).

Call Transcript LLM Inference Cost Comparison - OpenAI Models

GPT 3.5-turbo-instruct has a context window of 4K tokens, for this use case to work the engineering effort required for chunking, queuing, summarization of output is going to be very high.? I would avoid using GPT-3.5-turbo-instruct for this use case and hence marked it in red.?

For processing a 60-minute call transcript, the costs for various models are as listed:?

GPT-4 Turbo: $0.23

GPT-4 32k: $1.21

GPT-3.5 Turbo: $0.02

Product Use Case

Now let us extend the LLM inference cost to 1000 users.? Variability is inherent in the number of users and the frequency of their interactions with the 'Call Transcript Analysis' feature powered by GenAI. For data scientists and product strategists, it's crucial to simulate different scenarios to gauge the financial repercussions of deploying such a feature.

Consider the following scenario:?

  • Estimated user base utilizing the call transcript feature: 1,000.
  • Average number of calls per user in a typical work month: 20 (This assumes one call per day, which is modest by today's standards of remote work.)
  • Total number of call transcripts for analysis monthly: 20 calls/user * 1,000 users = 20,000 transcripts
  • If the Data Science team opts for GPT-4 Turbo for enhanced precision and reliability, the calculation for cost per transcript would be: $0.23 * 20,000 transcripts = $4,600

This equates to an expenditure of $4,600 to service 1,000 users who record one call each day, translating to a cost of $4.60 per user per month.?

It's important to note that these figures solely reflect the API usage costs. They do not include the additional overheads associated with resource cost, application integration, cloud infrastructure maintenance, app security measures, and tools required for regulatory compliance and audits.

Takeaway:

Whether you're a startup developing call transcript analysis features, or an enterprise creating advanced LLM-based solutions, it's evident that the existing expenses associated with transcript analysis are prohibitively steep. Action is required to mitigate the costs associated with Generative AI and Large Language Models through a variety of strategies:

  • Smaller models: Use smaller, less complex models that require fewer resources to run and have lower inference costs. This might lead to slightly less impressive outputs but could be sufficient for many tasks. Consider using GPT-3.5 Turbo, which has costs that are 90% lower than those of the GPT-4 Turbo model.
  • Automate Preprocessing: Develop pipelines that automatically clean and remove unnecessary data in transcripts, thus reducing the token count.
  • Pricing Models: Create consumption-based pricing strategies for end-users based on usage and value provided. There are lessons to be learned from survey providers who have restructured their pricing models to be more use-case-specific and aligned with the value delivered.
  • Open-source and Community Models: Consider using or contributing to open-source models; however, given the rapid development in the LLM space, this approach may incur greater costs compared to utilizing OpenAI’s services.

Many may not be aware of the cost implications of using OpenAI and other Large Language Models (LLMs) when deployed at scale. That is why it is critical to first understand these cost implications and then mitigate them accordingly using the strategies mentioned above.

Note: The views expressed in this article are solely my own and do not reflect the opinions or positions of my employer.


Sukrit Goel

Founder & CEO @InteligenAI

1 年

I guess cost optimization is now the focus for most AI specialists. I have found that for use-case that you described where there is a constant predictable volume having an open-sourced model such as Mixtral deployed on an on-prem server is more cost effective. I was able to build an in-house GPU based server with 124GB of vRAM for approx $2500 using some pre-used components. This system if loaded with consistent volume will cost much less than pay-per-use applications.

回复
Akshay Gupta

Sr. Product Manager at SeekOut

1 年

Chandramouli (CM)?it was helpful! Have you evaluated the costs of other models apart from openAi’s? How do they compare with these? And have you compared results, is there a significance difference?

回复
Yash Dubey

Data Scientist at openwashdata (Global Health Engineering)

1 年

Agreed with your points about cutting costs. In any case, as you mentioned, smaller models trained to specific tasks are likely to be cheaper and more accurate. If you have the engineering bandwidth or capability to outsource it and use your own open source, fine-tuned model pipelines which preprocess data and reduce the need to scale context windows for long meetings. I'd also argue that reliability and ownership of data are not insignificant concerns. ("lazy" GPT-4, Azure AI downtime etc.)

回复

要查看或添加评论,请登录

Chandramouli (CM)的更多文章

  • Framework to Evaluate GenAI Application Costs

    Framework to Evaluate GenAI Application Costs

    Overview Venturing into Generative AI applications typically starts with innovation and creativity. Yet, the…

    2 条评论
  • Do you want to build a ChatBot?

    Do you want to build a ChatBot?

    If 'Do you want to build a chatbot?' sounds familiar, it's because it's been on repeat in every Product Manager's and…

    4 条评论
  • What I learned by writing and reading "My Own User Manual"

    What I learned by writing and reading "My Own User Manual"

    A few weeks ago my colleagues and I did the "My User Manual" exercise. My team is global (US/Canada/India), and we all…

  • Who will disrupt LinkedIn?

    Who will disrupt LinkedIn?

    I recently met an entrepreneur to exchange notes on Talent technologies and the key areas that are trending with Talent…

    2 条评论
  • Neutral Impact of US Visa Reforms on Indian PRODUCT Start-ups

    Neutral Impact of US Visa Reforms on Indian PRODUCT Start-ups

    Who should read - This is article addresses the questions of Indian start-up founders on the impact of proposed…

    1 条评论
  • If Facebook can't Hire diverse talent, Then who can?

    If Facebook can't Hire diverse talent, Then who can?

    A Bloomberg article published recently (Titled: Facebook’s Hiring Process Hinders Its Effort to Create a Diverse…

    4 条评论
  • Q4 Sales Hustle? - Use 'A, B or C' Email

    Q4 Sales Hustle? - Use 'A, B or C' Email

    Most companies in the western hemisphere are heading towards yearend shutdown. Teams from products, services and…

    6 条评论

社区洞察

其他会员也浏览了