How 4X Speedup on Generative Video Model (FILM) Created Huge Cost Savings for Wombo

How 4X Speedup on Generative Video Model (FILM) Created Huge Cost Savings for Wombo

Generative AI is the hottest workload on the planet, but it’s also the most compute intensive, and therefore expensive to run. This puts startups building generative AI businesses in a tricky position. Not only must they deliver killer product experiences that grab attention and market share – they need to make the economics work too. To lower compute costs, generative AI models need to run faster and more efficiently on a more diverse set of hardware.?

WOMBO: an OctoML customer story

WOMBO makes popular mobile apps for content creation using generative AI. Their apps use ML models like stable diffusion to help people create fun videos and images to share online.

No alt text provided for this image
Think face-swapping, mixed with lip synching

Nearly 75 million people across more than 180 countries downloaded the app, making WOMBO one of the fastest-growing consumer apps in history. Like any generative AI startup, user growth translates to higher compute costs. With a fleet of GPUs nearing capacity, any model efficiency gains were a top priority.?

One production model is FILM, which predicts and generates intermediate frames between two existing frames in a video sequence. For premium WOMBO users, FILM generates a video clip showing their “transformation” into a celebrity or historical figure. The more frames you have in between the images, the better the final video, but the more costly and time consuming it becomes to generate. Optimizing the model across different hardware can help WOMBO better balance these user experiences (faster, higher quality video) and cost considerations.?

OctoML ran a series of experiments to optimize FILM on two different GPUs: NVIDIA A100 and A10G. We used the OctoML platform to compare a baseline version of FILM (TensorFlow) with several other optimized configurations.?

Results snapshot:

  • Cut model serving costs by 98% compared to baseline
  • 3.9x speedup on FILM model over baseline configurations?
  • Reduced image-to-image interpolation (AKA transformation) time from 10.1 seconds to 2.6 seconds

No alt text provided for this image
98% cost savings from baseline configuration

Better speed makes for a nice user experience, but these efficiency gains also slashed the compute cost per 1,000 image interpolations from $11.95 to $.24. Supposing 10,000 clips are created each day, that’s the difference between annual model serving costs of $43,617.50 or $876. For WOMBO, FILM traffic doesn’t represent the majority of overall usership, but even still, these cost savings can be significant.?

With so many media, entertainment and gaming applications, it’s easy to see how lowering compound model costs can make FILM more accessible to more creators. The more efficiently it runs, the more you can do for the same cost or less.

Here’s an example:

Let’s say a documentarian has access to 1,000 hours of archival video, and wants to use FILM to restore and enhance the missing footage. Working with the standard model configurations, running on NVIDIA A100 in AWS, this could cost upwards of $66,195.40 (assuming 24fps).

Combining OctoML model optimizations and the ability to run on the lower cost A10G in AWS, this cost comes down to $1,382.40.

Check out the OctoML blog for the full results of our work with WOMBO on FILM.?

If you want to achieve better speeds and lower costs for your AI workloads, be one of the first to try the new OctoML Compute Service. We're building an efficient compute layer that’s as easy to use as OpenAI, but flexible to run with any model.

Sign up for early access here ??

要查看或添加评论,请登录

OctoAI (Acquired by NVIDIA)的更多文章

社区洞察

其他会员也浏览了