Using Snowflake for AI completion in Manychat
Ready to make your chats smarter and your users happier? This was one of our goals at Manychat — the leading chat marketing platform for Instagram, Messenger, and WhatsApp — trusted by over 1 million businesses, and powering more than 4 billion conversations. And now, we’re integrating AI completion features to fuel chat automation and increase user engagement
Building an AI-Powered Data Platform at Manychat
Continuous innovation
To implement these features, we are actively using LLMs with chat completion capabilities. We started with OpenAI API and Azure OpenAI Service using GPT-3.4, GPT-4, and GPT-4o. However, we have a lot of reasons to switch to hosted open-source LLMs like Mixtral and Meta LLama 3, including the need for more flexibility with models and independence from a single vendor.
Main reasons, why we started thinking about hosting LLMs on our own:
Based on the three reasons above, we decided to try using hosted LLMs for the Intention Recognition feature of Manychat, mentioned above.
But how can we host LLMs in production?
Hosting Open-Source LLMs with Snowflake
As our main datastore, we use Snowflake, a scalable, cloud-based platform
Evaluating Performance : Snowpark Container Services vs. Cortex AI
It was important for us to understand if we could use these services for our AI tasks. The primary concern was not model accuracy (modern open-source LLMs are comparable in accuracy with OpenAI models) but performance: would Snowpark Container Services and Cortex be fast enough for us?
Performance Results with Snowpark Container Services
We developed a simple model serving app using Flask + Hugging Face and deployed it to Snowpark Container Services. We created a user-defined function that scores the model with the same SQL interface as using a model from Cortex.
领英推荐
We used the smallest Compute Pool with GPU: GPU | S (GPU: NVIDIA A10G, 24 GB, RAM: 27GB). Its size is perfect for using smaller LLMs — like Mistral 7B, and Meta Llama 3 8B — and we decided to use the latter.
In our experiment, we queried the model every few seconds with typical text input and 20 tokens to complete. The goal was to estimate the average scoring time and check the reliability of services. The average scoring time of models on Snowpark Container Services is 3.1s and it has pretty big noise.
Performance Results with Cortex AI
We carried out a similar experiment using Cortex AI. The best thing about Cortex is that for the simplest case you can use it by just querying:
SELECT SNOWFLAKE.CORTEX.COMPLETE(
'llama3-8b', 'What are large language models?'
);
And that’s it, no additional work is needed. So, we carried out the same experiment as with Snowpark Container Services. The result was great: 0.2s.
Conclusion: Optimizing AI Completion with Snowflake
We confirmed that both methods of hosting open-source LLMs in Snowflake work perfectly providing acceptable response time and reliability. However, Cortex works faster and more optimized, while Snowpark Container Services offers greater flexibility and potential for model serving optimization.
How do we plan to use Cortex in production?
Here is the basic illustration (below). We’ll hide all complexities of hosted LLM from the main backend. Instead, the backend will just call an API endpoint of a python-based microservice, which will then call a model from Cortex.
As we continue to push boundaries of what’s possible with AI in the Manychat application, we will explore and document the capabilities and performance of LLMs in various scenarios. We aim to investigate more precise characteristics of these LLMs, focusing on accuracy and other metrics.
We also plan to share our amazing results with multilingual embeddings in Snowpark Container Services for hundreds of millions of texts, as well as our progress in finetuning, and in practical improving business metrics based on AI/LLM and all this stuff.