Low-cost & low-complexity LLM Deployment with Monster Deploy

Low-cost & low-complexity LLM Deployment with Monster Deploy


?? Thinking of deploying a popular Large Language Model (LLM) or a custom fine-tuned one, in production with low-cost and low-complexity ?

? MonsterAPI is the best LLM deployment solution that I've recently come across. It enables me to host pre-trained and fine-tuned LLMs in one-click on its GPU cloud with a vast scalability and range of GPUs from 16GB to 80GB vRAM options.

? I've used it for a wide range of use cases such as Quick QA, quick commands, data summarization, and sophisticated queries.

? Quickly get an API endpoint that can start serving text generation requests using models like Llama2 7B, CodeLlama 34B, Falcon 40B or any of your custom/finetuned models.

? Developed with the vLLM (Variably-Large Language Models) project as its foundation, Monster Deploy is optimized for high throughput.

? As per their official blog, they recently delivered upto 10 Million tokens peak throughput for a mere cost of $1.25 while serving Zephyr 7B model, serving 39K requests per hour with an average request latency of 16ms on a 24GB GPU.

And this was using Monster Deploy on GPUs such as NVIDIA RTX A5000 (24GB) and A100 (80GB).


I've worked a ton with Monster API and in their deployment platform you get the following.

?? Seamless one-click deployments with its intuitive UI

?? Python client or a single curl request.

?? Supports deployment of LLMs as a REST API endpoint and any custom docker image as a hosted docker container.

?? Choose from a range of GPU and RAM configurations upto 160GB of VRAM

?? Detailed API documentations with ready to use colab notebooks.

?? Website : https://monsterapi.ai


A recent benchmark test of Monster Deploy of the Zephyr 7B model onto an 80GB Nvidia RTX A100, demonstrated its exceptional performance.

?? Number of users (peak concurrency): 200

?? Spawn Rate (users started/second): 1

?? Run Time: 15m

?? Input Token Length: 256 Tokens (max)

?? Output Token Length: 1500 Tokens (max)

?? Cost: $0.65



?? To access Monster Deploy Beta:

?? Sign up on MonsterAPI: https://monsterapi.ai/signup

?? Apply for Monster Deploy Beta: https://forms.gle/2vdzBca3B9qWqXXZ6

?? Deploy LLMs with these examples: https://developer.monsterapi.ai/docs/projects#demo-notebooks-for-using-monster-deploy



The code snippet below shows how you can use Monsterapi Python SDK to quickly deploy Mixtral 8x7B Chat model on Monster Deploy.

The Deployment will be able to serve the model as a REST API for both static and streaming token response support.

Code to show how you can use Monsterapi Python SDK to quickly deploy Mixtral 8x7B Chat model

Next, track the deployment progress

Keep in mind that it takes a few minutes to spin up the instance. The 'status' will transition from 'building' to 'live' as the build progresses. You can access the logs from the 'building' state to track its progress:


Track Deployment Progress
Track Deployment Progress

And then once the deployment is live, let's query our deployed LLM endpoint:

Once the deployment is live, let's query our deployed LLM endpoint:

Once your work is done, you may terminate your LLM deployment and stop the account billing

terminate your LLM deployment and stop the account billing



Below report showcases a benchmark of serving Zephyr-7b, using Monster Deploy on GPUs such as Nvidia RTX A5000 (24GB) and A100 (80GB) in multiple scenarios.

Report showcasing benchmark of serving Zephyr-7b



That's a wrap - all the important links are below

After you've signed up on MonsterAPI, apply for Deploy beta access here - https://developer.monsterapi.ai/docs/monster-deploy-beta#beta-phase--feedback

And get Free trial credits.

API Docs of Monster-Deploy - https://developer.monsterapi.ai/docs/monster-deploy-beta

?? Discord (Monsterapis) : https://discord.com/invite/mVXfag4kZN

要查看或添加评论,请登录

Rohan Paul的更多文章

社区洞察

其他会员也浏览了