登录查看更多内容

Chutes: did you try it?

Ivan Isaev

ML tech-lead and senior engineer | Ex-Head of ML & DS | Ex-Head of Engineering | Kaggle Competitions Master

发布日期: 2025年1月21日

+ 关注

Hi there I found one thing and want to ask if you tried it.

It named Chutes and could be found there https://chutes.ai/app/create

They describe it like?

What is a VLLM chute?

VLLM is a high-performance library for LLM inference, offering significant speed improvements over traditional deployment methods. A VLLM chute on Chutes is a pre-configured deployment template that handles all the complexity of setting up VLLM correctly, including:

- Proper CUDA configuration

- Flash attention installation

- Optimized serving configuration

- Standardized API endpoints

And I feel like this is a thing with private (maybe not?) docker images (i.e. private docker hub repo) with templates for deploying LLMs and not struggling with dependencies installation and right GPU configuration selection. Right?

If yes then do any of you used it? Because if templates would be open then yes this could be useful to reuse and save time if you run a new LLM or other model the first time. But as I see templates look like this.

领英推荐

Performance Measurements of VProc on Verilator

Simon Southwell 8 个月前

Understanding current->pagefault_disabled in Linux…

Aditya Pratap Singh 1 个月前

Linus Torvalds Announces Linux Kernel 6.12 RC1

Prateek Jangid 5 个月前

chute = build_vllm_chute(
????username="Your username here",
    readme="## My Custom Model\nDescription of your model here",
    model_name="huggingface-model-owner/model-name",  # E.G. "unsloth/Llama-3.2-1B-Instruct"
    concurrency=4,
    node_selector=NodeSelector(
        gpu_count=1,
        min_vram_gb_per_gpu=16,
        # Optional: specify GPU types
        # include=["a100", "a6000"]
    )
)  ????

So you use their api and the real dockerfile is hidden from you.?

There also some decorators @chute.cord(..) and @chute.on_startup() , you can check there

https://chutes.ai/app/docs?file=custom_chute.md

But as for me I will never use private template without understanding what is inside, because finally I would need to write it by my own to optimize..

Also if it is private then the latest models would occur there with delay because full community doesn’t work on this.

Maybe you know the open alternative?

As I understand, they also provide optimizations under the hood : you deploy their templates on top of their GPUs and they can switch you between cheaper and more expensive GPUs on the fly depending on current load (batch size, etc.) to improve your costs.

But if you have just experiment you still would know the limits and you can deploy it in vast.ai and write your own image with dependencies, yes this would be slightly longer but you will know all details and would be able to customize if effectively at low level not just using external api and costs there doesn’t matter. If the opposite: you have a high load then this means you already tried it or tried something similar and you can invest time to create your own image and control your balls by yourself.

What do you think?

Jon Durbin

Human.

1 周

FYI, all of the code, for not only the chutes but the entire platform, is open source! https://github.com/rayonlabs/chutes https://github.com/rayonlabs/chutes-api https://github.com/rayonlabs/chutes-miner https://github.com/rayonlabs/chutes-audit In this case, you can see exactly how the vllm templates work here: https://github.com/rayonlabs/chutes/blob/main/chutes/chute/template/vllm.py You can also just open any of the LLM chutes on the site: https://chutes.ai/app and click the "Source" tab to see the exact code used to create it. The Image code is all just a dockerfile wrapper basically. You can print(image) on any image and see the exact Dockerfile contents. The default vllm image is typically: ``` image = ( ??Image( ????username="chutes", name="vllm", tag="0.7.3.p0", readme="## vLLM - fast, flexible llm inference" ??) ??.from_base("parachutes/base-python:3.12.7") ??.run_command("pip install --no-cache 'vllm==0.7.3' wheel packaging git+https://github.com/huggingface/transformers.git@a18b7fdd9e79e8dd0379f6afe7883e8220d24c4d qwen-vl-utils[decord]==0.0.8") ??.run_command("pip install --no-cache flash-attn") ) ``` You absolutely can do the same on vast, runpod, etc, we just also handle the autoscaling and such.

1 次回应

Angelo Lamonaca

Co-Founder @ Neuramare | Merging AI & Art to Redefine Creative Expression

1 个月

Thank for the article, how did you find this platform? At RunGen.AI we are working on a very similar platform. We aim to become the "Vercel" of AI model deployment, easy and fast deplyoment with optimized inferences. It's hard, also because of the huge amount of possible configurations. but we have decided to take up this challenge ??

1 次回应

查看更多评论

要查看或添加评论，请登录

Ivan Isaev的更多文章

Quatitative interview task: human approach vs AI approach

2025年3月6日

Quatitative interview task: human approach vs AI approach

It is interesting to comare human approach to solving tasks reqired knowleage of some theorems with how current AI…
Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

2025年2月28日

Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

What is Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)? Group-wise Precision Quantization…
Pseudo Labeling

2025年2月16日

Pseudo Labeling

Pseudo Labeling (Lee 2013) assigns fake labels to unlabeled samples based on the maximum softmax probabilities…
Learning to distill ML models

2025年2月14日

Learning to distill ML models

I’m investigating the topic of ML models distillation and learning to do that. These are my takeaways with the links to…
Kaggle Santa 2024 and what do the puzzles have to do with it?

2025年2月8日

Kaggle Santa 2024 and what do the puzzles have to do with it?

Our team got 23-rd place in Santa 2024 with a silver medal. We were close to gold but not this time.
Qdrant and other vector DBs

2025年1月28日

Qdrant and other vector DBs

Issue with vector DB size There are plenty of vector DBs available including FAISS, OpenSearch, Milvous, Pinackle…
InternVL2 test drive

2024年11月26日

InternVL2 test drive

Intern_vl2 Is one another vision language model I tried some time ago and I like it a lot. It is quite fast (10 times…
VITA multimodal LLM

2024年11月25日

VITA multimodal LLM

Lately, I've been working a lot with multimodal LLMs to generate video descriptions. This post is about the multimodal…
What are Diffusion Models?

2024年5月29日

What are Diffusion Models?

Diffusion models is one of the hottest topics now. This short post is just a reminder what is this and how they emerged…
4 Neural Network Activation Functions you should keep in mind

2024年5月24日

4 Neural Network Activation Functions you should keep in mind

What is a Neural Network Activation Function (AF)? Why are deep neural networks hard to train? What is "rule of thumb"…

See all articles

Chutes: did you try it?

Ivan Isaev

ML tech-lead and senior engineer | Ex-Head of ML & DS | Ex-Head of Engineering | Kaggle Competitions Master

领英推荐

Ivan Isaev的更多文章

社区洞察

其他会员也浏览了

Linux Kernel Driver Series - 3 - Interrupts - IRQ, ISR

Multi-GPU FLUX Full Fine Tuning Experiments and Requirements on RunPod and Conclusions - Used 2x A100 - 80 GB GPUs

News from RISC-V Summit, Dec 13-14th San Jose, CA

Docker Internals: A Deep Dive into Containers

Progress Towards an LLM That Can Handle a Billion Context Tokens

Kohya brought massive improvements to FLUX LoRA (as low as 4 GB GPUs) and DreamBooth / Fine-Tuning (as low as 6 GB GPUs) training

PLS' UDE 2022 Offers New Features for System Analysis and Debugging of Automotive Applications

New AMD Compute DNA (CDNA) Architecture

High-performance Computing in C++

Detailed Comparison of JoyCaption Alpha One vs JoyCaption Pre-Alpha - Fully Multi GPU All In One Captioning APP

领英推荐

Ivan Isaev的更多文章

Quatitative interview task: human approach vs AI approach

Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

Pseudo Labeling

Learning to distill ML models

Kaggle Santa 2024 and what do the puzzles have to do with it?

Qdrant and other vector DBs

InternVL2 test drive

VITA multimodal LLM

What are Diffusion Models?

4 Neural Network Activation Functions you should keep in mind

社区洞察

其他会员也浏览了

Linux Kernel Driver Series - 3 - Interrupts - IRQ, ISR

Multi-GPU FLUX Full Fine Tuning Experiments and Requirements on RunPod and Conclusions - Used 2x A100 - 80 GB GPUs

News from RISC-V Summit, Dec 13-14th San Jose, CA

Docker Internals: A Deep Dive into Containers

Progress Towards an LLM That Can Handle a Billion Context Tokens

Kohya brought massive improvements to FLUX LoRA (as low as 4 GB GPUs) and DreamBooth / Fine-Tuning (as low as 6 GB GPUs) training

PLS' UDE 2022 Offers New Features for System Analysis and Debugging of Automotive Applications

New AMD Compute DNA (CDNA) Architecture

High-performance Computing in C++

Detailed Comparison of JoyCaption Alpha One vs JoyCaption Pre-Alpha - Fully Multi GPU All In One Captioning APP