登录查看更多内容

GPU renting for Kaggle and work. My experience

Ivan Isaev

ML tech-lead and senior engineer | Ex-Head of ML & DS | Ex-Head of Engineering | Kaggle Competitions Master

发布日期: 2023年10月13日

Recently I wrote an article about GCE advantages and drawbacks.

This article is about the Vast.ai service which I use for similar purposes: renting GPUs to train ML models for Kaggle and working needs.?

What is Vast.ai?

Vast is a cost effective service for P2P GPU’s renting.

Brief description of available Vast functionality and features:

Vast has out of the box templates for PyTorch, TF, Cuda , Kaggle-gpu-images, etc. https://cloud.vast.ai/templates/
You can stop your instance when you finished models training and later copy data from it to another instance rented
Friendly and quite quick Discord support.

One time I had an issue with GPU and was able to reach end owner of GPU in discord chat and he helped me to fix issue (I got about 1 Tb of data and migrate it to other GPU would be quite time consuming)

SSH connection to the instance and storage from your local machine. So you can download/upload data and run bash scripts.

Worth mentioning that SSH is closed after stopping instance and you couldn’t run bash scripts on instance anymore but downloading the data from instance using SSH is still available even after instance stopping.

You stop the instance to pause it and restart it in case nobody rented it during pause time or in case it is rented already you will be in queue to rent it again. For me renting the same instance again in case you pause it works very seldom (I succeeded in this one time only) so if you pause it then 99% that you will need to start another one and migrate your data to another instance.

Vast advantages and drawbacks

Vast advantages:

领英推荐

Troubleshooting the Most Common CUDA Installation…

Bojan Tunguz, Ph.D. 1 个月前

In Network Acceleration for AI/ML Workloads

Sharada Yeluri 1 年前

A Prediction: Highly Optimized Computing Platforms

Roger Grimes 6 个月前

Very easy to use?
Instance rent and deployment in 2 clicks in a few seconds
Very cost effective

0.25 hourly for 24GB Ram gpu, storage costs nearly nothing , no hidden costs. If you use it 24/7 this is just 180 usd per month. Similar resources from GCE will cost more than 2K usd per month including all unclear hidden costs with networking, disk operations, etc. (I tried it).

Reliable and friendly

Vast drawbacks:?

Doesn’t have TPU (obviously).?
Has templates (like PyTorch, TF, etc.) but you should every time manually install needed packages (pandas, sci-kit learn, etc.). I fixed this quite easily : just created a Jupyter notebook with an exhaustive list of !pip install cells for packages needed and just run all cells on the new GPU immediately after deploying it.

Couple of words about Colab+Gdrive for large data

Additionally to Vast and GCE I tried Colab for the purpose of working with large data amounts (RSNA competition with nearly 1 Tb of data).

Colab VM size for GPU is 100 Gb only (about 20 usd/day). And a few hundreds Gb for TPU (about 30 usd/day). So to work with 1 Tb of data in Colab I rented Gdrive 2 Tb storage also (about 10 usd per month). But even though I rented this all, I couldn't make it work. Possibly this is feasible but you are still limited by a hundred or a few hundreds of GPU/TPU VM storage.

Additional drawbacks of Colab and Gdrive you should consider when using them together:

Gdrive reads/writes are slow.
I got a year subscription for 200 Gb and upgraded it to 2 Tb for a month. As I told above I wasn’t able to use it due to slow speed and Colab’s GPU/TPU VM constraints. So after the Gdrive 2 Tb monthly subscription was finished the year subscription for 200 Gb was lost (didn’t roll back) and I needed to pay for it again.

Hope sharing this experience will be helpful for those who are looking for a suitable service to rent GPUs for ML models training on large amounts of data. Good luck!

要查看或添加评论，请登录

Ivan Isaev的更多文章

Quatitative interview task: human approach vs AI approach

2025年3月6日

Quatitative interview task: human approach vs AI approach

It is interesting to comare human approach to solving tasks reqired knowleage of some theorems with how current AI…
Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

2025年2月28日

Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

What is Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)? Group-wise Precision Quantization…
Pseudo Labeling

2025年2月16日

Pseudo Labeling

Pseudo Labeling (Lee 2013) assigns fake labels to unlabeled samples based on the maximum softmax probabilities…
Learning to distill ML models

2025年2月14日

Learning to distill ML models

I’m investigating the topic of ML models distillation and learning to do that. These are my takeaways with the links to…
Kaggle Santa 2024 and what do the puzzles have to do with it?

2025年2月8日

Kaggle Santa 2024 and what do the puzzles have to do with it?

Our team got 23-rd place in Santa 2024 with a silver medal. We were close to gold but not this time.
Qdrant and other vector DBs

2025年1月28日

Qdrant and other vector DBs

Issue with vector DB size There are plenty of vector DBs available including FAISS, OpenSearch, Milvous, Pinackle…
Chutes: did you try it?

2025年1月21日

Chutes: did you try it?

Hi there I found one thing and want to ask if you tried it. It named Chutes and could be found there https://chutes.

3 条评论
InternVL2 test drive

2024年11月26日

InternVL2 test drive

Intern_vl2 Is one another vision language model I tried some time ago and I like it a lot. It is quite fast (10 times…
VITA multimodal LLM

2024年11月25日

VITA multimodal LLM

Lately, I've been working a lot with multimodal LLMs to generate video descriptions. This post is about the multimodal…
What are Diffusion Models?

2024年5月29日

What are Diffusion Models?

Diffusion models is one of the hottest topics now. This short post is just a reminder what is this and how they emerged…

See all articles

GPU renting for Kaggle and work. My experience

Ivan Isaev

ML tech-lead and senior engineer | Ex-Head of ML & DS | Ex-Head of Engineering | Kaggle Competitions Master

What is Vast.ai?

Vast advantages and drawbacks

领英推荐

Couple of words about Colab+Gdrive for large data

Ivan Isaev的更多文章

社区洞察

其他会员也浏览了

GPU for Daily Development: A Practical Guide

Unleashing the Power of 1-Bit LLMs with bitnet.cpp: Accelerating Inference and Efficiency

Using a Local LLM for AutoComplete

Graviton 3 survey: Opportunities, Possibilities, Use cases

With Google Gemma 2 LLM – How to set up a Personal Voice AI Assistant on a Local Workstation with NVIDIA GPU

Advanced Attention Mechanisms — II

MS Build 2024 Keynote Highlights

Setting Up TensorFlow with GPU Support on Ubuntu: A Comprehensive Guide to Fixing Common Errors

Memory Bandwidth Explained: Typical Challenges and Practical Solutions

Optimize Your ML and Data Workloads

What is Vast.ai?

Vast advantages and drawbacks

领英推荐

Couple of words about Colab+Gdrive for large data

Ivan Isaev的更多文章

Quatitative interview task: human approach vs AI approach

Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

Pseudo Labeling

Learning to distill ML models

Kaggle Santa 2024 and what do the puzzles have to do with it?

Qdrant and other vector DBs

Chutes: did you try it?

InternVL2 test drive

VITA multimodal LLM

What are Diffusion Models?

社区洞察

其他会员也浏览了

GPU for Daily Development: A Practical Guide

Unleashing the Power of 1-Bit LLMs with bitnet.cpp: Accelerating Inference and Efficiency

Using a Local LLM for AutoComplete

Graviton 3 survey: Opportunities, Possibilities, Use cases

With Google Gemma 2 LLM – How to set up a Personal Voice AI Assistant on a Local Workstation with NVIDIA GPU

Advanced Attention Mechanisms — II

MS Build 2024 Keynote Highlights

Setting Up TensorFlow with GPU Support on Ubuntu: A Comprehensive Guide to Fixing Common Errors

Memory Bandwidth Explained: Typical Challenges and Practical Solutions

Optimize Your ML and Data Workloads