GPU renting for Kaggle and work. My experience

GPU renting for Kaggle and work. My experience

Recently I wrote an article about GCE advantages and drawbacks.

This article is about the Vast.ai service which I use for similar purposes: renting GPUs to train ML models for Kaggle and working needs.?

What is Vast.ai?

Vast is a cost effective service for P2P GPU’s renting.

Brief description of available Vast functionality and features:

  • Vast has out of the box templates for PyTorch, TF, Cuda , Kaggle-gpu-images, etc. https://cloud.vast.ai/templates/
  • You can stop your instance when you finished models training and later copy data from it to another instance rented
  • Friendly and quite quick Discord support.

One time I had an issue with GPU and was able to reach end owner of GPU in discord chat and he helped me to fix issue (I got about 1 Tb of data and migrate it to other GPU would be quite time consuming)

  • SSH connection to the instance and storage from your local machine. So you can download/upload data and run bash scripts.

Worth mentioning that SSH is closed after stopping instance and you couldn’t run bash scripts on instance anymore but downloading the data from instance using SSH is still available even after instance stopping.

  • You stop the instance to pause it and restart it in case nobody rented it during pause time or in case it is rented already you will be in queue to rent it again. For me renting the same instance again in case you pause it works very seldom (I succeeded in this one time only) so if you pause it then 99% that you will need to start another one and migrate your data to another instance.

Vast advantages and drawbacks

Vast advantages:

  • Very easy to use?
  • Instance rent and deployment in 2 clicks in a few seconds
  • Very cost effective

0.25 hourly for 24GB Ram gpu, storage costs nearly nothing , no hidden costs. If you use it 24/7 this is just 180 usd per month. Similar resources from GCE will cost more than 2K usd per month including all unclear hidden costs with networking, disk operations, etc. (I tried it).

  • Reliable and friendly

Vast drawbacks:?

  • Doesn’t have TPU (obviously).?
  • Has templates (like PyTorch, TF, etc.) but you should every time manually install needed packages (pandas, sci-kit learn, etc.). I fixed this quite easily : just created a Jupyter notebook with an exhaustive list of !pip install cells for packages needed and just run all cells on the new GPU immediately after deploying it.

Couple of words about Colab+Gdrive for large data

Additionally to Vast and GCE I tried Colab for the purpose of working with large data amounts (RSNA competition with nearly 1 Tb of data).

Colab VM size for GPU is 100 Gb only (about 20 usd/day). And a few hundreds Gb for TPU (about 30 usd/day). So to work with 1 Tb of data in Colab I rented Gdrive 2 Tb storage also (about 10 usd per month). But even though I rented this all, I couldn't make it work. Possibly this is feasible but you are still limited by a hundred or a few hundreds of GPU/TPU VM storage.

Additional drawbacks of Colab and Gdrive you should consider when using them together:

  • Gdrive reads/writes are slow.
  • I got a year subscription for 200 Gb and upgraded it to 2 Tb for a month. As I told above I wasn’t able to use it due to slow speed and Colab’s GPU/TPU VM constraints. So after the Gdrive 2 Tb monthly subscription was finished the year subscription for 200 Gb was lost (didn’t roll back) and I needed to pay for it again.

Hope sharing this experience will be helpful for those who are looking for a suitable service to rent GPUs for ML models training on large amounts of data. Good luck!

要查看或添加评论,请登录

Ivan Isaev的更多文章

  • Quatitative interview task: human approach vs AI approach

    Quatitative interview task: human approach vs AI approach

    It is interesting to comare human approach to solving tasks reqired knowleage of some theorems with how current AI…

  • Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

    Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)

    What is Group-wise Precision Quantization with Test Time Adaptation (GPQT with TTA)? Group-wise Precision Quantization…

  • Pseudo Labeling

    Pseudo Labeling

    Pseudo Labeling (Lee 2013) assigns fake labels to unlabeled samples based on the maximum softmax probabilities…

  • Learning to distill ML models

    Learning to distill ML models

    I’m investigating the topic of ML models distillation and learning to do that. These are my takeaways with the links to…

  • Kaggle Santa 2024 and what do the puzzles have to do with it?

    Kaggle Santa 2024 and what do the puzzles have to do with it?

    Our team got 23-rd place in Santa 2024 with a silver medal. We were close to gold but not this time.

  • Qdrant and other vector DBs

    Qdrant and other vector DBs

    Issue with vector DB size There are plenty of vector DBs available including FAISS, OpenSearch, Milvous, Pinackle…

  • Chutes: did you try it?

    Chutes: did you try it?

    Hi there I found one thing and want to ask if you tried it. It named Chutes and could be found there https://chutes.

    3 条评论
  • InternVL2 test drive

    InternVL2 test drive

    Intern_vl2 Is one another vision language model I tried some time ago and I like it a lot. It is quite fast (10 times…

  • VITA multimodal LLM

    VITA multimodal LLM

    Lately, I've been working a lot with multimodal LLMs to generate video descriptions. This post is about the multimodal…

  • What are Diffusion Models?

    What are Diffusion Models?

    Diffusion models is one of the hottest topics now. This short post is just a reminder what is this and how they emerged…

社区洞察

其他会员也浏览了