Trainy

Trainy

软件开发

San Franscisco,California 535 位关注者

Building Konduktor, a platform for GPU cluster management and AI workload scheduling.

关于我们

We've made a high performance platform for training AI. We take the guesswork out of speeding up training AI models so you can create your models faster. Checkout our dashboard for understanding multi-gpu profiles and tools to train and serve LLMs that you can try right now at https://github.com/Trainy-ai/

网站
https://trainy.ai/
所属行业
软件开发
规模
2-10 人
总部
San Franscisco,California
类型
上市公司
创立
2023

地点

Trainy员工

动态

  • Trainy转发了

    查看Roanak Baviskar的档案,图片

    Co-Founder & CEO at Trainy (YC S23) | Building high-performance GPU Infra for your AI Team

    Small GPU cloud providers are a race to the bottom. With so much focus on reducing price, many CSPs are failing to deliver high quality machines. At Trainy, we've seen countless horror stories from customers getting burned by cloud providers and losing 6-figure sums of money. Choosing a GPU cloud is not a decision that can be taken lightly, and it can be worth spending a bit more $ per hour to ensure uptime & performance. We can't stress enough the importance of benchmarking your hardware yourselves, since many CSPs are cutting corners. I've linked a blog post in the comments where we show you a couple simple tests to make sure your GPUs and network fabric are up to par. #AI #artificialintelligence #GPU

  • Trainy转发了

    查看Roanak Baviskar的档案,图片

    Co-Founder & CEO at Trainy (YC S23) | Building high-performance GPU Infra for your AI Team

    Going into 2025, does your AI team still use outdated tools like Slurm for job submission? Or do you pay huge premiums on platforms like AWS Sagemaker? Trainy's Konduktor platform lets you schedule AI workloads with priority, improve GPU reliability, and control GPU allocation. Our platform is flexible, requiring 0 code changes to submit jobs and runs on ANY cloud provider. You get all the benefits of a platform like Sagemaker at a fraction of the cost. Shoot me a message to find out how you can reduce spend on GPU compute! Or click the link in comments to get started with self-hosting.

  • Trainy转发了

    查看Roanak Baviskar的档案,图片

    Co-Founder & CEO at Trainy (YC S23) | Building high-performance GPU Infra for your AI Team

    Top tier AI research teams (Meta, OpenAI, etc.) have figured out the most efficient way to work with a cluster of GPUs. Instead of managing each GPU separately, they create a pools of GPU nodes and let sophisticated schedulers manage GPU availability efficiently. This leads to significantly higher (>80%) GPU usage. Add in some fault-tolerance to the infrastructure, and we see: - No more manual restarts at 2am. - ML Engineers get to focus on their jobs, rather than becoming DevOps experts. Trainy's Konduktor platform helps bring the benefits of a leading research team to your GPU cluster. We provide a fault-tolerant scheduler, integrated observability, and more. Interested? Drop me a message or check out our docs below!

  • Trainy转发了

    查看Roanak Baviskar的档案,图片

    Co-Founder & CEO at Trainy (YC S23) | Building high-performance GPU Infra for your AI Team

    ML engineers shouldn't have to waste time debugging infrastructure, especially when H100s have a fault rate of 25-30%. We believe that ML infrastructure should be able to handle bumps and bruises to the underlying hardware. Within Trainy's Konduktor platform, we've built a controller to monitor node health for GPUs. This isolates unhealthy nodes and prevents new work from being scheduled onto them. This way if a job fails from GPU or network card faults, 0 manual intervention is required. K8s does its magic of placing work only on healthy nodes, and we forward relevant logs to your CSP. Does your team struggle with GPU failures? Drop me a message, or read the blog post I've linked in comments. #AI #artificialintelligence #GPU #K8s

  • Trainy转发了

    查看Roanak Baviskar的档案,图片

    Co-Founder & CEO at Trainy (YC S23) | Building high-performance GPU Infra for your AI Team

    Setting up and validating GPU networking is a lot less trivial than you'd think. GPU fabric technology is highly varied across cloud providers, even when using H100 GPUs. Google Cloud has TCP-X, AWS has EFA, and once you're committed to figuring out a solution, it locks you in. We've seen countless AI research teams waste >$10,000 when the time comes to scale out because of incorrectly configured GPU fabrics. One of the biggest value-adds of Trainy's Konduktor platform is that we abstract out the network configurations for you as well. This means you can launch multinode training with high-bandwidth networking in the exact same way, even across different clouds. Is your team struggling to setup multinode on a cloud provider? We'll cut your setup time from weeks to minutes. Shoot me a message or check out our docs in the comments to self-host! #AI #artificialintelligence #GPU #K8s

  • Trainy转发了

    查看Roanak Baviskar的档案,图片

    Co-Founder & CEO at Trainy (YC S23) | Building high-performance GPU Infra for your AI Team

    Really enjoyed watching the latest Machine Learning Street Talk (MLST) episode, with Fran?ois Chollet discussing inherent limitations of LLMs. It was a breath of fresh air after all the usual Doomer/Acceleration talk on AGI. He makes some great points: 1. The core limitations of Transformer-based architectures have not changed in over 5 years. - Inability to adapt to small deviations from memorized patterns - Weak, patchy generalization 2. For any LLM, for any query that seems to work, there exists an equivalent rephrasing of the query that will break. - This ties into LLM's inability to handle deviations from a pattern - Highlights the modern LLM's lack of robustness 3. Skill does not show intelligence. And displaying skill at any number of tasks does not show intelligence. - This misguided view of intelligence is what causes our current form of benchmarking to be inadequate. He lays out the ARC-AGI benchmark, how it tests generalization abilities rather than memorization, and his thoughts on what kind of AI system will be necessary to improve on the SoTA. Watch here: https://lnkd.in/gw-BcPQE #AI #artificialintelligence

  • Trainy转发了

    查看Roanak Baviskar的档案,图片

    Co-Founder & CEO at Trainy (YC S23) | Building high-performance GPU Infra for your AI Team

    How can AI teams slash their GPU compute spend? Many AI teams invest heavily in GPU clusters without fully understanding their *actual compute needs*. Teams are blind to inefficiencies, while their cloud provider laughs all the way to the bank. Trainy's Konduktor platform is here to change that. With its advanced cluster management and AI workload scheduling, Konduktor delivers three key benefits: 1. Maximize GPU Utilization: With Konduktor, engineers can queue up a large number of jobs on their GPU cluster of varying priorities. This means the most important workloads get run first, and your GPU keeps crunching numbers overnight, on weekends, etc. 2. Minimize Downtime Disruptions: Traditional setups require manual intervention if a job fails. With H100 GPUs, these hardware faults are quite frequent (Llama3 training went for 52 days and required a restart every 3 hours). Konduktor automates this process by detecting hardware issues on failure, resuming jobs on healthy GPUs, and alerting your provider with detailed logs. 3. Enhanced Observability: Our platform offers comprehensive dashboards that provide a clear view of cluster usage and performance. Metrics like SM Efficiency help you understand how effectively your GPUs are being used, enabling better alignment of compute resources with your business objectives. With the features above and more, AI teams using Trainy’s Konduktor platform see at least 2x the utilization out of their GPU cluster. Curious? Drop me a message or click the link in comments to check out our docs. If your AI team self-hosts Konduktor, I’d love to hear how it goes! #AI #artificialintelligence #K8s #GPU

  • Trainy转发了

    查看Roanak Baviskar的档案,图片

    Co-Founder & CEO at Trainy (YC S23) | Building high-performance GPU Infra for your AI Team

    What is gang-scheduling, and why is it important for AI workloads? Gang-scheduling is a scheduling algorithm that ensures a group of correlated pods are scheduled at the same time. If there are not enough resources to schedule all pods, none are scheduled. In the case of AI workloads, this means the job will not be launched unless there are enough GPUs to satisfy the requirements. You can imagine a case where 2 engineers are accessing a 6 node cluster, and both request 4 nodes at the same time. Without gang-scheduling, the workloads can end up in a deadlock. This is also important for on-demand workloads. Let's say AWS only has capacity for 3 nodes right now and you've requested 4. Without a gang scheduler, you'd spin up 3 nodes and be charged while waiting for more capacity! We made sure to include gang-scheduling as a default for Trainy's Konduktor platform. If you'd like to learn more, feel free to drop me a message or click the link below to self-host! #AI #artificialintelligence #GPU #K8s

  • Trainy转发了

    查看Roanak Baviskar的档案,图片

    Co-Founder & CEO at Trainy (YC S23) | Building high-performance GPU Infra for your AI Team

    Had a great time at Linkedin's Scaling AI Infra conference! One thing Linkedin made clear at the event was their huge investment into developing scalable & efficient AI Infrastructure. It's amazing to see large scale AI research teams make many of the same design decisions we did while developing Trainy's Konduktor platform. With features like Gang Scheduling & Priority/Quota Management, we know we're moving in the right direction. To any smaller AI teams out there: If you want scalable and efficient AI infrastructure on your GPU cluster, send me a message! We've seen countless AI teams falter due to a lack of infra, and we can stop you from making the same mistakes. #AI #artificialintelligence #GPU #K8s

    • 该图片无替代文字
  • Trainy转发了

    查看Roanak Baviskar的档案,图片

    Co-Founder & CEO at Trainy (YC S23) | Building high-performance GPU Infra for your AI Team

    Managing GPU health becomes extremely difficult when running a GPU cluster > 16 nodes. We've seen many AI teams' development velocity stall after scaling their cluster. ML Engineers are constantly plagued with difficult to interpret XID errors, degraded network fabrics, and compatibility issues. This is where Trainy's Konduktor Platform comes in. We run intermittent health checks across your cluster to ensure nodes are healthy. If an AI workload suffers a hardware failure, our platform automatically isolates the faulty hardware and resumes training on healthy GPUs. Modern AI research teams know that a layer of infrastructure like this is absolutely necessary to run an efficient AI team. We bring OpenAI levels of efficiency to your GPU cluster. Interested? Drop me a message or read our blog to learn more about how we manage GPU health: https://lnkd.in/guusR62r #AI #artificialintelligence #GPU

    Automatic GPU Node Health and Pod Scheduling

    Automatic GPU Node Health and Pod Scheduling

    trainy.ai

相似主页

融资

Trainy 共 1 轮

上一轮

种子前

US$500,000.00

投资者

Y Combinator
Crunchbase 上查看更多信息