DatologyAI

科技、信息和网络

Redwood City，California 2,289 位关注者

Better Data, Better Models, Better Business.

查看职位关注

查看全部 21 位员工

关于我们

DatologyAI builds tools to automatically select the best data on which to train deep learning models. Our tools leverage cutting-edge research—much of which we perform ourselves—to identify redundant, noisy, or otherwise harmful data points. The algorithms that power our tools are modality-agnostic—they’re not limited to text or images—and don’t require labels, making them ideal for realizing the next generation of large deep learning models. Our products allow customers in nearly any vertical to train better models for cheaper.

网站: www.datologyai.com
DatologyAI的外部链接
所属行业: 科技、信息和网络
规模: 11-50 人
总部: Redwood City，California
类型: 私人持股
创立: 2023

地点

主要

699 Veterans Blvd

US，California，Redwood City，94063

获取路线

DatologyAI员工

查看全部员工

动态

DatologyAI

2,289 位关注者
1 个月
举报此动态
Leverage AI effectively with DatologyAI curated data tailored to your specific business needs.??

Ari Morcos

CEO and Co-founder at DatologyAI | ex-FAIR, DeepMind
1 个月已编辑

There's been lots of discussion about whether DeepSeek's $5.5M number is correct. Their math is very simple and checks out. Many have pointed out that this number doesn't include the cost of purchasing GPUs, salaries, R&D etc. But this was all very explicitly stated in the paper -- the $5.5M number covered the cost of compute for the training run alone. However, that's still a huge deal! Many have been arguing that frontier models will soon cost 100s of millions to billions in training costs alone. DeepSeek's ability to do it far more efficiently demonstrates that this is patently false. All of that R&D will be commoditized into easy-to-use solutions for training models (this is our explicit goal at DatologyAI -- make it so that you don't need to be an expert in order to train a model on your own data with the best possible data curation). This means that in a few years, an enterprise that wants to develop their own incredibly powerful and specialized small model for whatever use case their business requires, will be able to do so end-to-end for a few million dollars at most in marginal cost. Jevons Paradox has become surprisingly popular over the last week and it's because it perfectly applies here. If training costs $100s of millions to billions, very few entrants can work on it. But in a world where training costs a few hundred thousand to a few million, it will massively change the landscape. This will be especially important as inference costs become the main driving factor of cost in model development and deployment. In the enterprise, small specialized models that don't have the general ability of frontier models, but which can perform the single task that they need to five 9s of reliability and which can be deployed for a fraction of the cost because they have far fewer parameters than general models.

赞评论分享
DatologyAI转发了
Ari Morcos

CEO and Co-founder at DatologyAI | ex-FAIR, DeepMind
1 个月已编辑
举报此动态
There's been lots of discussion about whether DeepSeek's $5.5M number is correct. Their math is very simple and checks out. Many have pointed out that this number doesn't include the cost of purchasing GPUs, salaries, R&D etc. But this was all very explicitly stated in the paper -- the $5.5M number covered the cost of compute for the training run alone. However, that's still a huge deal! Many have been arguing that frontier models will soon cost 100s of millions to billions in training costs alone. DeepSeek's ability to do it far more efficiently demonstrates that this is patently false. All of that R&D will be commoditized into easy-to-use solutions for training models (this is our explicit goal at DatologyAI -- make it so that you don't need to be an expert in order to train a model on your own data with the best possible data curation). This means that in a few years, an enterprise that wants to develop their own incredibly powerful and specialized small model for whatever use case their business requires, will be able to do so end-to-end for a few million dollars at most in marginal cost. Jevons Paradox has become surprisingly popular over the last week and it's because it perfectly applies here. If training costs $100s of millions to billions, very few entrants can work on it. But in a world where training costs a few hundred thousand to a few million, it will massively change the landscape. This will be especially important as inference costs become the main driving factor of cost in model development and deployment. In the enterprise, small specialized models that don't have the general ability of frontier models, but which can perform the single task that they need to five 9s of reliability and which can be deployed for a fraction of the cost because they have far fewer parameters than general models.

6 条评论

赞评论分享
DatologyAI

2,289 位关注者
3 个月已编辑
举报此动态
Come on over to our booth to grab some delicious Data fortune cookies and pick up a fun DatologyAI-branded fidget cube! You can find us at booth 303, right next to the entrance. We can't wait to see you!! #NeurIPS2024
1 条评论

赞评论分享
DatologyAI

2,289 位关注者
3 个月
举报此动态
We’re thrilled to announce that DatologyAI will be at NeurIPS 2024 in Vancouver, Canada! ?? Visit us at Booth 303 from December 10-12 to learn more about how we’re revolutionizing data curation. If you’re passionate about unlocking the power of your data or just curious about what we do, we’d love to meet you! Let’s talk about how data curation can transform your AI initiatives. See you! #NeurIPS2024 #DataCuration #DatologyAI
赞评论分享
DatologyAI转发了
Rob Toews

Partner at Radical Ventures, AI Columnist at Forbes
3 个月
举报此动态
Nothing matters more for a startup's success than its ability to ship quickly. And DatologyAI has been shipping amazingly quickly. Congratulations to the team on another incredible release! ?? https://lnkd.in/ggvNwAxd

Train LLMs Faster, Better, and Smaller with DatologyAI’s Data Curation

datologyai.com

赞评论分享
DatologyAI

2,289 位关注者
3 个月
举报此动态
Starting your Thanksgiving holiday with some fresh-out-of-oven DatologyAI Data:?
Ari Morcos

CEO and Co-founder at DatologyAI | ex-FAIR, DeepMind
3 个月

Two weeks ago, we at DatologyAI released our first results demonstrating massive gains from data curation on contrastive image-text models. Today, I'm incredibly excited to share our newest results, applying our curation pipeline to LLMs. It's absolutely astonishing to see what a small, incredibly talented group of individuals can accomplish, and boy have we cooked! Starting with an exact-deduplicated version of Red Pajama v1 as our baseline and by manipulating only the training data for the model: Train Faster -- Training on our curated data reached the same baseline performance 7.7x faster, meaning results cost dramatically less to obtain and drastically improving iteration speed. Train Better -- Push the frontier of what's possible with a given budget, improving performance by 8.5 absolute percentage points (60.5% Datology vs. 52.0% RPJv1). This isn't just because of Red Pajama: compared to the strongest publicly curated datasets, DataComp-LM and FineWeb-Edu, improve performance by 4.4% and 6.1%, respectively. Train Smaller -- Better data enables you to train smaller models. Reduce cost per query at inference by 2.1x while simultaneously increasing performance over the baseline by 5.7%. As with our image-text results, we present these results both at a high-level (https://lnkd.in/g_hMR5Tx) and with an extremely meaty technical deep-dive for all of you who want the nitty-gritty details (https://lnkd.in/gY5tpq3s). We are just getting started on our journey and are so excited about what's in store. Are you training or customizing your own text models and want to improve performance, training efficiency, and inference efficiency through better data? Get in touch (https://lnkd.in/gSGckr6s)! Are you a data-obsessed researcher, engineer, or somewhere in between who wants to push the bounds of what's possible with better data? We're hiring Members of Technical Staff across a number of roles (https://lnkd.in/gHCwPk8e).
赞评论分享
DatologyAI转发了
Ari Morcos

CEO and Co-founder at DatologyAI | ex-FAIR, DeepMind
3 个月
举报此动态
Two weeks ago, we at DatologyAI released our first results demonstrating massive gains from data curation on contrastive image-text models. Today, I'm incredibly excited to share our newest results, applying our curation pipeline to LLMs. It's absolutely astonishing to see what a small, incredibly talented group of individuals can accomplish, and boy have we cooked! Starting with an exact-deduplicated version of Red Pajama v1 as our baseline and by manipulating only the training data for the model: Train Faster -- Training on our curated data reached the same baseline performance 7.7x faster, meaning results cost dramatically less to obtain and drastically improving iteration speed. Train Better -- Push the frontier of what's possible with a given budget, improving performance by 8.5 absolute percentage points (60.5% Datology vs. 52.0% RPJv1). This isn't just because of Red Pajama: compared to the strongest publicly curated datasets, DataComp-LM and FineWeb-Edu, improve performance by 4.4% and 6.1%, respectively. Train Smaller -- Better data enables you to train smaller models. Reduce cost per query at inference by 2.1x while simultaneously increasing performance over the baseline by 5.7%. As with our image-text results, we present these results both at a high-level (https://lnkd.in/g_hMR5Tx) and with an extremely meaty technical deep-dive for all of you who want the nitty-gritty details (https://lnkd.in/gY5tpq3s). We are just getting started on our journey and are so excited about what's in store. Are you training or customizing your own text models and want to improve performance, training efficiency, and inference efficiency through better data? Get in touch (https://lnkd.in/gSGckr6s)! Are you a data-obsessed researcher, engineer, or somewhere in between who wants to push the bounds of what's possible with better data? We're hiring Members of Technical Staff across a number of roles (https://lnkd.in/gHCwPk8e).
3 条评论

赞评论分享
DatologyAI

2,289 位关注者
4 个月
举报此动态
Train better, train faster, train smaller with DatologyAI!
Ari Morcos

CEO and Co-founder at DatologyAI | ex-FAIR, DeepMind
4 个月

Models are what they eat: high quality data lead to high quality models, enabling faster training of better models with fewer parameters. However, identifying and curating high quality data at scale, automatically, is an incredibly challenging problem requiring deep expertise. Our goal at DatologyAI is to make state of the art data curation accessible to anyone who wants to train a model, and we’ve been hard at work realizing this vision over the last year. On a personal note, I am so proud of the incredible work our small, but mighty team has accomplished, and today, I’m incredibly excited to share our first set of results at DatologyAI! We focused on contrastive models (ala CLIP) trained on the large-scale DataComp dataset, and the results we’ve been able to achieve have exceeded our already high expectations! Train Faster - Training on DatologyAI’s optimized dataset, we were able to reach the same performance with up to ~98% less compute, meaning that models cost dramatically less to train and train dramatically faster! Train Better - Models trained on our optimized data for the same compute budget achieve up to 13 absolute percentage points better performance relative to models trained on raw data. Train Smaller - Train models with >60% fewer parameters to better performance by training on our curated data. Check out our high-level blog post here (https://shorturl.at/jkYqk), and if you’re interested in all the nitty, gritty details, check out our technical deep dive here (https://shorturl.at/Mt0k9). We are so excited about these results, and we are just getting started! Stay tuned for more exciting results on text models coming very soon!
赞评论分享
DatologyAI转发了
Rob Toews

Partner at Radical Ventures, AI Columnist at Forbes
4 个月已编辑
举报此动态
Incredible results and exciting announcement today from DatologyAI! If you are training or fine-tuning AI models, you can't afford not to use Datology. The gains in speed, cost, performance and efficiency that Datology's platform unlocks are astounding.

DatologyAI’s Image-Text Data Curation: Train Better, Faster, Smaller

datologyai.com

2 条评论

赞评论分享
DatologyAI转发了
Haoli Yin

MTS @DatologyAI | Neo Finalist, Goldwater, CV Scholar | prev @ Modern Intelligence, Bowden Lab | See haoliyin.me
4 个月
举报此动态
Check out what I've been working on for the past 6 months! tl;dr: we've pretrained the most data-efficient and best-in-class CLIP models! See this X/Twitter thread for more information as well: https://lnkd.in/g9RwS7uG
Ari Morcos

CEO and Co-founder at DatologyAI | ex-FAIR, DeepMind
4 个月

Models are what they eat: high quality data lead to high quality models, enabling faster training of better models with fewer parameters. However, identifying and curating high quality data at scale, automatically, is an incredibly challenging problem requiring deep expertise. Our goal at DatologyAI is to make state of the art data curation accessible to anyone who wants to train a model, and we’ve been hard at work realizing this vision over the last year. On a personal note, I am so proud of the incredible work our small, but mighty team has accomplished, and today, I’m incredibly excited to share our first set of results at DatologyAI! We focused on contrastive models (ala CLIP) trained on the large-scale DataComp dataset, and the results we’ve been able to achieve have exceeded our already high expectations! Train Faster - Training on DatologyAI’s optimized dataset, we were able to reach the same performance with up to ~98% less compute, meaning that models cost dramatically less to train and train dramatically faster! Train Better - Models trained on our optimized data for the same compute budget achieve up to 13 absolute percentage points better performance relative to models trained on raw data. Train Smaller - Train models with >60% fewer parameters to better performance by training on our curated data. Check out our high-level blog post here (https://shorturl.at/jkYqk), and if you’re interested in all the nitty, gritty details, check out our technical deep dive here (https://shorturl.at/Mt0k9). We are so excited about these results, and we are just getting started! Stay tuned for more exciting results on text models coming very soon!
3 条评论

赞评论分享

相似主页

查看职位

融资

DatologyAI 共 2 轮

上一轮

A 轮 2024年6月8日

US$46,000,000.00

投资者

Felicis +5 其他投资者

在 Crunchbase 上查看更多信息

登录看看您认识DatologyAI的哪些人

DatologyAI

科技、信息和网络

Redwood City，California 2,289 位关注者

Better Data, Better Models, Better Business.

关于我们

地点

DatologyAI员工

Bogdan Gaza

Co-Founder & CTO at DatologyAI

Ari Morcos

CEO and Co-founder at DatologyAI | ex-FAIR, DeepMind

Ricardo Pio Monti

Research Scientist

Rob Toews

Partner at Radical Ventures, AI Columnist at Forbes

动态

立即加入，查看您错过的职场动态

相似主页

Pika

Vultron

Moonsense

Metaplane

lowRISC CIC

Monumental

Felicis

Via

Evolving Web

Tandem - Language Exchange

查看职位

工程师职位

机器学习工程师职位

科学家职位

软件工程师职位

分析师职位

实习生职位

Wordpress 开发员职位

C语言开发员职位

前端开发工程师职位

数据经理职位

数据分析员职位

基础设施工程师职位

现场工程师职位

高级软件工程师职位

版本和发布工程师职位

客服代表职位

数据科学家职位

解决方案工程师职位

融资