DatologyAI的封面图片
DatologyAI

DatologyAI

科技、信息和网络

Redwood City,California 2,289 位关注者

Better Data, Better Models, Better Business.

关于我们

DatologyAI builds tools to automatically select the best data on which to train deep learning models. Our tools leverage cutting-edge research—much of which we perform ourselves—to identify redundant, noisy, or otherwise harmful data points. The algorithms that power our tools are modality-agnostic—they’re not limited to text or images—and don’t require labels, making them ideal for realizing the next generation of large deep learning models. Our products allow customers in nearly any vertical to train better models for cheaper.

网站
www.datologyai.com
所属行业
科技、信息和网络
规模
11-50 人
总部
Redwood City,California
类型
私人持股
创立
2023

地点

  • 主要

    699 Veterans Blvd

    US,California,Redwood City,94063

    获取路线

DatologyAI员工

动态

  • 查看DatologyAI的组织主页

    2,289 位关注者

    Leverage AI effectively with DatologyAI curated data tailored to your specific business needs.??

    查看Ari Morcos的档案

    CEO and Co-founder at DatologyAI | ex-FAIR, DeepMind

    There's been lots of discussion about whether DeepSeek's $5.5M number is correct. Their math is very simple and checks out. Many have pointed out that this number doesn't include the cost of purchasing GPUs, salaries, R&D etc. But this was all very explicitly stated in the paper -- the $5.5M number covered the cost of compute for the training run alone. However, that's still a huge deal! Many have been arguing that frontier models will soon cost 100s of millions to billions in training costs alone. DeepSeek's ability to do it far more efficiently demonstrates that this is patently false. All of that R&D will be commoditized into easy-to-use solutions for training models (this is our explicit goal at DatologyAI -- make it so that you don't need to be an expert in order to train a model on your own data with the best possible data curation). This means that in a few years, an enterprise that wants to develop their own incredibly powerful and specialized small model for whatever use case their business requires, will be able to do so end-to-end for a few million dollars at most in marginal cost. Jevons Paradox has become surprisingly popular over the last week and it's because it perfectly applies here. If training costs $100s of millions to billions, very few entrants can work on it. But in a world where training costs a few hundred thousand to a few million, it will massively change the landscape. This will be especially important as inference costs become the main driving factor of cost in model development and deployment. In the enterprise, small specialized models that don't have the general ability of frontier models, but which can perform the single task that they need to five 9s of reliability and which can be deployed for a fraction of the cost because they have far fewer parameters than general models.

  • DatologyAI转发了

    查看Ari Morcos的档案

    CEO and Co-founder at DatologyAI | ex-FAIR, DeepMind

    There's been lots of discussion about whether DeepSeek's $5.5M number is correct. Their math is very simple and checks out. Many have pointed out that this number doesn't include the cost of purchasing GPUs, salaries, R&D etc. But this was all very explicitly stated in the paper -- the $5.5M number covered the cost of compute for the training run alone. However, that's still a huge deal! Many have been arguing that frontier models will soon cost 100s of millions to billions in training costs alone. DeepSeek's ability to do it far more efficiently demonstrates that this is patently false. All of that R&D will be commoditized into easy-to-use solutions for training models (this is our explicit goal at DatologyAI -- make it so that you don't need to be an expert in order to train a model on your own data with the best possible data curation). This means that in a few years, an enterprise that wants to develop their own incredibly powerful and specialized small model for whatever use case their business requires, will be able to do so end-to-end for a few million dollars at most in marginal cost. Jevons Paradox has become surprisingly popular over the last week and it's because it perfectly applies here. If training costs $100s of millions to billions, very few entrants can work on it. But in a world where training costs a few hundred thousand to a few million, it will massively change the landscape. This will be especially important as inference costs become the main driving factor of cost in model development and deployment. In the enterprise, small specialized models that don't have the general ability of frontier models, but which can perform the single task that they need to five 9s of reliability and which can be deployed for a fraction of the cost because they have far fewer parameters than general models.

  • 查看DatologyAI的组织主页

    2,289 位关注者

    We’re thrilled to announce that DatologyAI will be at NeurIPS 2024 in Vancouver, Canada! ?? Visit us at Booth 303 from December 10-12 to learn more about how we’re revolutionizing data curation. If you’re passionate about unlocking the power of your data or just curious about what we do, we’d love to meet you! Let’s talk about how data curation can transform your AI initiatives. See you! #NeurIPS2024 #DataCuration #DatologyAI

    • 该图片无替代文字
  • 查看DatologyAI的组织主页

    2,289 位关注者

    Starting your Thanksgiving holiday with some fresh-out-of-oven DatologyAI Data:?

    查看Ari Morcos的档案

    CEO and Co-founder at DatologyAI | ex-FAIR, DeepMind

    Two weeks ago, we at DatologyAI released our first results demonstrating massive gains from data curation on contrastive image-text models. Today, I'm incredibly excited to share our newest results, applying our curation pipeline to LLMs. It's absolutely astonishing to see what a small, incredibly talented group of individuals can accomplish, and boy have we cooked! Starting with an exact-deduplicated version of Red Pajama v1 as our baseline and by manipulating only the training data for the model: Train Faster -- Training on our curated data reached the same baseline performance 7.7x faster, meaning results cost dramatically less to obtain and drastically improving iteration speed. Train Better -- Push the frontier of what's possible with a given budget, improving performance by 8.5 absolute percentage points (60.5% Datology vs. 52.0% RPJv1). This isn't just because of Red Pajama: compared to the strongest publicly curated datasets, DataComp-LM and FineWeb-Edu, improve performance by 4.4% and 6.1%, respectively. Train Smaller -- Better data enables you to train smaller models. Reduce cost per query at inference by 2.1x while simultaneously increasing performance over the baseline by 5.7%. As with our image-text results, we present these results both at a high-level (https://lnkd.in/g_hMR5Tx) and with an extremely meaty technical deep-dive for all of you who want the nitty-gritty details (https://lnkd.in/gY5tpq3s). We are just getting started on our journey and are so excited about what's in store. Are you training or customizing your own text models and want to improve performance, training efficiency, and inference efficiency through better data? Get in touch (https://lnkd.in/gSGckr6s)! Are you a data-obsessed researcher, engineer, or somewhere in between who wants to push the bounds of what's possible with better data? We're hiring Members of Technical Staff across a number of roles (https://lnkd.in/gHCwPk8e).

    • 该图片无替代文字
  • DatologyAI转发了

    查看Ari Morcos的档案

    CEO and Co-founder at DatologyAI | ex-FAIR, DeepMind

    Two weeks ago, we at DatologyAI released our first results demonstrating massive gains from data curation on contrastive image-text models. Today, I'm incredibly excited to share our newest results, applying our curation pipeline to LLMs. It's absolutely astonishing to see what a small, incredibly talented group of individuals can accomplish, and boy have we cooked! Starting with an exact-deduplicated version of Red Pajama v1 as our baseline and by manipulating only the training data for the model: Train Faster -- Training on our curated data reached the same baseline performance 7.7x faster, meaning results cost dramatically less to obtain and drastically improving iteration speed. Train Better -- Push the frontier of what's possible with a given budget, improving performance by 8.5 absolute percentage points (60.5% Datology vs. 52.0% RPJv1). This isn't just because of Red Pajama: compared to the strongest publicly curated datasets, DataComp-LM and FineWeb-Edu, improve performance by 4.4% and 6.1%, respectively. Train Smaller -- Better data enables you to train smaller models. Reduce cost per query at inference by 2.1x while simultaneously increasing performance over the baseline by 5.7%. As with our image-text results, we present these results both at a high-level (https://lnkd.in/g_hMR5Tx) and with an extremely meaty technical deep-dive for all of you who want the nitty-gritty details (https://lnkd.in/gY5tpq3s). We are just getting started on our journey and are so excited about what's in store. Are you training or customizing your own text models and want to improve performance, training efficiency, and inference efficiency through better data? Get in touch (https://lnkd.in/gSGckr6s)! Are you a data-obsessed researcher, engineer, or somewhere in between who wants to push the bounds of what's possible with better data? We're hiring Members of Technical Staff across a number of roles (https://lnkd.in/gHCwPk8e).

    • 该图片无替代文字
  • 查看DatologyAI的组织主页

    2,289 位关注者

    Train better, train faster, train smaller with DatologyAI!

    查看Ari Morcos的档案

    CEO and Co-founder at DatologyAI | ex-FAIR, DeepMind

    Models are what they eat: high quality data lead to high quality models, enabling faster training of better models with fewer parameters. However, identifying and curating high quality data at scale, automatically, is an incredibly challenging problem requiring deep expertise. Our goal at DatologyAI is to make state of the art data curation accessible to anyone who wants to train a model, and we’ve been hard at work realizing this vision over the last year. On a personal note, I am so proud of the incredible work our small, but mighty team has accomplished, and today, I’m incredibly excited to share our first set of results at DatologyAI! We focused on contrastive models (ala CLIP) trained on the large-scale DataComp dataset, and the results we’ve been able to achieve have exceeded our already high expectations! Train Faster - Training on DatologyAI’s optimized dataset, we were able to reach the same performance with up to ~98% less compute, meaning that models cost dramatically less to train and train dramatically faster! Train Better - Models trained on our optimized data for the same compute budget achieve up to 13 absolute percentage points better performance relative to models trained on raw data. Train Smaller - Train models with >60% fewer parameters to better performance by training on our curated data. Check out our high-level blog post here (https://shorturl.at/jkYqk), and if you’re interested in all the nitty, gritty details, check out our technical deep dive here (https://shorturl.at/Mt0k9). We are so excited about these results, and we are just getting started! Stay tuned for more exciting results on text models coming very soon!

    • 该图片无替代文字
    • 该图片无替代文字
  • DatologyAI转发了

    查看Haoli Yin的档案

    MTS @DatologyAI | Neo Finalist, Goldwater, CV Scholar | prev @ Modern Intelligence, Bowden Lab | See haoliyin.me

    Check out what I've been working on for the past 6 months! tl;dr: we've pretrained the most data-efficient and best-in-class CLIP models! See this X/Twitter thread for more information as well: https://lnkd.in/g9RwS7uG

    查看Ari Morcos的档案

    CEO and Co-founder at DatologyAI | ex-FAIR, DeepMind

    Models are what they eat: high quality data lead to high quality models, enabling faster training of better models with fewer parameters. However, identifying and curating high quality data at scale, automatically, is an incredibly challenging problem requiring deep expertise. Our goal at DatologyAI is to make state of the art data curation accessible to anyone who wants to train a model, and we’ve been hard at work realizing this vision over the last year. On a personal note, I am so proud of the incredible work our small, but mighty team has accomplished, and today, I’m incredibly excited to share our first set of results at DatologyAI! We focused on contrastive models (ala CLIP) trained on the large-scale DataComp dataset, and the results we’ve been able to achieve have exceeded our already high expectations! Train Faster - Training on DatologyAI’s optimized dataset, we were able to reach the same performance with up to ~98% less compute, meaning that models cost dramatically less to train and train dramatically faster! Train Better - Models trained on our optimized data for the same compute budget achieve up to 13 absolute percentage points better performance relative to models trained on raw data. Train Smaller - Train models with >60% fewer parameters to better performance by training on our curated data. Check out our high-level blog post here (https://shorturl.at/jkYqk), and if you’re interested in all the nitty, gritty details, check out our technical deep dive here (https://shorturl.at/Mt0k9). We are so excited about these results, and we are just getting started! Stay tuned for more exciting results on text models coming very soon!

    • 该图片无替代文字
    • 该图片无替代文字

相似主页

查看职位

融资

DatologyAI 共 2 轮

上一轮

A 轮

US$46,000,000.00

Crunchbase 上查看更多信息