iterative.ai

软件开发

San Francisco，California 7,622 位关注者

Data tools for AI

关注

查看全部 19 位员工

关于我们

We create open-source and SaaS developer tools dedicated to advancing machine learning data management. Our journey began with the creation of DVC, that is now an open-source standard for data versioning and reproducibility. Fast forward to today, we are launching DataChain. It is a multimodal data processing framework for ETL and data analytics at scale. ?? Enterprise Support Our team is dedicated to providing top-notch Enterprise support, ensuring your teams are set up for success. ?? Let's Connect Curious to learn more? Schedule a 45-minute discussion with our experts to explore how Iterative can tailor solutions to your unique use case. Book a meeting here - https://calendly.com/dmitry-at-iterative/dmitry-petrov-30-minutes. ?? Why Iterative We are on a mission to simplify the complexities of managing datasets and ML infrastructure. At Iterative, we bring the best engineering practices to data science and machine learning teams, empowering them to thrive in the ever-evolving landscape of Generative AI. Join us as we redefine possibilities and shape the future of Generative AI innovation.

网站: https://datachain.ai
iterative.ai的外部链接
所属行业: 软件开发
规模: 11-50 人
总部: San Francisco，California
类型: 私人持股
创立: 2018
领域: Data Science、Machine Learning、Developer Tools、Data management、Continuous Integration、MLOps、ModelOps、DataOps、GitOps、Generative AI和Unstructured Data

地点

主要

450 Townsend St

US，California，San Francisco

获取路线

iterative.ai员工

查看全部员工

动态

iterative.ai

7,622 位关注者
1 个月
举报此动态
Exciting News!??? Design Partnership with Sigen AI. We are thrilled to announce a design partnership with Sigen AI! Through this collaboration, Sigen AI’s selective anonymization technology will now leverage our cutting-edge solutions to enhance security, reliability, and privacy compliance for clients worldwide. Stay tuned as we continue to push the boundaries of privacy-first AI innovations and deliver even more robust and reliable solutions together! #DataChain #PrivacyCompliance #Partnership #Innovation #SigenAI
赞评论分享
iterative.ai

7,622 位关注者
1 个月
举报此动态
"... metadata marts could play a key role in making video data more accessible and structured for model training and analysis ...". Read the new #datapains post by Simon Thelin. He is reviewing DataChain and gives some excellent recommendations. We definitely plan to add Iceberg support in open source (Studio is already using ClickHouse which is scalable and will support Iceberg soon). Stay tuned!

Simon Thelin

Tech Lead ML Platform (DataOps) | Lead Data Engineer
1 个月已编辑

Another Sunday, another set of #datapains to discuss! ?? I've just published a new Medium post reviewing DataChain from iterative.ai and its video metadata capabilities. DataChain is a Python-based AI-data warehouse for transforming and analysing unstructured data like images, audio, videos, text, and PDFs. It integrates with external storage (e.g. S3, GCP, Azure, HuggingFace) and manages metadata in an internal database for easy querying. In my post, I take a first glance into how well DataChain handles video metadata extraction. To see if it can truly simplify video metadata management. Key Takeaways: ??? - Managing video metadata is more complex than it seems; DataChain aims to simplify this. - While DataChain offers useful features, the use of SQLite within open source version, raises concerns about scalability for larger datasets, and cross collaboration. - A more scalable approach could involve integrating with open table formats like Delta or Apache Iceberg. I also explore the implementation process with code snippets and discuss areas for improvement. Ultimately, metadata marts could play a key role in making video data more accessible and structured for model training and analysis. Check out my full review here: https://lnkd.in/eQxsuqNi I'll also be sharing a follow-up video on my YouTube channel! Stay tuned and subscribe: https://lnkd.in/erYVFVgM What are your thoughts on handling video metadata at scale? Let me know in the comments!

DataPains

youtube.com

赞评论分享
iterative.ai

7,622 位关注者
1 个月
举报此动态
Watch an excellent PyData Berlin talk by Julian Wagenschütz (Volkswagen Group) on automating and managing fluid simulations with Python and DVC. See details ?? and more links in comments. Julian Wagenschütz suggests using DVC, built on Git, to streamline simulation iterations and track results. DVC keeps it lightweight (no need to run servers and such - CLI, Git, basic Python) while making the whole process way more manageable and scalable. The usual DVC building blocks are utilized to achieve this: ? DVC data versioning makes sure input data is saved and attached to an iteration and can be restored or access anytime in the future; ? Lightweight CLI pipelines declaratively describe and run data processing and DVC captures products of this processing; ? Finally, metrics, parameters are captured and also attached to Git and iterations - to compare, visualize result; See the full talk here: https://lnkd.in/ebPJteUc #dvc #simulations #OpenFOAM #dataversioning #reproducibility #simops

There is a Better Way to Automate and Manage Your (Fluid) Simulations

https://www.youtube.com/

2 条评论

赞评论分享
iterative.ai

7,622 位关注者
2 个月
举报此动态
DataChain organizes and makes your AI data queryable! What does it mean? Why? ?? All efficient AI teams have an excellent data hygiene - they utilize databases, ETLs, custom scripts / tools to effectively build a metadata layer on top of their binary data (videos, images, etc). This is essential to make data accessible, understand it, clean it - build better datasets. Some examples and ideas we've seen: ?? Photoroom - see the link to the talk by Eliot Andres in the comment ?? Iceberg for metadata (and sometimes binaries) + Spark - is one of the default choices (but usually requires data engineering skills / team) ?? Smaller scale - DVC + CSV files, or Postgres + some custom ETL to feed it DataChain is an open source library ?? (and SaaS platform for collaboration and scale) that implements this idea at scale + our goal was make it easy to use by ML teams. Give a try! See the links in the comments.

2 条评论

赞评论分享
iterative.ai

7,622 位关注者
2 个月已编辑
举报此动态
Under the hood DataChain combines power of warehouses with distributed clusters with proper data access patterns to process millions of video, images audio files: ?? Never copy data. Store references to files instead. (while still preserving versioning, data loading, efficient processing) ? Use warehouses under the hood (e.g. ClickHouse) to store metadata and perform as many operations inside it (e.g. filters). ?? Distributed compute that runs close to the data to compute Python-based UDFs ?? Data access. Pre-fetch, batching, caching, streaming - different workloads require different ways of using data. #unstructured #datachain #dvc #machinelearning #opensource

3 条评论

赞评论分享
iterative.ai

7,622 位关注者
2 个月
举报此动态
A quick glimpse from our CEO, Dmitry Petrov, into ETL and data governance aspects of the DataChain and our SaaS for unstructured data processing: ? Each dataset is immutable, versioned, and has fingerprints for all data objects to reproduce; ? All dependencies are tracked and saved: code, datasets, raw data sources; ? ETL can be run automatically or on schedule to produce new versions of the datasets; Interested to learn more? Contact us here https://datachain.ai/ Open source version is available here to try: https://lnkd.in/emFvJD84 #unstructured #dvc #datachain #machinelearning

1 条评论

赞评论分享
iterative.ai

7,622 位关注者
3 个月
举报此动态
DataChain got hand-picked on `r/Python` as one of the top 2024 tools in the "AI / ML / Data" category ??. Thanks folks, we are also super convinced that we need better tools for unstructured / AI data management. It is still a very hard problem and existing platforms don't address all the needs. Meanwhile there is a very strong and growing demand from AI companies, from all the companies that now do RAGs and other apps that tap into unstructured data. We are working hard on DataChain and DVC to make the whole data processing for images, audio, texts, pdfs, etc scalable, faster, and pleasant experience. Stay tuned, more to come! Quote: "Our selection criteria remain focused on innovation, active maintenance, and broad impact potential. ...." #datachain #dvc #unstructured #machinelearning #opensource
1 条评论

赞评论分享
iterative.ai

7,622 位关注者
3 个月
举报此动态
Dealing with a lot of unstructured or multimodal (audio, pdfs, images, videos) data is hard. We clearly need new tools for unstructured data: processing, governance, analytics, preparing it for RAGs, etc, etc. This small video by Ivan Shcheklein is a glimpse into how our DataChain SaaS helps with those aspects: - stream audio files from tar or wds archives! - enrich, prepare, version, publish datasets ... ?? - bonus! ??? is now natively integrated as a storage provider! Colab notebook: https://lnkd.in/g4W4qF4i Jupyter Notebook: https://lnkd.in/gTbj8ZG2 DataChain Repo: https://lnkd.in/emFvJD84 #huggingface #machinelearning #unstructured #dvc #datachain

1 条评论

赞评论分享
iterative.ai转发了
iterative.ai

7,622 位关注者
3 个月
举报此动态
DataChain hit 2000 stars ? on GitHub a week ago. Thank you for your interest and support ?? It was built to address those needs and pain points we saw in the DVC community when people have to deal with millions of files (e.g. images, pdfs, audio, etc). ?How to "query" them to find similar, deduplicate, based on some insights, etc ?What if those are tar or WebDataset archives ... ?? ?? How to apply transformations (e.g. LLMs or any other models) at scale to get insights and do analytics on top of that? ?????????? How to collaborate - share datasets with those insights? Version and reproduce those ??What about ETLs with granular updates (it's expensive to run GPUs to get embeddings) ... And many, many more questions ... We've just scratched the surface and more features to come, but DataChain (open source and enterprise SaaS) is already saving many many data engineering and ML researchers hours. https://lnkd.in/emFvJD84 https://datachain.ai How do you manage your unstructured data? #unstructured #machinelearning #opensource #dataengineering #dvc #datachain
赞评论分享
iterative.ai转发了
iterative.ai

7,622 位关注者
3 个月
举报此动态
DataChain hit 2000 stars ? on GitHub a week ago. Thank you for your interest and support ?? It was built to address those needs and pain points we saw in the DVC community when people have to deal with millions of files (e.g. images, pdfs, audio, etc). ?How to "query" them to find similar, deduplicate, based on some insights, etc ?What if those are tar or WebDataset archives ... ?? ?? How to apply transformations (e.g. LLMs or any other models) at scale to get insights and do analytics on top of that? ?????????? How to collaborate - share datasets with those insights? Version and reproduce those ??What about ETLs with granular updates (it's expensive to run GPUs to get embeddings) ... And many, many more questions ... We've just scratched the surface and more features to come, but DataChain (open source and enterprise SaaS) is already saving many many data engineering and ML researchers hours. https://lnkd.in/emFvJD84 https://datachain.ai How do you manage your unstructured data? #unstructured #machinelearning #opensource #dataengineering #dvc #datachain
赞评论分享

相似主页

查看职位

登录看看您认识iterative.ai的哪些人

iterative.ai

软件开发

San Francisco，California 7,622 位关注者

Data tools for AI

关于我们

CML - Continuous Machine Learning

机器学习软件

DVC - Data Version Control

版本控制系统

Studio - ML Platform & Model Registry

数据科学与机器学习平台

地点

iterative.ai员工

Maurice (Marc) McSweeney

Director at Iterative Bio, Inc., ISI Life Sciences, Inc., and L'Eft Bank Wine, Ltd.

Vladimir Rudnykh

Software Engineer

Ivan Longin

Founder of Longin IT

Martin Jasion

动态

There is a Better Way to Automate and Manage Your (Fluid) Simulations

https://www.youtube.com/

立即加入，查看您错过的职场动态

相似主页

Union.ai

fal

DagsHub

MLflow

LLM Engineer

Data Community Africa

BentoML

Hugging Face

Iterative;

DevNetwork

查看职位

工程师职位

科学家职位

机器学习工程师职位

游戏程序员职位

智能专员职位

Scala 开发员职位

实习生职位

系统工程师职位

科学经理职位

软件工程师职位

运营总监职位

分析师职位

作家职位

安卓开发员职位

产品设计实习生职位

机械设计师职位

草图设计员职位

用户界面设计师职位