iterative.ai的封面图片
iterative.ai

iterative.ai

软件开发

San Francisco,California 7,622 位关注者

Data tools for AI

关于我们

We create open-source and SaaS developer tools dedicated to advancing machine learning data management. Our journey began with the creation of DVC, that is now an open-source standard for data versioning and reproducibility. Fast forward to today, we are launching DataChain. It is a multimodal data processing framework for ETL and data analytics at scale. ?? Enterprise Support Our team is dedicated to providing top-notch Enterprise support, ensuring your teams are set up for success. ?? Let's Connect Curious to learn more? Schedule a 45-minute discussion with our experts to explore how Iterative can tailor solutions to your unique use case. Book a meeting here - https://calendly.com/dmitry-at-iterative/dmitry-petrov-30-minutes. ?? Why Iterative We are on a mission to simplify the complexities of managing datasets and ML infrastructure. At Iterative, we bring the best engineering practices to data science and machine learning teams, empowering them to thrive in the ever-evolving landscape of Generative AI. Join us as we redefine possibilities and shape the future of Generative AI innovation.

网站
https://datachain.ai
所属行业
软件开发
规模
11-50 人
总部
San Francisco,California
类型
私人持股
创立
2018
领域
Data Science、Machine Learning、Developer Tools、Data management、Continuous Integration、MLOps、ModelOps、DataOps、GitOps、Generative AI和Unstructured Data

地点

iterative.ai员工

动态

  • 查看iterative.ai的组织主页

    7,622 位关注者

    Exciting News!??? Design Partnership with Sigen AI. We are thrilled to announce a design partnership with Sigen AI! Through this collaboration, Sigen AI’s selective anonymization technology will now leverage our cutting-edge solutions to enhance security, reliability, and privacy compliance for clients worldwide. Stay tuned as we continue to push the boundaries of privacy-first AI innovations and deliver even more robust and reliable solutions together! #DataChain #PrivacyCompliance #Partnership #Innovation #SigenAI

    • 该图片无替代文字
  • 查看iterative.ai的组织主页

    7,622 位关注者

    "... metadata marts could play a key role in making video data more accessible and structured for model training and analysis ...". Read the new #datapains post by Simon Thelin. He is reviewing DataChain and gives some excellent recommendations. We definitely plan to add Iceberg support in open source (Studio is already using ClickHouse which is scalable and will support Iceberg soon). Stay tuned!

    查看Simon Thelin的档案

    Tech Lead ML Platform (DataOps) | Lead Data Engineer

    Another Sunday, another set of #datapains to discuss! ?? I've just published a new Medium post reviewing DataChain from iterative.ai and its video metadata capabilities. DataChain is a Python-based AI-data warehouse for transforming and analysing unstructured data like images, audio, videos, text, and PDFs. It integrates with external storage (e.g. S3, GCP, Azure, HuggingFace) and manages metadata in an internal database for easy querying. In my post, I take a first glance into how well DataChain handles video metadata extraction. To see if it can truly simplify video metadata management. Key Takeaways: ??? - Managing video metadata is more complex than it seems; DataChain aims to simplify this. - While DataChain offers useful features, the use of SQLite within open source version, raises concerns about scalability for larger datasets, and cross collaboration. - A more scalable approach could involve integrating with open table formats like Delta or Apache Iceberg. I also explore the implementation process with code snippets and discuss areas for improvement. Ultimately, metadata marts could play a key role in making video data more accessible and structured for model training and analysis. Check out my full review here: https://lnkd.in/eQxsuqNi I'll also be sharing a follow-up video on my YouTube channel! Stay tuned and subscribe: https://lnkd.in/erYVFVgM What are your thoughts on handling video metadata at scale? Let me know in the comments!

  • 查看iterative.ai的组织主页

    7,622 位关注者

    Watch an excellent PyData Berlin talk by Julian Wagenschütz (Volkswagen Group) on automating and managing fluid simulations with Python and DVC. See details ?? and more links in comments. Julian Wagenschütz suggests using DVC, built on Git, to streamline simulation iterations and track results. DVC keeps it lightweight (no need to run servers and such - CLI, Git, basic Python) while making the whole process way more manageable and scalable. The usual DVC building blocks are utilized to achieve this: ? DVC data versioning makes sure input data is saved and attached to an iteration and can be restored or access anytime in the future; ? Lightweight CLI pipelines declaratively describe and run data processing and DVC captures products of this processing; ? Finally, metrics, parameters are captured and also attached to Git and iterations - to compare, visualize result; See the full talk here: https://lnkd.in/ebPJteUc #dvc #simulations #OpenFOAM #dataversioning #reproducibility #simops

  • 查看iterative.ai的组织主页

    7,622 位关注者

    DataChain organizes and makes your AI data queryable! What does it mean? Why? ?? All efficient AI teams have an excellent data hygiene - they utilize databases, ETLs, custom scripts / tools to effectively build a metadata layer on top of their binary data (videos, images, etc). This is essential to make data accessible, understand it, clean it - build better datasets. Some examples and ideas we've seen: ?? Photoroom - see the link to the talk by Eliot Andres in the comment ?? Iceberg for metadata (and sometimes binaries) + Spark - is one of the default choices (but usually requires data engineering skills / team) ?? Smaller scale - DVC + CSV files, or Postgres + some custom ETL to feed it DataChain is an open source library ?? (and SaaS platform for collaboration and scale) that implements this idea at scale + our goal was make it easy to use by ML teams. Give a try! See the links in the comments.

  • 查看iterative.ai的组织主页

    7,622 位关注者

    Under the hood DataChain combines power of warehouses with distributed clusters with proper data access patterns to process millions of video, images audio files: ?? Never copy data. Store references to files instead. (while still preserving versioning, data loading, efficient processing) ? Use warehouses under the hood (e.g. ClickHouse) to store metadata and perform as many operations inside it (e.g. filters). ?? Distributed compute that runs close to the data to compute Python-based UDFs ?? Data access. Pre-fetch, batching, caching, streaming - different workloads require different ways of using data. #unstructured #datachain #dvc #machinelearning #opensource

  • 查看iterative.ai的组织主页

    7,622 位关注者

    A quick glimpse from our CEO, Dmitry Petrov, into ETL and data governance aspects of the DataChain and our SaaS for unstructured data processing: ? Each dataset is immutable, versioned, and has fingerprints for all data objects to reproduce; ? All dependencies are tracked and saved: code, datasets, raw data sources; ? ETL can be run automatically or on schedule to produce new versions of the datasets; Interested to learn more? Contact us here https://datachain.ai/ Open source version is available here to try: https://lnkd.in/emFvJD84 #unstructured #dvc #datachain #machinelearning

  • 查看iterative.ai的组织主页

    7,622 位关注者

    DataChain got hand-picked on `r/Python` as one of the top 2024 tools in the "AI / ML / Data" category ??. Thanks folks, we are also super convinced that we need better tools for unstructured / AI data management. It is still a very hard problem and existing platforms don't address all the needs. Meanwhile there is a very strong and growing demand from AI companies, from all the companies that now do RAGs and other apps that tap into unstructured data. We are working hard on DataChain and DVC to make the whole data processing for images, audio, texts, pdfs, etc scalable, faster, and pleasant experience. Stay tuned, more to come! Quote: "Our selection criteria remain focused on innovation, active maintenance, and broad impact potential. ...." #datachain #dvc #unstructured #machinelearning #opensource

    • 该图片无替代文字
  • 查看iterative.ai的组织主页

    7,622 位关注者

    Dealing with a lot of unstructured or multimodal (audio, pdfs, images, videos) data is hard. We clearly need new tools for unstructured data: processing, governance, analytics, preparing it for RAGs, etc, etc. This small video by Ivan Shcheklein is a glimpse into how our DataChain SaaS helps with those aspects: - stream audio files from tar or wds archives! - enrich, prepare, version, publish datasets ... ?? - bonus! ??? is now natively integrated as a storage provider! Colab notebook: https://lnkd.in/g4W4qF4i Jupyter Notebook: https://lnkd.in/gTbj8ZG2 DataChain Repo: https://lnkd.in/emFvJD84 #huggingface #machinelearning #unstructured #dvc #datachain

  • iterative.ai转发了

    查看iterative.ai的组织主页

    7,622 位关注者

    DataChain hit 2000 stars ? on GitHub a week ago. Thank you for your interest and support ?? It was built to address those needs and pain points we saw in the DVC community when people have to deal with millions of files (e.g. images, pdfs, audio, etc). ?How to "query" them to find similar, deduplicate, based on some insights, etc ?What if those are tar or WebDataset archives ... ?? ?? How to apply transformations (e.g. LLMs or any other models) at scale to get insights and do analytics on top of that? ?????????? How to collaborate - share datasets with those insights? Version and reproduce those ??What about ETLs with granular updates (it's expensive to run GPUs to get embeddings) ... And many, many more questions ... We've just scratched the surface and more features to come, but DataChain (open source and enterprise SaaS) is already saving many many data engineering and ML researchers hours. https://lnkd.in/emFvJD84 https://datachain.ai How do you manage your unstructured data? #unstructured #machinelearning #opensource #dataengineering #dvc #datachain

    • 该图片无替代文字
  • iterative.ai转发了

    查看iterative.ai的组织主页

    7,622 位关注者

    DataChain hit 2000 stars ? on GitHub a week ago. Thank you for your interest and support ?? It was built to address those needs and pain points we saw in the DVC community when people have to deal with millions of files (e.g. images, pdfs, audio, etc). ?How to "query" them to find similar, deduplicate, based on some insights, etc ?What if those are tar or WebDataset archives ... ?? ?? How to apply transformations (e.g. LLMs or any other models) at scale to get insights and do analytics on top of that? ?????????? How to collaborate - share datasets with those insights? Version and reproduce those ??What about ETLs with granular updates (it's expensive to run GPUs to get embeddings) ... And many, many more questions ... We've just scratched the surface and more features to come, but DataChain (open source and enterprise SaaS) is already saving many many data engineering and ML researchers hours. https://lnkd.in/emFvJD84 https://datachain.ai How do you manage your unstructured data? #unstructured #machinelearning #opensource #dataengineering #dvc #datachain

    • 该图片无替代文字

相似主页

查看职位