DataChain hit 2000 stars ? on GitHub a week ago. Thank you for your interest and support ?? It was built to address those needs and pain points we saw in the DVC community when people have to deal with millions of files (e.g. images, pdfs, audio, etc). ?How to "query" them to find similar, deduplicate, based on some insights, etc ?What if those are tar or WebDataset archives ... ?? ?? How to apply transformations (e.g. LLMs or any other models) at scale to get insights and do analytics on top of that? ?????????? How to collaborate - share datasets with those insights? Version and reproduce those ??What about ETLs with granular updates (it's expensive to run GPUs to get embeddings) ... And many, many more questions ... We've just scratched the surface and more features to come, but DataChain (open source and enterprise SaaS) is already saving many many data engineering and ML researchers hours. How do you manage your unstructured data? #unstructured #machinelearning #opensource #dataengineering #dvc #datachain