Under the hood DataChain combines power of warehouses with distributed clusters with proper data access patterns to process millions of video, images audio files: ?? Never copy data. Store references to files instead. (while still preserving versioning, data loading, efficient processing) ? Use warehouses under the hood (e.g. ClickHouse) to store metadata and perform as many operations inside it (e.g. filters). ?? Distributed compute that runs close to the data to compute Python-based UDFs ?? Data access. Pre-fetch, batching, caching, streaming - different workloads require different ways of using data. #unstructured #datachain #dvc #machinelearning #opensource
It's impressive how DataChain optimizes unstructured data processing. Could you share more about how the distributed compute works next to the data?
Try it here https://datachain.ai/ or less scalable open source version here https://github.com/iterative/datachain