Spawning转发了
There is a wave of new AI datasets being collected that are free of copyright concerns. Congratulations to Spawning and Cullen Miller for this strong new addition!
Today Spawning is releasing PD12M, a dataset of 12.4M public domain images and synthetic captions, alongside research detailing a novel approach to dataset governance. ?? Dataset : https://source.plus/pd12m While large-scale datasets have driven AI progress, they've also raised concerns about copyright, consent, bias, and safety. Rather than viewing these as trade-offs against scale, we demonstrate how careful sourcing and active governance can address these challenges while maintaining dataset utility. PD12M Key aspects: - Sourced entirely from public domain and CC0 materials - Matches size of widely-used CC12M dataset - Self-hosted to prevent degradation and externalized costs - Community-driven refinement through Source.Plus - Formal process for dataset evolution and improvement Beyond the dataset itself, we propose mechanisms for dataset governance that enable: 1. Public auditing and transparent revisions 2. Statistical stability for reproducible research?? 3. Clear processes for community feedback 4. Systematic content replacement The paper details our methodology and presents a framework for others building public datasets. We welcome feedback from the broader community as we work to expand access to high-quality, responsibly-sourced training data. ?? Preprint paper : https://lnkd.in/efnkD4B5 This work builds on recommendations from dataset documentation pioneers, legal scholars studying AI & copyright, and advocates for responsible AI development. We aim to demonstrate how public datasets can evolve while maintaining scientific utility. A special thank you to the cultural heritage institutions whose commitment to open access made this work possible. And, of course, to the entire Spawning team. Let's build AI systems that serve the public good together. #PublicAI #DataGovernance #OpenScience