Teraflop AI is excited to help support the Caselaw Access Project and Harvard LIL, in the release of over 6.6 million state and federal court decisions published throughout U.S. history. In collaboration with Ravel Law, hlslib digitized over 40 million U.S. court decisions consisting of 6.7 million cases from the last 360 years into a dataset that is widely accessible to use. You can bulk download the data using the CAP API: https://case.law/caselaw/ It is important to democratize fair access to data to the public, legal community, and researchers. You can find a processed and cleaned version of the data available on Huggingface here: https://lnkd.in/ezkqG5bH You can find more information about accessing state and federal written court decisions of common law through the bulk data service documentation: https://case.law/docs/ You can learn more about the Caselaw Access Project and all of the phenomenal work done by Jack Cushman, Greg Leppert, and macargnelutti here: https://case.law/about/ During the digitization of these texts, there were erroneous OCR errors that occurred. We worked to post-process each of the texts for model training to fix encoding, normalization, repetition, redundancy, parsing, and formatting. Teraflop AI’s data engine allows for the massively parallel processing of web-scale datasets into cleaned text form. Our one-click deployment allowed for us to easily split the computation between 1000s of nodes on our managed infrastructure. Thank you to Nomic AI for providing us with Atlas research credits to store and visualize each of the jurisdictions in this dataset. You can access the New York jurisdiction map and all of the other Nomic AI Atlas maps on Huggingface here: https://lnkd.in/e2JGH7Bf Nomic’s Atlas projection algorithm clusters semantically similar data together generating a topic hierarchy. You can find more information here: https://lnkd.in/e9JwrPJw Nomic AI released nomic-embed-text-v1.5, an open-source, 8192 context text embedding model here: https://lnkd.in/e7qx6-Hy You can find the detailed research paper on the methodologies used by Zach Nussbaum, Andriy Mulyar, and Brandon Duderstadt for the nomic-embed-text-v1.5 model here: https://lnkd.in/ejMHsT2W You can find all of the information here detailed in this post: https://lnkd.in/e5hyAvKr Thank you to Shayne Longpre, Robert Mahari, Jon Tow, StabilityAI, Barry Zhang, Sam Ching, Eleuther AI, Daniel Chang, and the many others who have been supportive over these last months. We plan to release trillions of commercially licensed text tokens, images, audio, videos, and other datasets spanning numerous domains and modalities over the next months.
关于我们
Big data
- 网站
-
https://www.teraflop.ai
Teraflop AI的外部链接
- 所属行业
- 数据基础架构与分析
- 规模
- 1 人
- 类型
- 私人持股