We are pleased to announce a new index and query api system for Common Crawl! We're excited that the Common Crawl community is embracing this new feature and have already added January and February 2015 datasets (that's 300+ TB). Going forward, each month's crawl will be accompanied by a new index. Check out the full announcement on our blog. https://lnkd.in/bJn4rp2
Common Crawl Foundation
科技、信息和网络
Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
关于我们
The Common Crawl Foundation is a California 501(c)(3) registered non-profit founded by Gil Elbaz with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable. Our vision is of a truly open web that allows open access to information and enables greater innovation in research, business and education. We level the playing field by making wholesale extraction, transformation and analysis of web data cheap and easy.
- 网站
-
https://www.commoncrawl.org
Common Crawl Foundation的外部链接
- 所属行业
- 科技、信息和网络
- 规模
- 2-10 人
- 类型
- 非营利机构
- 创立
- 2007
Common Crawl Foundation员工
动态
-
The Common Crawl Foundation’s January crawl (CC-MAIN-2025-05) and its corresponding Web Graph release (cc-main-2024-25-nov-dec-jan) are now available. The January crawl contains 3 billion web pages (460 TiB of uncompressed content), and the Web Graph consists of 277.7 million nodes and 2.7 billion edges at the host level, and 100.8 million nodes and 1.9 billion edges at the domain level. This release also includes a fix for a bug with SURT URLs, thanks to Tom Morris for identifying the issue (see iipc/webarchive-commons#102 for details). Further reading: ?? [January Crawl] https://lnkd.in/eqs7TzqV ?? [January Web Graph] https://lnkd.in/ekrjCUiU ?? [Web Graph Statistics] https://lnkd.in/eAKMDn8W ?? [SURT URLs Erratum] https://lnkd.in/edcVgRxs ?? [IIPC/webarchive-commons#102] https://lnkd.in/eMWN2CKN #CommonCrawl #WebData #OpenData #WebArchiving
-
I'm very pleased to share my Internet-Draft "Vocabulary for Expressing Content Preferences for AI Training" on the IETF Datatracker. The draft lays a foundation for clear and consistent signalling of rightsholder intentions. ?? https://lnkd.in/eK8CiThz #AI #Standards #IETF
-
I’m excited to share insights from the Common Crawl Foundation’s participation at NeurIPS 2024 in Vancouver. Our team engaged with over 40 organizations, hosted a collaborative event with Wikimedia, and connected with AI leaders. Dive into our experiences and the future of open data in AI in our latest blog post. Rich Skrenta, Greg Lindahl, Wayne Yamamoto, Sarmeesha Reddy, Chris Tolles, Jason Grey https://lnkd.in/gVyYDrGS
-
-
Everyone loves graphs. Well, most people.?OK, fine, only nerds like graphs.?But I’m a nerd and you probably are too. Nerd. We rely on Web Graphs heavily throughout our work at the Common Crawl Foundation. I’m happy to share our recently launched statistics page which features: - Top-ranked domains and hosts determined by Harmonic Centrality and PageRank (and explanations of what these mean) - Detailed statistics on nodes, edges, indegree/outdegree distributions, SCCs (and a lot more) for each graph release - Links to related papers Check it out here: https://lnkd.in/exBwsMyU #GraphAnalysis #OpenData #WebGraphs #Nerd
-
Data is the lifeblood of #innovation and the foundation of successful #AI initiatives. But how can organizations effectively scale their #data capabilities to unlock transformative value? In collaboration with Rich Skrenta, Executive Director of the Common Crawl Foundation, Bharath Thota and Aswin Chandrasekharana show how. Read more: https://bit.ly/3ZSgIqg
-
-
For anyone going to NeurIPS in Vancouver next week, Wikimedia (https://lnkd.in/gHZvzjCa) is hosting a social with Common Crawl (https://commoncrawl.org/) on Wednesday at 7:30 pm Pacific Time. Come learn about how the two nonprofits are providing data in the current ML/AI landscape and meet the teams! Christopher Petrillo Prabhat Tiwary from Wikimedia Enterprise, Rich Skrenta, Greg Lindahl, Stephen Burns from Common Crawl. Come for an evening of learning and light refreshments! Detailed description of the event here: https://lnkd.in/gs8M99C3
-
We are excited to release Nemotron-CC, our high quality Common Crawl based 6.3 trillion tokens dataset for LLM pretraining (4.4T globally deduplicated original tokens and 1.9T synthetically generated tokens). Compared to the leading open DCLM dataset, Nemotron-CC enables to either create a 4x larger dataset of similar quality or increase the MMLU by more than 5 points using a high quality subset of the tokens. Blog post: https://lnkd.in/gvK2tCyB Paper: https://lnkd.in/gWwm5uUb Dataset:?https://lnkd.in/gSf6tSQu We thank the Common Crawl Foundation for hosting the dataset. (with Dan Su*, Kezhi Kong*, Ying Lin*, Joseph Jennings, Brandon Norick, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro)
-
-
Last week I had the opportunity to present a series of talks on "Harnessing Common Crawl at Scale" with Pedro Ortiz Suarez.?We presented a talk at the The Alan Turing Institute's NLP Special Interest Group, and another in collaboration with Valyu at UCL. Here's the Common Crawl Foundation's blog post on these talks: ?? https://lnkd.in/gJNJc9Q2 Many thanks to Robert Blackwell, PhD and Anthony Hills from the Turing Institute; Hirsh Pithadia, Harvey Yorke, and Hendrik van der Sande from Valyu; and Prof. Philip Treleaven from UCL.
-