Web Data Commons will extract all Microformat, Microdata and RDFa data that is contained in the Common Crawl corpus and will provide the extracted data for free download in the form of RDF-quads as well as CSV-tables for common entity types (e.g. product, organization, location, ...).
We are finished with developing the software infrastructure for doing the extraction and will start an extraction run for the complete Common Crawl corpus once the new 2012 version of the corpus becomes available in February. For testing our extraction framework, we have extracted structured data out of 1% of the currently available Common Crawl corpus dating October 2010. The results of this extraction run are provided below. We will provide the data from the complete 2010 corpus together with the data from the 2012 corpus in order to enable comparisons on how data provision has evolved within the last two years.
Web Data Commons is a joint effort of the Web-based Systems Group at Freie Universität Berlin (Christian Bizer, Hannes Mühleisen) and the Institute AIFB at the Karlsruhe Institute of Technology (Andreas Harth, Steffen Stadtmüller). We plan to update the extracted datasets on regularly as new Common Crawl corpora a becoming available.
This project uses Anything To Triples (Any23) as a library to extract triples from the various formats, and the ArcFileReader from the Common Crawl codebase.
The extraction statistic lists all URLs we found structured data on, and which format has produced how many triples. In total, our 1% subset contained 1.3 Million URLs with embedded structured data, which were parsed into 37.7 Million RDF triples. The following table lists the number of URLs and the number of triples for each format:
| Format | URLS | Triples |
|---|---|---|
| html-mf-hcard | 624929 | 19738448 |
| html-mf-xfn | 491187 | 10168979 |
| html-rdfa | 158184 | 1047250 |
| html-mf-hcalendar | 13499 | 446179 |
| html-mf-hreview | 8017 | 251992 |
| html-mf-geo | 17452 | 56819 |
| html-mf-hlisting | 456 | 40816 |
| html-microdata | 1298 | 17890 |
| html-mf-hresume | 96 | 1437 |
| html-mf-hrecipe | 262 | 1401 |
The extracted structured data is given in the N-Quads RDF encoding and split according to the format the data was encoded in.