.. highlight:: sh .. _sec-common_crawl: Common Crawl ============ Crawled websites in the Common Crawl dataset are stored in `WARC Web ARChive `_ format. The Common Crawl archive `used in this project `_ is divided into 66.500 parts, each containing on average 46.000+ WARC records so in total 3.07+ billion Web pages. URL Index --------- To not being forced to download and look into all crawled data, Common Crawl `provides an URL Index `_ for the crawled Web pages. For example if we lookup ``http://www.uni-freiburg.de/`` we get. .. code-block:: json :caption: Example of a Common Crawl index entry. :name: cc-index-entry { "digest": "VJOEXKZAJQ56LKVB7ZP3WHQJ7YWU7EIX", "filename": "crawl-data/CC-MAIN-2017-13/segments/1490218187945.85/warc/CC-MAIN-20170322212947-00421-ip-10-233-31-227.ec2.internal.warc.gz", "length": "17128", "mime": "text/html", "offset": "726450631", "status": "200", "timestamp": "20170324120615", "url": "http://www.uni-freiburg.de/", "urlkey": "de,uni-freiburg)/" } now we can use ``filename``, ``offset`` and ``length`` to download only this fetched site from the archive. For example using ``curl`` .. code-block:: bash :caption: Download single WARC entry from Common Crawl archive. :name: cc-download-using-curl curl -s -r-$((+-1)) "" >> "" Common Crawl even provides a `server with an API to retrieve the WARC file locations `_. Fetch university WARC file locations ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ `endSly/world-universities-csv `_ provides a source of currently 9363 university web sites from all over the world. The `index_fetcher.py `_ provides a tweaked version of `ikreymer/cdx-index-client `_ to fetch the warc file locations for domains provided by the `world-universities.csv `_ file. Download the WARC files ----------------------- After fetching the locations of the WARC files in the Common Crawl archive: `src/download_warc.py `_ downloads the WARC files and stores them compressed. Use ``NUM_PARALLEL_JOBS`` to adjust how many parallel jobs should be executed at the same time. Coverage in Common Crawl ------------------------ During the project the question raised how many of the personal Web pages are in the common Crawl archive. Generally we know that in each iteration crawls overlap to some degree with its predecessors. For example in the April 2017 archive 56% of the crawled URLs overlap with the March 2017 crawl (`as stated in the April 2017 crawl announcement `_). To evaluate the coverage for a certain host, we collected `342 urls for scientists personal Web pages `_ on the `uni-freiburg.de `_ domain and checked if they are present in the different crawls. For all indexes (Summer 2013 till June 2017) combined 46,7% of the sample sites where covered. :ref:`Figure 8 ` shows the matches of the 164 out of 342 samples with at least one match in a crawl archive. :ref:`Figure 9 ` contains the total coverage for each crawl and the combined coverage for each crawl and the union with its predecessors. .. _figure-8: .. figure:: images/coverage_matrix.png :align: center Figure 8: Coverage matrix for the 164 samples with at least one match in a crawl. .. _figure-9: .. figure:: images/coverage_plot.png :align: center Figure 9: Plots of sample URL coverage. Total coverage for single crawl archives and coverage for the union of the crawl archive and its predecessors. To reproduce the results yourself or for another set of URLs you can use `evaluate_common_crawl_coverage.py `_ to evaluate the coverage for a list of URLs.