Usage¶
Download WARC files¶
Crawled websites in the Common Crawl dataset are stored in WARC Web ARChive format. Common Crawl’s storage is currently divided into 64.700 parts, each containing on average 65.400 WARC records.
Hostnames of universities¶
In
vendor/world-universities-csv/world-universities.csv
is a list of about 9300 universities websites. Which is a clone of endSly/world-universities-csv from github.
Get WARC file locations in CC-archive for hostnames¶
To be able to download the WARC files for a given hostname from the CC-archive one has to get their location in the archive first.
Run:
$ src/index_fetcher.py
This will download all available html sites crawled for each host in the world-universities.csv file.
Download WARC files from CC-archive¶
After fetching the locations use:
$ src/download_warc.py
to download the WARC files given by its locations. Use NUM_PARALLEL_JOBS
to adjust how many parallel jobs should be executed at the same time.