.. highlight:: sh Usage ===== Download WARC files ------------------- Crawled websites in the Common Crawl dataset are stored in `WARC Web ARChive `_ format. Common Crawl's storage is `currently `_ divided into 64.700 parts, each containing on average 65.400 WARC records. Hostnames of universities ^^^^^^^^^^^^^^^^^^^^^^^^^ In vendor/world-universities-csv/world-universities.csv is a list of about 9300 universities websites. Which is a clone of `endSly/world-universities-csv `_ from github. Get WARC file locations in CC-archive for hostnames ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To be able to download the WARC files for a given hostname from the CC-archive one has to get their location in the archive first. Run:: $ src/index_fetcher.py This will download all available html sites crawled for each host in the world-universities.csv file. Download WARC files from CC-archive ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ After fetching the locations use:: $ src/download_warc.py to download the WARC files given by its locations. Use ``NUM_PARALLEL_JOBS`` to adjust how many parallel jobs should be executed at the same time. Generate Training set ---------------------