.. highlight:: sh

Usage
=====

Download WARC files
-------------------

Crawled websites in the Common Crawl dataset are stored in `WARC Web ARChive <https://en.wikipedia.org/wiki/Web_ARChive>`_ format.
Common Crawl's storage is `currently <http://commoncrawl.org/2017/05/april-2017-crawl-archive-now-available/>`_ divided into 64.700 parts,
each containing on average 65.400 WARC records.

Hostnames of universities
^^^^^^^^^^^^^^^^^^^^^^^^^

In 

	vendor/world-universities-csv/world-universities.csv

is a list of about 9300 universities websites. Which is a clone of 
`endSly/world-universities-csv <https://github.com/endSly/world-universities-csv>`_ from github.

Get WARC file locations in CC-archive for hostnames
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To be able to download the WARC files for a given hostname from the CC-archive one has to get their location in the archive first.

Run::

	$ src/index_fetcher.py

This will download all available html sites crawled for each host in the world-universities.csv file.

Download WARC files from CC-archive
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

After fetching the locations use::

	$ src/download_warc.py

to download the WARC files given by its locations. Use ``NUM_PARALLEL_JOBS`` to adjust how many parallel jobs should be executed at the same time.

Generate Training set
---------------------