**Extract and Analyze Scientist's Homepages utilizing Common Crawl**
====================================================================

This page yields information and documentation about a bachelor project (6 ECTS) done in the summer semester 2017 at the
`Chair for Algorithms and Data Structures <http://ad.informatik.uni-freiburg.de/>`_,
`Department of Computer Science <https://www.tf.uni-freiburg.de/>`_,
`University of Freiburg <http://www.uni-freiburg.de/>`_, headed by
`Hannah Bast <http://ad.informatik.uni-freiburg.de/staff/bast>`_.

Project description
-------------------
The goal of this project is to use the open web crawl data archive of `Common Crawl <http://commoncrawl.org/>`_ to
get scientist's personal Web pages. Further extract structured data from scientist's personal Web pages like their
name, profession, affiliation and gender.

What is covered in this project page?
-------------------------------------
This documentation page provides information about the approach, results and produced code of the project and
should enable you to reproduce the results and take them as a starting point for your own work or just to get
inspirations about how certain parts work.

To get an overview you can read the :ref:`sec-experiments` section which contains all the steps and results of this
project in a nutshell and links to more detailed documentation for each step.

Contents
--------

.. toctree::
    :maxdepth: 3

    experiments
    common_crawl
    software_requirements