###############################################################################
TRIPLE SCORES FOR TRIPLE STORES
Reproducability material for 2015 VLDB paper by Hannah Bast, Bjoern Buchhold
and Elmar Haussmann, University of Freiburg
###############################################################################

### REQUIREMENTS

* bash
* python (>=2.7) with numpy (>=1.8) (and scikit-learn >= 0.14.1, to recompute
some results from scratch, see below)
* gnu make

### PRINTING RESULTS

To produce the major result tables from our paper, use one of the following
commands:

* make print-score-result-table
* make print-rank-result-table
* make print-nationality-results-table

### RESULT FILES

There are files with names corresponding to our approaches that can be inspected
in detail for the judgments made by the particular approach. The files are:

* first
* random
* prefixes
* llda
* words_regression
* words_counting
* words_mle
* counting_combined
* mle_combined

The format of these is simple. Here is an excerpt from llda:

:e:Jesus_Christ 66638 Preacher 1.0 5.0 0.012183092171
:e:Jesus_Christ 66638 Prophet 7.0 6.0 0.574005830593
:e:Jesus_Christ 66638 Carpenter 6.0 2.0 0.412908247109

Columns are TAB separated and in this order:
* the entity name (prefixed with :e:, spaces replaced by _)
* a popularity measure (the number of times the entity is mentioned in
Wikipedia)
* the profession
* the computed score for this profession (mapped to 0..7)
* the correct score for this profession (as determined by the crowdsourcing
task)
* the original score/probability (not mapped to 0..7)

### RE-COMPUTING RESULTS

In order to reproduce an approach from scratch, use the make target to clean
the result file as well as all intermediate files. You can then either call
"make <approach>" (where <approach> is one of first, random ...) or call one
of the statements above to print results, which will re-compute previously
cleaned results. E.g., to re-build "first" results call: "make clean-first"
then "make first".

Below there is a list of available clean targets and what is required to rebuild
what was cleaned:

IMPORTANT: Set the variable "PYTHON" in the Makefile to an installation of pypy
for a speedup of up to a factor of 10 for several tasks.

clean-all:
everything below, CAREFUL! note the requirements, especially for words_regression!

clean-words-counting:
About 1GB of available disk space. Takes about 15min when using pypy.

clean-words-mle:
About 1GB of available disk space. Takes about 2h when using pypy.
(Careful, with the default python interpreter this may run for a full night.)
Parts that require the numpy lib will always use the default python
interpreter. You can still set the PYTHON variable to pypy.

clean-combined:
No special requirements. Takes a few seconds.

clean-prefixes:
No special requirements. Takes a few seconds.

clean-first:
No special requirements. Takes a few seconds.

clean-nationality:
About 1GB of available disk space. Performs (among others) MLE and counting and hence takes
as long as those two together.

clean-llda:
Downloads and compiles the JGibbLDA library (https://github.com/myleott/JGibbLabeledLDA).
A rebuild will require a lot of RAM (around 64GB), around 2GB of disk space
and take roughly an hour.

clean-words-regression:
IMPORTANT: Edit the Makefile and set the variable on top to a filesystem
location with sufficient space.
CAREFUL! This approach is not optimized and requires ~620GB of free disk
space and runs roughly 3 hours. Apart from that, the python framework
scikit-learn has to be installed.

### CONTACT

If you have questions, feel free to contact us:
[bast,buchhold,haussmann] at informatik.uni-freiburg.de