Classifier

We feed the properties we collected from the dataset using out Analyser in the Classifier to find suitable parameters for configuring CompleteSearch.
The problem to solve is essentially a classification problem, in which each column of the input file is assigned to the different CompleteSearch parameter classes.

For further details take a look at chapter 3 in thesis

Output

The Classifier makes suggestions for the following parameter for each column in the initial input file to configure the CompleteSearch Web Application:

Parameter	Value Range
full-text	{true, false}
filter	{true, false}
facets	{true, false}
allow-multiple-items	{true, false}
field-format	{0, 1, 2} *
show	{true, false}
excerpt	{true, false}
ordering	{0, 1, 2} **
url	{true, false}
email	{true, false}
label	{true, false}

* Formats: 0: plain text 1: JSON 2: XML
** Ordering: 0: lexicographical 1: numerical 2: by date

Usage

Usage: ClassifierMain [mode] [parametersarameter]

Available modes:

--classify <inputFile> classifies a given dataset into the different parameter classes. The input file is not the actually dataset but the JSON output file containing its features returned by the Analyser
--train trains the classifier by performing all steps that can be computed in advance and saving the training data.
--benchmark <configuration> splits off a part of the training set into a test set, trains the classifier on the reduced training set and evaluates the classification results of the test set. Possible configurations: default, no-augmentation, no-prop-merge, no-sep-predetermination

Parameters:

--props <datasetPropDirectory>
Path to directory containing dataset property files for the input datasets in our training set. This parameter is required for training and benchmarking
--labels <datasetLabelDirectory>
Path to directory containing dataset label files for the input datasets in our training set. This parameter is required for training and benchmarking
--cache <trainingDataCacheDirectory>

Complete Search UI

Server

Classifier

Output

Usage