The Input File Analyser extracts and collects features that characterise the structure and the data formats of each column in the input dataset. It essentially converts the input file into a set of different scores, that will then be passed to the Classifier to find a suitable application configuration for the given file.
The proceeding in the Analyser is made up of the following steps:
- Column separator detection
- File structure validation
- Column Parsing
- Item index generation
- Column-based feature determination
- Item-based feature determination
- Item preprocessing
- Item characterisation
- Column score calculation
- Subitem separator detection
- File property summary
- File property summary
For a detailed explanation of every step, read chapter 2 of the thesis
Output
The following table lists all the different features that are collected by the Analyser
Feature | Value Type | Value Range |
---|---|---|
separator | continuous | {0, 1, 2, 3, 4, 5, 6, 7} * |
fill rate | continuous | [0, 1] |
item uniqueness | continuous | [0, 1] |
item length mean | continuous | [0, +∞[ |
item length deviation | continuous | [0, +∞[ |
item word count mean | continuous | [0, +∞[ |
item word count deviation | continuous | [0, +∞[ |
item numeric value | continuous | [0, 1] |
item max integer places | continuous | [0, +∞[ |
item max decimal place | continuous | [0, +∞[ |
item exclusive property score | continuous | [0, 1] |
item exclusive property type | discrete | {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} ** |
item letter occurrence | continuous | [0, 1] |
item digit occurrence | continuous | [0, 1] |
item symbol occurrence | continuous | [0, 1] |
item letter/digit ratio mean | continuous | [0, +∞[ |
item letter/digit ratio deviation | continuous | [0, +∞[ |
subitem separator | discrete | {0, 1, 2, 3, 4, 5, 6, 7} * |
list occurrence | continuous | [0, 1] |
subitem count mean | continuous | [0, +∞[ |
subitem count deviation | continuous | [0, +∞[ |
subitem uniqueness | continuous | [0, 1] |
subitem length mean | continuous | [0, +∞[ |
subitem length deviation | continuous | [0, +∞[ |
subitem word count mean | continuous | [0, +∞[ |
subitem word count deviation | continuous | [0, +∞[ |
subitem numeric value | continuous | [0, 1] |
subitem max integer places | continuous | [0, +∞[ |
subitem max decimal place | continuous | [0, +∞[ |
subitem exclusive property score | continuous | [0, 1] |
subitem exclusive property type | discrete | {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} ** |
item letter occurrence | continuous | [0, 1] |
item digit occurrence | continuous | [0, 1] |
item symbol occurrence | continuous | [0, 1] |
item letter/digit ratio mean | continuous | [0, +∞[ |
item letter/digit ratio deviation | continuous | [0, +∞[ |
* Separators: 0: ‘,’, 1:‘\t’, 2:’;’, 3:’.’, 4:’|’, 5:’:’, 6:’#’, 7:’/’
** Mutually exclusive properties: 1: Incremental Index, 2: Boolean, 3: Value with unit, 4: Phone Number, 5: Date, 6: Timestamp, 7: Email, 8: URL, 9: JSON, 10: XML
Usage
AnalyserMain [options] <inputfile> <outputfile>
Available options:
--columnNames: <col1>,<col2>,<col3>,...
The names of the columns can be manually assigned in the case they cannot be found in the first line of the input file.--mergeExclusiveProps
combines all mutually exclusive column properties to improve the independence between the retrieved features. This will become important for the column classification, where we are using the Naive Bayes algorithm, which makes strong assumptions on the independence of the training set features.--samplingStep <step>
This option can be used when shorter runtime is more important than the highest precision of feature scores. It is especially useful for large input datasets--separators <sep1><sep2><sep3>
The default separator set{",", "\t", ";", ".", "|", ":", "#", "/"}
can be replaced by a custom set.--subitemSeparator <separatorId>
This option is used to analyse the dataset for a given subitem separator and will be used in the classification step. This avoids us to perform unnecessary computations for improper separators.--jsonOutput
This option sets the output format to JSON. By default the Analyser returns a CSV file.