Input File Analyser

The Input File Analyser extracts and collects features that characterise the structure and the data formats of each column in the input dataset. It essentially converts the input file into a set of different scores, that will then be passed to the Classifier to find a suitable application configuration for the given file.

The proceeding in the Analyser is made up of the following steps:

Column separator detection
File structure validation
Column Parsing
- Item index generation
- Column-based feature determination
- Item-based feature determination
  - Item preprocessing
  - Item characterisation
  - Column score calculation
- Subitem separator detection
- File property summary
File property summary

For a detailed explanation of every step, read chapter 2 of the thesis

Output

The following table lists all the different features that are collected by the Analyser

Feature	Value Type	Value Range
separator	continuous	{0, 1, 2, 3, 4, 5, 6, 7} *
fill rate	continuous	[0, 1]
item uniqueness	continuous	[0, 1]
item length mean	continuous	[0, +∞[
item length deviation	continuous	[0, +∞[
item word count mean	continuous	[0, +∞[
item word count deviation	continuous	[0, +∞[
item numeric value	continuous	[0, 1]
item max integer places	continuous	[0, +∞[
item max decimal place	continuous	[0, +∞[
item exclusive property score	continuous	[0, 1]
item exclusive property type	discrete	{1, 2, 3, 4, 5, 6, 7, 8, 9, 10} **
item letter occurrence	continuous	[0, 1]
item digit occurrence	continuous	[0, 1]
item symbol occurrence	continuous	[0, 1]
item letter/digit ratio mean	continuous	[0, +∞[
item letter/digit ratio deviation	continuous	[0, +∞[
subitem separator	discrete	{0, 1, 2, 3, 4, 5, 6, 7} *
list occurrence	continuous	[0, 1]
subitem count mean	continuous	[0, +∞[
subitem count deviation	continuous	[0, +∞[
subitem uniqueness	continuous	[0, 1]
subitem length mean	continuous	[0, +∞[
subitem length deviation	continuous	[0, +∞[
subitem word count mean	continuous	[0, +∞[
subitem word count deviation	continuous	[0, +∞[
subitem numeric value	continuous	[0, 1]
subitem max integer places	continuous	[0, +∞[
subitem max decimal place	continuous	[0, +∞[
subitem exclusive property score	continuous	[0, 1]
subitem exclusive property type	discrete	{1, 2, 3, 4, 5, 6, 7, 8, 9, 10} **
item letter occurrence	continuous	[0, 1]
item digit occurrence	continuous	[0, 1]
item symbol occurrence	continuous	[0, 1]
item letter/digit ratio mean	continuous	[0, +∞[
item letter/digit ratio deviation	continuous	[0, +∞[

* Separators: 0: ‘,’, 1:‘\t’, 2:’;’, 3:’.’, 4:’|’, 5:’:’, 6:’#’, 7:’/’
** Mutually exclusive properties: 1: Incremental Index, 2: Boolean, 3: Value with unit, 4: Phone Number, 5: Date, 6: Timestamp, 7: Email, 8: URL, 9: JSON, 10: XML

Usage

AnalyserMain [options] <inputfile> <outputfile>

Available options:

--columnNames: <col1>,<col2>,<col3>,...
The names of the columns can be manually assigned in the case they cannot be found in the first line of the input file.
--mergeExclusiveProps combines all mutually exclusive column properties to improve the independence between the retrieved features. This will become important for the column classification, where we are using the Naive Bayes algorithm, which makes strong assumptions on the independence of the training set features.
--samplingStep <step>
This option can be used when shorter runtime is more important than the highest precision of feature scores. It is especially useful for large input datasets
--separators <sep1><sep2><sep3>
The default separator set {",", "\t", ";", ".", "|", ":", "#", "/"} can be replaced by a custom set.
--subitemSeparator <separatorId>
This option is used to analyse the dataset for a given subitem separator and will be used in the classification step. This avoids us to perform unnecessary computations for improper separators.
--jsonOutput
This option sets the output format to JSON. By default the Analyser returns a CSV file.

Complete Search UI

Server