Complete Search UI logo Complete Search UI

The Input File Analyser extracts and collects features that characterise the structure and the data formats of each column in the input dataset. It essentially converts the input file into a set of different scores, that will then be passed to the Classifier to find a suitable application configuration for the given file.

The proceeding in the Analyser is made up of the following steps:

  1. Column separator detection
  2. File structure validation
  3. Column Parsing
    • Item index generation
    • Column-based feature determination
    • Item-based feature determination
      • Item preprocessing
      • Item characterisation
      • Column score calculation
    • Subitem separator detection
    • File property summary
  4. File property summary

For a detailed explanation of every step, read chapter 2 of the thesis

Output

The following table lists all the different features that are collected by the Analyser

Feature Value Type Value Range
separator continuous {0, 1, 2, 3, 4, 5, 6, 7} *
fill rate continuous [0, 1]
item uniqueness continuous [0, 1]
item length mean continuous [0, +∞[
item length deviation continuous [0, +∞[
item word count mean continuous [0, +∞[
item word count deviation continuous [0, +∞[
item numeric value continuous [0, 1]
item max integer places continuous [0, +∞[
item max decimal place continuous [0, +∞[
item exclusive property score continuous [0, 1]
item exclusive property type discrete {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} **
item letter occurrence continuous [0, 1]
item digit occurrence continuous [0, 1]
item symbol occurrence continuous [0, 1]
item letter/digit ratio mean continuous [0, +∞[
item letter/digit ratio deviation continuous [0, +∞[
subitem separator discrete {0, 1, 2, 3, 4, 5, 6, 7} *
list occurrence continuous [0, 1]
subitem count mean continuous [0, +∞[
subitem count deviation continuous [0, +∞[
subitem uniqueness continuous [0, 1]
subitem length mean continuous [0, +∞[
subitem length deviation continuous [0, +∞[
subitem word count mean continuous [0, +∞[
subitem word count deviation continuous [0, +∞[
subitem numeric value continuous [0, 1]
subitem max integer places continuous [0, +∞[
subitem max decimal place continuous [0, +∞[
subitem exclusive property score continuous [0, 1]
subitem exclusive property type discrete {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} **
item letter occurrence continuous [0, 1]
item digit occurrence continuous [0, 1]
item symbol occurrence continuous [0, 1]
item letter/digit ratio mean continuous [0, +∞[
item letter/digit ratio deviation continuous [0, +∞[

* Separators: 0: ‘,’, 1:‘\t’, 2:’;’, 3:’.’, 4:’|’, 5:’:’, 6:’#’, 7:’/’
** Mutually exclusive properties: 1: Incremental Index, 2: Boolean, 3: Value with unit, 4: Phone Number, 5: Date, 6: Timestamp, 7: Email, 8: URL, 9: JSON, 10: XML

Usage

AnalyserMain [options] <inputfile> <outputfile>

Available options: