Automatic recognition of values in wikipedia articles
Project type: | ESE Project
|
Project name: | Automatic recognition of values in wikipedia articles
|
Author: | Regina König
|
Supervisor: | Prof. Dr. Hannah Bast, Björn Buchold
|
1. Project description
The goal was to find automatically values in wikipedia articles and convert them into metric units for the semantic search engine broccoli. The value finding component runs in a chain of a UIMA Pipeline.
2. Explanations
2.1 UIMA Framework
UIMA is an Unstructured Information Managment Architecture. This project was developed by IBM in 2005 and is supervised by Apache since 2006. The concept is to implement a pipeline, which reads in unstructured information, proceeds various analysis steps where the information gets marked with specific annotations, and finally delivers the results to consumers, which proceed the information.
3. Implementation
3.1 Prework
To assess the frequency and syntax of various value types, a statistical analysis of various wikipedia articles was carried out. In this case, not only real units are interesting, but also words, which can be an indication for a type of value. For example indicates the word "in" followed by a number, that the value is very probable a year ("in 1975").
The most frequent value syntaxes where:
Pattern | Relative Quantity in % | Example |
Unit Value | 15.6 | AD 1980 |
Value Unit | 14.3 | 30 km |
Value% | 13.1 | 7% |
Value without Unit | 5.8 | 1886 |
s after Value | 1.6 | 1980s |
3.2 Implementation
The code is written in Java.
The ValueAnnotator checks the UTF8 coded Wikipedia articles Token for Token for the occurence of a number. If a number is found, an object of the class ReadValue is created, which contains the actual value-reading function. The search-algorithm is based on the statistical analysis of the frequency of the different value patterns. As soon as the unit is found, the value gets converted to metric unit and the ValueAnnotator creates an annotation.
The value types searched for, can be determined in config.txt.
4. Results
In the search exact values (eg 30 km) and ranges (eg 1986 - 1992) can be distinguished.
The type "value" has 5 features:
int begin - the position in the text, where the value begins
int end - the position in the text, where the value ends
float value - the value itself
string unit - unit of the value
string type - type of the value (eg length)
The type "range" has two value features: the beginning and the end value of the range.