The actual motivation behind this project is to build an OSM-Data search engine that has comparable results as Nominatim. The aim of this project is setting up a basic web application(Parser, index structures, basic ranking and the UI) so that the application can be improved in a future work, in which one will mainly focus on the ranking functionality. In the following, the structure of the application, certain parts of it and their development will be presented. The order of presentation also reflects the order of development of the application and the logical execution sequence.
In order to run the appliction, do the following:
The backend is written in C++. It consists of a parser, an inverted index that uses attributes as name and a QGramIndex in order to get proposals for the search query. This project mainly focuses on the parsing part of the application and the UI. This application does send the results as a string to the frontend in the json-format.
There are several XML Parser for C++. There are few classes of parsers. DOM-parsers are building a tree structure that represents the parsed document, which allows random access to each element of the tree. Any parser of the DOM-parser class is not suitable for my low specification of 8 GB RAM, due to the expansion onto a multiple of the original size of the document. For example, a 2 GB file may blow up to 10 GB when the full DOM is created. Compared to SAX-API, using the DOM-API of a parser like rapidXML is rather simple and intuitive for me as the performing subject. SAX Parsers are event-based and the given SAX-parser have not been taken into consideration for this task yet.
An own parser that reads the document and processes it line-by-line and stores the entities internally in simple arrays has been implemented instead. Those entities are a representation of the read XML-elements. This implementation takes an advantage of the structure of the xml file we specifically use for this task. We deal with OSM-data as XML which is shallow wrt. the depth of the xml tree. We only deal with 3 kinds of elements, which are namely nodes, ways and a relations. A node represents a point in space, a way does represent a group of nodes, which one may consider as a path and a relation is a group of all three kinds of elements. So the depth of the XML tree essentialy is only 2, the first depth level contains for example the kind of element and its ID. The second depth level does store the attributes and references to other elements. Certain keywords are identified by the parser in order to distinguish elements and recognize an element's attributes or references. At this moment the values of the name attribute are considered. The implementation is looping through a list that only contains the name string to look for in the parsing. Other attributes can be included by adding key-terms to the list.
The local Regierungsberzirk Freiburg OSM data set with 2.3 GB has been used. The machine used for this task has a an Intel Core i5-2400 @3100 with 6 MB L3 cache and as previousley mentioned 8 GB of RAM. Out of 10 runs the average runtime of the parsing function is 56.224 seconds, whereas the the value does not deviate more than 0.3 seconds. Increasing RAM in order to work with DOM-parsers or using SAX-API of common parsers may influence the performance, and an increase of the runtime performance is assumed. Memory performance decreases for DOM-parsing, whereas the random access becomes instant.
While parsing all values for the name attribute of an OSM-element, they are stored separately without allowing for duplicates. These are used for the QGram-Index in order to allow for fuzzy search proposals that take effect in the UI later on. QGrams with Q = 3 are used for this task. QGrams are here used as a measure of string similarity.
Beside the fuzzy search proposals, as a functionality that may be added in future, they can be used in cases where a query does not return any results, but the QGram-index matches similar words. Google search also may offer an alternative but similar query as the user has typed in. Another approach that may be implenented in future is that similar words are also queried by the backend and their score gets discounted. So in fact multiple queries are processed, but the ranking is taken care of via the discounting of the score.
In order to provide a latent instant processing of the query preprocessing is required. Hence, the values of attributes like name are used for the Inverted-Index. This will make the lookup, given a query, trivial. The values of the osm-element-attributes are used as keys in a map/dictionary data-structure. A vector of references to the mapped osm-elements are used as the values correspoinding to the keys. E.g. this signifies, the mapping of a name of a geographical landmark onto references of geographical landmarks containing the same name. So in this case, the data structure here is the essential part rather than a certain algorithm. Processing a multiterm query requires aggregation of results.
The scores of terms in a document have in general a higher variation than the scores of a term in an OSM-attribute, because those attributes do not in general consist of a logical sequence of complete sentences. The scores are being merged and sorted. The improvement of the ranking is out of the scope of this project. Another project or thesis in succession to this, addressing the ranking and improvement of the quality of results is intended.
Based on the GoogleMaps layout, where the map covers essentially the whole space neglecting a few UI elements. The result list on the left side is still inspired by Nominatim. To get a notion of it: here is a glimpse of the UI. The fuzzy proposals, the information box as well as other backend functionality is disabled.