The actual motivation behind this project is to build an OSM-Data search engine that has comparable results as Nominatim. The aim of this project is setting up a basic web application(Parser, index structures, basic ranking and the UI) so that the application can be improved in a future work, in which one will mainly focus on the ranking functionality. In the following, the structure of the application, certain parts of it and their development will be presented. The order of presentation also reflects the order of development of the application and the logical execution sequence.
In order to run the appliction, do the following:
The backend is written in C++. It consists of a parser, an inverted index that uses attributes as name and a QGramIndex in order to get proposals for the search query. This project mainly focuses on the parsing part of the application and the UI. This application does send the results as a string to the frontend in the json-format.
There are several XML Parser for C++. In general, there are few classes of parsers. DOM-parsers are building a tree structure that represents the parsed document, which allows random access to each element of the tree. Any parser of the DOM-parser class is not suitable for my low specification of 8 GB RAM, due to the expansion onto a multiple of the original size of the document. For example, a 2 GB file may blow up to 10 GB when the full DOM is created. Compared to SAX-API, using the DOM-API of a parser like rapidXML is rather simple and intuitive for me as the performing subject. SAX Parsers are event-based and the given SAX-parser have not been taken into consideration for this task yet.
An own parser that reads the document and processes it line-by-line and stores the entities internally in simple arrays has been implemented instead. Those entities are a representation of the read XML-elements. This implementation takes the structure of the OSM-data (xml-file) we specifically use for this task into account. We deal with OSM-data as XML which is shallow wrt. the depth of the xml tree. We only deal with 3 kinds of elements, which are namely nodes, ways and a relations. A node represents a point in space, a way does represent a group of nodes, which one may consider as a path and a relation is a group of all three kinds of elements. So the depth of the XML tree essentialy is only 2, the first depth level contains for example the kind of element and its ID. The second depth level does store the attributes and references to other elements. Certain keywords are identified by the parser in order to distinguish elements and recognize an element's attributes or references. At this moment the values of the name attribute are considered. The implementation is looping through a list that only contains the name string to look for in the parsing. Other attributes can be included by adding key-terms to the list.
The local Regierungsberzirk Freiburg OSM data set with 2.3 GB has been used. The machine used for this task has a an Intel Core i5-2400 @3100 with 6 MB L3 cache and as previousley mentioned 8 GB of RAM. Out of 10 runs the average runtime of the parsing function is 56.224 seconds, whereas the the value does not deviate more than 0.3 seconds. Increasing RAM in order to work with DOM-parsers or using SAX-API of common parsers may influence the performance, and an increase of the runtime performance is assumed. Memory performance decreases for DOM-parsing, whereas the random access becomes instant.
Further preprocessing is done throughout or directly after the parsing i.e.: aggregation of ways into relations(clustering), building a geo-index and the addition of "is_in"-tags for which the geo-index is used. The aggregation of ways into relations does reduce the amount of elements, that can be queried in the inverted index by a significant amount because in general, there are a significant number of ways found in an OSM data-set that which are can be grouped with other ways in order to form a new relation. One can think the problem as a street that has been fragmented to several parts describing the street in OSM, but there is no relation that groups those parts together. Adding the "is_in"-tags is used in order to be able to use concatenated queries that are in the form of "$subjectToBeFound $placeOfSubject". E.g. one specific example could be the query "hautklinik herdern". The information of the place of the sought subject is helpful when the sought subject is a generic term that needs further narrowing down.
While parsing all values for the name attribute of an OSM-element, they are stored separately without allowing for duplicates. These are used for the QGram-Index in order to allow for fuzzy search proposals that take effect in the UI later on. QGrams with Q = 3 are used for this task. QGrams are here used as a measure of string similarity.
Beside the fuzzy search proposals, as a functionality that may be added in future, they can be used in cases where a query does not return any results, but the QGram-index matches similar words. Google search also may offer an alternative but similar query as the user has typed in. Another approach that may be implenented in future is that similar words are also queried by the backend and their score gets discounted. So in fact multiple queries are processed, but the ranking is taken care of via the discounting of the score.
In order to provide a latent instant processing of the query preprocessing is required. Hence, the values of attributes like name are used for the Inverted-Index. This will make the lookup, given a query, trivial. The values of the osm-element-attributes are used as keys in a map/dictionary data-structure. A vector of references to the mapped osm-elements are used as the values correspoinding to the keys. E.g. this signifies, the mapping of a name of a geographical landmark onto references of geographical landmarks containing the same name. So in this case, the data structure here is the essential part rather than a certain algorithm. Processing a multiterm query requires aggregation of results.
The scores of terms in a document have in general a higher variation than the scores of a term in an OSM-attribute, because those attributes do not in general consist of a logical sequence of complete sentences. The scores are being merged and sorted. Furthermore the inverse of the distance is also taken into account for the ranking in local vicinities. Matching and non-matching of the query or the result terms is also affecting the ranking accordingly. Further improvement of the ranking is out of the scope of this project. Another project or thesis in succession to this, addressing the ranking and improvement of the quality of results is intended.
Based on the GoogleMaps layout, where the map covers essentially the whole space neglecting a few UI elements. The result list on the left side is still inspired by Nominatim. To get a notion of it: here is a glimpse of the UI. The fuzzy proposals, the information box as well as other backend functionality is disabled.