Implementation details
We implemented Dijkstra's shortest path algorithm on time expanded and time dependent graph structures.
Both graphs were created based on OpenStreetMap data for road
networks and GTFS data for
public transport times. Our implementation connects both data sources and finds the fastest way from a start
to a goal position and is able to consider transfer times when changing trains.
All implementation was done in ISO-C++.
Time expanded approach:
Constructing the time expanded graph concerning the GTFS dataset, we create three types of nodes
for each station at different time events. These are arrival nodes, transfer nodes and departure nodes.
An arrival node reflects when a vehicle arrives at a station at which time, while a transfer node models
the ability to change the vehicle respecting the time that is needed for the change. A departure node
reflects when a vehicle leaves at station at a certain time. The edges between the nodes are drawn as follows:
an edge exists between the arrival and the transfer node having the waiting time for changing the vehicle as cost,
and it also exists an edge between arrival and departure node having the time one stays in the vehicle at a station
as cost. The edges that connect different stations, say A and B, are drawn between the departure node
of A (assuming
a trip goes over A to B) to the arrival node of B where the travel duration is the cost of that edge.
The transfer nodes
can have edges to several departure nodes at a later time, but have one edge to the next transfer node in time.
Having the complete time expanded graph both the OSM graph and the GTFS graph are merged. This is done by
finding the closest OSM node to a station and connecting the OSM node to all departure nodes of that station. Then
all arrival nodes of that station are connected with the OSM node. In the end we have a complete time expanded graph
where it is possible to walk between stations and also use the transit network where possible in oder to get from
A to B via the shortest path.
Concerning the time expanded approach we use a very straight forward Dijkstra implementation to calculate
the shortest path from A to B.
Time dependent approach:
The time dependent approach is a solution where each station/osm-node is represented by exactly one node. An edge between two osm nodes exists if they are consecutive points on a way. Furthermore, for each station, an edge to its closest osm-node exists. In addition, two stations A and B are connected via a directed edge if a train/bus passes B directly after A. Each edge (u,v) has a function which returns, for a given time, the earliest arrivals at v. Note that due to transfer buffers, several arrivals at v might be returned. Imagine that one could arrive with line L1 at v at 10:00 o'clock but also with line L2 at v at 10:01. Assume both trains depart after two minutes and transfers buffers are five minutes. If node w were only reachable using L2, considering only the earliest arrival at v (the one with L1 at 10:00) would exclude the solution to reach w. Note that the edge function for (u,v) depends also on the line with which one arrives at u. On such a graph, we run a slightly modified dijkstra which in addition to a start- and target-position takes a departure time. Starting at an OSM node, all outgoing edges and their edge-functions (using the departure-time) are considered. For each edge (u,v), the edge-function is calculated which returns a label for the earliest arrival at v and all labels for which the departure time is smaller or equal than the best arrival plus transfer buffer time (and larger or equal than the best departure time). These are the only labels needed to compute the optimal (in the sense of shortest travel time) path. All following departures from v can be reached by changing the line, since the transfer time is sufficient. All mentioned labels are inserted in the priority queue and treated independtly from each other. Then, the node with the smallest arrival time is taken from the priority queue and the same steps are repeated.
Problems:
Protocol Buffers: The main issue regarding Google's Protocol Buffers is the poor documentation of the library. All examples and tutorial files propose to use a nested data structure. This structure can not be applied in our case, because of the low message size limit and the huge size of one graph. So instead of writing one message containing the complete graph, we used the approach to write many messages with different types each containing a node, edge, or travel time data. The library does not contain a function to distinguish and read those messages. So we implemented a function marking every message type and its size at the beginning of each message.Implementation structure can not be converted directly to Protocol Buffer sturcture.
Changing elements in priority queue: In the beginning, we added pointers to nodes to the
std::priority_queue
with a custom comparator to define an order on
the pointers. As we later changed the cost values assigned to the nodes, we changed the order of the
elements and thus violated the heap condition of the priority queue. This is a rather bad idea and
leads to undefined behavior. Avoid storing pointers to priority queues.
Time expanded: Arcs to future nodes: When we are at a specific station at a specific time, it is desirable to be able to use every bus from that station that has a later departure time. But inserting O(n2) edges in a graph with some hundred thousand nodes is a bad idea. Instead, sorting all nodes at a station and continuously connecting each of them with its successor makes all later nodes accessible from an earlier node (by visiting all nodes in between) and reduces memory consumption from several GB to few MB.
Time expanded: Same departure times: When implementing the solution to the previous problem, special care have to be taken when departure times are the same for some nodes. While this is not the general case, it may lead to very subtle errors and can be avoided by still building a transitive closure on all (and only those!) nodes with the same departure time.
Database for GTFS data: At first it may be tempting to just parse the easily structured GTFS files line by line and only get what you want. While this may work for small example data, it is a far more flexible, faster and better idea to use the GTFS data for what it really is: a set of relational database tables. We do complex sorting and selecting on the data that spans not less than three tables. Using SQLite, we reduced the time to parse by factor two to three, reduced the code size and complexity and thus the source of bugs and extending the code is far more easy. A ready to use Python script to convert a bunch of GTFS files to a sqlite database can be found here. Doing the work of a RDBMS "by hand" is always a bad idea.
Erroneous GTFS files: The GTFS file specification is a rather fuzzy file description. When parsing the files, we stumbled upon missing fields, blank lines in the middle of the file and other stuff you have to take care of. The previously mentioned convert script did cope with our sample data without problems.
Close OSM nodes: Sometimes, OSM data may contain nodes that are very close together, for example when the street makes a close bend. We save the distance in meters as integer values. Consequently, nodes closer than 1m apart will have a distance of 0 and you can travel there in 0 time. Such a node is added with the current time to the priority queue. When it is allowed to revisit nodes (as necessary for transfer buffers in time dependent graphs), the previous node is again inserted to the priority queue with the same time. This leads to an infinite loop. Special care has to be taken to handle nodes with distance 0.
Erroneous OSM files: Openstreetmap data is open for modification by everyone, which sometimes leads to points being tracked multiple times. Our sample data contained several hundred nodes that did not belong to any way but represented the exact same position as an already existing node. This occurred as a problem for our first implementation of a kD tree, as it did not cope well with nodes that have the exact same coordinates. This also is a formal error in the OSM mapping and should not occur (or should be fixed if it occurs). Only use OSM points that actually belong to roads, this also reduces the number of nodes in the graph.