You must be registered to this challenge in order to access the files.
A long time ago radio stations came up with an idea of collecting current information about traffic jams, roadworks or accidents, and radio broadcasting it to all the drivers, so that they can go around the unpassable places. In the "Jams" task we want to go one step further and try to mine the data gathered during the initial phase of the morning peak, in order to predict where the next jams will occur during main phase of the peak. Such predictions could be used to warn drivers in advance, before the jams actually occur.
Data collected by radio broadcasters have some distinct features that influence the way how they should be mined and what algorithms can be used. These features are reflected in competition data. Firstly, they are imprecise: detailed measurements of speed and number of cars, as well as the time when the jam first occurred, are missing. Secondly, they cover only major roads that constitute 25% of the whole road network. The missing part of information, related to minor roads, may influence jams on major roads and thus should be included, e.g., as latent variables, in decision models. Competition data exhibit the same properties: they contain ordered sequences of street identifiers alone, without any numeric information on congestion or timing of jams, covering only major streets of Warsaw.
Data samples were generated from independent 1-hour long simulations, starting each time from an almost empty road network, fed subsequently with cars according to the selected distributions of start and destination points (distributions selected randomly from a pool of predefined ones). At the beginning of every simulation, 5 randomly selected road segments - 2 on major and 3 on minor roads - were removed from the graph, imitating the presence of roadworks. As input data, the algorithm receives identifiers of the 5 excluded segments followed by a sequence of major roads where the first jams occurred during initial 20 minutes of the simulation. The task is to predict a sequence of major roads where next jams will occur in the next 40 minutes. The sequences are ordered according to time of jam appearance and their length may vary between samples.
The criterion that defines a jam involves both the average speed and number of cars on a given segment. The segment is said to be jammed if average speed over the last 6 minutes was lower than 5 km/h and the number of cars that have passed or stay on this segment is larger than 10. A given road segment may be listed at most once during simulation, when it gets jammed for the first time.
Training and test sets consist of 5000 samples (hours) each. Every line corresponds to a single simulation and consists of a sequence of egde labels: two identifiers of nodes connected by underscore "_". Training set contains data for two parts of the simulation - initial 20-minute period and the following 40-minute period - separated by colon ":". Test set contains only recording for the first 20-minute part. The output part - a sequence of next predicted jams - should be submitted as a solution. Graph of Warsaw streets is available, so you can exploit structural dependencies between different parts of the network. Major and minor roads can be distinguished by looking at average maximum velocity, which equals 60 km/h in case of minor roads, or more in case of major roads.
Baseline solution predicts always the most frequently jammed edges that don't appear in the input part of a given sample. Length of the prediction is equal to the average length of output sequences in training data. Frequency of jams on a given edge was calculated over the whole training data.
Due to atypical form of output decisions - varying-length sequence of identifiers - we employed a non-standard evaluation metric, based on the concept of Mean Average Precision (MAP) from the domain of Information Retrieval, adapted to the specifics of this task. For each sample, quality of prediction is calculated as:
This measure takes on the largest values when both sequences are exactly equal and punishes any deviations of predictions from the target, like: different length of the sequence, different order of identifiers, lack of expected identifiers or presence of non-expected identifiers. Mistakes on initial positions of the sequence are punished stronger than on further ones.
Overall result for the test set is calculated as an arithmetic average of values calculated for each sample.