You must be registered to this challenge in order to access the files.
Recently two independent technologies - GPS driver navigation and wireless internet access in cell phones - became so popular that many drivers use them both and thus can send their current GPS positions to the server in real time. Stream of such data coming from different drivers can be merged to reconstruct current map of traffic in the whole city and make predictions for the next period of time, using data mining methods. These predictions would be sent back to drivers to be employed in smart real-time journey planning: choosing faster routes and optimizing global traffic in the city. In the 3rd task, "GPS", you have to devise an algorithm for solving this problem.
In the simulator, 1% of drivers use GPS navigators which send every 10 seconds a notification to the central server about its current GPS position and velocity. Your algorithm receives this stream, covering 30-minute interval, and has to predict average velocity of vehicles passing 100 randomly selected road segments in 6-minute time periods: from now on until 6 minutes ahead, and between 24'th and 30'th minute ahead. The algorithm should be scalable, because data are highly voluminous: several GB uncompressed.
Training and test datasets cover 500 hours of simulation each. The same type of simulation as in the "Traffic" task, consisting of 10-hour long cycles, was used. Distinct simulations are separated in data files by empty line.
Training data consists of two files: the stream of data obtained from the vehicles and information about the actual average velocities in corresponding simulation cycles and time periods, on selected road segments. Stream data comprise the following attributes:
The format of the second file is:
Test data are split into 1-hour windows, separated by empty lines. Stream of GPS notifications for the first half of each window is revealed. You should predict harmonic average of velocities of vehicles that will be passing the 100 segments in the 2 time periods: 0-6' and 24-30' of the second half of the window. Harmonic average is used instead of arithmetic mean because it corresponds better to travel times, which are the ultimate criterion of optimization in a real-world setting. Test data have similar format as the first file of training data, except the simulation cycles are split into separate windows and timestamps are counted always from the beginning of a given window.
Note: it was noticed after publishing the data that they contain occasionally wrong GPS values, from outside the range of geographical coordinates of Warsaw. These values occur rarely and should be ignored. See this post for details.
Further clarifications posted on the forum in response to participants' questions: harmonic mean of velocities.
Solution is a text file with 100,000 values: 200 values per each of 500 windows - 100 values for the first period (0-6') and another 100 for the second one (24-30'), listed on consecutive lines with an empty line at the end of every window.
Baseline solution is the average velocity of all cars that passed through each of the edges in the whole given 30-minute long period. If there was no such car, the overall average is taken.
Solutions are evaluated by Root Mean Squared Error (RMSE) of inverted predictions. That is, predicted velocities are transformed - through inverting and multiplying by 60 - into predicted travel time over 1 km of the road segment, expressed in minutes. These travel times are compared with ground truth using RMSE.