Hello,

Are the train and test Traffic datasets were sampled from the 10 hours cycles generated by same procedure with same generation parameters? By other words. Is it correct assumption that the train and test samples are not biased?

Thanks

4 posts • Page **1** of **1**

Hello,

Are the train and test Traffic datasets were sampled from the 10 hours cycles generated by same procedure with same generation parameters? By other words. Is it correct assumption that the train and test samples are not biased?

Thanks

Are the train and test Traffic datasets were sampled from the 10 hours cycles generated by same procedure with same generation parameters? By other words. Is it correct assumption that the train and test samples are not biased?

Thanks

- alegro
**Posts:**5**Joined:**Mon Dec 14, 2009 2:36 pm

Hello,

The generation procedure was the same for training and test datasets, so statistically they should exhibit the same characteristics. But note that if you consider single simulations, even in the same dataset (ex. training set), generation procedure might differ, because some simulation parameters were picked randomly for each simulation - but this 'randomness' was the same over training and test datasets.

The generation procedure was the same for training and test datasets, so statistically they should exhibit the same characteristics. But note that if you consider single simulations, even in the same dataset (ex. training set), generation procedure might differ, because some simulation parameters were picked randomly for each simulation - but this 'randomness' was the same over training and test datasets.

- Marcin
**Posts:**115**Joined:**Fri Oct 09, 2009 6:45 pm

Hello,

Thank you for the explanation.

For each of the two samples (Train and Test) I counted 21000 sums of the measured congestion over all road segments in both directions in the sliding 10 minutes intervals in the the first 30 minutes of each hour. Sorted by values sums shown at the attached graph. Pseudocode of the procedure for one of the samples:

Would you like to explain difference in estimation of the resulting distributions? Is it indicate low representative samples or not uniform (biased) sampling procedure or something else?

Thank you.

Thank you for the explanation.

For each of the two samples (Train and Test) I counted 21000 sums of the measured congestion over all road segments in both directions in the sliding 10 minutes intervals in the the first 30 minutes of each hour. Sorted by values sums shown at the attached graph. Pseudocode of the procedure for one of the samples:

- Code: Select all
`Sum array [1..21000]`

SumIndex = 1

for Hour from 1 to 1000

for WinFirstMinute from 1 to 21

Sum[SumIndex] = 0

for Minute from WinFirstMinute to WinFirstMinute+9

for Segment from 1 to 20

Sum[SumIndex] = Sum[SumIndex] + GetDataValue(Hour, Minute, Segment);

SumIndex = SumIndex + 1

Sort 21000 values in Sum

for SumIndex from 1 to 21000

show graph point X = SumIndex, Y = Sum[SumIndex]

Would you like to explain difference in estimation of the resulting distributions? Is it indicate low representative samples or not uniform (biased) sampling procedure or something else?

Thank you.

- alegro
**Posts:**5**Joined:**Mon Dec 14, 2009 2:36 pm

Hi alegro,

Yes, this difference may indeed originate from small sample size.

regards, M

Yes, this difference may indeed originate from small sample size.

regards, M

- Marcin
**Posts:**115**Joined:**Fri Oct 09, 2009 6:45 pm

4 posts • Page **1** of **1**

Return to IEEE ICDM Contest: Road Traffic Prediction for Intelligent GPS Navigation

Users browsing this forum: No registered users and 1 guest