Genre contest - datasets not comparable?

Questions, answers, discussions related to ISMIS 2011 Contest

Genre contest - datasets not comparable?

Postby JustForFun » Fri Feb 18, 2011 10:46 am

Dear all,

Here is the short version of my question: When you guys test a model on the genre training dataset, either by a simple hold-out/validation set or even by a full crossvalidation, do you obtain accuracies that are comparable to the one you obtain when you submit the respective model to the leaderboard?

The longer version: I started working on the problem a few days ago and as a first step, I tried to train some baseline models. I've written a small crossvalidation framework to check the performance of my models on the training dataset itself. The astounding result: my baseline model already achieves ~90% accuracy on the validation sets. A quick check against the results from the leaderboard told me that I must have done something wrong. So I was searching for a possible bug in my validation framework for two days now but so far I could not find any. Out of curiosity I submitted the solution for the (real) test dataset to the leaderboard, and as expected it is only in the range of ~50% accuracy.

Now I am wondering whether you observe similar things (meaning I do not have a bug). But on the other hand, I also cannot really believe that the contest is designed with a completely non-representative training dataset?!

Any information or help would be highly appreciated ;)

Kind regards,
Fabian
JustForFun
 
Posts: 2
Joined: Mon Jan 17, 2011 6:49 pm

Re: Genre contest - datasets not comparable?

Postby domcastro » Fri Feb 18, 2011 11:43 am

Hi
I think there's a move away these days from a "representative testset" - the goal seems to be more about finding a "general" model that can cope with new unknown and sometimes different data. The last biological competition was like this but a lot worse - a 96% CV training went to 75% preliminary which was then 0.48% final test set.

I've had a look through the test set and there are some instances that aren't represented in the training set - however I'm assuming this is more representative of what happens in the real world.

I'm a little worried that the labels were chosen by a computerised model though - noone's answered my message yet below.

Averagely I'm getting a 15 - 20% drop in performance for this competition from CV training to test - but I'm a lot more confident with this dataset then the biology one.
domcastro
 
Posts: 10
Joined: Fri Nov 26, 2010 5:00 pm

Re: Genre contest - datasets not comparable?

Postby JustForFun » Fri Feb 18, 2011 5:59 pm

Hi,

Thank you very much for your explanations. During my long search of a possible bug, I have though about the possibility of a non-representative training data set several times, but I always discarded the idea, because I just though it would be nonsense. Now I'm pretty surprised ;)

This leads to my follow-up question: what do you guys do with a training set that does not really allow to train? Do you use the training set for any validation at all? Since generalization on training data is (almost?) irrelevant, any validation seems to be pointless? Or are we just supposed to dumb down the models and press the submit button 1000 times, until we happen to obtain a good model?

Actually I don't really see the point of having such a big discrepancy between a training dataset and the forecast dataset for the purpose of a contest. Isn't it just a principle of data mining that you should have faith in your training data? If a model generalizes well on the given dataset, you have to expect that it generalizes well on other data. Reducing the quality of a good model on purpose seems to be very unnatural. And I also don't see the relevance to real world problems: the real world solution in this case would be to increase the quality of the training data set and not to dumb down a model with high generalization?! Apart from that, the real world situation would not allow to press the submit button a thousand times to see if the model is simple enough. Don't get me wrong, I'm still grateful for the possibility to combine by two big passions music and data mining in one project, but I'm nevertheless a bit disappointed that this contest shows characteristics of a game of pure chance.

But back to the topic: are there any scientific approaches to simplify the models systematically? Introducing class noise would be an option. Did anyone experiment with adding noise to the features, too?

Kinds regards,
Fabian
JustForFun
 
Posts: 2
Joined: Mon Jan 17, 2011 6:49 pm

Re: Genre contest - datasets not comparable?

Postby jamesxli » Sat Feb 19, 2011 4:26 am

No reason to be frustrated! You are not the only one who experienced this situation.
It just shows that the cross validation method you used is not appropriate for the
problem underlying this contest. Maybe you should try some other CV methods.
jamesxli
 
Posts: 19
Joined: Wed Dec 09, 2009 6:55 pm

Re: Genre contest - datasets not comparable?

Postby TunedIT » Mon Feb 21, 2011 8:45 pm

Dear all,

Indeed, the training and testing sets are different, but it's no secret: according to the task description, "Training and test datasets contain tracks of distinct performers". The reason for such a division is that we would like your model to be able to classify new performers, not only those that you already know. This way you really need to find some distinctive features of genres and not individual artists.

Therefore, simple CV is definitely not representative and it is natural that it gives better scores - training and validation set may contain 2 fragments of the same track, which are obviously much more similar than fragments of tracks of two different artists. However, as jamesxli said, you can try to generate a better division of the training set.

I hope everything is clearer now :)

Regards,
Joanna
TunedIT
 
Posts: 26
Joined: Fri Oct 09, 2009 6:45 pm

Re: Genre contest - datasets not comparable?

Postby pwf » Tue Feb 22, 2011 4:36 pm

The goal of creating prediction models that have broad generality is a worthwhile goal. However, the obvious way to encourage participants to build models with greater generality would have been to create a data set for training that had greater diversity in music passages and musical performers. It is some what strange for participants to find that performance on the leaderboard is improved by "dumbing down" the prediction model. Normally we expend a great deal of effort trying to model the training set as precisely as possible.
pwf
 
Posts: 4
Joined: Thu Jan 13, 2011 2:41 am

Re: Genre contest - datasets not comparable?

Postby pwf » Wed Feb 23, 2011 5:15 am

For clarity, let me add two additional observations to my previous note.

If the organizers wanted participants to produce prediction models that focused on genres and not performers, they could have prepared a training set in which each musical passage was played by multiple performers (say 20 different groups for each genre). This would have led to models that effectively distinguish among the genres independent of the differences among performers.

Given the limitations of the current training set, the organizers could have provided two additional variables, identifying each passage and each performance group. This would have greatly improved the participants opportunity to build models that were more effective in identifying genres independent of which group was performing.
pwf
 
Posts: 4
Joined: Thu Jan 13, 2011 2:41 am


Return to ISMIS 2011 Contest: Music Information Retrieval

Who is online

Users browsing this forum: No registered users and 1 guest

cron