Here is the short version of my question: When you guys test a model on the genre training dataset, either by a simple hold-out/validation set or even by a full crossvalidation, do you obtain accuracies that are comparable to the one you obtain when you submit the respective model to the leaderboard?
The longer version: I started working on the problem a few days ago and as a first step, I tried to train some baseline models. I've written a small crossvalidation framework to check the performance of my models on the training dataset itself. The astounding result: my baseline model already achieves ~90% accuracy on the validation sets. A quick check against the results from the leaderboard told me that I must have done something wrong. So I was searching for a possible bug in my validation framework for two days now but so far I could not find any. Out of curiosity I submitted the solution for the (real) test dataset to the leaderboard, and as expected it is only in the range of ~50% accuracy.
Now I am wondering whether you observe similar things (meaning I do not have a bug). But on the other hand, I also cannot really believe that the contest is designed with a completely non-representative training dataset?!
Any information or help would be highly appreciated

Kind regards,
Fabian