Questions, answers, discussions related to the FIND Technologies Inc. challenge

Postby bad_guy_2 »

Right now nobody wants to work on this because 1) the data are incredibly noisy and 2) the target accuracy is unrealistic.

That is true. But my main source of reservations come from low credibility of this contest.

1) Look at what the contestants are supplied by. Some reports which are confusing (different data, far from clear what and why they do), claims whose validity is unclear ("coherences/consistency in the data"). No reasonable and credible clues given.

2) Leaderboard Evaluation. I think it is a well known, and well established, practice for such contests to evaluate the contestants on blind data. What is usually done is that people are asked to classify unknown data, of which a portion (30%) is used for computing the leaderboard results, while the classification result on the entire ensamble is undisclosed. In cotrast to this, the organisers use **known data** for leaderboard, so everyone can overfit (knowingly or not) and the Leaderboard Results are then a complete mess.
Joined:

Postby tritical »

After looking at this data some more, I'm starting to think there is actually something there which discriminates the substances. The problem is that it's impossible to accurately detect it using only a single two second sample (data file) because of the noise variance. Specifically, the PSD (power spectral density) of the signals from the materials appears to be different. However, if you try to estimate the PSD from a single sample the variance of the estimate is much, much bigger than the small changes induced on the background PSD by the materials. However, if you estimate the PSD using 'N' independent samples the variance of the estimate is reduced by a factor of N... so once N is large enough the noise can be overcome allowing more accurate discrimination.

I ran a simple test which consisted of splitting the train1000 and test500 sets randomly into groups of 'K' files (each group containing K data files from the same substance), and then classified each group by estimating the PSD using the K independent measurements (just take the fft of each sample and average the magnitudes). For simplicity, I did binary classification: substance 1 or 2 (class 0) vs substance 3 (class 1). I did the classification using a threshold on a single bin from the PSD (one which I found to give good separation/consistency via cross validation on the training set). That bin was 11745 which corresponds to a center frequency of 5872.5 KHz ((16384/32768)*11745). I obviously have no idea why that frequency would be important, but for some reason it seems to discriminate substance 3 from 1 and 2 pretty well. Below are the classification accuracies, calculated as: (#_correctly_predicted_0/#_true_class_0 + #_correctly_predicted_1/#_true_class_1)*0.5 on the train/test sets respectively as K is increased from 1 to 22. These numbers are averaged over 5 different runs (random groupings of the data):

K     train     test
01   0.5915    0.5959
02   0.6199    0.6452
06   0.7136    0.7241
10   0.7545    0.7719
14   0.8217    0.8091
18   0.8278    0.8556
22   0.8967    0.8857

So I guess my question to the organizers is: would making multiple measurements of a material before trying to classify it be possible?
Joined:

Postby bad_guy_2 »

Cool result! Looks like a true milestone in this competition - so far we had no evidence of any possibility to classify better than chance.
Joined:

Postby franklabella »

tritical wrote:So I guess my question to the organizers is: would making multiple measurements of a material before trying to classify it be possible?

We never considered that possibility; but, we would need to see how many times and how long it would take to classify the material.

Frank LaBella
The Organizers
Joined:

Postby jeremie »

Your observation is related to the fact that around 200 of the datas of class C have a higher PSD. I think it's due to bad data collection because they are mainly on the second half of observations.

To illustrate it, here it is a graph of PSD (no normalization) with class shown as a color (A : red, B:green, C:blue):


How the temperature of the room is controlled ? Is something has changed during acquisition of sample of class C ?

This is true that the PSD density over the records is not gaussian (but not multimodal too). It could be a log normal distribution.

To me theses datas are random.
Edit : in fact this is not random in the sense observation of a noise. It measure something (with some noise), but the measure is not related to the class. Something is moving during the acquisition (and there is no such behaviour in the std deviatiation). If no external influence then it should be possible to separate with longer observations.

Edit2: Note the order I used for files is alphabetical so the 99 of each class are not in their 'natural' position.
Joined:


