random data?

Questions, answers, discussions related to the FIND Technologies Inc. challenge

Re: random data?

Postby bad_guy_2 » Thu May 19, 2011 1:30 pm

Right now nobody wants to work on this because 1) the data are incredibly noisy and 2) the target accuracy is unrealistic.


jake:
That is true. But my main source of reservations come from low credibility of this contest.

1) Look at what the contestants are supplied by. Some reports which are confusing (different data, far from clear what and why they do), claims whose validity is unclear ("coherences/consistency in the data"). No reasonable and credible clues given.

2) Leaderboard Evaluation. I think it is a well known, and well established, practice for such contests to evaluate the contestants on blind data. What is usually done is that people are asked to classify unknown data, of which a portion (30%) is used for computing the leaderboard results, while the classification result on the entire ensamble is undisclosed. In cotrast to this, the organisers use **known data** for leaderboard, so everyone can overfit (knowingly or not) and the Leaderboard Results are then a complete mess.
bad_guy_2
 
Posts: 4
Joined: Mon May 16, 2011 10:52 am

Re: random data?

Postby tritical » Tue May 24, 2011 11:56 pm

After looking at this data some more, I'm starting to think there is actually something there which discriminates the substances. The problem is that it's impossible to accurately detect it using only a single two second sample (data file) because of the noise variance. Specifically, the PSD (power spectral density) of the signals from the materials appears to be different. However, if you try to estimate the PSD from a single sample the variance of the estimate is much, much bigger than the small changes induced on the background PSD by the materials. However, if you estimate the PSD using 'N' independent samples the variance of the estimate is reduced by a factor of N... so once N is large enough the noise can be overcome allowing more accurate discrimination.

I ran a simple test which consisted of splitting the train1000 and test500 sets randomly into groups of 'K' files (each group containing K data files from the same substance), and then classified each group by estimating the PSD using the K independent measurements (just take the fft of each sample and average the magnitudes). For simplicity, I did binary classification: substance 1 or 2 (class 0) vs substance 3 (class 1). I did the classification using a threshold on a single bin from the PSD (one which I found to give good separation/consistency via cross validation on the training set). That bin was 11745 which corresponds to a center frequency of 5872.5 KHz ((16384/32768)*11745). I obviously have no idea why that frequency would be important, but for some reason it seems to discriminate substance 3 from 1 and 2 pretty well. Below are the classification accuracies, calculated as: (#_correctly_predicted_0/#_true_class_0 + #_correctly_predicted_1/#_true_class_1)*0.5 on the train/test sets respectively as K is increased from 1 to 22. These numbers are averaged over 5 different runs (random groupings of the data):

Code: Select all
K     train     test
01   0.5915    0.5959
02   0.6199    0.6452
06   0.7136    0.7241
10   0.7545    0.7719
14   0.8217    0.8091
18   0.8278    0.8556
22   0.8967    0.8857


So I guess my question to the organizers is: would making multiple measurements of a material before trying to classify it be possible?
tritical
 
Posts: 4
Joined: Wed Dec 23, 2009 9:13 am

Re: random data?

Postby bad_guy_2 » Thu May 26, 2011 11:03 am

tritical:
Cool result! Looks like a true milestone in this competition - so far we had no evidence of any possibility to classify better than chance.
bad_guy_2
 
Posts: 4
Joined: Mon May 16, 2011 10:52 am

Re: random data?

Postby franklabella » Mon May 30, 2011 3:28 pm

tritical wrote:So I guess my question to the organizers is: would making multiple measurements of a material before trying to classify it be possible?

We never considered that possibility; but, we would need to see how many times and how long it would take to classify the material.

Frank LaBella
The Organizers
franklabella
 
Posts: 13
Joined: Thu May 12, 2011 12:08 am

Re: random data?

Postby jeremie » Wed Jun 22, 2011 10:53 pm

Your observation is related to the fact that around 200 of the datas of class C have a higher PSD. I think it's due to bad data collection because they are mainly on the second half of observations.

To illustrate it, here it is a graph of PSD (no normalization) with class shown as a color (A : red, B:green, C:blue):

Image

How the temperature of the room is controlled ? Is something has changed during acquisition of sample of class C ?

This is true that the PSD density over the records is not gaussian (but not multimodal too). It could be a log normal distribution.

To me theses datas are random.
Edit : in fact this is not random in the sense observation of a noise. It measure something (with some noise), but the measure is not related to the class. Something is moving during the acquisition (and there is no such behaviour in the std deviatiation). If no external influence then it should be possible to separate with longer observations.

Edit2: Note the order I used for files is alphabetical so the 99 of each class are not in their 'natural' position.
jeremie
 
Posts: 1
Joined: Sun Jun 19, 2011 2:39 pm

Previous

Return to Materials Identification Based on Measurements of Electromagnetic Radiation

Who is online

Users browsing this forum: No registered users and 1 guest

cron