Page 1 of 2

random data?

PostPosted: Sun May 15, 2011 10:35 pm
by jake
I'm going out on a limb here, but I think the data might be random noise. So far, none of the credible attempts have broken through the ~33% accuracy expected of a random classifier. My own (unsubmitted) attempts haven't done any better. The sponsors of the challenge claim 79% accuracy with their own method, though they offer no details on how this was achieved. I'm starting to grow suspicious that the sponsors of the challenge know that this sensor doesn't produce signal, they don't even have a classifier that will perform at 79%, and they're hoping that some crowdsourced computational wizardry can save their investment.

I think it would behoove the sponsors to be more forthright and release details on how their baseline classifier works, so that the people working on this know that they're not wasting their time trying to predict on data that has no usable signal.

Re: random data?

PostPosted: Sun May 15, 2011 11:44 pm
by tgflynn
I second this sentiment.

Nothing I have seen so far, including the STFT plots, suggest that this data is anything more than gaussian white noise.

Publishing the organizer's classification algorithm, assuming that it really gives 75% correct classification, would assuage concerns that this is a wild goose chase as well as providing a more reasonable baseline for improvement.

Re: random data?

PostPosted: Mon May 16, 2011 12:27 am
by esrauch
I did some experiments with 1NN DTW which is one of the only things that seems to be considered consistently effective on time series classification from the literature that I looked at (but computationally slow), but the results were only in the 40-50% correct range in the tests I ran it on. I really doubt that its just random noise (though I guess it's actually conceivable that the 1NN DTW run was coincidentally that much higher, but that would be pretty unlikely on the sample size I ran on). I tried looking at a lot of other features of the data as sources for possible weak learners, but nothing else seemed to be at all relevant.

Everyone that is openly talking about it on here and on reddit seem to hold the opinion that this contest is either impossible or just so difficult that no one will be coming anywhere near solving it, especially given the vagueness of the problem specification making it impossible anything other than applying some naive algorithms (no actual domain specific physics knowledge can be used here). I can't really speak to that effect...

If their engineers had actual knowledge of the problem domain and only achieved ~75%, it's unlikely that anyone without that knowledge would be able to do much better, especially given how noisy the supplied data is.

Re: random data?

PostPosted: Mon May 16, 2011 11:56 am
by BwanaKubwa
where is the reddit thread? didn' t find it. The only thing about the company I found was this: Early stage means that's probably the first potential product that is being worked on, got some venture capital from somewhere.
I agree that we need assurance that this is not just a cheap way for the company to find out if their sensor data contains any information at all. Because, being a startup, that seems like a great idea: you just have to pay in case somebody is above those 70-something % success rate (which could maybe be a minimal threshold to make the sensor usable in a commercial sense) - so if somebody reaches that threshold you happily pay 45000 because it's worth it. if not you pay nothing and get with high reliability a test whether your sensor delivers anything more than noise. Please Dr Labella, do you see a possiblity of countering such suspicions?

Re: random data?

PostPosted: Mon May 16, 2011 12:18 pm
by noli

I wanted to try doing a 1NN DTW.. would you mind pointing to links or code samples of an implementation?

I could offer you some info I found on the problem domain in exchange.. and discoveries from looking at the data in labview

Re: random data?

PostPosted: Mon May 16, 2011 9:32 pm
by jamesxli
If the data seems to be random, it could also be an indication that you have the wrong model/method.
And, being emotional in doing analysis is not exactly very constructive.

I have plotted the value density (after de-gaussian the baseline) of the three material classes. It does not
look like impossible to distinguish the three classes with an algorithm from statistical point of view:

Value density of three materials.
ValueDensity.png (60.43 KiB) Viewed 141264 times

Re: random data?

PostPosted: Mon May 16, 2011 11:05 pm
by leec
by value densities, do you mean a histogram? i am not familiar with the phrase "value density".
if it is a histogram then the chart you show looks much like it is random data with a nice random Gaussian distribution. the humps are likely due to the devices discrete value bucketing on the DAC.
I suspect that people's skepticism is growing as everyone's submission is right around the slightly better than random guessing mark!
looking at the data i still find no overly prevalent, principle components in the test1000 set, nor are there any discrete peaks in the fourier transforms. of various samples.
not to mention there has been no response here in the forums from the contest posters. This would be an incredibly unprofessional use of the tuned it community.

Re: random data?

PostPosted: Mon May 16, 2011 11:21 pm
by byronknoll
jamesxli wrote:I have plotted the value density (after de-gaussian the baseline) of the three material classes. It does not
look like impossible to distinguish the three classes with an algorithm from statistical point of view:

What do you mean by "de-gaussian the baseline"? I don't think you can conclude the classes are separable based on the visualization you attached. Since you are plotting only three lines, I assume each line is the mean of the training datapoints in each material class? The problem is that there might be a large variance in the data within each class. For example, I suspect that if you plotted the 1,500 training points (instead of just the three averages) using the exact same visualization technique, the data would no longer appear separable.

Re: random data?

PostPosted: Mon May 16, 2011 11:40 pm
by jake
I just found this: ... ignatures/

In our laboratory we now rely on pattern recognition of images generated from the signals acquired from the sensor.

So if I'm reading this correctly, they opened the data in something like labview, plotted it, took a screenshot, and applied an image recognition algorithm on the screenshots? If this is indicative of the methodology they were relying on, I have serious doubts that the 79% figure attached to the challenge is grounded in reality.

My MS was spent developing a biosensor. I also used machine learning approaches to classify samples based on the output of the sensor. But I can tell you if my sensor had produced raw data like what we're seeing here, I would have taken the hardware straight back to the drawing board. There's something wrong with the underlying principle of the sensor if scores of analysts and statisticians can't find a signal in the data.

I really hope someone can step forward with some information that proves that I'm just being paranoid, but that's looking less likely.


Re: random data?

PostPosted: Tue May 17, 2011 8:20 am
by MJH
Frank Labella has a couple of patents regarding detection of electromagnetic fields. I'm guessing the sensor has something to do with these patents. This is the link for the patents in case anyone is interested. I didn't read them but they might be helpful in figuring out if this whole thing is worth working on or not, especially if the company is holding back information.

Re: random data?

PostPosted: Tue May 17, 2011 4:05 pm
by bad_guy_2
Yes it seems like random data ....

1) having a look on the data, you do not see anything special (no discriminative feature)
2) making a spectral analysis confirms that there are no dominant frequencies, the spectrum is very flat (at all three classes)
3) I converted a bunch of the samples to audio domain and listened to them. All sound the same (as the same noise).
4) I see no dominant spikes in the data

This is, of course, not an exhaustive analysis, but I am wondering what kind of magic you are supposed to employ when there is nothing?
Although I do not want to sound negative, taking further features of this competition into account (the company does not specify anything else
e.g. how the data are measured, what is the background physical process, or how they approach the classification problem), I am inclined to be very sceptical about their claim of 79%. I see no reason for them for not telling e.g. how to achieve 45% accuracy. But it lookes as if they do not have a clue themselves and are just fishing in the crowd.

Re: random data?

PostPosted: Tue May 17, 2011 5:42 pm
by leec
so maybe we can assume there is some noise floor, below some value/freq. we reject data, it would be nice to know what this is expected to be based on the device, and perhaps the devices saturation limits.

Re: random data?

PostPosted: Tue May 17, 2011 6:14 pm
by jamesxli
The picture I posted above is basically "histograms" of the three material classes subtracting a Gaussian distribution calculated from the complete dataset. Here "histogram", what i called value density, is the normal histogram of the radiation values smoothened by usual exponential function and then normalized. Subtracting the Gaussian distribution moves the focus more on micro variations. The picture shows clearly that the three classes are separable from statistical point of view. I don't want speculate how far you can use these and other information to improve predictions, but the data is NOT random as several posts try to suggest in this track.

I don't want to discredit the effort done few people here, and even less to join the charged part of this discussion. But I would be rather surprised to see significant progress in such a short time; and, based on my preliminary investigation, I do think the data is real and worth studying on its own.

Re: random data?

PostPosted: Tue May 17, 2011 7:54 pm
by franklabella
Dear All,

This problem was studied by the National Research Council, Informatics Group. With three different sets of data, they obtained about 90, 80, 70% classification accuracy with their 'proprietary' algorithm. We have their reports with methods and results in detail. We disclose them here for your information:

The Report

Some remarks:

1. I presume that most of the analyses performed on our data thus far was focused on the frequency domain. It should be noted that the analyses with high accuracy rates by the NRC Informatics Group specifically focused on the time domain, i.e. signature features of a given substance as a function of time of data acquisition after exposure of the sensor to the material of interest.

2. In the most recent analysis of our data by the NRC, we were foretold by the Group that they would try to improve their accuracy by making classifications on the basis of frequencies. We had predicted that the algorithms chosen, like wavelets, were inappropriate. Indeed, their accuracy was in the 0.3 range, i.e. equivalent to random noise and comparable to the values in the Leaderboard.

3. Have you looked at the Sigma Plots of the contest data?
There are many consistent features across different rows of panels: size of 'peaks', degree of uniformity of the various sizes of peaks, the regularity and distribution of unusually large peaks for a given row of images, the overall coherence of the pattern across each image of a row of images and across the entire five panels in the row; and others. That the patterns, for a given row (substance) are consistent across the five panels (representing signatures across the whole period of data acquisition), implies a coherence in signal signatures. Consequently, one should note the consistency/coherence across each row of images. I would expect that image analysis procedures would dissect these types of differences, and others, to provide a digital equivalent of the images.

Let me point out that imposition of coherence by exposure of the sensor to a substance is seen in every instance and immediately. This was noted by us with the original analog output of the sensor, i.e. a paper chart recorder, and is characteristic of "stochastic resonance", which appears to be the basic mechanism underlying our observations: a very weak signal can only be detected by its modulation of the background noise. Indeed, a characteristic of this phenomenon: a weak signal imposes coherence on the background noise pattern, and as a consequence, noise reduction.

Frank LaBella
The Organizers

Re: random data?

PostPosted: Tue May 17, 2011 9:14 pm
by tritical
Is the 79% claim coming from the report? Because It considers three, two class problems (not one three class problem) with best classification accuracies of 79, 66, and 78%. Obviously, if you attempt three class classification with those two class accuracies you would not come anywhere close to 79%. Although, you would expect higher accuracy than what has been achieved so far on the Leaderboard. In any case, the methods they used are not anything special. They achieved binary classification accuracies of 76, 66, 75% using wavelet coefficients as features and a linear discriminate. If this is the case, there must be a significant difference between the data they worked with and what is being used for this contest. Or their single train/test split of the data was not a reliable estimate of the classification accuracy.

Re: random data?

PostPosted: Wed May 18, 2011 3:13 am
by t_falk
The NRC report is also based on a validation dataset comprised of 300 samples (100 from each class) and not 500 as we have been instructed to do. Their data is also 1/4 of the sample rate here (4kHz as opposed to 16kHz), thus discards a good amount of noise in the 2-8kHz range.

Re: random data?

PostPosted: Wed May 18, 2011 12:17 pm
by bad_guy_2
I am pleased that the organisers released additional information.

However I am still confused about how their words should be interpreted.

Regarding Point 3. which Mr Labella mentioned ("Have you looked at the Sigma Plots of the contest data?") I am unable to observe any within-class consistency/coherence. Can anybody on this board?

Regarding the NRC report, I do not understand the reason behind what they call "class averages" (see figure on top of page 4).
For each measurement time, they seem to average the data across different samples within the class.
1. why?
I thought that the measurement is time-shift invariant (measurement is passive, so it does not matter when I start my 2s-long measurement).
What sense does it make then to make averages for **specific time** instants?
2. reproducibility
I certainly do not get the curves as in the figure. My averages are fully overlapping, while in their figure it seems that the AS curve is more towards the bottom and the WA towards the top.

Re: random data?

PostPosted: Wed May 18, 2011 4:24 pm
by tritical
Regarding Point 3. which Mr Labella mentioned ("Have you looked at the Sigma Plots of the contest data?") I am unable to observe any within-class consistency/coherence. Can anybody on this board?

I have been thinking the same thing. To me his comments appear to be wishful thinking coupled with bias due to knowing how the images are arranged. If I didn't know how the images were arranged (and that there were three substances) and someone asked me whether each row consisted of the same substance or each column consisted of the same substance I would not be able to tell. A very simple test to determine if what he thinks he's seeing is actually real would be to generate two types of frames, with each frame consisting of 3 rows and 5 (or 7, 9, 11, etc...) columns of images. For the first type, randomly select a data file from the entire training set for each of the images (no correspondence between row or column and substance). For the second type, only select data files from the same substance for each row. If those two types of frames could be distinguished reliably (i.e. in more than one cherry picked example) then it would prove that there is some coherency within substances and differences between substances that can be picked up on. I would also be very surprised! There are many simple variations of that test which could be performed.

Re: random data?

PostPosted: Wed May 18, 2011 10:18 pm
by jake
A fews things seem clear, regardless of the debate over how much signal is truly in this data.

1) the NRC analysts did not evaluate the data we're looking at, so our efforts (and performance) will not be comparable to theirs no matter what we do.
2) the NRC analysis did not treat this as a three-class problem, so again, this is apples and oranges, and 79% (let alone 95%) should not be used as any kind of expected performance in this contest.

I have a hunch that if FIND were truly satisfied with the analysis of NRC, and their classifier performed well in practice, we wouldn't be having this contest. They would have contracted further development with NRC to boost performance, since they already had experience with the problem.

Right now we have two goals/benchmarks: one that means nothing because it is incomparable to the task we face (79%), and another which I think is unrealistic with this data (the 95% target). I think it might be worth the sponsors' consideration to switch to a milestone-centric prize system. Right now nobody wants to work on this because 1) the data are incredibly noisy and 2) the target accuracy is unrealistic. Switch to a milestone system and you might get closer to your goal.

Re: random data?

PostPosted: Thu May 19, 2011 2:14 am
by robertf
Sadly, I am bailing out - this is far beyond my area of expertise, which is chemistry :) Besides, I sacrificed enough of my so-called spare time. On exit, though, I would like to share my latest results. Perhaps some of the signal processing experts will get some inspiration leading to a success. I picked one of the time series at random (303_C.lvm) and calculated ergodic mean and standard deviation obtaining -0.0061021 and 0.0041896, respectively. Next, I have generated a Gaussian white noise with the same length, sampling rate, and mu / sigma. I have conducted some classical signal analysis on both series using 327-dim autocorrelation matrix. Here is the spectrum of autocorrelation eigenvalues:

Eigenvalues.PNG (10.72 KiB) Viewed 140848 times

the power correlogram:

Correlogram.PNG (12.41 KiB) Viewed 140848 times

and the EV spectra (MUSIC spectra look very similar):

EV_spectra.PNG (10.29 KiB) Viewed 140848 times

To me 303_C and white noise with the same distribution are indistinguishable, except that the last ~10 eigenvalues in 303_C case taper down. If this has some meaning I don't know. I see it as an argument that the challenge data are a bunch of white noises. Perhaps the presence of materials, as Prof. LaBella put it, does affect the parameters of white noise in some way. Hence, the statistical noise analysis would lead to suitable modeling descriptors. In such a case, though, FIND Technologies should share 500 time series obtained from the same probe in the _absence_ of any material, so that the background characteristics of the probe could be taken into account.

Good luck,