## Certainty of the data?

Questions, answers, discussions related to RSCTC'2010 Discovery Challenge

### Certainty of the data?

Friends at TunedIT:

Initial analysis of the training data indicates that some class members have noticeably different profiles from the majority of members of their classes.

How certain are the class memberships in the Training data? Should we regard them as "absolutely correct" or "most probably, but not definitely, correct"? And a similar question applies to the secret classes in the Test data.

And also to the variable values. What level is their real level of precision? Are they truly precise to 3 decimal places? (For instance, I can measure my height to 3 decimal places, but the true precision of my height is much less, perhaps not even one decimal place.)
Winsteps

Posts: 3
Joined: Thu Dec 03, 2009 2:52 am

### Re: Certainty of the data?

Hi Winsteps,

How certain are the class memberships in the Training data? Should we regard them as "absolutely correct" or "most probably, but not definitely, correct"? And a similar question applies to the secret classes in the Test data.

The class memberships are certain to the extent of certainty with which medical doctors diagnosed and labeled the samples. Depending on the dataset I would call it "absolutely correct" or "almost absolutely correct". The differences between the decision class members are most probably caused by the fact that some medical conditions may have many rare subtypes or variations.

And also to the variable values. What level is their real level of precision? Are they truly precise to 3 decimal places?

The original variables values were more precise. They were rounded to 3 decimal places in order to reduce the size of the text files containing the data.

Andrzej Janusz
janusza

Posts: 3
Joined: Fri Oct 09, 2009 6:45 pm

### Re: Certainty of the data?

The class memberships are certain to the extent of certainty with which medical doctors diagnosed and labeled the samples.

Just a quick comment here: Judging from previous experience with healthcare-related data, it's quite possible that some samples represent multiple classifications. For example, let's say one classification represents "type 2 diabetes", and another classification represents "hypertension". Co-occurrence of these two diseases is not uncommon, and since they both tend to be progressive, it's quite possible that one will be diagnosed conclusively before the other is ever tested for, it's also possible that the test results for one or the other were borderline, even though the disease is actually present, or it will be present in the near future. In other words, both disease and diagnosis are very complex.

My conclusion: I will accept that the categories are "almost absolutely correct" (although I've seen many errors in diagnosis), but, unless steps were taken by the diagnosticians who categorized the data to rule out co-occurrence, it seems prudent to be wary of categorical overlap.
clueless

Posts: 1
Joined: Sun Dec 06, 2009 5:30 am

### Re: Certainty of the data?

Thank you, Clueless. Some members of classes have considerably different profiles from the majority of class members. These non-conforming members seem to merit a second opinion, or an alternative or new classification. This could be a useful side-effect of this Challenge: methods to identify new classes.

This also suggests that instead of striving for one profile for each class, we build several profiles for each class. Then the target data-string is matched to all the profiles for each class, and the best fit decides the target's class. In real data, we would probably bias the fit by the prevalence in the population. So we would assign a target to a frequent class rather than a rare class, even if the fit to the rare class was slightly better.
Guest