Challenges / JRS 2012 Data Mining Competition: Topical Classification of Biomedical Research Papers
|
TaskOur team has invested a significant amount of time and effort to gather a corpus of documents containing 20,000 journal articles from the PubMed Central open-access subset. Each of those documents was labeled by biomedical experts from PubMed with several MeSH subheadings that can be viewed as different contexts or topics discussed in the text. With a use of our automatic tagging algorithm, which we will describe in details after completion of the contest, we associated all the documents with the most related MeSH terms (headings). The competition data consists of information about strengths of those bonds, expressed as numerical values. Intuitively, they can be interpreted as values of a rough membership function that measures a degree in which a term is present in a given text. The task for the participants is to devise algorithms capable of accurately predicting MeSH subheadings (topics) assigned by the experts, based on the association strengths of the automatically generated tags. Each document can be labeled with several subheadings and this number is not fixed. In order to ensure that participants who are not familiar with biomedicine, and with the MeSH ontology in particular, have equal chances as domain experts, the names of concepts and topical classifications are removed from data. Those names and relations between data columns, as well as a dictionary translating decision class identifiers into MeSH subheadings, can be provided on request after completion of the challenge.
Data format:
The data set is provided in a tabular form as two tab-separated values files, namely
trainingData.csv (the training set)
and testData.csv (the test set).
They can be downloaded only after a successful registration to the competition. Each row of those data files represents a single document and, in the consecutive columns, it contains integers ranging from 0 to 1000, expressing association strengths to corresponding MeSH terms. Additionally,
there is a trainingLables.txt file,
whose consecutive rows correspond to entries in the training set
(trainingData.csv). Each row of that
file is a list of topic identifiers (integers ranging from 1 to 83), separated by commas, which can be regarded as a generalized classification
of a journal article. This information is not available for the test set and has to be predicted by participants.
Format of submissions:
The predictions should be submitted in a single text file containing exactly the same number of lines as the test data set. In the consecutive lines, this file
should contain identifiers of the predicted topics (integers ranging from 1 to 83) separated by commas and without any spaces. The file
majorityClasses.txt is an example of a well-formatted
submission. It assigns each document to five most frequently occurring classes. Downloads: You must be logged in and registered to this challenge in order to access the files.
Evaluation of results:
The submitted solutions will be evaluated on-line and the preliminary results will be published on the leaderboard.
The preliminary score will be computed on a random subset of the test set, fixed for all participants.
It will correspond to approximately 10% of the test data size. The final evaluation will be performed after
completion of the competition using the remaining part of the test data. Those results will also be published
on-line. It is important to note that only teams which send a short report describing their approach before the
end of the contest will qualify for the final evaluation. The winning teams will be officially announced during a special
session devoted to the competition at the JRS 2012 (http://sist.swjtu.edu.cn/JRS2012/)
conference.
We can define Precision and Recall of a prediction for a single document:
Average F-score over the test documents will be defined as:
Intuitively, average F-score combines precision and recall of predictions over the set of all test documents. In case of any draws in the results, the final ranking of participants will be decided based on time of the submissions. Please direct all questions to the Forum. We will reply as soon as possible. |
|||||||||||||||||


