Challenges / JRS 2012 Data Mining Competition: Topical Classification of Biomedical Research Papers

Status Closed
Type Scientific
Start 2012-01-02 00:00:00 CET
End 2012-03-30 23:59:59 CET
Prize 1,500$

Registration is required.


Our team has invested a significant amount of time and effort to gather a corpus of documents containing 20,000 journal articles from the PubMed Central open-access subset. Each of those documents was labeled by biomedical experts from PubMed with several MeSH subheadings that can be viewed as different contexts or topics discussed in the text. With a use of our automatic tagging algorithm, which we will describe in details after completion of the contest, we associated all the documents with the most related MeSH terms (headings). The competition data consists of information about strengths of those bonds, expressed as numerical values. Intuitively, they can be interpreted as values of a rough membership function that measures a degree in which a term is present in a given text. The task for the participants is to devise algorithms capable of accurately predicting MeSH subheadings (topics) assigned by the experts, based on the association strengths of the automatically generated tags. Each document can be labeled with several subheadings and this number is not fixed. In order to ensure that participants who are not familiar with biomedicine, and with the MeSH ontology in particular, have equal chances as domain experts, the names of concepts and topical classifications are removed from data. Those names and relations between data columns, as well as a dictionary translating decision class identifiers into MeSH subheadings, can be provided on request after completion of the challenge.

Data format: The data set is provided in a tabular form as two tab-separated values files, namely trainingData.csv (the training set) and testData.csv (the test set). They can be downloaded only after a successful registration to the competition. Each row of those data files represents a single document and, in the consecutive columns, it contains integers ranging from 0 to 1000, expressing association strengths to corresponding MeSH terms. Additionally, there is a trainingLables.txt file, whose consecutive rows correspond to entries in the training set (trainingData.csv). Each row of that file is a list of topic identifiers (integers ranging from 1 to 83), separated by commas, which can be regarded as a generalized classification of a journal article. This information is not available for the test set and has to be predicted by participants.
It is worth noting that, due to nature of the considered problem, the data sets are highly dimensional - the number of columns roughly corresponds to the MeSH ontology size. The data sets are also sparse, since usually only a small fraction of the MeSH terms is assigned to a particular document by our tagging algorithm. Finally, a large number of data columns have little (or even none) non-zero values (corresponding concepts are rarely assigned to documents). It is up to participants to decide which of them are still useful for the task.

Format of submissions: The predictions should be submitted in a single text file containing exactly the same number of lines as the test data set. In the consecutive lines, this file should contain identifiers of the predicted topics (integers ranging from 1 to 83) separated by commas and without any spaces. The file majorityClasses.txt is an example of a well-formatted submission. It assigns each document to five most frequently occurring classes.
Apart from submitting the solution file, each participating team is required to send a short report (should not exceed 1000 words) describing their final solution. A report should contain the name of a team, names of all team members, the last preliminary evaluation score and a brief overview of the used approach, such as data preprocessing steps, utilized algorithms, parameter tuning techniques, and so on. The reports (pdf file format is preferable) should be send to: due to 02 April, 2012.

Downloads: You must be logged in and registered to this challenge in order to access the files.

  • Training data - an information system containing 10,000 objects and 25,640 attributes
  • Test data - an information system containing 10,000 objects and 25,640 attributes
  • Training decision labels - generalized topical classification of training objects
  • Example solution - an exemplary well-formatted submission file

Evaluation of results: The submitted solutions will be evaluated on-line and the preliminary results will be published on the leaderboard. The preliminary score will be computed on a random subset of the test set, fixed for all participants. It will correspond to approximately 10% of the test data size. The final evaluation will be performed after completion of the competition using the remaining part of the test data. Those results will also be published on-line. It is important to note that only teams which send a short report describing their approach before the end of the contest will qualify for the final evaluation. The winning teams will be officially announced during a special session devoted to the competition at the JRS 2012 ( conference.
Quality of the submissions will be evaluated using a standard measure from Information Retrieval that naturally fits to the considered problem, namely average F-score of the predictions. Let us use the following notation:

We can define Precision and Recall of a prediction for a single document:

Average F-score over the test documents will be defined as:

Intuitively, average F-score combines precision and recall of predictions over the set of all test documents. In case of any draws in the results, the final ranking of participants will be decided based on time of the submissions.

Please direct all questions to the Forum. We will reply as soon as possible.

Copyright © 2008-2013 by TunedIT
Design by luksite