This page describes the task of Advanced Track only. Task of Basic Track is here.
Example datasets for 6 different problems of DNA microarray data analysis and classification can be found in Repository in RSCTC/2010/B/public folder - these are training and test datasets from Basic track, available in ARFF or CSV format, each one separately or zipped together. You can use them as benchmarks when experimenting with different kinds of algorithms and trying to find the best one, to be submitted to the challenge - they have similar characteristics as the secret datasets that will be used during evaluation of solutions.
All datasets - both public and secret ones - contain between 100 and 400 samples, characterized by values of 20,000 - 65,000 attributes. Samples are assigned to several (2-10) classes. All attributes are numeric and represent measurements from DNA microarrays. Attributes are normalized in some way - in test data you should expect similar distributions of attribute values as in the example data.
Solution has the form of a JAR file containing Java source code of a classification algorithm. The implementation should be based on architecture (API) of one of the systems: Debellor 1.0, Weka 3.6.1 or Rseslib 3.0.2. Depending on the chosen architecture, the algorithm class should inherit either from:
The JAR file may contain also any other classes, used by the class of the algorithm. It is not necessary to include classes of Debellor, Rseslib or Weka. You may assume that their JAR files will be available on the classpath. You are free to use in your implementation all the algorithms available in these systems.
In solution submission form, you must choose the JAR file to be submitted and type a full name (with package) of the class that implements the algorithm. After submission, the JAR file is compiled on the server under Sun JDK 6 and the algorithm is tested. Preliminary result will appear on Leaderboard.
Solutions are evaluated on several secret datasets corresponding to different problems of DNA microarray data classification. Secret datasets represent different medical problems than the public ones, but possess similar characteristics: number of samples and attributes, statistical distributions of attribute values etc.
There are 5 datasets during preliminary evaluation and 6 during final. The algorithm is evaluated 5 (preliminary) or 20 (final) times on each dataset using Train+Test procedure. Each T+T trial consists of randomly splitting the data into two equal disjoint parts - training and test subset - training the algorithm on the first part and testing on the second part with calculation of the quality measure: balanced accuracy. Measurements from all T+T trials on all the datasets are averaged. Randomization of data splits is the same for every submitted solution, so every algorithm is evaluated on the same splits.
Balanced accuracy is an average of the standard classification accuracies (acck) calculated for each decision class
In this way, every class has the same contribution to the final result, no matter how frequent it is.
Time and Memory
The algorithm should not only be accurate, but also time- and memory-efficient. There is a time limit set for the whole evaluation: 4 hours in preliminary tests and 20 hours in final tests. Therefore, a single Train+Test trial of the algorithm should last no longer than 10 minutes, on average.
Memory limit is set to 1,500 MB, both in preliminary and final evaluation. Note that up to 450 MB is used by evaluation procedure to load the dataset into memory, so about 1 GB is left to the algorithm.
Tests are performed on a station with 1.9 GHz dual-core CPU, 32-bit Linux and 2 GB memory, running Sun Java HotSpot Server 14.2 as a JVM.
Folder Examples in Repository contains sample implementations of the simplest classification algorithm, majority classifier. Implementations are realized in architectures of Debellor and Rseslib. They include compiled code as well as Java sources, so they may be helpful in understanding the API that should be implemented by your class. See also the Examples section in Docs.
JAR files of Debellor, Rseslib and Weka are available in Repository. You can download them and put on the classpath of your algorithm, to run and test the algorithm locally, outside TunedTester. For instance, if you develop under Eclipse and want to add a library JAR to the classpath, click on the project, choose from menu: Project -> Properties -> Java Build Path and then, in Libraries tab, click on "Add JARs" or "Add External JARs".
It is also possible to test the algorithm on each of the six public datasets: RSCTC/2010/B/public/dataX_train.arff, where X = 1,2,...,6, under similar conditions as in the challenge evaluation on the server (i.e., using TunedTester). For this purpose:
If you have any questions please post them on discussion forum of the challenge. We also encourage you to subscribe this forum ("Subscribe forum" link at the bottom), so that you receive notifications about new posts, which may contain further explanations to the challenge tasks.
Marcin Wojnarski, Andrzej Janusz, Hung Son Nguyen, Jan Bazan