After a significant investment of resources, Simulations Plus, Inc.
has collected experimental data of a biological property Y for a total of 1046 chemical compounds. Y is of great importance to the pharmaceutical industry and, in the interest of fairness; its exact nature will be revealed after the submission deadline. All that can be said at this moment is that the measured values are binary:
- The “S” value indicates a positive outcome of the biological experiment
- The “N” value indicates a negative outcome of the biological experiment
Next, 242 molecular descriptors labeled “X1, X2, …, X242” were calculated for each of the 1046 chemical molecules by the Simulations Plus program ADMET Predictor(TM)
. This was followed by descriptor normalization (0 – 1 scale).
The whole dataset was divided into 3 parts:
- 209 representative molecules with descriptor values only were separated as Final Test Data; the Y values for this set will remain unknown to you for duration of the challenge.
- 163 representative molecules from the remaining part were selected as Preliminary Test Set. The Y values for this set will be revealed 14 days before the end of the challenge, so it will be also possible to combine this set with Training Data described below to build the final model.
- The remaining 674 molecules with both descriptors and Y values known are labeled as Training Data.
All datasets are publicly available from TunedIT. Please note that all the sets are unbalanced in that the negatives outnumber positives approximately 4:1 posing additional challenges to the modeling algorithm, but also reflecting real world scenarios.
This, in our opinion, is also the main source of difficulty in this challenge. In addition, please be aware that biological data almost always is characterized by significant noise.
The competition will proceed in two distinct phases. The first one lasting ~2 months, "Exercise", will be the time of search for the best modeling algorithm. It will proceed in a manner similar to earlier TunedIT competitions, in that the Preliminary Test Set will be used as a common comparison platform for all participants. Preliminary results calculated on this set will be posted on the Leaderboard, so that participants could compare performances of their respective methods. The Leaderboard results, though, will not be used to pick the winner. The second phase, "Model Building", will begin with the Preliminary Test Set Y values revealed to all participants 14 days before the submission deadline. During this time participants, who should have the optimal algorithm in hand, are encouraged to combine the Preliminary Test Set and the Training Data into a single training pool of 837 compounds. In the next step, participants will split this pool into new training/testing subsets for building the final model with best predictivity. A good splitting algorithm is crucial and will have a sizable impact on the final model's performance; hence designing it is also a part of this competition. Final submission will consist of prediction results calculated on the Final Test Set only as a measure of predictivity.
Please also keep in mind that good performance on the Preliminary Test Set does NOT automatically guarantee high final predictivity!
We ask you to:
- Design new or learn existing ways of measuring the predictivity of binary classification QSPR models.
- "Exercise Phase": Use the Training Data to build a QSPR classification model with highest projected predictivity. This process will necessarily involve proper descriptor selection (mandatory), train/validation data division described in the Overview section (optional), and eventually use of the Leaderboard and Preliminary Test Set results as a guidance. Otherwise, method and manner of model training is left to the contestant.
- "Model Building Phase": After the Y values for the Preliminary Test Set are revealed, combine this set with Training Data into a single training pool of 837 compounds. Next, use your own best method to separate this pool into new Training and Testing subsets that you think will lead to a QSPR classification model with highest projected predictivity. Build the final model.
- Run the final model on the Final Test Data using appropriate descriptors as inputs, and then submit the 209 predicted values using the “N” and “S” labels. The order of predicted values should be the same as the order of Final Test Data!
The quality of models will be judged solely
on their predictivity
evaluated on Final Test Data predictions. Any statistics derived from the Preliminary Test Set and Training Data will be ignored. Solutions should contain a list of labels ("S" or "N"), one per line. There should be 163 labels for Preliminary Test examples followed by an empty line and another 209 labels for Final Test examples.
The following statistics is routinely used by QSAR community with regards to unbalanced binary datasets: [1-3]
The meaning of symbols used above is as follows:
- TP indicates the number of true positives: molecules for which both measured and predicted values are “S”.
- TN indicates the number of true negatives: molecules for which both measured and predicted values are “N”.
- FP indicates the number of false positives: molecules for which the predicted value is “S”, but measured value is “N”.
- FN indicates the number of false negatives: molecules for which the predicted value is “N”, but measured value is “S”.
In general, no single statistic provides a good measure of model's performance. All four indices should be as high as possible for Final Test Data predictions. We will be looking, in particular, for models with high Sensitivity and high Specificity, as well as a good balance between these two statistics to determine the winning model. For example, the following two models: Sensitivity = 0.8, Specificity = 0.8 and Sensitivity = 0.6, Specificity = 1.0 have the same Youden Index = 0.6. But the first model is balanced and preferred over the second one. The absolute difference |Sensitivity - Specificity| is a reasonable measure of imbalance. Following this idea and skipping constants, one can design a Balanced Youden Index as:
Mathematically, the above equation can be transformed into a simpler form (after removing the constant factor of 2):
This new index, a minimum of Sensitivity and Specificity, will be used to rank results in the Leaderboard and in the final scoring. Simulation Plus, however, reserves the right to adjust scoring procedure depending on the distribution of final results. In the case of models with equivalent quality, the winner will be decided on the date and time of final submission: earlier submissions will have preference. A participating group who creates the winning model will be awarded the 1000 USD prize funded by Simulations Plus.
For comparison, TunedIT has generated the "baseline solution
" by a simple 1-nearest neighbor classifier, without any feature selection nor scaling; in there Preliminary Test results are followed by Final Test results.Please direct all questions to the Forum.References
- Wikipedia, Matthews Correlation Coefficient, 2010, http://en.wikipedia.org/wiki/Matthews_correlation_coefficient
- Wikipedia, Youden Index, 2010, http://en.wikipedia.org/wiki/Youden_index
- Wikipedia, Sensitivity and Specificity, 2010, http://en.wikipedia.org/wiki/Sensitivity_(tests)