Challenges / e-LICO multi-omics prediction challenge with background knowledge on Obstructive Nephropathy

Status Closed
Type Scientific
Start 2010-09-15 00:00:00 CET
End 2010-12-19 23:59:59 CET
Prize 2,500€

Registration is required.

Task

Download
Data Description
Measurement Data: The measurement data span three different levels, miRNA, proteins, and metabolites. miRNA is measured using specific pan-miRNA arrays. Protein expression levels are measured with two different technologies, antibody arrays that target specific known proteins, and liquid chromatography associated to tandem mass spectrometry (LC-MS/MS), which also yields proteins. Some of the proteins identified by LC-MS/MS will be similar to the antibody-based approach, while others will be different. The metabolomic profile is measured using capillary electrophoresis associated to mass spectrometry (CE-MS). All the different data sources will be preprocessed to extract the appropriate features; the extracted data and features are delivered to the participants in standard tabular form where rows correspond to samples and columns correspond to measurements.

Measurement profiles could be acquired only on a limited number of samples, 34, due to costs, sample availability, and turnover-time of such studies. Of these 34 samples in total, only 30 will be released for the model construction phase. The remaining 4 will be released at a later stage, after the completion of the model construction phase, and will be used to evaluate the models created by each participant. We have to stress here that such small sample sizes are very common in biology. Even though technological advances and larger use of the testing techniques can overcome cost factors, what clearly cannot be overcome is the invasive nature of sample acquisition techniques such as biopsies, thus the small sample size problem is something that will be present for the foreseeable future. A more detailed description of the available data is given in the table below.

Image

A common phenomenon in biological experiments is that it is not possible to obtain complete measurements for all samples. This is also the case here. In the training data we have two distinct groups of samples. The first group consists of 20 samples for which we have the miRNA and metabolomics measurements. The second group consists of 10 samples for which we have the proteomics measurements obtained by the two different measurements technologies. Between the individuals comprising these two groups there is no overlap. The testing data are of the same form as these of the first group, i.e. they have miRNA and metabolomics measurements. The reason for this discrepancy was simply that the proteomics technologies require 10 to 20 times more biological samples. Since these samples are urines samples taken from newborns these volumes were quite limited. Each sample in the set of samples measured using proteomics technologies was obtained from multiple (1-4) patients.

In a standard mining scenario the second group of 10 samples, that has only proteomics measurements, would be considered useless. However, this is not the case here due to the fact that the attributes of the different sources, miRNA, proteomics, and metabolomics, are not independent and these dependencies are given as a part of the background knowledge described below. We can capitalize on this information. For example, models constructed on the miRNA and metabolomics sources should be somehow compatible and agree with models constructed on the proteomics level. We can thus use the information extracted from the latter to improve the quality of the former.

All the measurement data were preprocessed using standard techniques implemented in various bioconductor packages, and the features were rescaled by applying log2.

Available Background Knowledge: We also provide background knowledge, extracted from various biological databases, on the potential links between miRNA to proteins, the links of the proteins to the metabolome, and the links between proteins which are measured using the two different technologies. This will mainly be a description of the dependencies/interaction of the different attributes of the different biological levels as they are extracted from the different biological databases. The figure below describes how these connections are made and with the help of which database. We associate, when possible, the miRNA attributes to the proteins whose expression levels are measured by the Antibody Arrays and LC-MS/MS by first passing to the mRNAs that these miRNAs target, with the help of miRBase and Targetscan databases. Subsequently these mRNAs are translated to their corresponding proteins and associated to the attributes—proteins—of the antibody arrays and LC-MS/MS datasets. Proteins that appear within the LCMS/MS and Antibody Arrays are associated with some of the measured metabolites with the help of the KEGG and HMDB databases.

Image

It is important to note that the above biological databases reflect the current state of the domain knowledge and are constantly updated by biologists. Moreover, the raw data preprocessing methods we used to extract the appropriate features in the different levels are statistical in nature. This in turn might render the provided background information incomplete and noisy.

Each mapping corresponds to a pair of -omics data (e.g. "level1" and "level2") and is generated in a file "level1_2_level2.txt". This file contains indexes of attributes in the two levels and each line corresponds to a particular mapping of an attribute in level1 to a set of attributes in level2. The syntax of this files is the following:
level1.attrid.1: level2.attrid.11,...,level2.attrid.1n
...
level1.attrid.k: level2.attrid.k1,...,level2.attrid.kl
...

In the above format we assume that the "Sample" column in each of the file with the -omics data (e.g. "level1" and "level2") has an index of 0 and the first biological measurement has an index of 1.
Use of Other Background Knowledge: Participants are allowed to use other sources of external background knowledge provided that they are publicly available.

Other Data: We also provide patients' age measured in days.

Target Variables: The severity of the disease is given by the degree of obstruction, which is measured in two ways:
  • the pelvic diameter (PD) (mm): the more the urine accumulates within the pelvis of the kidney, the more the pelvis dilates and the diameter of the pelvis increases,
  • the differential renal function (DRF) (%): when obstruction of the kidney is too severe, the function of the obstructed kidney decreases. The function of both the obstructed and the non-obstructed kidney is measured and we then calculate the DRF. The larger the DRF value, the more the obstructed kidney function is impaired.
The provided target values are scaled to a [0,1] interval.

As already mentioned, each sample measured using proteomics technologies was obtained from several patients (as proteomics technologies require 10 to 20 times more biological substance). For these samples we provide the target variables for each of the patients.

Training and holdout data
Training data consists of several files which correspond to different -omics measurements (in the CE-MS-metabo_Data, LC-MSMS_Data, ProteinArray_Data and miRNA_Data subdirectories), the files with mappings (in the mappings subdirectory) and one file with the target variables together with the age of patients in days (in ON_challenge_ClinicalData).

Holdout data contains data for 4 samples and consists of a file with the metabolomic profiles (in the CE-MS-metabo_Data subdirectory) and miRNA expression values (miRNA_Data).

Data mining tasks
We define three different predictive tasks:
  1. Predict PD
  2. Predict DRF
  3. Jointly Predict PD and DRF
Evaluation Strategy
Our evaluation measure will be based on the sum of squared error for each one of the tasks described above:

Image

where the superscript p denotes the predicted values. The final evaluation score is computed by subtracting the average of the above three error components from a baseline performance that is produced if one predicts in a default manner the averages of PD and DRF as these are computed on the training set, i.e.:
score = BASELINE - (err_1 + err_2 + err_3)/3

In other words, the evaluation score reflects the enhancement over the baseline.

Given that we have a very limited number of examples, only 4, on which the models created by the participants will be applied, the submitted solutions will be evaluated, and their final scores published on the leader board, only after the challenge is finished. Participants are allowed to submit as many solutions as they want, but only the last solution will be evaluated. The submitted solutions are also preliminarily examined to determine if they fulfill the syntax requirement (i.e. a comma delimited csv file with 4 rows and 2 columns); if there is no parsing error the "Preliminary Result" column in the Leaderboard should contain 0.0000 .

Submission format

Prediction results for the holdout data should be submitted as a text file in csv format (with the comma as a delimiter). This file should contain 4 rows, i.e. each row per sample, and 2 columns which correspond to "Predicted PD" and "Predicted DRF", respectively; this file should not contain row and column names. The order of rows should be the same as the order of samples in the holdout data.

An example of a solution file can be found here.


Good luck!

Alexandros Kalousis, Julie Klein, Joost Schanstra and Adam Woznica
The Organizers
Copyright © 2008-2013 by TunedIT
Design by luksite