Challenges / e-LICO multi-omics prediction challenge with background knowledge on Obstructive Nephropathy
We present a biological data-mining problem that poses a number of significant challenges:
Authors of the winning solutions will be awarded with prizes sponsored by Rapid-I that supports the well-known data mining suite RapidMiner, and the European Commission through the e-Lico EU project, 2009-2012, building a virtual laboratory for interdisciplinary collaborative research in data mining and data-intensive sciences.
Participants are expected to prepare a paper, maximum 8 pages, describing their approach. We plan to have a number of selected papers considered for publication in a special issue of a journal (to be announced soon).
With the advent of high throughput technologies in biology one of the typical analysis problems encountered is the study of biological mechanisms based on the profiling of specimens--samples from some population related to the mechanism under study. This profiling more and more often spans different biological levels, e.g. genomics, proteomics, metabolomics, using different measurement techniques. Typical goals of such studies are the extraction of models that are predictive of some outcome, diagnostic of some condition, or simply provide a better understanding of the biological mechanism. These settings bring to the fore a number of significant challenges for the knowledge discovery and data mining process.
One such challenge is the high dimensionality of the spaces in which the samples or instances are described, typically hundreds or thousands of features. This condition is very often further aggravated by the limited availability of samples. The latter can be due to ethical reasons, e.g. biological samples that can only be acquired by intrusive techniques, or cost reasons, e.g. measurements can be expensive to acquire on large numbers. This combined problem morphology is usually identified as the High Dimensionality Small Sample Size problem.
Another challenge is the meaningful combination of data measurements that come from different biological levels, such as genomics and proteomics, and different measurement techniques. Obviously, the simplest approach is to combine all different sources in a single tabular view where instances are described by heterogeneous attributes, i.e. attributes that come from different biological levels. More elaborate approaches might consider the combination of the different biological levels in a meaningful manner in the data mining process; examples of such approaches are the various flavors of multiple kernel learning, or even model combination and ensemble learning techniques over the different levels.
A third challenge comes from the fact that the profile measurements represent a snapshot of the underlying biological mechanism. As a result we have a high degree of dependencies and interactions both between the attributes of the same source, a typical example is co-regulated genes, and attributes of different sources, e.g. a gene which is transcribed to some protein whose level of expression controls the level of expression of a number of other proteins or metabolites. If one follows a very simplistic mining approach with no pre-processing and the single tabular view described above then the learning algorithms that are used as a part of the mining process should be able to deal with strong feature interactions and redundancies and in the ideal case uncover them. Techniques that follow the multiple kernel learning or the ensemble learning example over the different levels might have a problem dealing with these kind of data, because under these learning examples it is more difficult to account for the attribute dependencies and interactions that are present here. The learning problem might become easier and the quality of the results can be boosted if one is able to model and exploit during the learning process the attributes’ interactions and dependencies.
Finally, it happens very often that it is not possible to have complete information and measurements on all the samples that are used in a given study. The main reason for that is that there is not enough biological material to do all the measurements. A very related problem is when we want to combine different measurements taken from different research teams on disjoint samples, which were nevertheless collected to study the same disease. For example, a given group collects proteomics data for a given disease on a pool of samples, and another group collects metabolomics data for the same disease on its own pool of samples. Standard data mining approaches break down here since there is no way to connect the different data. However, in biology we do have connections between the different sources as described in the previous paragraph, and we can exploit them in an effort to use all the available data in the modeling process.
Biology is a scientific field that has been marked by an explosion of the availability of computerized knowledge resources. Of particular relevance to the type of analysis that we are interested in here is knowledge of interactions and dependencies between different biological compounds from the same or different biological levels. Typical examples of such knowledge representation are biological pathways that model interactions between biological compounds. Given such knowledge, the challenge is to use and/or device data mining approaches and learning algorithms that are able to accommodate it in order to improve the quality of the results. Examples of such approaches could be attribute pre-processing steps that exploit the attributes' dependencies and the use existing learning paradigms that are appropriate for the described context such as graph-based learning algorithms and Bayesian networks.
The biological application that provides us the data mining problem is that of Obstructive Nephropathy (ON). This pathology is characterized by the presence of an obstacle in the urinary tract, e.g. stenosis or abnormal implantation of the urethra in the kidney; see the figure below. The urine cannot flow properly and accumulates within the kidney leading to progressive alterations of the renal parenchyma, development of renal fibrosis and loss of renal function. ON is the first cause of end-stage renal disease in children and is treated by dialysis or transplantation. Thus, it is important to understand the pathological mechanisms involved in the progression of this nephropathy. To achieve that we examine the urinary miRNA, protein, and metabolite profiles of newborns. The goal of the study is to construct diagnostic models that accurately connect the biological profile to the severity of the disease.
As already mentioned, the different views are interrelated; the figure below gives a rough description of the dependencies. More precisely, miRNAs control the level of repression of mRNAs thus controlling the level of expression of proteins to which the mRNA is translated. The expression level of these proteins then determines the levels of expression of different metabolites. These interactions have been extracted from different biological databases, such as Target Scan, Mirbase, Uniprot, KEGG and HMDB, and are part of the background knowledge associated with the problem.
We have ensured prize-sponsoring form Rapid-I that supports the well-known data mining suite RapidMiner, and the European Commission through the e-Lico EU project, 2009-2012, building a virtual laboratory for interdisciplinary collaborative research in data mining and data-intensive sciences.
The amount of sponsoring is: