Research - Documentation
TunedIT is an integrated system for sharing, evaluation and comparison of machine-learning (ML) and data-mining (DM) algorithms. Its aim is to help researchers and users evaluate learning methods in reproducible experiments and to enable valid comparison of different algorithms. TunedIT serves also as a place where researchers may share their implementations and datasets with others.
Designing new machine-learning or data-mining algorithm is a challenging task. The algorithm cannot be claimed valuable unless its performance is verified experimentally on real-world datasets. Experiments must be repeatable, so that other researchers can validate their results. Unfortunately, in ML/DM repeatability is very hard to achieve. Reproducing someone else's experiments is a highly complex, time-consuming and error-prone task. At the end, when final results occur to be different than expected, it is completely unclear whether the difference invalidates claims of original author of experiment, or rather should be attributed to:
Usually, it is not possible to resolve this issue. The problem lies in the nature of ML/DM algorithms...
In classical algorithmics, algorithms either work correctly or not at all. They cannot be "partially" correct. Correctness is a binary feature: either the algorithm satisfies the required specification or not. And if not, it is always possible to find a single "counterexample" or "witness" - a particular combination of input values - which prove incorrectness of the algorithm. For instance, if we implemented quicksort and the implementation contained a bug, we could notice for some particular input data that the output is incorrect, since the generated sequence would be improperly sorted.
In ML/DM, there is no such thing like "incorrect algorithm". The algorithm can be "more" or "less" correct, but never "incorrect". We all assume as a basic axiom of ML/DM that algorithms may occasionaly make mistakes. It is just impossible to design an ML/DM algorithm that is always right when tested on real-world data. Thus, wrong answers for some input samples do not invalidate the whole algorithm. If we had a classifier trained to recognize hand-written digits and passed an image of "7" but the answer was "1", we would not start looking for implementation bugs, but rather presume that the input pattern was vague or atypical.
If the algorithm is always "correct", we do not have any indications of implementation bugs. There are no "witnesses" that would clearly prove incorrectness and point in the direction where bugs are hidden. Even if the experimenter presumes that something is wrong, he has no clues where to start investigation. For these reasons, reimplementing and reproducing someone else's experiments is practically impossible. The researcher may never be sure whether the experiment is reproduced correctly, with all important details done in the same way as originally, without any implementation bugs nor mistakes in experimental procedure.
If experiments are not repeatable, verification of experimental results is impossible, so it is easier to design a new algorithm than to verify results of an existing one. In consequence, there are thousands of competing algorithms for every type of ML/DM problem, but no general consensus over their actual quality, strengths and weaknesses of each of them. This makes the quest for better ones difficult, if not blindfold. Cogent illustration of these paradoxes can be found in Empiricism Is Not a Matter of Faith (Pedersen, 2008).
Here comes TunedIT. With its creation we want to give ML/DM community the tools that will help conduct reproducible research and obtain meaningful results, leading to formulation of generally-accepted conclusions.
We want to make experiments fully repeatable through their automation with TunedTester - the automation going side by side with flexibility and extendibility, provided by the plug-in architecture of TunedTester and its ability to handle entirely new evaluation procedures, designed for new types of tasks and algorithms.
We are creating a collaboration environment for researchers, where general consensus over performance of different algorithms could arise. The central point of this environment is Knowledge Base (KB), where all researchers can submit experimental results and build together rich and comprehensive database of performance profiles of different algorithms. The results stored in KB are repeatable and verifiable by everyone. KB is coupled with public Repository of ML/DM resources: algorithms, datasets, evaluation methods and others. Repository secures interpretability of results collected in KB and fosters exchange of data, implementations and ideas among researchers.
Finally, with development of these tools, we want to facilitate design of even more advanced and effective algorithms, able to solve numerous practical problems unsolvable today.
TunedIT builds upon previous efforts of scientific community to facilitate experimentation and collaboration in ML&DM. In particular, it employs and extends the ideas which lain in the basis of:
TunedIT combines strengths of these systems to deliver comprehensive, extendible and easy-to-use platform for ML&DM research.
TunedIT platform is composed of three complementary tools:
Repository is a database of ML&DM-related files - resources. It is located on TunedIT server and is accessible for all registered users - they can view and download resources, as well as upload new ones. The role of Repository in TunedIT is three-fold:
Repository has similar structure as a local file system. It contains a hierarchy of folders, which in turn contain files - resources. Upon registration, every user is assigned home folder in Repository's root folder, with its name being the same as the user's login. The user has full access to his home folder, where he can upload/delete files, create subfolders and manage access rights for resources. All resources uploaded by users have unique names (access paths in Repository) and can be used in TunedTester exactly in the same way as preexisting resources.
Every file or folder in Repository is either public (by default) or private. All users can view and download public resources. Private files are visible only to the owner, while to other users they appear like if they did not exist - they cannot be viewed nor downloaded and their results do not show up at KB page. Private folders cannot be viewed by other users, although subfolders and files contained in them can be viewed by others, given that they are public themselves. In other words, the property of being private does not propagate from a folder to files and subfolders contained inside.
TunedTester (TT) is a Java application for automated evaluation of algorithms, according to test specification provided by the user. Single run of evaluation is called a test or experiment and corresponds to a triple of resources from Repository:
Evaluation procedure is not hard-wired into TunedTester but is a part of test configuration just like the algorithm and dataset themselves. Every user can implement new evaluation procedures to handle new kinds of algorithms, data types, quality measures or data mining tasks. In this way, TunedTester provides not only full automation of experiments, but also high level of flexibility and extendability.
TT runs locally on user's computer. All resources necessary to set up a test are automatically downloaded from Repository. If requested, TT can submit results of tests to Knowledge Base. They can be analysed later on with convenient web interface of KB.
Resources for TunedTester
All TunedIT resources are either files, like
Typically, datasets have a form of files, while evaluation procedures and algorithms have a form of Java classes. For datasets and algorithms this is not a strict rule, though.
To be executable by TunedTester, evaluation procedure must be a subclass of
located in TunedIT/core.jar file in Repository.
Data file formats and algorithm APIs that are most commonly used in TunedIT and are supported by standard evaluation procedures include:
It is also possible for a dataset to be represented by a Java class,
the class exposing methods that return data samples when requested.
This is a way to overcome the problem of custom file formats.
If a given dataset is stored in atypical file format, one can put it into a JAR file
as a Java resource and prepare a wrapper class that reads the data and returns samples in common representation,
for example as instances of Debellor's
Test specification is a formal description for TunedTester of how the test should be set up. It is a combination of three identifiers (TunedIT resource names) of TunedIT resources which represent an evaluation procedure, algorithm and dataset that will be employed in the test:
Test specification = Evaluation procedure + Algorithm + Dataset
TunedIT resource name is the full access path to the resource in Repository, as it appears on Repository page. It does not include a leading slash "/". For example, the name of file containing Iris data and located in UCI folder is:
Java classes contained in JARs are also treated as resources, although they do not show up
on Repository pages. TunedIT name of a Java class is composed of the containing JAR's name
followed by a colon ":" and full (with package) name of the class.
For instance, ClassificationTT70 class contained in
Note that resource names are case-sensitive.
Many algorithms expose parameters that can be set by the user to control and modify their behavior. Currently, test specification does not include values of parameters, and thus it is expected that the algorithm will apply default values. If the user wants to test an algorithm with non-default parameters he should write a wrapper class which internally invokes the algorithm with parameters set to some non-default values. The values must be hard-wired in the wrapper class, so that the wrapper itself does not expose any parameters.
Users of TunedTester may safely execute tests of any algorithms present in Repository, even if the code cannot be fully trusted. TunedTester exploits advanced features of Java Security Architecture to assure that the code executed during tests do not perform any harmful operation, like deleting files on disk or connecting through the network. Code downloaded from Repository executes in a sandbox which blocks the code's ability to interact with system environment. This is achieved through the use of a dedicated Java class loader and custom security policies. Similar mechanisms are used in web browsers to protect the system from potentially malicious applets found on websites.
Communication between TunedTester and TunedIT server is efficient thanks to the cache directory which keeps local copies of resources from Repository. When the resource is needed for the first time and must be downloaded from the server, its copy is saved in the cache. In subsequent tests, when the resource is needed again, the copy is used instead. In this way, resources are downloaded from Repository only once. TunedTester detects if the resource has been updated in Repository and downloads the newest version in such case. Also, any changes introduced to the local copies of resources are detected, so it is not possible to run a test with corrupted or intentionally faked resources.
TunedTester may be started in a special challenge mode, used to evaluate solutions submitted to a challenge.
In this mode, TT repeatedly queries TunedIT for new submissions, then downloads and evaluates them.
It runs as a background process and does not require user's interaction.
Challenge mode is activated by passing the challenge name as an argument to command-line option
In challenge mode, GUI is not available and TT reports its current operations to the console.
It is possible to run more than one instance of TT for a given challenge in parallel. This is particularly useful when evaluation of a single solution is time-consuming, e.g., lasts more than an hour. With parallel execution, the queue of pending tests becomes shorter.
Different instances of TT running in parallel are independent of each other. The organizer may start a new instance or stop a given one at any time. Job scheduling is coordinated by TunedIT server. The instances may run on the same machine or on different ones. When running several instances on a single machine, take into account that sharing of hardware resources (CPU time, memory limit) may lead to variable evaluation conditions for different tests.
Knowledge Base (KB) is a database of test results generated by TunedTester. It is located on TunedIT server.
To guarantee that results collected in KB are always consistent with the contents of Repository and that Repository can serve indeed as a context for interpreration of results, when a new version of resource is uploaded, KB gets automatically cleaned out of all out-dated results related to the old version of the resource. Thus, there is no way to insert results into KB that are inconsistent with the contents of Repository.
Aggregated vs atomic results
Atomic result is the result of a single test executed by TunedTester. It is possible to execute many tests of the same specification and log all their results in KB. Thus, there can be many atomic results present in KB which correspond to the same specification. Note that usually these results will differ among each other, because most tests include nondeterministic factors. For instance, ClassificationTT70 and RegressionTT70 evaluation procedures split data randomly into training and test parts, which yields different splits in every trial and usually results in different outcomes of the tests. Algorithms may also employ randomness. For example, neural networks perform random initialization of weights at the beginning of learning.
Aggregated result is the aggregation (arithmetic mean, standard deviation etc.) of all atomic results from KB related to a given test specification. There can be only one aggregated result for a given specification. Aggregated results are the ones which are presented on Knowledge Base page. Currently, users of TunedIT do not have direct access to atomic results.
If tests of a given specification are fully deterministic, they will always produce the same outcome and thus the aggregated result (mean) will be the same as all atomic results, with standard deviation equal to zero. The presence of nondeterminism in tests is highly desirable, as it allows to obtain broader knowledge about the tested algorithm (non-zero deviation measures how reliably and repeatably the algorithm behaves) and more reliable estimation of expected quality of the algorithm (mean of multiple atomic results which are different between each other).
Security issues. Validity of results
The user may assume that results generated by others and collected in KB are valid, in a sense that if the user runs the same tests by himself he would obtain the same expected results. In other words, results in KB can be trusted even if their authors - unknown users of TunedIT - cannot be trusted. This is possible thanks to numerous security measures built into Repository, TunedTester and KB, which ensure that KB contents cannot be polluted neither by accidental mistakes nor intentional fakery of any user.
Every file and folder in Repository may have an associated description, visible on Repository page of a given resource. Description can be modified only by the owner.
TunedTester can be downloaded at this page.
TunedTester runs tests of algorithms according to test specifications given by you. Specification is composed of TunedIT resource names of: an evaluation procedure, a dataset and an algorithm that should be used to set up the test. It is possible to give several test specifications at once, by listing a number of datasets and/or algorithms in text areas of TunedTester window. In such case, TunedTester will run tests for all possible combinations of the given items.
In order to download necessary resources from Repository or send results to Knowledge Base you must be authenticated by TunedIT server. For this purpose you must give your username and password in TunedTester window before starting test execution.
TunedTester creates a cache folder in the local file system to keep copies of resources downloaded from Repository. This folder may become large at some point and require manual cleaning. To do this, you should simply remove the folder with all its contents - it will be automatically recreated with empty contents upon next execution of TunedTester. Cache folder is named tunedit-cache and is located in user's home directory.
Knowledge Base page
KB page shows aggregated results of tests collected in KB. In section Filters you can specify which results you want to view, by defining a pattern that must be matched by test specifications of the results. The pattern is built as a conjunction of patterns for each part of test specification: name of algorithm, dataset and evaluation procedure. Empty pattern will match all possible names. After the filters are defined, press "Show Results" to download matching results from TunedIT server. Please be patient, this operation may take a couple of seconds. When downloaded, the results are presented in section Results, where you can manipulate them and change the way how they are presented without downloading them again.
Important: the exact meaning of "Mean Result" depends on what evaluation procedure was used. Result value can be interpreted either as gain or loss, so for some evaluation procedures it is the bigger value which indicates higher quality of the algorithm, while for others it is the lower. For instance, ClassificationTT70 measures classification accuracy of an algorithm, interpreted as gain, while RegressionTT70 calculates Root Mean Squared Error (RMSE), interpreted as loss. These differences must be taken into account when analysing results of tests. In order to find out how the results should be interpreted for a given evaluation procedure, it is best to go to its Repository page and read the description.
If "exact match" check box is on, pattern matching is case-sensitive. Please watch carefully for the case of letters.
The chart presents mean results of all algorithms that were tested on the selected dataset using the selected evaluation procedure. You can choose another dataset and evaluation procedure using drop-down lists located above the chart. If you place the mouse over a bar on the chart, a tooltip will show up in the upper-left corner of the window, displaying detailed information about the selected test.
Meaning of columns of the result table:
Names of evaluation procedures, algorithms and datasets are hyperlinks which lead to Repository pages of the resources, so you may click the name and see all the details of a given resource.
You can sort result tables by any column, in ascending or descending order, by clicking on the header of a chosen column.
You can download the results as CSV files for off-line analysis, by clicking on the [download as CSV] link located rigth above the result table.
In the following examples we assume that there is user 'John_Smith' registered in TunedIT and his password is 'pass'.
Example 1 - run test with TunedTester
Screenshot below [click to enlarge] shows how to evaluate J48 algorithm (decision tree induction) from Weka with TunedTester. We use here the default TunedTester's evaluation procedure, ClassificationTT70. Test will be repeated 5 times on each of two data sets, audiology.arff and iris.arff from UCI. Remember that in TunedTester you must give full names of algorithms, evaluation procedures and datasets, including their paths in Repository, as well as full package names for Java classes.
Example 2 - a classification algorithm suitable for ClassificationTT70 evaluation procedure
Default TunedTester's evaluation procedure is ClassificationTT70 (samples are randomly shuffled before splitting with ratio 70/30 into train and test sets). It was designed to support three kinds of classifier interfaces - these defined in Debellor, Rseslib and Weka libraries. We will show how to write an algorithm which may be then evaluated by the ClassificationTT70 procedure.
Debellor is an open source extensible data mining framework which provides common architecture for data processing algorithms of various types. The algorithms can be combined together to build data processing networks of large complexity. The unique feature of Debellor is data streaming, which enables efficient processing of large volumes of data. Data streaming is essential for providing scalability of the algorithms. See www.debellor.org for more details.
We will do everything step by step but if you find yourself run out of patience a sligtly modified version of the below example is available and ready for evaluation in the repository - see MajorityClassifier_debellor.jar from the Examples folder.
To be able to compile our Debellor based classifier we will need a copy of the library [Debellor library download link]
Let's quote some important fragments of the Debellor's
A source code of our simple classifier:
Copy the above code and save it as a MajorityClassifier.java file. Place it in the same directory as the already downloaded debellor<version>.jar file. Then compile and pack the classifier into the jar archive:
If everything went well you shoud have DebellorMajorityClassifier.jar file in the current directory.
Now you can upload the classifier into the repository e.g. John_Smith/Classifiers folder:
Evaluate its accuracy on some data sets using TunedTester GUI - refering to our classifier (if you followed our example) by John_Smith/Classifiers/DebellorMajorityClassifier.jar:MajorityClassifier
Or using command line:
And finally you can inspect its results in context of other algorithms' results:
As previously - prepared source code may be dowloaded from the repository's Examples folder.
To be able to compile a Rseslib based classifier we will need a copy of the library from the repository - [Rseslib library download link]
Let's quote Rseslib's Classifier interface:
We will implement constructor and classify method only:
Copy the above code and save it as a MajorityClassifier.java file. Place it in the same directory as the already downloaded rseslib<version>.jar file. Then compile and pack the classifier into the jar archive:
If everything went well you shoud have RseslibMajorityClassifier.jar file in the current directory.
As previously upload the classifier's jar file into the repository John_Smith/Classifiers folder and evaluate its accuracy on some data sets using either TunedTester GUI or command line - refering to the classifier by John_Smith/Classifiers/RseslibMajorityClassifier.jar:MajorityClassifier
Prepared source code may be dowloaded from the repository's Examples folder.
To be able to compile a Weka based classifier we will need a copy of the library from the repository - [Weka library download link]
Let's quote some important fragments from Weka's Classifier class:
We will implement buildClassifier and classifyInstance methods:
Copy the above code and save it as a MajorityClassifier.java file. Place it in the same directory as the already downloaded weka<version>.jar file. Then compile and pack the classifier into the jar archive:
If everything went well you shoud have WekaMajorityClassifier.jar file in the current directory.
As previously upload the classifier's jar file into the repository John_Smith/Classifiers folder and evaluate its accuracy on some data sets using either TunedTester GUI or command line - refering to the classifier by John_Smith/Classifiers/WekaMajorityClassifier.jar:MajorityClassifier
Example 3 - writing an evaluation procedure
Will appear soon...
Frequently Asked Questions (FAQ)
Q: If I use my private resource in a test and submit result to KB, will other users see the result on KB page?
A: No. To see the result the user must have access rights to all the resources used in a given test: algorithm, dataset and evaluation procedure.
Q: What if an error occurs during test and "Send results to Knowledge Base" option is checked? Is the error sent to KB? Is it included in results shown in KB page?
A: Errors caused by the tested algorithm are sent to KB. Other errors: caused by evaluation procedure or testing environment (like problems with network connection) are not. Currently, errors submitted to KB are not included in the results shown at KB page.
Q: What programing language should I use to implement new algorithms and evaluation procedures?
Q: What API my algorithm should implement to be suitable for TunedTester?
A: It depends on evaluation procedures that will be used for this algorithm.
The preferred way is to use API of Debellor and implement the algorithm as a subclass of
Q: In what data format should I save my dataset so that TunedTester can use it?
A: It depends on evaluation procedures that will be used. Currently, ARFF is the preferred format and it should be understood by most of evaluation procedures, including ClassificationTT70 and RegressionTT70.
Q: I receive OutOfMemory errors when running TunedTester.
A: This may occur if data used in tests are too large to fit in memory.
Try to increase the amount of memory available to TunedTester:
See also the discussion forum to view and post questions and answers.