Solubility data

Preparation of datasets and algorithms in a form suitable for TunedTester

Solubility data

Postby jcbradley » Wed Oct 14, 2009 10:49 am

Marcin
Thanks for your feedback on FriendFeed about not having to use ARFF format for TunedIT. The live data are currently in this format
http://spreadsheets.google.com/ccc?key= ... udnEmRD1aQ
Can you work with this?

Jean-Claude

--
Jean-Claude Bradley, Ph. D.
E-Learning Coordinator for the College of Arts and Sciences
Associate Professor of Chemistry
Drexel University

http://usefulchem.blogspot.com
http://drexel-coas-elearning.blogspot.com
http://drexel-coas-talks-mp3-podcast.blogspot.com/
http://friendfeed.com/jcbradley
jcbradley
 
Posts: 7
Joined: Tue Oct 13, 2009 6:19 pm

Re: Solubility data

Postby Marcin » Wed Oct 14, 2009 10:51 am

Jean,
Very nice data. As I understand, you are creating predictive models from them, right? Can you tell me:

- which column is the OUTPUT of the model - "concentration (M)"?
- which columns are the INPUT to the model - surely "solute" and "solvent", but do you use some more elementary features describing these two chemicals? like their physical properties? (I can see some columns on the right, but they're mostly empty)
- do you have some models already created? are they publicly available? are they implemented in some programming language?

As to ARFF, this is a very simple text format. I could help you with convertion of your data to ARFF - it's 15 min work. Only I'm not sure which columns should comprise the input.

Regards
Marcin Wojnarski
TunedIT
Marcin
 
Posts: 115
Joined: Fri Oct 09, 2009 6:45 pm

Re: Solubility data

Postby jcbradley » Wed Oct 14, 2009 10:54 am

Marcin - that spreadsheet does not have any modeling - it is just the experimentally measured solubility values. Andrew Lang has just posted about his model using molecular descriptors (from the CDK) of the solutes and solvents to predict solubility:
http://onschallenge.wikispaces.com/SolubilityModel003
Do you think it is possible to host the models as well as the data on your system?
Jean-Claude
jcbradley
 
Posts: 7
Joined: Tue Oct 13, 2009 6:19 pm

Re: Solubility data

Postby Marcin » Wed Oct 14, 2009 10:56 am

Jean,

I've searched on the net for CDK Descriptor Calculator which you mentioned on Wiki and by the way I've learnt a bit of cheminformatics :-) You're doing really interesting things with all these different molecules. It looks so familiar - train/test datasets, cross-validation, feature selection, regression and so on - all the things that we machine-learners love the most :-)

Yes, you can definitely host your data and models on TunedIT. I'm trying to figure out the best path. There are 2 things:

1. DATASET - must be saved in some common format. ARFF is a pretty good choice - very simple and understood by many existing software packages, like CDK Descriptor Calculator GUI: "ARFF output format is supported" (http://rguha.net/code/java/cdkdesc.html)

The question is what data to put inside as inputs to the model.
You have basically 2 choices:

(a) solute SMILES & solvent SMILES - as they are now. See the attached file - I've prepared an example arff to give you an idea how it looks

(b) descriptor values calculated by CDK - then instead of 2 columns of type 'string' you would have a number (10? 100?) of columns of type 'numeric' in the ARFF

I recommend (b) for the start - model implementation will be easier (no need to include CDK) and you could test existing general-purpose machine-learning algorithms for creation of new models.

2. MODEL - should be implemented in Java if you want to evaluate it in automated fashion with TunedTester. It's like 20 lines of Java code for the model described on Wiki. I'll give you more details when you have a dataset file ready.

Regards,
Marcin
Attachments
solubilities.arff
example arff file with solubility data
(1.22 KiB) Downloaded 1519 times
Marcin
 
Posts: 115
Joined: Fri Oct 09, 2009 6:45 pm

Re: Solubility data

Postby jcbradley » Wed Oct 14, 2009 10:58 am

Marcin,
That sounds like a good way to start.
Andy - I think your starting model had about 80 descriptors?
I think we could get a few people to try their hand at coming up with alternative models based on the same dataset.
Eventually it would be nice to have data automatically update but this would be a useful first step.

Jean-Claude
jcbradley
 
Posts: 7
Joined: Tue Oct 13, 2009 6:19 pm

Re: Solubility data

Postby onsc » Wed Oct 14, 2009 11:36 am

Marcin,

Please find attached the solubility data in arff format - descriptors included. This is the data as of September 17. It has more descriptors than in the model but not as many as are in the CDK.

If needed, the descriptors can be pulled one at a time for each SMILES using a webservice, e.g.:
http://toposome.chemistry.drexel.edu:66 ... ptor/CCC=O
(view source if you don't see anything)
but I think it is easier just to include them in the arff file as you suggested.

-Andy
Attachments
ONSData.arff
solubility data as of September 17
(111.93 KiB) Downloaded 1405 times
onsc
 
Posts: 6
Joined: Fri Oct 09, 2009 6:45 pm

Re: Solubility data

Postby Marcin » Wed Oct 14, 2009 11:38 am

Andrew,

Please make some fixes to the arff:

1) add "numeric" at the end of "@attribute solute_xlogp"

2) drop "string" columns at all - when you have numeric descriptors they are not necessary and would cause problems for many standard algorithms and for TunedTester (if you insist on keeping them we would cope with that, but it's easier to drop them now)

When you do this, just upload the file to TunedIT Repository.

-Marcin
Marcin
 
Posts: 115
Joined: Fri Oct 09, 2009 6:45 pm

Re: Solubility data

Postby jcbradley » Wed Oct 14, 2009 11:39 am

Thanks Andy - Marcin yes if we could incorporate the CDK as a webservice it would make the building of models much more flexible - and the community we expect to help us build models is extremely familiar with that data source.
But for now - yes lets just do it like this
jcbradley
 
Posts: 7
Joined: Tue Oct 13, 2009 6:19 pm

Re: Solubility data

Postby onsc » Wed Oct 14, 2009 11:39 am

I made the changes and uploaded to the repository:
http://tunedit.org/repo/onsc/ONSDataNumeric.arff
onsc
 
Posts: 6
Joined: Fri Oct 09, 2009 6:45 pm

Re: Solubility data

Postby Marcin » Wed Oct 14, 2009 11:40 am

You could include CDK as a JAR file, combined with the code of your model - I've checked CDK homepage and it seems that CDK is distributed as a single JAR, on a very liberal open source licence - no problem to include it in any other code.

Webservice is a handy tool, but it couldn't be accessed from automated tests. This is because TunedTester creates a protection boundary ("sandbox") around the tested algorithm, which blocks any access to system environment, like local disk or network connections. Sandbox ensures that: (1) any user can safely test any algorithm, even if the algorithm doesn't come from a trusted source; (2) the results sent to Knowledge Base are reproducible - they depend only on the code stored in Repository and not on some other external services which can change their behavior at any time.

Anyway, this is not needed now, when dataset contains already calculated descriptors.

-Marcin
Marcin
 
Posts: 115
Joined: Fri Oct 09, 2009 6:45 pm

Re: Solubility data

Postby Marcin » Wed Oct 14, 2009 11:42 am

Great. This arff is good. Weka algorithms can work with it. What next? Your model?

-Marcin
Marcin
 
Posts: 115
Joined: Fri Oct 09, 2009 6:45 pm

Re: Solubility data

Postby onsc » Wed Oct 14, 2009 11:44 am

Marcin,

How do I create the model? Hopefully it will be as easy as the arff.

-Andy
onsc
 
Posts: 6
Joined: Fri Oct 09, 2009 6:45 pm

Re: Solubility data

Postby Marcin » Wed Oct 14, 2009 11:46 am

Andy,

The model must be implemented in Java. Do you know Java? What programming environment do you use? Eclipse? else?

Currently, there are 3 ways to implement an algorithm suitable for TunedTester: either in Debellor architecture or Rseslib's or Weka's. So first we have to choose one of them. In long-term perspective Debellor is the best, since it's integrated with TunedIT to the largest extend and it's the most flexible, but for ordinary types of algorithms and data (like simple numeric), I'd recommend Weka, for simplicity.

Take a look at Example 2 in docs:
http://tunedit.org/docs#examples

- how to write a classifier. Weka case is at the end. Your classifier will look a bit different, but it will be similiarly simple.

Marcin
Marcin
 
Posts: 115
Joined: Fri Oct 09, 2009 6:45 pm

Re: Solubility data

Postby onsc » Wed Oct 14, 2009 11:47 am

Marcin,

The data is still the same but the model is going through a refinement stage:
http://onschallenge.wikispaces.com/SolubilityModel003

I'm hoping it will be improved even further at some point by someone with more modeling experience.

-Andy
onsc
 
Posts: 6
Joined: Fri Oct 09, 2009 6:45 pm

Re: Solubility data

Postby Marcin » Wed Oct 14, 2009 11:48 am

Andy,

I've run a couple of tests on your data using general-purpose ML algorithms from Weka (incl. Linear Regression). You may see the results here:

http://tunedit.org/results?e=&d=onsc&a=&n=

"Mean Result" is Root Mean Squared Error of solubility predictions, averaged over different splits of the dataset into training/test parts. Hover mouse pointer over a chart bar to see detailed info about given result.

Note that these results are statistically valid (non-biased), because the models are trained on a random subset (70%) of the full dataset, and evaluated on the rest (30%) of samples. So evaluation and training are done on separate parts of data.

Regards,
Marcin
Marcin
 
Posts: 115
Joined: Fri Oct 09, 2009 6:45 pm

Re: Solubility data

Postby onsc » Wed Oct 14, 2009 11:49 am

Marcin,

Thanks!!!

I tried to download the csv in the raw results tab and I am getting an error. Anyway, I've been spending hours making up my own genetic algorithms on mathematica to find descriptors but it looks like that there are several sophisticated linear regression algorithms build in to TunedIT?! This would be awesome!

How do I find the coefficients and the descriptors used in the models? I thought they would be in the csv file.

-Andy
onsc
 
Posts: 6
Joined: Fri Oct 09, 2009 6:45 pm

Re: Solubility data

Postby jcbradley » Wed Oct 14, 2009 11:51 am

Marcin
This certainly looks like it can be useful for our research. In line with what Andy is asking is there an easy way to use the models predictively? For example, Andy set up this page for predicting solubilities of any solute in any solvent:
http://showme.physics.drexel.edu/onsc/m ... olvent.php

We would like to be able to do that with any of the models housed on TunedIT

Jean-Claude
jcbradley
 
Posts: 7
Joined: Tue Oct 13, 2009 6:19 pm

Re: Solubility data

Postby Marcin » Wed Oct 14, 2009 11:52 am

Hello,

These algorithms (and many more) come from Weka library. There are no models generated and stored, because the test consists of: (i) generating the model on the training subset (70%) of full data, (ii) evaluating it on the test subset (30%) and (iii) discarding the model (!). So there's no model at the end, only a quality measure of the algorithm that generates models.

(It is possible to evalute a particular model, but you have to prepare this model as Java code beforehand and upload to Repository)

How to create a working model? Using Weka application:

http://www.cs.waikato.ac.nz/ml/weka/ind ... ading.html

Weka has a GUI to do this (Explorer). You can train the model with any available algorithm, on the whole dataset, and then save it to disk. It can be used later on, in any other application, but I cannot tell you now how to do this - presumably a piece of Java code might be necessary.

Marcin
Marcin
 
Posts: 115
Joined: Fri Oct 09, 2009 6:45 pm

Re: Solubility data

Postby Marcin » Tue Oct 20, 2009 2:57 pm

Jean and Andy,

I'd like to make sure how your predictive model (http://onschallenge.wikispaces.com/SolubilityModel003) works. How the prediction is calculated? As I understand, this table describes linear regression, with the 1st column containing variables and the "Estimate" column containing coefficients - am I right? So the equation for predicted value would be something like:
Code: Select all
   prediction = -133.454 - 31.4476 * (1/AMR) + 36.6713 * (1/Kier1) + ...

Is this correct?
Marcin
 
Posts: 115
Joined: Fri Oct 09, 2009 6:45 pm

Re: Solubility data

Postby onsc » Tue Oct 20, 2009 3:35 pm

Right Marcin. The first column is the variable and the second, the coefficient, exactly as you say. -Andy
onsc
 
Posts: 6
Joined: Fri Oct 09, 2009 6:45 pm

Next

Return to How to contribute new resources

Who is online

Users browsing this forum: No registered users and 1 guest

cron