Repository /Data/Text-wc/ohscal.wc.arff

5.0 MB
2010-04-08 11:09:40


Multi-class (1-of-n) text dataset donated by George Forman (

Description of original data sets:

I'd like to offer this jar containing 19 multi-class (1-of-n) text
datasets, whose word count feature vectors have already been extracted.
I thought it'd be good to
post at the UCI repository and the WEKA datasets site, if you are
interested. It's 14MB compressed.

The problems come from LA Times, TREC, OHSUMED, etc. and the data were
originally converted to word counts by

Han, E. and Karypis, G. Centroid-Based Document Classification:
Analysis & Experimental Results. In Proc. of the 4th European Conf. on
the Principles of Data Mining and Knowledge Discovery (PKDD): 424-431,

I have found them quite useful for studies, e.g.

G. Forman & Ira Cohen. Learning from Little: Comparison of Classifiers
Given Little Training ECML'04. Hewlett-Packard Labs TR HPL-2004-19R1.

G. Forman. A Pitfall and Solution in Multi-Class Feature Selection for
Text Classification. ICML'04. HPL-2004-86

G. Forman. An Extensive Empirical Study of Feature Selection Metrics
for Text Classification. Special Issue on Variable and Feature
Selection, Journal of Machine Learning Research, 3(Mar):1289-1305, 2003.
HPL-2002-147R1. ((Their web-site has a subset of these datasets, but it
only includes binary features--- word occurs 1 or more times.))

George Forman

File contents

This file has over 100KB. Only the first 100KB are shown below. Please download the file to view the whole contents.

Copyright © 2008-2013 by TunedIT
Design by luksite