Datasets for single-label text categorization

Here you can find the Datasets for single-label text categorization that I used in my PhD work. This is a copy of the page at IST.

This page makes available some files containing the terms I obtained by pre-processing some well-known datasets used for text categorization.

I did not create the datasets. I am simply making available already processed versions of them, for three main reasons:

    • To allow an easier comparison among different algorithms. Many papers in this area use these datasets but report slightly different numbers of terms for each of them. By having exactly the same terms, the comparisons made using these files will be more reliable.
    • To ease the work of people starting out in this field. Because these files contain less information than the original ones, they can have a simpler format and thus will be easier to process. The most common pre-processing steps are also provided.
    • To provide single-label datasets, which some of the original datasets were not.

I make them available here on the same terms as they were originally available, which is basically for research purposes. If you want to use them for any other purpose, please ask for permission from the original creator. You can reach their homepages by following the links next to each one of them.

If you use the datasets that I provide here, please cite my PhD thesis, where I describe the datasets in section 2.8. Ana Cardoso-Cachopo, Improving Methods for Single-label Text Categorization, PhD Thesis, October, 2007.

@Misc{2007:phd-Ana-Cardoso-Cachopo,   
  author = {Ana Cardoso-Cachopo},
  title = {{Improving Methods for Single-label Text Categorization}},
  howpublished = {PdD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa},
  year = 2007} 

All the files mentioned below are available individually and in one zip file (48 Mb) from this folder in Google Drive.

20 Newsgroups

I downloaded the 20Newsgroups dataset from Jason Rennie's page and used the "bydate" version, because it already had a standard train/test split. This dataset is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

Although already cleaned-up, this dataset still had several attachments, many PGP keys and some duplicates.

After removing them and the messages that became empty because of it the distribution of train and test messages was the following for each newsgroup:

Reuters 21578

I downloaded the Reuters-21578 dataset from David Lewis' page and used the standard "modApté" train/test split. These documents appeared on the Reuters newswire in 1987 and were manually classified by personnel from Reuters Ltd.

Due to the fact that the class distribution for these documents is very skewed, two sub-collections are usually considered for text categorization tasks:

    • R10 The set of the 10 classes with the highest number of positive training examples.
    • R90 The set of the 90 classes with at least one positive training and testing example.

Moreover, many of these documents are classified as having no topic at all or with more than one topic. In fact, you can see the distribution of the documents per number of topics in the following table, where # train docs and # test docs refer to the Mod Apté split and # other refers to documents that were not considered in this split:

As the goal in this page is to consider single-labeled datasets, all the documents with less than one or with more than one topic were eliminated (this includes eliminating one test document that was labeled as "trade" twice in the original collection (thank you Austin Brockmeier for pointing this out)). With this, some of the classes in R10 and R90 were left with no train or test documents.

Considering only the documents with a single topic and the classes which still have at least one train and one test example, we have 8 of the 10 most frequent classes and 52 of the original 90.

Following Sebastiani's convention, we will call these sets R8 and R52. Note that from R10 to R8 the classes corn and wheat, which are intimately related to the class grain disapeared and this last class lost many of its documents.

The distribution of documents per class is the following for R8 and R52:

Cade

The documents in the Cade12 correspond to a subset of web pages extracted from the CADÊ Web Directory, which points to Brazilian web pages classified by human experts. This directory was available at Cade's Homepage, in Brazilian Portuguese.

A pre-processed version of this dataset was made available to me by Marco Cristo, from Universidade Federal de Minas Gerais, in Brazil. This dataset is part of project Gerindo.

Because there is no standard train/test split for this dataset, and in order to be consistent with the previous ones, I randomly chose two thirds of the documents for training and the remaining third for testing.

For this particular split, the distribution of documents per class is the following:

WebKB

The documents in the WebKB are webpages collected by the World Wide Knowledge Base (Web->Kb) project of the CMU text learning group, and were downloaded from The 4 Universities Data Set Homepage. These pages were collected from computer science departments of various universities in 1997, manually classified into seven different classes: student, faculty, staff, department, course, project, and other.

The class other is a collection of pages that were not deemed the ``main page'' representing an instance of the previous six classes. For example, a particular faculty member may be represented by home page, a publications list, a vitae and several research interests pages. Only the faculty member's home page was placed in the faculty class. The publications list, vitae and research interests pages were all placed in the other category.

For each class, the collection contains pages from four universities: Cornell, Texas, Washington, Wisconsin, and other miscellaneous pages collected from other universities.

I discarded the classes Department and Staff because there were only a few pages from each university. I also discarded the class Other because pages were very different among this class.

Because there is no standard train/test split for this dataset, and in order to be consistent with the previous ones, I randomly chose two thirds of the documents for training and the remaining third for testing.

For this particular split, the distribution of documents per class is the following:

The files

If you use the datasets that I provide here, please cite my PhD thesis, where I describe the datasets in section 2.8. Ana Cardoso-Cachopo, Improving Methods for Single-label Text Categorization, PhD Thesis, October, 2007.

@Misc{2007:phd-Ana-Cardoso-Cachopo,
  author = {Ana Cardoso-Cachopo},
  title = {{Improving Methods for Single-label Text Categorization}},
  howpublished = {PdD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa},
  year = 2007} 

All the files mentioned below are available individually and in one zip file (48 Mb) from this folder in Google Drive.

File description

All of these are text files containing one document per line.

Each document is composed by its class and its terms.

Each document is represented by a "word" representing the document's class, a TAB character and then a sequence of "words" delimited by spaces, representing the terms contained in the document.

Pre-processing

Except for the Cade12 dataset, from the original datasets, in order to obtain the present files, I applied the following pre-processing:

    1. all-terms Obtained from the original datasets by applying the following transformations:
        1. Substitute TAB, NEWLINE and RETURN characters by SPACE.
        2. Keep only letters (that is, turn punctuation, numbers, etc. into SPACES).
        3. Turn all letters to lowercase.
        4. Substitute multiple SPACES by a single SPACE.
        5. The title/subject of each document is simply added in the beginning of the document's text.
    2. no-short Obtained from the previous file, by removing words that are less than 3 characters long. For example, removing "he" but keeping "him".
    3. no-stop Obtained from the previous file, by removing the 524 SMART stopwords. Some of them had already been removed, because they were shorter than 3 characters.
    4. stemmed Obtained from the previous file, by applying Porter's Stemmer to the remaining words. Information about stemming can be found here.

Some results

Just to give an idea of the relative hardness of each dataset, I have determined the accuracy that some of the most common classification methods achieve with them. As usual, tfidf term weighting is used to represent document vectors, and they were normalized to unitary length. The stemmed train and test sets were used for each dataset.

The "dumb classifier" is included as a baseline. It ignores the query and always gives as the predicted class the most frequent class in the training set.

Note that, because R8, R52, and WebKB are very skewed, the dumb classifier has a ``reasonable'' performance for these datasets. Also, it is worth noting that, while for R8, R52, 20Ng, and webKB it is possible to find good classifiers, that is, classifiers that achieve a high accuracy, for Cade12 the best we can get does not reach 58% accuracy, even with some of the best classifiers available.