Datasets for single-label text categorization

Here you can find the Datasets for single-label text categorization that I used in my PhD work. This is a copy of the page at IST.

This page makes available some files containing the terms I obtained by pre-processing some well-known datasets used for text categorization.

I did not create the datasets. I am simply making available already processed versions of them, for three main reasons:

  • To allow an easier comparison among different algorithms. Many papers in this area use these datasets but report slightly different numbers of terms for each of them. By having exactly the same terms, the comparisons made using these files will be more reliable.
  • To ease the work of people starting out in this field. Because these files contain less information than the original ones, they can have a simpler format and thus will be easier to process. The most common pre-processing steps are also provided.
  • To provide single-label datasets, which some of the original datasets were not.

I make them available here on the same terms as they were originally available, which is basically for research purposes. If you want to use them for any other purpose, please ask for permission from the original creator. You can reach their homepages by following the links next to each one of them.

If you use the datasets that I provide here, please cite my PhD thesis, where I describe the datasets in section 2.8. Ana Cardoso-Cachopo, Improving Methods for Single-label Text Categorization, PhD Thesis, October, 2007.


@Misc{2007:phd-Ana-Cardoso-Cachopo,
  author =	 {Ana Cardoso-Cachopo},
  title =	 {{Improving Methods for Single-label Text
                  Categorization}},
  howpublished = {PdD Thesis, Instituto Superior Tecnico, Universidade
                  Tecnica de Lisboa},
  year =	 2007
}

All the files mentioned below are available in one zip file (48 Mb).

20 Newsgroups

I downloaded the 20Newsgroups dataset from Jason Rennie's page and used the "bydate" version, because it already had a standard train/test split. This dataset is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

Although already cleaned-up, this dataset still had several attachments, many PGP keys and some duplicates.

After removing them and the messages that became empty because of it the distribution of train and test messages was the following for each newsgroup:

20 Newsgroups
Class # train docs # test docs Total # docs
alt.atheism 480 319 799
comp.graphics 584 389 973
comp.os.ms-windows.misc 572 394 966
comp.sys.ibm.pc.hardware 590 392 982
comp.sys.mac.hardware 578 385 963
comp.windows.x 593 392 985
misc.forsale 585 390 975
rec.autos 594 395 989
rec.motorcycles 598 398 996
rec.sport.baseball 597 397 994
rec.sport.hockey 600 399 999
sci.crypt 595 396 991
sci.electronics 591 393 984
sci.med 594 396 990
sci.space 593 394 987
soc.religion.christian 598 398 996
talk.politics.guns 545 364 909
talk.politics.mideast 564 376 940
talk.politics.misc 465 310 775
talk.religion.misc 377 251 628
Total 11293 7528 18821

Reuters 21578

I downloaded the Reuters-21578 dataset from David Lewis' page and used the standard "modApté" train/test split. These documents appeared on the Reuters newswire in 1987 and were manually classified by personnel from Reuters Ltd.

Due to the fact that the class distribution for these documents is very skewed, two sub-collections are usually considered for text categorization tasks (see this paper):

  • R10 The set of the 10 classes with the highest number of positive training examples.
  • R90 The set of the 90 classes with at least one positive training and testing example.

Moreover, many of these documents are classified as having no topic at all or with more than one topic. In fact, you can see the distribution of the documents per number of topics in the following table, where # train docs and # test docs refer to the Mod Apté split and # other refers to documents that were not considered in this split:

Reuters 21578
# Topics # train docs # test docs # other Total # docs
0 1828 280 8103 10211
1 6552 2581 361 9494
2 890 309 135 1334
3 191 64 55 310
4 62 32 10 104
5 39 14 8 61
6 21 6 3 30
7 7 4 0 11
8 4 2 0 6
9 4 2 0 6
10 3 1 0 4
11 0 1 1 2
12 1 1 0 2
13 0 0 0 0
14 0 2 0 2
15 0 0 0 0
16 1 0 0 1

As the goal in this page is to consider single-labeled datasets, all the documents with less than or with more than one topic were eliminated. With this some of the classes in R10 and R90 were left with no train or test documents.

Considering only the documents with a single topic and the classes which still have at least one train and one test example, we have 8 of the 10 most frequent classes and 52 of the original 90.

Following Sebastiani's convention, we will call these sets R8 and R52. Note that from R10 to R8 the classes corn and wheat, which are intimately related to the class grain disapeared and this last class lost many of its documents.

The distribution of documents per class is the following for R8 and R52:

R8
Class # train docs # test docs Total # docs
acq 1596 696 2292
crude 253 121 374
earn 2840 1083 3923
grain 41 10 51
interest 190 81 271
money-fx 206 87 293
ship 108 36 144
trade 251 75 326
Total 5485 2189 7674
R52
Class # train docs # test docs Total # docs
acq 1596 696 2292
alum 31 19 50
bop 22 9 31
carcass 6 5 11
cocoa 46 15 61
coffee 90 22 112
copper 31 13 44
cotton 15 9 24
cpi 54 17 71
cpu 3 1 4
crude 253 121 374
dlr 3 3 6
earn 2840 1083 3923
fuel 4 7 11
gas 10 8 18
gnp 58 15 73
gold 70 20 90
grain 41 10 51
heat 6 4 10
housing 15 2 17
income 7 4 11
instal-debt 5 1 6
interest 190 81 271
ipi 33 11 44
iron-steel 26 12 38
jet 2 1 3
jobs 37 12 49
lead 4 4 8
lei 11 3 14
livestock 13 5 18
lumber 7 4 11
meal-feed 6 1 7
money-fx 206 87 293
money-supply 123 28 151
nat-gas 24 12 36
nickel 3 1 4
orange 13 9 22
pet-chem 13 6 19
platinum 1 2 3
potato 2 3 5
reserves 37 12 49
retail 19 1 20
rubber 31 9 40
ship 108 36 144
strategic-metal 9 6 15
sugar 97 25 122
tea 2 3 5
tin 17 10 27
trade 251 75 326
veg-oil 19 11 30
wpi 14 9 23
zinc 8 5 13
Total 6532 2568 9100

Cade

The documents in the Cade12 correspond to a subset of web pages extracted from the CADÊ Web Directory, which points to Brazilian web pages classified by human experts. This directory is available at Cade's Homepage, in Brazilian Portuguese.

A pre-processed version of this dataset was made available to me by Marco Cristo, from Universidade Federal de Minas Gerais, in Brazil. This dataset is part of project Gerindo.

Because there is no standard train/test split for this dataset, and in order to be consistent with the previous ones, I randomly chose two thirds of the documents for training and the remaining third for testing.

For this particular split, the distribution of documents per class is the following:

Cade12
Class # train docs # test docs Total # docs
01--servicos 5627 2846 8473
02--sociedade 4935 2428 7363
03--lazer 3698 1892 5590
04--informatica 2983 1536 4519
05--saude 2118 1053 3171
06--educacao 1912 944 2856
07--internet 1585 796 2381
08--cultura 1494 643 2137
09--esportes 1277 630 1907
10--noticias 701 381 1082
11--ciencias 569 310 879
12--compras-online 423 202 625
Total 27322 13661 40983

WebKB

The documents in the WebKB are webpages collected by the World Wide Knowledge Base (Web->Kb) project of the CMU text learning group, and were downloaded from The 4 Universities Data Set Homepage. These pages were collected from computer science departments of various universities in 1997, manually classified into seven different classes: student, faculty, staff, department, course, project, and other.

The class other is a collection of pages that were not deemed the ``main page'' representing an instance of the previous six classes. For example, a particular faculty member may be represented by home page, a publications list, a vitae and several research interests pages. Only the faculty member's home page was placed in the faculty class. The publications list, vitae and research interests pages were all placed in the other category.

For each class, the collection contains pages from four universities: Cornell, Texas, Washington, Wisconsin, and other miscellaneous pages collected from other universities.

I discarded the classes Department and Staff because there were only a few pages from each university. I also discarded the class Other because pages were very different among this class.

Because there is no standard train/test split for this dataset, and in order to be consistent with the previous ones, I randomly chose two thirds of the documents for training and the remaining third for testing.

For this particular split, the distribution of documents per class is the following:

WebKB
Class # train docs # test docs Total # docs
project 336 168 504
course 620 310 930
faculty 750 374 1124
student 1097 544 1641
Total 2803 1396 4199

The files

If you use the datasets that I provide here, please cite my PhD thesis, where I describe the datasets in section 2.8. Ana Cardoso-Cachopo, Improving Methods for Single-label Text Categorization, PhD Thesis, October, 2007.


@Misc{2007:phd-Ana-Cardoso-Cachopo,
  author =	 {Ana Cardoso-Cachopo},
  title =	 {{Improving Methods for Single-label Text
                  Categorization}},
  howpublished = {PdD Thesis, Instituto Superior Tecnico, Universidade
                  Tecnica de Lisboa},
  year =	 2007
}

All the files mentioned below are available in one zip file (48 Mb).

20 Newsgroups
Train Test
# documents 11293 docs 7528 docs
all-terms
20ng-train-all-terms
15.91 Mb
20ng-test-all-terms
10.31 Mb
no-short
20ng-train-no-short
14.06 Mb
20ng-test-no-short
9.12 Mb
no-stop
20ng-train-no-stop
10.59 Mb
20ng-test-no-stop
6.86 Mb
stemmed
20ng-train-stemmed
9.46 Mb
20ng-test-stemmed
6.13 Mb
Reuters-21578 R8 Reuters-21578 R52
Train Test Train Test
# documents 5485 docs 2189 docs 6532 docs 2568 docs
all-terms
r8-train-all-terms
3.20 Mb
r8-test-all-terms
1.14 Mb
r52-train-all-terms
4.08 Mb
r52-test-all-terms
1.45 Mb
no-short
r8-train-no-short
2.90 Mb
r8-test-no-short
1.03 Mb
r52-train-no-short
3.71 Mb
r52-test-no-short
1.32 Mb
no-stop
r8-train-no-stop
2.42 Mb
r8-test-no-stop
0.86 Mb
r52-train-no-stop
3.08 Mb
r52-test-no-stop
1.09 Mb
stemmed
r8-train-stemmed
2.13 Mb
r8-test-stemmed
0.76 Mb
r52-train-stemmed
2.71 Mb
r52-test-stemmed
0.96 Mb
Cade12
Train Test
# documents 27322 docs 13661 docs
stemmed
cade-train-stemmed
24.50 Mb
cade-test-stemmed
11.65 Mb
WebKB
Train Test
# documents 2803 docs 1396 docs
stemmed
webkb-train-stemmed
2.40 Mb
webkb-test-stemmed
1.20 Mb

File description

All of these are text files containing one document per line.

Each document is composed by its class and its terms.

Each document is represented by a "word" representing the document's class, a TAB character and then a sequence of "words" delimited by spaces, representing the terms contained in the document.

Pre-processing

Except for the Cade12 dataset, from the original datasets, in order to obtain the present files, I applied the following pre-processing:

  1. all-terms Obtained from the original datasets by applying the following transformations:
    1. Substitute TAB, NEWLINE and RETURN characters by SPACE.
    2. Keep only letters (that is, turn punctuation, numbers, etc. into SPACES).
    3. Turn all letters to lowercase.
    4. Substitute multiple SPACES by a single SPACE.
    5. The title/subject of each document is simply added in the beginning of the document's text.
  2. no-short Obtained from the previous file, by removing words that are less than 3 characters long. For example, removing "he" but keeping "him".
  3. no-stop Obtained from the previous file, by removing the 524 SMART stopwords. Some of them had already been removed, because they were shorter than 3 characters.
  4. stemmed Obtained from the previous file, by applying Porter's Stemmer to the remaining words. Information about stemming can be found here.

Some results

Just to give an idea of the relative hardness of each dataset, I have determined the accuracy that some of the most common classification methods achieve with them. As usual, tfidf term weighting is used to represent document vectors, and they were normalized to unitary length. The stemmed train and test sets were used for each dataset.

The "dumb classifier" is included as a baseline. It ignores the query and always gives as the predicted class the most frequent class in the training set.

Accuracy Values
Classification Method R8 R52 20Ng Cade12 WebKb
Dumb classifier 0.4947 0.4217 0.0530 0.2083 0.3897
Vector Method 0.7889 0.7687 0.7240 0.4142 0.6447
kNN (k = 10) 0.8524 0.8322 0.7593 0.5120 0.7256
Centroid (Normalized Sum) 0.9356 0.8717 0.7885 0.5148 0.8266
Naive Bayes 0.9607 0.8692 0.8103 0.5727 0.8352
SVM (Linear Kernel) 0.9698 0.9377 0.8284 0.5284 0.8582

Note that, because R8, R52, and WebKB are very skewed, the dumb classifier has a ``reasonable'' performance for these datasets. Also, it is worth noting that, while for R8, R52, 20Ng, and webKB it is possible to find good classifiers, that is, classifiers that achieve a high accuracy, for Cade12 the best we can get does not reach 58% accuracy, even with some of the best classifiers available.

ċ
20ng-test-all-terms.txt
(10555k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:31
ċ
20ng-test-no-short.txt
(9335k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:32
ċ
20ng-test-no-stop.txt
(7028k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:33
ċ
20ng-test-stemmed.txt
(6275k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:34
ċ
20ng-train-all-terms.txt
(16292k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:37
ċ
20ng-train-no-short.txt
(14395k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:39
ċ
20ng-train-no-stop.txt
(10847k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:41
ċ
20ng-train-stemmed.txt
(9684k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:42
ċ
cade-test-stemmed.txt
(11932k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:44
ċ
mini20-test.txt
(816k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:56
ċ
mini20-train.txt
(1534k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:56
ċ
r52-test-all-terms.txt
(1487k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:59
ċ
r52-test-no-short.txt
(1347k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:59
ċ
r52-test-no-stop.txt
(1120k)
Ana Cardoso Cachopo,
30 Apr 2015, 11:00
ċ
r52-test-stemmed.txt
(988k)
Ana Cardoso Cachopo,
30 Apr 2015, 11:00
ċ
r52-train-all-terms.txt
(4181k)
Ana Cardoso Cachopo,
30 Apr 2015, 11:01
ċ
r52-train-no-short.txt
(3798k)
Ana Cardoso Cachopo,
30 Apr 2015, 11:02
ċ
r52-train-no-stop.txt
(3152k)
Ana Cardoso Cachopo,
30 Apr 2015, 11:03
ċ
r52-train-stemmed.txt
(2775k)
Ana Cardoso Cachopo,
30 Apr 2015, 11:03
ċ
r8-test-all-terms.txt
(1167k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:56
ċ
r8-test-no-short.txt
(1055k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:56
ċ
r8-test-no-stop.txt
(884k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:57
ċ
r8-test-stemmed.txt
(780k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:57
ċ
r8-train-all-terms.txt
(3276k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:57
ċ
r8-train-no-short.txt
(2972k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:58
ċ
r8-train-no-stop.txt
(2478k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:58
ċ
r8-train-stemmed.txt
(2185k)
Ana Cardoso Cachopo,
30 Apr 2015, 10:59
ċ
webkb-test-stemmed.txt
(1271k)
Ana Cardoso Cachopo,
30 Apr 2015, 11:04
ċ
webkb-train-stemmed.txt
(2491k)
Ana Cardoso Cachopo,
30 Apr 2015, 11:04
Comments