[JZ15a] reports two types of experiment using RCV1, single-label categorization with 55 classes (Table 2 and Fig.6) and multi-label categorization with 103 classes (Table 4).
Single-label categorization (55 classes): The single-label experiments used the 55 second-level topics according to the RCV1-v2 label assignment in [LYRL04], removing the documents with multiple labels; also see Section 3.4 of [JZ15a] for details. The training sets and test set used in the single-label experiments are as follows.
Name | Usage | Date of articles | Number of articles |
---|---|---|---|
1m | Training set in Table 2; 3rd largest training set in Fig.6 | Aug'96 | 15,564 |
test | Test set in Table 2 and Fig.6 | July'97 | 49,838 |
2m | 2nd largest training set in Fig.6 | Aug'96-Sept'96 | 53,690 |
3m | Largest training set in Fig.6 | Aug'96-Oct'96 | 100,899 |
The smaller training sets (r02k, r03k, r04k, r05k, and r10k; named according to the size) used in Fig.6 were randomly chosen subsets of the Aug'96 articles.
We cannot provide the text files of these sets due to the copyright issue. Instead, we provide article ID's and labels (one of 55 second-level topics according to the RCV1-v2 labeling [LYRL04]) of these sets. This archive (click here to download) contains files whose names are in the form of rcv1_name.{id|lvl2}, where name is one of the names above such as "1m" or "r02k". An article ID, e.g., 771567, appears in the XML file in the RCV1 CD (obtained from NIST) as in <newsitem itemid="771567" ... >. It is also used as part of filename as in "771567newsML.xml".[JZ15b] also reports two types of experiment using RCV1, single-label categorization with 55 classes (Table 3 and Fig.3-4) and multi-label categorization with 103 classes (Table 6).
Single-label categorization (55 classes):
The 55-class single-label categorization experiments reported in [JZ16] used the same data as the single-label experiments in [JZ15b].