ConText v4: Neural network code for text categorization

CONTEXT v4: Neural network code for text categorization in C++ on GPU

Updated: March 29, 2019. Latest version: CONTEXT v4.00a (v4.00a (3/29/2019): makefile changed for newer GPUs; v4.00 (7/22/2017) ). GitHub

What's new? v4 includes deep pyramid CNN (DPCNN) of [JZ17].

CONTEXT provides an implementation of the following types of neural network for text categorization:

Shallow CNN (convolutional neural networks) [JZ15a]
Shallow CNN enhanced with unsupervised embeddings (embeddings trained in an unsupervised manner) [JZ15b]
Deep pyramid CNN (DPCNN) enhanced with unsupervised embeddings [JZ17]
LSTM enhanced with unsupervised embeddings [JZ16a]

Looking for a tool?

Try DPCNN [JZ17] if your labeled training data is large (e.g., 1M documents).
Try shallow CNN [JZ15a] if your labeled training data is small (e.g., 50K documents).
Also try enhancement of your networks with unsupervised embeddings [JZ15b, JZ16b, JZ17]
- if you have good unlabeled data (large and in-domain) or
- if your training data is large enough to be used as unlabeled data

System requirements This code runs only on a CUDA-capable GPU such as Tesla K20. That is, your system must have a GPU and an appropriate version of CUDA installed. The provided makefile and example shell scripts are for Unix-like systems. Testing was done on Linux. In principle, the C++ code should compile and run also in other systems (e.g., Windows), but no guarantee. See README for more details.

Download & Documentation

conText-v4.00a.tar.gz (CONTEXT v4.00 code)
README
User guide (pdf)
Optional resources
- elec2.tar.gz (required for training with Elec training data of various sizes as in Fig.6 of [JZ15a])
- Info on RCV1 used in [JZ15a,15b,16a]
- Unlabeled data, etc.

Getting started

Download the code and extract the files, and read README.
Go to the top directory and build executables by make, after customizing makefile as needed.
To confirm installation, go to examples/ and enter ./sample.sh to train and test a network on small data.
(See README for installation trouble shooting.)
Read Section 1 (Overview) of User Guide to get an idea.
Try some shell scripts at examples/. There is a table of the scripts in Section 1.6 of User Guide.

Data Source

The data files in the code/data archives were derived from Large Movie Review Dataset (IMDB) [MDPHN11] and Amazon reviews [ML13].

License

This program is free software issued under the GNU General Public License V3 .

References

[JZ17] Rie Johnson and Tong Zhang. Deep pyramid convolutional neural networks for text categorization. ACL 2017.
[JZ16b] Rie Johnson and Tong Zhang. Convolutional neural networks for text categorization: shallow word-level vs. deep character-level. arXiv:1609.00718, 2016.
[JZ16a] Rie Johnson and Tong Zhang. Supervised and semi-supervised text categorization using LSTM for region embeddings. ICML 2016.
[JZ15b] Rie Johnson and Tong Zhang. Semi-supervised convolutional neural networks for text categorization via region embedding. NIPS 2015.
[JZ15a] Rie Johnson and Tong Zhang. Effective use of word order for text categorization with convolutional neural networks. NAACL-HLT 2015.
[ML13] Julian McAuley and Jure Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013.
[MDPHN11] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. ACL, 2011.