Welcome to caver’s doc!

Caver is a multilabel text classification toolkit based on Pytorch.

Raising a torch to explore the cave.

Why we need this?

Multilabel text classification is widely used in text-based recommendation system, anti-spam, sentiment analysis and so on. Some basic machine learning methods like SVM, Naive bayes can help us to find the category of text, but sentences usually belong to multi labels which is hard to extract all of them. Now we can use deep learning models to deal with it.

Features

  • Train text classifier

  • Offer CNN, LSTM, HAN, SWEN etc.

  • Get the feature vector of text

  • Get the most possible labels for text

  • Ensemble (soft voting)

Version

  • Python 3.5

  • Pytorch 0.4

Tutorail

Before train the model, you need to prepare text data in the format like:

__label__animal __label__sports The quick brown fox jumps over the lazy dog

Each line of text file cantains a list of labels and words separated by space. Each label is start with ‘__label__’.

If you use Chinese, please segment the sentence into words or single word first, depends on which format you want.

__label__animal __label__sports 狐狸  狗子   跳山羊

Train

>>> from caver import Trainer
>>> t = Trainer(
>>>     'CNN',
>>>     'data_path',
>>>     ...... # kwargs will update the default value in config
>>> )
>>> t.train()

Classify

>>> from caver import Caver
>>> cnn = Caver('CNN', 'CNN_model.pth', 'data_path')

# predict
>>> cnn.predict('The quick brown fox jumps over the lazy dog')
[1.0, 0.33, 0.15, 0.002, ......]

# get top label
>>> cnn.get_top_label('The quick brown fox jumps over the lazy dog')
[['__label__animal', '__label__sports', ......],
[1.0, 0.35, ......]]

# ensemble
>>> from caver import Ensemble
>>> lstm = Caver('LSTM', 'LSTM_model.pth', 'data_path')
>>> model = Ensemble([cnn, lstm])

>>> model.predict('The quick brown fox jumps over the lazy dog', 'log')
>>> model.get_top_label('The quick brown fox jumps over the lazy dog', 'avg')

Indices and tables