Language Model

Download scripts

Word Language Model

Reference: Merity, S., et al. “Regularizing and optimizing LSTM language models”. ICLR 2018

The key features used to reproduce the results for pre-trained models are listed in the following tables.

The dataset used for training the models is wikitext-2.

Model

awd_lstm_lm_1150_wikitext-2

awd_lstm_lm_600_wikitext-2

standard_lstm_lm_1500_wikitext-2

standard_lstm_lm_650_wikitext-2

standard_lstm_lm_200_wikitext-2

Mode

LSTM

LSTM

LSTM

LSTM

LSTM

Num_layers

3

3

2

2

2

Embed size

400

200

1500

650

200

Hidden size

1150

600

1500

650

200

Dropout

0.4

0.2

0.65

0.5

0.2

Dropout_h

0.2

0.1

0

0

0

Dropout_i

0.65

0.3

0

0

0

Dropout_e

0.1

0.05

0

0

0

Weight_drop

0.5

0.2

0

0

0

Val PPL

71.78

80.11

86.28

91.30

108.17

Test PPL

68.55

76.14

81.99

85.82

102.49

Command

[1]

[2]

[3]

[4]

[5]

Training logs

log

log

log

log

log

For all the above model settings, we set Tied = True and NTASGD = True .

[1] awd_lstm_lm_1150_wikitext-2 (Val PPL 71.78 Test PPL 68.55 )

$ python word_language_model.py --gpu 0 --tied --ntasgd --lr_update_interval 30 --lr_update_factor 0.1 --save awd_lstm_lm_1150_wikitext-2

[2] awd_lstm_lm_600_wikitext-2 (Val PPL 80.11 Test PPL 76.14)

$ python word_language_model.py --gpu 0 --emsize 200 --nhid 600 --epochs 750 --dropout 0.2 --dropout_h 0.1 --dropout_i 0.3 --dropout_e 0.05 --weight_drop 0.2 --tied --ntasgd --lr_update_interval 30 --lr_update_factor 0.1 --save awd_lstm_lm_600_wikitext-2

[3] standard_lstm_lm_1500_wikitext-2 (Val PPL 86.28 Test PPL 81.99)

$ python word_language_model.py --gpu 0 --emsize 1500 --nhid 1500 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.65 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --ntasgd --lr_update_interval 30 --lr_update_factor 0.1 --save standard_lstm_lm_1500_wikitext-2

[4] standard_lstm_lm_650_wikitext-2 (Val PPL 91.30 Test PPL 85.82)

$ python word_language_model.py --gpu 0 --emsize 650 --nhid 650 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.5 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --ntasgd --lr_update_interval 30 --lr_update_factor 0.1 --save standard_lstm_lm_650_wikitext-2

[5] standard_lstm_lm_200_wikitext-2 (Val PPL 108.17 Test PPL 102.49)

$ python word_language_model.py --gpu 0 --emsize 200 --nhid 200 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.2 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --ntasgd --lr_update_interval 30 --lr_update_factor 0.1 --save standard_lstm_lm_200_wikitext-2

Cache Language Model

Reference: Grave, E., et al. “Improving neural language models with a continuous cache”. ICLR 2017

The key features used to reproduce the results based on the corresponding pre-trained models are listed in the following tables.

The dataset used for training the models is wikitext-2.

Model

cache_awd_lstm_lm_1150_wikitext-2

cache_awd_lstm_lm_600_wikitext-2

cache_standard_lstm_lm_1500_wikitext-2

cache_standard_lstm_lm_650_wikitext-2

cache_standard_lstm_lm_200_wikitext-2

Pre-trained setting

Refer to: awd_lstm_lm_1150_wikitext-2

Refer to: awd_lstm_lm_600_wikitext-2

Refer to: standard_lstm_lm_1500_wikitext-2

Refer to: standard_lstm_lm_650_wikitext-2

Refer to: standard_lstm_lm_200_wikitext-2

Val PPL

58.18

64.09

73.19

69.27

81.68

Test PPL

56.08

61.62

70.91

66.39

77.83

Command

[1]

[2]

[3]

[4]

[5]

Training logs

log

log

log

log

log

For all the above model settings, we set lambdas = 0.1279, theta = 0.662, window = 2000 and bptt= 2000 .

[1] cache_awd_lstm_lm_1150_wikitext-2 (Val PPL 58.18 Test PPL 56.08)

$ python cache_language_model.py --gpus 0 --model_name awd_lstm_lm_1150

[2] cache_awd_lstm_lm_600_wikitext-2 (Val PPL 64.09 Test PPL 61.62)

$ python cache_language_model.py --gpus 0 --model_name awd_lstm_lm_600

[3] cache_standard_lstm_lm_1500_wikitext-2 (Val PPL 73.19 Test PPL 70.91)

$ python cache_language_model.py --gpus 0 --model_name standard_lstm_lm_1500

[4] cache_standard_lstm_lm_650_wikitext-2 (Val PPL 69.27 Test PPL 66.39)

$ python cache_language_model.py --gpus 0 --model_name standard_lstm_lm_650

[5] cache_standard_lstm_lm_200_wikitext-2 (Val PPL 81.68 Test PPL 77.83)

$ python cache_language_model.py --gpus 0 --model_name standard_lstm_lm_200

Large Scale Word Language Model

Reference: Jozefowicz, Rafal, et al. “Exploring the limits of language modeling”. arXiv preprint arXiv:1602.02410 (2016).

The key features used to reproduce the results for pre-trained models are listed in the following tables.

The dataset used for training the models is Google’s 1 billion words dataset.

Model

LSTM-2048-512

Mode

LSTMP

Num layers

1

Embed size

512

Hidden size

2048

Projection size

512

Dropout

0.1

Learning rate

0.2

Num samples

8192

Batch size

128

Gradient clip

10.0

Test perplexity

43.62

Num epochs

50

Training logs

log

Evaluation logs

log

[1] LSTM-2048-512 (Test PPL 43.62)

$ python large_word_language_model.py --gpus 0,1,2,3 --clip=10
$ python large_word_language_model.py --gpus 4 --eval-only --batch-size=1

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Reference: Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. “XLNet: Generalized Autoregressive Pretraining for Language Understanding.” arXiv preprint arXiv:1906.08237 (2019).

The following pre-trained XLNet models are available from the get_model API:

xlnet_cased_l12_h768_a12

xlnet_cased_l24_h1024_a16

126gb

where 126gb refers to the 126 GB large training dataset used by the XLNet paper authors.

import gluonnlp as nlp; import mxnet as mx
from transformer import get_model, XLNetTokenizer
model, vocab, tokenizer = get_model('xlnet_cased_l12_h768_a12', dataset_name='126gb', use_decoder=True)
indices = mx.nd.array([vocab.to_indices(tokenizer('Hello world'))])
token_types = mx.nd.ones_like(indices)
mems = model.begin_mems(batch_size=1, mem_len=500, context=indices.context)
output, new_mems = model(indices, token_types, mems)

Sentence Classification

GluonNLP provides the following example script to fine-tune sentence classification with pre-trained XLNet model.

Results using xlnet_12_768_12:

Task Name

Metrics

Results on Dev Set

log

command

CoLA

Matthew Corr.

59.33

log

command

SST-2

Accuracy

94.61

log

command

MRPC

Accuracy/F1

89.22/92.20

log

command

STS-B

Pearson Corr.

89.34

log

command

QQP

Accuracy

91.31

log

command

MNLI

Accuracy(m/mm)

87.19/86.45

log

command

QNLI

Accuracy

88

log

command

RTE

Accuracy

75.09

log

command

Results using xlnet_24_1024_16: We followed the hyperparameters reported by the paper authors.

Task Name

Metrics

Results on Dev Set

log

command

CoLA

Matthew Corr.

67

log

command

SST-2

Accuracy

94

log

command

MRPC

Accuracy/F1

90.2/93

log

command

STS-B

Pearson Corr.

91.37

log

command

QQP

Accuracy

91.94

log

command

MNLI

Accuracy(m/mm)

89.93/89.91

log

command

RTE

Accuracy

84.12

log

command

Question Answering on SQuAD

Dataset

SQuAD 1.1

SQuAD 1.1

SQuAD 2.0

SQuAD 2.0

Model

xlnet_12_768_12

xlnet_24_1024_16

xlnet_12_768_12

xlnet_24_1024_16

EM / F1

85.50 / 91.77

89.08 / 94.52

80.47 / 83.22

86.08 / 86.69

Log

log

log

log

log

Command

command

command

command

command

Prediction

predictions.json

predictions.json

predictions.json null_odds.json

predictions.json null_odds.json

For xlnet_24_1024_16, we used hyperparameters reported by the paper authors.

To get the score of the dev data, you need to download the evaluate script (evaluate-2.0.py). You can either put the evaluate script under the same folder with run_squad.py to let our script run it automatically, or run it manually by yourself. To run the evaluate script, you can use the following commands:

SQuAD1.1:

$ python evaluate-v2.0.py dev-v2.0.json predictions.json

SQuAD2.0:

$ python evaluate-v2.0.py dev-v2.0.json predictions.json --na-prob-file null_odds.json