Language Model¶
Word Language Model¶
Reference: Merity, S., et al. “Regularizing and optimizing LSTM language models”. ICLR 2018
The key features used to reproduce the results for pre-trained models are listed in the following tables.
The dataset used for training the models is wikitext-2.
Model |
awd_lstm_lm_1150_wikitext-2 |
awd_lstm_lm_600_wikitext-2 |
standard_lstm_lm_1500_wikitext-2 |
standard_lstm_lm_650_wikitext-2 |
standard_lstm_lm_200_wikitext-2 |
---|---|---|---|---|---|
Mode |
LSTM |
LSTM |
LSTM |
LSTM |
LSTM |
Num_layers |
3 |
3 |
2 |
2 |
2 |
Embed size |
400 |
200 |
1500 |
650 |
200 |
Hidden size |
1150 |
600 |
1500 |
650 |
200 |
Dropout |
0.4 |
0.2 |
0.65 |
0.5 |
0.2 |
Dropout_h |
0.2 |
0.1 |
0 |
0 |
0 |
Dropout_i |
0.65 |
0.3 |
0 |
0 |
0 |
Dropout_e |
0.1 |
0.05 |
0 |
0 |
0 |
Weight_drop |
0.5 |
0.2 |
0 |
0 |
0 |
Val PPL |
71.78 |
80.11 |
86.28 |
91.30 |
108.17 |
Test PPL |
68.55 |
76.14 |
81.99 |
85.82 |
102.49 |
Command |
[1] |
[2] |
[3] |
[4] |
[5] |
Training logs |
For all the above model settings, we set Tied = True and NTASGD = True .
[1] awd_lstm_lm_1150_wikitext-2 (Val PPL 71.78 Test PPL 68.55 )
$ python word_language_model.py --gpu 0 --tied --ntasgd --lr_update_interval 30 --lr_update_factor 0.1 --save awd_lstm_lm_1150_wikitext-2
[2] awd_lstm_lm_600_wikitext-2 (Val PPL 80.11 Test PPL 76.14)
$ python word_language_model.py --gpu 0 --emsize 200 --nhid 600 --epochs 750 --dropout 0.2 --dropout_h 0.1 --dropout_i 0.3 --dropout_e 0.05 --weight_drop 0.2 --tied --ntasgd --lr_update_interval 30 --lr_update_factor 0.1 --save awd_lstm_lm_600_wikitext-2
[3] standard_lstm_lm_1500_wikitext-2 (Val PPL 86.28 Test PPL 81.99)
$ python word_language_model.py --gpu 0 --emsize 1500 --nhid 1500 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.65 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --ntasgd --lr_update_interval 30 --lr_update_factor 0.1 --save standard_lstm_lm_1500_wikitext-2
[4] standard_lstm_lm_650_wikitext-2 (Val PPL 91.30 Test PPL 85.82)
$ python word_language_model.py --gpu 0 --emsize 650 --nhid 650 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.5 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --ntasgd --lr_update_interval 30 --lr_update_factor 0.1 --save standard_lstm_lm_650_wikitext-2
[5] standard_lstm_lm_200_wikitext-2 (Val PPL 108.17 Test PPL 102.49)
$ python word_language_model.py --gpu 0 --emsize 200 --nhid 200 --nlayers 2 --lr 20 --epochs 750 --batch_size 20 --bptt 35 --dropout 0.2 --dropout_h 0 --dropout_i 0 --dropout_e 0 --weight_drop 0 --tied --wd 0 --alpha 0 --beta 0 --ntasgd --lr_update_interval 30 --lr_update_factor 0.1 --save standard_lstm_lm_200_wikitext-2
Cache Language Model¶
Reference: Grave, E., et al. “Improving neural language models with a continuous cache”. ICLR 2017
The key features used to reproduce the results based on the corresponding pre-trained models are listed in the following tables.
The dataset used for training the models is wikitext-2.
Model |
cache_awd_lstm_lm_1150_wikitext-2 |
cache_awd_lstm_lm_600_wikitext-2 |
cache_standard_lstm_lm_1500_wikitext-2 |
cache_standard_lstm_lm_650_wikitext-2 |
cache_standard_lstm_lm_200_wikitext-2 |
---|---|---|---|---|---|
Pre-trained setting |
Refer to: awd_lstm_lm_1150_wikitext-2 |
Refer to: awd_lstm_lm_600_wikitext-2 |
Refer to: standard_lstm_lm_1500_wikitext-2 |
Refer to: standard_lstm_lm_650_wikitext-2 |
Refer to: standard_lstm_lm_200_wikitext-2 |
Val PPL |
58.18 |
64.09 |
73.19 |
69.27 |
81.68 |
Test PPL |
56.08 |
61.62 |
70.91 |
66.39 |
77.83 |
Command |
[1] |
[2] |
[3] |
[4] |
[5] |
Training logs |
For all the above model settings, we set lambdas = 0.1279, theta = 0.662, window = 2000 and bptt= 2000 .
[1] cache_awd_lstm_lm_1150_wikitext-2 (Val PPL 58.18 Test PPL 56.08)
$ python cache_language_model.py --gpus 0 --model_name awd_lstm_lm_1150
[2] cache_awd_lstm_lm_600_wikitext-2 (Val PPL 64.09 Test PPL 61.62)
$ python cache_language_model.py --gpus 0 --model_name awd_lstm_lm_600
[3] cache_standard_lstm_lm_1500_wikitext-2 (Val PPL 73.19 Test PPL 70.91)
$ python cache_language_model.py --gpus 0 --model_name standard_lstm_lm_1500
[4] cache_standard_lstm_lm_650_wikitext-2 (Val PPL 69.27 Test PPL 66.39)
$ python cache_language_model.py --gpus 0 --model_name standard_lstm_lm_650
[5] cache_standard_lstm_lm_200_wikitext-2 (Val PPL 81.68 Test PPL 77.83)
$ python cache_language_model.py --gpus 0 --model_name standard_lstm_lm_200
Large Scale Word Language Model¶
Reference: Jozefowicz, Rafal, et al. “Exploring the limits of language modeling”. arXiv preprint arXiv:1602.02410 (2016).
The key features used to reproduce the results for pre-trained models are listed in the following tables.
The dataset used for training the models is Google’s 1 billion words dataset.
Model |
LSTM-2048-512 |
---|---|
Mode |
LSTMP |
Num layers |
1 |
Embed size |
512 |
Hidden size |
2048 |
Projection size |
512 |
Dropout |
0.1 |
Learning rate |
0.2 |
Num samples |
8192 |
Batch size |
128 |
Gradient clip |
10.0 |
Test perplexity |
43.62 |
Num epochs |
50 |
Training logs |
|
Evaluation logs |
[1] LSTM-2048-512 (Test PPL 43.62)
$ python large_word_language_model.py --gpus 0,1,2,3 --clip=10
$ python large_word_language_model.py --gpus 4 --eval-only --batch-size=1
XLNet: Generalized Autoregressive Pretraining for Language Understanding¶
Reference: Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. “XLNet: Generalized Autoregressive Pretraining for Language Understanding.” arXiv preprint arXiv:1906.08237 (2019).
The following pre-trained XLNet models are available from the get_model API:
xlnet_cased_l12_h768_a12 |
xlnet_cased_l24_h1024_a16 |
|
---|---|---|
126gb |
✓ |
✓ |
where 126gb refers to the 126 GB large training dataset used by the XLNet paper authors.
import gluonnlp as nlp; import mxnet as mx
from transformer import get_model, XLNetTokenizer
model, vocab, tokenizer = get_model('xlnet_cased_l12_h768_a12', dataset_name='126gb', use_decoder=True)
indices = mx.nd.array([vocab.to_indices(tokenizer('Hello world'))])
token_types = mx.nd.ones_like(indices)
mems = model.begin_mems(batch_size=1, mem_len=500, context=indices.context)
output, new_mems = model(indices, token_types, mems)
Sentence Classification¶
GluonNLP provides the following example script to fine-tune sentence classification with pre-trained XLNet model.
Results using xlnet_12_768_12:
Task Name |
Metrics |
Results on Dev Set |
log |
command |
---|---|---|---|---|
CoLA |
Matthew Corr. |
59.33 |
||
SST-2 |
Accuracy |
94.61 |
||
MRPC |
Accuracy/F1 |
89.22/92.20 |
||
STS-B |
Pearson Corr. |
89.34 |
||
QQP |
Accuracy |
91.31 |
||
MNLI |
Accuracy(m/mm) |
87.19/86.45 |
||
QNLI |
Accuracy |
88 |
||
RTE |
Accuracy |
75.09 |
Results using xlnet_24_1024_16: We followed the hyperparameters reported by the paper authors.
Task Name |
Metrics |
Results on Dev Set |
log |
command |
---|---|---|---|---|
CoLA |
Matthew Corr. |
67 |
||
SST-2 |
Accuracy |
94 |
||
MRPC |
Accuracy/F1 |
90.2/93 |
||
STS-B |
Pearson Corr. |
91.37 |
||
QQP |
Accuracy |
91.94 |
||
MNLI |
Accuracy(m/mm) |
89.93/89.91 |
||
RTE |
Accuracy |
84.12 |
Question Answering on SQuAD¶
Dataset |
SQuAD 1.1 |
SQuAD 1.1 |
SQuAD 2.0 |
SQuAD 2.0 |
---|---|---|---|---|
Model |
xlnet_12_768_12 |
xlnet_24_1024_16 |
xlnet_12_768_12 |
xlnet_24_1024_16 |
EM / F1 |
85.50 / 91.77 |
89.08 / 94.52 |
80.47 / 83.22 |
86.08 / 86.69 |
Log |
||||
Command |
||||
Prediction |
For xlnet_24_1024_16, we used hyperparameters reported by the paper authors.
To get the score of the dev data, you need to download the evaluate script (evaluate-2.0.py). You can either put the evaluate script under the same folder with run_squad.py to let our script run it automatically, or run it manually by yourself. To run the evaluate script, you can use the following commands:
SQuAD1.1:
$ python evaluate-v2.0.py dev-v2.0.json predictions.json
SQuAD2.0:
$ python evaluate-v2.0.py dev-v2.0.json predictions.json --na-prob-file null_odds.json