LSTM-based Language Models¶
A statistical language model is simply a probability distribution over sequences of words or characters [1]. In this tutorial, we’ll restrict our attention to word-based language models. Given a reliable language model, we can answer questions like which among the following strings are we more likely to encounter?
‘On Monday, Mr. Lamar’s “DAMN.” took home an even more elusive honor, one that may never have even seemed within reach: the Pulitzer Prize”
“Frog zealot flagged xylophone the bean wallaby anaphylaxis extraneous porpoise into deleterious carrot banana apricot.”
Even if we’ve never seen either of these sentences in our entire lives, and even though no rapper has previously been awarded a Pulitzer Prize, we wouldn’t be shocked to see the first sentence in the New York Times. By comparison, we can all agree that the second sentence, consisting of incoherent babble, is comparatively unlikely. A statistical language model can assign precise probabilities to each of these and other strings of words.
Given a large corpus of text, we can estimate (or, in this case, train) a language model \(\hat{p}(x_1, ..., x_n)\). And given such a model, we can sample strings \(\mathbf{x} \sim \hat{p}(x_1, ..., x_n)\), generating new strings according to their estimated probability. Among other useful applications, we can use language models to score candidate transcriptions from speech recognition models, given a preference to sentences that seem more probable (at the expense of those deemed anomalous).
These days recurrent neural networks (RNNs) are the preferred method for language models. In this notebook, we will go through an example of using GluonNLP to
implement a typical LSTM language model architecture
train the language model on a corpus of real data
bring in your own dataset for training
grab off-the-shelf pre-trained state-of-the-art language models (i.e., AWD language model) using GluonNLP.
What is a language model (LM)?¶
The standard approach to language modeling consists of training a model that given a trailing window of text, predicts the next word in the sequence. When we train the model we feed in the inputs \(x_1, x_2, ...\) and try at each time step to predict the corresponding next word \(x_2, ..., x_{n+1}\). To generate text from a language model, we can iteratively predict the next word, and then feed this word as an input to the model at the subsequent time step. The image included below demonstrates this idea.
Train your own language model¶
Now let’s go through the step-by-step process on how to train your own language model using GluonNLP.
Preparation¶
We’ll start by taking care of our basic dependencies and setting up our environment.
Firstly, we import the required modules for GluonNLP and the LM.
In [1]:
import warnings
warnings.filterwarnings('ignore')
import glob
import time
import math
import mxnet as mx
from mxnet import gluon, autograd
from mxnet.gluon.utils import download
import gluonnlp as nlp
nlp.utils.check_version('0.7.0')
Then we setup the environment for GluonNLP.
Please note that we should change num_gpus according to how many NVIDIA GPUs are available on the target machine in the following code.
In [2]:
num_gpus = 1
context = [mx.gpu(i) for i in range(num_gpus)] if num_gpus else [mx.cpu()]
log_interval = 200
Next we setup the hyperparameters for the LM we are using.
Note that BPTT stands for “back propagation through time,” and LR stands for learning rate. A link to more information on truncated BPTT can be found here.
In [3]:
batch_size = 20 * len(context)
lr = 20
epochs = 3
bptt = 35
grad_clip = 0.25
Loading the dataset¶
Now, we load the dataset, extract the vocabulary, numericalize, and batchify in order to perform truncated BPTT.
In [4]:
dataset_name = 'wikitext-2'
# Load the dataset
train_dataset, val_dataset, test_dataset = [
nlp.data.WikiText2(
segment=segment, bos=None, eos='<eos>', skip_empty=False)
for segment in ['train', 'val', 'test']
]
# Extract the vocabulary and numericalize with "Counter"
vocab = nlp.Vocab(
nlp.data.Counter(train_dataset), padding_token=None, bos_token=None)
# Batchify for BPTT
bptt_batchify = nlp.data.batchify.CorpusBPTTBatchify(
vocab, bptt, batch_size, last_batch='discard')
train_data, val_data, test_data = [
bptt_batchify(x) for x in [train_dataset, val_dataset, test_dataset]
]
Downloading /root/.mxnet/datasets/wikitext-2/wikitext-2-v1.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/wikitext-2/wikitext-2-v1.zip...
And then we load the pre-defined language model architecture as so:
In [5]:
model_name = 'standard_lstm_lm_200'
model, vocab = nlp.model.get_model(model_name, vocab=vocab, dataset_name=None)
print(model)
print(vocab)
# Initialize the model
model.initialize(mx.init.Xavier(), ctx=context)
# Initialize the trainer and optimizer and specify some hyperparameters
trainer = gluon.Trainer(model.collect_params(), 'sgd', {
'learning_rate': lr,
'momentum': 0,
'wd': 0
})
# Specify the loss function, in this case, cross-entropy with softmax.
loss = gluon.loss.SoftmaxCrossEntropyLoss()
StandardRNN(
(decoder): HybridSequential(
(0): Dense(200 -> 33278, linear)
)
(embedding): HybridSequential(
(0): Embedding(33278 -> 200, float32)
(1): Dropout(p = 0.2, axes=())
)
(encoder): LSTM(200 -> 200, TNC, num_layers=2, dropout=0.2)
)
Vocab(size=33278, unk="<unk>", reserved="['<eos>']")
Training the LM¶
Now that everything is ready, we can start training the model.
We first define a helper function for detaching the gradients on specific states for easier truncated BPTT.
In [6]:
def detach(hidden):
if isinstance(hidden, (tuple, list)):
hidden = [detach(i) for i in hidden]
else:
hidden = hidden.detach()
return hidden
And then a helper evaluation function.
In [7]:
# Note that ctx is short for context
def evaluate(model, data_source, batch_size, ctx):
total_L = 0.0
ntotal = 0
hidden = model.begin_state(
batch_size=batch_size, func=mx.nd.zeros, ctx=ctx)
for i, (data, target) in enumerate(data_source):
data = data.as_in_context(ctx)
target = target.as_in_context(ctx)
output, hidden = model(data, hidden)
hidden = detach(hidden)
L = loss(output.reshape(-3, -1), target.reshape(-1))
total_L += mx.nd.sum(L).asscalar()
ntotal += L.size
return total_L / ntotal
The main training loop¶
Our loss function will be the standard cross-entropy loss function used for multi-class classification, applied at each time step to compare the model’s predictions to the true next word in the sequence. We can calculate gradients with respect to our parameters using truncated BPTT. In this case, we’ll back propagate for \(35\) time steps, updating our weights with stochastic gradient descent and a learning rate of \(20\); these correspond to the hyperparameters that we specified earlier in the notebook.
In [8]:
# Function for actually training the model
def train(model, train_data, val_data, test_data, epochs, lr):
best_val = float("Inf")
start_train_time = time.time()
parameters = model.collect_params().values()
for epoch in range(epochs):
total_L = 0.0
start_epoch_time = time.time()
start_log_interval_time = time.time()
hiddens = [model.begin_state(batch_size//len(context), func=mx.nd.zeros, ctx=ctx)
for ctx in context]
for i, (data, target) in enumerate(train_data):
data_list = gluon.utils.split_and_load(data, context,
batch_axis=1, even_split=True)
target_list = gluon.utils.split_and_load(target, context,
batch_axis=1, even_split=True)
hiddens = detach(hiddens)
L = 0
Ls = []
with autograd.record():
for j, (X, y, h) in enumerate(zip(data_list, target_list, hiddens)):
output, h = model(X, h)
batch_L = loss(output.reshape(-3, -1), y.reshape(-1,))
L = L + batch_L.as_in_context(context[0]) / (len(context) * X.size)
Ls.append(batch_L / (len(context) * X.size))
hiddens[j] = h
L.backward()
grads = [p.grad(x.context) for p in parameters for x in data_list]
gluon.utils.clip_global_norm(grads, grad_clip)
trainer.step(1)
total_L += sum([mx.nd.sum(l).asscalar() for l in Ls])
if i % log_interval == 0 and i > 0:
cur_L = total_L / log_interval
print('[Epoch %d Batch %d/%d] loss %.2f, ppl %.2f, '
'throughput %.2f samples/s'%(
epoch, i, len(train_data), cur_L, math.exp(cur_L),
batch_size * log_interval / (time.time() - start_log_interval_time)))
total_L = 0.0
start_log_interval_time = time.time()
mx.nd.waitall()
print('[Epoch %d] throughput %.2f samples/s'%(
epoch, len(train_data)*batch_size / (time.time() - start_epoch_time)))
val_L = evaluate(model, val_data, batch_size, context[0])
print('[Epoch %d] time cost %.2fs, valid loss %.2f, valid ppl %.2f'%(
epoch, time.time()-start_epoch_time, val_L, math.exp(val_L)))
if val_L < best_val:
best_val = val_L
test_L = evaluate(model, test_data, batch_size, context[0])
model.save_parameters('{}_{}-{}.params'.format(model_name, dataset_name, epoch))
print('test loss %.2f, test ppl %.2f'%(test_L, math.exp(test_L)))
else:
lr = lr*0.25
print('Learning rate now %f'%(lr))
trainer.set_learning_rate(lr)
print('Total training throughput %.2f samples/s'%(
(batch_size * len(train_data) * epochs) /
(time.time() - start_train_time)))
We can now actually perform the training
In [9]:
train(model, train_data, val_data, test_data, epochs, lr)
[Epoch 0 Batch 200/2983] loss 7.65, ppl 2108.10, throughput 922.40 samples/s
[Epoch 0 Batch 400/2983] loss 6.75, ppl 850.65, throughput 948.10 samples/s
[Epoch 0 Batch 600/2983] loss 6.34, ppl 569.25, throughput 950.86 samples/s
[Epoch 0 Batch 800/2983] loss 6.17, ppl 480.58, throughput 945.40 samples/s
[Epoch 0 Batch 1000/2983] loss 6.04, ppl 419.41, throughput 948.46 samples/s
[Epoch 0 Batch 1200/2983] loss 5.96, ppl 387.28, throughput 946.61 samples/s
[Epoch 0 Batch 1400/2983] loss 5.86, ppl 350.48, throughput 944.73 samples/s
[Epoch 0 Batch 1600/2983] loss 5.87, ppl 354.95, throughput 943.93 samples/s
[Epoch 0 Batch 1800/2983] loss 5.71, ppl 302.71, throughput 944.44 samples/s
[Epoch 0 Batch 2000/2983] loss 5.68, ppl 292.84, throughput 931.68 samples/s
[Epoch 0 Batch 2200/2983] loss 5.58, ppl 264.01, throughput 940.27 samples/s
[Epoch 0 Batch 2400/2983] loss 5.58, ppl 266.39, throughput 935.37 samples/s
[Epoch 0 Batch 2600/2983] loss 5.58, ppl 264.05, throughput 928.80 samples/s
[Epoch 0 Batch 2800/2983] loss 5.46, ppl 235.42, throughput 933.48 samples/s
[Epoch 0] throughput 939.66 samples/s
[Epoch 0] time cost 65.78s, valid loss 5.48, valid ppl 240.96
test loss 5.40, test ppl 221.04
[Epoch 1 Batch 200/2983] loss 5.47, ppl 237.74, throughput 922.20 samples/s
[Epoch 1 Batch 400/2983] loss 5.45, ppl 233.01, throughput 925.84 samples/s
[Epoch 1 Batch 600/2983] loss 5.29, ppl 198.96, throughput 923.23 samples/s
[Epoch 1 Batch 800/2983] loss 5.30, ppl 200.44, throughput 924.81 samples/s
[Epoch 1 Batch 1000/2983] loss 5.27, ppl 195.00, throughput 920.38 samples/s
[Epoch 1 Batch 1200/2983] loss 5.26, ppl 193.40, throughput 915.88 samples/s
[Epoch 1 Batch 1400/2983] loss 5.27, ppl 193.78, throughput 917.12 samples/s
[Epoch 1 Batch 1600/2983] loss 5.33, ppl 205.68, throughput 919.20 samples/s
[Epoch 1 Batch 1800/2983] loss 5.20, ppl 181.67, throughput 917.34 samples/s
[Epoch 1 Batch 2000/2983] loss 5.21, ppl 183.15, throughput 918.87 samples/s
[Epoch 1 Batch 2200/2983] loss 5.12, ppl 166.87, throughput 914.60 samples/s
[Epoch 1 Batch 2400/2983] loss 5.16, ppl 173.73, throughput 915.47 samples/s
[Epoch 1 Batch 2600/2983] loss 5.17, ppl 175.74, throughput 912.38 samples/s
[Epoch 1 Batch 2800/2983] loss 5.09, ppl 162.08, throughput 901.59 samples/s
[Epoch 1] throughput 917.35 samples/s
[Epoch 1] time cost 67.36s, valid loss 5.17, valid ppl 175.52
test loss 5.10, test ppl 163.60
[Epoch 2 Batch 200/2983] loss 5.14, ppl 170.42, throughput 903.73 samples/s
[Epoch 2 Batch 400/2983] loss 5.15, ppl 172.21, throughput 903.58 samples/s
[Epoch 2 Batch 600/2983] loss 4.98, ppl 145.56, throughput 901.61 samples/s
[Epoch 2 Batch 800/2983] loss 5.02, ppl 151.87, throughput 904.62 samples/s
[Epoch 2 Batch 1000/2983] loss 5.01, ppl 150.26, throughput 905.79 samples/s
[Epoch 2 Batch 1200/2983] loss 5.02, ppl 150.70, throughput 901.39 samples/s
[Epoch 2 Batch 1400/2983] loss 5.04, ppl 154.26, throughput 901.08 samples/s
[Epoch 2 Batch 1600/2983] loss 5.11, ppl 165.26, throughput 901.12 samples/s
[Epoch 2 Batch 1800/2983] loss 4.99, ppl 146.97, throughput 898.49 samples/s
[Epoch 2 Batch 2000/2983] loss 5.02, ppl 150.66, throughput 898.83 samples/s
[Epoch 2 Batch 2200/2983] loss 4.92, ppl 136.79, throughput 894.16 samples/s
[Epoch 2 Batch 2400/2983] loss 4.97, ppl 143.65, throughput 896.35 samples/s
[Epoch 2 Batch 2600/2983] loss 4.99, ppl 147.06, throughput 895.51 samples/s
[Epoch 2 Batch 2800/2983] loss 4.91, ppl 135.76, throughput 889.29 samples/s
[Epoch 2] throughput 899.42 samples/s
[Epoch 2] time cost 68.68s, valid loss 5.05, valid ppl 155.81
test loss 4.98, test ppl 144.79
Total training throughput 852.75 samples/s
Using your own dataset¶
When we train a language model, we fit to the statistics of a given
dataset. While many papers focus on a few standard datasets, such as
WikiText or the Penn Tree Bank, that’s just to provide a standard
benchmark for the purpose of comparing models against one another. In
general, for any given use case, you’ll want to train your own language
model using a dataset of your own choice. Here, for demonstration, we’ll
grab some .txt
files corresponding to Sherlock Holmes novels.
We first download the new dataset.
In [10]:
TRAIN_PATH = "./sherlockholmes.train.txt"
VALID_PATH = "./sherlockholmes.valid.txt"
TEST_PATH = "./sherlockholmes.test.txt"
PREDICT_PATH = "./tinyshakespeare/input.txt"
download(
"https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.train.txt",
TRAIN_PATH,
sha1_hash="d65a52baaf32df613d4942e0254c81cff37da5e8")
download(
"https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.valid.txt",
VALID_PATH,
sha1_hash="71133db736a0ff6d5f024bb64b4a0672b31fc6b3")
download(
"https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.test.txt",
TEST_PATH,
sha1_hash="b7ccc4778fd3296c515a3c21ed79e9c2ee249f70")
download(
"https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/tinyshakespeare/input.txt",
PREDICT_PATH,
sha1_hash="04486597058d11dcc2c556b1d0433891eb639d2e")
print(glob.glob("sherlockholmes.*.txt"))
Downloading ./sherlockholmes.train.txt from https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.train.txt...
Downloading ./sherlockholmes.valid.txt from https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.valid.txt...
Downloading ./sherlockholmes.test.txt from https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/sherlockholmes/sherlockholmes.test.txt...
Downloading ./tinyshakespeare/input.txt from https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/tinyshakespeare/input.txt...
['sherlockholmes.valid.txt', 'sherlockholmes.train.txt', 'sherlockholmes.test.txt']
Then we specify the tokenizer as well as batchify the dataset.
In [11]:
import nltk
moses_tokenizer = nlp.data.SacreMosesTokenizer()
sherlockholmes_datasets = [
nlp.data.CorpusDataset(
'sherlockholmes.{}.txt'.format(name),
sample_splitter=nltk.tokenize.sent_tokenize,
tokenizer=moses_tokenizer,
flatten=True,
eos='<eos>') for name in ['train', 'valid', 'test']
]
sherlockholmes_train_data, sherlockholmes_val_data, sherlockholmes_test_data = [
bptt_batchify(dataset) for dataset in sherlockholmes_datasets
]
We setup the evaluation to see whether our previous model trained on the other dataset does well on the new dataset.
In [12]:
sherlockholmes_L = evaluate(model, sherlockholmes_val_data, batch_size,
context[0])
print('Best validation loss %.2f, test ppl %.2f' %
(sherlockholmes_L, math.exp(sherlockholmes_L)))
Best validation loss 4.77, test ppl 117.60
Or we have the option of training the model on the new dataset with just one line of code.
In [13]:
train(
model,
sherlockholmes_train_data, # This is your input training data, we leave batchifying and tokenizing as an exercise for the reader
sherlockholmes_val_data,
sherlockholmes_test_data, # This would be your test data, again left as an exercise for the reader
epochs=3,
lr=20)
[Epoch 0] throughput 892.88 samples/s
[Epoch 0] time cost 3.86s, valid loss 3.00, valid ppl 20.08
test loss 2.93, test ppl 18.80
[Epoch 1] throughput 890.10 samples/s
[Epoch 1] time cost 3.88s, valid loss 3.13, valid ppl 22.92
Learning rate now 5.000000
[Epoch 2] throughput 882.91 samples/s
[Epoch 2] time cost 3.91s, valid loss 2.74, valid ppl 15.41
test loss 2.70, test ppl 14.89
Total training throughput 736.79 samples/s
Using a pre-trained AWD LSTM language model¶
AWD LSTM language model is the state-of-the-art RNN language model [1]. The main technique leveraged is to add weight-dropout on the recurrent hidden to hidden matrices to prevent overfitting on the recurrent connections.
Load the vocabulary and the pre-trained model¶
In [14]:
awd_model_name = 'awd_lstm_lm_1150'
awd_model, vocab = nlp.model.get_model(
awd_model_name,
vocab=vocab,
dataset_name=dataset_name,
pretrained=True,
ctx=context[0])
print(awd_model)
print(vocab)
Vocab file is not found. Downloading.
Downloading /root/.mxnet/models/3963101239443680508/3963101239443680508_wikitext-2-be36dc52.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/vocab/wikitext-2-be36dc52.zip...
Downloading /root/.mxnet/models/awd_lstm_lm_1150_wikitext-2-f9562ed0.zip2338dea9-cb82-4ae9-a9dc-cbc743bf1c9c from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/awd_lstm_lm_1150_wikitext-2-f9562ed0.zip...
AWDRNN(
(decoder): HybridSequential(
(0): Dense(400 -> 33278, linear)
)
(embedding): HybridSequential(
(0): Embedding(33278 -> 400, float32)
(1): Dropout(p = 0.65, axes=(0,))
)
(encoder): HybridSequential(
(0): LSTM(400 -> 1150, TNC)
(1): LSTM(1150 -> 1150, TNC)
(2): LSTM(1150 -> 400, TNC)
)
)
Vocab(size=33278, unk="<unk>", reserved="['<eos>']")
Evaluate the pre-trained model on the validation and test datasets¶
In [15]:
val_L = evaluate(awd_model, val_data, batch_size, context[0])
test_L = evaluate(awd_model, test_data, batch_size, context[0])
print('Best validation loss %.2f, val ppl %.2f' % (val_L, math.exp(val_L)))
print('Best test loss %.2f, test ppl %.2f' % (test_L, math.exp(test_L)))
Best validation loss 4.23, val ppl 68.80
Best test loss 4.19, test ppl 65.73
Using a cache LSTM LM¶
Cache LSTM language model [2] adds a cache-like memory to neural network language models. It can be used in conjunction with the aforementioned AWD LSTM language model or other LSTM models. It exploits the hidden outputs to define a probability distribution over the words in the cache. It generates state-of-the-art results at inference time.
Load the pre-trained model and define the hyperparameters¶
In [16]:
window = 2
theta = 0.662
lambdas = 0.1279
bptt = 2000
cache_model = nlp.model.train.get_cache_model(name=awd_model_name,
dataset_name=dataset_name,
window=window,
theta=theta,
lambdas=lambdas,
ctx=context[0])
print(cache_model)
CacheCell(
(lm_model): AWDRNN(
(decoder): HybridSequential(
(0): Dense(400 -> 33278, linear)
)
(embedding): HybridSequential(
(0): Embedding(33278 -> 400, float32)
(1): Dropout(p = 0.65, axes=(0,))
)
(encoder): HybridSequential(
(0): LSTM(400 -> 1150, TNC)
(1): LSTM(1150 -> 1150, TNC)
(2): LSTM(1150 -> 400, TNC)
)
)
)
Define specific get_batch and evaluation helper functions for the cache model¶
Note that these helper functions are very similar to the ones we defined above, but are slightly different.
In [17]:
val_test_batch_size = 1
val_test_batchify = nlp.data.batchify.CorpusBatchify(vocab, val_test_batch_size)
val_data = val_test_batchify(val_dataset)
test_data = val_test_batchify(test_dataset)
In [18]:
def get_batch(data_source, i, seq_len=None):
seq_len = min(seq_len if seq_len else bptt, len(data_source) - 1 - i)
data = data_source[i:i + seq_len]
target = data_source[i + 1:i + 1 + seq_len]
return data, target
In [19]:
def evaluate_cache(model, data_source, batch_size, ctx):
total_L = 0.0
hidden = model.begin_state(
batch_size=batch_size, func=mx.nd.zeros, ctx=ctx)
next_word_history = None
cache_history = None
for i in range(0, len(data_source) - 1, bptt):
if i > 0:
print('Batch %d, ppl %f' % (i, math.exp(total_L / i)))
if i == bptt:
return total_L / i
data, target = get_batch(data_source, i)
data = data.as_in_context(ctx)
target = target.as_in_context(ctx)
L = 0
outs, next_word_history, cache_history, hidden = model(
data, target, next_word_history, cache_history, hidden)
for out in outs:
L += (-mx.nd.log(out)).asscalar()
total_L += L / data.shape[1]
hidden = detach(hidden)
return total_L / len(data_source)
Evaluate the pre-trained model on the validation and test datasets¶
In [20]:
val_L = evaluate_cache(cache_model, val_data, val_test_batch_size, context[0])
test_L = evaluate_cache(cache_model, test_data, val_test_batch_size, context[0])
print('Best validation loss %.2f, val ppl %.2f'%(val_L, math.exp(val_L)))
print('Best test loss %.2f, test ppl %.2f'%(test_L, math.exp(test_L)))
Batch 2000, ppl 60.767825
Batch 2000, ppl 67.390510
Best validation loss 4.11, val ppl 60.77
Best test loss 4.21, test ppl 67.39
References¶
[1] Merity, S., et al. “Regularizing and optimizing LSTM language models”. ICLR 2018
[2] Grave, E., et al. “Improving neural language models with a continuous cache”. ICLR 2017