LSTM-based Language Models¶
A statistical language model is simply a probability distribution over sequences of words or characters [1]. In this tutorial, we’ll restrict our attention to word-based language models. Given a reliable language model, we can answer questions like which among the following strings are we more likely to encounter?
‘On Monday, Mr. Lamar’s “DAMN.” took home an even more elusive honor, one that may never have even seemed within reach: the Pulitzer Prize”
“Frog zealot flagged xylophone the bean wallaby anaphylaxis extraneous porpoise into deleterious carrot banana apricot.”
Even if we’ve never seen either of these sentences in our entire lives, and even though no rapper has previously been awarded a Pulitzer Prize, we wouldn’t be shocked to see the first sentence in the New York Times. By comparison, we can all agree that the second sentence, consisting of incoherent babble, is comparatively unlikely. A statistical language model can assign precise probabilities to each of these and other strings of words.
Given a large corpus of text, we can estimate (or, in this case, train) a language model \(\hat{p}(x_1, ..., x_n)\). And given such a model, we can sample strings \(\mathbf{x} \sim \hat{p}(x_1, ..., x_n)\), generating new strings according to their estimated probability. Among other useful applications, we can use language models to score candidate transcriptions from speech recognition models, given a preference to sentences that seem more probable (at the expense of those deemed anomalous).
These days recurrent neural networks (RNNs) are the preferred method for language models. In this notebook, we will go through an example of using GluonNLP to
implement a typical LSTM language model architecture
train the language model on a corpus of real data
bring in your own dataset for training
grab off-the-shelf pre-trained state-of-the-art language models (i.e., AWD language model) using GluonNLP.
What is a language model (LM)?¶
The standard approach to language modeling consists of training a model that given a trailing window of text, predicts the next word in the sequence. When we train the model we feed in the inputs \(x_1, x_2, ...\) and try at each time step to predict the corresponding next word \(x_2, ..., x_{n+1}\). To generate text from a language model, we can iteratively predict the next word, and then feed this word as an input to the model at the subsequent time step. The image included below demonstrates this idea.
Train your own language model¶
Now let’s go through the step-by-step process on how to train your own language model using GluonNLP.
We’ll start by taking care of our basic dependencies and setting up our environment.
Firstly, we import the required modules for GluonNLP and the LM.
import warnings
import glob
import time
import math
import mxnet as mx
from mxnet import gluon, autograd
from mxnet.gluon.utils import download
import gluonnlp as nlp
Then we setup the environment for GluonNLP.
Please note that we should change num_gpus according to how many NVIDIA GPUs are available on the target machine in the following code.
num_gpus = 1
context = [mx.gpu(i) for i in range(num_gpus)] if num_gpus else [mx.cpu()]
log_interval = 200
Next we setup the hyperparameters for the LM we are using.
Note that BPTT stands for “back propagation through time,” and LR stands for learning rate. A link to more information on truncated BPTT can be found here.
batch_size = 20 * len(context)
lr = 20
epochs = 3
bptt = 35
grad_clip = 0.25
Loading the dataset¶
Now, we load the dataset, extract the vocabulary, numericalize, and batchify in order to perform truncated BPTT.
dataset_name = 'wikitext-2'
# Load the dataset
train_dataset, val_dataset, test_dataset = [
segment=segment, bos=None, eos='<eos>', skip_empty=False)
for segment in ['train', 'val', 'test']
# Extract the vocabulary and numericalize with "Counter"
vocab = nlp.Vocab(, padding_token=None, bos_token=None)
# Batchify for BPTT
bptt_batchify =
vocab, bptt, batch_size, last_batch='discard')
train_data, val_data, test_data = [
bptt_batchify(x) for x in [train_dataset, val_dataset, test_dataset]
Downloading /root/.mxnet/datasets/wikitext-2/
And then we load the pre-defined language model architecture as so:
model_name = 'standard_lstm_lm_200'
model, vocab = nlp.model.get_model(model_name, vocab=vocab, dataset_name=None)
# Initialize the model
model.initialize(mx.init.Xavier(), ctx=context)
# Initialize the trainer and optimizer and specify some hyperparameters
trainer = gluon.Trainer(model.collect_params(), 'sgd', {
'learning_rate': lr,
'momentum': 0,
'wd': 0
# Specify the loss function, in this case, cross-entropy with softmax.
loss = gluon.loss.SoftmaxCrossEntropyLoss()
(decoder): HybridSequential(
(0): Dense(200 -> 33278, linear)
(embedding): HybridSequential(
(0): Embedding(33278 -> 200, float32)
(1): Dropout(p = 0.2, axes=())
(encoder): LSTM(200 -> 200, TNC, num_layers=2, dropout=0.2)
Vocab(size=33278, unk="<unk>", reserved="['<eos>']")
Training the LM¶
Now that everything is ready, we can start training the model.
We first define a helper function for detaching the gradients on specific states for easier truncated BPTT.
def detach(hidden):
if isinstance(hidden, (tuple, list)):
hidden = [detach(i) for i in hidden]
hidden = hidden.detach()
return hidden
And then a helper evaluation function.
# Note that ctx is short for context
def evaluate(model, data_source, batch_size, ctx):
total_L = 0.0
ntotal = 0
hidden = model.begin_state(
batch_size=batch_size, func=mx.nd.zeros, ctx=ctx)
for i, (data, target) in enumerate(data_source):
data = data.as_in_context(ctx)
target = target.as_in_context(ctx)
output, hidden = model(data, hidden)
hidden = detach(hidden)
L = loss(output.reshape(-3, -1), target.reshape(-1))
total_L += mx.nd.sum(L).asscalar()
ntotal += L.size
return total_L / ntotal
The main training loop¶
Our loss function will be the standard cross-entropy loss function used for multi-class classification, applied at each time step to compare the model’s predictions to the true next word in the sequence. We can calculate gradients with respect to our parameters using truncated BPTT. In this case, we’ll back propagate for \(35\) time steps, updating our weights with stochastic gradient descent and a learning rate of \(20\); these correspond to the hyperparameters that we specified earlier in the notebook.
# Function for actually training the model
def train(model, train_data, val_data, test_data, epochs, lr):
best_val = float("Inf")
start_train_time = time.time()
parameters = model.collect_params().values()
for epoch in range(epochs):
total_L = 0.0
start_epoch_time = time.time()
start_log_interval_time = time.time()
hiddens = [model.begin_state(batch_size//len(context), func=mx.nd.zeros, ctx=ctx)
for ctx in context]
for i, (data, target) in enumerate(train_data):
data_list = gluon.utils.split_and_load(data, context,
batch_axis=1, even_split=True)
target_list = gluon.utils.split_and_load(target, context,
batch_axis=1, even_split=True)
hiddens = detach(hiddens)
L = 0
Ls = []
with autograd.record():
for j, (X, y, h) in enumerate(zip(data_list, target_list, hiddens)):
output, h = model(X, h)
batch_L = loss(output.reshape(-3, -1), y.reshape(-1,))
L = L + batch_L.as_in_context(context[0]) / (len(context) * X.size)
Ls.append(batch_L / (len(context) * X.size))
hiddens[j] = h
grads = [p.grad(x.context) for p in parameters for x in data_list]
gluon.utils.clip_global_norm(grads, grad_clip)
total_L += sum([mx.nd.sum(l).asscalar() for l in Ls])
if i % log_interval == 0 and i > 0:
cur_L = total_L / log_interval
print('[Epoch %d Batch %d/%d] loss %.2f, ppl %.2f, '
'throughput %.2f samples/s'%(
epoch, i, len(train_data), cur_L, math.exp(cur_L),
batch_size * log_interval / (time.time() - start_log_interval_time)))
total_L = 0.0
start_log_interval_time = time.time()
print('[Epoch %d] throughput %.2f samples/s'%(
epoch, len(train_data)*batch_size / (time.time() - start_epoch_time)))
val_L = evaluate(model, val_data, batch_size, context[0])
print('[Epoch %d] time cost %.2fs, valid loss %.2f, valid ppl %.2f'%(
epoch, time.time()-start_epoch_time, val_L, math.exp(val_L)))
if val_L < best_val:
best_val = val_L
test_L = evaluate(model, test_data, batch_size, context[0])
model.save_parameters('{}_{}-{}.params'.format(model_name, dataset_name, epoch))
print('test loss %.2f, test ppl %.2f'%(test_L, math.exp(test_L)))
lr = lr*0.25
print('Learning rate now %f'%(lr))
print('Total training throughput %.2f samples/s'%(
(batch_size * len(train_data) * epochs) /
(time.time() - start_train_time)))
We can now actually perform the training
train(model, train_data, val_data, test_data, epochs, lr)
Using your own dataset¶
When we train a language model, we fit to the statistics of a given
dataset. While many papers focus on a few standard datasets, such as
WikiText or the Penn Tree Bank, that’s just to provide a standard
benchmark for the purpose of comparing models against one another. In
general, for any given use case, you’ll want to train your own language
model using a dataset of your own choice. Here, for demonstration, we’ll
grab some .txt
files corresponding to Sherlock Holmes novels.
We first download the new dataset.
TRAIN_PATH = "./sherlockholmes.train.txt"
VALID_PATH = "./sherlockholmes.valid.txt"
TEST_PATH = "./sherlockholmes.test.txt"
PREDICT_PATH = "./tinyshakespeare/input.txt"
Then we specify the tokenizer as well as batchify the dataset.
import nltk
moses_tokenizer =
sherlockholmes_datasets = [
eos='<eos>') for name in ['train', 'valid', 'test']
sherlockholmes_train_data, sherlockholmes_val_data, sherlockholmes_test_data = [
bptt_batchify(dataset) for dataset in sherlockholmes_datasets
We setup the evaluation to see whether our previous model trained on the other dataset does well on the new dataset.
sherlockholmes_L = evaluate(model, sherlockholmes_val_data, batch_size,
print('Best validation loss %.2f, test ppl %.2f' %
(sherlockholmes_L, math.exp(sherlockholmes_L)))
Best validation loss 4.77, test ppl 117.60
Or we have the option of training the model on the new dataset with just one line of code.
sherlockholmes_train_data, # This is your input training data, we leave batchifying and tokenizing as an exercise for the reader
sherlockholmes_test_data, # This would be your test data, again left as an exercise for the reader
Using a pre-trained AWD LSTM language model¶
AWD LSTM language model is the state-of-the-art RNN language model [1]. The main technique leveraged is to add weight-dropout on the recurrent hidden to hidden matrices to prevent overfitting on the recurrent connections.
Load the vocabulary and the pre-trained model¶
awd_model_name = 'awd_lstm_lm_1150'
awd_model, vocab = nlp.model.get_model(
Vocab file is not found. Downloading.
Downloading /root/.mxnet/models/3963101239443680508/
Downloading /root/.mxnet/models/awd_lstm_lm_1150_wikitext-2-f9562ed0.zip2338dea9-cb82-4ae9-a9dc-cbc743bf1c9c
(decoder): HybridSequential(
(0): Dense(400 -> 33278, linear)
(embedding): HybridSequential(
(0): Embedding(33278 -> 400, float32)
(1): Dropout(p = 0.65, axes=(0,))
(encoder): HybridSequential(
(0): LSTM(400 -> 1150, TNC)
(1): LSTM(1150 -> 1150, TNC)
(2): LSTM(1150 -> 400, TNC)
Vocab(size=33278, unk="<unk>", reserved="['<eos>']")
Evaluate the pre-trained model on the validation and test datasets¶
val_L = evaluate(awd_model, val_data, batch_size, context[0])
test_L = evaluate(awd_model, test_data, batch_size, context[0])
print('Best validation loss %.2f, val ppl %.2f' % (val_L, math.exp(val_L)))
print('Best test loss %.2f, test ppl %.2f' % (test_L, math.exp(test_L)))
Best validation loss 4.23, val ppl 68.80
Best test loss 4.19, test ppl 65.73
Using a cache LSTM LM¶
Cache LSTM language model [2] adds a cache-like memory to neural network language models. It can be used in conjunction with the aforementioned AWD LSTM language model or other LSTM models. It exploits the hidden outputs to define a probability distribution over the words in the cache. It generates state-of-the-art results at inference time.
Load the pre-trained model and define the hyperparameters¶
window = 2
theta = 0.662
lambdas = 0.1279
bptt = 2000
cache_model = nlp.model.train.get_cache_model(name=awd_model_name,
(lm_model): AWDRNN(
(decoder): HybridSequential(
(0): Dense(400 -> 33278, linear)
(embedding): HybridSequential(
(0): Embedding(33278 -> 400, float32)
(1): Dropout(p = 0.65, axes=(0,))
(encoder): HybridSequential(
(0): LSTM(400 -> 1150, TNC)
(1): LSTM(1150 -> 1150, TNC)
(2): LSTM(1150 -> 400, TNC)
Define specific get_batch and evaluation helper functions for the cache model¶
Note that these helper functions are very similar to the ones we defined above, but are slightly different.
In [17]:
val_test_batch_size = 1
val_test_batchify =, val_test_batch_size)
val_data = val_test_batchify(val_dataset)
test_data = val_test_batchify(test_dataset)
def get_batch(data_source, i, seq_len=None):
seq_len = min(seq_len if seq_len else bptt, len(data_source) - 1 - i)
data = data_source[i:i + seq_len]
target = data_source[i + 1:i + 1 + seq_len]
return data, target
def evaluate_cache(model, data_source, batch_size, ctx):
total_L = 0.0
hidden = model.begin_state(
batch_size=batch_size, func=mx.nd.zeros, ctx=ctx)
next_word_history = None
cache_history = None
for i in range(0, len(data_source) - 1, bptt):
if i > 0:
print('Batch %d, ppl %f' % (i, math.exp(total_L / i)))
if i == bptt:
return total_L / i
data, target = get_batch(data_source, i)
data = data.as_in_context(ctx)
target = target.as_in_context(ctx)
L = 0
outs, next_word_history, cache_history, hidden = model(
data, target, next_word_history, cache_history, hidden)
for out in outs:
L += (-mx.nd.log(out)).asscalar()
total_L += L / data.shape[1]
hidden = detach(hidden)
return total_L / len(data_source)
Evaluate the pre-trained model on the validation and test datasets¶
val_L = evaluate_cache(cache_model, val_data, val_test_batch_size, context[0])
test_L = evaluate_cache(cache_model, test_data, val_test_batch_size, context[0])
print('Best validation loss %.2f, val ppl %.2f'%(val_L, math.exp(val_L)))
print('Best test loss %.2f, test ppl %.2f'%(test_L, math.exp(test_L)))
Batch 2000, ppl 60.767825
Batch 2000, ppl 67.390510
Best validation loss 4.11, val ppl 60.77
Best test loss 4.21, test ppl 67.39
[1] Merity, S., et al. “Regularizing and optimizing LSTM language models”. ICLR 2018
[2] Grave, E., et al. “Improving neural language models with a continuous cache”. ICLR 2017