gluonnlp.model.train¶

GluonNLP Toolkit supplies models with train-mode since the corresponding models have different behaviors in training: and inference, e.g., the number and type of the outputs from the forward pass are different.

Language Modeling¶

`AWDRNN`	AWD language model by salesforce.
`StandardRNN`	Standard RNN language model.
`CacheCell`	Cache language model.
`get_cache_model`	Returns a cache model using a pre-trained language model.
`BigRNN`	Big language model with LSTMP and importance sampling.

Word Embeddings¶

`EmbeddingModel`	Abstract base class for embedding models for training.
`CSREmbeddingModel`	A trainable embedding model.
`FasttextEmbeddingModel`	FastText embedding model.

API Reference¶

NLP training model.

class gluonnlp.model.train.AWDRNN(mode, vocab_size, embed_size=400, hidden_size=1150, num_layers=3, tie_weights=True, dropout=0.4, weight_drop=0.5, drop_h=0.2, drop_i=0.65, drop_e=0.1, **kwargs)[source]¶

AWD language model by salesforce.

Reference: https://github.com/salesforce/awd-lstm-lm

License: BSD 3-Clause

Parameters

mode (str) – The type of RNN to use. Options are ‘lstm’, ‘gru’, ‘rnn_tanh’, ‘rnn_relu’.
vocab_size (int) – Size of the input vocabulary.
embed_size (int) – Dimension of embedding vectors.
hidden_size (int) – Number of hidden units for RNN.
num_layers (int) – Number of RNN layers.
tie_weights (bool, default False) – Whether to tie the weight matrices of output dense layer and input embedding layer.
dropout (float) – Dropout rate to use for encoder output.
weight_drop (float) – Dropout rate to use on encoder h2h weights.
drop_h (float) – Dropout rate to on the output of intermediate layers of encoder.
drop_i (float) – Dropout rate to on the output of embedding.
drop_e (float) – Dropout rate to use on the embedding layer.

hybrid_forward(F, inputs, begin_state=None)[source]¶

Implement the forward computation that the awd language model and cache model use.

Parameters

inputs (NDArray or Symbol) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”.
begin_state (list) – initial recurrent state tensor with length equals to num_layers. the initial state with shape (1, batch_size, num_hidden)

Returns

out (NDArray or Symbol) – output tensor with shape (sequence_length, batch_size, input_size) when layout is “TNC”.
out_states (list) – output recurrent state tensor with length equals to num_layers. the state with shape (1, batch_size, num_hidden)
encoded_raw (list) – The list of outputs of the model’s encoder with length equals to num_layers. the shape of every encoder’s output (sequence_length, batch_size, num_hidden)
encoded_dropped (list) – The list of outputs with dropout of the model’s encoder with length equals to num_layers. The shape of every encoder’s dropped output (sequence_length, batch_size, num_hidden)

class gluonnlp.model.train.StandardRNN(mode, vocab_size, embed_size, hidden_size, num_layers, dropout=0.5, tie_weights=False, **kwargs)[source]¶

Standard RNN language model.

Parameters

mode (str) – The type of RNN to use. Options are ‘lstm’, ‘gru’, ‘rnn_tanh’, ‘rnn_relu’.
vocab_size (int) – Size of the input vocabulary.
embed_size (int) – Dimension of embedding vectors.
hidden_size (int) – Number of hidden units for RNN.
num_layers (int) – Number of RNN layers.
dropout (float) – Dropout rate to use for encoder output.
tie_weights (bool, default False) – Whether to tie the weight matrices of output dense layer and input embedding layer.

hybrid_forward(F, inputs, begin_state=None)[source]¶

Defines the forward computation. Arguments can be either NDArray or Symbol.

Parameters

inputs (NDArray or Symbol) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”.
begin_state (list) – initial recurrent state tensor with length equals to num_layers-1. the initial state with shape (num_layers, batch_size, num_hidden)

Returns

out (NDArray or Symbol) – output tensor with shape (sequence_length, batch_size, input_size) when layout is “TNC”.
out_states (list) – output recurrent state tensor with length equals to num_layers-1. the state with shape (num_layers, batch_size, num_hidden)
encoded_raw (list) – The list of last output of the model’s encoder. the shape of last encoder’s output (sequence_length, batch_size, num_hidden)
encoded_dropped (list) – The list of last output with dropout of the model’s encoder. the shape of last encoder’s dropped output (sequence_length, batch_size, num_hidden)

class gluonnlp.model.train.BigRNN(vocab_size, embed_size, hidden_size, num_layers, projection_size, num_sampled, embed_dropout=0.0, encode_dropout=0.0, sparse_weight=True, sparse_grad=True, **kwargs)[source]¶

Big language model with LSTMP and importance sampling.

Reference: https://github.com/rafaljozefowicz/lm

License: MIT

Parameters

vocab_size (int) – Size of the input vocabulary.
embed_size (int) – Dimension of embedding vectors.
hidden_size (int) – Number of hidden units for LSTMP.
num_layers (int) – Number of LSTMP layers.
projection_size (int) – Number of projection units for LSTMP.
num_sampled (int) – Number of sampled classes for the decoder.
embed_dropout (float) – Dropout rate to use for embedding output.
encoder_dropout (float) – Dropout rate to use for encoder output.
sparse_weight (bool) – Whether to use RewSparseNDArray for weights of input and output embeddings.
sparse_grad (bool) – Whether to use RowSparseNDArray for the gradients w.r.t. weights of input and output embeddings.
note (.) – embeddings will be sparse. Only a subset of optimizers support sparse gradients, including SGD, AdaGrad and Adam. By default lazy_update is turned on for these optimizers, which may perform differently from standard updates. For more details, please check the Optimization API at: https://mxnet.incubator.apache.org/api/python/optimization/optimization.html
note – decoder block will be stored in row_sparse format, which helps reduce memory consumption and communication overhead during multi-GPU training. However, sparse parameters cannot be shared with other blocks, nor could we hybridize a block containing sparse parameters.

forward(inputs, label, begin_state, sampled_values)[source]¶

Defines the forward computation.

Parameters

inputs (NDArray) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”.
begin_state (list) – initial recurrent state tensor with length equals to num_layers*2. For each layer the two initial states have shape (batch_size, num_hidden) and (batch_size, num_projection)
sampled_values (list) – a list of three tensors for sampled_classes with shape (num_samples,), expected_count_sampled with shape (num_samples,), and expected_count_true with shape (sequence_length, batch_size).

Returns

out (NDArray) – output tensor with shape (sequence_length, batch_size, 1+num_samples) when layout is “TNC”.
out_states (list) – output recurrent state tensor with length equals to num_layers*2. For each layer the two initial states have shape (batch_size, num_hidden) and (batch_size, num_projection)
new_target (NDArray) – output tensor with shape (sequence_length, batch_size) when layout is “TNC”.

class gluonnlp.model.train.CacheCell(lm_model, vocab_size, window, theta, lambdas, **kwargs)[source]¶

Cache language model.

We implement the neural cache language model proposed in the following work:

@article{grave2016improving,
title={Improving neural language models with a continuous cache},
author={Grave, Edouard and Joulin, Armand and Usunier, Nicolas},
journal={ICLR},
year={2017}
}

Parameters

lm_model (gluonnlp.model.StandardRNN or gluonnlp.model.AWDRNN) – The type of RNN to use. Options are ‘gluonnlp.model.StandardRNN’, ‘gluonnlp.model.AWDRNN’.
vocab_size (int) – Size of the input vocabulary.
window (int) – Size of cache window
theta (float) –
The scala controls the flatness of the cache distribution that predict the next word as shown below:

\[p_{cache} \propto \sum_{i=1}^{t-1} \mathbb{1}_{w=x_{i+1}} exp(\theta {h_t}^T h_i)\]

where \(p_{cache}\) is the cache distribution, \(\mathbb{1}\) is the identity function, and \(h_i\) is the output of timestep i.
lambdas (float) –
Linear scalar between only cache and vocab distribution, the formulation is as below:

\[p = (1 - \lambda) p_{vocab} + \lambda p_{cache}\]

where \(p_{vocab}\) is the vocabulary distribution and \(p_{cache}\) is the cache distribution.

begin_state(*args, **kwargs)[source]¶: Initialize the hidden states.

hybrid_forward(F, inputs, target, next_word_history, cache_history, begin_state=None)[source]¶

Defines the forward computation for cache cell. Arguments can be either NDArray or Symbol.

Parameters

inputs (NDArray or Symbol) – The input data
target (NDArray or Symbol) – The label
next_word_history (NDArray or Symbol) – The next word in memory
cache_history (NDArray or Symbol) – The hidden state in cache history
begin_state (list of NDArray or Symbol, optional) – The begin states.

Returns

out (NDArray or Symbol) – The linear interpolation of the cache language model with the regular word-level language model
next_word_history (NDArray or Symbol) – The next words to be kept in the memory for look up (size is equal to the window size)
cache_history (NDArray or Symbol) – The hidden states to be kept in the memory for look up (size is equal to the window size)

load_parameters(filename, ctx=cpu(0))[source]¶

Load parameters from file.

filenamestr: Path to parameter file.
ctxContext or list of Context, default cpu(): Context(s) initialize loaded parameters on.

save_parameters(filename, deduplicate=False)[source]¶

Save parameters to file.

filenamestr: Path to file.
deduplicatebool, default False: If True, save shared parameters only once. Otherwise, if a Block contains multiple sub-blocks that share parameters, each of the shared parameters will be separately saved for every sub-block.

class gluonnlp.model.train.EmbeddingModel(prefix=None, params=None)[source]¶

Abstract base class for embedding models for training.

An embedding model is a Gluon block with additional __contains__ and __getitem__ support for computing embeddings given a string or list of strings. See the documentation of __contains__ and __getitem__ for details.

class gluonnlp.model.train.CSREmbeddingModel(token_to_idx, output_dim, weight_initializer=None, sparse_grad=True, dtype='float32', **kwargs)[source]¶

A trainable embedding model.

This class is a simple wrapper around the mxnet.gluon.nn.Embedding. It trains independent embedding vectors for every token. It implements the gluonnlp.model.train.EmbeddingModel interface which provides convenient helper methods.

Parameters

token_to_idx (dict) – token_to_idx mapping of the vocabulary that this model is to be trained with. token_to_idx is used for __getitem__ and __contains__. For initialization len(token_to_idx) is used to specify the size of the subword embedding matrix.
output_dim (int) – Dimension of the dense embedding.
weight_initializer (mxnet.initializer.Initializer, optional) – Initializer for the embeddings matrix.
sparse_grad (bool, default True) – Specifies mxnet.gluon.nn.Embedding sparse_grad argument.
dtype (str, default 'float32') – dtype argument passed to gluon.nn.Embedding

hybrid_forward(F, words, weight)[source]¶

Compute embedding of words in batch.

Parameters: words (mx.nd.NDArray) – Array of token indices.

class gluonnlp.model.train.FasttextEmbeddingModel(token_to_idx, subword_function, output_dim, weight_initializer=None, sparse_grad=True, dtype='float32', **kwargs)[source]¶

FastText embedding model.

The FasttextEmbeddingModel combines a word level embedding matrix and a subword level embedding matrix. It implements the gluonnlp.model.train.EmbeddingModel interface which provides convenient functions.

Parameters

token_to_idx (dict) – token_to_idx mapping of the vocabulary that this model is to be trained with. token_to_idx is used for __getitem__ and __contains__. For initialization len(token_to_idx) is used to specify the size of the subword embedding matrix..
subword_function (gluonnlp.vocab.SubwordFunction) – The subword function used to obtain the subword indices during training this model. The subword_function is used for __getitem__ and __contains__. For initialization len(subword_function) is used to specify the size of the subword embedding matrix..
output_dim (int) – Dimension of embeddings.
weight_initializer (mxnet.initializer.Initializer, optional) – Initializer for the embeddings and subword embeddings matrix.
sparse_grad (bool, default True) – Specifies mxnet.gluon.nn.Embedding sparse_grad argument.
dtype (str, default 'float32') – dtype argument passed to gluon.nn.Embedding

hybrid_forward(F, words, weight)[source]¶

Compute embedding of words in batch.

Parameters: words (mxnet.ndarray.sparse.CSRNDArray) – Sparse array containing weights for every word and subword index. Output is the weighted sum of word and subword embeddings.

classmethod load_fasttext_format(path, ctx=cpu(0), **kwargs)[source]¶

Create an instance of the class and load weights.

Load the weights from the fastText binary format created by https://github.com/facebookresearch/fastText

Parameters

path (str) – Path to the .bin model file.
ctx (mx.Context, default mx.cpu()) – Context to initialize the weights on.
kwargs (dict) – Keyword arguments are passed to the class initializer.

gluonnlp.model.train.get_cache_model(name, dataset_name='wikitext-2', window=2000, theta=0.6, lambdas=0.2, ctx=cpu(0), **kwargs)[source]¶

Returns a cache model using a pre-trained language model.

We implement the neural cache language model proposed in the following work:

@article{grave2016improving,
title={Improving neural language models with a continuous cache},
author={Grave, Edouard and Joulin, Armand and Usunier, Nicolas},
journal={ICLR},
year={2017}
}

Parameters

name (str) – Name of the cache language model.
dataset_name (str or None, default 'wikitext-2'.) – The dataset name on which the pre-trained model is trained. Options are ‘wikitext-2’. If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned.
window (int) – Size of cache window
theta (float) –
The scala controls the flatness of the cache distribution that predict the next word as shown below:

\[p_{cache} \propto \sum_{i=1}^{t-1} \mathbb{1}_{w=x_{i+1}} exp(\theta {h_t}^T h_i)\]

where \(p_{cache}\) is the cache distribution, \(\mathbb{1}\) is the identity function, and \(h_i\) is the output of timestep i.
lambdas (float) –
Linear scalar between only cache and vocab distribution, the formulation is as below:

\[p = (1 - \lambda) p_{vocab} + \lambda p_{cache}\]

where \(p_{vocab}\) is the vocabulary distribution and \(p_{cache}\) is the cache distribution.
vocab (gluonnlp.Vocab or None, default None) – Vocabulary object to be used with the language model. Required when dataset_name is not specified.
pretrained (bool, default False) – Whether to load the pre-trained weights for model.
ctx (Context, default CPU) – The context in which to load the pre-trained weights.
root (str, default '~/.mxnet/models') – Location for keeping the pre-trained model parameters.

Returns

The model.

Return type

Block