gluonnlp.model.train

GluonNLP Toolkit supplies models with train-mode since the corresponding models have different behaviors in training

and inference, e.g., the number and type of the outputs from the forward pass are different.

Language Modeling

AWDRNN

AWD language model by salesforce.

StandardRNN

Standard RNN language model.

CacheCell

Cache language model.

get_cache_model

Returns a cache model using a pre-trained language model.

BigRNN

Big language model with LSTMP and importance sampling.

Word Embeddings

EmbeddingModel

Abstract base class for embedding models for training.

CSREmbeddingModel

A trainable embedding model.

FasttextEmbeddingModel

FastText embedding model.

API Reference

NLP training model.

class gluonnlp.model.train.AWDRNN(mode, vocab_size, embed_size=400, hidden_size=1150, num_layers=3, tie_weights=True, dropout=0.4, weight_drop=0.5, drop_h=0.2, drop_i=0.65, drop_e=0.1, **kwargs)[source]

AWD language model by salesforce.

Reference: https://github.com/salesforce/awd-lstm-lm

License: BSD 3-Clause

Parameters
  • mode (str) – The type of RNN to use. Options are ‘lstm’, ‘gru’, ‘rnn_tanh’, ‘rnn_relu’.

  • vocab_size (int) – Size of the input vocabulary.

  • embed_size (int) – Dimension of embedding vectors.

  • hidden_size (int) – Number of hidden units for RNN.

  • num_layers (int) – Number of RNN layers.

  • tie_weights (bool, default False) – Whether to tie the weight matrices of output dense layer and input embedding layer.

  • dropout (float) – Dropout rate to use for encoder output.

  • weight_drop (float) – Dropout rate to use on encoder h2h weights.

  • drop_h (float) – Dropout rate to on the output of intermediate layers of encoder.

  • drop_i (float) – Dropout rate to on the output of embedding.

  • drop_e (float) – Dropout rate to use on the embedding layer.

hybrid_forward(F, inputs, begin_state=None)[source]

Implement the forward computation that the awd language model and cache model use.

Parameters
  • inputs (NDArray or Symbol) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”.

  • begin_state (list) – initial recurrent state tensor with length equals to num_layers. the initial state with shape (1, batch_size, num_hidden)

Returns

  • out (NDArray or Symbol) – output tensor with shape (sequence_length, batch_size, input_size) when layout is “TNC”.

  • out_states (list) – output recurrent state tensor with length equals to num_layers. the state with shape (1, batch_size, num_hidden)

  • encoded_raw (list) – The list of outputs of the model’s encoder with length equals to num_layers. the shape of every encoder’s output (sequence_length, batch_size, num_hidden)

  • encoded_dropped (list) – The list of outputs with dropout of the model’s encoder with length equals to num_layers. The shape of every encoder’s dropped output (sequence_length, batch_size, num_hidden)

class gluonnlp.model.train.StandardRNN(mode, vocab_size, embed_size, hidden_size, num_layers, dropout=0.5, tie_weights=False, **kwargs)[source]

Standard RNN language model.

Parameters
  • mode (str) – The type of RNN to use. Options are ‘lstm’, ‘gru’, ‘rnn_tanh’, ‘rnn_relu’.

  • vocab_size (int) – Size of the input vocabulary.

  • embed_size (int) – Dimension of embedding vectors.

  • hidden_size (int) – Number of hidden units for RNN.

  • num_layers (int) – Number of RNN layers.

  • dropout (float) – Dropout rate to use for encoder output.

  • tie_weights (bool, default False) – Whether to tie the weight matrices of output dense layer and input embedding layer.

hybrid_forward(F, inputs, begin_state=None)[source]

Defines the forward computation. Arguments can be either NDArray or Symbol.

Parameters
  • inputs (NDArray or Symbol) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”.

  • begin_state (list) – initial recurrent state tensor with length equals to num_layers-1. the initial state with shape (num_layers, batch_size, num_hidden)

Returns

  • out (NDArray or Symbol) – output tensor with shape (sequence_length, batch_size, input_size) when layout is “TNC”.

  • out_states (list) – output recurrent state tensor with length equals to num_layers-1. the state with shape (num_layers, batch_size, num_hidden)

  • encoded_raw (list) – The list of last output of the model’s encoder. the shape of last encoder’s output (sequence_length, batch_size, num_hidden)

  • encoded_dropped (list) – The list of last output with dropout of the model’s encoder. the shape of last encoder’s dropped output (sequence_length, batch_size, num_hidden)

class gluonnlp.model.train.BigRNN(vocab_size, embed_size, hidden_size, num_layers, projection_size, num_sampled, embed_dropout=0.0, encode_dropout=0.0, sparse_weight=True, sparse_grad=True, **kwargs)[source]

Big language model with LSTMP and importance sampling.

Reference: https://github.com/rafaljozefowicz/lm

License: MIT

Parameters
  • vocab_size (int) – Size of the input vocabulary.

  • embed_size (int) – Dimension of embedding vectors.

  • hidden_size (int) – Number of hidden units for LSTMP.

  • num_layers (int) – Number of LSTMP layers.

  • projection_size (int) – Number of projection units for LSTMP.

  • num_sampled (int) – Number of sampled classes for the decoder.

  • embed_dropout (float) – Dropout rate to use for embedding output.

  • encoder_dropout (float) – Dropout rate to use for encoder output.

  • sparse_weight (bool) – Whether to use RewSparseNDArray for weights of input and output embeddings.

  • sparse_grad (bool) – Whether to use RowSparseNDArray for the gradients w.r.t. weights of input and output embeddings.

  • note (.) – embeddings will be sparse. Only a subset of optimizers support sparse gradients, including SGD, AdaGrad and Adam. By default lazy_update is turned on for these optimizers, which may perform differently from standard updates. For more details, please check the Optimization API at: https://mxnet.incubator.apache.org/api/python/optimization/optimization.html

  • note – decoder block will be stored in row_sparse format, which helps reduce memory consumption and communication overhead during multi-GPU training. However, sparse parameters cannot be shared with other blocks, nor could we hybridize a block containing sparse parameters.

forward(inputs, label, begin_state, sampled_values)[source]

Defines the forward computation.

Parameters
  • inputs (NDArray) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”.

  • begin_state (list) – initial recurrent state tensor with length equals to num_layers*2. For each layer the two initial states have shape (batch_size, num_hidden) and (batch_size, num_projection)

  • sampled_values (list) – a list of three tensors for sampled_classes with shape (num_samples,), expected_count_sampled with shape (num_samples,), and expected_count_true with shape (sequence_length, batch_size).

Returns

  • out (NDArray) – output tensor with shape (sequence_length, batch_size, 1+num_samples) when layout is “TNC”.

  • out_states (list) – output recurrent state tensor with length equals to num_layers*2. For each layer the two initial states have shape (batch_size, num_hidden) and (batch_size, num_projection)

  • new_target (NDArray) – output tensor with shape (sequence_length, batch_size) when layout is “TNC”.

class gluonnlp.model.train.CacheCell(lm_model, vocab_size, window, theta, lambdas, **kwargs)[source]

Cache language model.

We implement the neural cache language model proposed in the following work:

@article{grave2016improving,
title={Improving neural language models with a continuous cache},
author={Grave, Edouard and Joulin, Armand and Usunier, Nicolas},
journal={ICLR},
year={2017}
}
Parameters
  • lm_model (gluonnlp.model.StandardRNN or gluonnlp.model.AWDRNN) – The type of RNN to use. Options are ‘gluonnlp.model.StandardRNN’, ‘gluonnlp.model.AWDRNN’.

  • vocab_size (int) – Size of the input vocabulary.

  • window (int) – Size of cache window

  • theta (float) –

    The scala controls the flatness of the cache distribution that predict the next word as shown below:

    \[p_{cache} \propto \sum_{i=1}^{t-1} \mathbb{1}_{w=x_{i+1}} exp(\theta {h_t}^T h_i)\]

    where \(p_{cache}\) is the cache distribution, \(\mathbb{1}\) is the identity function, and \(h_i\) is the output of timestep i.

  • lambdas (float) –

    Linear scalar between only cache and vocab distribution, the formulation is as below:

    \[p = (1 - \lambda) p_{vocab} + \lambda p_{cache}\]

    where \(p_{vocab}\) is the vocabulary distribution and \(p_{cache}\) is the cache distribution.

begin_state(*args, **kwargs)[source]

Initialize the hidden states.

hybrid_forward(F, inputs, target, next_word_history, cache_history, begin_state=None)[source]

Defines the forward computation for cache cell. Arguments can be either NDArray or Symbol.

Parameters
  • inputs (NDArray or Symbol) – The input data

  • target (NDArray or Symbol) – The label

  • next_word_history (NDArray or Symbol) – The next word in memory

  • cache_history (NDArray or Symbol) – The hidden state in cache history

  • begin_state (list of NDArray or Symbol, optional) – The begin states.

Returns

  • out (NDArray or Symbol) – The linear interpolation of the cache language model with the regular word-level language model

  • next_word_history (NDArray or Symbol) – The next words to be kept in the memory for look up (size is equal to the window size)

  • cache_history (NDArray or Symbol) – The hidden states to be kept in the memory for look up (size is equal to the window size)

load_parameters(filename, ctx=cpu(0))[source]

Load parameters from file.

filenamestr

Path to parameter file.

ctxContext or list of Context, default cpu()

Context(s) initialize loaded parameters on.

save_parameters(filename, deduplicate=False)[source]

Save parameters to file.

filenamestr

Path to file.

deduplicatebool, default False

If True, save shared parameters only once. Otherwise, if a Block contains multiple sub-blocks that share parameters, each of the shared parameters will be separately saved for every sub-block.

class gluonnlp.model.train.EmbeddingModel(prefix=None, params=None)[source]

Abstract base class for embedding models for training.

An embedding model is a Gluon block with additional __contains__ and __getitem__ support for computing embeddings given a string or list of strings. See the documentation of __contains__ and __getitem__ for details.

class gluonnlp.model.train.CSREmbeddingModel(token_to_idx, output_dim, weight_initializer=None, sparse_grad=True, dtype='float32', **kwargs)[source]

A trainable embedding model.

This class is a simple wrapper around the mxnet.gluon.nn.Embedding. It trains independent embedding vectors for every token. It implements the gluonnlp.model.train.EmbeddingModel interface which provides convenient helper methods.

Parameters
  • token_to_idx (dict) – token_to_idx mapping of the vocabulary that this model is to be trained with. token_to_idx is used for __getitem__ and __contains__. For initialization len(token_to_idx) is used to specify the size of the subword embedding matrix.

  • output_dim (int) – Dimension of the dense embedding.

  • weight_initializer (mxnet.initializer.Initializer, optional) – Initializer for the embeddings matrix.

  • sparse_grad (bool, default True) – Specifies mxnet.gluon.nn.Embedding sparse_grad argument.

  • dtype (str, default 'float32') – dtype argument passed to gluon.nn.Embedding

hybrid_forward(F, words, weight)[source]

Compute embedding of words in batch.

Parameters

words (mx.nd.NDArray) – Array of token indices.

class gluonnlp.model.train.FasttextEmbeddingModel(token_to_idx, subword_function, output_dim, weight_initializer=None, sparse_grad=True, dtype='float32', **kwargs)[source]

FastText embedding model.

The FasttextEmbeddingModel combines a word level embedding matrix and a subword level embedding matrix. It implements the gluonnlp.model.train.EmbeddingModel interface which provides convenient functions.

Parameters
  • token_to_idx (dict) – token_to_idx mapping of the vocabulary that this model is to be trained with. token_to_idx is used for __getitem__ and __contains__. For initialization len(token_to_idx) is used to specify the size of the subword embedding matrix..

  • subword_function (gluonnlp.vocab.SubwordFunction) – The subword function used to obtain the subword indices during training this model. The subword_function is used for __getitem__ and __contains__. For initialization len(subword_function) is used to specify the size of the subword embedding matrix..

  • output_dim (int) – Dimension of embeddings.

  • weight_initializer (mxnet.initializer.Initializer, optional) – Initializer for the embeddings and subword embeddings matrix.

  • sparse_grad (bool, default True) – Specifies mxnet.gluon.nn.Embedding sparse_grad argument.

  • dtype (str, default 'float32') – dtype argument passed to gluon.nn.Embedding

hybrid_forward(F, words, weight)[source]

Compute embedding of words in batch.

Parameters

words (mxnet.ndarray.sparse.CSRNDArray) – Sparse array containing weights for every word and subword index. Output is the weighted sum of word and subword embeddings.

classmethod load_fasttext_format(path, ctx=cpu(0), **kwargs)[source]

Create an instance of the class and load weights.

Load the weights from the fastText binary format created by https://github.com/facebookresearch/fastText

Parameters
  • path (str) – Path to the .bin model file.

  • ctx (mx.Context, default mx.cpu()) – Context to initialize the weights on.

  • kwargs (dict) – Keyword arguments are passed to the class initializer.

gluonnlp.model.train.get_cache_model(name, dataset_name='wikitext-2', window=2000, theta=0.6, lambdas=0.2, ctx=cpu(0), **kwargs)[source]

Returns a cache model using a pre-trained language model.

We implement the neural cache language model proposed in the following work:

@article{grave2016improving,
title={Improving neural language models with a continuous cache},
author={Grave, Edouard and Joulin, Armand and Usunier, Nicolas},
journal={ICLR},
year={2017}
}
Parameters
  • name (str) – Name of the cache language model.

  • dataset_name (str or None, default 'wikitext-2'.) – The dataset name on which the pre-trained model is trained. Options are ‘wikitext-2’. If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned.

  • window (int) – Size of cache window

  • theta (float) –

    The scala controls the flatness of the cache distribution that predict the next word as shown below:

    \[p_{cache} \propto \sum_{i=1}^{t-1} \mathbb{1}_{w=x_{i+1}} exp(\theta {h_t}^T h_i)\]

    where \(p_{cache}\) is the cache distribution, \(\mathbb{1}\) is the identity function, and \(h_i\) is the output of timestep i.

  • lambdas (float) –

    Linear scalar between only cache and vocab distribution, the formulation is as below:

    \[p = (1 - \lambda) p_{vocab} + \lambda p_{cache}\]

    where \(p_{vocab}\) is the vocabulary distribution and \(p_{cache}\) is the cache distribution.

  • vocab (gluonnlp.Vocab or None, default None) – Vocabulary object to be used with the language model. Required when dataset_name is not specified.

  • pretrained (bool, default False) – Whether to load the pre-trained weights for model.

  • ctx (Context, default CPU) – The context in which to load the pre-trained weights.

  • root (str, default '~/.mxnet/models') – Location for keeping the pre-trained model parameters.

Returns

The model.

Return type

Block