gluonnlp.model.train¶
- GluonNLP Toolkit supplies models with train-mode since the corresponding models have different behaviors in training
and inference, e.g., the number and type of the outputs from the forward pass are different.
Language Modeling¶
AWD language model by salesforce. |
|
Standard RNN language model. |
|
Cache language model. |
|
Returns a cache model using a pre-trained language model. |
|
Big language model with LSTMP and importance sampling. |
Word Embeddings¶
Abstract base class for embedding models for training. |
|
A trainable embedding model. |
|
FastText embedding model. |
API Reference¶
NLP training model.
-
class
gluonnlp.model.train.
AWDRNN
(mode, vocab_size, embed_size=400, hidden_size=1150, num_layers=3, tie_weights=True, dropout=0.4, weight_drop=0.5, drop_h=0.2, drop_i=0.65, drop_e=0.1, **kwargs)[source]¶ AWD language model by salesforce.
Reference: https://github.com/salesforce/awd-lstm-lm
License: BSD 3-Clause
- Parameters
mode (str) – The type of RNN to use. Options are ‘lstm’, ‘gru’, ‘rnn_tanh’, ‘rnn_relu’.
vocab_size (int) – Size of the input vocabulary.
embed_size (int) – Dimension of embedding vectors.
hidden_size (int) – Number of hidden units for RNN.
num_layers (int) – Number of RNN layers.
tie_weights (bool, default False) – Whether to tie the weight matrices of output dense layer and input embedding layer.
dropout (float) – Dropout rate to use for encoder output.
weight_drop (float) – Dropout rate to use on encoder h2h weights.
drop_h (float) – Dropout rate to on the output of intermediate layers of encoder.
drop_i (float) – Dropout rate to on the output of embedding.
drop_e (float) – Dropout rate to use on the embedding layer.
-
hybrid_forward
(F, inputs, begin_state=None)[source]¶ Implement the forward computation that the awd language model and cache model use.
- Parameters
inputs (NDArray or Symbol) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”.
begin_state (list) – initial recurrent state tensor with length equals to num_layers. the initial state with shape (1, batch_size, num_hidden)
- Returns
out (NDArray or Symbol) – output tensor with shape (sequence_length, batch_size, input_size) when layout is “TNC”.
out_states (list) – output recurrent state tensor with length equals to num_layers. the state with shape (1, batch_size, num_hidden)
encoded_raw (list) – The list of outputs of the model’s encoder with length equals to num_layers. the shape of every encoder’s output (sequence_length, batch_size, num_hidden)
encoded_dropped (list) – The list of outputs with dropout of the model’s encoder with length equals to num_layers. The shape of every encoder’s dropped output (sequence_length, batch_size, num_hidden)
-
class
gluonnlp.model.train.
StandardRNN
(mode, vocab_size, embed_size, hidden_size, num_layers, dropout=0.5, tie_weights=False, **kwargs)[source]¶ Standard RNN language model.
- Parameters
mode (str) – The type of RNN to use. Options are ‘lstm’, ‘gru’, ‘rnn_tanh’, ‘rnn_relu’.
vocab_size (int) – Size of the input vocabulary.
embed_size (int) – Dimension of embedding vectors.
hidden_size (int) – Number of hidden units for RNN.
num_layers (int) – Number of RNN layers.
dropout (float) – Dropout rate to use for encoder output.
tie_weights (bool, default False) – Whether to tie the weight matrices of output dense layer and input embedding layer.
-
hybrid_forward
(F, inputs, begin_state=None)[source]¶ Defines the forward computation. Arguments can be either
NDArray
orSymbol
.- Parameters
inputs (NDArray or Symbol) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”.
begin_state (list) – initial recurrent state tensor with length equals to num_layers-1. the initial state with shape (num_layers, batch_size, num_hidden)
- Returns
out (NDArray or Symbol) – output tensor with shape (sequence_length, batch_size, input_size) when layout is “TNC”.
out_states (list) – output recurrent state tensor with length equals to num_layers-1. the state with shape (num_layers, batch_size, num_hidden)
encoded_raw (list) – The list of last output of the model’s encoder. the shape of last encoder’s output (sequence_length, batch_size, num_hidden)
encoded_dropped (list) – The list of last output with dropout of the model’s encoder. the shape of last encoder’s dropped output (sequence_length, batch_size, num_hidden)
-
class
gluonnlp.model.train.
BigRNN
(vocab_size, embed_size, hidden_size, num_layers, projection_size, num_sampled, embed_dropout=0.0, encode_dropout=0.0, sparse_weight=True, sparse_grad=True, **kwargs)[source]¶ Big language model with LSTMP and importance sampling.
Reference: https://github.com/rafaljozefowicz/lm
License: MIT
- Parameters
vocab_size (int) – Size of the input vocabulary.
embed_size (int) – Dimension of embedding vectors.
hidden_size (int) – Number of hidden units for LSTMP.
num_layers (int) – Number of LSTMP layers.
projection_size (int) – Number of projection units for LSTMP.
num_sampled (int) – Number of sampled classes for the decoder.
embed_dropout (float) – Dropout rate to use for embedding output.
encoder_dropout (float) – Dropout rate to use for encoder output.
sparse_weight (bool) – Whether to use RewSparseNDArray for weights of input and output embeddings.
sparse_grad (bool) – Whether to use RowSparseNDArray for the gradients w.r.t. weights of input and output embeddings.
note (.) – embeddings will be sparse. Only a subset of optimizers support sparse gradients, including SGD, AdaGrad and Adam. By default lazy_update is turned on for these optimizers, which may perform differently from standard updates. For more details, please check the Optimization API at: https://mxnet.incubator.apache.org/api/python/optimization/optimization.html
note – decoder block will be stored in row_sparse format, which helps reduce memory consumption and communication overhead during multi-GPU training. However, sparse parameters cannot be shared with other blocks, nor could we hybridize a block containing sparse parameters.
-
forward
(inputs, label, begin_state, sampled_values)[source]¶ Defines the forward computation.
- Parameters
inputs (NDArray) – input tensor with shape (sequence_length, batch_size) when layout is “TNC”.
begin_state (list) – initial recurrent state tensor with length equals to num_layers*2. For each layer the two initial states have shape (batch_size, num_hidden) and (batch_size, num_projection)
sampled_values (list) – a list of three tensors for sampled_classes with shape (num_samples,), expected_count_sampled with shape (num_samples,), and expected_count_true with shape (sequence_length, batch_size).
- Returns
out (NDArray) – output tensor with shape (sequence_length, batch_size, 1+num_samples) when layout is “TNC”.
out_states (list) – output recurrent state tensor with length equals to num_layers*2. For each layer the two initial states have shape (batch_size, num_hidden) and (batch_size, num_projection)
new_target (NDArray) – output tensor with shape (sequence_length, batch_size) when layout is “TNC”.
-
class
gluonnlp.model.train.
CacheCell
(lm_model, vocab_size, window, theta, lambdas, **kwargs)[source]¶ Cache language model.
We implement the neural cache language model proposed in the following work:
@article{grave2016improving, title={Improving neural language models with a continuous cache}, author={Grave, Edouard and Joulin, Armand and Usunier, Nicolas}, journal={ICLR}, year={2017} }
- Parameters
lm_model (gluonnlp.model.StandardRNN or gluonnlp.model.AWDRNN) – The type of RNN to use. Options are ‘gluonnlp.model.StandardRNN’, ‘gluonnlp.model.AWDRNN’.
vocab_size (int) – Size of the input vocabulary.
window (int) – Size of cache window
theta (float) –
The scala controls the flatness of the cache distribution that predict the next word as shown below:
\[p_{cache} \propto \sum_{i=1}^{t-1} \mathbb{1}_{w=x_{i+1}} exp(\theta {h_t}^T h_i)\]where \(p_{cache}\) is the cache distribution, \(\mathbb{1}\) is the identity function, and \(h_i\) is the output of timestep i.
lambdas (float) –
Linear scalar between only cache and vocab distribution, the formulation is as below:
\[p = (1 - \lambda) p_{vocab} + \lambda p_{cache}\]where \(p_{vocab}\) is the vocabulary distribution and \(p_{cache}\) is the cache distribution.
-
hybrid_forward
(F, inputs, target, next_word_history, cache_history, begin_state=None)[source]¶ Defines the forward computation for cache cell. Arguments can be either
NDArray
orSymbol
.- Parameters
inputs (NDArray or Symbol) – The input data
target (NDArray or Symbol) – The label
next_word_history (NDArray or Symbol) – The next word in memory
cache_history (NDArray or Symbol) – The hidden state in cache history
begin_state (list of NDArray or Symbol, optional) – The begin states.
- Returns
out (NDArray or Symbol) – The linear interpolation of the cache language model with the regular word-level language model
next_word_history (NDArray or Symbol) – The next words to be kept in the memory for look up (size is equal to the window size)
cache_history (NDArray or Symbol) – The hidden states to be kept in the memory for look up (size is equal to the window size)
-
load_parameters
(filename, ctx=cpu(0))[source]¶ Load parameters from file.
- filenamestr
Path to parameter file.
- ctxContext or list of Context, default cpu()
Context(s) initialize loaded parameters on.
-
save_parameters
(filename, deduplicate=False)[source]¶ Save parameters to file.
- filenamestr
Path to file.
- deduplicatebool, default False
If True, save shared parameters only once. Otherwise, if a Block contains multiple sub-blocks that share parameters, each of the shared parameters will be separately saved for every sub-block.
-
class
gluonnlp.model.train.
EmbeddingModel
(prefix=None, params=None)[source]¶ Abstract base class for embedding models for training.
An embedding model is a Gluon block with additional __contains__ and __getitem__ support for computing embeddings given a string or list of strings. See the documentation of __contains__ and __getitem__ for details.
-
class
gluonnlp.model.train.
CSREmbeddingModel
(token_to_idx, output_dim, weight_initializer=None, sparse_grad=True, dtype='float32', **kwargs)[source]¶ A trainable embedding model.
This class is a simple wrapper around the mxnet.gluon.nn.Embedding. It trains independent embedding vectors for every token. It implements the gluonnlp.model.train.EmbeddingModel interface which provides convenient helper methods.
- Parameters
token_to_idx (dict) – token_to_idx mapping of the vocabulary that this model is to be trained with. token_to_idx is used for __getitem__ and __contains__. For initialization len(token_to_idx) is used to specify the size of the subword embedding matrix.
output_dim (int) – Dimension of the dense embedding.
weight_initializer (mxnet.initializer.Initializer, optional) – Initializer for the embeddings matrix.
sparse_grad (bool, default True) – Specifies mxnet.gluon.nn.Embedding sparse_grad argument.
dtype (str, default 'float32') – dtype argument passed to gluon.nn.Embedding
-
class
gluonnlp.model.train.
FasttextEmbeddingModel
(token_to_idx, subword_function, output_dim, weight_initializer=None, sparse_grad=True, dtype='float32', **kwargs)[source]¶ FastText embedding model.
The FasttextEmbeddingModel combines a word level embedding matrix and a subword level embedding matrix. It implements the gluonnlp.model.train.EmbeddingModel interface which provides convenient functions.
- Parameters
token_to_idx (dict) – token_to_idx mapping of the vocabulary that this model is to be trained with. token_to_idx is used for __getitem__ and __contains__. For initialization len(token_to_idx) is used to specify the size of the subword embedding matrix..
subword_function (gluonnlp.vocab.SubwordFunction) – The subword function used to obtain the subword indices during training this model. The subword_function is used for __getitem__ and __contains__. For initialization len(subword_function) is used to specify the size of the subword embedding matrix..
output_dim (int) – Dimension of embeddings.
weight_initializer (mxnet.initializer.Initializer, optional) – Initializer for the embeddings and subword embeddings matrix.
sparse_grad (bool, default True) – Specifies mxnet.gluon.nn.Embedding sparse_grad argument.
dtype (str, default 'float32') – dtype argument passed to gluon.nn.Embedding
-
hybrid_forward
(F, words, weight)[source]¶ Compute embedding of words in batch.
- Parameters
words (mxnet.ndarray.sparse.CSRNDArray) – Sparse array containing weights for every word and subword index. Output is the weighted sum of word and subword embeddings.
-
classmethod
load_fasttext_format
(path, ctx=cpu(0), **kwargs)[source]¶ Create an instance of the class and load weights.
Load the weights from the fastText binary format created by https://github.com/facebookresearch/fastText
-
gluonnlp.model.train.
get_cache_model
(name, dataset_name='wikitext-2', window=2000, theta=0.6, lambdas=0.2, ctx=cpu(0), **kwargs)[source]¶ Returns a cache model using a pre-trained language model.
We implement the neural cache language model proposed in the following work:
@article{grave2016improving, title={Improving neural language models with a continuous cache}, author={Grave, Edouard and Joulin, Armand and Usunier, Nicolas}, journal={ICLR}, year={2017} }
- Parameters
name (str) – Name of the cache language model.
dataset_name (str or None, default 'wikitext-2'.) – The dataset name on which the pre-trained model is trained. Options are ‘wikitext-2’. If specified, then the returned vocabulary is extracted from the training set of the dataset. If None, then vocab is required, for specifying embedding weight size, and is directly returned.
window (int) – Size of cache window
theta (float) –
The scala controls the flatness of the cache distribution that predict the next word as shown below:
\[p_{cache} \propto \sum_{i=1}^{t-1} \mathbb{1}_{w=x_{i+1}} exp(\theta {h_t}^T h_i)\]where \(p_{cache}\) is the cache distribution, \(\mathbb{1}\) is the identity function, and \(h_i\) is the output of timestep i.
lambdas (float) –
Linear scalar between only cache and vocab distribution, the formulation is as below:
\[p = (1 - \lambda) p_{vocab} + \lambda p_{cache}\]where \(p_{vocab}\) is the vocabulary distribution and \(p_{cache}\) is the cache distribution.
vocab (gluonnlp.Vocab or None, default None) – Vocabulary object to be used with the language model. Required when dataset_name is not specified.
pretrained (bool, default False) – Whether to load the pre-trained weights for model.
ctx (Context, default CPU) – The context in which to load the pre-trained weights.
root (str, default '~/.mxnet/models') – Location for keeping the pre-trained model parameters.
- Returns
The model.
- Return type
Block