gluonnlp.vocab¶
This page describes the gluonnlp.Vocab
class for text data numericalization
and the subword functionality provided in gluonnlp.vocab
.
Vocabulary¶
The vocabulary builds indices for text tokens and can be attached with
token embeddings. The input counter whose keys are candidate indices may
be obtained via gluonnlp.data.count_tokens()
Indexing and embedding attachment for text tokens. |
Subword functionality¶
When using a vocabulary of fixed size, out of vocabulary words may be
encountered. However, words are composed of characters, allowing intelligent
fallbacks for out of vocabulary words based on subword units such as the
characters or ngrams in a word. gluonnlp.vocab.SubwordFunction
provides
an API to map words to their subword units. gluonnlp.model.train contains
models that make use of subword information to word embeddings.
A SubwordFunction maps words to lists of subword indices. |
|
Map words to a list of bytes. |
|
Map words to a list of hashes in a restricted domain. |
ELMo Character-level Vocabulary¶
In the original ELMo pre-trained models, the character-level vocabulary relies on UTF-8 encoding in a specific setting. We provide the following vocabulary class to keep consistent with ELMo pre-trained models.
ELMo special character vocabulary |
BERT Vocabulary¶
The vocabulary for BERT, inherited from gluon.Vocab
, provides some additional special tokens for ease of use.
Specialization of gluonnlp.Vocab for BERT models. |
API Reference¶
NLP toolkit.
-
class
gluonnlp.
Vocab
(counter=None, max_size=None, min_freq=1, unknown_token='<unk>', deprecated_padding_token=<object object>, deprecated_bos_token=<object object>, deprecated_eos_token=<object object>, reserved_tokens=None, token_to_idx=None, *, padding_token='<pad>', bos_token='<bos>', eos_token='<eos>', **kwargs)[source]¶ Indexing and embedding attachment for text tokens.
- Parameters
counter (
Optional
[Counter
]) – Counts text token frequencies in the text data. Its keys will be indexed according to frequency thresholds such as max_size and min_freq. Keys of counter, unknown_token, and values of reserved_tokens must be of the same hashable type. Examples: str, int, and tuple.max_size (
Optional
[int
]) – The maximum possible number of the most frequent tokens in the keys of counter that can be indexed. Note that this argument does not count any token from reserved_tokens. Suppose that there are different keys of counter whose frequency are the same, if indexing all of them will exceed this argument value, such keys will be indexed one by one according to their __cmp__() order until the frequency threshold is met. If this argument is None or larger than its largest possible value restricted by counter and reserved_tokens, this argument has no effect.min_freq (
int
) – The minimum frequency required for a token in the keys of counter to be indexed.unknown_token (
Hashable
) – The representation for any unknown token. If unknown_token is not None, looking up any token that is not part of the vocabulary and thus considered unknown will return the index of unknown_token. If None, looking up an unknown token will result in KeyError.reserved_tokens (
Optional
[List
[Hashable
]]) – A list specifying additional tokens to be added to the vocabulary. reserved_tokens must not contain the value of unknown_token or duplicate tokens. It must neither contain special tokens specified via keyword arguments.token_to_idx (
Optional
[Dict
[Hashable
,int
]]) – If not None, specifies the indices of tokens to be used by the vocabulary. Each token in token_to_index must be part of the Vocab and each index can only be associated with a single token. token_to_idx is not required to contain a mapping for all tokens. For example, it is valid to only set the unknown_token index to 10 (instead of the default of 0) with token_to_idx = {‘<unk>’: 10}, assuming that there are at least 10 tokens in the vocabulary.**kwargs – Keyword arguments of the format xxx_token can be used to specify further special tokens that will be exposed as attribute of the vocabulary and associated with an index. For example, specifying mask_token=’<mask> as additional keyword argument when constructing a vocabulary v leads to v.mask_token exposing the value of the special token: <mask>. If the specified token is not part of the Vocabulary, it will be added, just as if it was listed in the reserved_tokens argument. The specified tokens are listed together with reserved tokens in the reserved_tokens attribute of the vocabulary object.
deprecated_padding_token (
Hashable
) – The representation for the special token of padding token. Default: ‘<pad>’. Specifying padding_token as positional argument is deprecated and support will be removed. Specify it as keyword argument instead (see documentation of **kwargs above)deprecated_bos_token (
Hashable
) – The representation for the special token of beginning-of-sequence token. Default: ‘<bos>’. Specifying bos_token as positional argument is deprecated and support will be removed. Specify it as keyword argument instead (see documentation of **kwargs above)deprecated_eos_token (
Hashable
) – The representation for the special token of end-of-sequence token. Default: ‘<eos>’. Specifying eos_token as positional argument is deprecated and support will be removed. Specify it as keyword argument instead (see documentation of **kwargs above)
- Variables
embedding (instance of
gluonnlp.embedding.TokenEmbedding
) – The embedding of the indexed tokens.idx_to_token (list of strs) – A list of indexed tokens where the list indices and the token indices are aligned.
reserved_tokens (list of strs or None) – A list of reserved tokens that will always be indexed.
token_to_idx (dict mapping str to int) – A dict mapping each token to its index integer.
unknown_token (hashable object or None) – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation.
padding_token (hashable object or None) – The representation for padding token.
bos_token (hashable object or None) – The representation for beginning-of-sentence token.
eos_token (hashable object or None) – The representation for end-of-sentence token.
Examples
>>> text_data = ['hello', 'world', 'hello', 'nice', 'world', 'hi', 'world'] >>> counter = gluonnlp.data.count_tokens(text_data) >>> my_vocab = gluonnlp.Vocab(counter) >>> fasttext = gluonnlp.embedding.create('fasttext', source='wiki.simple') -etc- >>> my_vocab.set_embedding(fasttext) >>> my_vocab.embedding[['hello', 'world']][:, :5] [[ 0.39567 0.21454 -0.035389 -0.24299 -0.095645] [ 0.10444 -0.10858 0.27212 0.13299 -0.33165 ]] <NDArray 2x5 @cpu(0)> >>> my_vocab[['hello', 'world']] [5, 4]
>>> input_dim, output_dim = my_vocab.embedding.idx_to_vec.shape >>> layer = gluon.nn.Embedding(input_dim, output_dim) >>> layer.initialize() >>> layer.weight.set_data(my_vocab.embedding.idx_to_vec) >>> layer(mx.nd.array([5, 4]))[:, :5] [[ 0.39567 0.21454 -0.035389 -0.24299 -0.095645] [ 0.10444 -0.10858 0.27212 0.13299 -0.33165 ]] <NDArray 2x5 @cpu(0)> >>> glove = gluonnlp.embedding.create('glove', source='glove.6B.50d') -etc- >>> my_vocab.set_embedding(glove) >>> my_vocab.embedding[['hello', 'world']][:, :5] [[-0.38497 0.80092 0.064106 -0.28355 -0.026759] [-0.41486 0.71848 -0.3045 0.87445 0.22441 ]] <NDArray 2x5 @cpu(0)>
Extra keyword arguments of the format xxx_token are used to expose specified tokens as attributes.
>>> my_vocab2 = gluonnlp.Vocab(counter, special_token='hi') >>> my_vocab2.special_token 'hi'
With the token_to_idx argument the order of the Vocab’s index can be adapted. For example, Vocab assigns the index 0 to the unknown_token by default. With the token_to_idx argument, the default can be overwritten. Here we assign index 3 to the unknown token representation <unk>.
>>> tok2idx = {'<unk>': 3} >>> my_vocab3 = gluonnlp.Vocab(counter, token_to_idx=tok2idx) >>> my_vocab3.unknown_token '<unk>' >>> my_vocab3[my_vocab3.unknown_token] 3 >>> my_vocab[my_vocab.unknown_token] 0
-
set_embedding
(*embeddings)[source]¶ Attaches one or more embeddings to the indexed text tokens.
- Parameters
embeddings (None or tuple of
gluonnlp.embedding.TokenEmbedding
instances) – The embedding to be attached to the indexed tokens. If a tuple of multiple embeddings are provided, their embedding vectors will be concatenated for the same token.
Vocabulary.
-
class
gluonnlp.vocab.
SubwordFunction
[source]¶ A SubwordFunction maps words to lists of subword indices.
This class is abstract and to be subclassed. Use gluonnlp.vocab.list_subword_functions to list all available subword functions.
A SubwordFunction object is callable and returns a list of ndarrays of subwordindices for the given words in a call.
-
class
gluonnlp.vocab.
ByteSubwords
(encoding='utf-8')[source]¶ Map words to a list of bytes.
- Parameters
encoding (str, default 'utf-8) – Encoding to use for obtaining bytes.
-
class
gluonnlp.vocab.
NGramHashes
(num_subwords, ngrams=(3, 4, 5, 6), special_tokens=None)[source]¶ Map words to a list of hashes in a restricted domain.
The hash function is the same as in https://github.com/facebookresearch/fastText
- Parameters
num_subwords (int) – Size of target set for the hash function.
ngrams (list of int, default [3, 4, 5, 6]) – n-s for which to hash the ngrams
special_tokens (set of str, default None) – Set of words for which not to look up subwords.
-
gluonnlp.vocab.
create_subword_function
(subword_function_name, **kwargs)[source]¶ Creates an instance of a subword function.
-
class
gluonnlp.vocab.
ELMoCharVocab
(bos_token='<bos>', eos_token='<eos>')[source]¶ ELMo special character vocabulary
The vocab aims to map individual tokens to sequences of character ids, compatible with ELMo. To be consistent with previously trained models, we include it here.
Specifically, char ids 0-255 come from utf-8 encoding bytes. Above 256 are reserved for special tokens.
- Parameters
- Variables
max_word_length (50) – The maximum number of character a word contains is 50 in ELMo.
bos_id (256) – The index of beginning of the sentence character is 256 in ELMo.
eos_id (257) – The index of end of the sentence character is 257 in ELMo.
bow_id (258) – The index of beginning of the word character is 258 in ELMo.
eow_id (259) – The index of end of the word character is 259 in ELMo.
pad_id (260) – The index of padding character is 260 in ELMo.
-
class
gluonnlp.vocab.
BERTVocab
(counter=None, max_size=None, min_freq=1, unknown_token='[UNK]', padding_token='[PAD]', bos_token=None, eos_token=None, mask_token='[MASK]', sep_token='[SEP]', cls_token='[CLS]', reserved_tokens=None, token_to_idx=None)[source]¶ Specialization of gluonnlp.Vocab for BERT models.
BERTVocab changes default token representations of unknown and other special tokens of gluonnlp.Vocab and adds convenience parameters to specify mask, sep and cls tokens typically used by Bert models.
- Parameters
counter (Counter or None, default None) – Counts text token frequencies in the text data. Its keys will be indexed according to frequency thresholds such as max_size and min_freq. Keys of counter, unknown_token, and values of reserved_tokens must be of the same hashable type. Examples: str, int, and tuple.
max_size (None or int, default None) – The maximum possible number of the most frequent tokens in the keys of counter that can be indexed. Note that this argument does not count any token from reserved_tokens. Suppose that there are different keys of counter whose frequency are the same, if indexing all of them will exceed this argument value, such keys will be indexed one by one according to their __cmp__() order until the frequency threshold is met. If this argument is None or larger than its largest possible value restricted by counter and reserved_tokens, this argument has no effect.
min_freq (int, default 1) – The minimum frequency required for a token in the keys of counter to be indexed.
unknown_token (hashable object or None, default '[UNK]') – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation. If None, looking up an unknown token will result in KeyError.
padding_token (hashable object or None, default '[PAD]') – The representation for the special token of padding token.
bos_token (hashable object or None, default None) – The representation for the special token of beginning-of-sequence token.
eos_token (hashable object or None, default None) – The representation for the special token of end-of-sequence token.
mask_token (hashable object or None, default '[MASK]') – The representation for the special token of mask token for BERT.
sep_token (hashable object or None, default '[SEP]') – A token used to separate sentence pairs for BERT.
cls_token (hashable object or None, default '[CLS]') – Classification symbol for BERT.
reserved_tokens (list of hashable objects or None, default None) – A list specifying additional tokens to be added to the vocabulary. reserved_tokens cannot contain unknown_token or duplicate reserved tokens. Keys of counter, unknown_token, and values of reserved_tokens must be of the same hashable type. Examples of hashable types are str, int, and tuple.
token_to_idx (dict mapping tokens (hashable objects) to int or None, default None) – Optionally specifies the indices of tokens to be used by the vocabulary. Each token in token_to_index must be part of the Vocab and each index can only be associated with a single token. token_to_idx is not required to contain a mapping for all tokens. For example, it is valid to only set the unknown_token index to 10 (instead of the default of 0) with token_to_idx = {‘<unk>’: 10}.
- Variables
embedding (instance of
gluonnlp.embedding.TokenEmbedding
) – The embedding of the indexed tokens.idx_to_token (list of strs) – A list of indexed tokens where the list indices and the token indices are aligned.
reserved_tokens (list of strs or None) – A list of reserved tokens that will always be indexed.
token_to_idx (dict mapping str to int) – A dict mapping each token to its index integer.
unknown_token (hashable object or None, default '[UNK]') – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation.
padding_token (hashable object or None, default '[PAD]') – The representation for padding token.
bos_token (hashable object or None, default None) – The representation for beginning-of-sentence token.
eos_token (hashable object or None, default None) – The representation for end-of-sentence token.
mask_token (hashable object or None, default '[MASK]') – The representation for the special token of mask token for BERT.
sep_token (hashable object or None, default '[SEP]') – a token used to separate sentence pairs for BERT.
cls_token (hashable object or None, default '[CLS]') –
-
classmethod
from_sentencepiece
(path, mask_token='[MASK]', sep_token='[SEP]', cls_token='[CLS]', unknown_token=None, padding_token=None, bos_token=None, eos_token=None, reserved_tokens=None)[source]¶ BERTVocab from pre-trained sentencepiece Tokenizer
- Parameters
path (str) – Path to the pre-trained subword tokenization model.
mask_token (hashable object or None, default '[MASK]') – The representation for the special token of mask token for BERT
sep_token (hashable object or None, default '[SEP]') – a token used to separate sentence pairs for BERT.
cls_token (hashable object or None, default '[CLS]') –
unknown_token (hashable object or None, default None) – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation. If set to None, it is set to the token corresponding to the unk_id() in the loaded sentencepiece model.
padding_token (hashable object or None, default '[PAD]') – The representation for padding token.
bos_token (hashable object or None, default None) – The representation for the begin of sentence token. If set to None, it is set to the token corresponding to the bos_id() in the loaded sentencepiece model.
eos_token (hashable object or None, default None) – The representation for the end of sentence token. If set to None, it is set to the token corresponding to the bos_id() in the loaded sentencepiece model.
reserved_tokens (list of strs or None, optional) – A list of reserved tokens that will always be indexed.
- Returns
- Return type