Vocabulary and Embedding API¶
This note illustrates how to write simple code to create index for tokens to form a
vocabulary, and utilize pre-trained word-embeddings.
All the code demonstrated in this document assumes that the following modules or packages are imported.
>>> from mxnet import gluon, nd
>>> import gluonnlp as nlp
Indexing words and using pre-trained word embeddings¶
As a common use case, let us index words, attach pre-trained word
embeddings for them, and use such embeddings in mxnet.gluon in just a few
lines of code.
To begin with, suppose that we have a simple text data set in the string format. We can count word frequency in the data set.
>>> text_data = ['hello', 'world', 'hello', 'nice', 'world', 'hi', 'world']
>>> counter = nlp.data.count_tokens(text_data)
The obtained Counter has key-value pairs whose keys are words and
values are word frequencies. This allows us to filter out infrequent
words. Suppose that we want to build indices for all the keys in Counter.
We need a Vocab instance with Counter as its argument.
>>> my_vocab = nlp.Vocab(counter)
To attach word embeddings to indexed words in my_vocab, let us go on
to create a fastText word embedding instance by specifying the embedding
name fasttext and the pre-trained file name wiki.simple.
>>> fasttext = nlp.embedding.create('fasttext', source='wiki.simple')
This automatically downloads the corresponding embedding file from public repo,
and the file is by default stored in ~/.mxnet/embedding/.
Next, we can attach word embedding fasttext to indexed words
my_vocab.
>>> my_vocab.set_embedding(fasttext)
Now we are ready to access the fastText word embedding vectors for
indexed words, such as ‘hello’ and ‘world’.
>>> my_vocab.embedding[['hello', 'world']]
[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01
...
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02]
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01
...
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]]
<NDArray 2x300 @cpu(0)>
To demonstrate how to use pre-trained word embeddings with mxnet.gluon models,
let us first obtain indices of the words ‘hello’ and ‘world’.
>>> my_vocab[['hello', 'world']]
[5, 4]
We can obtain the vector representation for the words ‘hello’ and
‘world’ by specifying their indices (5 and 4) and the weight matrix
my_vocab.embedding.idx_to_vec in mxnet.gluon.nn.Embedding.
>>> input_dim, output_dim = my_vocab.embedding.idx_to_vec.shape
>>> layer = gluon.nn.Embedding(input_dim, output_dim)
>>> layer.initialize()
>>> layer.weight.set_data(my_vocab.embedding.idx_to_vec)
>>> layer(nd.array([5, 4]))
[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01
...
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02]
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01
...
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]]
<NDArray 2x300 @cpu(0)>