Vocabulary and Embedding API¶
This note illustrates how to write simple code to create index for tokens to form a
vocabulary
, and utilize pre-trained word-embeddings
.
All the code demonstrated in this document assumes that the following modules or packages are imported.
>>> from mxnet import gluon, nd
>>> import gluonnlp as nlp
Indexing words and using pre-trained word embeddings¶
As a common use case, let us index words, attach pre-trained word
embeddings for them, and use such embeddings in mxnet.gluon
in just a few
lines of code.
To begin with, suppose that we have a simple text data set in the string format. We can count word frequency in the data set.
>>> text_data = ['hello', 'world', 'hello', 'nice', 'world', 'hi', 'world']
>>> counter = nlp.data.count_tokens(text_data)
The obtained Counter
has key-value pairs whose keys are words and
values are word frequencies. This allows us to filter out infrequent
words. Suppose that we want to build indices for all the keys in Counter
.
We need a Vocab
instance with Counter
as its argument.
>>> my_vocab = nlp.Vocab(counter)
To attach word embeddings to indexed words in my_vocab
, let us go on
to create a fastText word embedding
instance by specifying the embedding
name fasttext
and the pre-trained file name wiki.simple
.
>>> fasttext = nlp.embedding.create('fasttext', source='wiki.simple')
This automatically downloads the corresponding embedding file from public repo,
and the file is by default stored in ~/.mxnet/embedding/.
Next, we can attach word embedding fasttext
to indexed words
my_vocab
.
>>> my_vocab.set_embedding(fasttext)
Now we are ready to access the fastText word embedding
vectors for
indexed words, such as ‘hello’ and ‘world’.
>>> my_vocab.embedding[['hello', 'world']]
[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01
...
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02]
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01
...
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]]
<NDArray 2x300 @cpu(0)>
To demonstrate how to use pre-trained word embeddings with mxnet.gluon
models,
let us first obtain indices of the words ‘hello’ and ‘world’.
>>> my_vocab[['hello', 'world']]
[5, 4]
We can obtain the vector representation for the words ‘hello’ and
‘world’ by specifying their indices (5 and 4) and the weight matrix
my_vocab.embedding.idx_to_vec
in mxnet.gluon.nn.Embedding
.
>>> input_dim, output_dim = my_vocab.embedding.idx_to_vec.shape
>>> layer = gluon.nn.Embedding(input_dim, output_dim)
>>> layer.initialize()
>>> layer.weight.set_data(my_vocab.embedding.idx_to_vec)
>>> layer(nd.array([5, 4]))
[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01
...
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02]
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01
...
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]]
<NDArray 2x300 @cpu(0)>