gluonnlp.embedding¶
GluonNLP Toolkit provides tools for working with embeddings.
This page describes the gluonnlp
APIs for text embedding, such as loading
pre-trained embedding vectors for text tokens and storing them in the
mxnet.ndarray.NDArray
format as well as utilities for intrinsic evaluation
of text embeddings.
Pre-trained Embeddings¶
Registers a new token embedding. |
|
Creates an instance of token embedding. |
|
Get valid token embedding names and their pre-trained file names. |
|
Token embedding base class. |
|
The GloVe word embedding. |
|
The fastText word embedding. |
|
The Word2Vec word embedding. |
Intrinsic evaluation¶
Registers a new word embedding evaluation function. |
|
Creates an instance of a registered word embedding evaluation function. |
|
Get valid word embedding functions names. |
|
Base class for word embedding similarity functions. |
|
Base class for word embedding analogy functions. |
|
Computes the cosine similarity. |
|
The 3CosAdd analogy function. |
|
The 3CosMul analogy function. |
|
Word embeddings similarity task evaluator. |
|
Word embeddings analogy task evaluator. |
API Reference¶
Word embeddings.
-
gluonnlp.embedding.
register
(embedding_cls)[source]¶ Registers a new token embedding.
Once an embedding is registered, we can create an instance of this embedding with
create()
.Examples
>>> @gluonnlp.embedding.register ... class MyTextEmbed(gluonnlp.embedding.TokenEmbedding): ... def __init__(self, source='my_pretrain_file'): ... pass >>> embed = gluonnlp.embedding.create('MyTextEmbed') >>> print(type(embed)) <class 'gluonnlp.embedding.token_embedding.MyTextEmbed'>
-
gluonnlp.embedding.
create
(embedding_name, **kwargs)[source]¶ Creates an instance of token embedding.
Creates a token embedding instance by loading embedding vectors from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText. To get all the valid embedding_name and source, use
gluonnlp.embedding.list_sources()
.- Parameters
embedding_name (str) – The token embedding name (case-insensitive).
kwargs (dict) – All other keyword arguments are passed to the initializer of token embedding class. For example create(embedding_name=’fasttext’, source=’wiki.simple’, load_ngrams=True) will return FastText(source=’wiki.simple’, load_ngrams=True).
- Returns
A token embedding instance that loads embedding vectors from an externally hosted pre-trained token embedding file.
- Return type
An instance of
gluonnlp.embedding.TokenEmbedding
-
gluonnlp.embedding.
list_sources
(embedding_name=None)[source]¶ Get valid token embedding names and their pre-trained file names.
To load token embedding vectors from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText, one should use gluonnlp.embedding.create(embedding_name, source). This method returns all the valid names of source for the specified embedding_name. If embedding_name is set to None, this method returns all the valid names of embedding_name with their associated source.
- Parameters
embedding_name (str or None, default None) – The pre-trained token embedding name.
- Returns
A list of all the valid pre-trained token embedding file names (source) for the specified token embedding name (embedding_name). If the text embedding name is set to None, returns a dict mapping each valid token embedding name to a list of valid pre-trained files (source). They can be plugged into gluonnlp.embedding.create(embedding_name, source).
- Return type
-
class
gluonnlp.embedding.
TokenEmbedding
(unknown_token='<unk>', init_unknown_vec=<function zeros>, allow_extend=False, unknown_lookup=None, idx_to_token=None, idx_to_vec=None)[source]¶ Token embedding base class.
To load token embedding from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText, use
gluonnlp.embedding.create()
. To get all the available embedding_name and source, usegluonnlp.embedding.list_sources()
.Alternatively, to load embedding vectors from a custom pre-trained token embedding file, use
gluonnlp.embedding.TokenEmbedding.from_file()
.If unknown_token is None, looking up unknown tokens results in KeyError. Otherwise, for every unknown token, if its representation self.unknown_token is encountered in the pre-trained token embedding file, index 0 of self.idx_to_vec maps to the pre-trained token embedding vector loaded from the file; otherwise, index 0 of self.idx_to_vec maps to the token embedding vector initialized by init_unknown_vec.
If a token is encountered multiple times in the pre-trained token embedding file, only the first-encountered token embedding vector will be loaded and the rest will be skipped.
- Parameters
unknown_token (hashable object or None, default '<unk>') – Any unknown token will be replaced by unknown_token and consequently will be indexed as the same representation. Only used if oov_imputer is not specified.
init_unknown_vec (callback, default nd.zeros) – The callback used to initialize the embedding vector for the unknown token. Only used if unknown_token is not None and idx_to_token is not None and does not contain unknown_vec.
allow_extend (bool, default False) – If True, embedding vectors for previously unknown words can be added via token_embedding[tokens] = vecs. If False, only vectors for known tokens can be updated.
unknown_lookup (object subscriptable with list of tokens returning nd.NDarray, default None) – If not None, the TokenEmbedding obtains embeddings for unknown tokens automatically from unknown_lookup[unknown_tokens]. For example, in a FastText model, embeddings for unknown tokens can be computed from the subword information.
idx_to_token (list of str or None, default None) – If not None, a list of tokens for which the idx_to_vec argument provides embeddings. The list indices and the indices of idx_to_vec must be aligned. If idx_to_token is not None, idx_to_vec must not be None either. If idx_to_token is None, an empty TokenEmbedding object is created. If allow_extend is True, tokens and their embeddings can be added to the TokenEmbedding at a later stage.
idx_to_vec (mxnet.ndarray.NDArray or None, default None) – If not None, a NDArray containing embeddings for the tokens specified in idx_to_token. The first dimension of idx_to_vec must be aligned with idx_to_token. If idx_to_vec is not None, idx_to_token must not be None either. If idx_to_vec is None, an empty TokenEmbedding object is created. If allow_extend is True, tokens and their embeddings can be added to the TokenEmbedding at a later stage. No copy of the idx_to_vec array is made as long as unknown_token is None or an embedding for unknown_token is specified in idx_to_vec.
-
__getitem__
(tokens)[source]¶ Looks up embedding vectors of text tokens.
- Parameters
tokens (str or list of strs) – A token or a list of tokens.
- Returns
The embedding vector(s) of the token(s). According to numpy conventions, if tokens is a string, returns a 1-D NDArray (vector); if tokens is a list of strings, returns a 2-D NDArray (matrix) of shape=(len(tokens), vec_len).
- Return type
-
__setitem__
(tokens, new_embedding)[source]¶ Updates embedding vectors for tokens.
If self.allow_extend is True, vectors for previously unknown tokens can be introduced.
- Parameters
tokens (hashable object or a list or tuple of hashable objects) – A token or a list of tokens whose embedding vector are to be updated.
new_embedding (mxnet.ndarray.NDArray) – An NDArray to be assigned to the embedding vectors of tokens. Its length must be equal to the number of tokens and its width must be equal to the dimension of embedding of the glossary. If tokens is a singleton, it must be 1-D or 2-D. If tokens is a list of multiple strings, it must be 2-D.
-
property
allow_extend
¶ Allow extension of the TokenEmbedding with new tokens.
If True, TokenEmbedding[tokens] = vec can introduce new tokens that were previously unknown. New indices will be assigned to the newly introduced tokens. If False, only known tokens can be updated.
- Returns
Extension of the TokenEmbedding is allowed.
- Return type
-
static
deserialize
(file_path, **kwargs)[source]¶ Create a new TokenEmbedding from a serialized one.
TokenEmbedding is serialized by converting the list of tokens, the array of word embeddings and other metadata to numpy arrays, saving all in a single (optionally compressed) Zipfile. See https://docs.scipy.org/doc/numpy-1.14.2/neps/npy-format.html for more information on the format.
-
static
from_file
(file_path, elem_delim=' ', encoding='utf8', **kwargs)[source]¶ Creates a user-defined token embedding from a pre-trained embedding file.
This is to load embedding vectors from a user-defined pre-trained token embedding file. For example, if elem_delim = ‘ ‘, the expected format of a custom pre-trained token embedding file may look like:
‘hello 0.1 0.2 0.3 0.4 0.5\nworld 1.1 1.2 1.3 1.4 1.5\n’
where embedding vectors of words hello and world are [0.1, 0.2, 0.3, 0.4, 0.5] and [1.1, 1.2, 1.3, 1.4, 1.5] respectively.
- Parameters
file_path (str) – The path to the user-defined pre-trained token embedding file.
elem_delim (str, default ' ') – The delimiter for splitting a token and every embedding vector element value on the same line of the custom pre-trained token embedding file.
encoding (str, default 'utf8') – The encoding scheme for reading the custom pre-trained token embedding file.
kwargs (dict) – All other keyword arguments are passed to the TokenEmbedding initializer.
- Returns
The user-defined token embedding instance.
- Return type
instance of
gluonnlp.embedding.TokenEmbedding
-
property
idx_to_token
¶ Index to token mapping.
- Returns
A list of indexed tokens where the list indices and the token indices are aligned.
- Return type
list of str
-
property
idx_to_vec
¶ Index to vector mapping.
- Returns
For all the indexed tokens in this embedding, this NDArray maps each token’s index to an embedding vector.
- Return type
-
serialize
(file_path, compress=True)[source]¶ Serializes the TokenEmbedding to a file specified by file_path.
TokenEmbedding is serialized by converting the list of tokens, the array of word embeddings and other metadata to numpy arrays, saving all in a single (optionally compressed) Zipfile. See https://docs.scipy.org/doc/numpy-1.14.2/neps/npy-format.html for more information on the format.
- Parameters
-
property
token_to_idx
¶ Token to index mapping.
- Returns
A dictionary of tokens with their corresponding index numbers; inverse vocab.
- Return type
dict of int to strs
-
property
unknown_lookup
¶ Vector lookup for unknown tokens.
If not None, unknown_lookup[tokens] is automatically called for any unknown tokens.
-
class
gluonnlp.embedding.
GloVe
(source='glove.6B.50d', embedding_root='/var/lib/jenkins/.mxnet/embedding', **kwargs)[source]¶ The GloVe word embedding.
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. (Source from https://nlp.stanford.edu/projects/glove/)
Reference:
GloVe: Global Vectors for Word Representation. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. https://nlp.stanford.edu/pubs/glove.pdf
Website: https://nlp.stanford.edu/projects/glove/
To get the updated URLs to the externally hosted pre-trained token embedding files, visit https://nlp.stanford.edu/projects/glove/
License for pre-trained embedding: https://opendatacommons.org/licenses/pddl/
Available sources
>>> import gluonnlp as nlp >>> sorted(nlp.embedding.list_sources('GloVe')) ['glove.42B.300d', 'glove.6B.100d', 'glove.6B.200d', 'glove.6B.300d', 'glove.6B.50d', 'glove.840B.300d', 'glove.twitter.27B.100d', 'glove.twitter.27B.200d', 'glove.twitter.27B.25d', 'glove.twitter.27B.50d']
- Parameters
source (str, default 'glove.6B.50d') – The name of the pre-trained token embedding file.
embedding_root (str, default '$MXNET_HOME/embedding') – The root directory for storing embedding-related files. MXNET_HOME defaults to ‘~/.mxnet’.
kwargs – All other keyword arguments are passed to gluonnlp.embedding.TokenEmbedding.
- Variables
idx_to_vec (mxnet.ndarray.NDArray) – For all the indexed tokens in this embedding, this NDArray maps each token’s index to an embedding vector.
unknown_token (hashable object) – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation.
-
class
gluonnlp.embedding.
FastText
(source='wiki.simple', embedding_root='/var/lib/jenkins/.mxnet/embedding', load_ngrams=False, ctx=cpu(0), **kwargs)[source]¶ The fastText word embedding.
FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices. (Source from https://fasttext.cc/)
References:
Enriching Word Vectors with Subword Information. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. https://arxiv.org/abs/1607.04606
Bag of Tricks for Efficient Text Classification. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. https://arxiv.org/abs/1607.01759
FastText.zip: Compressing text classification models. Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herve Jegou, and Tomas Mikolov. https://arxiv.org/abs/1612.03651
For ‘wiki.multi’ embedding: Word Translation Without Parallel Data Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. https://arxiv.org/abs/1710.04087
Website: https://fasttext.cc/
To get the updated URLs to the externally hosted pre-trained token embedding files, visit https://github.com/facebookresearch/fastText/blob/master/docs/pretrained-vectors.md
License for pre-trained embedding: https://creativecommons.org/licenses/by-sa/3.0/
Available sources
>>> import gluonnlp as nlp >>> sorted(nlp.embedding.list_sources('FastText')) ['cc.af.300', 'cc.als.300', 'cc.am.300', 'cc.an.300', 'cc.ar.300', 'cc.arz.300', 'cc.as.300', 'cc.ast.300', 'cc.az.300', 'cc.azb.300', 'cc.ba.300', 'cc.bar.300', 'cc.bcl.300', 'cc.be.300', 'cc.bg.300', 'cc.bh.300', 'cc.bn.300', 'cc.bo.300', 'cc.bpy.300', 'cc.br.300', 'cc.bs.300', 'cc.ca.300', 'cc.ce.300', 'cc.ceb.300', 'cc.ckb.300', 'cc.co.300', 'cc.cs.300', 'cc.cv.300', 'cc.cy.300', 'cc.da.300', 'cc.de.300', 'cc.diq.300', 'cc.dv.300', 'cc.el.300', 'cc.eml.300', 'cc.en.300', 'cc.eo.300', 'cc.es.300', 'cc.et.300', 'cc.eu.300', 'cc.fa.300', 'cc.fi.300', 'cc.fr.300', 'cc.frr.300', 'cc.fy.300', 'cc.ga.300', 'cc.gd.300', 'cc.gl.300', 'cc.gom.300', 'cc.gu.300', 'cc.gv.300', 'cc.he.300', 'cc.hi.300', 'cc.hif.300', 'cc.hr.300', 'cc.hsb.300', 'cc.ht.300', 'cc.hu.300', 'cc.hy.300', 'cc.ia.300', 'cc.id.300', 'cc.ilo.300', 'cc.io.300', 'cc.is.300', 'cc.it.300', 'cc.ja.300', 'cc.jv.300', 'cc.ka.300', 'cc.kk.300', 'cc.km.300', 'cc.kn.300', 'cc.ko.300', 'cc.ku.300', 'cc.ky.300', 'cc.la.300', 'cc.lb.300', 'cc.li.300', 'cc.lmo.300', 'cc.lt.300', 'cc.lv.300', 'cc.mai.300', 'cc.mg.300', 'cc.mhr.300', 'cc.min.300', 'cc.mk.300', 'cc.ml.300', 'cc.mn.300', 'cc.mr.300', 'cc.mrj.300', 'cc.ms.300', 'cc.mt.300', 'cc.mwl.300', 'cc.my.300', 'cc.myv.300', 'cc.mzn.300', 'cc.nah.300', 'cc.nap.300', 'cc.nds.300', 'cc.ne.300', 'cc.new.300', 'cc.nl.300', 'cc.nn.300', 'cc.no.300', 'cc.nso.300', 'cc.oc.300', 'cc.or.300', 'cc.os.300', 'cc.pa.300', 'cc.pam.300', 'cc.pfl.300', 'cc.pl.300', 'cc.pms.300', 'cc.pnb.300', 'cc.ps.300', 'cc.pt.300', 'cc.qu.300', 'cc.rm.300', 'cc.ro.300', 'cc.ru.300', 'cc.sa.300', 'cc.sah.300', 'cc.sc.300', 'cc.scn.300', 'cc.sco.300', 'cc.sd.300', 'cc.sh.300', 'cc.si.300', 'cc.sk.300', 'cc.sl.300', 'cc.so.300', 'cc.sq.300', 'cc.sr.300', 'cc.su.300', 'cc.sv.300', 'cc.sw.300', 'cc.ta.300', 'cc.te.300', 'cc.tg.300', 'cc.th.300', 'cc.tk.300', 'cc.tl.300', 'cc.tr.300', 'cc.tt.300', 'cc.ug.300', 'cc.uk.300', 'cc.ur.300', 'cc.uz.300', 'cc.vec.300', 'cc.vi.300', 'cc.vls.300', 'cc.vo.300', 'cc.wa.300', 'cc.war.300', 'cc.xmf.300', 'cc.yi.300', 'cc.yo.300', 'cc.zea.300', 'cc.zh.300', 'crawl-300d-2M', 'crawl-300d-2M-subword', 'wiki-news-300d-1M', 'wiki-news-300d-1M-subword', 'wiki.aa', 'wiki.ab', 'wiki.ace', 'wiki.ady', 'wiki.af', 'wiki.ak', 'wiki.als', 'wiki.am', 'wiki.an', 'wiki.ang', 'wiki.ar', 'wiki.arc', 'wiki.arz', 'wiki.as', 'wiki.ast', 'wiki.av', 'wiki.ay', 'wiki.az', 'wiki.azb', 'wiki.ba', 'wiki.bar', 'wiki.bat_smg', 'wiki.bcl', 'wiki.be', 'wiki.bg', 'wiki.bh', 'wiki.bi', 'wiki.bjn', 'wiki.bm', 'wiki.bn', 'wiki.bo', 'wiki.bpy', 'wiki.br', 'wiki.bs', 'wiki.bug', 'wiki.bxr', 'wiki.ca', 'wiki.cbk_zam', 'wiki.cdo', 'wiki.ce', 'wiki.ceb', 'wiki.ch', 'wiki.cho', 'wiki.chr', 'wiki.chy', 'wiki.ckb', 'wiki.co', 'wiki.cr', 'wiki.crh', 'wiki.cs', 'wiki.csb', 'wiki.cu', 'wiki.cv', 'wiki.cy', 'wiki.da', 'wiki.de', 'wiki.diq', 'wiki.dsb', 'wiki.dv', 'wiki.dz', 'wiki.ee', 'wiki.el', 'wiki.eml', 'wiki.en', 'wiki.eo', 'wiki.es', 'wiki.et', 'wiki.eu', 'wiki.ext', 'wiki.fa', 'wiki.ff', 'wiki.fi', 'wiki.fiu_vro', 'wiki.fj', 'wiki.fo', 'wiki.fr', 'wiki.frp', 'wiki.frr', 'wiki.fur', 'wiki.fy', 'wiki.ga', 'wiki.gag', 'wiki.gan', 'wiki.gd', 'wiki.gl', 'wiki.glk', 'wiki.gn', 'wiki.gom', 'wiki.got', 'wiki.gu', 'wiki.gv', 'wiki.ha', 'wiki.hak', 'wiki.haw', 'wiki.he', 'wiki.hi', 'wiki.hif', 'wiki.ho', 'wiki.hr', 'wiki.hsb', 'wiki.ht', 'wiki.hu', 'wiki.hy', 'wiki.hz', 'wiki.ia', 'wiki.id', 'wiki.ie', 'wiki.ig', 'wiki.ii', 'wiki.ik', 'wiki.ilo', 'wiki.io', 'wiki.is', 'wiki.it', 'wiki.iu', 'wiki.ja', 'wiki.jam', 'wiki.jbo', 'wiki.jv', 'wiki.ka', 'wiki.kaa', 'wiki.kab', 'wiki.kbd', 'wiki.kg', 'wiki.ki', 'wiki.kj', 'wiki.kk', 'wiki.kl', 'wiki.km', 'wiki.kn', 'wiki.ko', 'wiki.koi', 'wiki.kr', 'wiki.krc', 'wiki.ks', 'wiki.ksh', 'wiki.ku', 'wiki.kv', 'wiki.kw', 'wiki.ky', 'wiki.la', 'wiki.lad', 'wiki.lb', 'wiki.lbe', 'wiki.lez', 'wiki.lg', 'wiki.li', 'wiki.lij', 'wiki.lmo', 'wiki.ln', 'wiki.lo', 'wiki.lrc', 'wiki.lt', 'wiki.ltg', 'wiki.lv', 'wiki.mai', 'wiki.map_bms', 'wiki.mdf', 'wiki.mg', 'wiki.mh', 'wiki.mhr', 'wiki.mi', 'wiki.min', 'wiki.mk', 'wiki.ml', 'wiki.mn', 'wiki.mo', 'wiki.mr', 'wiki.mrj', 'wiki.ms', 'wiki.mt', 'wiki.multi.ar', 'wiki.multi.bg', 'wiki.multi.ca', 'wiki.multi.cs', 'wiki.multi.da', 'wiki.multi.de', 'wiki.multi.el', 'wiki.multi.en', 'wiki.multi.es', 'wiki.multi.et', 'wiki.multi.fi', 'wiki.multi.fr', 'wiki.multi.he', 'wiki.multi.hr', 'wiki.multi.hu', 'wiki.multi.id', 'wiki.multi.it', 'wiki.multi.mk', 'wiki.multi.nl', 'wiki.multi.no', 'wiki.multi.pl', 'wiki.multi.pt', 'wiki.multi.ro', 'wiki.multi.ru', 'wiki.multi.sk', 'wiki.multi.sl', 'wiki.multi.sv', 'wiki.multi.tr', 'wiki.multi.uk', 'wiki.multi.vi', 'wiki.mus', 'wiki.mwl', 'wiki.my', 'wiki.myv', 'wiki.mzn', 'wiki.na', 'wiki.nah', 'wiki.nap', 'wiki.nds', 'wiki.nds_nl', 'wiki.ne', 'wiki.new', 'wiki.ng', 'wiki.nl', 'wiki.nn', 'wiki.no', 'wiki.nov', 'wiki.nrm', 'wiki.nso', 'wiki.nv', 'wiki.ny', 'wiki.oc', 'wiki.olo', 'wiki.om', 'wiki.or', 'wiki.os', 'wiki.pa', 'wiki.pag', 'wiki.pam', 'wiki.pap', 'wiki.pcd', 'wiki.pdc', 'wiki.pfl', 'wiki.pi', 'wiki.pih', 'wiki.pl', 'wiki.pms', 'wiki.pnb', 'wiki.pnt', 'wiki.ps', 'wiki.pt', 'wiki.qu', 'wiki.rm', 'wiki.rmy', 'wiki.rn', 'wiki.ro', 'wiki.roa_rup', 'wiki.roa_tara', 'wiki.ru', 'wiki.rue', 'wiki.rw', 'wiki.sa', 'wiki.sah', 'wiki.sc', 'wiki.scn', 'wiki.sco', 'wiki.sd', 'wiki.se', 'wiki.sg', 'wiki.sh', 'wiki.si', 'wiki.simple', 'wiki.sk', 'wiki.sl', 'wiki.sm', 'wiki.sn', 'wiki.so', 'wiki.sq', 'wiki.sr', 'wiki.srn', 'wiki.ss', 'wiki.st', 'wiki.stq', 'wiki.su', 'wiki.sv', 'wiki.sw', 'wiki.szl', 'wiki.ta', 'wiki.tcy', 'wiki.te', 'wiki.tet', 'wiki.tg', 'wiki.th', 'wiki.ti', 'wiki.tk', 'wiki.tl', 'wiki.tn', 'wiki.to', 'wiki.tpi', 'wiki.tr', 'wiki.ts', 'wiki.tt', 'wiki.tum', 'wiki.tw', 'wiki.ty', 'wiki.tyv', 'wiki.udm', 'wiki.ug', 'wiki.uk', 'wiki.ur', 'wiki.uz', 'wiki.ve', 'wiki.vec', 'wiki.vep', 'wiki.vi', 'wiki.vls', 'wiki.vo', 'wiki.wa', 'wiki.war', 'wiki.wo', 'wiki.wuu', 'wiki.xal', 'wiki.xh', 'wiki.xmf', 'wiki.yi', 'wiki.yo', 'wiki.za', 'wiki.zea', 'wiki.zh', 'wiki.zh_classical', 'wiki.zh_min_nan', 'wiki.zh_yue', 'wiki.zu']
- Parameters
source (str, default 'wiki.simple') – The name of the pre-trained token embedding file.
embedding_root (str, default '$MXNET_HOME/embedding') – The root directory for storing embedding-related files. MXNET_HOME defaults to ‘~/.mxnet’.
load_ngrams (bool, default False) – Load vectors for ngrams so that computing vectors for OOV words is possible. This is disabled by default as it requires downloading an additional 2GB file containing the vectors for ngrams. Note that facebookresearch did not publish ngram vectors for all their models. If load_ngrams is True, but no ngram vectors are available for the chosen source this a RuntimeError is thrown. The ngram vectors are passed to the resulting TokenEmbedding as unknown_lookup.
ctx (mx.Context, default mxnet.cpu()) – Context to load the FasttextEmbeddingModel for ngram vectors to. This parameter is ignored if load_ngrams is False.
kwargs – All other keyword arguments are passed to gluonnlp.embedding.TokenEmbedding.
- Variables
idx_to_vec (mxnet.ndarray.NDArray) – For all the indexed tokens in this embedding, this NDArray maps each token’s index to an embedding vector.
unknown_token (hashable object) – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation.
-
class
gluonnlp.embedding.
Word2Vec
(source='GoogleNews-vectors-negative300', embedding_root='/var/lib/jenkins/.mxnet/embedding', encoding='utf8', **kwargs)[source]¶ The Word2Vec word embedding.
Word2Vec is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed with continuous bag-of-words or skip-gram architecture for computing vector representations of words.
References:
[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
[3] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.
Website: https://code.google.com/archive/p/word2vec/
License for pre-trained embedding: Unspecified
Available sources
>>> import gluonnlp as nlp >>> sorted(nlp.embedding.list_sources('Word2Vec')) ['GoogleNews-vectors-negative300', 'freebase-vectors-skipgram1000', 'freebase-vectors-skipgram1000-en']
- Parameters
source (str, default 'GoogleNews-vectors-negative300') – The name of the pre-trained token embedding file. A binary pre-trained file outside from the source list can be used for this constructor by passing the path to it which ends with .bin as file extension name.
embedding_root (str, default '$MXNET_HOME/embedding') – The root directory for storing embedding-related files. MXNET_HOME defaults to ‘~/.mxnet’.
kwargs – All other keyword arguments are passed to gluonnlp.embedding.TokenEmbedding.
- Variables
idx_to_vec (mxnet.ndarray.NDArray) – For all the indexed tokens in this embedding, this NDArray maps each token’s index to an embedding vector.
unknown_token (hashable object) – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation.
Models for intrinsic and extrinsic word embedding evaluation
-
gluonnlp.embedding.evaluation.
register
(class_)[source]¶ Registers a new word embedding evaluation function.
Once registered, we can create an instance with
create()
.Examples
>>> @gluonnlp.embedding.evaluation.register ... class MySimilarityFunction(gluonnlp.embedding.evaluation.WordEmbeddingSimilarityFunction): ... def __init__(self, eps=1e-10): ... pass >>> similarity_function = gluonnlp.embedding.evaluation.create('similarity', ... 'MySimilarityFunction') >>> print(type(similarity_function)) <class 'gluonnlp.embedding.evaluation.MySimilarityFunction'>
>>> @gluonnlp.embedding.evaluation.register ... class MyAnalogyFunction(gluonnlp.embedding.evaluation.WordEmbeddingAnalogyFunction): ... def __init__(self, k=1, eps=1E-10): ... pass >>> analogy_function = gluonnlp.embedding.evaluation.create('analogy', 'MyAnalogyFunction') >>> print(type(analogy_function)) <class 'gluonnlp.embedding.evaluation.MyAnalogyFunction'>
-
gluonnlp.embedding.evaluation.
create
(kind, name, **kwargs)[source]¶ Creates an instance of a registered word embedding evaluation function.
- Parameters
kind (['similarity', 'analogy']) – Return only valid names for similarity, analogy or both kinds of functions.
name (str) – The evaluation function name (case-insensitive).
- Returns
An instance of
or
gluonnlp.embedding.evaluation.WordEmbeddingSimilarityFunction
– An instance of the specified evaluation function.
-
gluonnlp.embedding.evaluation.
list_evaluation_functions
(kind=None)[source]¶ Get valid word embedding functions names.
- Parameters
kind (['similarity', 'analogy', None]) – Return only valid names for similarity, analogy or both kinds of functions.
- Returns
A list of all the valid evaluation function names for the specified kind. If kind is set to None, returns a dict mapping each valid name to its respective output list. The valid names can be plugged in gluonnlp.model.word_evaluation_model.create(name).
- Return type
-
class
gluonnlp.embedding.evaluation.
WordEmbeddingSimilarityFunction
(prefix=None, params=None)[source]¶ Base class for word embedding similarity functions.
-
class
gluonnlp.embedding.evaluation.
WordEmbeddingAnalogyFunction
(prefix=None, params=None)[source]¶ Base class for word embedding analogy functions.
- Parameters
idx_to_vec (mxnet.ndarray.NDArray) – Embedding matrix.
k (int, default 1) – Number of analogies to predict per input triple.
eps (float, optional, default=1e-10) – A small constant for numerical stability.
-
class
gluonnlp.embedding.evaluation.
CosineSimilarity
(eps=1e-10, **kwargs)[source]¶ Computes the cosine similarity.
- Parameters
eps (float, optional, default=1e-10) – A small constant for numerical stability.
-
hybrid_forward
(F, x, y)[source]¶ Compute the cosine similarity between two batches of vectors.
The cosine similarity is the dot product between the L2 normalized vectors.
- Parameters
x (Symbol or NDArray) –
y (Symbol or NDArray) –
- Returns
similarity – The similarity computed by WordEmbeddingSimilarity.similarity_function.
- Return type
Symbol or NDArray
-
class
gluonnlp.embedding.evaluation.
ThreeCosMul
(idx_to_vec, k=1, eps=1e-10, exclude_question_words=True, **kwargs)[source]¶ The 3CosMul analogy function.
The 3CosMul analogy function is defined as
\[\arg\max_{b^* ∈ V}\frac{\cos(b^∗, b) \cos(b^*, a)}{cos(b^*, a^*) + ε}\]See the following paper for more details:
Levy, O., & Goldberg, Y. (2014). Linguistic regularities in sparse and explicit word representations. In R. Morante, & W. Yih, Proceedings of the Eighteenth Conference on Computational Natural Language Learning, CoNLL 2014, Baltimore, Maryland, USA, June 26-27, 2014 (pp. 171–180). : ACL.
- Parameters
idx_to_vec (mxnet.ndarray.NDArray) – Embedding matrix.
k (int, default 1) – Number of analogies to predict per input triple.
exclude_question_words (bool, default True) – Exclude the 3 question words from being a valid answer.
eps (float, optional, default=1e-10) – A small constant for numerical stability.
-
hybrid_forward
(F, words1, words2, words3, weight)[source]¶ Compute ThreeCosMul for given question words.
- Parameters
words1 (Symbol or NDArray) – Question words at first position. Shape (batch_size, )
words2 (Symbol or NDArray) – Question words at second position. Shape (batch_size, )
words3 (Symbol or NDArray) – Question words at third position. Shape (batch_size, )
- Returns
Predicted answer words. Shape (batch_size, k).
- Return type
Symbol or NDArray
-
class
gluonnlp.embedding.evaluation.
ThreeCosAdd
(idx_to_vec, normalize=True, k=1, eps=1e-10, exclude_question_words=True, **kwargs)[source]¶ The 3CosAdd analogy function.
The 3CosAdd analogy function is defined as
\[\arg\max_{b^* ∈ V}[\cos(b^∗, b - a + a^*)]\]See the following paper for more details:
Levy, O., & Goldberg, Y. (2014). Linguistic regularities in sparse and explicit word representations. In R. Morante, & W. Yih, Proceedings of the Eighteenth Conference on Computational Natural Language Learning, CoNLL 2014, Baltimore, Maryland, USA, June 26-27, 2014 (pp. 171–180). : ACL.
- Parameters
idx_to_vec (mxnet.ndarray.NDArray) – Embedding matrix.
normalize (bool, default True) – Normalize all word embeddings before computing the analogy.
k (int, default 1) – Number of analogies to predict per input triple.
exclude_question_words (bool, default True) – Exclude the 3 question words from being a valid answer.
eps (float, optional, default=1e-10) – A small constant for numerical stability.
-
hybrid_forward
(F, words1, words2, words3, weight)[source]¶ Compute ThreeCosAdd for given question words.
- Parameters
words1 (Symbol or NDArray) – Question words at first position. Shape (batch_size, )
words2 (Symbol or NDArray) – Question words at second position. Shape (batch_size, )
words3 (Symbol or NDArray) – Question words at third position. Shape (batch_size, )
- Returns
Predicted answer words. Shape (batch_size, k).
- Return type
Symbol or NDArray
-
class
gluonnlp.embedding.evaluation.
WordEmbeddingSimilarity
(idx_to_vec, similarity_function='CosineSimilarity', eps=1e-10, **kwargs)[source]¶ Word embeddings similarity task evaluator.
- Parameters
idx_to_vec (mxnet.ndarray.NDArray) – Embedding matrix.
similarity_function (str, default 'CosineSimilarity') – Name of a registered WordEmbeddingSimilarityFunction.
eps (float, optional, default=1e-10) – A small constant for numerical stability.
-
hybrid_forward
(F, words1, words2, weight)[source]¶ Predict the similarity of words1 and words2.
- Parameters
words1 (Symbol or NDArray) – The indices of the words the we wish to compare to the words in words2.
words2 (Symbol or NDArray) – The indices of the words the we wish to compare to the words in words1.
- Returns
similarity – The similarity computed by WordEmbeddingSimilarity.similarity_function.
- Return type
Symbol or NDArray
-
class
gluonnlp.embedding.evaluation.
WordEmbeddingAnalogy
(idx_to_vec, analogy_function='ThreeCosMul', k=1, exclude_question_words=True, **kwargs)[source]¶ Word embeddings analogy task evaluator.
- Parameters
idx_to_vec (mxnet.ndarray.NDArray) – Embedding matrix.
analogy_function (str, default 'ThreeCosMul') – Name of a registered WordEmbeddingAnalogyFunction.
k (int, default 1) – Number of analogies to predict per input triple.
exclude_question_words (bool, default True) – Exclude the 3 question words from being a valid answer.
-
hybrid_forward
(F, words1, words2, words3)[source]¶ Compute analogies for given question words.
- Parameters
words1 (Symbol or NDArray) – Word indices of first question words. Shape (batch_size, ).
words2 (Symbol or NDArray) – Word indices of second question words. Shape (batch_size, ).
words3 (Symbol or NDArray) – Word indices of third question words. Shape (batch_size, ).
- Returns
predicted_indices – Indices of predicted analogies of shape (batch_size, k)
- Return type
Symbol or NDArray