[Download]

Evaluating Pre-trained Word Embeddings

Word embeddings can be evaluated on intrinsic and extrinsic tasks. gluonnlp facilitates the work with both of them by providing common datasets and helpful abstractions. In this notebook we show how to evaluate embeddings on the intrinsic similarity and analogy tasks.

The used GloVe and fastText word embeddings in this tutorial are from the following sources:

Let us first import the following packages.

In [1]:
import warnings
warnings.filterwarnings('ignore')

import mxnet as mx
import gluonnlp as nlp

Intrinsic evaluation

While word embeddings are in industry mainly interesting for their use in improving performance in downstream tasks, direct evaluation on those tasks may be expensive and infeasible while experimenting with a large number of embeddings. Evaluation of word embeddings on such downstream tasks is called extrinsic evaluation.

Intrinsic evaluation tasks on the contrary aim to judge the quality of word embeddings directly.

Word Similarity and Relatedness Task

Word embeddings should capture the relationsship between words in natural language. In the Word Similarity and Relatedness Task word embeddings are evaluated by comparing word similarity scores computed from a pair of words with human labels for the similarity or relatedness of the pair.

gluonnlp includes a number of common datasets for the Word Similarity and Relatedness Task. The included datasets are listed in the API documentation. We use several of them in the evaluation example below.

We first show a few samples from the WordSim353 dataset, to get an overall feeling of the Dataset structur

In [2]:
wordsim353 = nlp.data.WordSim353()
for i in range(15):
    print(*wordsim353[i])
computer keyboard 7.62
Jerusalem Israel 8.46
planet galaxy 8.11
canyon landscape 7.53
OPEC country 5.63
day summer 3.94
day dawn 7.53
country citizen 7.31
planet people 5.75
environment ecology 8.81
Maradona football 8.62
OPEC oil 8.59
money bank 8.5
computer software 8.5
law lawyer 8.38

Evaluation: Loading the embeddings

To evaluate word embeddings on the WordSim353 dataset, we first load pretrained embeddings and construct a vocabulary object. Here we load the fasttext word embeddings created from the crawl-300d-2M source. As they are quite large, executing the following cell may take a minute or two.

In [3]:
embedding = nlp.embedding.create('fasttext', source='crawl-300d-2M')
In [4]:
counter = nlp.data.utils.Counter(w for wpair in wordsim353 for w in wpair[:2])
vocab = nlp.vocab.Vocab(counter)
vocab.set_embedding(embedding)

We then replace the words in the WordSim353 dataset with indices from the vocabulary.

In [5]:
wordsim353_coded = [[vocab[d[0]], vocab[d[1]], d[2]] for d in wordsim353]
words1, words2, scores = zip(*wordsim353_coded)

Evaluation: Running the task

The gluonnlp toolkit contains helpers for evaluation word embeddings on the word similarity and relatedness task.

In the following we create a WordEmbeddingSimilarity block, which predicts similarity score between word pairs given an embedding matrix.

In [6]:
# context = mx.cpu()  # Replace this with mx.gpu(0) if you got a GPU
context = mx.gpu(0)  # Replace this with mx.cpu() if you got no GPU


evaluator = nlp.embedding.evaluation.WordEmbeddingSimilarity(
    idx_to_vec=vocab.embedding.idx_to_vec,
    similarity_function="CosineSimilarity")
evaluator.initialize(ctx=context)
evaluator.hybridize()

The similarities can be predicted by passing the two arrays of words through the evaluator. Thereby the ith word in words1 will be compared with the ith word in words2.

In [7]:
pred_similarity = evaluator(
    mx.nd.array(words1, ctx=context), mx.nd.array(words2, ctx=context))
print(pred_similarity[:10])

[0.4934404  0.69630307 0.5902223  0.31201977 0.16985895 0.3822252
 0.42938995 0.36722115 0.22559652 0.51560944]
<NDArray 10 @gpu(0)>

We can evaluate the predicted similarities, and thereby the word embeddings, by computing the Spearman Rank Correlation between the predicted similarities and the groundtruth, human, similarity scores from the dataset:

In [8]:
import numpy as np
from scipy import stats

sr = stats.spearmanr(pred_similarity.asnumpy(), np.array(scores))
print('Spearman rank correlation on {}: {}'.format(wordsim353.__class__.__name__,
                                                   sr.correlation.round(3)))
Spearman rank correlation on WordSim353: 0.792

Word Analogy Task

In the Word Analogy Task word embeddings are evaluated by inferring an analogous word D, which is related to a given word C in the same way as a given pair of words A, B are related.

gluonnlp includes a number of common datasets for the Word Analogy Task. The included datasets are listed in the API documentation. In this notebook we use the GoogleAnalogyTestSet dataset.

In [9]:
google_analogy = nlp.data.GoogleAnalogyTestSet()

We first demonstrate the structure of the dataset by printing a few examples

In [10]:
sample = []
print(('Printing every 1000st analogy question '
       'from the {} questions'
        'in the Google Analogy Test Set:').format(len(google_analogy)))
print('')
for i in range(0, 19544, 1000):
    print(*google_analogy[i])
    sample.append(google_analogy[i])
Printing every 1000st analogy question from the 19544 questionsin the Google Analogy Test Set:

athens greece baghdad iraq
baku azerbaijan dushanbe tajikistan
dublin ireland kathmandu nepal
lusaka zambia tehran iran
rome italy windhoek namibia
zagreb croatia astana kazakhstan
philadelphia pennsylvania tampa florida
wichita kansas shreveport louisiana
shreveport louisiana oxnard california
complete completely lucky luckily
comfortable uncomfortable clear unclear
good better high higher
young younger tight tighter
weak weakest bright brightest
slow slowing describe describing
ireland irish greece greek
feeding fed sitting sat
slowing slowed decreasing decreased
finger fingers onion onions
play plays sing sings
In [11]:
words1, words2, words3, words4 = list(zip(*sample))

We again construct a vocabulary object from the loaded pretrained embeddings. To speed up computation, we restrict ourselves here to the most frequent 300000 words in the vocabulary.

In [12]:
counter = nlp.data.utils.Counter(embedding.idx_to_token[:300000])
vocab = nlp.vocab.Vocab(counter)
vocab.set_embedding(embedding)

We then throw away all analogy questions that contain words not in the frequent words subset selected above.

In [13]:
google_analogy_subset = [
    d for d in google_analogy if (d[0] in vocab and d[1] in vocab
    and d[2] in vocab and d[3] in vocab)
]
print('Dropped {} pairs from {} as the were OOV.'.format(
    len(google_analogy) - len(google_analogy_subset),
    len(google_analogy)))
Dropped 5108 pairs from 19544 as the were OOV.
In [14]:
google_analogy_coded = [[vocab[d[0]], vocab[d[1]], vocab[d[2]], vocab[d[3]]]
                 for d in google_analogy_subset]
google_analogy_coded_batched = mx.gluon.data.DataLoader(
    google_analogy_coded, batch_size=64)
In [15]:
evaluator = nlp.embedding.evaluation.WordEmbeddingAnalogy(
    idx_to_vec=vocab.embedding.idx_to_vec,
    exclude_question_words=True,
    analogy_function="ThreeCosMul")
evaluator.initialize(ctx=context)
evaluator.hybridize()

To show a visual progressbar, make sure the progressbar2 package is installed. You can remove the # from below cell to optionally install it.

In [16]:
#! pip install --user progressbar2
In [17]:
try:
    import progressbar
except:
    progressbar = None

acc = mx.metric.Accuracy()

if progressbar is not None:
    google_analogy_coded_batched = progressbar.progressbar(google_analogy_coded_batched)
for batch in google_analogy_coded_batched:
    batch = batch.as_in_context(context)
    words1, words2, words3, words4 = (batch[:, 0], batch[:, 1],
                                      batch[:, 2], batch[:, 3])
    pred_idxs = evaluator(words1, words2, words3)
    acc.update(pred_idxs[:, 0], words4.astype(np.float32))

print('Accuracy on %s: %s'% (google_analogy.__class__.__name__, acc.get()[1].round(3)))
100% (226 of 226) |######################| Elapsed Time: 0:00:32 Time:  0:00:32
Accuracy on GoogleAnalogyTestSet: 0.794

Aggregated Results on all datasets

We have precomputed the results on the similarity and analogy tasks on all respective datasets and all pretrained embeddings (targeted at English) included in the Gluon NLP toolkit. If you are interested in reproducing the results, please run the run_all.sh bash script in the scripts/word_embeddings_evaluation folder. That folder also contains a notebook with extended, unaggregated results that detail the performance of the different embeddings on each category in the datasets.

We first load the CSV file containing the results and define a highlighter function that will help us to highlight the best-performinging method per dataset.

In [18]:
import pandas as pd
pd.options.display.max_rows = 999
pd.options.display.precision = 3

df = pd.read_table("../../../scripts/word_embedding_evaluation/results-vocablimit.csv",
                   header=None, names=[
                       "evaluation_type", "dataset", "kwargs", "embedding_name",
                       "embedding_source", "evaluation", "value", "num_samples"
                   ])

Similarity task

We then select the results from the similarity task and generate a table. To keep this page concise, we report the mean value over all datasets. Please see the extended results notebook at the Scripts page for detailed results.

In [19]:
dfs = df[~df["dataset"].isin(["BiggerAnalogyTestSet", "GoogleAnalogyTestSet"])].drop(["evaluation_type", "evaluation", "num_samples"], axis=1)
dfs = dfs[dfs["embedding_source"].isin([
    "glove.42B.300d",
    "glove.6B.100d",
    "glove.6B.200d",
    "glove.6B.300d",
    "glove.6B.50d",
    "glove.840B.300d",
    "glove.twitter.27B.100d",
    "glove.twitter.27B.200d",
    "glove.twitter.27B.25d",
    "glove.twitter.27B.50d",
    "wiki.en",
    "wiki.simple",
    "crawl-300d-2M",
    "wiki-news-300d-1M",
    "wiki-news-300d-1M-subword"
])]

dfs = dfs.groupby(["embedding_name", "embedding_source"]).mean()
dfs.sort_values(by='value', ascending=False)
Out[19]:
value
embedding_name embedding_source
fasttext crawl-300d-2M 0.690
wiki-news-300d-1M-subword 0.658
wiki-news-300d-1M 0.649
glove glove.840B.300d 0.629
fasttext wiki.en 0.569
glove glove.42B.300d 0.520
glove.6B.300d 0.518
glove.6B.200d 0.495
fasttext wiki.simple 0.476
glove glove.6B.100d 0.464
glove.6B.50d 0.432
glove.twitter.27B.200d 0.373
glove.twitter.27B.100d 0.356
glove.twitter.27B.50d 0.323
glove.twitter.27B.25d 0.253

Analogy task

For the analogy task, we report the aggregate results per category type in the datasets.

Note that the analogy task is a open vocabulary task: Given a query of 3 words, we ask the model to select a 4th word from the whole vocabulary. Different pre-trained embeddings have vocabularies of different size. In general the vocabulary of embeddings pretrained on more tokens (indicated by a bigger number before the B in the embedding source name) include more tokens in their vocabulary. While training embeddings on more tokens improves their quality, the larger vocabulary also makes the analogy task harder.

In this experiment all results are reported with reducing the vocabulary to the 300k most frequent tokens. Questions containing Out Of Vocabulary words are ignored.

Google Analogy Test Set

We first display the results on the Google Analogy Test Set.

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations (ICLR).

The Google Analogy Test Set contains the following categories. All analogy questions per category follow the pattern specified by the category name. We group them into semantic and syntactic analogy questions.

In [20]:
import json
pd.Series(df[df["dataset"] == "GoogleAnalogyTestSet"]["kwargs"].unique()).apply(
    json.loads).apply(lambda x: x['category'])
Out[20]:
0        capital-common-countries
1                   capital-world
2                        currency
3                   city-in-state
4                          family
5       gram1-adjective-to-adverb
6                  gram2-opposite
7               gram3-comparative
8               gram4-superlative
9        gram5-present-participle
10    gram6-nationality-adjective
11               gram7-past-tense
12                   gram8-plural
13             gram9-plural-verbs
dtype: object

We first load the results from the output of the word_embedding_evaluation.py script.

In [21]:
dfa_google = df[df["dataset"] == "GoogleAnalogyTestSet"].drop(
    ["evaluation_type", "num_samples", "dataset"], axis=1)
dfa_google = dfa_google[dfa_google["embedding_source"].isin([
    "glove.42B.300d",
    "glove.6B.100d",
    "glove.6B.200d",
    "glove.6B.300d",
    "glove.6B.50d",
    "glove.840B.300d",
    "glove.twitter.27B.100d",
    "glove.twitter.27B.200d",
    "glove.twitter.27B.25d",
    "glove.twitter.27B.50d",
    "wiki.en",
    "wiki.simple",
    "crawl-300d-2M",
    "wiki-news-300d-1M",
    "wiki-news-300d-1M-subword",
])]
dfa_google["category"] = dfa_google["kwargs"].apply(json.loads).apply(lambda x: str(x['category']))
dfa_google.drop("kwargs", axis=1, inplace=True)

groups = dfa_google["category"].apply(lambda x: "syntactic" if x.startswith("gram") else "semantic")
dfa_google_aggregate = dfa_google.drop("category", axis=1)
dfa_google_aggregate["group"] = groups
google_aggregate = dfa_google_aggregate.groupby(["group", "embedding_name", "embedding_source", "evaluation"]).mean()
google_aggregate = google_aggregate.sort_values(by='value', ascending=False).sort_index(level=[0], sort_remaining=False)

Syntactic

We first present aggregate results over syntactic analogy questions.

In [22]:
google_aggregate.loc["syntactic"]
Out[22]:
value
embedding_name embedding_source evaluation
fasttext wiki-news-300d-1M-subword threecosmul 0.871
threecosadd 0.863
wiki-news-300d-1M threecosmul 0.809
threecosadd 0.794
crawl-300d-2M threecosmul 0.787
threecosadd 0.764
glove glove.840B.300d threecosmul 0.728
fasttext wiki.en threecosmul 0.724
glove glove.42B.300d threecosmul 0.702
fasttext wiki.en threecosadd 0.701
glove glove.840B.300d threecosadd 0.700
glove.42B.300d threecosadd 0.670
glove.6B.300d threecosmul 0.654
threecosadd 0.634
glove.6B.200d threecosadd 0.625
threecosmul 0.622
fasttext wiki.simple threecosmul 0.596
glove glove.6B.100d threecosadd 0.579
fasttext wiki.simple threecosadd 0.552
glove glove.6B.100d threecosmul 0.545
glove.twitter.27B.200d threecosmul 0.536
threecosadd 0.529
glove.twitter.27B.100d threecosadd 0.467
threecosmul 0.436
glove.6B.50d threecosadd 0.405
threecosmul 0.322
glove.twitter.27B.50d threecosadd 0.319
threecosmul 0.271
glove.twitter.27B.25d threecosadd 0.135
threecosmul 0.102

Semantic

We then present aggregate results over semantic analogy questions.

In [23]:
google_aggregate.loc["semantic"]
Out[23]:
value
embedding_name embedding_source evaluation
glove glove.42B.300d threecosmul 0.751
threecosadd 0.747
glove.6B.300d threecosmul 0.712
fasttext wiki.en threecosmul 0.711
glove glove.6B.300d threecosadd 0.708
fasttext wiki.en threecosadd 0.703
glove glove.6B.200d threecosadd 0.684
threecosmul 0.676
glove.6B.100d threecosadd 0.619
threecosmul 0.589
glove.840B.300d threecosmul 0.580
threecosadd 0.574
fasttext crawl-300d-2M threecosmul 0.569
threecosadd 0.560
glove glove.6B.50d threecosadd 0.481
glove.twitter.27B.200d threecosadd 0.439
threecosmul 0.427
fasttext wiki-news-300d-1M threecosmul 0.404
threecosadd 0.401
glove glove.6B.50d threecosmul 0.400
fasttext wiki-news-300d-1M-subword threecosmul 0.349
threecosadd 0.348
glove glove.twitter.27B.100d threecosadd 0.324
threecosmul 0.293
fasttext wiki.simple threecosmul 0.261
threecosadd 0.205
glove glove.twitter.27B.50d threecosadd 0.188
threecosmul 0.155
glove.twitter.27B.25d threecosadd 0.108
threecosmul 0.080

Bigger Analogy Test Set

We then display the results on the Bigger Analogy Test Set (BATS).

  • Gladkova, A., Drozd, A., & Matsuoka, S. (2016). Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In Proceedings of the NAACL-HLT SRW (pp. 47–54). San Diego, California, June 12-17, 2016: ACL. Retrieved from https://www.aclweb.org/anthology/N/N16/N16-2002.pdf

Unlike the Google Analogy Test Set, BATS is balanced across 4 types of relations (inflectional morphology, derivational morphology, lexicographic semantics, encyclopedic semantics).

We first load the results for the BATS dataset:

In [24]:
dfa_bats = df[df["dataset"] == "BiggerAnalogyTestSet"].drop(
    ["evaluation_type", "num_samples", "dataset"], axis=1)
dfa_bats = dfa_bats[dfa_bats["embedding_source"].isin([
    "glove.42B.300d",
    "glove.6B.100d",
    "glove.6B.200d",
    "glove.6B.300d",
    "glove.6B.50d",
    "glove.840B.300d",
    "glove.twitter.27B.100d",
    "glove.twitter.27B.200d",
    "glove.twitter.27B.25d",
    "glove.twitter.27B.50d",
    "wiki.en",
    "wiki.simple",
    "crawl-300d-2M",
    "wiki-news-300d-1M",
    "wiki-news-300d-1M-subword",
])]
dfa_bats["category"] = dfa_bats["kwargs"].apply(json.loads).apply(lambda x: str(x['category']))
dfa_bats.drop("kwargs", axis=1, inplace=True)

groups = dfa_bats["category"].str[0].apply(lambda x: {
    'I':'Inflectional morphology',
    'D':'Derivational morphology',
    'L':'Lexicographic semantics',
    'E':'Encyclopedic semantics'}[x])
dfa_bats_aggregate = dfa_bats.drop("category", axis=1)
dfa_bats_aggregate["group"] = groups
bats_aggregate = dfa_bats_aggregate.groupby(
    ["group", "embedding_name", "embedding_source", "evaluation"]).mean()
bats_aggregate = bats_aggregate.sort_values(
    by='value', ascending=False).sort_index(level=[0], sort_remaining=False)

For BATS we present the results aggregated over all categories grouped by the respective 4 types of relations (inflectional morphology, derivational morphology, lexicographic semantics, encyclopedic semantics):

Inflectional morphology

In [25]:
bats_aggregate.loc["Inflectional morphology"]
Out[25]:
value
embedding_name embedding_source evaluation
fasttext wiki-news-300d-1M-subword threecosmul 0.923
threecosadd 0.917
wiki-news-300d-1M threecosmul 0.856
threecosadd 0.847
crawl-300d-2M threecosmul 0.835
threecosadd 0.799
glove glove.840B.300d threecosmul 0.768
threecosadd 0.760
glove.42B.300d threecosmul 0.674
fasttext wiki.en threecosmul 0.643
glove glove.42B.300d threecosadd 0.630
glove.6B.300d threecosmul 0.627
fasttext wiki.en threecosadd 0.601
glove glove.6B.200d threecosmul 0.598
glove.6B.300d threecosadd 0.593
glove.6B.200d threecosadd 0.591
glove.6B.100d threecosadd 0.574
threecosmul 0.552
fasttext wiki.simple threecosmul 0.494
threecosadd 0.433
glove glove.twitter.27B.200d threecosmul 0.431
threecosadd 0.425
glove.twitter.27B.100d threecosadd 0.394
glove.6B.50d threecosadd 0.391
glove.twitter.27B.100d threecosmul 0.362
glove.6B.50d threecosmul 0.311
glove.twitter.27B.50d threecosadd 0.282
threecosmul 0.232
glove.twitter.27B.25d threecosadd 0.135
threecosmul 0.098

Derivational morphology

In [26]:
bats_aggregate.loc["Derivational morphology"]
Out[26]:
value
embedding_name embedding_source evaluation
fasttext wiki-news-300d-1M-subword threecosmul 0.414
threecosadd 0.356
wiki-news-300d-1M threecosmul 0.307
crawl-300d-2M threecosmul 0.278
wiki.simple threecosmul 0.268
wiki-news-300d-1M threecosadd 0.248
wiki.simple threecosadd 0.228
wiki.en threecosmul 0.212
crawl-300d-2M threecosadd 0.193
wiki.en threecosadd 0.179
glove glove.42B.300d threecosmul 0.146
threecosadd 0.118
glove.6B.300d threecosmul 0.087
threecosadd 0.079
glove.6B.200d threecosadd 0.078
glove.6B.100d threecosadd 0.077
glove.6B.200d threecosmul 0.076
glove.6B.100d threecosmul 0.063
glove.6B.50d threecosadd 0.047
glove.twitter.27B.200d threecosadd 0.037
threecosmul 0.034
glove.twitter.27B.100d threecosadd 0.026
glove.6B.50d threecosmul 0.023
glove.twitter.27B.100d threecosmul 0.019
glove.twitter.27B.50d threecosadd 0.016
threecosmul 0.007
glove.twitter.27B.25d threecosadd 0.005
threecosmul 0.002

Lexicographic semantics

In [27]:
bats_aggregate.loc["Lexicographic semantics"]
Out[27]:
value
embedding_name embedding_source evaluation
fasttext wiki-news-300d-1M threecosadd 0.087
wiki-news-300d-1M-subword threecosadd 0.087
threecosmul 0.087
wiki-news-300d-1M threecosmul 0.087
crawl-300d-2M threecosmul 0.065
glove glove.6B.300d threecosadd 0.063
fasttext crawl-300d-2M threecosadd 0.062
glove glove.6B.200d threecosadd 0.061
glove.6B.100d threecosadd 0.059
glove.twitter.27B.200d threecosadd 0.056
glove.6B.300d threecosmul 0.051
fasttext wiki.en threecosadd 0.051
threecosmul 0.048
glove glove.twitter.27B.200d threecosmul 0.045
glove.6B.200d threecosmul 0.045
glove.twitter.27B.100d threecosadd 0.042
glove.6B.100d threecosmul 0.037
glove.6B.50d threecosadd 0.034
glove.twitter.27B.100d threecosmul 0.027
fasttext wiki.simple threecosadd 0.024
glove glove.twitter.27B.50d threecosadd 0.023
fasttext wiki.simple threecosmul 0.022
glove glove.twitter.27B.50d threecosmul 0.014
glove.6B.50d threecosmul 0.014
glove.twitter.27B.25d threecosadd 0.009
threecosmul 0.005

Encyclopedic semantics

In [28]:
bats_aggregate.loc["Encyclopedic semantics"]
Out[28]:
value
embedding_name embedding_source evaluation
glove glove.42B.300d threecosadd 0.272
fasttext wiki.en threecosmul 0.256
glove glove.42B.300d threecosmul 0.254
glove.6B.300d threecosadd 0.242
threecosmul 0.240
fasttext wiki.en threecosadd 0.236
glove glove.6B.200d threecosadd 0.230
threecosmul 0.214
glove.6B.100d threecosadd 0.198
fasttext crawl-300d-2M threecosmul 0.177
threecosadd 0.166
glove glove.6B.100d threecosmul 0.164
glove.twitter.27B.200d threecosadd 0.142
fasttext wiki-news-300d-1M threecosmul 0.139
glove glove.6B.50d threecosadd 0.135
fasttext wiki-news-300d-1M threecosadd 0.131
glove glove.twitter.27B.200d threecosmul 0.128
fasttext wiki-news-300d-1M-subword threecosmul 0.116
threecosadd 0.114
glove glove.twitter.27B.100d threecosadd 0.101
fasttext wiki.simple threecosmul 0.099
glove glove.6B.50d threecosmul 0.090
fasttext wiki.simple threecosadd 0.077
glove glove.twitter.27B.100d threecosmul 0.076
glove.twitter.27B.50d threecosadd 0.054
threecosmul 0.035
glove.twitter.27B.25d threecosadd 0.028
threecosmul 0.017