gluonnlp.data

GluonNLP Toolkit provides tools for building efficient data pipelines for NLP tasks.

Public Datasets

Popular datasets for NLP tasks are provided in gluonnlp. By default, all built-in datasets are automatically downloaded from public repo and reside in ~/.mxnet/datasets/.

Language modeling

WikiText is a popular language modeling dataset from Salesforce. It is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Google 1 Billion Words is a popular language modeling dataset. It is a collection of over 0.8 billion tokens extracted from the WMT11 website. The dataset is available under Apache License.

WikiText2

WikiText-2 word-level dataset for language modeling, from Salesforce research.

WikiText103

WikiText-103 word-level dataset for language modeling, from Salesforce research.

WikiText2Raw

WikiText-2 character-level dataset for language modeling

WikiText103Raw

WikiText-103 character-level dataset for language modeling

GBWStream

1-Billion-Word word-level dataset for language modeling, from Google.

Text Classification

IMDB is a popular dataset for binary sentiment classification. It provides a set of 25,000 highly polar movie reviews for training, 25,000 for testing, and additional unlabeled data.

MR is a movie-review data set of 10,662 sentences labeled with respect to their overall sentiment polarity (positive or negative).

SST-1 is an extension of the MR data set. However, training/test splits are provided and labels are fine-grained (very positive, positive, neutral, negative, very negative). The training and test data sets have 237,107 and 2,210 sentences respectively.

SST-2 is the same as SST-1 with neutral sentences removed and only binary sentiment polarity are considered: very positive is considered as positive, and very negative is considered as negative.

SUBJ is a Subjectivity data set for sentiment analysis. Sentences labeled with respect to their subjectivity status (subjective or objective).

TREC is a movie-review data set of 10,000 sentences labeled with respect to their subjectivity status (subjective or objective).

CR is customer reviews of various products (cameras, MP3s etc.). Sentences are labeled with respect to their overall sentiment polarities (positive or negative).

MPQA is an opinion polarity detection subtask. Sentences are labeled with respect to their overall sentiment polarities (positive or negative).

IMDB

IMDB reviews for sentiment analysis.

MR

Movie reviews for sentiment analysis.

SST_1

Stanford Sentiment Treebank: an extension of the MR data set.

SST_2

Stanford Sentiment Treebank: an extension of the MR data set.

SUBJ

Subjectivity dataset for sentiment analysis.

TREC

Question dataset for question classification.

CR

Customer reviews of various products (cameras, MP3s etc.).

MPQA

Opinion polarity detection subtask of the MPQA dataset.

Word Embedding Evaluation Datasets

There are a number of commonly used datasets for intrinsic evaluation for word embeddings.

The similarity-based evaluation datasets include:

WordSim353

WordSim353 dataset.

MEN

MEN dataset for word-similarity and relatedness.

RadinskyMTurk

MTurk dataset for word-similarity and relatedness by Radinsky et al..

RareWords

Rare words dataset word-similarity and relatedness.

SimLex999

SimLex999 dataset word-similarity.

SimVerb3500

SimVerb3500 dataset word-similarity.

SemEval17Task2

SemEval17Task2 dataset for word-similarity.

BakerVerb143

Verb143 dataset.

YangPowersVerb130

Verb-130 dataset.

Analogy-based evaluation datasets include:

GoogleAnalogyTestSet

Google analogy test set

BiggerAnalogyTestSet

Bigger analogy test set

CoNLL Datasets

The CoNLL datasets are from a series of annual competitions held at the top tier conference of the same name. The conference is organized by SIGNLL.

These datasets include data for the shared tasks, such as part-of-speech (POS) tagging, chunking, named entity recognition (NER), semantic role labeling (SRL), etc.

We provide built in support for CoNLL 2000 – 2002, 2004, as well as the Universal Dependencies dataset which is used in the 2017 and 2018 competitions.

CoNLL2000

CoNLL2000 Part-of-speech (POS) tagging and chunking joint task dataset.

CoNLL2001

CoNLL2001 Clause Identification dataset.

CoNLL2002

CoNLL2002 Named Entity Recognition (NER) task dataset.

CoNLL2004

CoNLL2004 Semantic Role Labeling (SRL) task dataset.

UniversalDependencies21

Universal dependencies tree banks.

Machine Translation Datasets

IWSLT2015

Preprocessed IWSLT English-Vietnamese Translation Dataset.

WMT2014

Translation Corpus of the WMT2014 Evaluation Campaign.

WMT2014BPE

Preprocessed Translation Corpus of the WMT2014 Evaluation Campaign.

WMT2016

Translation Corpus of the WMT2016 Evaluation Campaign.

WMT2016BPE

Preprocessed Translation Corpus of the WMT2016 Evaluation Campaign.

Intent Classification and Slot Labeling

ATISDataset

Airline Travel Information System dataset from MS CNTK.

SNIPSDataset

Snips Natural Language Understanding Benchmark dataset.

Question Answering

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

SQuAD

Stanford Question Answering Dataset (SQuAD) - reading comprehension dataset.

GLUE Benchmark

The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems.

GlueCoLA

The Corpus of Linguistic Acceptability (Warstadt et al., 2018) consists of English acceptability judgments drawn from books and journal articles on linguistic theory.

GlueSST2

The Stanford Sentiment Treebank (Socher et al., 2013) consists of sentences from movie reviews and human annotations of their sentiment.

GlueSTSB

The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data.

GlueQQP

The Quora Question Pairs dataset is a collection of question pairs from the community question-answering website Quora.

GlueRTE

The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual entailment challenges (RTE1, RTE2, RTE3, and RTE5).

GlueMNLI

The Multi-Genre Natural Language Inference Corpus (Williams et al., 2018) is a crowdsourced collection of sentence pairs with textual entailment annotations.

GlueQNLI

The Question-answering NLI dataset converted from Stanford Question Answering Dataset (Rajpurkar et al.

GlueWNLI

The Winograd NLI dataset converted from the dataset in Winograd Schema Challenge (Levesque et al., 2011).

GlueMRPC

The Microsoft Research Paraphrase Corpus dataset.

SuperGLUE Benchmark

The SuperGLUE Benchmark a new benchmark styled after GLUE with a new set of more difficult language understanding tasks.

SuperGlueRTE

The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual entailment challenges (RTE1, RTE2, RTE3 and RTE5).

SuperGlueCB

The CommitmentBank (CB) is a corpus of short texts in which at least one sentence contains an embedded clause.

SuperGlueWSC

The Winograd Schema Challenge (WSC) is a co-reference resolution dataset.

SuperGlueWiC

The Word-in-Context (WiC) is a word sense disambiguation dataset cast as binary classification of sentence pairs.

SuperGlueCOPA

The Choice of Plausible Alternatives (COPA) is a causal reasoning dataset.

SuperGlueMultiRC

Multi-Sentence Reading Comprehension (MultiRC) is a QA dataset.

SuperGlueBoolQ

Boolean Questions (BoolQ) is a QA dataset where each example consists of a short passage and a yes/no question about it.

SuperGlueReCoRD

Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a multiple-choice QA dataset.

SuperGlueAXb

The Broadcoverage Diagnostics (AX-b) is a diagnostics dataset labeled closely to the schema of MultiNLI.

SuperGlueAXg

The Winogender Schema Diagnostics (AX-g) is a diagnostics dataset labeled closely to the schema of MultiNLI.

Datasets

Dataset API for processing common text formats. The following classes can be used or subclassed to load custom datasets.

TextLineDataset

Dataset that comprises lines in a file.

CorpusDataset

Common text dataset that reads a whole corpus based on provided sample splitter and word tokenizer.

TSVDataset

Common tab separated text dataset that reads text fields based on provided sample splitter and field separator.

DataStreams

DataStream API for streaming and processing common text formats. The following classes can be used or subclassed to stream large custom data.

DataStream

Abstract Data Stream Interface.

SimpleDataStream

SimpleDataStream wraps iterables to expose the DataStream API.

DatasetStream

Abstract Dataset Stream Interface.

SimpleDatasetStream

A simple stream of Datasets.

PrefetchingStream

Prefetch a DataStream in a separate Thread or Process.

Transforms

Text data transformation functions. They can be used for processing text sequences in conjunction with Dataset.transform method.

ClipSequence

Clip the sequence to have length no more than length.

PadSequence

Pad the sequence.

SacreMosesTokenizer

Apply the Moses Tokenizer implemented in sacremoses.

SpacyTokenizer

Apply the Spacy Tokenizer.

SacreMosesDetokenizer

Apply the Moses Detokenizer implemented in sacremoses.

BERTTokenizer

End-to-end tokenization for BERT models.

BERTSentenceTransform

BERT style data transformation.

Samplers

Samplers determine how to iterate through datasets. The below samplers and batch samplers can help iterate through sequence data.

SortedSampler

Sort the samples based on the sort key and then sample sequentially.

FixedBucketSampler

Assign each data sample to a fixed bucket based on its length.

SortedBucketSampler

Batches are sampled from sorted buckets of data.

SplitSampler

Split the dataset into num_parts parts and randomly sample from the part with index part_index.

The FixedBucketSampler uses following bucket scheme classes to generate bucket keys.

ConstWidthBucket

Buckets with constant width.

LinearWidthBucket

Buckets with linearly increasing width: \(w_i = \alpha * i + 1\) for all \(i \geq 1\).

ExpWidthBucket

Buckets with exponentially increasing width: \(w_i = bucket\_len\_step * w_{i-1}\) for all \(i \geq 2\).

DataLoaders

DataLoaders loads data from a dataset and returns mini-batches of data

ShardedDataLoader

Loads data from a dataset and returns mini-batches of data.

DatasetLoader

Loads data from a list of datasets and returns mini-batches of data.

Utilities

Miscellaneous utility classes and functions for processing text and sequence data.

Counter

Counter class for keeping token frequencies.

count_tokens

Counts tokens in the specified string.

concat_sequence

Concatenate sequences of tokens into a single flattened list of tokens.

slice_sequence

Slice a flat sequence of tokens into sequences tokens, with each inner sequence’s length equal to the specified length, taking into account the requested sequence overlap.

train_valid_split

Split the dataset into training and validation sets.

register

Registers a dataset with segment specific hyperparameters.

create

Creates an instance of a registered dataset.

list_datasets

Get valid datasets and registered parameters.

API Reference

This module includes common utilities such as data readers and counter.

gluonnlp.data.get_tokenizer(model_name, dataset_name, vocab=None, root='/var/lib/jenkins/.mxnet/data', **kwargs)[source]

Returns a pre-defined tokenizer by name.

Parameters
  • model_name (str) – Options include ‘bert_24_1024_16’, ‘bert_12_768_12’, ‘roberta_12_768_12’, ‘roberta_24_1024_16’ and ‘ernie_12_768_12’

  • dataset_name (str) – The supported datasets for model_name of either bert_24_1024_16 and bert_12_768_12 are ‘book_corpus_wiki_en_cased’, ‘book_corpus_wiki_en_uncased’. For model_name bert_12_768_12 ‘wiki_cn_cased’, ‘wiki_multilingual_uncased’, ‘wiki_multilingual_cased’, ‘scibert_scivocab_uncased’, ‘scibert_scivocab_cased’, ‘scibert_basevocab_uncased’,’scibert_basevocab_cased’, ‘biobert_v1.0_pmc’, ‘biobert_v1.0_pubmed’, ‘biobert_v1.0_pubmed_pmc’, ‘biobert_v1.1_pubmed’, ‘clinicalbert’, ‘kobert_news_wiki_ko_cased’ are supported. For model_name roberta_12_768_12 and roberta_24_1024_16 ‘openwebtext_ccnews_stories_books_cased’ is supported. For model_name ernie_12_768_12 ‘baidu_ernie_uncased’. is additionally supported.

  • vocab (gluonnlp.vocab.BERTVocab or None, default None) – Vocabulary for the dataset. Must be provided if tokenizer is based on vocab.

  • root (str, default '$MXNET_HOME/models' with MXNET_HOME defaults to '~/.mxnet') – Location for keeping the model parameters.

Returns

  • gluonnlp.data.BERTTokenizer or gluonnlp.data.GPT2BPETokenizer or

  • gluonnlp.data.SentencepieceTokenizer

Examples

>>> model_name = 'bert_12_768_12'
>>> dataset_name = 'book_corpus_wiki_en_uncased'
>>> _, vocab = gluonnlp.model.get_model(model_name,
...                                     dataset_name=dataset_name,
...                                     pretrained=False, root='./model')
-etc-
>>> tokenizer = gluonnlp.data.get_tokenizer(model_name, dataset_name, vocab)
>>> tokenizer('Habit is second nature.')
['habit', 'is', 'second', 'nature', '.']
class gluonnlp.data.Counter(**kwds)[source]

Counter class for keeping token frequencies.

discard(min_freq, unknown_token)[source]

Discards tokens with frequency below min_frequency and represents them as unknown_token.

Parameters
  • min_freq (int) – Tokens whose frequency is under min_freq is counted as unknown_token in the Counter returned.

  • unknown_token (str) – The representation for any unknown token.

Returns

Return type

The Counter instance.

Examples

>>> a = gluonnlp.data.Counter({'a': 10, 'b': 1, 'c': 1})
>>> a.discard(3, '<unk>')
Counter({'a': 10, '<unk>': 2})
gluonnlp.data.count_tokens(tokens, to_lower=False, counter=None)[source]

Counts tokens in the specified string.

For token_delim=’(td)’ and seq_delim=’(sd)’, a specified string of two sequences of tokens may look like:

(td)token1(td)token2(td)token3(td)(sd)(td)token4(td)token5(td)(sd)
Parameters
  • tokens (list of str) – A source list of tokens.

  • to_lower (bool, default False) – Whether to convert the source source_str to the lower case.

  • counter (Counter or None, default None) – The Counter instance to be updated with the counts of tokens. If None, return a new Counter instance counting tokens from tokens.

Returns

  • The counter Counter instance after being updated with the token

  • counts of source_str. If counter is None, return a new Counter

  • instance counting tokens from source_str.

Examples

>>> import re
>>> source_str = ' Life is great ! \n life is good . \n'
>>> source_str_tokens = filter(None, re.split(' |\n', source_str))
>>> counter = gluonnlp.data.count_tokens(source_str_tokens)
>>> sorted(counter.items())
[('!', 1), ('.', 1), ('Life', 1), ('good', 1), ('great', 1), ('is', 2), ('life', 1)]
gluonnlp.data.concat_sequence(sequences)[source]

Concatenate sequences of tokens into a single flattened list of tokens.

Parameters

sequences (list of list of object) – Sequences of tokens, each of which is an iterable of tokens.

Returns

Return type

Flattened list of tokens.

gluonnlp.data.slice_sequence(sequence, length, pad_last=False, pad_val='<pad>', overlap=0)[source]

Slice a flat sequence of tokens into sequences tokens, with each inner sequence’s length equal to the specified length, taking into account the requested sequence overlap.

Parameters
  • sequence (list of object) – A flat list of tokens.

  • length (int) – The length of each of the samples.

  • pad_last (bool, default False) – Whether to pad the last sequence when its length doesn’t align. If the last sequence’s length doesn’t align and pad_last is False, it will be dropped.

  • pad_val (object, default) – The padding value to use when the padding of the last sequence is enabled. In general, the type of pad_val should be the same as the tokens.

  • overlap (int, default 0) – The extra number of items in current sample that should overlap with the next sample.

Returns

Return type

List of list of tokens, with the length of each inner list equal to length.

gluonnlp.data.train_valid_split(dataset, valid_ratio=0.05, stratify=None)[source]

Split the dataset into training and validation sets.

Parameters
  • dataset (list) – A list of training samples.

  • valid_ratio (float, default 0.05) – Proportion of training samples to use for validation set range: [0, 1]

  • stratify (list, default None) – If not None, data is split in a stratified fashion, using the contents of stratify as class labels.

Returns

  • train (SimpleDataset)

  • valid (SimpleDataset)

gluonnlp.data.line_splitter(s)[source]

Split a string at newlines.

Parameters

s (str) – The string to be split

Returns

List of strings. Obtained by calling s.splitlines().

Return type

List[str]

gluonnlp.data.whitespace_splitter(s)[source]

Split a string at whitespace (space, tab, newline, return, formfeed).

Parameters

s (str) – The string to be split

Returns

List of strings. Obtained by calling s.split().

Return type

List[str]

class gluonnlp.data.Splitter(separator=None)[source]

Split a string based on a separator.

Parameters

separator (str) – The separator based on which string is split.

__call__(s)[source]

Split a string based on the separator.

Parameters

s (str) – The string to be split

Returns

List of strings. Obtained by calling s.split(separator).

Return type

List[str]

class gluonnlp.data.ClipSequence(length)[source]

Clip the sequence to have length no more than length.

Parameters

length (int) – Maximum length of the sequence

Examples

>>> datasets = gluon.data.SimpleDataset([[1, 3, 5, 7], [1, 2, 3], [1, 2, 3, 4, 5, 6, 7, 8]])
>>> list(datasets.transform(gluonnlp.data.ClipSequence(4)))
[[1, 3, 5, 7], [1, 2, 3], [1, 2, 3, 4]]
>>> datasets = gluon.data.SimpleDataset([np.array([[1, 3], [5, 7], [7, 5], [3, 1]]),
...                                      np.array([[1, 2], [3, 4], [5, 6],
...                                                [6, 5], [4, 3], [2, 1]]),
...                                      np.array([[2, 4], [4, 2]])])
>>> list(datasets.transform(gluonnlp.data.ClipSequence(3)))
[array([[1, 3],
       [5, 7],
       [7, 5]]), array([[1, 2],
       [3, 4],
       [5, 6]]), array([[2, 4],
       [4, 2]])]
__call__(sample)[source]

Call self as a function.

class gluonnlp.data.PadSequence(length, pad_val=0, clip=True)[source]

Pad the sequence.

Pad the sequence to the given length by inserting pad_val. If clip is set, sequence that has length larger than length will be clipped.

Parameters
  • length (int) – The maximum length to pad/clip the sequence

  • pad_val (number) – The pad value. Default 0

  • clip (bool) –

Examples

>>> datasets = gluon.data.SimpleDataset([[1, 3, 5, 7], [1, 2, 3], [1, 2, 3, 4, 5, 6, 7, 8]])
>>> list(datasets.transform(gluonnlp.data.PadSequence(6)))
[[1, 3, 5, 7, 0, 0], [1, 2, 3, 0, 0, 0], [1, 2, 3, 4, 5, 6]]
>>> list(datasets.transform(gluonnlp.data.PadSequence(6, clip=False)))
[[1, 3, 5, 7, 0, 0], [1, 2, 3, 0, 0, 0], [1, 2, 3, 4, 5, 6, 7, 8]]
>>> list(datasets.transform(gluonnlp.data.PadSequence(6, pad_val=-1, clip=False)))
[[1, 3, 5, 7, -1, -1], [1, 2, 3, -1, -1, -1], [1, 2, 3, 4, 5, 6, 7, 8]]
__call__(sample)[source]
Parameters

sample (list of number or mx.nd.NDArray or np.ndarray) –

Returns

ret

Return type

list of number or mx.nd.NDArray or np.ndarray

class gluonnlp.data.SacreMosesTokenizer[source]

Apply the Moses Tokenizer implemented in sacremoses.

Users of this class are required to install sacremoses. For example, one can use pip install sacremoses.

Note

sacremoses carries an LGPL 2.1+ license.

Examples

>>> tokenizer = gluonnlp.data.SacreMosesTokenizer()
>>> tokenizer('Gluon NLP toolkit provides a suite of text processing tools.')
['Gluon', 'NLP', 'toolkit', 'provides', 'a', 'suite', 'of', 'text', 'processing', 'tools', '.']
>>> tokenizer('Das Gluon NLP-Toolkit stellt eine Reihe von Textverarbeitungstools '
...           'zur Verfügung.')
['Das', 'Gluon', 'NLP-Toolkit', 'stellt', 'eine', 'Reihe', 'von', 'Textverarbeitungstools', 'zur', 'Verfügung', '.']
__call__(sample, return_str=False)[source]

Tokenize a sample.

Parameters
  • sample (str) – The sentence to tokenize

  • return_str (bool) – True: return a single string False: return a list of tokens

Returns

ret – List of tokens or tokenized text

Return type

list of strs or str

class gluonnlp.data.SpacyTokenizer(lang='en_core_web_sm')[source]

Apply the Spacy Tokenizer.

Users of this class are required to install spaCy and download corresponding NLP models, such as python -m spacy download en.

Only spacy>=2.0.0 is supported.

Parameters

lang (str) – The language to tokenize. Default is ‘en’, i.e, English. You may refer to https://spacy.io/usage/models for supported languages.

Examples

>>> tokenizer = gluonnlp.data.SpacyTokenizer()
>>> tokenizer('Gluon NLP toolkit provides a suite of text processing tools.')
['Gluon', 'NLP', 'toolkit', 'provides', 'a', 'suite', 'of', 'text', 'processing', 'tools', '.']
>>> tokenizer = gluonnlp.data.SpacyTokenizer('de')
>>> tokenizer('Das Gluon NLP-Toolkit stellt eine Reihe von Textverarbeitungstools'
...           ' zur Verfügung.')
['Das', 'Gluon', 'NLP-Toolkit', 'stellt', 'eine', 'Reihe', 'von', 'Textverarbeitungstools', 'zur', 'Verfügung', '.']
__call__(sample)[source]
Parameters

sample (str) – The sentence to tokenize

Returns

ret – List of tokens

Return type

list of strs

class gluonnlp.data.SacreMosesDetokenizer(return_str=True)[source]

Apply the Moses Detokenizer implemented in sacremoses.

Users of this class are required to install sacremoses. For example, one can use pip install sacremoses.

Note

sacremoses carries an LGPL 2.1+ license.

Parameters

return_str (bool, default False) – True: return a single string False: return a list of words

Examples

>>> detokenizer = gluonnlp.data.SacreMosesDetokenizer()
>>> detokenizer(['Gluon', 'NLP', 'toolkit', 'provides', 'a', 'suite', 'of',
...              'text', 'processing', 'tools', '.'], return_str=True)
'Gluon NLP toolkit provides a suite of text processing tools.'
>>> detokenizer(['Das', 'Gluon','NLP-Toolkit','stellt','eine','Reihe','von',
...              'Textverarbeitungstools','zur','Verfügung','.'], return_str=True)
'Das Gluon NLP-Toolkit stellt eine Reihe von Textverarbeitungstools zur Verfügung.'
__call__(sample, return_str=None)[source]
Parameters
  • sample (List[str]) – The sentence to detokenize

  • return_str (Optional[bool]) – True: return a single string False: return a list of words None: use constructor setting

Returns

ret – List of words or detokenized text

Return type

list of strs or str

class gluonnlp.data.JiebaTokenizer[source]

Apply the jieba Tokenizer.

Users of this class are required to install jieba

Parameters

lang (str) – The language to tokenize. Default is “zh”, i.e, Chinese.

Examples

>>> tokenizer = gluonnlp.data.JiebaTokenizer()
>>> tokenizer('我来到北京清华大学')
['我', '来到', '北京', '清华大学']
>>> tokenizer('小明硕士毕业于中国科学院计算所,后在日本京都大学深造')
['小明', '硕士', '毕业', '于', '中国科学院', '计算所', ',', '后', '在', '日本京都大学', '深造']
__call__(sample)[source]
Parameters

sample (str) – The Chinese sentence to tokenize. Better not to input sentence in other languages since this class is mainly used for Chinese Word Segmentation.

Returns

ret – List of tokens

Return type

list of strs

class gluonnlp.data.NLTKStanfordSegmenter(segmenter_root='/var/lib/jenkins/.mxnet/stanford-segmenter', slf4j_root='/var/lib/jenkins/.mxnet/slf4j', java_class='edu.stanford.nlp.ie.crf.CRFClassifier')[source]

Apply the Stanford Chinese Word Segmenter implemented in NLTK.

Users of this class are required to install Java, NLTK and download Stanford Word Segmenter

Parameters
  • segmenter_root (str, default '$MXNET_HOME/stanford-segmenter') – Path to folder for storing stanford segmenter. MXNET_HOME defaults to ‘~/.mxnet’.

  • slf4j_root (str, default '$MXNET_HOME/slf4j') – Path to foler for storing slf4j. MXNET_HOME defaults to ‘~/.mxnet’

  • java_class (str, default 'edu.stanford.nlp.ie.crf.CRFClassifier') – The learning algorithm used for segmentation

Examples

>>> tokenizer = gluonnlp.data.NLTKStanfordSegmenter() 
>>> tokenizer('我来到北京清华大学') 
['我', '来到', '北京', '清华大学']
>>> tokenizer('小明硕士毕业于中国科学院计算所,后在日本京都大学深造') 
['小明', '硕士', '毕业', '于', '中国科学院', '计算所', ',', '后', '在', '日本京都大学', '深造']
__call__(sample)[source]
Parameters

sample (str) – The Chinese sentence to tokenize. Better not to input sentence in other languages since this class is mainly used for Chinese Word Segmentation.

Returns

ret – List of tokens

Return type

list of strs

class gluonnlp.data.SentencepieceTokenizer(path, num_best=0, alpha=1.0)[source]

Apply the Sentencepiece Tokenizer, which supports subword tokenization such as BPE.

Users of this class are required to install sentencepiece. For example, one can use pip install sentencepiece

Parameters
  • path (str) – Path to the pre-trained subword tokenization model.

  • num_best (int, default 0) – A scalar for sampling subwords. If num_best = {0,1}, no sampling is performed. If num_best > 1, then samples from the num_best results. If num_best < 0, then assume that num_best is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.

  • alpha (float, default 1.0) – A scalar for a smoothing parameter. Inverse temperature for probability rescaling.

Examples

>>> url = 'http://repo.mxnet.io/gluon/dataset/vocab/test-0690baed.bpe'
>>> f = gluon.utils.download(url)
-etc-
>>> tokenizer = gluonnlp.data.SentencepieceTokenizer(f)
>>> detokenizer = gluonnlp.data.SentencepieceDetokenizer(f)
>>> sentence = 'This is a very awesome, life-changing sentence.'
>>> tokenizer(sentence)
['▁This', '▁is', '▁a', '▁very', '▁awesome', ',', '▁life', '-', 'ch', 'anging', '▁sentence', '.']
>>> detokenizer(tokenizer(sentence))
'This is a very awesome, life-changing sentence.'
>>> os.remove('test-0690baed.bpe')
__call__(sample)[source]
Parameters

sample (str) – The string to tokenize.

Returns

ret – List of tokens

Return type

list of strs

class gluonnlp.data.SentencepieceDetokenizer(path)[source]

Apply the Sentencepiece detokenizer, which supports recombining subwords such as BPE.

Users of this class are required to install sentencepiece. For example, one can use pip install sentencepiece

Parameters

path (str) – Path to the pre-trained subword tokenization model.

Examples

>>> url = 'http://repo.mxnet.io/gluon/dataset/vocab/test-0690baed.bpe'
>>> f = gluon.utils.download(url)
-etc-
>>> tokenizer = gluonnlp.data.SentencepieceTokenizer(f)
>>> detokenizer = gluonnlp.data.SentencepieceDetokenizer(f)
>>> sentence = 'This is a very awesome, life-changing sentence.'
>>> tokenizer(sentence)
['▁This', '▁is', '▁a', '▁very', '▁awesome', ',', '▁life', '-', 'ch', 'anging', '▁sentence', '.']
>>> detokenizer(tokenizer(sentence))
'This is a very awesome, life-changing sentence.'
>>> os.remove('test-0690baed.bpe')
__call__(sample)[source]
Parameters

sample (list(str)) – The sentence to detokenize

Returns

ret – Detokenized text

Return type

str

class gluonnlp.data.BERTBasicTokenizer(lower=True)[source]

Runs basic tokenization

performs invalid character removal (e.g. control chars) and whitespace. tokenize CJK chars. splits punctuation on a piece of text. strips accents and convert to lower case.(If lower is true)

Parameters

lower (bool, default True) – whether the text strips accents and convert to lower case.

Examples

>>> tokenizer = gluonnlp.data.BERTBasicTokenizer(lower=True)
>>> tokenizer(' \tHeLLo!how  \n Are yoU?  ')
['hello', '!', 'how', 'are', 'you', '?']
>>> tokenizer = gluonnlp.data.BERTBasicTokenizer(lower=False)
>>> tokenizer(' \tHeLLo!how  \n Are yoU?  ')
['HeLLo', '!', 'how', 'Are', 'yoU', '?']
__call__(sample)[source]
Parameters

sample (str) – The string to tokenize. Must be unicode.

Returns

ret – List of tokens

Return type

list of strs

class gluonnlp.data.BERTTokenizer(vocab, lower=True, max_input_chars_per_word=200, lru_cache_size=None)[source]

End-to-end tokenization for BERT models.

Parameters
  • vocab (Vocab) – Vocabulary for the corpus.

  • lower (bool) – whether the text strips accents and convert to lower case. If you use the BERT pre-training model, lower is set to Flase when using the cased model, otherwise it is set to True.

  • max_input_chars_per_word (int) –

  • lru_cache_size (Optional[int]) – Maximum size of a least-recently-used cache to speed up tokenization. Use size of 2**20 for example.

Examples

>>> _, vocab = gluonnlp.model.bert_12_768_12(dataset_name='wiki_multilingual_uncased',
...                                          pretrained=False, root='./model')
-etc-
>>> tokenizer = gluonnlp.data.BERTTokenizer(vocab=vocab)
>>> tokenizer('gluonnlp: 使NLP变得简单。')
['gl', '##uo', '##nn', '##lp', ':', '使', 'nl', '##p', '变', '得', '简', '单', '。']
__call__(sample)[source]
Parameters

sample (str) – The string to tokenize.

Returns

ret – List of tokens

Return type

list of strs

convert_tokens_to_ids(tokens)[source]

Converts a sequence of tokens into ids using the vocab.

static is_first_subword(token)[source]

Check if a token is the beginning of subwords.

Parameters

token (str) – The input token.

Returns

ret

Return type

True if the token is the beginning of a serious of wordpieces.

Examples

>>> _, vocab = gluonnlp.model.bert_12_768_12(dataset_name='wiki_multilingual_uncased',
...                                          pretrained=False, root='./bert_tokenizer')
-etc-
>>> tokenizer = gluonnlp.data.BERTTokenizer(vocab=vocab)
>>> tokenizer('gluonnlp: 使NLP变得简单。')
['gl', '##uo', '##nn', '##lp', ':', '使', 'nl', '##p', '变', '得', '简', '单', '。']
>>> tokenizer.is_first_subword('gl')
True
>>> tokenizer.is_first_subword('##uo')
False
class gluonnlp.data.BERTSentenceTransform(tokenizer, max_seq_length, vocab=None, pad=True, pair=True)[source]

BERT style data transformation.

Parameters
  • tokenizer (BERTTokenizer.) – Tokenizer for the sentences.

  • max_seq_length (int.) – Maximum sequence length of the sentences.

  • vocab (Vocab) – The vocabulary which has cls_token and sep_token registered. If vocab.cls_token is not present, vocab.bos_token is used instead. If vocab.sep_token is not present, vocab.eos_token is used instead.

  • pad (bool, default True) – Whether to pad the sentences to maximum length.

  • pair (bool, default True) – Whether to transform sentences or sentence pairs.

__call__(line)[source]

Perform transformation for sequence pairs or single sequences.

The transformation is processed in the following steps: - tokenize the input sequences - insert [CLS], [SEP] as necessary - generate type ids to indicate whether a token belongs to the first sequence or the second sequence. - generate valid length

For sequence pairs, the input is a tuple of 2 strings: text_a, text_b.

Inputs:

text_a: ‘is this jacksonville ?’ text_b: ‘no it is not’

Tokenization:

text_a: ‘is this jack ##son ##ville ?’ text_b: ‘no it is not .’

Processed:

tokens: ‘[CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]’ type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 valid_length: 14

For single sequences, the input is a tuple of single string: text_a.

Inputs:

text_a: ‘the dog is hairy .’

Tokenization:

text_a: ‘the dog is hairy .’

Processed:

text_a: ‘[CLS] the dog is hairy . [SEP]’ type_ids: 0 0 0 0 0 0 0 valid_length: 7

If vocab.cls_token and vocab.sep_token are not present, vocab.bos_token and vocab.eos_token are used instead.

Parameters

line (tuple of str) – Input strings. For sequence pairs, the input is a tuple of 2 strings: (text_a, text_b). For single sequences, the input is a tuple of single string: (text_a,).

Returns

  • np.array (input token ids in ‘int32’, shape (batch_size, seq_length))

  • np.array (valid length in ‘int32’, shape (batch_size,))

  • np.array (input token type ids in ‘int32’, shape (batch_size, seq_length))

class gluonnlp.data.BERTSPTokenizer(path, vocab, num_best=0, alpha=1.0, lower=True, max_input_chars_per_word=200)[source]

End-to-end SentencePiece tokenization for BERT models.

It works best with BERTSentenceTransform().

Note

BERTSPTokenizer depends on the sentencepiece library. For multi-processing with BERTSPTokenizer, making an extra copy of the BERTSPTokenizer instance is recommended before using it.

Parameters
  • path (str) – Path to the pre-trained subword tokenization model.

  • vocab (gluonnlp.Vocab) – Vocabulary for the corpus.

  • num_best (int, default 0) – A scalar for sampling subwords. If num_best = {0,1}, no sampling is performed. If num_best > 1, then samples from the num_best results. If num_best < 0, then assume that num_best is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.

  • alpha (float) – A scalar for a smoothing parameter. Inverse temperature for probability rescaling.

  • lower (bool, default True) – Whether the text strips accents and convert to lower case. If you use the BERT pre-training model, lower is set to False when using the cased model, otherwise it is set to True.

  • max_input_chars_per_word (int, default 200) –

Examples

>>> url = 'http://repo.mxnet.io/gluon/dataset/vocab/test-682b5d15.bpe'
>>> f = gluon.utils.download(url)
-etc-
>>> bert_vocab = gluonnlp.vocab.BERTVocab.from_sentencepiece(f)
>>> sp_tokenizer = BERTSPTokenizer(f, bert_vocab, lower=True)
>>> sentence = 'Better is to bow than break.'
>>> sp_tokenizer(sentence)
['▁better', '▁is', '▁to', '▁b', 'ow', '▁than', '▁brea', 'k', '▁', '.']
>>> os.remove('test-682b5d15.bpe')
__call__(sample)[source]
Parameters

sample (str) – The string to tokenize.

Returns

ret – List of tokens

Return type

list of strs

convert_tokens_to_ids(tokens)[source]

Converts a sequence of tokens into ids using the vocab.

static is_first_subword(token)[source]

Check if a string token is a subword following a previous subword, instead of the beginning of a word.

Parameters

token (str) – The input token.

Returns

ret

Return type

True if the token is the beginning of a series of subwords,

Examples

>>> url = 'http://repo.mxnet.io/gluon/dataset/vocab/test-682b5d15.bpe'
>>> f = gluon.utils.download(url)
-etc-
>>> bert_vocab = gluonnlp.vocab.BERTVocab.from_sentencepiece(f)
>>> sp_tokenizer = BERTSPTokenizer(f, bert_vocab, lower=True)
>>> sp_tokenizer('Better is to bow than break.')
['▁better', '▁is', '▁to', '▁b', 'ow', '▁than', '▁brea', 'k', '▁', '.']
>>> sp_tokenizer.is_first_subword('▁better')
True
>>> sp_tokenizer.is_first_subword('ow')
False
>>> os.remove('test-682b5d15.bpe')
class gluonnlp.data.GPT2BPETokenizer(root='/var/lib/jenkins/.mxnet/models')[source]

BPE tokenizer used in OpenAI GPT-2 model.

Parameters

root (str, default '$MXNET_HOME/models') – Location for keeping the BPE rank file. MXNET_HOME defaults to ‘~/.mxnet’.

__call__(sample)[source]
Parameters

sample (str) –

Returns

ret

Return type

list(str)

get_bpe_subword(token)[source]

Encode the word token into BPE subwords

Parameters

token (str) –

Returns

chars

Return type

list(str)

class gluonnlp.data.GPT2BPEDetokenizer[source]

BPE detokenizer used in OpenAI GPT-2 model.

__call__(sample)[source]
Parameters

sample (list(str)) –

Returns

ret

Return type

str

class gluonnlp.data.ConstWidthBucket[source]

Buckets with constant width.

__call__(max_lengths, min_lengths, num_buckets)[source]

This generate bucket keys given that all the buckets have the same width.

Parameters
  • max_lengths (int or list of int) – Maximum of lengths of sequences.

  • min_lengths (int or list of int) – Minimum of lengths of sequences.

  • num_buckets (int) – Number of buckets

Returns

bucket_keys – A list including the keys of the buckets.

Return type

list of int

class gluonnlp.data.LinearWidthBucket[source]

Buckets with linearly increasing width: \(w_i = \alpha * i + 1\) for all \(i \geq 1\).

__call__(max_lengths, min_lengths, num_buckets)[source]

This function generates bucket keys with linearly increasing bucket width:

Parameters
  • max_lengths (int or list of int) – Maximum of lengths of sequences.

  • min_lengths (int or list of int) – Minimum of lengths of sequences.

  • num_buckets (int) – Number of buckets

Returns

bucket_keys – A list including the keys of the buckets.

Return type

list of int

class gluonnlp.data.ExpWidthBucket(bucket_len_step=1.1)[source]

Buckets with exponentially increasing width: \(w_i = bucket\_len\_step * w_{i-1}\) for all \(i \geq 2\).

Parameters

bucket_len_step (float, default 1.1) – This is the increasing factor for the bucket width.

__call__(max_lengths, min_lengths, num_buckets)[source]

This function generates bucket keys exponentially increasing bucket width.

Parameters
  • max_lengths (int or list of int) – Maximum of lengths of sequences.

  • min_lengths (int or list of int) – Minimum of lengths of sequences.

  • num_buckets (int) – Number of buckets

Returns

bucket_keys – A list including the keys of the buckets.

Return type

list of int

class gluonnlp.data.SortedSampler(sort_keys, reverse=True)[source]

Sort the samples based on the sort key and then sample sequentially.

Parameters
  • sort_keys (list-like object) – List of the sort keys.

  • reverse (bool, default True) – Whether to sort by descending order.

class gluonnlp.data.FixedBucketSampler(lengths, batch_size, num_buckets=10, bucket_keys=None, ratio=0, shuffle=False, use_average_length=False, num_shards=0, bucket_scheme=<gluonnlp.data.sampler.ConstWidthBucket object>)[source]

Assign each data sample to a fixed bucket based on its length. The bucket keys are either given or generated from the input sequence lengths.

Parameters
  • lengths (list of int or list of tuple/list of int) – The length of the sequences in the input data sample.

  • batch_size (int) – The batch size of the sampler.

  • num_buckets (int or None, default 10) – The number of buckets. This will not be used if bucket_keys is set.

  • bucket_keys (None or list of int or list of tuple, default None) – The keys that will be used to create the buckets. It should usually be the lengths of the sequences. If it is None, the bucket_keys will be generated based on the maximum lengths of the data.

  • ratio (float, default 0) –

    Ratio to scale up the batch size of smaller buckets. Assume the \(i\) th key is \(K_i\) , the default batch size is \(B\) , the ratio to scale the batch size is \(\alpha\) and the batch size corresponds to the \(i\) th bucket is \(B_i\) . We have:

    \[B_i = \max(\alpha B \times \frac{\max_j sum(K_j)}{sum(K_i)}, B)\]

    Thus, setting this to a value larger than 0, like 0.5, will scale up the batch size of the smaller buckets.

  • shuffle (bool, default False) – Whether to shuffle the batches.

  • use_average_length (bool, default False) – False: each batch contains batch_size sequences, number of sequence elements varies. True: each batch contains batch_size elements, number of sequences varies. In this case, ratio option is ignored.

  • num_shards (int, default 0) – If num_shards > 0, the sampled batch is split into num_shards smaller batches. The output will have structure of list(list(int)). If num_shards = 0, the output will have structure of list(int). This is useful in multi-gpu training and can potentially reduce the number of paddings. In general, it is set to the number of gpus.

  • bucket_scheme (BucketScheme, default ConstWidthBucket) – It is used to generate bucket keys. It supports: ConstWidthBucket: all the buckets have the same width LinearWidthBucket: the width of ith bucket follows \(w_i = \alpha * i + 1\) ExpWidthBucket: the width of ith bucket follows \(w_i\) = bucket_len_step \(* w_{i-1}\)

Examples

>>> lengths = [np.random.randint(1, 100) for _ in range(1000)]
>>> sampler = gluonnlp.data.FixedBucketSampler(lengths, 8, ratio=0.5)
>>> print(sampler.stats())
FixedBucketSampler:
-etc-
stats()[source]

Return a string representing the statistics of the bucketing sampler.

Returns

ret – String representing the statistics of the buckets.

Return type

str

class gluonnlp.data.SortedBucketSampler(sort_keys, batch_size, mult=100, reverse=True, shuffle=False)[source]

Batches are sampled from sorted buckets of data.

First, partition data in buckets of size batch_size * mult. Each bucket contains batch_size * mult elements. The samples inside each bucket are sorted based on sort_key and then batched.

Parameters
  • sort_keys (list-like object) – The keys to sort the samples.

  • batch_size (int) – Batch size of the sampler.

  • mult (int or float, default 100) – The multiplier to determine the bucket size. Each bucket will have size mult * batch_size.

  • reverse (bool, default True) – Whether to sort in descending order.

  • shuffle (bool, default False) – Whether to shuffle the data.

Examples

>>> lengths = [np.random.randint(1, 1000) for _ in range(1000)]
>>> sampler = gluonnlp.data.SortedBucketSampler(lengths, 16)
>>> # The sequence lengths within the batch will be sorted
>>> for i, indices in enumerate(sampler):
...     if i == 0:
...         print([lengths[ele] for ele in indices])
[-etc-]
class gluonnlp.data.SplitSampler(length, num_parts=1, part_index=0, even_size=False, repeat=1, shuffle=True)[source]

Split the dataset into num_parts parts and randomly sample from the part with index part_index.

The data is randomly shuffled at each iteration within each partition.

Parameters
  • length (int) – Number of examples in the dataset

  • num_parts (int, default 1) – Number of partitions which the data is split into

  • part_index (int, default 0) – The index of the part to read from

  • even_size (bool, default False) – If the number of samples is not even across all partitions, sample a few extra samples for the ones with fewer samples.

  • repeat (int, default 1) – The number of times that items are repeated.

  • shuffle (bool, default True) – Whether or not to shuffle the items.

class gluonnlp.data.TextLineDataset(filename, encoding='utf8')[source]

Dataset that comprises lines in a file. Each line will be stripped.

Parameters
  • filename (str) – Path to the input text file.

  • encoding (str, default 'utf8') – File encoding format.

class gluonnlp.data.CorpusDataset(filename, encoding='utf8', flatten=False, skip_empty=True, sample_splitter=<function line_splitter>, tokenizer=<function whitespace_splitter>, bos=None, eos=None)[source]

Common text dataset that reads a whole corpus based on provided sample splitter and word tokenizer.

The returned dataset includes samples, each of which can either be a list of tokens if tokenizer is specified, or otherwise a single string segment produced by the sample_splitter.

Parameters
  • filename (str or list of str) – Path to the input text file or list of paths to the input text files.

  • encoding (str, default 'utf8') – File encoding format.

  • flatten (bool, default False) – Whether to return all samples as flattened tokens. If True, each sample is a token.

  • skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.

  • sample_splitter (function, default str.splitlines) – A function that splits the dataset string into samples.

  • tokenizer (function or None, default str.split) – A function that splits each sample string into list of tokens. If None, raw samples are returned according to sample_splitter.

  • bos (str or None, default None) – The token to add at the beginning of each sequence. If None, or if tokenizer is not specified, then nothing is added.

  • eos (str or None, default None) – The token to add at the end of each sequence. If None, or if tokenizer is not specified, then nothing is added.

class gluonnlp.data.ConcatDataset(datasets)[source]

Dataset that concatenates a list of datasets.

Parameters

datasets (list) – List of datasets.

class gluonnlp.data.TSVDataset(filename, encoding='utf8', sample_splitter=<function line_splitter>, field_separator=<gluonnlp.data.utils.Splitter object>, num_discard_samples=0, field_indices=None, allow_missing=False)[source]

Common tab separated text dataset that reads text fields based on provided sample splitter and field separator.

The returned dataset includes samples, each of which can either be a list of text fields if field_separator is specified, or otherwise a single string segment produced by the sample_splitter.

Example:

# assume `test.tsv` contains the following content:
# Id    FirstName       LastName
# a     Jiheng  Jiang
# b     Laoban  Zha
# discard the first line and select the 0th and 2nd fields
dataset = data.TSVDataset('test.tsv', num_discard_samples=1, field_indices=[0, 2])
assert dataset[0] == ['a', 'Jiang']
assert dataset[1] == ['b', 'Zha']
Parameters
  • filename (str or list of str) – Path to the input text file or list of paths to the input text files.

  • encoding (str, default 'utf8') – File encoding format.

  • sample_splitter (function, default str.splitlines) – A function that splits the dataset string into samples.

  • field_separator (function or None, default Splitter(' ')) – A function that splits each sample string into list of text fields. If None, raw samples are returned according to sample_splitter.

  • num_discard_samples (int, default 0) – Number of samples discarded at the head of the first file.

  • field_indices (list of int or None, default None) – If set, for each sample, only fields with provided indices are selected as the output. Otherwise all fields are returned.

  • allow_missing (bool, default False) – If set to True, no exception will be thrown if the number of fields is smaller than the maximum field index provided.

class gluonnlp.data.NumpyDataset(filename, **kwargs)[source]

A dataset wrapping over a Numpy binary (.npy, .npz) file.

If the file is a .npy file, then a single numpy array is loaded. If the file is a .npz file with multiple arrays, then a list of numpy arrays are loaded, ordered by their key in the archive.

Sparse matrix is not yet supported.

Parameters
  • filename (str) – Path to the .npy or .npz file.

  • kwargs – Keyword arguments are passed to np.load.

  • Properties

  • ----------

  • keys (list of str or None) – The list of keys loaded from the .npz file.

get_field(field)[source]

Return the dataset corresponds to the provided key.

Example::

a = np.ones((2,2)) b = np.zeros((2,2)) np.savez(‘data.npz’, a=a, b=b) dataset = NumpyDataset(‘data.npz’) data_a = dataset.get_field(‘a’) data_b = dataset.get_field(‘b’)

Parameters

field (str) – The name of the field to retrieve.

class gluonnlp.data.GBWStream(segment='train', skip_empty=True, bos=None, eos='<eos>', root='/var/lib/jenkins/.mxnet/datasets/gbw')[source]

1-Billion-Word word-level dataset for language modeling, from Google.

The GBWSream iterates over CorpusDatasets(flatten=False).

Source http://www.statmt.org/lm-benchmark

License: Apache

Parameters
  • segment ({'train', 'test'}, default 'train') – Dataset segment.

  • skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.

  • bos (str or None, default None) – The token to add at the begining of each sentence. If None, nothing is added.

  • eos (str or None, default '<eos>') – The token to add at the end of each sentence. If None, nothing is added.

  • root (str, default '$MXNET_HOME/datasets/gbw') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

class gluonnlp.data.Text8(root='/var/lib/jenkins/.mxnet/datasets/text8', segment='train', max_sentence_length=10000)[source]

Text8 corpus

http://mattmahoney.net/dc/textdata.html

Part of the test data for the Large Text Compression Benchmark http://mattmahoney.net/dc/text.html. The first 10**8 bytes of the cleaned English Wikipedia dump on Mar. 3, 2006.

License: https://en.wikipedia.org/wiki/Wikipedia:Copyrights

Parameters

root (str, default '$MXNET_HOME/datasets/text8') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

class gluonnlp.data.Fil9(root='/var/lib/jenkins/.mxnet/datasets/fil9', segment='train', max_sentence_length=None)[source]

Fil9 corpus

http://mattmahoney.net/dc/textdata.html

Part of the test data for the Large Text Compression Benchmark http://mattmahoney.net/dc/text.html. The first 10**9 bytes of the English Wikipedia dump on Mar. 3, 2006.

License: https://en.wikipedia.org/wiki/Wikipedia:Copyrights

Parameters

root (str, default '$MXNET_HOME/datasets/fil9') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

class gluonnlp.data.Enwik8(root='/var/lib/jenkins/.mxnet/datasets/enwik8', segment='train')[source]

Enwik8 corpus

http://mattmahoney.net/dc/textdata.html

Part of the test data for the Large Text Compression Benchmark http://mattmahoney.net/dc/text.html. The first 10**8 bytes of the English Wikipedia dump on Mar. 3, 2006.

License: https://en.wikipedia.org/wiki/Wikipedia:Copyrights

Parameters
class gluonnlp.data.WikiText2(segment='train', flatten=True, skip_empty=True, tokenizer=<function WikiText2.<lambda>>, bos=None, eos='<eos>', root='/var/lib/jenkins/.mxnet/datasets/wikitext-2', **kwargs)[source]

WikiText-2 word-level dataset for language modeling, from Salesforce research.

WikiText2 is implemented as CorpusDataset with the default flatten=True.

From https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/

License: Creative Commons Attribution-ShareAlike

Parameters
  • segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.

  • flatten (bool, default True) – Whether to return all samples as flattened tokens. If True, each sample is a token.

  • skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.

  • tokenizer (function, default str.split) – A function that splits each sample string into list of tokens.

  • bos (str or None, default None) – The token to add at the beginning of each sentence. If None, nothing is added.

  • eos (str or None, default '<eos>') – The token to add at the end of each sentence. If None, nothing is added.

  • root (str, default '$MXNET_HOME/datasets/wikitext-2') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> wikitext2 = gluonnlp.data.WikiText2('val', root='./datasets/wikitext2')
-etc-
>>> len(wikitext2)
216347
>>> wikitext2[0]
'='
>>> wikitext2 = gluonnlp.data.WikiText2('val', flatten=False,
...                                     root='./datasets/wikitext2')
>>> len(wikitext2)
2461
>>> wikitext2[0]
['=', 'Homarus', 'gammarus', '=', '<eos>']
>>> wikitext2 = gluonnlp.data.WikiText2('val', flatten=False, bos='<bos>', eos=None,
...                                     root='./datasets/wikitext2')
>>> wikitext2[0]
['<bos>', '=', 'Homarus', 'gammarus', '=']
>>> wikitext2 = gluonnlp.data.WikiText2('val', flatten=False, bos='<bos>', eos=None,
...                                     skip_empty=False, root='./datasets/wikitext2')
>>> len(wikitext2)
3760
>>> wikitext2[0]
['<bos>']
class gluonnlp.data.WikiText103(segment='train', flatten=True, skip_empty=True, tokenizer=<function WikiText103.<lambda>>, bos=None, eos='<eos>', root='/var/lib/jenkins/.mxnet/datasets/wikitext-103', **kwargs)[source]

WikiText-103 word-level dataset for language modeling, from Salesforce research.

WikiText103 is implemented as CorpusDataset with the default flatten=True.

From https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/

License: Creative Commons Attribution-ShareAlike

Parameters
  • segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.

  • flatten (bool, default True) – Whether to return all samples as flattened tokens. If True, each sample is a token.

  • skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.

  • tokenizer (function, default str.split) – A function that splits each sample string into list of tokens.

  • bos (str or None, default None) – The token to add at the beginning of each sentence. If None, nothing is added.

  • eos (str or None, default '<eos>') – The token to add at the end of each sentence. If None, nothing is added.

  • root (str, default '$MXNET_HOME/datasets/wikitext-103') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> wikitext103 = gluonnlp.data.WikiText103('val', root='./datasets/wikitext103')
-etc-
>>> len(wikitext103)
216347
>>> wikitext103[0]
'='
>>> wikitext103 = gluonnlp.data.WikiText103('val', flatten=False,
...                                         root='./datasets/wikitext103')
>>> len(wikitext103)
2461
>>> wikitext103[0]
['=', 'Homarus', 'gammarus', '=', '<eos>']
>>> wikitext103 = gluonnlp.data.WikiText103('val', flatten=False, bos='<bos>', eos=None,
...                                         root='./datasets/wikitext103')
>>> wikitext103[0]
['<bos>', '=', 'Homarus', 'gammarus', '=']
>>> wikitext103 = gluonnlp.data.WikiText103('val', flatten=False, bos='<bos>', eos=None,
...                                         skip_empty=False, root='./datasets/wikitext103')
>>> len(wikitext103)
3760
>>> wikitext103[0]
['<bos>']
class gluonnlp.data.WikiText2Raw(segment='train', flatten=True, skip_empty=True, bos=None, eos=None, tokenizer=<function WikiText2Raw.<lambda>>, root='/var/lib/jenkins/.mxnet/datasets/wikitext-2', **kwargs)[source]

WikiText-2 character-level dataset for language modeling

WikiText2Raw is implemented as CorpusDataset with the default flatten=True.

From Salesforce research: https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/

License: Creative Commons Attribution-ShareAlike

Parameters
  • segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.

  • flatten (bool, default True) – Whether to return all samples as flattened tokens. If True, each sample is a token.

  • skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.

  • tokenizer (function, default s.encode('utf-8')) – A function that splits each sample string into list of tokens. The tokenizer can also be used to convert everything to lowercase. E.g. with tokenizer=lambda s: s.lower().encode(‘utf-8’)

  • bos (str or None, default None) – The token to add at the beginning of each sentence. If None, nothing is added.

  • eos (str or None, default '<eos>') – The token to add at the end of each sentence. If None, nothing is added.

  • root (str, default '$MXNET_HOME/datasets/wikitext-2') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> wikitext2 = gluonnlp.data.WikiText2Raw('val', root='./datasets/wikitext2')
-etc-
>>> len(wikitext2)
1136862
>>> wikitext2[0]
61
>>> type(wikitext2[0])
<class 'int'>
>>> wikitext2 = gluonnlp.data.WikiText2Raw('val', flatten=False,
...                                        tokenizer=None, root='./datasets/wikitext2')
>>> len(wikitext2)
2461
>>> wikitext2[0]
'= Homarus gammarus ='
>>> wikitext2 = gluonnlp.data.WikiText2Raw('val', flatten=False, bos='<bos>', eos=None,
...                                        tokenizer=lambda s: s.split(),
...                                        root='./datasets/wikitext2')
>>> wikitext2[0]
['<bos>', '=', 'Homarus', 'gammarus', '=']
class gluonnlp.data.WikiText103Raw(segment='train', flatten=True, skip_empty=True, tokenizer=<function WikiText103Raw.<lambda>>, bos=None, eos=None, root='/var/lib/jenkins/.mxnet/datasets/wikitext-103', **kwargs)[source]

WikiText-103 character-level dataset for language modeling

WikiText103Raw is implemented as CorpusDataset with the default flatten=True.

From Salesforce research: https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/

License: Creative Commons Attribution-ShareAlike

Parameters
  • segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.

  • flatten (bool, default True) – Whether to return all samples as flattened tokens. If True, each sample is a token.

  • skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.

  • tokenizer (function, default s.encode('utf-8')) – A function that splits each sample string into list of tokens. The tokenizer can also be used to convert everything to lowercase. E.g. with tokenizer=lambda s: s.lower().encode(‘utf-8’)

  • bos (str or None, default None) – The token to add at the beginning of each sentence. If None, nothing is added.

  • eos (str or None, default '<eos>') – The token to add at the end of each sentence. If None, nothing is added.

  • root (str, default '$MXNET_HOME/datasets/wikitext-103') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> wikitext103 = gluonnlp.data.WikiText103Raw('val', root='./datasets/wikitext103')
-etc-
>>> len(wikitext103)
1136862
>>> wikitext103[0]
61
>>> wikitext103 = gluonnlp.data.WikiText103Raw('val', flatten=False,
...                                            root='./datasets/wikitext103')
>>> len(wikitext103)
2461
>>> wikitext103[0]
[61, 32, 72, 111, 109, 97, 114, 117, 115, 32, 103, 97, 109, 109, 97, 114, 117, 115, 32, 61]
>>> wikitext103 = gluonnlp.data.WikiText103Raw('val', flatten=False, tokenizer=None,
...                                            root='./datasets/wikitext103')
>>> wikitext103[0]
'= Homarus gammarus ='
class gluonnlp.data.IMDB(segment='train', root='/var/lib/jenkins/.mxnet/datasets/imdb')[source]

IMDB reviews for sentiment analysis.

From http://ai.stanford.edu/~amaas/data/sentiment/

Positive classes have label values in [7, 10]. Negative classes have label values in [1, 4]. All samples in unsupervised set have labels with value 0.

Parameters
  • segment (str, default 'train') – Dataset segment. Options are ‘train’, ‘test’, and ‘unsup’ for unsupervised.

  • root (str, default '$MXNET_HOME/datasets/imdb') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> imdb = gluonnlp.data.IMDB('test', root='./datasets/imdb')
-etc-
>>> len(imdb)
25000
>>> len(imdb[0])
2
>>> type(imdb[0][0]), type(imdb[0][1])
(<class 'str'>, <class 'int'>)
>>> imdb[0][0][:75]
'I went and saw this movie last night after being coaxed to by a few friends'
>>> imdb[0][1]
10
>>> imdb = gluonnlp.data.IMDB('unsup', root='./datasets/imdb')
-etc-
>>> len(imdb)
50000
>>> len(imdb[0])
2
>>> type(imdb[0][0]), type(imdb[0][1])
(<class 'str'>, <class 'int'>)
>>> imdb[0][0][:70]
'I admit, the great majority of films released before say 1933 are just'
>>> imdb[0][1]
0
class gluonnlp.data.MR(root='/var/lib/jenkins/.mxnet/datasets/mr')[source]

Movie reviews for sentiment analysis.

From https://www.cs.cornell.edu/people/pabo/movie-review-data/

Positive class has label value 1. Negative class has label value 0.

Parameters

root (str, default '$MXNET_HOME/datasets/mr') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> mr = gluonnlp.data.MR(root='./datasets/mr')
-etc-
>>> len(mr)
10662
>>> len(mr[3])
2
>>> type(mr[3][0]), type(mr[3][1])
(<class 'str'>, <class 'int'>)
>>> mr[3][0][:55]
'if you sometimes like to go to the movies to have fun ,'
>>> mr[3][1]
1
class gluonnlp.data.TREC(segment='train', root='/var/lib/jenkins/.mxnet/datasets/trec')[source]

Question dataset for question classification.

From http://cogcomp.cs.illinois.edu/Data/QA/QC/

Class labels are (http://cogcomp.org/Data/QA/QC/definition.html):
  • DESCRIPTION: 0

  • ENTITY: 1

  • ABBREVIATION: 2

  • HUMAN: 3

  • LOCATION: 4

  • NUMERIC: 5

The first space-separated token in the text of each sample is the fine-grain label.

Parameters
  • segment (str, default 'train') – Dataset segment. Options are ‘train’ and ‘test’.

  • root (str, default '$MXNET_HOME/datasets/trec') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> trec = gluonnlp.data.TREC('test', root='./datasets/trec')
-etc-
>>> len(trec)
500
>>> len(trec[0])
2
>>> type(trec[0][0]), type(trec[0][1])
(<class 'str'>, <class 'int'>)
>>> trec[0][0]
'How far is it from Denver to Aspen ?'
>>> (trec[0][1], trec[0][0].split()[0])
(5, 'How')
class gluonnlp.data.SUBJ(root='/var/lib/jenkins/.mxnet/datasets/subj')[source]

Subjectivity dataset for sentiment analysis.

Positive class has label value 1. Negative class has label value 0.

Parameters

root (str, default '$MXNET_HOME/datasets/subj') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> subj = gluonnlp.data.SUBJ(root='./datasets/subj')
-etc-
>>> len(subj)
10000
>>> len(subj[0])
2
>>> type(subj[0][0]), type(subj[0][1])
(<class 'str'>, <class 'int'>)
>>> subj[0][0][:60]
'its impressive images of crematorium chimney fires and stack'
>>> subj[0][1]
1
class gluonnlp.data.SST_1(segment='train', root='/var/lib/jenkins/.mxnet/datasets/sst-1')[source]

Stanford Sentiment Treebank: an extension of the MR data set. However, train/dev/test splits are provided and labels are fine-grained (very positive, positive, neutral, negative, very negative).

From http://nlp.stanford.edu/sentiment/

Class labels are:
  • very positive: 4

  • positive: 3

  • neutral: 2

  • negative: 1

  • very negative: 0

Parameters
  • segment (str, default 'train') – Dataset segment. Options are ‘train’ and ‘test’.

  • root (str, default '$MXNET_HOME/datasets/sst-1') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> sst_1 = gluonnlp.data.SST_1('test', root='./datasets/sst_1')
-etc-
>>> len(sst_1)
2210
>>> len(sst_1[0])
2
>>> type(sst_1[0][0]), type(sst_1[0][1])
(<class 'str'>, <class 'int'>)
>>> sst_1[0][0][:73]
'no movement , no yuks , not much of anything .'
>>> sst_1[0][1]
1
class gluonnlp.data.SST_2(segment='train', root='/var/lib/jenkins/.mxnet/datasets/sst-2')[source]

Stanford Sentiment Treebank: an extension of the MR data set. Same as the SST-1 data set except that neutral reviews are removed and labels are binary (positive, negative).

From http://nlp.stanford.edu/sentiment/

Positive class has label value 1. Negative class has label value 0.

Parameters
  • segment (str, default 'train') – Dataset segment. Options are ‘train’ and ‘test’.

  • root (str, default '$MXNET_HOME/datasets/sst-2') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> sst_2 = gluonnlp.data.SST_2('test', root='./datasets/sst_2')
-etc-
>>> len(sst_2)
1821
>>> len(sst_2[0])
2
>>> type(sst_2[0][0]), type(sst_2[0][1])
(<class 'str'>, <class 'int'>)
>>> sst_2[0][0][:65]
'no movement , no yuks , not much of anything .'
>>> sst_2[0][1]
0
class gluonnlp.data.CR(root='/var/lib/jenkins/.mxnet/datasets/cr')[source]

Customer reviews of various products (cameras, MP3s etc.). The task is to predict positive/negative reviews.

Positive class has label value 1. Negative class has label value 0.

Parameters

root (str, default '$MXNET_HOME/datasets/cr') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> cr = gluonnlp.data.CR(root='./datasets/cr')
-etc-
>>> len(cr)
3775
>>> len(cr[3])
2
>>> type(cr[3][0]), type(cr[3][1])
(<class 'str'>, <class 'int'>)
>>> cr[3][0][:55]
'i know the saying is " you get what you pay for " but a'
>>> cr[3][1]
0
class gluonnlp.data.MPQA(root='/var/lib/jenkins/.mxnet/datasets/mpqa')[source]

Opinion polarity detection subtask of the MPQA dataset.

From http://www.cs.pitt.edu/mpqa/

Positive class has label value 1. Negative class has label value 0.

Parameters

root (str, default '$MXNET_HOME/datasets/mpqa') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> mpqa = gluonnlp.data.MPQA(root='./datasets/mpqa')
-etc-
>>> len(mpqa)
10606
>>> len(mpqa[3])
2
>>> type(mpqa[3][0]), type(mpqa[3][1])
(<class 'str'>, <class 'int'>)
>>> mpqa[3][0][:25]
'many years of decay'
>>> mpqa[3][1]
0
class gluonnlp.data.WordSimilarityEvaluationDataset(root)[source]

Base class for word similarity or relatedness task datasets.

Inheriting classes are assumed to implement datasets of the form [‘word1’, ‘word2’, score] where score is a numerical similarity or relatedness score with respect to ‘word1’ and ‘word2’.

class gluonnlp.data.WordAnalogyEvaluationDataset(root)[source]

Base class for word analogy task datasets.

Inheriting classes are assumed to implement datasets of the form [‘word1’, ‘word2’, ‘word3’, ‘word4’] or [‘word1’, [ ‘word2a’, ‘word2b’, … ], ‘word3’, [ ‘word4a’, ‘word4b’, … ]].

class gluonnlp.data.WordSim353(segment='all', root='/var/lib/jenkins/.mxnet/datasets/wordsim353')[source]

WordSim353 dataset.

The dataset was collected by Finkelstein et al. (http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/). Agirre et al. proposed to split the collection into two datasets, one focused on measuring similarity, and the other one on relatedness (http://alfonseca.org/eng/research/wordsim353.html).

  • Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing search in context: the concept revisited. ACM} Trans. Inf. Syst., 20(1), 116–131. https://dl.acm.org/citation.cfm?id=372094

  • Agirre, E., Alfonseca, E., Hall, K. B., Kravalova, J., Pasca, M., & Soroa, A. (2009). A study on similarity and relatedness using distributional and wordnet-based approaches. In , Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, May 31 - June 5, 2009, Boulder, Colorado, {USA (pp. 19–27). : The Association for Computational Linguistics.

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Each sample consists of a pair of words, and a score with scale from 0 (totally unrelated words) to 10 (very much related or identical words).

Parameters
  • segment (str) – ‘relatedness’, ‘similarity’ or ‘all’

  • root (str, default '$MXNET_HOME/datasets/wordsim353') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> wordsim353 = gluonnlp.data.WordSim353('similarity', root='./datasets/wordsim353')
-etc-
>>> len(wordsim353)
203
>>> wordsim353[0]
['Arafat', 'Jackson', 2.5]
class gluonnlp.data.MEN(segment='dev', root='/var/lib/jenkins/.mxnet/datasets/men')[source]

MEN dataset for word-similarity and relatedness.

The dataset was collected by Bruni et al. (https://staff.fnwi.uva.nl/e.bruni/MEN).

  • Bruni, E., Boleda, G., Baroni, M., & Nam-Khanh Tran (2012). Distributional semantics in technicolor. In , The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, July 8-14, 2012, Jeju Island, Korea - Volume 1: Long Papers (pp. 136–145). : The Association for Computer Linguistics.

License: Creative Commons Attribution 2.0 Generic (CC BY 2.0)

Each sample consists of a pair of words, and a score with scale from 0 (totally unrelated words) to 50 (very much related or identical words).

Parameters
  • root (str, default '$MXNET_HOME/datasets/men') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

  • segment (str, default 'train') – Dataset segment. Options are ‘train’, ‘dev’, ‘test’.

Examples

>>> men = gluonnlp.data.MEN('test', root='./datasets/men')
-etc-
>>> len(men)
1000
>>> men[0]
['display', 'pond', 10.0]
class gluonnlp.data.RadinskyMTurk(root='/var/lib/jenkins/.mxnet/datasets/radinskymturk')[source]

MTurk dataset for word-similarity and relatedness by Radinsky et al..

  • Radinsky, K., Agichtein, E., Gabrilovich, E., & Markovitch, S. (2011). A word at a time: computing word relatedness using temporal semantic analysis. In S. Srinivasan, K. Ramamritham, A. Kumar, M. P. Ravindra, E. Bertino, & R. Kumar, Proceedings of the 20th International Conference on World Wide Web, {WWW} 2011, Hyderabad, India, March 28 - April 1, 2011 (pp. 337–346). : ACM.

License: Unspecified

Each sample consists of a pair of words, and a score with scale from 1 (totally unrelated words) to 5 (very much related or identical words).

Parameters

root (str, default '$MXNET_HOME/datasets/radinskymturk') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> radinskymturk = gluonnlp.data.RadinskyMTurk(root='./datasets/radinskymturk')
-etc-
>>> len(radinskymturk)
287
>>> radinskymturk[0]
['episcopal', 'russia', 2.75]
class gluonnlp.data.RareWords(root='/var/lib/jenkins/.mxnet/datasets/rarewords')[source]

Rare words dataset word-similarity and relatedness.

  • Luong, T., Socher, R., & Manning, C. D. (2013). Better word representations with recursive neural networks for morphology. In J. Hockenmaier, & S. Riedel, Proceedings of the Seventeenth Conference on Computational Natural Language Learning, CoNLL 2013, Sofia, Bulgaria, August 8-9, 2013 (pp. 104–113). : ACL.

License: Unspecified

Each sample consists of a pair of words, and a score with scale from 0 (totally unrelated words) to 10 (very much related or identical words).

Parameters

root (str, default '$MXNET_HOME/datasets/rarewords',) – MXNET_HOME defaults to ‘~/.mxnet’. Path to temp folder for storing data.

Examples

>>> rarewords = gluonnlp.data.RareWords(root='./datasets/rarewords')
-etc-
>>> len(rarewords)
2034
>>> rarewords[0]
['squishing', 'squirt', 5.88]
class gluonnlp.data.SimLex999(root='/var/lib/jenkins/.mxnet/datasets/simlex999')[source]

SimLex999 dataset word-similarity.

  • Hill, F., Reichart, R., & Korhonen, A. (2015). Simlex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4), 665–695. https://arxiv.org/abs/1408.3456

License: Unspecified

Each sample consists of a pair of words, and a score with scale from 0 (totally unrelated words) to 10 (very much related or identical words).

Parameters

root (str, default '$MXNET_HOME/datasets/simlex999') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> simlex999 = gluonnlp.data.SimLex999(root='./datasets/simlex999')
-etc-
>>> len(simlex999)
999
>>> simlex999[0]
['old', 'new', 1.58]
class gluonnlp.data.SimVerb3500(segment='full', root='/var/lib/jenkins/.mxnet/datasets/simverb3500')[source]

SimVerb3500 dataset word-similarity.

  • Hill, F., Reichart, R., & Korhonen, A. (2015). Simlex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4), 665–695. https://arxiv.org/abs/1408.3456

License: Unspecified

Each sample consists of a pair of words, and a score with scale from 0 (totally unrelated words) to 10 (very much related or identical words).

Parameters

root (str, default '$MXNET_HOME/datasets/verb3500') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> simverb3500 = gluonnlp.data.SimVerb3500(root='./datasets/simverb3500') 
-etc-
>>> len(simverb3500) 
3500
>>> simverb3500[0] 
['take', 'remove', 6.81]
class gluonnlp.data.SemEval17Task2(segment='trial', language='en', root='/var/lib/jenkins/.mxnet/datasets/semeval17task2')[source]

SemEval17Task2 dataset for word-similarity.

The dataset was collected by Finkelstein et al. (http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/). Agirre et al. proposed to split the collection into two datasets, one focused on measuring similarity, and the other one on relatedness (http://alfonseca.org/eng/research/wordsim353.html).

  • Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing search in context: the concept revisited. ACM} Trans. Inf. Syst., 20(1), 116–131. https://dl.acm.org/citation.cfm?id=372094

  • Agirre, E., Alfonseca, E., Hall, K. B., Kravalova, J., Pasca, M., & Soroa, A. (2009). A study on similarity and relatedness using distributional and wordnet-based approaches. In , Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, May 31 - June 5, 2009, Boulder, Colorado, {USA (pp. 19–27). : The Association for Computational Linguistics.

License: Unspecified

Each sample consists of a pair of words, and a score with scale from 0 (totally unrelated words) to 5 (very much related or identical words).

Parameters
  • root (str, default '$MXNET_HOME/datasets/semeval17task2') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

  • segment (str, default 'train') – Dataset segment. Options are ‘trial’, ‘test’.

  • language (str, default 'en') – Dataset language.

Examples

>>> semeval17task2 = gluonnlp.data.SemEval17Task2()  
-etc-
>>> len(semeval17task2)  
18
>>> semeval17task2[0]  
['sunset', 'string', 0.05]
class gluonnlp.data.BakerVerb143(root='/var/lib/jenkins/.mxnet/datasets/verb143')[source]

Verb143 dataset.

  • Baker, S., Reichart, R., & Korhonen, A. (2014). An unsupervised model for instance level subcategorization acquisition. In A. Moschitti, B. Pang, & W. Daelemans, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2014, October 25-29, 2014, Doha, Qatar, {A} meeting of SIGDAT, a Special Interest Group of the {ACL (pp. 278–289). : ACL.

144 pairs of verbs annotated by 10 annotators following the WS-353 guidelines.

License: unspecified

Each sample consists of a pair of words, and a score with scale from 0 (totally unrelated words) to 1 (very much related or identical words).

Parameters

root (str, default '$MXNET_HOME/datasets/verb143') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> bakerverb143 = gluonnlp.data.BakerVerb143(root='./datasets/bakerverb143') 
-etc-
>>> len(bakerverb143) 
144
>>> bakerverb143[0] 
['happen', 'say', 0.19]
class gluonnlp.data.YangPowersVerb130(root='~/.mxnet/datasets/verb130')[source]

Verb-130 dataset.

  • Yang, D., & Powers, D. M. (2006). Verb similarity on the taxonomy of wordnet. In The Third International WordNet Conference: GWC 2006

License: Unspecified

Each sample consists of a pair of words, and a score with scale from 0 (totally unrelated words) to 4 (very much related or identical words).

Parameters

root (str, default '$MXNET_HOME/datasets/verb130') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> yangpowersverb130 = gluonnlp.data.YangPowersVerb130(root='./datasets/yangpowersverb130')
>>> len(yangpowersverb130)
130
>>> yangpowersverb130[0]
['brag', 'boast', 4.0]
class gluonnlp.data.GoogleAnalogyTestSet(group=None, category=None, lowercase=True, root='/var/lib/jenkins/.mxnet/datasets/google_analogy')[source]

Google analogy test set

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations (ICLR).

License: Unspecified

Each sample consists of two analogical pairs of words.

Parameters
  • group ({'syntactic', 'semantic'} or None, default None) – The subset for the specified type of analogy. None for the complete dataset.

  • category (str or None, default None) – The subset for the specified category of analogy. None for the complete dataset.

  • lowercase (boolean, default True) – Whether to convert words to lowercase.

  • root (str, default '$MXNET_HOME/datasets/google_analogy') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> googleanalogytestset = gluonnlp.data.GoogleAnalogyTestSet(
...     root='./datasets/googleanalogytestset')
-etc-
>>> len(googleanalogytestset)
19544
>>> googleanalogytestset[0]
['athens', 'greece', 'baghdad', 'iraq']
>>> googleanalogytestset = gluonnlp.data.GoogleAnalogyTestSet(
...     'syntactic', root='./datasets/googleanalogytestset')
>>> googleanalogytestset[0]
['amazing', 'amazingly', 'apparent', 'apparently']
>>> googleanalogytestset = gluonnlp.data.GoogleAnalogyTestSet(
...     'syntactic', 'gram8-plural', root='./datasets/googleanalogytestset')
>>> googleanalogytestset[0]
['banana', 'bananas', 'bird', 'birds']
class gluonnlp.data.BiggerAnalogyTestSet(category=None, form_analogy_pairs=True, drop_alternative_solutions=True, root='/var/lib/jenkins/.mxnet/datasets/bigger_analogy')[source]

Bigger analogy test set

  • Gladkova, A., Drozd, A., & Matsuoka, S. (2016). Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In Proceedings of the NAACL-HLT SRW (pp. 47–54). San Diego, California, June 12-17, 2016: ACL. Retrieved from https://www.aclweb.org/anthology/N/N16/N16-2002.pdf

License: Unspecified

Each sample consists of two analogical pairs of words.

Parameters

root (str, default '$MXNET_HOME/datasets/bats') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> biggeranalogytestset = gluonnlp.data.BiggerAnalogyTestSet(
...     root='./datasets/biggeranalogytestset')
-etc-
>>> len(biggeranalogytestset)
98000
>>> biggeranalogytestset[0]
['arm', 'armless', 'art', 'artless']
class gluonnlp.data.DataStream[source]

Abstract Data Stream Interface.

DataStreams are useful to avoid loading big datasets to memory. A DataStream is a iterable object (it implements the __iter__ function). Whenever an iteration over the DataStream is requested (e.g. in a for loop or by calling iter(datastream)), a new iterator over all samples in the DataStream is returned. DataStreams can be lazily transformed by calling transform() which returns a DataStream over the transformed samples.

__iter__()[source]

Return an iterator over all elements of the DataStream.

This method returns a new iterator object that can iterate over all the objects in the DataStream.

Returns

An object implementing the Python iterator protocol.

Return type

iterator

transform(fn)[source]

Transform a DataStream lazily.

Returns

The data stream that lazily transforms the data while streaming.

Return type

DataStream

class gluonnlp.data.SimpleDataStream(iterable)[source]

SimpleDataStream wraps iterables to expose the DataStream API.

Unlike the iterable itself, the SimpleDataStream exposes the DataStream API and allows lazy transformation of the iterable.

__iter__()[source]

Return an iterator over all elements of the DataStream.

This method returns a new iterator object that can iterate over all the objects in the DataStream.

Returns

An object implementing the Python iterator protocol.

Return type

iterator

class gluonnlp.data.DatasetStream[source]

Abstract Dataset Stream Interface.

A DatasetStream is a DataStream where each sample is a mxnet.gluon.data.Dataset. An iteration over a DatasetStream iterates over mxnet.gluon.data.Dataset objects, representing a chunk or shards of some large datasets.

Iterating over sizeable chunks of a dataset can be helpful to speed up preprocessing as the overhead of preprocessing each sample individually is reduced (this is similar to the idea of using batches for training a model).

__iter__()[source]

Return an iterator over all elements of the DataStream.

This method returns a new iterator object that can iterate over all the objects in the DataStream.

Returns

An object implementing the Python iterator protocol.

Return type

iterator

class gluonnlp.data.SimpleDatasetStream(dataset, file_pattern, file_sampler='random', **kwargs)[source]

A simple stream of Datasets.

The SimpleDatasetStream is created from multiple files based on provided file_pattern. One file is read at a time and a corresponding Dataset is returned. The Dataset is created based on the file and the kwargs passed to SimpleDatasetStream.

Parameters
  • dataset (class) – The class for which to create an object for every file. kwargs are passed to this class.

  • file_pattern (str) – Path to the input text files.

  • file_sampler (str or gluon.data.Sampler, defaults to 'random') –

    The sampler used to sample a file. The following string values are supported:

    • ’sequential’: SequentialSampler

    • ’random’: RandomSampler

  • kwargs – All other keyword arguments are passed to the dataset constructor.

__iter__()[source]

Return an iterator over all elements of the DataStream.

This method returns a new iterator object that can iterate over all the objects in the DataStream.

Returns

An object implementing the Python iterator protocol.

Return type

iterator

class gluonnlp.data.PrefetchingStream(stream, num_prefetch=1, worker_type='thread')[source]

Prefetch a DataStream in a separate Thread or Process.

This iterator will create another thread or process to perform iter_next and then store the data in memory. It potentially accelerates the data read, at the cost of more memory usage.

The python, numpy and mxnet random states in the launched Thread or Process will be initialized randomly based on the next 32 bit integer in the python, numpy and mxnet random generator of the caller respectively (random.getrandbits(32), numpy.random.randint(0, 2**32), int(mx.nd.random.uniform(0, 2**32).asscalar())).

Parameters
  • stream (DataStream) – Source stream.

  • num_prefetch (int, default 1) – Number of elements to prefetch from the stream. Must be greater 0.

  • worker_type ('thread' or 'process', default 'thread') – Use a separate Python Thread or Process to prefetch.

__iter__()[source]

Return an iterator over all elements of the DataStream.

This method returns a new iterator object that can iterate over all the objects in the DataStream.

Returns

An object implementing the Python iterator protocol.

Return type

iterator

class gluonnlp.data.CoNLL2000(segment='train', root='/var/lib/jenkins/.mxnet/datasets/conll2000')[source]

CoNLL2000 Part-of-speech (POS) tagging and chunking joint task dataset.

Each sample has three fields: word, POS tag, chunk label.

From https://www.clips.uantwerpen.be/conll2000/chunking/

Parameters
  • segment ({'train', 'test'}, default 'train') – Dataset segment.

  • root (str, default '$MXNET_HOME/datasets/conll2000') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> conll = gluonnlp.data.CoNLL2000('test', root='./datasets/conll2000')
-etc-
>>> len(conll)
2012
>>> len(conll[0])
3
>>> conll[8][0]
['SHEARSON', 'LEHMAN', 'HUTTON', 'Inc', '.']
>>> conll[8][1]
['NNP', 'NNP', 'NNP', 'NNP', '.']
>>> conll[8][2]
['B-NP', 'I-NP', 'I-NP', 'I-NP', 'O']
class gluonnlp.data.CoNLL2001(part, segment='train', root='/var/lib/jenkins/.mxnet/datasets/conll2001')[source]

CoNLL2001 Clause Identification dataset.

Each sample has four fields: word, POS tag, chunk label, clause tag.

From https://www.clips.uantwerpen.be/conll2001/clauses/

Parameters
  • part (int, {1, 2, 3}) – Part number of the dataset.

  • segment ({'train', 'testa', 'testb'}, default 'train') – Dataset segment.

  • root (str, default '$MXNET_HOME/datasets/conll2001') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> conll = gluonnlp.data.CoNLL2001(1, 'testa', root='./datasets/conll2001')
-etc-
>>> len(conll)
2012
>>> len(conll[0])
4
>>> conll[8][0]
['SHEARSON', 'LEHMAN', 'HUTTON', 'Inc', '.']
>>> conll[8][1]
['NNP', 'NNP', 'NNP', 'NNP', '.']
>>> conll[8][2]
['B-NP', 'I-NP', 'I-NP', 'I-NP', 'O']
>>> conll[8][3]
['X', 'X', 'X', 'X', 'X']
class gluonnlp.data.CoNLL2002(lang, segment='train', root='/var/lib/jenkins/.mxnet/datasets/conll2002')[source]

CoNLL2002 Named Entity Recognition (NER) task dataset.

For ‘esp’, each sample has two fields: word, NER label.

For ‘ned’, each sample has three fields: word, POS tag, NER label.

From https://www.clips.uantwerpen.be/conll2002/ner/

Parameters
  • lang (str, {'esp', 'ned'}) – Dataset language.

  • segment ({'train', 'testa', 'testb'}, default 'train') – Dataset segment.

  • root (str, default '$MXNET_HOME/datasets/conll2002') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> conll = gluonnlp.data.CoNLL2002('esp', 'testa', root='./datasets/conll2002')
-etc-
>>> len(conll)
1915
>>> len(conll[0])
2
>>> conll[0][0]
['Sao', 'Paulo', '(', 'Brasil', ')', ',', '23', 'may', '(', 'EFECOM', ')', '.']
>>> conll[0][1]
['B-LOC', 'I-LOC', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'O']
class gluonnlp.data.CoNLL2004(segment='train', root='/var/lib/jenkins/.mxnet/datasets/conll2004')[source]

CoNLL2004 Semantic Role Labeling (SRL) task dataset.

Each sample has six or more fields: word, POS tag, chunk label, clause tag, NER label, target verbs, and sense labels (of variable number per sample).

From http://www.cs.upc.edu/~srlconll/st04/st04.html

Parameters
  • segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment.

  • root (str, default '$MXNET_HOME/datasets/conll2004') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> conll = gluonnlp.data.CoNLL2004('dev', root='./datasets/conll2004')
-etc-
>>> len(conll)
2012
>>> len(conll[8])
6
>>> conll[8][0]
['SHEARSON', 'LEHMAN', 'HUTTON', 'Inc', '.']
>>> conll[8][1]
['NNP', 'NNP', 'NNP', 'NNP', '.']
>>> conll[8][2]
['B-NP', 'I-NP', 'I-NP', 'I-NP', 'O']
>>> conll[8][3]
['*', '*', '*', '*', '*']
>>> conll[8][4]
['B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O']
>>> conll[8][5]
['-', '-', '-', '-', '-']
class gluonnlp.data.UniversalDependencies21(lang='en', segment='train', root='/var/lib/jenkins/.mxnet/datasets/ud2.1')[source]

Universal dependencies tree banks.

Each sample has 8 or more fields as described in http://universaldependencies.org/docs/format.html

From https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2515

Parameters
  • lang (str, default 'en') – Dataset language.

  • segment (str, default 'train') – Dataset segment.

  • root (str, default '$MXNET_HOME/datasets/ud2.1') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> ud = gluonnlp.data.UniversalDependencies21('en', 'dev', root='./datasets/ud21')
-etc-
>>> len(ud)
2002
>>> len(ud[0])
10
>>> ud[0][0]
['1', '2', '3', '4', '5', '6', '7']
>>> ud[0][1]
['From', 'the', 'AP', 'comes', 'this', 'story', ':']
>>> ud[0][2]
['from', 'the', 'AP', 'come', 'this', 'story', ':']
>>> ud[0][3]
['ADP', 'DET', 'PROPN', 'VERB', 'DET', 'NOUN', 'PUNCT']
>>> ud[0][4]
['IN', 'DT', 'NNP', 'VBZ', 'DT', 'NN', ':']
>>> ud[0][5][:3]
['_', 'Definite=Def|PronType=Art', 'Number=Sing']
>>> ud[0][6]
['3', '3', '4', '0', '6', '4', '4']
>>> ud[0][7]
['case', 'det', 'obl', 'root', 'det', 'nsubj', 'punct']
>>> ud[0][8]
['3:case', '3:det', '4:obl', '0:root', '6:det', '4:nsubj', '4:punct']
>>> ud[0][9]
['_', '_', '_', '_', '_', '_', '_']
class gluonnlp.data.IWSLT2015(segment='train', src_lang='en', tgt_lang='vi', root='/var/lib/jenkins/.mxnet/datasets/iwslt2015')[source]

Preprocessed IWSLT English-Vietnamese Translation Dataset.

We use the preprocessed version provided in https://nlp.stanford.edu/projects/nmt/

Parameters
  • segment (str or list of str, default 'train') – Dataset segment. Options are ‘train’, ‘val’, ‘test’ or their combinations.

  • src_lang (str, default 'en') – The source language. Option for source and target languages are ‘en’ <-> ‘vi’

  • tgt_lang (str, default 'vi') – The target language. Option for source and target languages are ‘en’ <-> ‘vi’

  • root (str, default '$MXNET_HOME/datasets/iwslt2015') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

class gluonnlp.data.WMT2014(segment='train', src_lang='en', tgt_lang='de', full=False, root='/var/lib/jenkins/.mxnet/datasets/wmt2014')[source]

Translation Corpus of the WMT2014 Evaluation Campaign.

http://www.statmt.org/wmt14/translation-task.html

Parameters
  • segment (str or list of str, default 'train') – Dataset segment. Options are ‘train’, ‘newstest2009’, ‘newstest2010’, ‘newstest2011’, ‘newstest2012’, ‘newstest2013’, ‘newstest2014’ or their combinations

  • src_lang (str, default 'en') – The source language. Option for source and target languages are ‘en’ <-> ‘de’

  • tgt_lang (str, default 'de') – The target language. Option for source and target languages are ‘en’ <-> ‘de’

  • full (bool, default False) – By default, we use the “filtered test sets” while if full is True, we use the “cleaned test sets”.

  • root (str, default '$MXNET_HOME/datasets/wmt2014') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

class gluonnlp.data.WMT2014BPE(segment='train', src_lang='en', tgt_lang='de', full=False, root='/var/lib/jenkins/.mxnet/datasets/wmt2014')[source]

Preprocessed Translation Corpus of the WMT2014 Evaluation Campaign.

We preprocess the dataset by adapting https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh

Parameters
  • segment (str or list of str, default 'train') – Dataset segment. Options are ‘train’, ‘newstest2009’, ‘newstest2010’, ‘newstest2011’, ‘newstest2012’, ‘newstest2013’, ‘newstest2014’ or their combinations

  • src_lang (str, default 'en') – The source language. Option for source and target languages are ‘en’ <-> ‘de’

  • tgt_lang (str, default 'de') – The target language. Option for source and target languages are ‘en’ <-> ‘de’

  • full (bool, default False) – In default, we use the test dataset in http://statmt.org/wmt14/test-filtered.tgz. When full is True, we use the test dataset in http://statmt.org/wmt14/test-full.tgz

  • root (str, default '$MXNET_HOME/datasets/wmt2014') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

class gluonnlp.data.WMT2016(segment='train', src_lang='en', tgt_lang='de', root='/var/lib/jenkins/.mxnet/datasets/wmt2016')[source]

Translation Corpus of the WMT2016 Evaluation Campaign.

Parameters
  • segment (str or list of str, default 'train') – Dataset segment. Options are ‘train’, ‘newstest2012’, ‘newstest2013’, ‘newstest2014’, ‘newstest2015’, ‘newstest2016’ or their combinations

  • src_lang (str, default 'en') – The source language. Option for source and target languages are ‘en’ <-> ‘de’

  • tgt_lang (str, default 'de') – The target language. Option for source and target languages are ‘en’ <-> ‘de’

  • root (str, default '$MXNET_HOME/datasets/wmt2016') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

class gluonnlp.data.WMT2016BPE(segment='train', src_lang='en', tgt_lang='de', root='/var/lib/jenkins/.mxnet/datasets/wmt2016')[source]

Preprocessed Translation Corpus of the WMT2016 Evaluation Campaign.

We use the preprocessing script in https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh

Parameters
  • segment (str or list of str, default 'train') – Dataset segment. Options are ‘train’, ‘newstest2012’, ‘newstest2013’, ‘newstest2014’, ‘newstest2015’, ‘newstest2016’ or their combinations

  • src_lang (str, default 'en') – The source language. Option for source and target languages are ‘en’ <-> ‘de’

  • tgt_lang (str, default 'de') – The target language. Option for source and target languages are ‘en’ <-> ‘de’

  • root (str, default '$MXNET_HOME/datasets/wmt2016') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

gluonnlp.data.register(class_=None, **kwargs)[source]

Registers a dataset with segment specific hyperparameters.

When passing keyword arguments to register, they are checked to be valid keyword arguments for the registered Dataset class constructor and are saved in the registry. Registered keyword arguments can be retrieved with the list_datasets function.

All arguments that result in creation of separate datasets should be registered. Examples are datasets divided in different segments or categories, or datasets containing multiple languages.

Once registered, an instance can be created by calling create() with the class name.

Parameters

**kwargs (list or tuple of allowed argument values) – For each keyword argument, it’s value must be a list or tuple of the allowed argument values.

Examples

>>> @gluonnlp.data.register(segment=['train', 'test', 'dev'])
... class MyDataset(gluon.data.Dataset):
...     def __init__(self, segment='train'):
...         pass
>>> my_dataset = gluonnlp.data.create('MyDataset')
>>> print(type(my_dataset))
<class 'gluonnlp.data.registry.MyDataset'>
gluonnlp.data.create(name, **kwargs)[source]

Creates an instance of a registered dataset.

Parameters

name (str) – The dataset name (case-insensitive).

Returns

gluonnlp.data.list_datasets(name=None)[source]

Get valid datasets and registered parameters.

Parameters

name (str or None, default None) – Return names and registered parameters of registered datasets. If name is specified, only registered parameters of the respective dataset are returned.

Returns

A dict of all the valid keyword parameters names for the specified dataset. If name is set to None, returns a dict mapping each valid name to its respective keyword parameter dict. The valid names can be plugged in gluonnlp.model.word_evaluation_model.create(name).

Return type

dict

class gluonnlp.data.SQuAD(segment='train', version='1.1', root='/var/lib/jenkins/.mxnet/datasets/squad')[source]

Stanford Question Answering Dataset (SQuAD) - reading comprehension dataset.

From https://rajpurkar.github.io/SQuAD-explorer/

License: CreativeCommons BY-SA 4.0

The original data format is json, which has multiple contexts (a context is a paragraph of text from which questions are drawn). For each context there are multiple questions, and for each of these questions there are multiple (usually 3) answers.

This class loads the json and flattens it to a table view. Each record is a single question. Since there are more than one question per context in the original dataset, some records shares the same context. Number of records in the dataset is equal to number of questions in json file.

The format of each record of the dataset is following:

  • record_index: An index of the record, generated on the fly (0 … to # of last question)

  • question_id: Question Id. It is a string and taken from the original json file as-is

  • question: Question text, taken from the original json file as-is

  • context: Context text. Will be the same for questions from the same context

  • answer_list: All answers for this question. Stored as python list

  • start_indices: All answers’ starting indices. Stored as python list. The position in this list is the same as the position of an answer in answer_list

  • is_impossible: The question is unanswerable, if version is ‘2.0’. In SQuAd2.0, there are some unanswerable questions.

Parameters
  • segment (str, default 'train') – Dataset segment. Options are ‘train’ and ‘dev’.

  • version (str, default '1.1') – Dataset version. Options are ‘1.1’ and ‘2.0’.

  • root (str, default '~/.mxnet/datasets/squad') – Path to temp folder for storing data.

Examples

>>> squad = gluonnlp.data.SQuAD('dev', '1.1', root='./datasets/squad')
-etc-
>>> len(squad)
10570
>>> len(squad[0])
6
>>> tuple(type(squad[0][i]) for i in range(6))
(<class 'int'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'list'>, <class 'list'>)
>>> squad[0][0]
0
>>> squad[0][1]
'56be4db0acb8001400a502ec'
>>> squad[0][2]
'Which NFL team represented the AFC at Super Bowl 50?'
>>> squad[0][3][:70]
'Super Bowl 50 was an American football game to determine the champion '
>>> squad[0][4]
['Denver Broncos', 'Denver Broncos', 'Denver Broncos']
>>> squad[0][5]
[177, 177, 177]
>>> squad2 = gluonnlp.data.SQuAD('dev', '2.0', root='./datasets/squad')
-etc-
>>> len(squad2)
11873
>>> len(squad2[0])
7
>>> type(squad2[0][6])
<class 'bool'>
>>> squad2[0][6]
False
class gluonnlp.data.ShardedDataLoader(dataset, batch_size=None, shuffle=False, sampler=None, last_batch=None, batch_sampler=None, batchify_fn=None, num_workers=0, pin_memory=False, prefetch=None, thread_pool=False)[source]

Loads data from a dataset and returns mini-batches of data.

Parameters
  • dataset (Dataset) – Source dataset. Note that numpy and mxnet arrays can be directly used as a Dataset.

  • batch_size (int) – Size of mini-batch.

  • shuffle (bool) – Whether to shuffle the samples.

  • sampler (Sampler) – The sampler to use. Either specify sampler or shuffle, not both.

  • last_batch ({'keep', 'discard', 'rollover'}) –

    How to handle the last batch if batch_size does not evenly divide len(dataset).

    keep - A batch with less samples than previous batches is returned. discard - The last batch is discarded if its incomplete. rollover - The remaining samples are rolled over to the next epoch.

  • batch_sampler (Sampler) – A sampler that returns mini-batches. Do not specify batch_size, shuffle, sampler, and last_batch if batch_sampler is specified.

  • batchify_fn (callable) –

    Callback function to allow users to specify how to merge samples into a batch. Defaults to default_batchify_fn:

    def default_batchify_fn(data):
        if isinstance(data[0], nd.NDArray):
            return nd.stack(*data)
        elif isinstance(data[0], tuple):
            data = zip(*data)
            return [default_batchify_fn(i) for i in data]
        else:
            data = np.asarray(data)
            return nd.array(data, dtype=data.dtype)
    

  • num_workers (int, default 0) – The number of multiprocessing workers to use for data preprocessing. num_workers > 0 is not supported on Windows yet.

  • pin_memory (boolean, default False) – If True, the dataloader will copy NDArrays into pinned memory before returning them. Copying from CPU pinned memory to GPU is faster than from normal CPU memory.

  • prefetch (int, default is num_workers * 2) – The number of prefetching batches only works if num_workers > 0. If prefetch > 0, it allow worker process to prefetch certain batches before acquiring data from iterators. Note that using large prefetching batch will provide smoother bootstrapping performance, but will consume more shared_memory. Using smaller number may forfeit the purpose of using multiple worker processes, try reduce num_workers in this case. By default it defaults to num_workers * 2.

  • thread_pool (bool, default False) – If True, use threading pool instead of multiprocessing pool. Using threadpool can avoid shared memory usage. If DataLoader is more IO bounded or GIL is not a killing problem, threadpool version may achieve better performance than multiprocessing.

class gluonnlp.data.UnigramCandidateSampler(weights, dtype='float32')[source]

Unigram Candidate Sampler

Draw random samples from a unigram distribution with specified weights using the alias method.

Parameters
  • weights (mx.nd.NDArray) – Unnormalized class probabilities. Samples are drawn and returned on the same context as weights.context.

  • dtype (str or np.dtype, default 'float32') – Data type of the candidates. Make sure that the dtype precision is large enough to represent the size of your weights array precisely. For example, float32 can not distinguish 2**24 from 2**24 + 1.

hybrid_forward(F, candidates_like, prob, alias)[source]

Draw samples from uniform distribution and return sampled candidates.

Parameters

candidates_like (mxnet.nd.NDArray or mxnet.sym.Symbol) – This input specifies the shape of the to be sampled candidates. #

Returns

samples – The sampled candidates of shape candidates_like.shape. Candidates are sampled based on the weights specified on creation of the UnigramCandidateSampler.

Return type

mxnet.nd.NDArray or mxnet.sym.Symbol

class gluonnlp.data.ATISDataset(segment='train', root='/var/lib/jenkins/.mxnet/datasets/atis')[source]

Airline Travel Information System dataset from MS CNTK.

From https://github.com/Microsoft/CNTK/tree/master/Examples/LanguageUnderstanding/ATIS/Data

License: Unspecified

Each sample has three fields: tokens, slot labels, intent label.

Parameters
  • segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment.

  • root (str, default '$MXNET_HOME/datasets/atis') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> atis = gluonnlp.data.ATISDataset(root='./datasets/atis')
-etc-
>>> len(atis)
4478
>>> len(atis[0])
3
>>> len(atis[0][0])
10
>>> atis[0][0]
['i', 'want', 'to', 'fly', 'from', 'baltimore', 'to', 'dallas', 'round', 'trip']
>>> len(atis[0][1])
10
>>> atis[0][1][:8]
['O', 'O', 'O', 'O', 'O', 'B-fromloc.city_name', 'O', 'B-toloc.city_name']
>>> atis[0][2]
array([10], dtype=int32)
class gluonnlp.data.SNIPSDataset(segment='train', root='/var/lib/jenkins/.mxnet/datasets/snips')[source]

Snips Natural Language Understanding Benchmark dataset.

Coucke et al. (2018). Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces. https://arxiv.org/abs/1805.10190

From https://github.com/snipsco/nlu-benchmark/tree/master/2017-06-custom-intent-engines

License: Unspecified

Each sample has three fields: tokens, slot labels, intent label.

Parameters
  • segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment.

  • root (str, default '$MXNET_HOME/datasets/snips') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> snips = gluonnlp.data.SNIPSDataset(root='./datasets/snips')
-etc-
>>> len(snips)
13084
>>> len(snips[0])
3
>>> len(snips[1][0])
8
>>> snips[1][0]
['put', 'United', 'Abominations', 'onto', 'my', 'rare', 'groove', 'playlist']
>>> len(snips[1][1])
8
>>> snips[1][1][:5]
['O', 'B-entity_name', 'I-entity_name', 'O', 'B-playlist_owner']
>>> snips[1][2]
array([0], dtype=int32)
class gluonnlp.data.GlueCoLA(segment='train', root='/var/lib/jenkins/.mxnet/datasets/glue_cola', return_all_fields=False)[source]

The Corpus of Linguistic Acceptability (Warstadt et al., 2018) consists of English acceptability judgments drawn from books and journal articles on linguistic theory.

Each example is a sequence of words annotated with whether it is a grammatical English sentence.

From https://gluebenchmark.com/tasks

Parameters
  • segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment.

  • root (str, default '$MXNET_HOME/datasets/glue_cola') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

  • return_all_fields (bool, default False) – Return all fields available in the dataset.

Examples

>>> cola_dev = gluonnlp.data.GlueCoLA('dev', root='./datasets/cola')
-etc-
>>> len(cola_dev)
1043
>>> len(cola_dev[0])
2
>>> cola_dev[0]
['The sailors rode the breeze clear of the rocks.', '1']
>>> cola_test = gluonnlp.data.GlueCoLA('test', root='./datasets/cola')
-etc-
>>> len(cola_test)
1063
>>> len(cola_test[0])
1
>>> cola_test[0]
['Bill whistled past the house.']
class gluonnlp.data.GlueSST2(segment='train', root='/var/lib/jenkins/.mxnet/datasets/glue_sst', return_all_fields=False)[source]

The Stanford Sentiment Treebank (Socher et al., 2013) consists of sentences from movie reviews and human annotations of their sentiment.

From https://gluebenchmark.com/tasks

Parameters
  • segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment.

  • root (str, default '$MXNET_HOME/datasets/glue_sst') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

  • return_all_fields (bool, default False) – Return all fields available in the dataset.

Examples

>>> sst_dev = gluonnlp.data.GlueSST2('dev', root='./datasets/sst')
-etc-
>>> len(sst_dev)
872
>>> len(sst_dev[0])
2
>>> sst_dev[0]
["it 's a charming and often affecting journey . ", '1']
>>> sst_test = gluonnlp.data.GlueSST2('test', root='./datasets/sst')
-etc-
>>> len(sst_test)
1821
>>> len(sst_test[0])
1
>>> sst_test[0]
['uneasy mishmash of styles and genres .']
class gluonnlp.data.GlueSTSB(segment='train', root='/var/lib/jenkins/.mxnet/datasets/glue_stsb', return_all_fields=False)[source]

The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data.

Each pair is human-annotated with a similarity score from 1 to 5.

From https://gluebenchmark.com/tasks

Parameters
  • segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment.

  • root (str, default '$MXNET_HOME/datasets/glue_stsb') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

  • return_all_fields (bool, default False) – Return all fields available in the dataset.

Examples

>>> stsb_dev = gluonnlp.data.GlueSTSB('dev', root='./datasets/stsb')
-etc-
>>> len(stsb_dev)
1500
>>> len(stsb_dev[0])
3
>>> stsb_dev[0]
['A man with a hard hat is dancing.', 'A man wearing a hard hat is dancing.', '5.000']
>>> stsb_test = gluonnlp.data.GlueSTSB('test', root='./datasets/stsb')
-etc-
>>> len(stsb_test)
1379
>>> len(stsb_test[0])
2
>>> stsb_test[0]
['A girl is styling her hair.', 'A girl is brushing her hair.']
class gluonnlp.data.GlueQQP(segment='train', root='/var/lib/jenkins/.mxnet/datasets/glue_qqp', return_all_fields=False)[source]

The Quora Question Pairs dataset is a collection of question pairs from the community question-answering website Quora.

From https://gluebenchmark.com/tasks

Parameters
  • segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment.

  • root (str, default '$MXNET_HOME/datasets/glue_qqp') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

  • return_all_fields (bool, default False) – Return all fields available in the dataset.

Examples

>>> import warnings
>>> with warnings.catch_warnings():
...     # Ignore warnings triggered by invalid entries in GlueQQP dev set
...     warnings.simplefilter("ignore")
...     qqp_dev = gluonnlp.data.GlueQQP('dev', root='./datasets/qqp')
-etc-
>>> len(qqp_dev)
40430
>>> len(qqp_dev[0])
3
>>> qqp_dev[0]
['Why are African-Americans so beautiful?', 'Why are hispanics so beautiful?', '0']
>>> qqp_test = gluonnlp.data.GlueQQP('test', root='./datasets/qqp')
-etc-
>>> len(qqp_test)
390965
>>> len(qqp_test[3])
2
>>> qqp_test[3]
['Is it safe to invest in social trade biz?', 'Is social trade geniune?']
class gluonnlp.data.GlueRTE(segment='train', root='/var/lib/jenkins/.mxnet/datasets/glue_rte', return_all_fields=False)[source]

The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual entailment challenges (RTE1, RTE2, RTE3, and RTE5).

From https://gluebenchmark.com/tasks

Parameters
  • segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment.

  • root (str, default '$MXNET_HOME/datasets/glue_rte') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

  • return_all_fields (bool, default False) – Return all fields available in the dataset.

Examples

>>> rte_dev = gluonnlp.data.GlueRTE('dev', root='./datasets/rte')
-etc-
>>> len(rte_dev)
277
>>> len(rte_dev[0])
3
>>> rte_dev[0]
['Dana Reeve, the widow of the actor Christopher Reeve, has died of lung cancer at age 44, according to the Christopher Reeve Foundation.', 'Christopher Reeve had an accident.', 'not_entailment']
>>> rte_test = gluonnlp.data.GlueRTE('test', root='./datasets/rte')
-etc-
>>> len(rte_test)
3000
>>> len(rte_test[16])
2
>>> rte_test[16]
['United failed to progress beyond the group stages of the Champions League and trail in the Premiership title race, sparking rumours over its future.', 'United won the Champions League.']
class gluonnlp.data.GlueMNLI(segment='train', root='/var/lib/jenkins/.mxnet/datasets/glue_mnli', return_all_fields=False)[source]

The Multi-Genre Natural Language Inference Corpus (Williams et al., 2018) is a crowdsourced collection of sentence pairs with textual entailment annotations.

From https://gluebenchmark.com/tasks

Parameters
  • segment ({'train', 'dev_matched', 'dev_mismatched', 'test_matched', 'test_mismatched'},) – default ‘train’ Dataset segment.

  • root (str, default '$MXNET_HOME/datasets/glue_mnli') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

  • return_all_fields (bool, default False) – Return all fields available in the dataset.

Examples

>>> mnli_dev = gluonnlp.data.GlueMNLI('dev_matched', root='./datasets/mnli')
-etc-
>>> len(mnli_dev)
9815
>>> len(mnli_dev[0])
3
>>> mnli_dev[0]
['The new rights are nice enough', 'Everyone really likes the newest benefits ', 'neutral']
>>> mnli_test = gluonnlp.data.GlueMNLI('test_matched', root='./datasets/mnli')
-etc-
>>> len(mnli_test)
9796
>>> len(mnli_test[0])
2
>>> mnli_test[0]
['Hierbas, ans seco, ans dulce, and frigola are just a few names worth keeping a look-out for.', 'Hierbas is a name worth looking out for.']
class gluonnlp.data.GlueQNLI(segment='train', root='/var/lib/jenkins/.mxnet/datasets/glue_qnli', return_all_fields=False)[source]

The Question-answering NLI dataset converted from Stanford Question Answering Dataset (Rajpurkar et al. 2016).

From https://gluebenchmark.com/tasks

Parameters
  • segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment. Dataset segment.

  • root (str, default '$MXNET_HOME/datasets/glue_qnli') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

  • return_all_fields (bool, default False) – Return all fields available in the dataset.

Examples

>>> qnli_dev = gluonnlp.data.GlueQNLI('dev', root='./datasets/qnli')
-etc-
>>> len(qnli_dev)
5732
>>> len(qnli_dev[0])
3
>>> qnli_dev[0]
['Which NFL team represented the AFC at Super Bowl 50?', 'The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title.', 'entailment']
>>> qnli_test = gluonnlp.data.GlueQNLI('test', root='./datasets/qnli')
-etc-
>>> len(qnli_test)
5740
>>> len(qnli_test[0])
2
>>> qnli_test[0]
['What seldom used term of a unit of force equal to 1000 pound s of force?', 'Other arcane units of force include the sthène, which is equivalent to 1000 N, and the kip, which is equivalent to 1000 lbf.']
class gluonnlp.data.GlueWNLI(segment='train', root='/var/lib/jenkins/.mxnet/datasets/glue_wnli', return_all_fields=False)[source]

The Winograd NLI dataset converted from the dataset in Winograd Schema Challenge (Levesque et al., 2011).

From https://gluebenchmark.com/tasks

Parameters
  • segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment.

  • root (str, default '$MXNET_HOME/datasets/glue_wnli') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

  • return_all_fields (bool, default False) – Return all fields available in the dataset.

Examples

>>> wnli_dev = gluonnlp.data.GlueWNLI('dev', root='./datasets/wnli')
-etc-
>>> len(wnli_dev)
71
>>> len(wnli_dev[0])
3
>>> wnli_dev[0]
['The drain is clogged with hair. It has to be cleaned.', 'The hair has to be cleaned.', '0']
>>> wnli_test = gluonnlp.data.GlueWNLI('test', root='./datasets/wnli')
-etc-
>>> len(wnli_test)
146
>>> len(wnli_test[0])
2
>>> wnli_test[0]
['Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they came in sight.', 'Horses ran away when Maude and Dora came in sight.']
class gluonnlp.data.GlueMRPC(segment='train', root='/var/lib/jenkins/.mxnet/datasets/glue_mrpc')[source]

The Microsoft Research Paraphrase Corpus dataset.

From https://gluebenchmark.com/tasks

Parameters
  • segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment.

  • root (str, default '$MXNET_HOME/datasets/glue_mrpc') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> mrpc_dev = gluonnlp.data.GlueMRPC('dev', root='./datasets/mrpc')
-etc-
>>> len(mrpc_dev)
408
>>> len(mrpc_dev[0])
3
>>> mrpc_dev[0]
["He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .", '" The foodservice pie business does not fit our long-term growth strategy .', '1']
>>> mrpc_test = gluonnlp.data.GlueMRPC('test', root='./datasets/mrpc')
-etc-
>>> len(mrpc_test)
1725
>>> len(mrpc_test[0])
2
>>> mrpc_test[0]
["PCCW 's chief operating officer , Mike Butcher , and Alex Arena , the chief financial officer , will report directly to Mr So .", 'Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So .']
class gluonnlp.data.SuperGlueRTE(segment='train', root='/var/lib/jenkins/.mxnet/datasets/superglue_rte')[source]

The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual entailment challenges (RTE1, RTE2, RTE3 and RTE5).

From https://super.gluebenchmark.com/tasks

Parameters
  • segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.

  • root (str, default "$MXNET_HOME/datasets/superglue_rte") – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

Examples

>>> rte_val = gluonnlp.data.SuperGlueRTE('val', root='./datasets/rte')
-etc-
>>> len(rte_val)
277
>>> sorted(rte_val[0].items())
[('hypothesis', 'Christopher Reeve had an accident.'), ('idx', 0), ('label', 'not_entailment'), ('premise', 'Dana Reeve, the widow of the actor Christopher Reeve, has died of lung cancer at age 44, according to the Christopher Reeve Foundation.')]
>>> rte_test = gluonnlp.data.SuperGlueRTE('test', root='./datasets/rte')
-etc-
>>> len(rte_test)
3000
>>> sorted(rte_test[0].items())
[('hypothesis', 'Shukla is related to Mangla.'), ('idx', 0), ('premise', "Mangla was summoned after Madhumita's sister Nidhi Shukla, who was the first witness in the case.")]
class gluonnlp.data.SuperGlueCB(segment='train', root='/var/lib/jenkins/.mxnet/datasets/superglue_cb')[source]

The CommitmentBank (CB) is a corpus of short texts in which at least one sentence contains an embedded clause.

From https://super.gluebenchmark.com/tasks

Parameters
  • segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.

  • root (str, default "$MXNET_HOME/datasets/superglue_cb") – Path to temp folder from storing data. MXNET_HOME defaults to ‘~/.mxnet’

Examples

>>> cb_val = gluonnlp.data.SuperGlueCB('val', root='./datasets/cb')
-etc-
>>> len(cb_val)
56
>>> sorted(cb_val[0].items())
[('hypothesis', 'Valence was helping'), ('idx', 0), ('label', 'contradiction'), ('premise', "Valence the void-brain, Valence the virtuous valet. Why couldn't the figger choose his own portion of titanic anatomy to shaft? Did he think he was helping?")]
>>> cb_test = gluonnlp.data.SuperGlueCB('test', root='./datasets/cb')
-etc-
>>> len(cb_test)
250
>>> sorted(cb_test[0].items())
[('hypothesis', 'Polly was not an experienced ocean sailor'), ('idx', 0), ('premise', 'Polly had to think quickly. They were still close enough to shore for him to return her to the police if she admitted she was not an experienced ocean sailor.')]
class gluonnlp.data.SuperGlueWSC(segment='train', root='/var/lib/jenkins/.mxnet/datasets/superglue_wsc')[source]

The Winograd Schema Challenge (WSC) is a co-reference resolution dataset. (Levesque et al., 2012)

From https://super.gluebenchmark.com/tasks

Parameters
  • segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.

  • root (str, default "$MXNET_HOME/datasets/superglue_wsc") – Path to temp folder from storing data. MXNET_HOME defaults to ‘~/.mxnet’

Examples

>>> wsc_val = gluonnlp.data.SuperGlueWSC('val', root='./datasets/wsc')
-etc-
>>> len(wsc_val)
104
>>> sorted(wsc_val[5].items())
[('idx', 5), ('label', True), ('target', OrderedDict([('span2_index', 9), ('span1_index', 6), ('span1_text', 'The table'), ('span2_text', 'it')])), ('text', 'The large ball crashed right through the table because it was made of styrofoam.')]
>>> wsc_test = gluonnlp.data.SuperGlueWSC('test', root='./datasets/wsc')
-etc-
>>> len(wsc_test)
146
>>> sorted(wsc_test[16].items())
[('idx', 16), ('target', OrderedDict([('span1_text', 'life'), ('span1_index', 1), ('span2_text', 'it'), ('span2_index', 21)])), ('text', 'Your life is yours and yours alone, and if the pain outweighs the benefit, you should have the option to end it .')]
class gluonnlp.data.SuperGlueWiC(segment='train', root='/var/lib/jenkins/.mxnet/datasets/superglue_wic')[source]

The Word-in-Context (WiC) is a word sense disambiguation dataset cast as binary classification of sentence pairs. (Pilehvar and Camacho-Collados, 2019)

From https://super.gluebenchmark.com/tasks

Parameters
  • segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.

  • root (str, default "$MXNET_HOME/datasets/superglue_wic") – Path to temp folder from storing data. MXNET_HOME defaults to ‘~/.mxnet’

Examples

>>> wic_val = gluonnlp.data.SuperGlueWiC('val', root='./datasets/wic')
-etc-
>>> len(wic_val)
638
>>> sorted(wic_val[3].items())
[('end1', 31), ('end2', 35), ('idx', 3), ('label', True), ('sentence1', 'She gave her hair a quick brush.'), ('sentence2', 'The dentist recommended two brushes a day.'), ('start1', 26), ('start2', 28), ('version', 1.1), ('word', 'brush')]
>>> wic_test = gluonnlp.data.SuperGlueWiC('test', root='./datasets/wic')
-etc-
>>> len(wic_test)
1400
>>> sorted(wic_test[0].items())
[('end1', 46), ('end2', 22), ('idx', 0), ('sentence1', 'The smell of fried onions makes my mouth water.'), ('sentence2', 'His eyes were watering.'), ('start1', 41), ('start2', 14), ('version', 1.1), ('word', 'water')]
class gluonnlp.data.SuperGlueCOPA(segment='train', root='/var/lib/jenkins/.mxnet/datasets/superglue_copa')[source]

The Choice of Plausible Alternatives (COPA) is a causal reasoning dataset. (Roemmele et al., 2011)

From https://super.gluebenchmark.com/tasks

Parameters
  • segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.

  • root (str, default "$MXNET_HOME/datasets/superglue_copa") – Path to temp folder from storing data. MXNET_HOME defaults to ‘~/.mxnet’

Examples

>>> copa_val = gluonnlp.data.SuperGlueCOPA('val', root='./datasets/copa')
-etc-
>>> len(copa_val)
100
>>> sorted(copa_val[0].items())
[('choice1', 'The toilet filled with water.'), ('choice2', 'Water flowed from the spout.'), ('idx', 0), ('label', 1), ('premise', 'The man turned on the faucet.'), ('question', 'effect')]
>>> copa_test = gluonnlp.data.SuperGlueCOPA('test', root='./datasets/copa')
-etc-
>>> len(copa_test)
500
>>> sorted(copa_test[0].items())
[('choice1', 'It was fragile.'), ('choice2', 'It was small.'), ('idx', 0), ('premise', 'The item was packaged in bubble wrap.'), ('question', 'cause')]
class gluonnlp.data.SuperGlueMultiRC(segment='train', root='/var/lib/jenkins/.mxnet/datasets/superglue_multirc')[source]

Multi-Sentence Reading Comprehension (MultiRC) is a QA dataset. (Khashabi et al., 2018)

From https://super.gluebenchmark.com/tasks

Parameters
  • segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.

  • root (str, default "$MXNET_HOME/datasets/superglue_multirc") – Path to temp folder from storing data. MXNET_HOME defaults to ‘~/.mxnet’

Examples

>>> multirc_val = gluonnlp.data.SuperGlueMultiRC('val', root='./datasets/multirc')
-etc-
>>> len(multirc_val)
83
>>> sorted(multirc_val[0].keys())
['questions', 'text']
>>> len(multirc_val[0]['text'])
12
>>> len(multirc_val[0]['questions'])
13
>>> sorted(multirc_val[0]['questions'][0].keys())
['answers', 'idx', 'multisent', 'question', 'sentences_used']
>>> multirc_test = gluonnlp.data.SuperGlueMultiRC('test', root='./datasets/multirc')
-etc-
>>> len(multirc_test)
166
>>> sorted(multirc_test[0].keys())
['questions', 'text']
>>> len(multirc_test[0]['text'])
14
>>> len(multirc_test[0]['questions'])
14
>>> sorted(multirc_test[0]['questions'][0].keys())
['answers', 'idx', 'multisent', 'question', 'sentences_used']
class gluonnlp.data.SuperGlueBoolQ(segment='train', root='/var/lib/jenkins/.mxnet/datasets/superglue_boolq')[source]

Boolean Questions (BoolQ) is a QA dataset where each example consists of a short passage and a yes/no question about it.

From https://super.gluebenchmark.com/tasks

Parameters
  • segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.

  • root (str, default "$MXNET_HOME/datasets/superglue_boolq") – Path to temp folder from storing data. MXNET_HOME defaults to ‘~/.mxnet’

Examples

>>> boolq_val = gluonnlp.data.SuperGlueBoolQ('val', root='./datasets/boolq')
-etc-
>>> len(boolq_val)
3270
>>> sorted(boolq_val[0].keys())
['idx', 'label', 'passage', 'question']
>>> boolq_test = gluonnlp.data.SuperGlueBoolQ('test', root='./datasets/boolq')
-etc-
>>> len(boolq_test)
3245
>>> sorted(boolq_test[0].keys())
['idx', 'passage', 'question']
class gluonnlp.data.SuperGlueReCoRD(segment='train', root='/var/lib/jenkins/.mxnet/datasets/superglue_record')[source]

Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a multiple-choice QA dataset.

From https://super.gluebenchmark.com/tasks

Parameters
  • segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.

  • root (str, default "$MXNET_HOME/datasets/superglue_record") – Path to temp folder from storing data. MXNET_HOME defaults to ‘~/.mxnet’

Examples

>>> record_val = gluonnlp.data.SuperGlueReCoRD('val', root='./datasets/record')
-etc-
>>> len(record_val)
7481
>>> sorted(record_val[0].keys())
['idx', 'passage', 'qas', 'source']
>>> record_test = gluonnlp.data.SuperGlueReCoRD('test', root='./datasets/record')
-etc-
>>> len(record_test)
7484
>>> sorted(record_test[0].keys())
['idx', 'passage', 'qas', 'source']
class gluonnlp.data.SuperGlueAXb(root='/var/lib/jenkins/.mxnet/datasets/superglue_ax_b')[source]

The Broadcoverage Diagnostics (AX-b) is a diagnostics dataset labeled closely to the schema of MultiNLI.

From https://super.gluebenchmark.com/tasks

Parameters

root (str, default "$MXNET_HOME/datasets/superglue_ax_b") – Path to temp folder from storing data. MXNET_HOME defaults to ‘~/.mxnet’

Examples

>>> ax_b = gluonnlp.data.SuperGlueAXb(root='./datasets/ax_b')
-etc-
>>> len(ax_b)
1104
>>> sorted(ax_b[0].items())
[('idx', '0'), ('label', 'not_entailment'), ('logic', 'Negation'), ('sentence1', 'The cat sat on the mat.'), ('sentence2', 'The cat did not sit on the mat.')]
class gluonnlp.data.SuperGlueAXg(root='/var/lib/jenkins/.mxnet/datasets/superglue_ax_g')[source]

The Winogender Schema Diagnostics (AX-g) is a diagnostics dataset labeled closely to the schema of MultiNLI.

From https://super.gluebenchmark.com/tasks

Parameters

root (str, default "$MXNET_HOME/datasets/superglue_ax_g") – Path to temp folder from storing data. MXNET_HOME defaults to ‘~/.mxnet’

Examples

>>> ax_g = gluonnlp.data.SuperGlueAXg(root='./datasets/ax_g')
-etc-
>>> len(ax_g)
356
>>> sorted(ax_g[0].items())
[('hypothesis', 'The accountant sought help filing taxes.'), ('idx', 0), ('label', 'not_entailment'), ('pair_id', 551646), ('premise', 'The taxpayer met with the accountant to get help filing his taxes.')]
class gluonnlp.data.MRPCTask[source]

The MRPC task on GlueBenchmark.

Examples

>>> MRPC = MRPCTask()
>>> MRPC.class_labels
['0', '1']
>>> type(MRPC.metrics.get_metric(0))
<class 'mxnet.metric.Accuracy'>
>>> type(MRPC.metrics.get_metric(1))
<class 'mxnet.metric.F1'>
>>> MRPC.dataset_train()[0]
-etc-
'train'
>>> len(MRPC.dataset_train()[1])
3668
>>> MRPC.dataset_dev()[0]
'dev'
>>> len(MRPC.dataset_dev()[1])
408
>>> MRPC.dataset_test()[0]
-etc-
'test'
>>> len(MRPC.dataset_test()[1])
1725
get_dataset(segment='train')[source]

Get the corresponding dataset for MRPC.

Parameters

segment (str, default 'train') – Dataset segments. Options are ‘train’, ‘dev’, ‘test’.

class gluonnlp.data.QQPTask[source]

The Quora Question Pairs task on GlueBenchmark.

Examples

>>> QQP = QQPTask()
>>> QQP.class_labels
['0', '1']
>>> type(QQP.metrics.get_metric(0))
<class 'mxnet.metric.Accuracy'>
>>> type(QQP.metrics.get_metric(1))
<class 'mxnet.metric.F1'>
>>> import warnings
>>> with warnings.catch_warnings():
...     # Ignore warnings triggered by invalid entries in GlueQQP set
...     warnings.simplefilter("ignore")
...     QQP.dataset_train()[0]
-etc-
'train'
>>> QQP.dataset_test()[0]
-etc-
'test'
>>> len(QQP.dataset_test()[1])
390965
get_dataset(segment='train')[source]

Get the corresponding dataset for QQP.

Parameters

segment (str, default 'train') – Dataset segments. Options are ‘train’, ‘dev’, ‘test’.

class gluonnlp.data.QNLITask[source]

The SQuAD NLI task on GlueBenchmark.

Examples

>>> QNLI = QNLITask()
>>> QNLI.class_labels
['not_entailment', 'entailment']
>>> type(QNLI.metrics)
<class 'mxnet.metric.Accuracy'>
>>> QNLI.dataset_train()[0]
-etc-
'train'
>>> len(QNLI.dataset_train()[1])
108436
>>> QNLI.dataset_dev()[0]
-etc-
'dev'
>>> len(QNLI.dataset_dev()[1])
5732
>>> QNLI.dataset_test()[0]
-etc-
'test'
>>> len(QNLI.dataset_test()[1])
5740
get_dataset(segment='train')[source]

Get the corresponding dataset for QNLI.

Parameters

segment (str, default 'train') – Dataset segments. Options are ‘train’, ‘dev’, ‘test’.

class gluonnlp.data.RTETask[source]

The Recognizing Textual Entailment task on GlueBenchmark.

Examples

>>> RTE = RTETask()
>>> RTE.class_labels
['not_entailment', 'entailment']
>>> type(RTE.metrics)
<class 'mxnet.metric.Accuracy'>
>>> RTE.dataset_train()[0]
-etc-
'train'
>>> len(RTE.dataset_train()[1])
2490
>>> RTE.dataset_dev()[0]
-etc-
'dev'
>>> len(RTE.dataset_dev()[1])
277
>>> RTE.dataset_test()[0]
-etc-
'test'
>>> len(RTE.dataset_test()[1])
3000
get_dataset(segment='train')[source]

Get the corresponding dataset for RTE.

Parameters

segment (str, default 'train') – Dataset segments. Options are ‘train’, ‘dev’, ‘test’.

class gluonnlp.data.STSBTask[source]

The Sentence Textual Similarity Benchmark task on GlueBenchmark.

Examples

>>> STSB = STSBTask()
>>> STSB.class_labels
>>> type(STSB.metrics)
<class 'mxnet.metric.PearsonCorrelation'>
>>> STSB.dataset_train()[0]
-etc-
'train'
>>> len(STSB.dataset_train()[1])
5749
>>> STSB.dataset_dev()[0]
-etc-
'dev'
>>> len(STSB.dataset_dev()[1])
1500
>>> STSB.dataset_test()[0]
-etc-
'test'
>>> len(STSB.dataset_test()[1])
1379
get_dataset(segment='train')[source]

Get the corresponding dataset for STSB

Parameters

segment (str, default 'train') – Dataset segments. Options are ‘train’, ‘dev’, ‘test’.

class gluonnlp.data.CoLATask[source]

The Warstdadt acceptability task on GlueBenchmark.

Examples

>>> CoLA = CoLATask()
>>> CoLA.class_labels
['0', '1']
>>> type(CoLA.metrics)
<class 'mxnet.metric.MCC'>
>>> CoLA.dataset_train()[0]
-etc-
'train'
>>> len(CoLA.dataset_train()[1])
8551
>>> CoLA.dataset_dev()[0]
-etc-
'dev'
>>> len(CoLA.dataset_dev()[1])
1043
>>> CoLA.dataset_test()[0]
-etc-
'test'
>>> len(CoLA.dataset_test()[1])
1063
get_dataset(segment='train')[source]

Get the corresponding dataset for CoLA

Parameters

segment (str, default 'train') – Dataset segments. Options are ‘train’, ‘dev’, ‘test’.

class gluonnlp.data.MNLITask[source]

The Multi-Genre Natural Language Inference task on GlueBenchmark.

Examples

>>> MNLI = MNLITask()
>>> MNLI.class_labels
['neutral', 'entailment', 'contradiction']
>>> type(MNLI.metrics)
<class 'mxnet.metric.Accuracy'>
>>> MNLI.dataset_train()[0]
-etc-
'train'
>>> len(MNLI.dataset_train()[1])
392702
>>> MNLI.dataset_dev()[0][0]
-etc-
'dev_matched'
>>> len(MNLI.dataset_dev()[0][1])
9815
>>> MNLI.dataset_dev()[1][0]
'dev_mismatched'
>>> len(MNLI.dataset_dev()[1][1])
9832
>>> MNLI.dataset_test()[0][0]
-etc-
'test_matched'
>>> len(MNLI.dataset_test()[0][1])
9796
>>> MNLI.dataset_test()[1][0]
'test_mismatched'
>>> len(MNLI.dataset_test()[1][1])
9847
dataset_dev()[source]

Get the dev segment of the dataset for the task.

Returns

list of TSVDataset

Return type

the dataset of the dev segment.

dataset_test()[source]

Get the test segment of the dataset for the task.

Returns

list of TSVDataset

Return type

the dataset of the test segment.

get_dataset(segment='train')[source]

Get the corresponding dataset for MNLI

Parameters

segment (str, default 'train') – Dataset segments. Options are ‘dev_matched’, ‘dev_mismatched’, ‘test_matched’, ‘test_mismatched’, ‘train’

class gluonnlp.data.WNLITask[source]

The Winograd NLI task on GlueBenchmark.

Examples

>>> WNLI = WNLITask()
>>> WNLI.class_labels
['0', '1']
>>> type(WNLI.metrics)
<class 'mxnet.metric.Accuracy'>
>>> WNLI.dataset_train()[0]
-etc-
'train'
>>> len(WNLI.dataset_train()[1])
635
>>> WNLI.dataset_dev()[0]
-etc-
'dev'
>>> len(WNLI.dataset_dev()[1])
71
>>> WNLI.dataset_test()[0]
-etc-
'test'
>>> len(WNLI.dataset_test()[1])
146
get_dataset(segment='train')[source]

Get the corresponding dataset for WNLI

Parameters

segment (str, default 'train') – Dataset segments. Options are ‘dev’, ‘test’, ‘train’

class gluonnlp.data.SSTTask[source]

The Stanford Sentiment Treebank task on GlueBenchmark.

Examples

>>> SST = SSTTask()
>>> SST.class_labels
['0', '1']
>>> type(SST.metrics)
<class 'mxnet.metric.Accuracy'>
>>> SST.dataset_train()[0]
-etc-
'train'
>>> len(SST.dataset_train()[1])
67349
>>> SST.dataset_dev()[0]
-etc-
'dev'
>>> len(SST.dataset_dev()[1])
872
>>> SST.dataset_test()[0]
-etc-
'test'
>>> len(SST.dataset_test()[1])
1821
get_dataset(segment='train')[source]

Get the corresponding dataset for SST

Parameters

segment (str, default 'train') – Dataset segments. Options are ‘train’, ‘dev’, ‘test’.

class gluonnlp.data.XNLITask[source]

The XNLI task using the dataset released from Baidu

<https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE>.

Examples

>>> XNLI = XNLITask()
>>> XNLI.class_labels
['neutral', 'entailment', 'contradiction']
>>> type(XNLI.metrics)
<class 'mxnet.metric.Accuracy'>
>>> XNLI.dataset_train()[0]
'train'
>>> len(XNLI.dataset_train()[1])
392702
>>> XNLI.dataset_dev()[0]
'dev'
>>> len(XNLI.dataset_dev()[1])
2490
>>> XNLI.dataset_test()[0]
'test'
>>> len(XNLI.dataset_test()[1])
5010
get_dataset(segment='train')[source]

Get the corresponding dataset for XNLI.

Parameters

segment (str, default 'train') – Dataset segments. Options are ‘dev’, ‘test’, ‘train’

gluonnlp.data.get_task(task)[source]

Returns a pre-defined glue task by name.

Parameters

task (str) – Options include ‘MRPC’, ‘QNLI’, ‘RTE’, ‘STS-B’, ‘CoLA’, ‘MNLI’, ‘WNLI’, ‘SST’, ‘XNLI’, ‘LCQMC’, ‘ChnSentiCorp’

Returns

Return type

GlueTask

class gluonnlp.data.BaiduErnieXNLI(segment='train', root='/var/lib/jenkins/.mxnet/datasets/baidu_ernie_data', return_all_fields=False)[source]

The XNLI dataset redistributed by Baidu <https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE>.

Original from: Conneau, Alexis, et al. “Xnli: Evaluating cross-lingual sentence representations.” arXiv preprint arXiv:1809.05053 (2018). https://github.com/facebookresearch/XNLI

Licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. License details: https://creativecommons.org/licenses/by-nc/4.0/

Parameters
  • segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment.

  • root (str, default '$MXNET_HOME/datasets/baidu_ernie_task_data') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

  • return_all_fields (bool, default False) – Return all fields available in the dataset.

Examples

>>> xnli_dev = BaiduErnieXNLI('dev', root='./datasets/baidu_ernie_task_data/')
>>> len(xnli_dev)
2490
>>> len(xnli_dev[0])
3
>>> xnli_dev[0]
['他说,妈妈,我回来了。', '校车把他放下后,他立即给他妈妈打了电话。', 'neutral']
>>> xnli_test = BaiduErnieXNLI('test', root='./datasets/baidu_ernie_task_data/')
>>> len(xnli_test)
5010
>>> len(xnli_test[0])
2
>>> xnli_test[0]
['嗯,我根本没想过,但是我很沮丧,最后我又和他说话了。', '我还没有和他再次谈论。']
class gluonnlp.data.BaiduErnieLCQMC(file_path, segment='train', return_all_fields=False)[source]

The LCQMC dataset original from: Xin Liu, Qingcai Chen, Chong Deng, Huajun Zeng, Jing Chen, Dongfang Li, Buzhou Tang, LCQMC: A Large-scale Chinese Question Matching Corpus,COLING2018.

No license granted. You can request a private license via http://icrc.hitsz.edu.cn/LCQMC_Application_Form.pdf The code fits the dataset format which was redistributed by Baidu in ERNIE repo. (Baidu does not hold this version any more.)

Parameters
  • segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment.

  • file_path (str) – Path to the downloaded dataset file.

  • return_all_fields (bool, default False) – Return all fields available in the dataset.

class gluonnlp.data.BaiduErnieChnSentiCorp(segment='train', root='/var/lib/jenkins/.mxnet/datasets/baidu_ernie_data', return_all_fields=False)[source]

The ChnSentiCorp dataset redistributed by Baidu <https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE>.

Original from Tan Songbo (Chinese Academy of Sciences, tansongbo@software.ict.ac.cn).

Parameters
  • segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment.

  • root (str, default '$MXNET_HOME/datasets/baidu_ernie_task_data') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.

  • return_all_fields (bool, default False) – Return all fields available in the dataset.

Examples

>>> chnsenticorp_dev = BaiduErnieChnSentiCorp('dev', root='./datasets/baidu_ernie_task_data/')
>>> len(chnsenticorp_dev)
1200
>>> len(chnsenticorp_dev[0])
2
>>> chnsenticorp_dev[2]
['商品的不足暂时还没发现,京东的订单处理速度实在.......周二就打包完成,周五才发货...', '0']
>>> chnsenticorp_test = BaiduErnieChnSentiCorp('test', root='./datasets/baidu_ernie_task_data/')
>>> len(chnsenticorp_test)
1200
>>> len(chnsenticorp_test[0])
1
>>> chnsenticorp_test[0]
['这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般']
class gluonnlp.data.DatasetLoader(file_patterns, file_sampler, dataset_fn=None, batch_sampler_fn=None, dataset_params=None, batch_sampler_params=None, batchify_fn=None, num_dataset_workers=0, num_batch_workers=0, pin_memory=False, circle_length=1, dataset_prefetch=None, batch_prefetch=None, dataset_cached=False, num_max_dataset_cached=0)[source]

Loads data from a list of datasets and returns mini-batches of data.

One dataset is loaded at a time.

Parameters
  • file_patterns (str) – Path to the input text files.

  • file_sampler (str or gluon.data.Sampler, defaults to 'random') –

    The sampler used to sample a file. The following string values are supported:

    • ’sequential’: SequentialSampler

    • ’random’: RandomSampler

  • dataset_fn (DatasetFn, callable) – Callable object to generate a gluon.data.Dataset given a url.

  • batch_sampler_fn (SamplerFn, callable) – Callable object to generate a gluon.data.sampler.Sampler given a dataset.

  • dataset_params (dict, default is None) – Dictionary of parameters passed to dataset_fn.

  • batch_sampler_params (dict, default is None) – Dictionary of parameters passed to batch_sampler_fn.

  • batchify_fn (callable) –

    Callback function to allow users to specify how to merge samples into a batch. Defaults to default_batchify_fn:

    def default_batchify_fn(data):
        if isinstance(data[0], nd.NDArray):
            return nd.stack(*data)
        elif isinstance(data[0], tuple):
            data = zip(*data)
            return [default_batchify_fn(i) for i in data]
        else:
            data = np.asarray(data)
            return nd.array(data, dtype=data.dtype)
    

  • num_dataset_workers (int) – Number of worker process for dataset creation.

  • num_batch_workers (int) – Number of worker process for batch creation.

  • pin_memory (boolean, default False) – If True, the dataloader will copy NDArrays into pinned memory before returning them. Copying from CPU pinned memory to GPU is faster than from normal CPU memory. At the same time, it increases GPU memory.

  • circle_length (int, default is 1) – The number of files to be read at the same time. When circle_length is larger than 1, we merge circle_length number of files.

  • dataset_prefetch (int, default is num_dataset_workers) – The number of prefetching datasets only works if num_workers > 0. If prefetch > 0, it allow worker process to prefetch certain datasets before acquiring data from iterators. Note that using large prefetching batch will provide smoother bootstrapping performance, but will consume more memory. Using smaller number may forfeit the purpose of using multiple worker processes, try reduce num_dataset_workers in this case. By default it defaults to num_dataset_workers.

  • batch_prefetch (int, default is num_batch_workers * 2) – The number of prefetching batches only works if num_workers > 0. If prefetch > 0, it allow worker process to prefetch certain batches before acquiring data from iterators. Note that using large prefetching batch will provide smoother bootstrapping performance, but will consume more shared_memory. Using smaller number may forfeit the purpose of using multiple worker processes, try reduce num_batch_workers in this case. By default it defaults to num_batch_workers * 2.

  • dataset_cached (bool, default is False) – Whether or not to cache last processed dataset. Each processed dataset can only be cached for once. When there is no new available processed dataset to be fetched, we pop a cached processed dataset.

  • num_max_dataset_cached (int, default is 0) – Maximum number of cached datasets. It is valid only if dataset_cached is True

gluonnlp.data.truncate_seqs_equal(sequences, max_len)[source]

truncate a list of seqs equally so that the total length equals max length.

Parameters
  • sequences (list of list of object) – Sequences of tokens, each of which is an iterable of tokens.

  • max_len (int) – Max length to be truncated to.

Returns

list

Return type

list of truncated sequence keeping the origin order

Examples

>>> seqs = [[1, 2, 3], [4, 5, 6]]
>>> truncate_seqs_equal(seqs, 6)
[[1, 2, 3], [4, 5, 6]]
>>> seqs = [[1, 2, 3], [4, 5, 6]]
>>> truncate_seqs_equal(seqs, 4)
[[1, 2], [4, 5]]
>>> seqs = [[1, 2, 3], [4, 5, 6]]
>>> truncate_seqs_equal(seqs, 3)
[[1, 2], [4]]
gluonnlp.data.concat_sequences(seqs, separators, seq_mask=0, separator_mask=1)[source]

Concatenate sequences in a list into a single sequence, using specified separators.

Example 1: seqs: [[‘is’, ‘this’ ,’jacksonville’, ‘?’], [‘no’ ,’it’ ,’is’ ,’not’, ‘.’]] separator: [[SEP], [SEP], [CLS]] seq_mask: 0 separator_mask: 1

Returns: tokens: is this jacksonville ? [SEP] no it is not . [SEP] [CLS] segment_ids: 0 0 0 0 0 1 1 1 1 1 1 2 p_mask: 0 0 0 0 1 0 0 0 0 0 1 1

Example 2: separator_mask can also be a list. seqs: [[‘is’, ‘this’ ,’jacksonville’, ‘?’], [‘no’ ,’it’ ,’is’ ,’not’, ‘.’]] separator: [[SEP], [SEP], [CLS]] seq_mask: 0 separator_mask: [[1], [1], [0]]

Returns: tokens: ‘is this jacksonville ? [SEP] no it is not . [SEP] [CLS]’ segment_ids: 0 0 0 0 0 1 1 1 1 1 1 2 p_mask: 1 1 1 1 1 0 0 0 0 0 1 0

Example 3: seq_mask can also be a list. seqs: [[‘is’, ‘this’ ,’jacksonville’, ‘?’], [‘no’ ,’it’ ,’is’ ,’not’, ‘.’]] separator: [[SEP], [SEP], [CLS]] seq_mask: [[1, 1, 1, 1], [0, 0, 0, 0, 0]] separator_mask: [[1], [1], [0]]

Returns: tokens: ‘is this jacksonville ? [SEP] no it is not . [SEP] [CLS]’ segment_ids: 0 0 0 0 0 1 1 1 1 1 1 2 p_mask: 1 1 1 1 1 0 0 0 0 0 1 0

Parameters
  • seqs (list of list of object) – sequences to be concatenated

  • separator (list of list of object) – The special tokens to separate sequences.

  • seq_mask (int or list of list of int) – A single mask value for all sequence items or a list of values for each item in sequences

  • separator_mask (int or list of list of int) – A single mask value for all separators or a list of values for each separator

Returns

  • np.array (input token ids in ‘int32’, shape (batch_size, seq_length))

  • np.array (segment ids in ‘int32’, shape (batch_size, seq_length))

  • np.array (mask for special tokens)

gluonnlp.data.tokenize_and_align_positions(origin_text, start_position, end_position, tokenizer)[source]

Tokenize the text and align the origin positions to the corresponding position.

Parameters
  • origin_text (list) – list of tokens to be tokenized.

  • start_position (int) – Start position in the origin_text

  • end_position (int) – End position in the origin_text

  • tokenizer (callable function, e.g., BERTTokenizer.) –

Returns

  • int (Aligned start position)

  • int (Aligned end position)

  • list (tokenized text)

  • list (map from the origin index to the tokenized sequence index)

  • list (map from tokenized sequence index to the origin index)

Examples

>>> from gluonnlp.vocab import BERTVocab
>>> from gluonnlp.data import count_tokens, BERTTokenizer
>>> origin_text = ['is', 'this', 'jacksonville', '?']
>>> vocab_tokens = ['is', 'this', 'jack', '##son', '##ville', '?']
>>> bert_vocab = BERTVocab(count_tokens(vocab_tokens))
>>> tokenizer = BERTTokenizer(vocab=bert_vocab)
>>> out = tokenize_and_align_positions(origin_text, 0, 2, tokenizer)
>>> out[0] # start_position
0
>>> out[1] # end_position
4
>>> out[2] # tokenized_text
['is', 'this', 'jack', '##son', '##ville', '?']
>>> out[3] # orig_to_tok_index
[0, 1, 2, 5]
>>> out[4] # tok_to_orig_index
[0, 1, 2, 2, 2, 3]
gluonnlp.data.get_doc_spans(full_doc, max_length, doc_stride)[source]

Obtain document spans by sliding a window across the document

Parameters
  • full_doc (list) – The origin doc text

  • max_length (max_length) – Maximum size of a doc span

  • doc_stride (int) – Step of sliding window

Returns

  • list (a list of processed doc spans)

  • list (a list of start/end index of each doc span)

gluonnlp.data.align_position2doc_spans(positions, doc_spans_indices, offset=0, default_value=-1, all_in_span=True)[source]

Align original positions to the corresponding document span positions

Parameters
  • positions (list or int) – A single or a list of positions to be aligned

  • doc_spans_indices (list or tuple) – Contains the start/end position of the doc_spans. Typically, (start_position, end_position)

  • offset (int) – Offset of aligned positions. Sometimes the doc spans would be added to the back of a question text, in this case, the new position should add len(question_text).

  • default_value (int) – The default value to return if the positions are not in the doc span.

  • all_in_span (bool) – If set to True, then as long as one position is out of span, all positions would be set to default_value.

Returns

list

Return type

a list of aligned positions

Examples

>>> positions = [2, 6]
>>> doc_span_indices = [1, 4]
>>> align_position2doc_spans(positions, doc_span_indices, default_value=-2)
[-2, -2]
>>> align_position2doc_spans(positions, doc_span_indices, default_value=-2, all_in_span=False)
[1, -2]
gluonnlp.data.improve_answer_span(doc_tokens, input_start, input_end, tokenizer, orig_answer_text)[source]

Returns tokenized answer spans that better match the annotated answer.

The SQuAD annotations are character based. We first project them to whitespace-tokenized words. But then after WordPiece tokenization, we can often find a “better match”. For example:

Question: What year was John Smith born? Context: The leader was John Smith (1895-1943). Answer: 1895

The original whitespace-tokenized answer will be “(1895-1943).”. However after tokenization, our tokens will be “( 1895 - 1943 ) .”. So we can match the exact answer, 1895.

However, this is not always possible. Consider the following:

Question: What country is the top exporter of electornics? Context: The Japanese electronics industry is the lagest in the world. Answer: Japan

In this case, the annotator chose “Japan” as a character sub-span of the word “Japanese”. Since our WordPiece tokenizer does not split “Japanese”, we just use “Japanese” as the annotation. This is fairly rare in SQuAD, but does happen.

Parameters
  • doc_tokens (list) – A list of doc tokens

  • input_start (int) – start position of the answer

  • input_end (int) – end position of the answer

  • tokenizer (callable function) –

  • orig_answer_text (str) – origin answer text.

Returns

tuple

Return type

a tuple of improved start position and end position

gluonnlp.data.check_is_max_context(doc_spans, cur_span_index, position)[source]

Check if this is the ‘max context’ doc span for the token.

Because of the sliding window approach taken to scoring documents, a single token can appear in multiple documents. E.g. Doc: the man went to the store and bought a gallon of milk Span A: the man went to the Span B: to the store and bought Span C: and bought a gallon of …

Now the word ‘bought’ will have two scores from spans B and C. We only want to consider the score with “maximum context”, which we define as the minimum of its left and right context (the sum of left and right context will always be the same, of course).

In the example the maximum context for ‘bought’ would be span C since it has 1 left context and 3 right context, while span B has 4 left context and 0 right context.

Note that position is the absolute position in the origin text.

Parameters
  • doc_spans (list) – A list of doc spans

  • cur_span_index (int) – The index of doc span to be checked in doc_spans.

  • position (int) – Position of the token to be checked.

Returns

bool

Return type

True if the token has ‘max context’.

gluonnlp.data.convert_squad_examples(record, is_training)[source]

read a single entry of gluonnlp.data.SQuAD and convert it to an example.

Parameters
  • record (list) – An entry of gluonnlp.data.SQuAD

  • is_training (bool) – If the example is used for training, then a rough start/end position will be generated

Returns

SquadExample

Return type

An instance of SquadExample

gluonnlp.data.convert_index(index_map, pos, M=None, is_start=True)[source]

Working best with lcs_match(), convert the token index to origin text index

Parameters
  • index_map (list of int) – Typically, it is a map form origin indices to converted indices

  • pos (int) – The origin index to be converted.

  • M (int) – The maximum index.

  • is_start (bool) – True if pos is a start position.

Returns

int

Return type

the converted index regarding index_map

gluonnlp.data.lcs_match(max_dist, seq1, seq2, max_seq_length=1024, lower=False)[source]

Longest common sequence match.

unlike standard LCS, this is specifically optimized for the setting because the mismatch between sentence pieces and original text will be small

Parameters
  • max_dist (int) – The max distance between tokens to be considered.

  • seq1 (list) – The first sequence to be matched.

  • seq2 (list) – The second sequence to be matched.

  • lower (bool) – If match the lower-cased tokens.

Returns

  • numpyArray (Token-wise lcs matrix f. Shape of ((max(len(seq1), 1024), max(len(seq2), 1024)))

  • Map (The dp path in matrix f.) – g[(i ,j)] == 2 if token_i in seq1 matches token_j in seq2. g[(i, j)] == 1 if token_i in seq1 matches token_{j-1} in seq2. g[(i, j)] == 0 of token_{i-1} in seq1 matches token_j in seq2.