gluonnlp.data¶
GluonNLP Toolkit provides tools for building efficient data pipelines for NLP tasks.
Public Datasets¶
Popular datasets for NLP tasks are provided in gluonnlp. By default, all built-in datasets are automatically downloaded from public repo and reside in ~/.mxnet/datasets/.
Language modeling¶
WikiText is a popular language modeling dataset from Salesforce. It is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.
Google 1 Billion Words is a popular language modeling dataset. It is a collection of over 0.8 billion tokens extracted from the WMT11 website. The dataset is available under Apache License.
WikiText-2 word-level dataset for language modeling, from Salesforce research. |
|
WikiText-103 word-level dataset for language modeling, from Salesforce research. |
|
WikiText-2 character-level dataset for language modeling |
|
WikiText-103 character-level dataset for language modeling |
|
1-Billion-Word word-level dataset for language modeling, from Google. |
Text Classification¶
IMDB is a popular dataset for binary sentiment classification. It provides a set of 25,000 highly polar movie reviews for training, 25,000 for testing, and additional unlabeled data.
MR is a movie-review data set of 10,662 sentences labeled with respect to their overall sentiment polarity (positive or negative).
SST-1 is an extension of the MR data set. However, training/test splits are provided and labels are fine-grained (very positive, positive, neutral, negative, very negative). The training and test data sets have 237,107 and 2,210 sentences respectively.
SST-2 is the same as SST-1 with neutral sentences removed and only binary sentiment polarity are considered: very positive is considered as positive, and very negative is considered as negative.
SUBJ is a Subjectivity data set for sentiment analysis. Sentences labeled with respect to their subjectivity status (subjective or objective).
TREC is a movie-review data set of 10,000 sentences labeled with respect to their subjectivity status (subjective or objective).
CR is customer reviews of various products (cameras, MP3s etc.). Sentences are labeled with respect to their overall sentiment polarities (positive or negative).
MPQA is an opinion polarity detection subtask. Sentences are labeled with respect to their overall sentiment polarities (positive or negative).
IMDB reviews for sentiment analysis. |
|
Movie reviews for sentiment analysis. |
|
Stanford Sentiment Treebank: an extension of the MR data set. |
|
Stanford Sentiment Treebank: an extension of the MR data set. |
|
Subjectivity dataset for sentiment analysis. |
|
Question dataset for question classification. |
|
Customer reviews of various products (cameras, MP3s etc.). |
|
Opinion polarity detection subtask of the MPQA dataset. |
Word Embedding Evaluation Datasets¶
There are a number of commonly used datasets for intrinsic evaluation for word embeddings.
The similarity-based evaluation datasets include:
WordSim353 dataset. |
|
MEN dataset for word-similarity and relatedness. |
|
MTurk dataset for word-similarity and relatedness by Radinsky et al.. |
|
Rare words dataset word-similarity and relatedness. |
|
SimLex999 dataset word-similarity. |
|
SimVerb3500 dataset word-similarity. |
|
SemEval17Task2 dataset for word-similarity. |
|
Verb143 dataset. |
|
Verb-130 dataset. |
Analogy-based evaluation datasets include:
Google analogy test set |
|
Bigger analogy test set |
CoNLL Datasets¶
The CoNLL datasets are from a series of annual competitions held at the top tier conference of the same name. The conference is organized by SIGNLL.
These datasets include data for the shared tasks, such as part-of-speech (POS) tagging, chunking, named entity recognition (NER), semantic role labeling (SRL), etc.
We provide built in support for CoNLL 2000 – 2002, 2004, as well as the Universal Dependencies dataset which is used in the 2017 and 2018 competitions.
CoNLL2000 Part-of-speech (POS) tagging and chunking joint task dataset. |
|
CoNLL2001 Clause Identification dataset. |
|
CoNLL2002 Named Entity Recognition (NER) task dataset. |
|
CoNLL2004 Semantic Role Labeling (SRL) task dataset. |
|
Universal dependencies tree banks. |
Machine Translation Datasets¶
Preprocessed IWSLT English-Vietnamese Translation Dataset. |
|
Translation Corpus of the WMT2014 Evaluation Campaign. |
|
Preprocessed Translation Corpus of the WMT2014 Evaluation Campaign. |
|
Translation Corpus of the WMT2016 Evaluation Campaign. |
|
Preprocessed Translation Corpus of the WMT2016 Evaluation Campaign. |
Intent Classification and Slot Labeling¶
Airline Travel Information System dataset from MS CNTK. |
|
Snips Natural Language Understanding Benchmark dataset. |
Question Answering¶
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
Stanford Question Answering Dataset (SQuAD) - reading comprehension dataset. |
GLUE Benchmark¶
The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems.
The Corpus of Linguistic Acceptability (Warstadt et al., 2018) consists of English acceptability judgments drawn from books and journal articles on linguistic theory. |
|
The Stanford Sentiment Treebank (Socher et al., 2013) consists of sentences from movie reviews and human annotations of their sentiment. |
|
The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data. |
|
The Quora Question Pairs dataset is a collection of question pairs from the community question-answering website Quora. |
|
The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual entailment challenges (RTE1, RTE2, RTE3, and RTE5). |
|
The Multi-Genre Natural Language Inference Corpus (Williams et al., 2018) is a crowdsourced collection of sentence pairs with textual entailment annotations. |
|
The Question-answering NLI dataset converted from Stanford Question Answering Dataset (Rajpurkar et al. |
|
The Winograd NLI dataset converted from the dataset in Winograd Schema Challenge (Levesque et al., 2011). |
|
The Microsoft Research Paraphrase Corpus dataset. |
SuperGLUE Benchmark¶
The SuperGLUE Benchmark a new benchmark styled after GLUE with a new set of more difficult language understanding tasks.
The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual entailment challenges (RTE1, RTE2, RTE3 and RTE5). |
|
The CommitmentBank (CB) is a corpus of short texts in which at least one sentence contains an embedded clause. |
|
The Winograd Schema Challenge (WSC) is a co-reference resolution dataset. |
|
The Word-in-Context (WiC) is a word sense disambiguation dataset cast as binary classification of sentence pairs. |
|
The Choice of Plausible Alternatives (COPA) is a causal reasoning dataset. |
|
Multi-Sentence Reading Comprehension (MultiRC) is a QA dataset. |
|
Boolean Questions (BoolQ) is a QA dataset where each example consists of a short passage and a yes/no question about it. |
|
Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a multiple-choice QA dataset. |
|
The Broadcoverage Diagnostics (AX-b) is a diagnostics dataset labeled closely to the schema of MultiNLI. |
|
The Winogender Schema Diagnostics (AX-g) is a diagnostics dataset labeled closely to the schema of MultiNLI. |
Datasets¶
Dataset API for processing common text formats. The following classes can be used or subclassed to load custom datasets.
Dataset that comprises lines in a file. |
|
Common text dataset that reads a whole corpus based on provided sample splitter and word tokenizer. |
|
Common tab separated text dataset that reads text fields based on provided sample splitter and field separator. |
DataStreams¶
DataStream API for streaming and processing common text formats. The following classes can be used or subclassed to stream large custom data.
Abstract Data Stream Interface. |
|
SimpleDataStream wraps iterables to expose the DataStream API. |
|
Abstract Dataset Stream Interface. |
|
A simple stream of Datasets. |
|
Prefetch a DataStream in a separate Thread or Process. |
Transforms¶
Text data transformation functions. They can be used for processing text sequences in conjunction with Dataset.transform method.
Clip the sequence to have length no more than length. |
|
Pad the sequence. |
|
Apply the Moses Tokenizer implemented in sacremoses. |
|
Apply the Spacy Tokenizer. |
|
Apply the Moses Detokenizer implemented in sacremoses. |
|
End-to-end tokenization for BERT models. |
|
BERT style data transformation. |
Samplers¶
Samplers determine how to iterate through datasets. The below samplers and batch samplers can help iterate through sequence data.
Sort the samples based on the sort key and then sample sequentially. |
|
Assign each data sample to a fixed bucket based on its length. |
|
Batches are sampled from sorted buckets of data. |
|
Split the dataset into num_parts parts and randomly sample from the part with index part_index. |
The FixedBucketSampler uses following bucket scheme classes to generate bucket keys.
Buckets with constant width. |
|
Buckets with linearly increasing width: \(w_i = \alpha * i + 1\) for all \(i \geq 1\). |
|
Buckets with exponentially increasing width: \(w_i = bucket\_len\_step * w_{i-1}\) for all \(i \geq 2\). |
DataLoaders¶
DataLoaders loads data from a dataset and returns mini-batches of data
Loads data from a dataset and returns mini-batches of data. |
|
Loads data from a list of datasets and returns mini-batches of data. |
Utilities¶
Miscellaneous utility classes and functions for processing text and sequence data.
Counter class for keeping token frequencies. |
|
Counts tokens in the specified string. |
|
Concatenate sequences of tokens into a single flattened list of tokens. |
|
Slice a flat sequence of tokens into sequences tokens, with each inner sequence’s length equal to the specified length, taking into account the requested sequence overlap. |
|
Split the dataset into training and validation sets. |
|
Registers a dataset with segment specific hyperparameters. |
|
Creates an instance of a registered dataset. |
|
Get valid datasets and registered parameters. |
API Reference¶
This module includes common utilities such as data readers and counter.
-
gluonnlp.data.
get_tokenizer
(model_name, dataset_name, vocab=None, root='/var/lib/jenkins/.mxnet/data', **kwargs)[source]¶ Returns a pre-defined tokenizer by name.
- Parameters
model_name (str) – Options include ‘bert_24_1024_16’, ‘bert_12_768_12’, ‘roberta_12_768_12’, ‘roberta_24_1024_16’ and ‘ernie_12_768_12’
dataset_name (str) – The supported datasets for model_name of either bert_24_1024_16 and bert_12_768_12 are ‘book_corpus_wiki_en_cased’, ‘book_corpus_wiki_en_uncased’. For model_name bert_12_768_12 ‘wiki_cn_cased’, ‘wiki_multilingual_uncased’, ‘wiki_multilingual_cased’, ‘scibert_scivocab_uncased’, ‘scibert_scivocab_cased’, ‘scibert_basevocab_uncased’,’scibert_basevocab_cased’, ‘biobert_v1.0_pmc’, ‘biobert_v1.0_pubmed’, ‘biobert_v1.0_pubmed_pmc’, ‘biobert_v1.1_pubmed’, ‘clinicalbert’, ‘kobert_news_wiki_ko_cased’ are supported. For model_name roberta_12_768_12 and roberta_24_1024_16 ‘openwebtext_ccnews_stories_books_cased’ is supported. For model_name ernie_12_768_12 ‘baidu_ernie_uncased’. is additionally supported.
vocab (gluonnlp.vocab.BERTVocab or None, default None) – Vocabulary for the dataset. Must be provided if tokenizer is based on vocab.
root (str, default '$MXNET_HOME/models' with MXNET_HOME defaults to '~/.mxnet') – Location for keeping the model parameters.
- Returns
gluonnlp.data.BERTTokenizer or gluonnlp.data.GPT2BPETokenizer or
gluonnlp.data.SentencepieceTokenizer
Examples
>>> model_name = 'bert_12_768_12' >>> dataset_name = 'book_corpus_wiki_en_uncased' >>> _, vocab = gluonnlp.model.get_model(model_name, ... dataset_name=dataset_name, ... pretrained=False, root='./model') -etc- >>> tokenizer = gluonnlp.data.get_tokenizer(model_name, dataset_name, vocab) >>> tokenizer('Habit is second nature.') ['habit', 'is', 'second', 'nature', '.']
-
class
gluonnlp.data.
Counter
(**kwds)[source]¶ Counter class for keeping token frequencies.
-
discard
(min_freq, unknown_token)[source]¶ Discards tokens with frequency below min_frequency and represents them as unknown_token.
- Parameters
- Returns
- Return type
The Counter instance.
Examples
>>> a = gluonnlp.data.Counter({'a': 10, 'b': 1, 'c': 1}) >>> a.discard(3, '<unk>') Counter({'a': 10, '<unk>': 2})
-
-
gluonnlp.data.
count_tokens
(tokens, to_lower=False, counter=None)[source]¶ Counts tokens in the specified string.
For token_delim=’(td)’ and seq_delim=’(sd)’, a specified string of two sequences of tokens may look like:
(td)token1(td)token2(td)token3(td)(sd)(td)token4(td)token5(td)(sd)
- Parameters
tokens (list of str) – A source list of tokens.
to_lower (bool, default False) – Whether to convert the source source_str to the lower case.
counter (Counter or None, default None) – The Counter instance to be updated with the counts of tokens. If None, return a new Counter instance counting tokens from tokens.
- Returns
The counter Counter instance after being updated with the token
counts of source_str. If counter is None, return a new Counter
instance counting tokens from source_str.
Examples
>>> import re >>> source_str = ' Life is great ! \n life is good . \n' >>> source_str_tokens = filter(None, re.split(' |\n', source_str)) >>> counter = gluonnlp.data.count_tokens(source_str_tokens) >>> sorted(counter.items()) [('!', 1), ('.', 1), ('Life', 1), ('good', 1), ('great', 1), ('is', 2), ('life', 1)]
-
gluonnlp.data.
concat_sequence
(sequences)[source]¶ Concatenate sequences of tokens into a single flattened list of tokens.
- Parameters
sequences (list of list of object) – Sequences of tokens, each of which is an iterable of tokens.
- Returns
- Return type
Flattened list of tokens.
-
gluonnlp.data.
slice_sequence
(sequence, length, pad_last=False, pad_val='<pad>', overlap=0)[source]¶ Slice a flat sequence of tokens into sequences tokens, with each inner sequence’s length equal to the specified length, taking into account the requested sequence overlap.
- Parameters
sequence (list of object) – A flat list of tokens.
length (int) – The length of each of the samples.
pad_last (bool, default False) – Whether to pad the last sequence when its length doesn’t align. If the last sequence’s length doesn’t align and
pad_last
is False, it will be dropped.pad_val (object, default) – The padding value to use when the padding of the last sequence is enabled. In general, the type of
pad_val
should be the same as the tokens.overlap (int, default 0) – The extra number of items in current sample that should overlap with the next sample.
- Returns
- Return type
List of list of tokens, with the length of each inner list equal to length.
-
gluonnlp.data.
train_valid_split
(dataset, valid_ratio=0.05, stratify=None)[source]¶ Split the dataset into training and validation sets.
- Parameters
- Returns
train (SimpleDataset)
valid (SimpleDataset)
-
gluonnlp.data.
whitespace_splitter
(s)[source]¶ Split a string at whitespace (space, tab, newline, return, formfeed).
-
class
gluonnlp.data.
Splitter
(separator=None)[source]¶ Split a string based on a separator.
- Parameters
separator (str) – The separator based on which string is split.
-
class
gluonnlp.data.
ClipSequence
(length)[source]¶ Clip the sequence to have length no more than length.
- Parameters
length (int) – Maximum length of the sequence
Examples
>>> datasets = gluon.data.SimpleDataset([[1, 3, 5, 7], [1, 2, 3], [1, 2, 3, 4, 5, 6, 7, 8]]) >>> list(datasets.transform(gluonnlp.data.ClipSequence(4))) [[1, 3, 5, 7], [1, 2, 3], [1, 2, 3, 4]] >>> datasets = gluon.data.SimpleDataset([np.array([[1, 3], [5, 7], [7, 5], [3, 1]]), ... np.array([[1, 2], [3, 4], [5, 6], ... [6, 5], [4, 3], [2, 1]]), ... np.array([[2, 4], [4, 2]])]) >>> list(datasets.transform(gluonnlp.data.ClipSequence(3))) [array([[1, 3], [5, 7], [7, 5]]), array([[1, 2], [3, 4], [5, 6]]), array([[2, 4], [4, 2]])]
-
class
gluonnlp.data.
PadSequence
(length, pad_val=0, clip=True)[source]¶ Pad the sequence.
Pad the sequence to the given length by inserting pad_val. If clip is set, sequence that has length larger than length will be clipped.
- Parameters
Examples
>>> datasets = gluon.data.SimpleDataset([[1, 3, 5, 7], [1, 2, 3], [1, 2, 3, 4, 5, 6, 7, 8]]) >>> list(datasets.transform(gluonnlp.data.PadSequence(6))) [[1, 3, 5, 7, 0, 0], [1, 2, 3, 0, 0, 0], [1, 2, 3, 4, 5, 6]] >>> list(datasets.transform(gluonnlp.data.PadSequence(6, clip=False))) [[1, 3, 5, 7, 0, 0], [1, 2, 3, 0, 0, 0], [1, 2, 3, 4, 5, 6, 7, 8]] >>> list(datasets.transform(gluonnlp.data.PadSequence(6, pad_val=-1, clip=False))) [[1, 3, 5, 7, -1, -1], [1, 2, 3, -1, -1, -1], [1, 2, 3, 4, 5, 6, 7, 8]]
-
class
gluonnlp.data.
SacreMosesTokenizer
[source]¶ Apply the Moses Tokenizer implemented in sacremoses.
Users of this class are required to install sacremoses. For example, one can use
pip install sacremoses
.Note
sacremoses carries an LGPL 2.1+ license.
Examples
>>> tokenizer = gluonnlp.data.SacreMosesTokenizer() >>> tokenizer('Gluon NLP toolkit provides a suite of text processing tools.') ['Gluon', 'NLP', 'toolkit', 'provides', 'a', 'suite', 'of', 'text', 'processing', 'tools', '.'] >>> tokenizer('Das Gluon NLP-Toolkit stellt eine Reihe von Textverarbeitungstools ' ... 'zur Verfügung.') ['Das', 'Gluon', 'NLP-Toolkit', 'stellt', 'eine', 'Reihe', 'von', 'Textverarbeitungstools', 'zur', 'Verfügung', '.']
-
class
gluonnlp.data.
SpacyTokenizer
(lang='en_core_web_sm')[source]¶ Apply the Spacy Tokenizer.
Users of this class are required to install spaCy and download corresponding NLP models, such as
python -m spacy download en
.Only spacy>=2.0.0 is supported.
- Parameters
lang (str) – The language to tokenize. Default is ‘en’, i.e, English. You may refer to https://spacy.io/usage/models for supported languages.
Examples
>>> tokenizer = gluonnlp.data.SpacyTokenizer() >>> tokenizer('Gluon NLP toolkit provides a suite of text processing tools.') ['Gluon', 'NLP', 'toolkit', 'provides', 'a', 'suite', 'of', 'text', 'processing', 'tools', '.'] >>> tokenizer = gluonnlp.data.SpacyTokenizer('de') >>> tokenizer('Das Gluon NLP-Toolkit stellt eine Reihe von Textverarbeitungstools' ... ' zur Verfügung.') ['Das', 'Gluon', 'NLP-Toolkit', 'stellt', 'eine', 'Reihe', 'von', 'Textverarbeitungstools', 'zur', 'Verfügung', '.']
-
class
gluonnlp.data.
SacreMosesDetokenizer
(return_str=True)[source]¶ Apply the Moses Detokenizer implemented in sacremoses.
Users of this class are required to install sacremoses. For example, one can use
pip install sacremoses
.Note
sacremoses carries an LGPL 2.1+ license.
- Parameters
return_str (bool, default False) – True: return a single string False: return a list of words
Examples
>>> detokenizer = gluonnlp.data.SacreMosesDetokenizer() >>> detokenizer(['Gluon', 'NLP', 'toolkit', 'provides', 'a', 'suite', 'of', ... 'text', 'processing', 'tools', '.'], return_str=True) 'Gluon NLP toolkit provides a suite of text processing tools.' >>> detokenizer(['Das', 'Gluon','NLP-Toolkit','stellt','eine','Reihe','von', ... 'Textverarbeitungstools','zur','Verfügung','.'], return_str=True) 'Das Gluon NLP-Toolkit stellt eine Reihe von Textverarbeitungstools zur Verfügung.'
-
class
gluonnlp.data.
JiebaTokenizer
[source]¶ Apply the jieba Tokenizer.
Users of this class are required to install jieba
- Parameters
lang (str) – The language to tokenize. Default is “zh”, i.e, Chinese.
Examples
>>> tokenizer = gluonnlp.data.JiebaTokenizer() >>> tokenizer('我来到北京清华大学') ['我', '来到', '北京', '清华大学'] >>> tokenizer('小明硕士毕业于中国科学院计算所,后在日本京都大学深造') ['小明', '硕士', '毕业', '于', '中国科学院', '计算所', ',', '后', '在', '日本京都大学', '深造']
-
class
gluonnlp.data.
NLTKStanfordSegmenter
(segmenter_root='/var/lib/jenkins/.mxnet/stanford-segmenter', slf4j_root='/var/lib/jenkins/.mxnet/slf4j', java_class='edu.stanford.nlp.ie.crf.CRFClassifier')[source]¶ Apply the Stanford Chinese Word Segmenter implemented in NLTK.
Users of this class are required to install Java, NLTK and download Stanford Word Segmenter
- Parameters
segmenter_root (str, default '$MXNET_HOME/stanford-segmenter') – Path to folder for storing stanford segmenter. MXNET_HOME defaults to ‘~/.mxnet’.
slf4j_root (str, default '$MXNET_HOME/slf4j') – Path to foler for storing slf4j. MXNET_HOME defaults to ‘~/.mxnet’
java_class (str, default 'edu.stanford.nlp.ie.crf.CRFClassifier') – The learning algorithm used for segmentation
Examples
>>> tokenizer = gluonnlp.data.NLTKStanfordSegmenter() >>> tokenizer('我来到北京清华大学') ['我', '来到', '北京', '清华大学'] >>> tokenizer('小明硕士毕业于中国科学院计算所,后在日本京都大学深造') ['小明', '硕士', '毕业', '于', '中国科学院', '计算所', ',', '后', '在', '日本京都大学', '深造']
-
class
gluonnlp.data.
SentencepieceTokenizer
(path, num_best=0, alpha=1.0)[source]¶ Apply the Sentencepiece Tokenizer, which supports subword tokenization such as BPE.
Users of this class are required to install sentencepiece. For example, one can use
pip install sentencepiece
- Parameters
path (str) – Path to the pre-trained subword tokenization model.
num_best (int, default 0) – A scalar for sampling subwords. If num_best = {0,1}, no sampling is performed. If num_best > 1, then samples from the num_best results. If num_best < 0, then assume that num_best is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
alpha (float, default 1.0) – A scalar for a smoothing parameter. Inverse temperature for probability rescaling.
Examples
>>> url = 'http://repo.mxnet.io/gluon/dataset/vocab/test-0690baed.bpe' >>> f = gluon.utils.download(url) -etc- >>> tokenizer = gluonnlp.data.SentencepieceTokenizer(f) >>> detokenizer = gluonnlp.data.SentencepieceDetokenizer(f) >>> sentence = 'This is a very awesome, life-changing sentence.' >>> tokenizer(sentence) ['▁This', '▁is', '▁a', '▁very', '▁awesome', ',', '▁life', '-', 'ch', 'anging', '▁sentence', '.'] >>> detokenizer(tokenizer(sentence)) 'This is a very awesome, life-changing sentence.' >>> os.remove('test-0690baed.bpe')
-
class
gluonnlp.data.
SentencepieceDetokenizer
(path)[source]¶ Apply the Sentencepiece detokenizer, which supports recombining subwords such as BPE.
Users of this class are required to install sentencepiece. For example, one can use
pip install sentencepiece
- Parameters
path (str) – Path to the pre-trained subword tokenization model.
Examples
>>> url = 'http://repo.mxnet.io/gluon/dataset/vocab/test-0690baed.bpe' >>> f = gluon.utils.download(url) -etc- >>> tokenizer = gluonnlp.data.SentencepieceTokenizer(f) >>> detokenizer = gluonnlp.data.SentencepieceDetokenizer(f) >>> sentence = 'This is a very awesome, life-changing sentence.' >>> tokenizer(sentence) ['▁This', '▁is', '▁a', '▁very', '▁awesome', ',', '▁life', '-', 'ch', 'anging', '▁sentence', '.'] >>> detokenizer(tokenizer(sentence)) 'This is a very awesome, life-changing sentence.' >>> os.remove('test-0690baed.bpe')
-
class
gluonnlp.data.
BERTBasicTokenizer
(lower=True)[source]¶ Runs basic tokenization
performs invalid character removal (e.g. control chars) and whitespace. tokenize CJK chars. splits punctuation on a piece of text. strips accents and convert to lower case.(If lower is true)
- Parameters
lower (bool, default True) – whether the text strips accents and convert to lower case.
Examples
>>> tokenizer = gluonnlp.data.BERTBasicTokenizer(lower=True) >>> tokenizer(' \tHeLLo!how \n Are yoU? ') ['hello', '!', 'how', 'are', 'you', '?'] >>> tokenizer = gluonnlp.data.BERTBasicTokenizer(lower=False) >>> tokenizer(' \tHeLLo!how \n Are yoU? ') ['HeLLo', '!', 'how', 'Are', 'yoU', '?']
-
class
gluonnlp.data.
BERTTokenizer
(vocab, lower=True, max_input_chars_per_word=200, lru_cache_size=None)[source]¶ End-to-end tokenization for BERT models.
- Parameters
vocab (
Vocab
) – Vocabulary for the corpus.lower (
bool
) – whether the text strips accents and convert to lower case. If you use the BERT pre-training model, lower is set to Flase when using the cased model, otherwise it is set to True.max_input_chars_per_word (
int
) –lru_cache_size (
Optional
[int
]) – Maximum size of a least-recently-used cache to speed up tokenization. Use size of 2**20 for example.
Examples
>>> _, vocab = gluonnlp.model.bert_12_768_12(dataset_name='wiki_multilingual_uncased', ... pretrained=False, root='./model') -etc- >>> tokenizer = gluonnlp.data.BERTTokenizer(vocab=vocab) >>> tokenizer('gluonnlp: 使NLP变得简单。') ['gl', '##uo', '##nn', '##lp', ':', '使', 'nl', '##p', '变', '得', '简', '单', '。']
-
__call__
(sample)[source]¶ - Parameters
sample (str) – The string to tokenize.
- Returns
ret – List of tokens
- Return type
list of strs
-
static
is_first_subword
(token)[source]¶ Check if a token is the beginning of subwords.
- Parameters
token (str) – The input token.
- Returns
ret
- Return type
True if the token is the beginning of a serious of wordpieces.
Examples
>>> _, vocab = gluonnlp.model.bert_12_768_12(dataset_name='wiki_multilingual_uncased', ... pretrained=False, root='./bert_tokenizer') -etc- >>> tokenizer = gluonnlp.data.BERTTokenizer(vocab=vocab) >>> tokenizer('gluonnlp: 使NLP变得简单。') ['gl', '##uo', '##nn', '##lp', ':', '使', 'nl', '##p', '变', '得', '简', '单', '。'] >>> tokenizer.is_first_subword('gl') True >>> tokenizer.is_first_subword('##uo') False
-
class
gluonnlp.data.
BERTSentenceTransform
(tokenizer, max_seq_length, vocab=None, pad=True, pair=True)[source]¶ BERT style data transformation.
- Parameters
tokenizer (BERTTokenizer.) – Tokenizer for the sentences.
max_seq_length (int.) – Maximum sequence length of the sentences.
vocab (Vocab) – The vocabulary which has cls_token and sep_token registered. If vocab.cls_token is not present, vocab.bos_token is used instead. If vocab.sep_token is not present, vocab.eos_token is used instead.
pad (bool, default True) – Whether to pad the sentences to maximum length.
pair (bool, default True) – Whether to transform sentences or sentence pairs.
-
__call__
(line)[source]¶ Perform transformation for sequence pairs or single sequences.
The transformation is processed in the following steps: - tokenize the input sequences - insert [CLS], [SEP] as necessary - generate type ids to indicate whether a token belongs to the first sequence or the second sequence. - generate valid length
For sequence pairs, the input is a tuple of 2 strings: text_a, text_b.
- Inputs:
text_a: ‘is this jacksonville ?’ text_b: ‘no it is not’
- Tokenization:
text_a: ‘is this jack ##son ##ville ?’ text_b: ‘no it is not .’
- Processed:
tokens: ‘[CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]’ type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 valid_length: 14
For single sequences, the input is a tuple of single string: text_a.
- Inputs:
text_a: ‘the dog is hairy .’
- Tokenization:
text_a: ‘the dog is hairy .’
- Processed:
text_a: ‘[CLS] the dog is hairy . [SEP]’ type_ids: 0 0 0 0 0 0 0 valid_length: 7
If vocab.cls_token and vocab.sep_token are not present, vocab.bos_token and vocab.eos_token are used instead.
- Parameters
line (tuple of str) – Input strings. For sequence pairs, the input is a tuple of 2 strings: (text_a, text_b). For single sequences, the input is a tuple of single string: (text_a,).
- Returns
np.array (input token ids in ‘int32’, shape (batch_size, seq_length))
np.array (valid length in ‘int32’, shape (batch_size,))
np.array (input token type ids in ‘int32’, shape (batch_size, seq_length))
-
class
gluonnlp.data.
BERTSPTokenizer
(path, vocab, num_best=0, alpha=1.0, lower=True, max_input_chars_per_word=200)[source]¶ End-to-end SentencePiece tokenization for BERT models.
It works best with BERTSentenceTransform().
Note
BERTSPTokenizer depends on the sentencepiece library. For multi-processing with BERTSPTokenizer, making an extra copy of the BERTSPTokenizer instance is recommended before using it.
- Parameters
path (str) – Path to the pre-trained subword tokenization model.
vocab (gluonnlp.Vocab) – Vocabulary for the corpus.
num_best (int, default 0) – A scalar for sampling subwords. If num_best = {0,1}, no sampling is performed. If num_best > 1, then samples from the num_best results. If num_best < 0, then assume that num_best is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
alpha (float) – A scalar for a smoothing parameter. Inverse temperature for probability rescaling.
lower (bool, default True) – Whether the text strips accents and convert to lower case. If you use the BERT pre-training model, lower is set to False when using the cased model, otherwise it is set to True.
max_input_chars_per_word (int, default 200) –
Examples
>>> url = 'http://repo.mxnet.io/gluon/dataset/vocab/test-682b5d15.bpe' >>> f = gluon.utils.download(url) -etc- >>> bert_vocab = gluonnlp.vocab.BERTVocab.from_sentencepiece(f) >>> sp_tokenizer = BERTSPTokenizer(f, bert_vocab, lower=True) >>> sentence = 'Better is to bow than break.' >>> sp_tokenizer(sentence) ['▁better', '▁is', '▁to', '▁b', 'ow', '▁than', '▁brea', 'k', '▁', '.'] >>> os.remove('test-682b5d15.bpe')
-
__call__
(sample)[source]¶ - Parameters
sample (str) – The string to tokenize.
- Returns
ret – List of tokens
- Return type
list of strs
-
static
is_first_subword
(token)[source]¶ Check if a string token is a subword following a previous subword, instead of the beginning of a word.
- Parameters
token (str) – The input token.
- Returns
ret
- Return type
True if the token is the beginning of a series of subwords,
Examples
>>> url = 'http://repo.mxnet.io/gluon/dataset/vocab/test-682b5d15.bpe' >>> f = gluon.utils.download(url) -etc- >>> bert_vocab = gluonnlp.vocab.BERTVocab.from_sentencepiece(f) >>> sp_tokenizer = BERTSPTokenizer(f, bert_vocab, lower=True) >>> sp_tokenizer('Better is to bow than break.') ['▁better', '▁is', '▁to', '▁b', 'ow', '▁than', '▁brea', 'k', '▁', '.'] >>> sp_tokenizer.is_first_subword('▁better') True >>> sp_tokenizer.is_first_subword('ow') False >>> os.remove('test-682b5d15.bpe')
-
class
gluonnlp.data.
GPT2BPETokenizer
(root='/var/lib/jenkins/.mxnet/models')[source]¶ BPE tokenizer used in OpenAI GPT-2 model.
- Parameters
root (str, default '$MXNET_HOME/models') – Location for keeping the BPE rank file. MXNET_HOME defaults to ‘~/.mxnet’.
-
class
gluonnlp.data.
ConstWidthBucket
[source]¶ Buckets with constant width.
-
class
gluonnlp.data.
LinearWidthBucket
[source]¶ Buckets with linearly increasing width: \(w_i = \alpha * i + 1\) for all \(i \geq 1\).
-
class
gluonnlp.data.
ExpWidthBucket
(bucket_len_step=1.1)[source]¶ Buckets with exponentially increasing width: \(w_i = bucket\_len\_step * w_{i-1}\) for all \(i \geq 2\).
- Parameters
bucket_len_step (float, default 1.1) – This is the increasing factor for the bucket width.
-
class
gluonnlp.data.
SortedSampler
(sort_keys, reverse=True)[source]¶ Sort the samples based on the sort key and then sample sequentially.
- Parameters
sort_keys (list-like object) – List of the sort keys.
reverse (bool, default True) – Whether to sort by descending order.
-
class
gluonnlp.data.
FixedBucketSampler
(lengths, batch_size, num_buckets=10, bucket_keys=None, ratio=0, shuffle=False, use_average_length=False, num_shards=0, bucket_scheme=<gluonnlp.data.sampler.ConstWidthBucket object>)[source]¶ Assign each data sample to a fixed bucket based on its length. The bucket keys are either given or generated from the input sequence lengths.
- Parameters
lengths (list of int or list of tuple/list of int) – The length of the sequences in the input data sample.
batch_size (int) – The batch size of the sampler.
num_buckets (int or None, default 10) – The number of buckets. This will not be used if bucket_keys is set.
bucket_keys (None or list of int or list of tuple, default None) – The keys that will be used to create the buckets. It should usually be the lengths of the sequences. If it is None, the bucket_keys will be generated based on the maximum lengths of the data.
ratio (float, default 0) –
Ratio to scale up the batch size of smaller buckets. Assume the \(i\) th key is \(K_i\) , the default batch size is \(B\) , the ratio to scale the batch size is \(\alpha\) and the batch size corresponds to the \(i\) th bucket is \(B_i\) . We have:
\[B_i = \max(\alpha B \times \frac{\max_j sum(K_j)}{sum(K_i)}, B)\]Thus, setting this to a value larger than 0, like 0.5, will scale up the batch size of the smaller buckets.
shuffle (bool, default False) – Whether to shuffle the batches.
use_average_length (bool, default False) – False: each batch contains batch_size sequences, number of sequence elements varies. True: each batch contains batch_size elements, number of sequences varies. In this case, ratio option is ignored.
num_shards (int, default 0) – If num_shards > 0, the sampled batch is split into num_shards smaller batches. The output will have structure of list(list(int)). If num_shards = 0, the output will have structure of list(int). This is useful in multi-gpu training and can potentially reduce the number of paddings. In general, it is set to the number of gpus.
bucket_scheme (BucketScheme, default ConstWidthBucket) – It is used to generate bucket keys. It supports: ConstWidthBucket: all the buckets have the same width LinearWidthBucket: the width of ith bucket follows \(w_i = \alpha * i + 1\) ExpWidthBucket: the width of ith bucket follows \(w_i\) = bucket_len_step \(* w_{i-1}\)
Examples
>>> lengths = [np.random.randint(1, 100) for _ in range(1000)] >>> sampler = gluonnlp.data.FixedBucketSampler(lengths, 8, ratio=0.5) >>> print(sampler.stats()) FixedBucketSampler: -etc-
-
class
gluonnlp.data.
SortedBucketSampler
(sort_keys, batch_size, mult=100, reverse=True, shuffle=False)[source]¶ Batches are sampled from sorted buckets of data.
First, partition data in buckets of size batch_size * mult. Each bucket contains batch_size * mult elements. The samples inside each bucket are sorted based on sort_key and then batched.
- Parameters
sort_keys (list-like object) – The keys to sort the samples.
batch_size (int) – Batch size of the sampler.
mult (int or float, default 100) – The multiplier to determine the bucket size. Each bucket will have size mult * batch_size.
reverse (bool, default True) – Whether to sort in descending order.
shuffle (bool, default False) – Whether to shuffle the data.
Examples
>>> lengths = [np.random.randint(1, 1000) for _ in range(1000)] >>> sampler = gluonnlp.data.SortedBucketSampler(lengths, 16) >>> # The sequence lengths within the batch will be sorted >>> for i, indices in enumerate(sampler): ... if i == 0: ... print([lengths[ele] for ele in indices]) [-etc-]
-
class
gluonnlp.data.
SplitSampler
(length, num_parts=1, part_index=0, even_size=False, repeat=1, shuffle=True)[source]¶ Split the dataset into num_parts parts and randomly sample from the part with index part_index.
The data is randomly shuffled at each iteration within each partition.
- Parameters
length (int) – Number of examples in the dataset
num_parts (int, default 1) – Number of partitions which the data is split into
part_index (int, default 0) – The index of the part to read from
even_size (bool, default False) – If the number of samples is not even across all partitions, sample a few extra samples for the ones with fewer samples.
repeat (int, default 1) – The number of times that items are repeated.
shuffle (bool, default True) – Whether or not to shuffle the items.
-
class
gluonnlp.data.
TextLineDataset
(filename, encoding='utf8')[source]¶ Dataset that comprises lines in a file. Each line will be stripped.
-
class
gluonnlp.data.
CorpusDataset
(filename, encoding='utf8', flatten=False, skip_empty=True, sample_splitter=<function line_splitter>, tokenizer=<function whitespace_splitter>, bos=None, eos=None)[source]¶ Common text dataset that reads a whole corpus based on provided sample splitter and word tokenizer.
The returned dataset includes samples, each of which can either be a list of tokens if tokenizer is specified, or otherwise a single string segment produced by the sample_splitter.
- Parameters
filename (str or list of str) – Path to the input text file or list of paths to the input text files.
encoding (str, default 'utf8') – File encoding format.
flatten (bool, default False) – Whether to return all samples as flattened tokens. If True, each sample is a token.
skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.
sample_splitter (function, default str.splitlines) – A function that splits the dataset string into samples.
tokenizer (function or None, default str.split) – A function that splits each sample string into list of tokens. If None, raw samples are returned according to sample_splitter.
bos (str or None, default None) – The token to add at the beginning of each sequence. If None, or if tokenizer is not specified, then nothing is added.
eos (str or None, default None) – The token to add at the end of each sequence. If None, or if tokenizer is not specified, then nothing is added.
-
class
gluonnlp.data.
ConcatDataset
(datasets)[source]¶ Dataset that concatenates a list of datasets.
- Parameters
datasets (list) – List of datasets.
-
class
gluonnlp.data.
TSVDataset
(filename, encoding='utf8', sample_splitter=<function line_splitter>, field_separator=<gluonnlp.data.utils.Splitter object>, num_discard_samples=0, field_indices=None, allow_missing=False)[source]¶ Common tab separated text dataset that reads text fields based on provided sample splitter and field separator.
The returned dataset includes samples, each of which can either be a list of text fields if field_separator is specified, or otherwise a single string segment produced by the sample_splitter.
Example:
# assume `test.tsv` contains the following content: # Id FirstName LastName # a Jiheng Jiang # b Laoban Zha # discard the first line and select the 0th and 2nd fields dataset = data.TSVDataset('test.tsv', num_discard_samples=1, field_indices=[0, 2]) assert dataset[0] == ['a', 'Jiang'] assert dataset[1] == ['b', 'Zha']
- Parameters
filename (str or list of str) – Path to the input text file or list of paths to the input text files.
encoding (str, default 'utf8') – File encoding format.
sample_splitter (function, default str.splitlines) – A function that splits the dataset string into samples.
field_separator (function or None, default Splitter(' ')) – A function that splits each sample string into list of text fields. If None, raw samples are returned according to sample_splitter.
num_discard_samples (int, default 0) – Number of samples discarded at the head of the first file.
field_indices (list of int or None, default None) – If set, for each sample, only fields with provided indices are selected as the output. Otherwise all fields are returned.
allow_missing (bool, default False) – If set to True, no exception will be thrown if the number of fields is smaller than the maximum field index provided.
-
class
gluonnlp.data.
NumpyDataset
(filename, **kwargs)[source]¶ A dataset wrapping over a Numpy binary (.npy, .npz) file.
If the file is a .npy file, then a single numpy array is loaded. If the file is a .npz file with multiple arrays, then a list of numpy arrays are loaded, ordered by their key in the archive.
Sparse matrix is not yet supported.
- Parameters
-
get_field
(field)[source]¶ Return the dataset corresponds to the provided key.
- Example::
a = np.ones((2,2)) b = np.zeros((2,2)) np.savez(‘data.npz’, a=a, b=b) dataset = NumpyDataset(‘data.npz’) data_a = dataset.get_field(‘a’) data_b = dataset.get_field(‘b’)
- Parameters
field (str) – The name of the field to retrieve.
-
class
gluonnlp.data.
GBWStream
(segment='train', skip_empty=True, bos=None, eos='<eos>', root='/var/lib/jenkins/.mxnet/datasets/gbw')[source]¶ 1-Billion-Word word-level dataset for language modeling, from Google.
The GBWSream iterates over CorpusDatasets(flatten=False).
Source http://www.statmt.org/lm-benchmark
License: Apache
- Parameters
segment ({'train', 'test'}, default 'train') – Dataset segment.
skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.
bos (str or None, default None) – The token to add at the begining of each sentence. If None, nothing is added.
eos (str or None, default '<eos>') – The token to add at the end of each sentence. If None, nothing is added.
root (str, default '$MXNET_HOME/datasets/gbw') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
-
class
gluonnlp.data.
Text8
(root='/var/lib/jenkins/.mxnet/datasets/text8', segment='train', max_sentence_length=10000)[source]¶ Text8 corpus
http://mattmahoney.net/dc/textdata.html
Part of the test data for the Large Text Compression Benchmark http://mattmahoney.net/dc/text.html. The first 10**8 bytes of the cleaned English Wikipedia dump on Mar. 3, 2006.
License: https://en.wikipedia.org/wiki/Wikipedia:Copyrights
- Parameters
root (str, default '$MXNET_HOME/datasets/text8') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
-
class
gluonnlp.data.
Fil9
(root='/var/lib/jenkins/.mxnet/datasets/fil9', segment='train', max_sentence_length=None)[source]¶ Fil9 corpus
http://mattmahoney.net/dc/textdata.html
Part of the test data for the Large Text Compression Benchmark http://mattmahoney.net/dc/text.html. The first 10**9 bytes of the English Wikipedia dump on Mar. 3, 2006.
License: https://en.wikipedia.org/wiki/Wikipedia:Copyrights
- Parameters
root (str, default '$MXNET_HOME/datasets/fil9') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
-
class
gluonnlp.data.
Enwik8
(root='/var/lib/jenkins/.mxnet/datasets/enwik8', segment='train')[source]¶ Enwik8 corpus
http://mattmahoney.net/dc/textdata.html
Part of the test data for the Large Text Compression Benchmark http://mattmahoney.net/dc/text.html. The first 10**8 bytes of the English Wikipedia dump on Mar. 3, 2006.
License: https://en.wikipedia.org/wiki/Wikipedia:Copyrights
- Parameters
root (str, default '$MXNET_HOME/datasets/text8') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
segment (
str
) – train, test, valid, trainraw, testraw and validraw segments preprocessed with https://github.com/salesforce/awd-lstm-lm/blob/master/data/enwik8/prep_enwik8.py are provided.
-
class
gluonnlp.data.
WikiText2
(segment='train', flatten=True, skip_empty=True, tokenizer=<function WikiText2.<lambda>>, bos=None, eos='<eos>', root='/var/lib/jenkins/.mxnet/datasets/wikitext-2', **kwargs)[source]¶ WikiText-2 word-level dataset for language modeling, from Salesforce research.
WikiText2 is implemented as CorpusDataset with the default flatten=True.
From https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/
License: Creative Commons Attribution-ShareAlike
- Parameters
segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.
flatten (bool, default True) – Whether to return all samples as flattened tokens. If True, each sample is a token.
skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.
tokenizer (function, default str.split) – A function that splits each sample string into list of tokens.
bos (str or None, default None) – The token to add at the beginning of each sentence. If None, nothing is added.
eos (str or None, default '<eos>') – The token to add at the end of each sentence. If None, nothing is added.
root (str, default '$MXNET_HOME/datasets/wikitext-2') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> wikitext2 = gluonnlp.data.WikiText2('val', root='./datasets/wikitext2') -etc- >>> len(wikitext2) 216347 >>> wikitext2[0] '=' >>> wikitext2 = gluonnlp.data.WikiText2('val', flatten=False, ... root='./datasets/wikitext2') >>> len(wikitext2) 2461 >>> wikitext2[0] ['=', 'Homarus', 'gammarus', '=', '<eos>'] >>> wikitext2 = gluonnlp.data.WikiText2('val', flatten=False, bos='<bos>', eos=None, ... root='./datasets/wikitext2') >>> wikitext2[0] ['<bos>', '=', 'Homarus', 'gammarus', '='] >>> wikitext2 = gluonnlp.data.WikiText2('val', flatten=False, bos='<bos>', eos=None, ... skip_empty=False, root='./datasets/wikitext2') >>> len(wikitext2) 3760 >>> wikitext2[0] ['<bos>']
-
class
gluonnlp.data.
WikiText103
(segment='train', flatten=True, skip_empty=True, tokenizer=<function WikiText103.<lambda>>, bos=None, eos='<eos>', root='/var/lib/jenkins/.mxnet/datasets/wikitext-103', **kwargs)[source]¶ WikiText-103 word-level dataset for language modeling, from Salesforce research.
WikiText103 is implemented as CorpusDataset with the default flatten=True.
From https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/
License: Creative Commons Attribution-ShareAlike
- Parameters
segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.
flatten (bool, default True) – Whether to return all samples as flattened tokens. If True, each sample is a token.
skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.
tokenizer (function, default str.split) – A function that splits each sample string into list of tokens.
bos (str or None, default None) – The token to add at the beginning of each sentence. If None, nothing is added.
eos (str or None, default '<eos>') – The token to add at the end of each sentence. If None, nothing is added.
root (str, default '$MXNET_HOME/datasets/wikitext-103') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> wikitext103 = gluonnlp.data.WikiText103('val', root='./datasets/wikitext103') -etc- >>> len(wikitext103) 216347 >>> wikitext103[0] '=' >>> wikitext103 = gluonnlp.data.WikiText103('val', flatten=False, ... root='./datasets/wikitext103') >>> len(wikitext103) 2461 >>> wikitext103[0] ['=', 'Homarus', 'gammarus', '=', '<eos>'] >>> wikitext103 = gluonnlp.data.WikiText103('val', flatten=False, bos='<bos>', eos=None, ... root='./datasets/wikitext103') >>> wikitext103[0] ['<bos>', '=', 'Homarus', 'gammarus', '='] >>> wikitext103 = gluonnlp.data.WikiText103('val', flatten=False, bos='<bos>', eos=None, ... skip_empty=False, root='./datasets/wikitext103') >>> len(wikitext103) 3760 >>> wikitext103[0] ['<bos>']
-
class
gluonnlp.data.
WikiText2Raw
(segment='train', flatten=True, skip_empty=True, bos=None, eos=None, tokenizer=<function WikiText2Raw.<lambda>>, root='/var/lib/jenkins/.mxnet/datasets/wikitext-2', **kwargs)[source]¶ WikiText-2 character-level dataset for language modeling
WikiText2Raw is implemented as CorpusDataset with the default flatten=True.
From Salesforce research: https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/
License: Creative Commons Attribution-ShareAlike
- Parameters
segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.
flatten (bool, default True) – Whether to return all samples as flattened tokens. If True, each sample is a token.
skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.
tokenizer (function, default s.encode('utf-8')) – A function that splits each sample string into list of tokens. The tokenizer can also be used to convert everything to lowercase. E.g. with tokenizer=lambda s: s.lower().encode(‘utf-8’)
bos (str or None, default None) – The token to add at the beginning of each sentence. If None, nothing is added.
eos (str or None, default '<eos>') – The token to add at the end of each sentence. If None, nothing is added.
root (str, default '$MXNET_HOME/datasets/wikitext-2') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> wikitext2 = gluonnlp.data.WikiText2Raw('val', root='./datasets/wikitext2') -etc- >>> len(wikitext2) 1136862 >>> wikitext2[0] 61 >>> type(wikitext2[0]) <class 'int'> >>> wikitext2 = gluonnlp.data.WikiText2Raw('val', flatten=False, ... tokenizer=None, root='./datasets/wikitext2') >>> len(wikitext2) 2461 >>> wikitext2[0] '= Homarus gammarus =' >>> wikitext2 = gluonnlp.data.WikiText2Raw('val', flatten=False, bos='<bos>', eos=None, ... tokenizer=lambda s: s.split(), ... root='./datasets/wikitext2') >>> wikitext2[0] ['<bos>', '=', 'Homarus', 'gammarus', '=']
-
class
gluonnlp.data.
WikiText103Raw
(segment='train', flatten=True, skip_empty=True, tokenizer=<function WikiText103Raw.<lambda>>, bos=None, eos=None, root='/var/lib/jenkins/.mxnet/datasets/wikitext-103', **kwargs)[source]¶ WikiText-103 character-level dataset for language modeling
WikiText103Raw is implemented as CorpusDataset with the default flatten=True.
From Salesforce research: https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/
License: Creative Commons Attribution-ShareAlike
- Parameters
segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.
flatten (bool, default True) – Whether to return all samples as flattened tokens. If True, each sample is a token.
skip_empty (bool, default True) – Whether to skip the empty samples produced from sample_splitters. If False, bos and eos will be added in empty samples.
tokenizer (function, default s.encode('utf-8')) – A function that splits each sample string into list of tokens. The tokenizer can also be used to convert everything to lowercase. E.g. with tokenizer=lambda s: s.lower().encode(‘utf-8’)
bos (str or None, default None) – The token to add at the beginning of each sentence. If None, nothing is added.
eos (str or None, default '<eos>') – The token to add at the end of each sentence. If None, nothing is added.
root (str, default '$MXNET_HOME/datasets/wikitext-103') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> wikitext103 = gluonnlp.data.WikiText103Raw('val', root='./datasets/wikitext103') -etc- >>> len(wikitext103) 1136862 >>> wikitext103[0] 61 >>> wikitext103 = gluonnlp.data.WikiText103Raw('val', flatten=False, ... root='./datasets/wikitext103') >>> len(wikitext103) 2461 >>> wikitext103[0] [61, 32, 72, 111, 109, 97, 114, 117, 115, 32, 103, 97, 109, 109, 97, 114, 117, 115, 32, 61] >>> wikitext103 = gluonnlp.data.WikiText103Raw('val', flatten=False, tokenizer=None, ... root='./datasets/wikitext103') >>> wikitext103[0] '= Homarus gammarus ='
-
class
gluonnlp.data.
IMDB
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/imdb')[source]¶ IMDB reviews for sentiment analysis.
From http://ai.stanford.edu/~amaas/data/sentiment/
Positive classes have label values in [7, 10]. Negative classes have label values in [1, 4]. All samples in unsupervised set have labels with value 0.
- Parameters
Examples
>>> imdb = gluonnlp.data.IMDB('test', root='./datasets/imdb') -etc- >>> len(imdb) 25000 >>> len(imdb[0]) 2 >>> type(imdb[0][0]), type(imdb[0][1]) (<class 'str'>, <class 'int'>) >>> imdb[0][0][:75] 'I went and saw this movie last night after being coaxed to by a few friends' >>> imdb[0][1] 10 >>> imdb = gluonnlp.data.IMDB('unsup', root='./datasets/imdb') -etc- >>> len(imdb) 50000 >>> len(imdb[0]) 2 >>> type(imdb[0][0]), type(imdb[0][1]) (<class 'str'>, <class 'int'>) >>> imdb[0][0][:70] 'I admit, the great majority of films released before say 1933 are just' >>> imdb[0][1] 0
-
class
gluonnlp.data.
MR
(root='/var/lib/jenkins/.mxnet/datasets/mr')[source]¶ Movie reviews for sentiment analysis.
From https://www.cs.cornell.edu/people/pabo/movie-review-data/
Positive class has label value 1. Negative class has label value 0.
- Parameters
root (str, default '$MXNET_HOME/datasets/mr') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> mr = gluonnlp.data.MR(root='./datasets/mr') -etc- >>> len(mr) 10662 >>> len(mr[3]) 2 >>> type(mr[3][0]), type(mr[3][1]) (<class 'str'>, <class 'int'>) >>> mr[3][0][:55] 'if you sometimes like to go to the movies to have fun ,' >>> mr[3][1] 1
-
class
gluonnlp.data.
TREC
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/trec')[source]¶ Question dataset for question classification.
From http://cogcomp.cs.illinois.edu/Data/QA/QC/
- Class labels are (http://cogcomp.org/Data/QA/QC/definition.html):
DESCRIPTION: 0
ENTITY: 1
ABBREVIATION: 2
HUMAN: 3
LOCATION: 4
NUMERIC: 5
The first space-separated token in the text of each sample is the fine-grain label.
- Parameters
Examples
>>> trec = gluonnlp.data.TREC('test', root='./datasets/trec') -etc- >>> len(trec) 500 >>> len(trec[0]) 2 >>> type(trec[0][0]), type(trec[0][1]) (<class 'str'>, <class 'int'>) >>> trec[0][0] 'How far is it from Denver to Aspen ?' >>> (trec[0][1], trec[0][0].split()[0]) (5, 'How')
-
class
gluonnlp.data.
SUBJ
(root='/var/lib/jenkins/.mxnet/datasets/subj')[source]¶ Subjectivity dataset for sentiment analysis.
Positive class has label value 1. Negative class has label value 0.
- Parameters
root (str, default '$MXNET_HOME/datasets/subj') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> subj = gluonnlp.data.SUBJ(root='./datasets/subj') -etc- >>> len(subj) 10000 >>> len(subj[0]) 2 >>> type(subj[0][0]), type(subj[0][1]) (<class 'str'>, <class 'int'>) >>> subj[0][0][:60] 'its impressive images of crematorium chimney fires and stack' >>> subj[0][1] 1
-
class
gluonnlp.data.
SST_1
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/sst-1')[source]¶ Stanford Sentiment Treebank: an extension of the MR data set. However, train/dev/test splits are provided and labels are fine-grained (very positive, positive, neutral, negative, very negative).
From http://nlp.stanford.edu/sentiment/
- Class labels are:
very positive: 4
positive: 3
neutral: 2
negative: 1
very negative: 0
- Parameters
Examples
>>> sst_1 = gluonnlp.data.SST_1('test', root='./datasets/sst_1') -etc- >>> len(sst_1) 2210 >>> len(sst_1[0]) 2 >>> type(sst_1[0][0]), type(sst_1[0][1]) (<class 'str'>, <class 'int'>) >>> sst_1[0][0][:73] 'no movement , no yuks , not much of anything .' >>> sst_1[0][1] 1
-
class
gluonnlp.data.
SST_2
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/sst-2')[source]¶ Stanford Sentiment Treebank: an extension of the MR data set. Same as the SST-1 data set except that neutral reviews are removed and labels are binary (positive, negative).
From http://nlp.stanford.edu/sentiment/
Positive class has label value 1. Negative class has label value 0.
- Parameters
Examples
>>> sst_2 = gluonnlp.data.SST_2('test', root='./datasets/sst_2') -etc- >>> len(sst_2) 1821 >>> len(sst_2[0]) 2 >>> type(sst_2[0][0]), type(sst_2[0][1]) (<class 'str'>, <class 'int'>) >>> sst_2[0][0][:65] 'no movement , no yuks , not much of anything .' >>> sst_2[0][1] 0
-
class
gluonnlp.data.
CR
(root='/var/lib/jenkins/.mxnet/datasets/cr')[source]¶ Customer reviews of various products (cameras, MP3s etc.). The task is to predict positive/negative reviews.
Positive class has label value 1. Negative class has label value 0.
- Parameters
root (str, default '$MXNET_HOME/datasets/cr') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> cr = gluonnlp.data.CR(root='./datasets/cr') -etc- >>> len(cr) 3775 >>> len(cr[3]) 2 >>> type(cr[3][0]), type(cr[3][1]) (<class 'str'>, <class 'int'>) >>> cr[3][0][:55] 'i know the saying is " you get what you pay for " but a' >>> cr[3][1] 0
-
class
gluonnlp.data.
MPQA
(root='/var/lib/jenkins/.mxnet/datasets/mpqa')[source]¶ Opinion polarity detection subtask of the MPQA dataset.
From http://www.cs.pitt.edu/mpqa/
Positive class has label value 1. Negative class has label value 0.
- Parameters
root (str, default '$MXNET_HOME/datasets/mpqa') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> mpqa = gluonnlp.data.MPQA(root='./datasets/mpqa') -etc- >>> len(mpqa) 10606 >>> len(mpqa[3]) 2 >>> type(mpqa[3][0]), type(mpqa[3][1]) (<class 'str'>, <class 'int'>) >>> mpqa[3][0][:25] 'many years of decay' >>> mpqa[3][1] 0
-
class
gluonnlp.data.
WordSimilarityEvaluationDataset
(root)[source]¶ Base class for word similarity or relatedness task datasets.
Inheriting classes are assumed to implement datasets of the form [‘word1’, ‘word2’, score] where score is a numerical similarity or relatedness score with respect to ‘word1’ and ‘word2’.
-
class
gluonnlp.data.
WordAnalogyEvaluationDataset
(root)[source]¶ Base class for word analogy task datasets.
Inheriting classes are assumed to implement datasets of the form [‘word1’, ‘word2’, ‘word3’, ‘word4’] or [‘word1’, [ ‘word2a’, ‘word2b’, … ], ‘word3’, [ ‘word4a’, ‘word4b’, … ]].
-
class
gluonnlp.data.
WordSim353
(segment='all', root='/var/lib/jenkins/.mxnet/datasets/wordsim353')[source]¶ WordSim353 dataset.
The dataset was collected by Finkelstein et al. (http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/). Agirre et al. proposed to split the collection into two datasets, one focused on measuring similarity, and the other one on relatedness (http://alfonseca.org/eng/research/wordsim353.html).
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing search in context: the concept revisited. ACM} Trans. Inf. Syst., 20(1), 116–131. https://dl.acm.org/citation.cfm?id=372094
Agirre, E., Alfonseca, E., Hall, K. B., Kravalova, J., Pasca, M., & Soroa, A. (2009). A study on similarity and relatedness using distributional and wordnet-based approaches. In , Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, May 31 - June 5, 2009, Boulder, Colorado, {USA (pp. 19–27). : The Association for Computational Linguistics.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Each sample consists of a pair of words, and a score with scale from 0 (totally unrelated words) to 10 (very much related or identical words).
- Parameters
Examples
>>> wordsim353 = gluonnlp.data.WordSim353('similarity', root='./datasets/wordsim353') -etc- >>> len(wordsim353) 203 >>> wordsim353[0] ['Arafat', 'Jackson', 2.5]
-
class
gluonnlp.data.
MEN
(segment='dev', root='/var/lib/jenkins/.mxnet/datasets/men')[source]¶ MEN dataset for word-similarity and relatedness.
The dataset was collected by Bruni et al. (https://staff.fnwi.uva.nl/e.bruni/MEN).
Bruni, E., Boleda, G., Baroni, M., & Nam-Khanh Tran (2012). Distributional semantics in technicolor. In , The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, July 8-14, 2012, Jeju Island, Korea - Volume 1: Long Papers (pp. 136–145). : The Association for Computer Linguistics.
License: Creative Commons Attribution 2.0 Generic (CC BY 2.0)
Each sample consists of a pair of words, and a score with scale from 0 (totally unrelated words) to 50 (very much related or identical words).
- Parameters
Examples
>>> men = gluonnlp.data.MEN('test', root='./datasets/men') -etc- >>> len(men) 1000 >>> men[0] ['display', 'pond', 10.0]
-
class
gluonnlp.data.
RadinskyMTurk
(root='/var/lib/jenkins/.mxnet/datasets/radinskymturk')[source]¶ MTurk dataset for word-similarity and relatedness by Radinsky et al..
Radinsky, K., Agichtein, E., Gabrilovich, E., & Markovitch, S. (2011). A word at a time: computing word relatedness using temporal semantic analysis. In S. Srinivasan, K. Ramamritham, A. Kumar, M. P. Ravindra, E. Bertino, & R. Kumar, Proceedings of the 20th International Conference on World Wide Web, {WWW} 2011, Hyderabad, India, March 28 - April 1, 2011 (pp. 337–346). : ACM.
License: Unspecified
Each sample consists of a pair of words, and a score with scale from 1 (totally unrelated words) to 5 (very much related or identical words).
- Parameters
root (str, default '$MXNET_HOME/datasets/radinskymturk') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> radinskymturk = gluonnlp.data.RadinskyMTurk(root='./datasets/radinskymturk') -etc- >>> len(radinskymturk) 287 >>> radinskymturk[0] ['episcopal', 'russia', 2.75]
-
class
gluonnlp.data.
RareWords
(root='/var/lib/jenkins/.mxnet/datasets/rarewords')[source]¶ Rare words dataset word-similarity and relatedness.
Luong, T., Socher, R., & Manning, C. D. (2013). Better word representations with recursive neural networks for morphology. In J. Hockenmaier, & S. Riedel, Proceedings of the Seventeenth Conference on Computational Natural Language Learning, CoNLL 2013, Sofia, Bulgaria, August 8-9, 2013 (pp. 104–113). : ACL.
License: Unspecified
Each sample consists of a pair of words, and a score with scale from 0 (totally unrelated words) to 10 (very much related or identical words).
- Parameters
root (str, default '$MXNET_HOME/datasets/rarewords',) – MXNET_HOME defaults to ‘~/.mxnet’. Path to temp folder for storing data.
Examples
>>> rarewords = gluonnlp.data.RareWords(root='./datasets/rarewords') -etc- >>> len(rarewords) 2034 >>> rarewords[0] ['squishing', 'squirt', 5.88]
-
class
gluonnlp.data.
SimLex999
(root='/var/lib/jenkins/.mxnet/datasets/simlex999')[source]¶ SimLex999 dataset word-similarity.
Hill, F., Reichart, R., & Korhonen, A. (2015). Simlex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4), 665–695. https://arxiv.org/abs/1408.3456
License: Unspecified
Each sample consists of a pair of words, and a score with scale from 0 (totally unrelated words) to 10 (very much related or identical words).
- Parameters
root (str, default '$MXNET_HOME/datasets/simlex999') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> simlex999 = gluonnlp.data.SimLex999(root='./datasets/simlex999') -etc- >>> len(simlex999) 999 >>> simlex999[0] ['old', 'new', 1.58]
-
class
gluonnlp.data.
SimVerb3500
(segment='full', root='/var/lib/jenkins/.mxnet/datasets/simverb3500')[source]¶ SimVerb3500 dataset word-similarity.
Hill, F., Reichart, R., & Korhonen, A. (2015). Simlex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4), 665–695. https://arxiv.org/abs/1408.3456
License: Unspecified
Each sample consists of a pair of words, and a score with scale from 0 (totally unrelated words) to 10 (very much related or identical words).
- Parameters
root (str, default '$MXNET_HOME/datasets/verb3500') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> simverb3500 = gluonnlp.data.SimVerb3500(root='./datasets/simverb3500') -etc- >>> len(simverb3500) 3500 >>> simverb3500[0] ['take', 'remove', 6.81]
-
class
gluonnlp.data.
SemEval17Task2
(segment='trial', language='en', root='/var/lib/jenkins/.mxnet/datasets/semeval17task2')[source]¶ SemEval17Task2 dataset for word-similarity.
The dataset was collected by Finkelstein et al. (http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/). Agirre et al. proposed to split the collection into two datasets, one focused on measuring similarity, and the other one on relatedness (http://alfonseca.org/eng/research/wordsim353.html).
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing search in context: the concept revisited. ACM} Trans. Inf. Syst., 20(1), 116–131. https://dl.acm.org/citation.cfm?id=372094
Agirre, E., Alfonseca, E., Hall, K. B., Kravalova, J., Pasca, M., & Soroa, A. (2009). A study on similarity and relatedness using distributional and wordnet-based approaches. In , Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, May 31 - June 5, 2009, Boulder, Colorado, {USA (pp. 19–27). : The Association for Computational Linguistics.
License: Unspecified
Each sample consists of a pair of words, and a score with scale from 0 (totally unrelated words) to 5 (very much related or identical words).
- Parameters
Examples
>>> semeval17task2 = gluonnlp.data.SemEval17Task2() -etc- >>> len(semeval17task2) 18 >>> semeval17task2[0] ['sunset', 'string', 0.05]
-
class
gluonnlp.data.
BakerVerb143
(root='/var/lib/jenkins/.mxnet/datasets/verb143')[source]¶ Verb143 dataset.
Baker, S., Reichart, R., & Korhonen, A. (2014). An unsupervised model for instance level subcategorization acquisition. In A. Moschitti, B. Pang, & W. Daelemans, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2014, October 25-29, 2014, Doha, Qatar, {A} meeting of SIGDAT, a Special Interest Group of the {ACL (pp. 278–289). : ACL.
144 pairs of verbs annotated by 10 annotators following the WS-353 guidelines.
License: unspecified
Each sample consists of a pair of words, and a score with scale from 0 (totally unrelated words) to 1 (very much related or identical words).
- Parameters
root (str, default '$MXNET_HOME/datasets/verb143') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> bakerverb143 = gluonnlp.data.BakerVerb143(root='./datasets/bakerverb143') -etc- >>> len(bakerverb143) 144 >>> bakerverb143[0] ['happen', 'say', 0.19]
-
class
gluonnlp.data.
YangPowersVerb130
(root='~/.mxnet/datasets/verb130')[source]¶ Verb-130 dataset.
Yang, D., & Powers, D. M. (2006). Verb similarity on the taxonomy of wordnet. In The Third International WordNet Conference: GWC 2006
License: Unspecified
Each sample consists of a pair of words, and a score with scale from 0 (totally unrelated words) to 4 (very much related or identical words).
- Parameters
root (str, default '$MXNET_HOME/datasets/verb130') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> yangpowersverb130 = gluonnlp.data.YangPowersVerb130(root='./datasets/yangpowersverb130') >>> len(yangpowersverb130) 130 >>> yangpowersverb130[0] ['brag', 'boast', 4.0]
-
class
gluonnlp.data.
GoogleAnalogyTestSet
(group=None, category=None, lowercase=True, root='/var/lib/jenkins/.mxnet/datasets/google_analogy')[source]¶ Google analogy test set
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations (ICLR).
License: Unspecified
Each sample consists of two analogical pairs of words.
- Parameters
group ({'syntactic', 'semantic'} or None, default None) – The subset for the specified type of analogy. None for the complete dataset.
category (str or None, default None) – The subset for the specified category of analogy. None for the complete dataset.
lowercase (boolean, default True) – Whether to convert words to lowercase.
root (str, default '$MXNET_HOME/datasets/google_analogy') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> googleanalogytestset = gluonnlp.data.GoogleAnalogyTestSet( ... root='./datasets/googleanalogytestset') -etc- >>> len(googleanalogytestset) 19544 >>> googleanalogytestset[0] ['athens', 'greece', 'baghdad', 'iraq'] >>> googleanalogytestset = gluonnlp.data.GoogleAnalogyTestSet( ... 'syntactic', root='./datasets/googleanalogytestset') >>> googleanalogytestset[0] ['amazing', 'amazingly', 'apparent', 'apparently'] >>> googleanalogytestset = gluonnlp.data.GoogleAnalogyTestSet( ... 'syntactic', 'gram8-plural', root='./datasets/googleanalogytestset') >>> googleanalogytestset[0] ['banana', 'bananas', 'bird', 'birds']
-
class
gluonnlp.data.
BiggerAnalogyTestSet
(category=None, form_analogy_pairs=True, drop_alternative_solutions=True, root='/var/lib/jenkins/.mxnet/datasets/bigger_analogy')[source]¶ Bigger analogy test set
Gladkova, A., Drozd, A., & Matsuoka, S. (2016). Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In Proceedings of the NAACL-HLT SRW (pp. 47–54). San Diego, California, June 12-17, 2016: ACL. Retrieved from https://www.aclweb.org/anthology/N/N16/N16-2002.pdf
License: Unspecified
Each sample consists of two analogical pairs of words.
- Parameters
root (str, default '$MXNET_HOME/datasets/bats') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> biggeranalogytestset = gluonnlp.data.BiggerAnalogyTestSet( ... root='./datasets/biggeranalogytestset') -etc- >>> len(biggeranalogytestset) 98000 >>> biggeranalogytestset[0] ['arm', 'armless', 'art', 'artless']
-
class
gluonnlp.data.
DataStream
[source]¶ Abstract Data Stream Interface.
DataStreams are useful to avoid loading big datasets to memory. A DataStream is a iterable object (it implements the __iter__ function). Whenever an iteration over the DataStream is requested (e.g. in a for loop or by calling iter(datastream)), a new iterator over all samples in the DataStream is returned. DataStreams can be lazily transformed by calling transform() which returns a DataStream over the transformed samples.
-
__iter__
()[source]¶ Return an iterator over all elements of the DataStream.
This method returns a new iterator object that can iterate over all the objects in the DataStream.
- Returns
An object implementing the Python iterator protocol.
- Return type
iterator
-
-
class
gluonnlp.data.
SimpleDataStream
(iterable)[source]¶ SimpleDataStream wraps iterables to expose the DataStream API.
Unlike the iterable itself, the SimpleDataStream exposes the DataStream API and allows lazy transformation of the iterable.
-
class
gluonnlp.data.
DatasetStream
[source]¶ Abstract Dataset Stream Interface.
A DatasetStream is a DataStream where each sample is a mxnet.gluon.data.Dataset. An iteration over a DatasetStream iterates over mxnet.gluon.data.Dataset objects, representing a chunk or shards of some large datasets.
Iterating over sizeable chunks of a dataset can be helpful to speed up preprocessing as the overhead of preprocessing each sample individually is reduced (this is similar to the idea of using batches for training a model).
-
class
gluonnlp.data.
SimpleDatasetStream
(dataset, file_pattern, file_sampler='random', **kwargs)[source]¶ A simple stream of Datasets.
The SimpleDatasetStream is created from multiple files based on provided file_pattern. One file is read at a time and a corresponding Dataset is returned. The Dataset is created based on the file and the kwargs passed to SimpleDatasetStream.
- Parameters
dataset (class) – The class for which to create an object for every file. kwargs are passed to this class.
file_pattern (str) – Path to the input text files.
file_sampler (str or gluon.data.Sampler, defaults to 'random') –
The sampler used to sample a file. The following string values are supported:
’sequential’: SequentialSampler
’random’: RandomSampler
kwargs – All other keyword arguments are passed to the dataset constructor.
-
class
gluonnlp.data.
PrefetchingStream
(stream, num_prefetch=1, worker_type='thread')[source]¶ Prefetch a DataStream in a separate Thread or Process.
This iterator will create another thread or process to perform
iter_next
and then store the data in memory. It potentially accelerates the data read, at the cost of more memory usage.The python, numpy and mxnet random states in the launched Thread or Process will be initialized randomly based on the next 32 bit integer in the python, numpy and mxnet random generator of the caller respectively (random.getrandbits(32), numpy.random.randint(0, 2**32), int(mx.nd.random.uniform(0, 2**32).asscalar())).
- Parameters
stream (DataStream) – Source stream.
num_prefetch (int, default 1) – Number of elements to prefetch from the stream. Must be greater 0.
worker_type ('thread' or 'process', default 'thread') – Use a separate Python Thread or Process to prefetch.
-
class
gluonnlp.data.
CoNLL2000
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/conll2000')[source]¶ CoNLL2000 Part-of-speech (POS) tagging and chunking joint task dataset.
Each sample has three fields: word, POS tag, chunk label.
From https://www.clips.uantwerpen.be/conll2000/chunking/
- Parameters
segment ({'train', 'test'}, default 'train') – Dataset segment.
root (str, default '$MXNET_HOME/datasets/conll2000') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> conll = gluonnlp.data.CoNLL2000('test', root='./datasets/conll2000') -etc- >>> len(conll) 2012 >>> len(conll[0]) 3 >>> conll[8][0] ['SHEARSON', 'LEHMAN', 'HUTTON', 'Inc', '.'] >>> conll[8][1] ['NNP', 'NNP', 'NNP', 'NNP', '.'] >>> conll[8][2] ['B-NP', 'I-NP', 'I-NP', 'I-NP', 'O']
-
class
gluonnlp.data.
CoNLL2001
(part, segment='train', root='/var/lib/jenkins/.mxnet/datasets/conll2001')[source]¶ CoNLL2001 Clause Identification dataset.
Each sample has four fields: word, POS tag, chunk label, clause tag.
From https://www.clips.uantwerpen.be/conll2001/clauses/
- Parameters
Examples
>>> conll = gluonnlp.data.CoNLL2001(1, 'testa', root='./datasets/conll2001') -etc- >>> len(conll) 2012 >>> len(conll[0]) 4 >>> conll[8][0] ['SHEARSON', 'LEHMAN', 'HUTTON', 'Inc', '.'] >>> conll[8][1] ['NNP', 'NNP', 'NNP', 'NNP', '.'] >>> conll[8][2] ['B-NP', 'I-NP', 'I-NP', 'I-NP', 'O'] >>> conll[8][3] ['X', 'X', 'X', 'X', 'X']
-
class
gluonnlp.data.
CoNLL2002
(lang, segment='train', root='/var/lib/jenkins/.mxnet/datasets/conll2002')[source]¶ CoNLL2002 Named Entity Recognition (NER) task dataset.
For ‘esp’, each sample has two fields: word, NER label.
For ‘ned’, each sample has three fields: word, POS tag, NER label.
From https://www.clips.uantwerpen.be/conll2002/ner/
- Parameters
Examples
>>> conll = gluonnlp.data.CoNLL2002('esp', 'testa', root='./datasets/conll2002') -etc- >>> len(conll) 1915 >>> len(conll[0]) 2 >>> conll[0][0] ['Sao', 'Paulo', '(', 'Brasil', ')', ',', '23', 'may', '(', 'EFECOM', ')', '.'] >>> conll[0][1] ['B-LOC', 'I-LOC', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'O']
-
class
gluonnlp.data.
CoNLL2004
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/conll2004')[source]¶ CoNLL2004 Semantic Role Labeling (SRL) task dataset.
Each sample has six or more fields: word, POS tag, chunk label, clause tag, NER label, target verbs, and sense labels (of variable number per sample).
From http://www.cs.upc.edu/~srlconll/st04/st04.html
- Parameters
segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment.
root (str, default '$MXNET_HOME/datasets/conll2004') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> conll = gluonnlp.data.CoNLL2004('dev', root='./datasets/conll2004') -etc- >>> len(conll) 2012 >>> len(conll[8]) 6 >>> conll[8][0] ['SHEARSON', 'LEHMAN', 'HUTTON', 'Inc', '.'] >>> conll[8][1] ['NNP', 'NNP', 'NNP', 'NNP', '.'] >>> conll[8][2] ['B-NP', 'I-NP', 'I-NP', 'I-NP', 'O'] >>> conll[8][3] ['*', '*', '*', '*', '*'] >>> conll[8][4] ['B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O'] >>> conll[8][5] ['-', '-', '-', '-', '-']
-
class
gluonnlp.data.
UniversalDependencies21
(lang='en', segment='train', root='/var/lib/jenkins/.mxnet/datasets/ud2.1')[source]¶ Universal dependencies tree banks.
Each sample has 8 or more fields as described in http://universaldependencies.org/docs/format.html
From https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2515
- Parameters
Examples
>>> ud = gluonnlp.data.UniversalDependencies21('en', 'dev', root='./datasets/ud21') -etc- >>> len(ud) 2002 >>> len(ud[0]) 10 >>> ud[0][0] ['1', '2', '3', '4', '5', '6', '7'] >>> ud[0][1] ['From', 'the', 'AP', 'comes', 'this', 'story', ':'] >>> ud[0][2] ['from', 'the', 'AP', 'come', 'this', 'story', ':'] >>> ud[0][3] ['ADP', 'DET', 'PROPN', 'VERB', 'DET', 'NOUN', 'PUNCT'] >>> ud[0][4] ['IN', 'DT', 'NNP', 'VBZ', 'DT', 'NN', ':'] >>> ud[0][5][:3] ['_', 'Definite=Def|PronType=Art', 'Number=Sing'] >>> ud[0][6] ['3', '3', '4', '0', '6', '4', '4'] >>> ud[0][7] ['case', 'det', 'obl', 'root', 'det', 'nsubj', 'punct'] >>> ud[0][8] ['3:case', '3:det', '4:obl', '0:root', '6:det', '4:nsubj', '4:punct'] >>> ud[0][9] ['_', '_', '_', '_', '_', '_', '_']
-
class
gluonnlp.data.
IWSLT2015
(segment='train', src_lang='en', tgt_lang='vi', root='/var/lib/jenkins/.mxnet/datasets/iwslt2015')[source]¶ Preprocessed IWSLT English-Vietnamese Translation Dataset.
We use the preprocessed version provided in https://nlp.stanford.edu/projects/nmt/
- Parameters
segment (str or list of str, default 'train') – Dataset segment. Options are ‘train’, ‘val’, ‘test’ or their combinations.
src_lang (str, default 'en') – The source language. Option for source and target languages are ‘en’ <-> ‘vi’
tgt_lang (str, default 'vi') – The target language. Option for source and target languages are ‘en’ <-> ‘vi’
root (str, default '$MXNET_HOME/datasets/iwslt2015') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
-
class
gluonnlp.data.
WMT2014
(segment='train', src_lang='en', tgt_lang='de', full=False, root='/var/lib/jenkins/.mxnet/datasets/wmt2014')[source]¶ Translation Corpus of the WMT2014 Evaluation Campaign.
http://www.statmt.org/wmt14/translation-task.html
- Parameters
segment (str or list of str, default 'train') – Dataset segment. Options are ‘train’, ‘newstest2009’, ‘newstest2010’, ‘newstest2011’, ‘newstest2012’, ‘newstest2013’, ‘newstest2014’ or their combinations
src_lang (str, default 'en') – The source language. Option for source and target languages are ‘en’ <-> ‘de’
tgt_lang (str, default 'de') – The target language. Option for source and target languages are ‘en’ <-> ‘de’
full (bool, default False) – By default, we use the “filtered test sets” while if full is True, we use the “cleaned test sets”.
root (str, default '$MXNET_HOME/datasets/wmt2014') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
-
class
gluonnlp.data.
WMT2014BPE
(segment='train', src_lang='en', tgt_lang='de', full=False, root='/var/lib/jenkins/.mxnet/datasets/wmt2014')[source]¶ Preprocessed Translation Corpus of the WMT2014 Evaluation Campaign.
We preprocess the dataset by adapting https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh
- Parameters
segment (str or list of str, default 'train') – Dataset segment. Options are ‘train’, ‘newstest2009’, ‘newstest2010’, ‘newstest2011’, ‘newstest2012’, ‘newstest2013’, ‘newstest2014’ or their combinations
src_lang (str, default 'en') – The source language. Option for source and target languages are ‘en’ <-> ‘de’
tgt_lang (str, default 'de') – The target language. Option for source and target languages are ‘en’ <-> ‘de’
full (bool, default False) – In default, we use the test dataset in http://statmt.org/wmt14/test-filtered.tgz. When full is True, we use the test dataset in http://statmt.org/wmt14/test-full.tgz
root (str, default '$MXNET_HOME/datasets/wmt2014') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
-
class
gluonnlp.data.
WMT2016
(segment='train', src_lang='en', tgt_lang='de', root='/var/lib/jenkins/.mxnet/datasets/wmt2016')[source]¶ Translation Corpus of the WMT2016 Evaluation Campaign.
- Parameters
segment (str or list of str, default 'train') – Dataset segment. Options are ‘train’, ‘newstest2012’, ‘newstest2013’, ‘newstest2014’, ‘newstest2015’, ‘newstest2016’ or their combinations
src_lang (str, default 'en') – The source language. Option for source and target languages are ‘en’ <-> ‘de’
tgt_lang (str, default 'de') – The target language. Option for source and target languages are ‘en’ <-> ‘de’
root (str, default '$MXNET_HOME/datasets/wmt2016') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
-
class
gluonnlp.data.
WMT2016BPE
(segment='train', src_lang='en', tgt_lang='de', root='/var/lib/jenkins/.mxnet/datasets/wmt2016')[source]¶ Preprocessed Translation Corpus of the WMT2016 Evaluation Campaign.
We use the preprocessing script in https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh
- Parameters
segment (str or list of str, default 'train') – Dataset segment. Options are ‘train’, ‘newstest2012’, ‘newstest2013’, ‘newstest2014’, ‘newstest2015’, ‘newstest2016’ or their combinations
src_lang (str, default 'en') – The source language. Option for source and target languages are ‘en’ <-> ‘de’
tgt_lang (str, default 'de') – The target language. Option for source and target languages are ‘en’ <-> ‘de’
root (str, default '$MXNET_HOME/datasets/wmt2016') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
-
gluonnlp.data.
register
(class_=None, **kwargs)[source]¶ Registers a dataset with segment specific hyperparameters.
When passing keyword arguments to register, they are checked to be valid keyword arguments for the registered Dataset class constructor and are saved in the registry. Registered keyword arguments can be retrieved with the list_datasets function.
All arguments that result in creation of separate datasets should be registered. Examples are datasets divided in different segments or categories, or datasets containing multiple languages.
Once registered, an instance can be created by calling
create()
with the class name.- Parameters
**kwargs (list or tuple of allowed argument values) – For each keyword argument, it’s value must be a list or tuple of the allowed argument values.
Examples
>>> @gluonnlp.data.register(segment=['train', 'test', 'dev']) ... class MyDataset(gluon.data.Dataset): ... def __init__(self, segment='train'): ... pass >>> my_dataset = gluonnlp.data.create('MyDataset') >>> print(type(my_dataset)) <class 'gluonnlp.data.registry.MyDataset'>
-
gluonnlp.data.
create
(name, **kwargs)[source]¶ Creates an instance of a registered dataset.
- Parameters
name (str) – The dataset name (case-insensitive).
- Returns
An instance of
mxnet.gluon.data.Dataset
constructed with thekeyword arguments passed to the create function.
-
gluonnlp.data.
list_datasets
(name=None)[source]¶ Get valid datasets and registered parameters.
- Parameters
name (str or None, default None) – Return names and registered parameters of registered datasets. If name is specified, only registered parameters of the respective dataset are returned.
- Returns
A dict of all the valid keyword parameters names for the specified dataset. If name is set to None, returns a dict mapping each valid name to its respective keyword parameter dict. The valid names can be plugged in gluonnlp.model.word_evaluation_model.create(name).
- Return type
-
class
gluonnlp.data.
SQuAD
(segment='train', version='1.1', root='/var/lib/jenkins/.mxnet/datasets/squad')[source]¶ Stanford Question Answering Dataset (SQuAD) - reading comprehension dataset.
From https://rajpurkar.github.io/SQuAD-explorer/
License: CreativeCommons BY-SA 4.0
The original data format is json, which has multiple contexts (a context is a paragraph of text from which questions are drawn). For each context there are multiple questions, and for each of these questions there are multiple (usually 3) answers.
This class loads the json and flattens it to a table view. Each record is a single question. Since there are more than one question per context in the original dataset, some records shares the same context. Number of records in the dataset is equal to number of questions in json file.
The format of each record of the dataset is following:
record_index: An index of the record, generated on the fly (0 … to # of last question)
question_id: Question Id. It is a string and taken from the original json file as-is
question: Question text, taken from the original json file as-is
context: Context text. Will be the same for questions from the same context
answer_list: All answers for this question. Stored as python list
start_indices: All answers’ starting indices. Stored as python list. The position in this list is the same as the position of an answer in answer_list
is_impossible: The question is unanswerable, if version is ‘2.0’. In SQuAd2.0, there are some unanswerable questions.
- Parameters
Examples
>>> squad = gluonnlp.data.SQuAD('dev', '1.1', root='./datasets/squad') -etc- >>> len(squad) 10570 >>> len(squad[0]) 6 >>> tuple(type(squad[0][i]) for i in range(6)) (<class 'int'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'list'>, <class 'list'>) >>> squad[0][0] 0 >>> squad[0][1] '56be4db0acb8001400a502ec' >>> squad[0][2] 'Which NFL team represented the AFC at Super Bowl 50?' >>> squad[0][3][:70] 'Super Bowl 50 was an American football game to determine the champion ' >>> squad[0][4] ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'] >>> squad[0][5] [177, 177, 177] >>> squad2 = gluonnlp.data.SQuAD('dev', '2.0', root='./datasets/squad') -etc- >>> len(squad2) 11873 >>> len(squad2[0]) 7 >>> type(squad2[0][6]) <class 'bool'> >>> squad2[0][6] False
-
class
gluonnlp.data.
ShardedDataLoader
(dataset, batch_size=None, shuffle=False, sampler=None, last_batch=None, batch_sampler=None, batchify_fn=None, num_workers=0, pin_memory=False, prefetch=None, thread_pool=False)[source]¶ Loads data from a dataset and returns mini-batches of data.
- Parameters
dataset (Dataset) – Source dataset. Note that numpy and mxnet arrays can be directly used as a Dataset.
batch_size (int) – Size of mini-batch.
shuffle (bool) – Whether to shuffle the samples.
sampler (Sampler) – The sampler to use. Either specify sampler or shuffle, not both.
last_batch ({'keep', 'discard', 'rollover'}) –
How to handle the last batch if batch_size does not evenly divide len(dataset).
keep - A batch with less samples than previous batches is returned. discard - The last batch is discarded if its incomplete. rollover - The remaining samples are rolled over to the next epoch.
batch_sampler (Sampler) – A sampler that returns mini-batches. Do not specify batch_size, shuffle, sampler, and last_batch if batch_sampler is specified.
batchify_fn (callable) –
Callback function to allow users to specify how to merge samples into a batch. Defaults to default_batchify_fn:
def default_batchify_fn(data): if isinstance(data[0], nd.NDArray): return nd.stack(*data) elif isinstance(data[0], tuple): data = zip(*data) return [default_batchify_fn(i) for i in data] else: data = np.asarray(data) return nd.array(data, dtype=data.dtype)
num_workers (int, default 0) – The number of multiprocessing workers to use for data preprocessing. num_workers > 0 is not supported on Windows yet.
pin_memory (boolean, default False) – If
True
, the dataloader will copy NDArrays into pinned memory before returning them. Copying from CPU pinned memory to GPU is faster than from normal CPU memory.prefetch (int, default is num_workers * 2) – The number of prefetching batches only works if num_workers > 0. If prefetch > 0, it allow worker process to prefetch certain batches before acquiring data from iterators. Note that using large prefetching batch will provide smoother bootstrapping performance, but will consume more shared_memory. Using smaller number may forfeit the purpose of using multiple worker processes, try reduce num_workers in this case. By default it defaults to num_workers * 2.
thread_pool (bool, default False) – If
True
, use threading pool instead of multiprocessing pool. Using threadpool can avoid shared memory usage. If DataLoader is more IO bounded or GIL is not a killing problem, threadpool version may achieve better performance than multiprocessing.
-
class
gluonnlp.data.
UnigramCandidateSampler
(weights, dtype='float32')[source]¶ Unigram Candidate Sampler
Draw random samples from a unigram distribution with specified weights using the alias method.
- Parameters
weights (mx.nd.NDArray) – Unnormalized class probabilities. Samples are drawn and returned on the same context as weights.context.
dtype (str or np.dtype, default 'float32') – Data type of the candidates. Make sure that the dtype precision is large enough to represent the size of your weights array precisely. For example, float32 can not distinguish 2**24 from 2**24 + 1.
-
hybrid_forward
(F, candidates_like, prob, alias)[source]¶ Draw samples from uniform distribution and return sampled candidates.
- Parameters
candidates_like (mxnet.nd.NDArray or mxnet.sym.Symbol) – This input specifies the shape of the to be sampled candidates. #
- Returns
samples – The sampled candidates of shape candidates_like.shape. Candidates are sampled based on the weights specified on creation of the UnigramCandidateSampler.
- Return type
mxnet.nd.NDArray or mxnet.sym.Symbol
-
class
gluonnlp.data.
ATISDataset
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/atis')[source]¶ Airline Travel Information System dataset from MS CNTK.
From https://github.com/Microsoft/CNTK/tree/master/Examples/LanguageUnderstanding/ATIS/Data
License: Unspecified
Each sample has three fields: tokens, slot labels, intent label.
- Parameters
segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment.
root (str, default '$MXNET_HOME/datasets/atis') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> atis = gluonnlp.data.ATISDataset(root='./datasets/atis') -etc- >>> len(atis) 4478 >>> len(atis[0]) 3 >>> len(atis[0][0]) 10 >>> atis[0][0] ['i', 'want', 'to', 'fly', 'from', 'baltimore', 'to', 'dallas', 'round', 'trip'] >>> len(atis[0][1]) 10 >>> atis[0][1][:8] ['O', 'O', 'O', 'O', 'O', 'B-fromloc.city_name', 'O', 'B-toloc.city_name'] >>> atis[0][2] array([10], dtype=int32)
-
class
gluonnlp.data.
SNIPSDataset
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/snips')[source]¶ Snips Natural Language Understanding Benchmark dataset.
Coucke et al. (2018). Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces. https://arxiv.org/abs/1805.10190
From https://github.com/snipsco/nlu-benchmark/tree/master/2017-06-custom-intent-engines
License: Unspecified
Each sample has three fields: tokens, slot labels, intent label.
- Parameters
segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment.
root (str, default '$MXNET_HOME/datasets/snips') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> snips = gluonnlp.data.SNIPSDataset(root='./datasets/snips') -etc- >>> len(snips) 13084 >>> len(snips[0]) 3 >>> len(snips[1][0]) 8 >>> snips[1][0] ['put', 'United', 'Abominations', 'onto', 'my', 'rare', 'groove', 'playlist'] >>> len(snips[1][1]) 8 >>> snips[1][1][:5] ['O', 'B-entity_name', 'I-entity_name', 'O', 'B-playlist_owner'] >>> snips[1][2] array([0], dtype=int32)
-
class
gluonnlp.data.
GlueCoLA
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/glue_cola', return_all_fields=False)[source]¶ The Corpus of Linguistic Acceptability (Warstadt et al., 2018) consists of English acceptability judgments drawn from books and journal articles on linguistic theory.
Each example is a sequence of words annotated with whether it is a grammatical English sentence.
From https://gluebenchmark.com/tasks
- Parameters
Examples
>>> cola_dev = gluonnlp.data.GlueCoLA('dev', root='./datasets/cola') -etc- >>> len(cola_dev) 1043 >>> len(cola_dev[0]) 2 >>> cola_dev[0] ['The sailors rode the breeze clear of the rocks.', '1'] >>> cola_test = gluonnlp.data.GlueCoLA('test', root='./datasets/cola') -etc- >>> len(cola_test) 1063 >>> len(cola_test[0]) 1 >>> cola_test[0] ['Bill whistled past the house.']
-
class
gluonnlp.data.
GlueSST2
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/glue_sst', return_all_fields=False)[source]¶ The Stanford Sentiment Treebank (Socher et al., 2013) consists of sentences from movie reviews and human annotations of their sentiment.
From https://gluebenchmark.com/tasks
- Parameters
Examples
>>> sst_dev = gluonnlp.data.GlueSST2('dev', root='./datasets/sst') -etc- >>> len(sst_dev) 872 >>> len(sst_dev[0]) 2 >>> sst_dev[0] ["it 's a charming and often affecting journey . ", '1'] >>> sst_test = gluonnlp.data.GlueSST2('test', root='./datasets/sst') -etc- >>> len(sst_test) 1821 >>> len(sst_test[0]) 1 >>> sst_test[0] ['uneasy mishmash of styles and genres .']
-
class
gluonnlp.data.
GlueSTSB
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/glue_stsb', return_all_fields=False)[source]¶ The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data.
Each pair is human-annotated with a similarity score from 1 to 5.
From https://gluebenchmark.com/tasks
- Parameters
Examples
>>> stsb_dev = gluonnlp.data.GlueSTSB('dev', root='./datasets/stsb') -etc- >>> len(stsb_dev) 1500 >>> len(stsb_dev[0]) 3 >>> stsb_dev[0] ['A man with a hard hat is dancing.', 'A man wearing a hard hat is dancing.', '5.000'] >>> stsb_test = gluonnlp.data.GlueSTSB('test', root='./datasets/stsb') -etc- >>> len(stsb_test) 1379 >>> len(stsb_test[0]) 2 >>> stsb_test[0] ['A girl is styling her hair.', 'A girl is brushing her hair.']
-
class
gluonnlp.data.
GlueQQP
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/glue_qqp', return_all_fields=False)[source]¶ The Quora Question Pairs dataset is a collection of question pairs from the community question-answering website Quora.
From https://gluebenchmark.com/tasks
- Parameters
Examples
>>> import warnings >>> with warnings.catch_warnings(): ... # Ignore warnings triggered by invalid entries in GlueQQP dev set ... warnings.simplefilter("ignore") ... qqp_dev = gluonnlp.data.GlueQQP('dev', root='./datasets/qqp') -etc- >>> len(qqp_dev) 40430 >>> len(qqp_dev[0]) 3 >>> qqp_dev[0] ['Why are African-Americans so beautiful?', 'Why are hispanics so beautiful?', '0'] >>> qqp_test = gluonnlp.data.GlueQQP('test', root='./datasets/qqp') -etc- >>> len(qqp_test) 390965 >>> len(qqp_test[3]) 2 >>> qqp_test[3] ['Is it safe to invest in social trade biz?', 'Is social trade geniune?']
-
class
gluonnlp.data.
GlueRTE
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/glue_rte', return_all_fields=False)[source]¶ The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual entailment challenges (RTE1, RTE2, RTE3, and RTE5).
From https://gluebenchmark.com/tasks
- Parameters
Examples
>>> rte_dev = gluonnlp.data.GlueRTE('dev', root='./datasets/rte') -etc- >>> len(rte_dev) 277 >>> len(rte_dev[0]) 3 >>> rte_dev[0] ['Dana Reeve, the widow of the actor Christopher Reeve, has died of lung cancer at age 44, according to the Christopher Reeve Foundation.', 'Christopher Reeve had an accident.', 'not_entailment'] >>> rte_test = gluonnlp.data.GlueRTE('test', root='./datasets/rte') -etc- >>> len(rte_test) 3000 >>> len(rte_test[16]) 2 >>> rte_test[16] ['United failed to progress beyond the group stages of the Champions League and trail in the Premiership title race, sparking rumours over its future.', 'United won the Champions League.']
-
class
gluonnlp.data.
GlueMNLI
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/glue_mnli', return_all_fields=False)[source]¶ The Multi-Genre Natural Language Inference Corpus (Williams et al., 2018) is a crowdsourced collection of sentence pairs with textual entailment annotations.
From https://gluebenchmark.com/tasks
- Parameters
segment ({'train', 'dev_matched', 'dev_mismatched', 'test_matched', 'test_mismatched'},) – default ‘train’ Dataset segment.
root (str, default '$MXNET_HOME/datasets/glue_mnli') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
return_all_fields (bool, default False) – Return all fields available in the dataset.
Examples
>>> mnli_dev = gluonnlp.data.GlueMNLI('dev_matched', root='./datasets/mnli') -etc- >>> len(mnli_dev) 9815 >>> len(mnli_dev[0]) 3 >>> mnli_dev[0] ['The new rights are nice enough', 'Everyone really likes the newest benefits ', 'neutral'] >>> mnli_test = gluonnlp.data.GlueMNLI('test_matched', root='./datasets/mnli') -etc- >>> len(mnli_test) 9796 >>> len(mnli_test[0]) 2 >>> mnli_test[0] ['Hierbas, ans seco, ans dulce, and frigola are just a few names worth keeping a look-out for.', 'Hierbas is a name worth looking out for.']
-
class
gluonnlp.data.
GlueQNLI
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/glue_qnli', return_all_fields=False)[source]¶ The Question-answering NLI dataset converted from Stanford Question Answering Dataset (Rajpurkar et al. 2016).
From https://gluebenchmark.com/tasks
- Parameters
segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment. Dataset segment.
root (str, default '$MXNET_HOME/datasets/glue_qnli') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
return_all_fields (bool, default False) – Return all fields available in the dataset.
Examples
>>> qnli_dev = gluonnlp.data.GlueQNLI('dev', root='./datasets/qnli') -etc- >>> len(qnli_dev) 5732 >>> len(qnli_dev[0]) 3 >>> qnli_dev[0] ['Which NFL team represented the AFC at Super Bowl 50?', 'The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title.', 'entailment'] >>> qnli_test = gluonnlp.data.GlueQNLI('test', root='./datasets/qnli') -etc- >>> len(qnli_test) 5740 >>> len(qnli_test[0]) 2 >>> qnli_test[0] ['What seldom used term of a unit of force equal to 1000 pound s of force?', 'Other arcane units of force include the sthène, which is equivalent to 1000 N, and the kip, which is equivalent to 1000 lbf.']
-
class
gluonnlp.data.
GlueWNLI
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/glue_wnli', return_all_fields=False)[source]¶ The Winograd NLI dataset converted from the dataset in Winograd Schema Challenge (Levesque et al., 2011).
From https://gluebenchmark.com/tasks
- Parameters
Examples
>>> wnli_dev = gluonnlp.data.GlueWNLI('dev', root='./datasets/wnli') -etc- >>> len(wnli_dev) 71 >>> len(wnli_dev[0]) 3 >>> wnli_dev[0] ['The drain is clogged with hair. It has to be cleaned.', 'The hair has to be cleaned.', '0'] >>> wnli_test = gluonnlp.data.GlueWNLI('test', root='./datasets/wnli') -etc- >>> len(wnli_test) 146 >>> len(wnli_test[0]) 2 >>> wnli_test[0] ['Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they came in sight.', 'Horses ran away when Maude and Dora came in sight.']
-
class
gluonnlp.data.
GlueMRPC
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/glue_mrpc')[source]¶ The Microsoft Research Paraphrase Corpus dataset.
From https://gluebenchmark.com/tasks
- Parameters
segment ({'train', 'dev', 'test'}, default 'train') – Dataset segment.
root (str, default '$MXNET_HOME/datasets/glue_mrpc') – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> mrpc_dev = gluonnlp.data.GlueMRPC('dev', root='./datasets/mrpc') -etc- >>> len(mrpc_dev) 408 >>> len(mrpc_dev[0]) 3 >>> mrpc_dev[0] ["He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .", '" The foodservice pie business does not fit our long-term growth strategy .', '1'] >>> mrpc_test = gluonnlp.data.GlueMRPC('test', root='./datasets/mrpc') -etc- >>> len(mrpc_test) 1725 >>> len(mrpc_test[0]) 2 >>> mrpc_test[0] ["PCCW 's chief operating officer , Mike Butcher , and Alex Arena , the chief financial officer , will report directly to Mr So .", 'Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So .']
-
class
gluonnlp.data.
SuperGlueRTE
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/superglue_rte')[source]¶ The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual entailment challenges (RTE1, RTE2, RTE3 and RTE5).
From https://super.gluebenchmark.com/tasks
- Parameters
segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.
root (str, default "$MXNET_HOME/datasets/superglue_rte") – Path to temp folder for storing data. MXNET_HOME defaults to ‘~/.mxnet’.
Examples
>>> rte_val = gluonnlp.data.SuperGlueRTE('val', root='./datasets/rte') -etc- >>> len(rte_val) 277 >>> sorted(rte_val[0].items()) [('hypothesis', 'Christopher Reeve had an accident.'), ('idx', 0), ('label', 'not_entailment'), ('premise', 'Dana Reeve, the widow of the actor Christopher Reeve, has died of lung cancer at age 44, according to the Christopher Reeve Foundation.')] >>> rte_test = gluonnlp.data.SuperGlueRTE('test', root='./datasets/rte') -etc- >>> len(rte_test) 3000 >>> sorted(rte_test[0].items()) [('hypothesis', 'Shukla is related to Mangla.'), ('idx', 0), ('premise', "Mangla was summoned after Madhumita's sister Nidhi Shukla, who was the first witness in the case.")]
-
class
gluonnlp.data.
SuperGlueCB
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/superglue_cb')[source]¶ The CommitmentBank (CB) is a corpus of short texts in which at least one sentence contains an embedded clause.
From https://super.gluebenchmark.com/tasks
- Parameters
segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.
root (str, default "$MXNET_HOME/datasets/superglue_cb") – Path to temp folder from storing data. MXNET_HOME defaults to ‘~/.mxnet’
Examples
>>> cb_val = gluonnlp.data.SuperGlueCB('val', root='./datasets/cb') -etc- >>> len(cb_val) 56 >>> sorted(cb_val[0].items()) [('hypothesis', 'Valence was helping'), ('idx', 0), ('label', 'contradiction'), ('premise', "Valence the void-brain, Valence the virtuous valet. Why couldn't the figger choose his own portion of titanic anatomy to shaft? Did he think he was helping?")] >>> cb_test = gluonnlp.data.SuperGlueCB('test', root='./datasets/cb') -etc- >>> len(cb_test) 250 >>> sorted(cb_test[0].items()) [('hypothesis', 'Polly was not an experienced ocean sailor'), ('idx', 0), ('premise', 'Polly had to think quickly. They were still close enough to shore for him to return her to the police if she admitted she was not an experienced ocean sailor.')]
-
class
gluonnlp.data.
SuperGlueWSC
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/superglue_wsc')[source]¶ The Winograd Schema Challenge (WSC) is a co-reference resolution dataset. (Levesque et al., 2012)
From https://super.gluebenchmark.com/tasks
- Parameters
segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.
root (str, default "$MXNET_HOME/datasets/superglue_wsc") – Path to temp folder from storing data. MXNET_HOME defaults to ‘~/.mxnet’
Examples
>>> wsc_val = gluonnlp.data.SuperGlueWSC('val', root='./datasets/wsc') -etc- >>> len(wsc_val) 104 >>> sorted(wsc_val[5].items()) [('idx', 5), ('label', True), ('target', OrderedDict([('span2_index', 9), ('span1_index', 6), ('span1_text', 'The table'), ('span2_text', 'it')])), ('text', 'The large ball crashed right through the table because it was made of styrofoam.')] >>> wsc_test = gluonnlp.data.SuperGlueWSC('test', root='./datasets/wsc') -etc- >>> len(wsc_test) 146 >>> sorted(wsc_test[16].items()) [('idx', 16), ('target', OrderedDict([('span1_text', 'life'), ('span1_index', 1), ('span2_text', 'it'), ('span2_index', 21)])), ('text', 'Your life is yours and yours alone, and if the pain outweighs the benefit, you should have the option to end it .')]
-
class
gluonnlp.data.
SuperGlueWiC
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/superglue_wic')[source]¶ The Word-in-Context (WiC) is a word sense disambiguation dataset cast as binary classification of sentence pairs. (Pilehvar and Camacho-Collados, 2019)
From https://super.gluebenchmark.com/tasks
- Parameters
segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.
root (str, default "$MXNET_HOME/datasets/superglue_wic") – Path to temp folder from storing data. MXNET_HOME defaults to ‘~/.mxnet’
Examples
>>> wic_val = gluonnlp.data.SuperGlueWiC('val', root='./datasets/wic') -etc- >>> len(wic_val) 638 >>> sorted(wic_val[3].items()) [('end1', 31), ('end2', 35), ('idx', 3), ('label', True), ('sentence1', 'She gave her hair a quick brush.'), ('sentence2', 'The dentist recommended two brushes a day.'), ('start1', 26), ('start2', 28), ('version', 1.1), ('word', 'brush')] >>> wic_test = gluonnlp.data.SuperGlueWiC('test', root='./datasets/wic') -etc- >>> len(wic_test) 1400 >>> sorted(wic_test[0].items()) [('end1', 46), ('end2', 22), ('idx', 0), ('sentence1', 'The smell of fried onions makes my mouth water.'), ('sentence2', 'His eyes were watering.'), ('start1', 41), ('start2', 14), ('version', 1.1), ('word', 'water')]
-
class
gluonnlp.data.
SuperGlueCOPA
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/superglue_copa')[source]¶ The Choice of Plausible Alternatives (COPA) is a causal reasoning dataset. (Roemmele et al., 2011)
From https://super.gluebenchmark.com/tasks
- Parameters
segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.
root (str, default "$MXNET_HOME/datasets/superglue_copa") – Path to temp folder from storing data. MXNET_HOME defaults to ‘~/.mxnet’
Examples
>>> copa_val = gluonnlp.data.SuperGlueCOPA('val', root='./datasets/copa') -etc- >>> len(copa_val) 100 >>> sorted(copa_val[0].items()) [('choice1', 'The toilet filled with water.'), ('choice2', 'Water flowed from the spout.'), ('idx', 0), ('label', 1), ('premise', 'The man turned on the faucet.'), ('question', 'effect')] >>> copa_test = gluonnlp.data.SuperGlueCOPA('test', root='./datasets/copa') -etc- >>> len(copa_test) 500 >>> sorted(copa_test[0].items()) [('choice1', 'It was fragile.'), ('choice2', 'It was small.'), ('idx', 0), ('premise', 'The item was packaged in bubble wrap.'), ('question', 'cause')]
-
class
gluonnlp.data.
SuperGlueMultiRC
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/superglue_multirc')[source]¶ Multi-Sentence Reading Comprehension (MultiRC) is a QA dataset. (Khashabi et al., 2018)
From https://super.gluebenchmark.com/tasks
- Parameters
segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.
root (str, default "$MXNET_HOME/datasets/superglue_multirc") – Path to temp folder from storing data. MXNET_HOME defaults to ‘~/.mxnet’
Examples
>>> multirc_val = gluonnlp.data.SuperGlueMultiRC('val', root='./datasets/multirc') -etc- >>> len(multirc_val) 83 >>> sorted(multirc_val[0].keys()) ['questions', 'text'] >>> len(multirc_val[0]['text']) 12 >>> len(multirc_val[0]['questions']) 13 >>> sorted(multirc_val[0]['questions'][0].keys()) ['answers', 'idx', 'multisent', 'question', 'sentences_used'] >>> multirc_test = gluonnlp.data.SuperGlueMultiRC('test', root='./datasets/multirc') -etc- >>> len(multirc_test) 166 >>> sorted(multirc_test[0].keys()) ['questions', 'text'] >>> len(multirc_test[0]['text']) 14 >>> len(multirc_test[0]['questions']) 14 >>> sorted(multirc_test[0]['questions'][0].keys()) ['answers', 'idx', 'multisent', 'question', 'sentences_used']
-
class
gluonnlp.data.
SuperGlueBoolQ
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/superglue_boolq')[source]¶ Boolean Questions (BoolQ) is a QA dataset where each example consists of a short passage and a yes/no question about it.
From https://super.gluebenchmark.com/tasks
- Parameters
segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.
root (str, default "$MXNET_HOME/datasets/superglue_boolq") – Path to temp folder from storing data. MXNET_HOME defaults to ‘~/.mxnet’
Examples
>>> boolq_val = gluonnlp.data.SuperGlueBoolQ('val', root='./datasets/boolq') -etc- >>> len(boolq_val) 3270 >>> sorted(boolq_val[0].keys()) ['idx', 'label', 'passage', 'question'] >>> boolq_test = gluonnlp.data.SuperGlueBoolQ('test', root='./datasets/boolq') -etc- >>> len(boolq_test) 3245 >>> sorted(boolq_test[0].keys()) ['idx', 'passage', 'question']
-
class
gluonnlp.data.
SuperGlueReCoRD
(segment='train', root='/var/lib/jenkins/.mxnet/datasets/superglue_record')[source]¶ Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a multiple-choice QA dataset.
From https://super.gluebenchmark.com/tasks
- Parameters
segment ({'train', 'val', 'test'}, default 'train') – Dataset segment.
root (str, default "$MXNET_HOME/datasets/superglue_record") – Path to temp folder from storing data. MXNET_HOME defaults to ‘~/.mxnet’
Examples
>>> record_val = gluonnlp.data.SuperGlueReCoRD('val', root='./datasets/record') -etc- >>> len(record_val) 7481 >>> sorted(record_val[0].keys()) ['idx', 'passage', 'qas', 'source'] >>> record_test = gluonnlp.data.SuperGlueReCoRD('test', root='./datasets/record') -etc- >>> len(record_test) 7484 >>> sorted(record_test[0].keys()) ['idx', 'passage', 'qas', 'source']
-
class
gluonnlp.data.
SuperGlueAXb
(root='/var/lib/jenkins/.mxnet/datasets/superglue_ax_b')[source]¶ The Broadcoverage Diagnostics (AX-b) is a diagnostics dataset labeled closely to the schema of MultiNLI.
From https://super.gluebenchmark.com/tasks
- Parameters
root (str, default "$MXNET_HOME/datasets/superglue_ax_b") – Path to temp folder from storing data. MXNET_HOME defaults to ‘~/.mxnet’
Examples
>>> ax_b = gluonnlp.data.SuperGlueAXb(root='./datasets/ax_b') -etc- >>> len(ax_b) 1104 >>> sorted(ax_b[0].items()) [('idx', '0'), ('label', 'not_entailment'), ('logic', 'Negation'), ('sentence1', 'The cat sat on the mat.'), ('sentence2', 'The cat did not sit on the mat.')]
-
class
gluonnlp.data.
SuperGlueAXg
(root='/var/lib/jenkins/.mxnet/datasets/superglue_ax_g')[source]¶ The Winogender Schema Diagnostics (AX-g) is a diagnostics dataset labeled closely to the schema of MultiNLI.
From https://super.gluebenchmark.com/tasks
- Parameters
root (str, default "$MXNET_HOME/datasets/superglue_ax_g") – Path to temp folder from storing data. MXNET_HOME defaults to ‘~/.mxnet’
Examples
>>> ax_g = gluonnlp.data.SuperGlueAXg(root='./datasets/ax_g') -etc- >>> len(ax_g) 356 >>> sorted(ax_g[0].items()) [('hypothesis', 'The accountant sought help filing taxes.'), ('idx', 0), ('label', 'not_entailment'), ('pair_id', 551646), ('premise', 'The taxpayer met with the accountant to get help filing his taxes.')]
-
class
gluonnlp.data.
DatasetLoader
(file_patterns, file_sampler, dataset_fn=None, batch_sampler_fn=None, dataset_params=None, batch_sampler_params=None, batchify_fn=None, num_dataset_workers=0, num_batch_workers=0, pin_memory=False, circle_length=1, dataset_prefetch=None, batch_prefetch=None, dataset_cached=False, num_max_dataset_cached=0)[source]¶ Loads data from a list of datasets and returns mini-batches of data.
One dataset is loaded at a time.
- Parameters
file_patterns (str) – Path to the input text files.
file_sampler (str or gluon.data.Sampler, defaults to 'random') –
The sampler used to sample a file. The following string values are supported:
’sequential’: SequentialSampler
’random’: RandomSampler
dataset_fn (DatasetFn, callable) – Callable object to generate a gluon.data.Dataset given a url.
batch_sampler_fn (SamplerFn, callable) – Callable object to generate a gluon.data.sampler.Sampler given a dataset.
dataset_params (dict, default is None) – Dictionary of parameters passed to dataset_fn.
batch_sampler_params (dict, default is None) – Dictionary of parameters passed to batch_sampler_fn.
batchify_fn (callable) –
Callback function to allow users to specify how to merge samples into a batch. Defaults to default_batchify_fn:
def default_batchify_fn(data): if isinstance(data[0], nd.NDArray): return nd.stack(*data) elif isinstance(data[0], tuple): data = zip(*data) return [default_batchify_fn(i) for i in data] else: data = np.asarray(data) return nd.array(data, dtype=data.dtype)
num_dataset_workers (int) – Number of worker process for dataset creation.
num_batch_workers (int) – Number of worker process for batch creation.
pin_memory (boolean, default False) – If
True
, the dataloader will copy NDArrays into pinned memory before returning them. Copying from CPU pinned memory to GPU is faster than from normal CPU memory. At the same time, it increases GPU memory.circle_length (int, default is 1) – The number of files to be read at the same time. When circle_length is larger than 1, we merge circle_length number of files.
dataset_prefetch (int, default is num_dataset_workers) – The number of prefetching datasets only works if num_workers > 0. If prefetch > 0, it allow worker process to prefetch certain datasets before acquiring data from iterators. Note that using large prefetching batch will provide smoother bootstrapping performance, but will consume more memory. Using smaller number may forfeit the purpose of using multiple worker processes, try reduce num_dataset_workers in this case. By default it defaults to num_dataset_workers.
batch_prefetch (int, default is num_batch_workers * 2) – The number of prefetching batches only works if num_workers > 0. If prefetch > 0, it allow worker process to prefetch certain batches before acquiring data from iterators. Note that using large prefetching batch will provide smoother bootstrapping performance, but will consume more shared_memory. Using smaller number may forfeit the purpose of using multiple worker processes, try reduce num_batch_workers in this case. By default it defaults to num_batch_workers * 2.
dataset_cached (bool, default is False) – Whether or not to cache last processed dataset. Each processed dataset can only be cached for once. When there is no new available processed dataset to be fetched, we pop a cached processed dataset.
num_max_dataset_cached (int, default is 0) – Maximum number of cached datasets. It is valid only if dataset_cached is True