gluonnlp.data.batchify

Batchify functions can be used to transform a dataset into mini-batches that can be processed efficiently.

Batch Loaders

Stack

Stack the input data samples to construct the batch.

Pad

Return a callable that pads and stacks data.

List

Simply forward the list of input data.

Tuple

Wrap multiple batchify functions together.

NamedTuple

Wrap multiple batchify functions together and apply it to merge inputs from a namedtuple.

Dict

Wrap multiple batchify functions together and apply it to merge inputs from a dict.

Language Modeling

CorpusBatchify

Transform the dataset into N independent sequences, where N is the batch size.

CorpusBPTTBatchify

Transform the dataset into batches of numericalized samples, in the way that the recurrent states from last batch connects with the current batch for each sample.

StreamBPTTBatchify

Transform a Stream of CorpusDataset to BPTT batches.

Embedding Training

EmbeddingCenterContextBatchify

Helper to create batches of center and contexts words.

API Reference

Batchify helpers.

class gluonnlp.data.batchify.Stack(dtype=None)[source]

Stack the input data samples to construct the batch.

The N input samples must have the same shape/length and will be stacked to construct a batch.

Parameters

dtype (str or numpy.dtype, default None) – The value type of the output. If it is set to None, the input data type is used.

Examples

>>> import gluonnlp.data.batchify as bf
>>> # Stack multiple lists
>>> a = [1, 2, 3, 4]
>>> b = [4, 5, 6, 8]
>>> c = [8, 9, 1, 2]
>>> bf.Stack()([a, b, c])

[[1 2 3 4]
 [4 5 6 8]
 [8 9 1 2]]
<NDArray 3x4 @cpu_shared(0)>
>>> # Stack multiple numpy.ndarrays
>>> a = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> b = np.array([[5, 6, 7, 8], [1, 2, 3, 4]])
>>> bf.Stack()([a, b])

[[[1 2 3 4]
  [5 6 7 8]]

 [[5 6 7 8]
  [1 2 3 4]]]
<NDArray 2x2x4 @cpu_shared(0)>
>>> # Stack multiple NDArrays
>>> a = mx.nd.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> b = mx.nd.array([[5, 6, 7, 8], [1, 2, 3, 4]])
>>> bf.Stack()([a, b])

[[[1. 2. 3. 4.]
  [5. 6. 7. 8.]]

 [[5. 6. 7. 8.]
  [1. 2. 3. 4.]]]
<NDArray 2x2x4 @cpu_shared(0)>
__call__(data)[source]

Batchify the input data

Parameters

data (list) – The input data samples

Returns

batch_data

Return type

NDArray

class gluonnlp.data.batchify.Pad(axis=0, pad_val=None, ret_length=False, dtype=None, round_to=None)[source]

Return a callable that pads and stacks data.

Parameters
  • axis (int, default 0) – The axis to pad the arrays. The arrays will be padded to the largest dimension at axis. For example, assume the input arrays have shape (10, 8, 5), (6, 8, 5), (3, 8, 5) and the axis is 0. Each input will be padded into (10, 8, 5) and then stacked to form the final output, which has shape(3, 10, 8, 5).

  • pad_val (float or int, default 0) – The padding value.

  • ret_length (bool, default False) – Whether to return the valid length in the output.

  • dtype (str or numpy.dtype, default None) – The value type of the output. If it is set to None, the input data type is used.

  • round_to (int, default None) – If specified, the padded dimension will be rounded to be multiple of this argument.

Examples

>>> import gluonnlp.data.batchify as bf
>>> # Inputs are multiple lists
>>> a = [1, 2, 3, 4]
>>> b = [4, 5, 6]
>>> c = [8, 2]
>>> bf.Pad(pad_val=0)([a, b, c])

[[1. 2. 3. 4.]
 [4. 5. 6. 0.]
 [8. 2. 0. 0.]]
<NDArray 3x4 @cpu_shared(0)>
>>> # Also output the lengths
>>> a = [1, 2, 3, 4]
>>> b = [4, 5, 6]
>>> c = [8, 2]
>>> batch, length = bf.Pad(pad_val=0, ret_length=True)([a, b, c])
>>> batch

[[1. 2. 3. 4.]
 [4. 5. 6. 0.]
 [8. 2. 0. 0.]]
<NDArray 3x4 @cpu_shared(0)>
>>> length

[4 3 2]
<NDArray 3 @cpu_shared(0)>
>>> # Inputs are multiple ndarrays
>>> a = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> b = np.array([[5, 8], [1, 2]])
>>> bf.Pad(axis=1, pad_val=-1)([a, b])

[[[ 1  2  3  4]
  [ 5  6  7  8]]

 [[ 5  8 -1 -1]
  [ 1  2 -1 -1]]]
<NDArray 2x2x4 @cpu_shared(0)>
__call__(data)[source]

Batchify the input data.

The input can be list of numpy.ndarray, list of numbers or list of mxnet.nd.NDArray. Inputting mxnet.nd.NDArray is discouraged as each array need to be converted to numpy for efficient padding.

The arrays will be padded to the largest dimension at axis and then stacked to form the final output. In addition, the function will output the original dimensions at the axis if ret_length is turned on.

Parameters

data (List[np.ndarray] or List[List[dtype]] or List[mx.nd.NDArray]) – List of samples to pad and stack.

Returns

  • batch_data (NDArray) – Data in the minibatch. Shape is (N, …)

  • valid_length (NDArray, optional) – The sequences’ original lengths at the padded axis. Shape is (N,). This will only be returned in ret_length is True.

class gluonnlp.data.batchify.Tuple(fn, *args)[source]

Wrap multiple batchify functions together. The input functions will be applied to the corresponding input fields.

Each data sample should be a list or tuple containing multiple attributes. The i`th batchify function stored in `Tuple will be applied on the i`th attribute. For example, each data sample is (nd_data, label). You can wrap two batchify functions using `Tuple(DataBatchify, LabelBatchify) to batchify nd_data and label correspondingly.

Parameters
  • fn (list or tuple or callable) – The batchify functions to wrap.

  • *args (tuple of callable) – The additional batchify functions to wrap.

Examples

>>> import gluonnlp.data.batchify as bf
>>> a = ([1, 2, 3, 4], 0)
>>> b = ([5, 7], 1)
>>> c = ([1, 2, 3, 4, 5, 6, 7], 0)
>>> f1, f2 = bf.Tuple(bf.Pad(pad_val=0), bf.Stack())([a, b])
>>> f1

[[1. 2. 3. 4.]
 [5. 7. 0. 0.]]
<NDArray 2x4 @cpu_shared(0)>
>>> f2

[0 1]
<NDArray 2 @cpu_shared(0)>
__call__(data)[source]

Batchify the input data.

Parameters

data (list) – The samples to batchfy. Each sample should contain N attributes.

Returns

ret – A tuple of length N. Contains the batchified result of each attribute in the input.

Return type

tuple

class gluonnlp.data.batchify.List[source]

Simply forward the list of input data.

This is particularly useful when the Dataset contains textual data and in conjonction with the Tuple batchify function.

Examples

>>> import gluonnlp.data.batchify as bf
>>> a = ([1, 2, 3, 4], "I am using MXNet")
>>> b = ([5, 7, 2, 5], "Gluon rocks!")
>>> c = ([1, 2, 3, 4], "Batchification!")
>>> _, l = bf.Tuple(bf.Stack(), bf.List())([a, b, c])
>>> l
['I am using MXNet', 'Gluon rocks!', 'Batchification!']
__call__(data)[source]
Parameters

data (List[~T]) – The list of samples

Returns

The input list

Return type

ret

class gluonnlp.data.batchify.NamedTuple(container, fn_info)[source]

Wrap multiple batchify functions together and apply it to merge inputs from a namedtuple.

The generated batch samples are stored as a namedtuple with the same structure.

Each data sample should be a namedtuple. The i`th batchify function stored in `NamedTuple will be applied on the i`th attribute of the namedtuple data. For example, each data sample is Sample(data=nd_data, label=nd_label). You can wrap two batchify functions using `NamedTuple(Sample, {‘data’: DataBatchify, ‘label’: LabelBatchify}) to batchify nd_data and nd_label correspondingly. The result will be stored as a Sample object and you can access the data and label via sample.data and sample.label, correspondingly.

Parameters

Examples

>>> from gluonnlp.data.batchify import NamedTuple, Pad, Stack
>>> from collections import namedtuple
>>> SampleData = namedtuple('SampleData', ['data', 'label'])
>>> a = SampleData([1, 2, 3, 4], 0)
>>> b = SampleData([5, 7], 1)
>>> c = SampleData([1, 2, 3, 4, 5, 6, 7], 0)
>>> batchify_fn = NamedTuple(SampleData, {'data': Pad(pad_val=0), 'label': Stack()})
>>> sample = batchify_fn([a, b, c])
>>> sample
SampleData(data=
[[1. 2. 3. 4. 0. 0. 0.]
 [5. 7. 0. 0. 0. 0. 0.]
 [1. 2. 3. 4. 5. 6. 7.]]
<NDArray 3x7 @cpu_shared(0)>, label=
[0 1 0]
<NDArray 3 @cpu_shared(0)>)
>>> sample.data

[[1. 2. 3. 4. 0. 0. 0.]
 [5. 7. 0. 0. 0. 0. 0.]
 [1. 2. 3. 4. 5. 6. 7.]]
<NDArray 3x7 @cpu_shared(0)>
>>> # Let's consider to use a list
>>> batchify_fn = NamedTuple(SampleData, [Pad(pad_val=0), Stack()])
>>> batchify_fn([a, b, c])
SampleData(data=
[[1. 2. 3. 4. 0. 0. 0.]
 [5. 7. 0. 0. 0. 0. 0.]
 [1. 2. 3. 4. 5. 6. 7.]]
<NDArray 3x7 @cpu_shared(0)>, label=
[0 1 0]
<NDArray 3 @cpu_shared(0)>)
__call__(data)[source]

Batchify the input data.

Parameters

data (List of NamedTuple) – The samples to batchify. Each sample should be a NamedTuple.

Returns

ret – A namedtuple of length N. Contains the batchified result of each attribute in the input.

Return type

List of NamedTuple

class gluonnlp.data.batchify.Dict(fn_dict)[source]

Wrap multiple batchify functions together and apply it to merge inputs from a dict.

The generated batch samples are stored as a dict with the same keywords.

Each data sample should be a dict and the fn corresponds to key will be applied on the input with the keyword key. For example, each data sample is {‘data’: nd_data, ‘label’: nd_label}. You can merge the data and labels using Dict({‘data’: DataBatchify, ‘label’: LabelBatchify}) to batchify the nd_data and nd_label.

Parameters

fn_dict (Dict[AnyStr, Callable]) – A dictionary that contains the key–>batchify function mapping.

Examples

>>> from gluonnlp.data.batchify import Dict, Pad, Stack
>>> a = {'data': [1, 2, 3, 4], 'label': 0}
>>> b = {'data': [5, 7], 'label': 1}
>>> c = {'data': [1, 2, 3, 4, 5, 6, 7], 'label': 0}
>>> batchify_fn = Dict({'data': Pad(pad_val=0), 'label': Stack()})
>>> sample = batchify_fn([a, b, c])
>>> sample['data']

[[1. 2. 3. 4. 0. 0. 0.]
 [5. 7. 0. 0. 0. 0. 0.]
 [1. 2. 3. 4. 5. 6. 7.]]
<NDArray 3x7 @cpu_shared(0)>
>>> sample['label']

[0 1 0]
<NDArray 3 @cpu_shared(0)>
__call__(data)[source]
Parameters

data (List[Dict[~KT, ~VT]]) – The samples to batchify. Each sample should be a dictionary

Returns

The resulting dictionary that stores the merged samples.

Return type

ret

class gluonnlp.data.batchify.CorpusBatchify(vocab, batch_size)[source]

Transform the dataset into N independent sequences, where N is the batch size.

Parameters
  • vocab (gluonnlp.Vocab) – The vocabulary to use for numericalizing the dataset. Each token will be mapped to the index according to the vocabulary.

  • batch_size (int) – The number of samples in each batch.

__call__(data)[source]

Batchify a dataset.

Parameters

data (mxnet.gluon.data.Dataset) – A flat dataset to be batchified.

Returns

NDArray of shape (len(data) // N, N) where N is the batch_size wrapped by a mxnet.gluon.data.SimpleDataset. Excessive tokens that don’t align along the batches are discarded.

Return type

mxnet.gluon.data.Dataset

class gluonnlp.data.batchify.CorpusBPTTBatchify(vocab, seq_len, batch_size, last_batch='keep')[source]

Transform the dataset into batches of numericalized samples, in the way that the recurrent states from last batch connects with the current batch for each sample.

Each sample is of shape (seq_len, batch_size). When last_batch=’keep’, the first dimension of last sample may be shorter than seq_len.

Parameters
  • vocab (gluonnlp.Vocab) – The vocabulary to use for numericalizing the dataset. Each token will be mapped to the index according to the vocabulary.

  • seq_len (int) – The length of each of the samples for truncated back-propagation-through-time (TBPTT).

  • batch_size (int) – The number of samples in each batch.

  • last_batch ({'keep', 'discard'}) –

    How to handle the last batch if the remaining length is less than seq_len.

    • keep: A batch with less samples than previous batches is returned. vocab.padding_token is used to pad the last batch based on batch size.

    • discard: The last batch is discarded if it’s smaller than (seq_len, batch_size).

__call__(corpus)[source]

Batchify a dataset.

Parameters

corpus (mxnet.gluon.data.Dataset) – A flat dataset to be batchified.

Returns

Batches of numericalized samples such that the recurrent states from last batch connects with the current batch for each sample. Each element of the Dataset is a tuple of size 2, specifying the data and label for BPTT respectively. Both items are of the same shape (seq_len, batch_size).

Return type

mxnet.gluon.data.Dataset

class gluonnlp.data.batchify.StreamBPTTBatchify(vocab, seq_len, batch_size, sampler='random', last_batch='keep')[source]

Transform a Stream of CorpusDataset to BPTT batches.

The corpus is transformed into batches of numericalized samples, in the way that the recurrent states from last batch connects with the current batch for each sample.

Each sample is of shape (seq_len, batch_size).

For example, the following 4 sequences:

a b c d <eos>
e f g h i j <eos>
k l m n <eos>
o <eos>

will generate 2 batches with seq_len = 5, batch_size = 2 as follow (transposed):

batch_0.data.T:

a b c d <eos>
e f g h i

batch_0.target.T:

b c d <eos> k
f g h i j

batch_1.data.T:

k l m n <eos>
j <eos> o <eos> <padding>

batch_1.target.T:

l m n <eos> <padding>
<eos> o <eos> <padding> <padding>
Parameters
  • vocab (gluonnlp.Vocab) – The vocabulary to use for numericalizing the dataset. Each token will be mapped to the index according to the vocabulary.

  • seq_len (int) – The length of each of the samples for truncated back-propagation-through-time (TBPTT).

  • batch_size (int) – The number of samples in each batch.

  • sampler (str, {'sequential', 'random'}, defaults to 'random') –

    The sampler used to sample texts within a file.

    • ’sequential’: SequentialSampler

    • ’random’: RandomSampler

  • last_batch ({'keep', 'discard'}) –

    How to handle the last batch if the remaining length is less than seq_len.

    • keep: A batch with less samples than previous batches is returned.

    • discard: The last batch is discarded if it’s smaller than (seq_len, batch_size).

__call__(corpus)[source]

Batchify a stream.

Parameters

corpus (nlp.data.DatasetStream) – A stream of un-flattened CorpusDataset.

Returns

Batches of numericalized samples such that the recurrent states from last batch connects with the current batch for each sample. Each element of the Dataset is a tuple of data and label arrays for BPTT. They are of shape (seq_len, batch_size) respectively.

Return type

nlp.data.DataStream

class gluonnlp.data.batchify.EmbeddingCenterContextBatchify(batch_size, window_size=5, reduce_window_size_randomly=True, shuffle=True, cbow=False, weight_dtype='float32', index_dtype='int64')[source]

Helper to create batches of center and contexts words.

Batches are created lazily on a optionally shuffled version of the Dataset. To create batches from some corpus, first create a EmbeddingCenterContextBatchify object and then call it with the corpus. Please see the documentation of __call__ for more details.

Parameters
  • batch_size (int) – Maximum size of batches returned. Actual batch returned can be smaller when running out of samples.

  • window_size (int, default 5) – The maximum number of context elements to consider left and right of each center element. Less elements may be considered if there are not sufficient elements left / right of the center element or if a reduced window size was drawn.

  • reduce_window_size_randomly (bool, default True) – If True, randomly draw a reduced window size for every center element uniformly from [1, window].

  • shuffle (bool, default True) – If True, shuffle the sentences before lazily generating batches.

  • cbow (bool, default False) – Enable CBOW mode. In CBOW mode the returned context contains multiple entries per row. One for each context. If CBOW is False (default), there is a separate row for each context. The context_data array always contains weights for the context words equal to 1 over the number of context words in the given row of the context array.

  • weight_dtype (numpy.dtype, default numpy.float32) – Data type for data array of sparse COO context representation.

  • index_dtype (numpy.dtype, default numpy.int64) –

__call__(corpus)[source]

Batchify a dataset.

Parameters

corpus (list of sentences) –

List of sentences. Any list containing for example integers or

strings can be a sentence. Context samples do not cross sentence boundaries.

returns

Each element of the DataStream is a tuple of 2 elements (center, context). center is a numpy.ndarray of shape (batch_size, ). context is a tuple of 3 numpy.ndarray, representing a sparse COO array (data, row, col). The center and context arrays contain the center and corresponding context words respectively. A sparse representation is used for context as the number of context words for one center word varies based on the randomly chosen context window size and sentence boundaries. The returned center and col arrays are of the same dtype as the sentence elements.

rtype

DataStream