gluonnlp.data.batchify¶

Batchify functions can be used to transform a dataset into mini-batches that can be processed efficiently.

Batch Loaders¶

`Stack`	Stack the input data samples to construct the batch.
`Pad`	Return a callable that pads and stacks data.
`List`	Simply forward the list of input data.
`Tuple`	Wrap multiple batchify functions together.
`NamedTuple`	Wrap multiple batchify functions together and apply it to merge inputs from a namedtuple.
`Dict`	Wrap multiple batchify functions together and apply it to merge inputs from a dict.

Language Modeling¶

`CorpusBatchify`	Transform the dataset into N independent sequences, where N is the batch size.
`CorpusBPTTBatchify`	Transform the dataset into batches of numericalized samples, in the way that the recurrent states from last batch connects with the current batch for each sample.
`StreamBPTTBatchify`	Transform a Stream of CorpusDataset to BPTT batches.

Embedding Training¶

EmbeddingCenterContextBatchify

Helper to create batches of center and contexts words.

API Reference¶

Batchify helpers.

class gluonnlp.data.batchify.Stack(dtype=None)[source]¶

Stack the input data samples to construct the batch.

The N input samples must have the same shape/length and will be stacked to construct a batch.

Parameters: dtype (str or numpy.dtype, default None) – The value type of the output. If it is set to None, the input data type is used.

Examples

>>> import gluonnlp.data.batchify as bf
>>> # Stack multiple lists
>>> a = [1, 2, 3, 4]
>>> b = [4, 5, 6, 8]
>>> c = [8, 9, 1, 2]
>>> bf.Stack()([a, b, c])

[[1 2 3 4]
 [4 5 6 8]
 [8 9 1 2]]
<NDArray 3x4 @cpu_shared(0)>
>>> # Stack multiple numpy.ndarrays
>>> a = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> b = np.array([[5, 6, 7, 8], [1, 2, 3, 4]])
>>> bf.Stack()([a, b])

[[[1 2 3 4]
  [5 6 7 8]]

 [[5 6 7 8]
  [1 2 3 4]]]
<NDArray 2x2x4 @cpu_shared(0)>
>>> # Stack multiple NDArrays
>>> a = mx.nd.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> b = mx.nd.array([[5, 6, 7, 8], [1, 2, 3, 4]])
>>> bf.Stack()([a, b])

[[[1. 2. 3. 4.]
  [5. 6. 7. 8.]]

 [[5. 6. 7. 8.]
  [1. 2. 3. 4.]]]
<NDArray 2x2x4 @cpu_shared(0)>

__call__(data)[source]¶

Batchify the input data

Parameters: data (list) – The input data samples
Returns: batch_data
Return type: NDArray

class gluonnlp.data.batchify.Pad(axis=0, pad_val=None, ret_length=False, dtype=None, round_to=None)[source]¶

Return a callable that pads and stacks data.

Parameters

axis (int, default 0) – The axis to pad the arrays. The arrays will be padded to the largest dimension at axis. For example, assume the input arrays have shape (10, 8, 5), (6, 8, 5), (3, 8, 5) and the axis is 0. Each input will be padded into (10, 8, 5) and then stacked to form the final output, which has shape（3, 10, 8, 5).
pad_val (float or int, default 0) – The padding value.
ret_length (bool, default False) – Whether to return the valid length in the output.
dtype (str or numpy.dtype, default None) – The value type of the output. If it is set to None, the input data type is used.
round_to (int, default None) – If specified, the padded dimension will be rounded to be multiple of this argument.

Examples

>>> import gluonnlp.data.batchify as bf
>>> # Inputs are multiple lists
>>> a = [1, 2, 3, 4]
>>> b = [4, 5, 6]
>>> c = [8, 2]
>>> bf.Pad(pad_val=0)([a, b, c])

[[1. 2. 3. 4.]
 [4. 5. 6. 0.]
 [8. 2. 0. 0.]]
<NDArray 3x4 @cpu_shared(0)>
>>> # Also output the lengths
>>> a = [1, 2, 3, 4]
>>> b = [4, 5, 6]
>>> c = [8, 2]
>>> batch, length = bf.Pad(pad_val=0, ret_length=True)([a, b, c])
>>> batch

[[1. 2. 3. 4.]
 [4. 5. 6. 0.]
 [8. 2. 0. 0.]]
<NDArray 3x4 @cpu_shared(0)>
>>> length

[4 3 2]
<NDArray 3 @cpu_shared(0)>
>>> # Inputs are multiple ndarrays
>>> a = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
>>> b = np.array([[5, 8], [1, 2]])
>>> bf.Pad(axis=1, pad_val=-1)([a, b])

[[[ 1  2  3  4]
  [ 5  6  7  8]]

 [[ 5  8 -1 -1]
  [ 1  2 -1 -1]]]
<NDArray 2x2x4 @cpu_shared(0)>

__call__(data)[source]¶

Batchify the input data.

The input can be list of numpy.ndarray, list of numbers or list of mxnet.nd.NDArray. Inputting mxnet.nd.NDArray is discouraged as each array need to be converted to numpy for efficient padding.

The arrays will be padded to the largest dimension at axis and then stacked to form the final output. In addition, the function will output the original dimensions at the axis if ret_length is turned on.

Parameters

data (List[np.ndarray] or List[List[dtype]] or List[mx.nd.NDArray]) – List of samples to pad and stack.

Returns

batch_data (NDArray) – Data in the minibatch. Shape is (N, …)
valid_length (NDArray, optional) – The sequences’ original lengths at the padded axis. Shape is (N,). This will only be returned in ret_length is True.

class gluonnlp.data.batchify.Tuple(fn, *args)[source]¶

Wrap multiple batchify functions together. The input functions will be applied to the corresponding input fields.

Each data sample should be a list or tuple containing multiple attributes. The i`th batchify function stored in `Tuple will be applied on the i`th attribute. For example, each data sample is (nd_data, label). You can wrap two batchify functions using `Tuple(DataBatchify, LabelBatchify) to batchify nd_data and label correspondingly.

Parameters

fn (list or tuple or callable) – The batchify functions to wrap.
*args (tuple of callable) – The additional batchify functions to wrap.

Examples

>>> import gluonnlp.data.batchify as bf
>>> a = ([1, 2, 3, 4], 0)
>>> b = ([5, 7], 1)
>>> c = ([1, 2, 3, 4, 5, 6, 7], 0)
>>> f1, f2 = bf.Tuple(bf.Pad(pad_val=0), bf.Stack())([a, b])
>>> f1

[[1. 2. 3. 4.]
 [5. 7. 0. 0.]]
<NDArray 2x4 @cpu_shared(0)>
>>> f2

[0 1]
<NDArray 2 @cpu_shared(0)>

__call__(data)[source]¶

Batchify the input data.

Parameters: data (list) – The samples to batchfy. Each sample should contain N attributes.
Returns: ret – A tuple of length N. Contains the batchified result of each attribute in the input.
Return type: tuple

class gluonnlp.data.batchify.List[source]¶

Simply forward the list of input data.

This is particularly useful when the Dataset contains textual data and in conjonction with the Tuple batchify function.

Examples

>>> import gluonnlp.data.batchify as bf
>>> a = ([1, 2, 3, 4], "I am using MXNet")
>>> b = ([5, 7, 2, 5], "Gluon rocks!")
>>> c = ([1, 2, 3, 4], "Batchification!")
>>> _, l = bf.Tuple(bf.Stack(), bf.List())([a, b, c])
>>> l
['I am using MXNet', 'Gluon rocks!', 'Batchification!']

__call__(data)[source]¶

Parameters: data (List[~T]) – The list of samples
Returns: The input list
Return type: ret

class gluonnlp.data.batchify.NamedTuple(container, fn_info)[source]¶

Wrap multiple batchify functions together and apply it to merge inputs from a namedtuple.

The generated batch samples are stored as a namedtuple with the same structure.

Each data sample should be a namedtuple. The i`th batchify function stored in `NamedTuple will be applied on the i`th attribute of the namedtuple data. For example, each data sample is Sample(data=nd_data, label=nd_label). You can wrap two batchify functions using `NamedTuple(Sample, {‘data’: DataBatchify, ‘label’: LabelBatchify}) to batchify nd_data and nd_label correspondingly. The result will be stored as a Sample object and you can access the data and label via sample.data and sample.label, correspondingly.

Parameters

container (NamedTuple class) – The object that constructs the NamedTuple.
fn_info (Union[List[Callable], Tuple[Callable], Dict[AnyStr, Callable]]) – The information of the inner batchify functions.

Examples

>>> from gluonnlp.data.batchify import NamedTuple, Pad, Stack
>>> from collections import namedtuple
>>> SampleData = namedtuple('SampleData', ['data', 'label'])
>>> a = SampleData([1, 2, 3, 4], 0)
>>> b = SampleData([5, 7], 1)
>>> c = SampleData([1, 2, 3, 4, 5, 6, 7], 0)
>>> batchify_fn = NamedTuple(SampleData, {'data': Pad(pad_val=0), 'label': Stack()})
>>> sample = batchify_fn([a, b, c])
>>> sample
SampleData(data=
[[1. 2. 3. 4. 0. 0. 0.]
 [5. 7. 0. 0. 0. 0. 0.]
 [1. 2. 3. 4. 5. 6. 7.]]
<NDArray 3x7 @cpu_shared(0)>, label=
[0 1 0]
<NDArray 3 @cpu_shared(0)>)
>>> sample.data

[[1. 2. 3. 4. 0. 0. 0.]
 [5. 7. 0. 0. 0. 0. 0.]
 [1. 2. 3. 4. 5. 6. 7.]]
<NDArray 3x7 @cpu_shared(0)>
>>> # Let's consider to use a list
>>> batchify_fn = NamedTuple(SampleData, [Pad(pad_val=0), Stack()])
>>> batchify_fn([a, b, c])
SampleData(data=
[[1. 2. 3. 4. 0. 0. 0.]
 [5. 7. 0. 0. 0. 0. 0.]
 [1. 2. 3. 4. 5. 6. 7.]]
<NDArray 3x7 @cpu_shared(0)>, label=
[0 1 0]
<NDArray 3 @cpu_shared(0)>)

__call__(data)[source]¶

Batchify the input data.

Parameters: data (List of NamedTuple) – The samples to batchify. Each sample should be a NamedTuple.
Returns: ret – A namedtuple of length N. Contains the batchified result of each attribute in the input.
Return type: List of NamedTuple

class gluonnlp.data.batchify.Dict(fn_dict)[source]¶

Wrap multiple batchify functions together and apply it to merge inputs from a dict.

The generated batch samples are stored as a dict with the same keywords.

Each data sample should be a dict and the fn corresponds to key will be applied on the input with the keyword key. For example, each data sample is {‘data’: nd_data, ‘label’: nd_label}. You can merge the data and labels using Dict({‘data’: DataBatchify, ‘label’: LabelBatchify}) to batchify the nd_data and nd_label.

Parameters: fn_dict (Dict[AnyStr, Callable]) – A dictionary that contains the key–>batchify function mapping.

Examples

>>> from gluonnlp.data.batchify import Dict, Pad, Stack
>>> a = {'data': [1, 2, 3, 4], 'label': 0}
>>> b = {'data': [5, 7], 'label': 1}
>>> c = {'data': [1, 2, 3, 4, 5, 6, 7], 'label': 0}
>>> batchify_fn = Dict({'data': Pad(pad_val=0), 'label': Stack()})
>>> sample = batchify_fn([a, b, c])
>>> sample['data']

[[1. 2. 3. 4. 0. 0. 0.]
 [5. 7. 0. 0. 0. 0. 0.]
 [1. 2. 3. 4. 5. 6. 7.]]
<NDArray 3x7 @cpu_shared(0)>
>>> sample['label']

[0 1 0]
<NDArray 3 @cpu_shared(0)>

__call__(data)[source]¶

Parameters: data (List[Dict[~KT, ~VT]]) – The samples to batchify. Each sample should be a dictionary
Returns: The resulting dictionary that stores the merged samples.
Return type: ret

class gluonnlp.data.batchify.CorpusBatchify(vocab, batch_size)[source]¶

Transform the dataset into N independent sequences, where N is the batch size.

Parameters

vocab (gluonnlp.Vocab) – The vocabulary to use for numericalizing the dataset. Each token will be mapped to the index according to the vocabulary.
batch_size (int) – The number of samples in each batch.

__call__(data)[source]¶

Batchify a dataset.

Parameters: data (mxnet.gluon.data.Dataset) – A flat dataset to be batchified.
Returns: NDArray of shape (len(data) // N, N) where N is the batch_size wrapped by a mxnet.gluon.data.SimpleDataset. Excessive tokens that don’t align along the batches are discarded.
Return type: mxnet.gluon.data.Dataset

class gluonnlp.data.batchify.CorpusBPTTBatchify(vocab, seq_len, batch_size, last_batch='keep')[source]¶

Transform the dataset into batches of numericalized samples, in the way that the recurrent states from last batch connects with the current batch for each sample.

Each sample is of shape (seq_len, batch_size). When last_batch=’keep’, the first dimension of last sample may be shorter than seq_len.

Parameters

vocab (gluonnlp.Vocab) – The vocabulary to use for numericalizing the dataset. Each token will be mapped to the index according to the vocabulary.
seq_len (int) – The length of each of the samples for truncated back-propagation-through-time (TBPTT).
batch_size (int) – The number of samples in each batch.
last_batch ({'keep', 'discard'}) –
How to handle the last batch if the remaining length is less than seq_len.
- keep: A batch with less samples than previous batches is returned. vocab.padding_token is used to pad the last batch based on batch size.
- discard: The last batch is discarded if it’s smaller than (seq_len, batch_size).

__call__(corpus)[source]¶

Batchify a dataset.

Parameters: corpus (mxnet.gluon.data.Dataset) – A flat dataset to be batchified.
Returns: Batches of numericalized samples such that the recurrent states from last batch connects with the current batch for each sample. Each element of the Dataset is a tuple of size 2, specifying the data and label for BPTT respectively. Both items are of the same shape (seq_len, batch_size).
Return type: mxnet.gluon.data.Dataset

class gluonnlp.data.batchify.StreamBPTTBatchify(vocab, seq_len, batch_size, sampler='random', last_batch='keep')[source]¶

Transform a Stream of CorpusDataset to BPTT batches.

The corpus is transformed into batches of numericalized samples, in the way that the recurrent states from last batch connects with the current batch for each sample.

Each sample is of shape (seq_len, batch_size).

For example, the following 4 sequences:

a b c d <eos>
e f g h i j <eos>
k l m n <eos>
o <eos>

will generate 2 batches with seq_len = 5, batch_size = 2 as follow (transposed):

batch_0.data.T:

a b c d <eos>
e f g h i

batch_0.target.T:

b c d <eos> k
f g h i j

batch_1.data.T:

k l m n <eos>
j <eos> o <eos> <padding>

batch_1.target.T:

l m n <eos> <padding>
<eos> o <eos> <padding> <padding>

Parameters

vocab (gluonnlp.Vocab) – The vocabulary to use for numericalizing the dataset. Each token will be mapped to the index according to the vocabulary.
seq_len (int) – The length of each of the samples for truncated back-propagation-through-time (TBPTT).
batch_size (int) – The number of samples in each batch.
sampler (str, {'sequential', 'random'}, defaults to 'random') –
The sampler used to sample texts within a file.
- ’sequential’: SequentialSampler
- ’random’: RandomSampler
last_batch ({'keep', 'discard'}) –
How to handle the last batch if the remaining length is less than seq_len.
- keep: A batch with less samples than previous batches is returned.
- discard: The last batch is discarded if it’s smaller than (seq_len, batch_size).

__call__(corpus)[source]¶

Batchify a stream.

Parameters: corpus (nlp.data.DatasetStream) – A stream of un-flattened CorpusDataset.
Returns: Batches of numericalized samples such that the recurrent states from last batch connects with the current batch for each sample. Each element of the Dataset is a tuple of data and label arrays for BPTT. They are of shape (seq_len, batch_size) respectively.
Return type: nlp.data.DataStream

class gluonnlp.data.batchify.EmbeddingCenterContextBatchify(batch_size, window_size=5, reduce_window_size_randomly=True, shuffle=True, cbow=False, weight_dtype='float32', index_dtype='int64')[source]¶

Helper to create batches of center and contexts words.

Batches are created lazily on a optionally shuffled version of the Dataset. To create batches from some corpus, first create a EmbeddingCenterContextBatchify object and then call it with the corpus. Please see the documentation of __call__ for more details.

Parameters

batch_size (int) – Maximum size of batches returned. Actual batch returned can be smaller when running out of samples.
window_size (int, default 5) – The maximum number of context elements to consider left and right of each center element. Less elements may be considered if there are not sufficient elements left / right of the center element or if a reduced window size was drawn.
reduce_window_size_randomly (bool, default True) – If True, randomly draw a reduced window size for every center element uniformly from [1, window].
shuffle (bool, default True) – If True, shuffle the sentences before lazily generating batches.
cbow (bool, default False) – Enable CBOW mode. In CBOW mode the returned context contains multiple entries per row. One for each context. If CBOW is False (default), there is a separate row for each context. The context_data array always contains weights for the context words equal to 1 over the number of context words in the given row of the context array.
weight_dtype (numpy.dtype, default numpy.float32) – Data type for data array of sparse COO context representation.
index_dtype (numpy.dtype, default numpy.int64) –

__call__(corpus)[source]¶

Batchify a dataset.

Parameters

corpus (list of sentences) –

List of sentences. Any list containing for example integers or: strings can be a sentence. Context samples do not cross sentence boundaries.

returns: Each element of the DataStream is a tuple of 2 elements (center, context). center is a numpy.ndarray of shape (batch_size, ). context is a tuple of 3 numpy.ndarray, representing a sparse COO array (data, row, col). The center and context arrays contain the center and corresponding context words respectively. A sparse representation is used for context as the number of context words for one center word varies based on the randomly chosen context window size and sentence boundaries. The returned center and col arrays are of the same dtype as the sentence elements.
rtype: DataStream