gluonnlp.data.batchify¶
Batchify functions can be used to transform a dataset into mini-batches that can be processed efficiently.
Batch Loaders¶
Stack the input data samples to construct the batch. |
|
Return a callable that pads and stacks data. |
|
Simply forward the list of input data. |
|
Wrap multiple batchify functions together. |
|
Wrap multiple batchify functions together and apply it to merge inputs from a namedtuple. |
|
Wrap multiple batchify functions together and apply it to merge inputs from a dict. |
Language Modeling¶
Transform the dataset into N independent sequences, where N is the batch size. |
|
Transform the dataset into batches of numericalized samples, in the way that the recurrent states from last batch connects with the current batch for each sample. |
|
Transform a Stream of CorpusDataset to BPTT batches. |
Embedding Training¶
Helper to create batches of center and contexts words. |
API Reference¶
Batchify helpers.
-
class
gluonnlp.data.batchify.
Stack
(dtype=None)[source]¶ Stack the input data samples to construct the batch.
The N input samples must have the same shape/length and will be stacked to construct a batch.
- Parameters
dtype (str or numpy.dtype, default None) – The value type of the output. If it is set to None, the input data type is used.
Examples
>>> import gluonnlp.data.batchify as bf >>> # Stack multiple lists >>> a = [1, 2, 3, 4] >>> b = [4, 5, 6, 8] >>> c = [8, 9, 1, 2] >>> bf.Stack()([a, b, c]) [[1 2 3 4] [4 5 6 8] [8 9 1 2]] <NDArray 3x4 @cpu_shared(0)> >>> # Stack multiple numpy.ndarrays >>> a = np.array([[1, 2, 3, 4], [5, 6, 7, 8]]) >>> b = np.array([[5, 6, 7, 8], [1, 2, 3, 4]]) >>> bf.Stack()([a, b]) [[[1 2 3 4] [5 6 7 8]] [[5 6 7 8] [1 2 3 4]]] <NDArray 2x2x4 @cpu_shared(0)> >>> # Stack multiple NDArrays >>> a = mx.nd.array([[1, 2, 3, 4], [5, 6, 7, 8]]) >>> b = mx.nd.array([[5, 6, 7, 8], [1, 2, 3, 4]]) >>> bf.Stack()([a, b]) [[[1. 2. 3. 4.] [5. 6. 7. 8.]] [[5. 6. 7. 8.] [1. 2. 3. 4.]]] <NDArray 2x2x4 @cpu_shared(0)>
-
class
gluonnlp.data.batchify.
Pad
(axis=0, pad_val=None, ret_length=False, dtype=None, round_to=None)[source]¶ Return a callable that pads and stacks data.
- Parameters
axis (int, default 0) – The axis to pad the arrays. The arrays will be padded to the largest dimension at axis. For example, assume the input arrays have shape (10, 8, 5), (6, 8, 5), (3, 8, 5) and the axis is 0. Each input will be padded into (10, 8, 5) and then stacked to form the final output, which has shape(3, 10, 8, 5).
ret_length (bool, default False) – Whether to return the valid length in the output.
dtype (str or numpy.dtype, default None) – The value type of the output. If it is set to None, the input data type is used.
round_to (int, default None) – If specified, the padded dimension will be rounded to be multiple of this argument.
Examples
>>> import gluonnlp.data.batchify as bf >>> # Inputs are multiple lists >>> a = [1, 2, 3, 4] >>> b = [4, 5, 6] >>> c = [8, 2] >>> bf.Pad(pad_val=0)([a, b, c]) [[1. 2. 3. 4.] [4. 5. 6. 0.] [8. 2. 0. 0.]] <NDArray 3x4 @cpu_shared(0)> >>> # Also output the lengths >>> a = [1, 2, 3, 4] >>> b = [4, 5, 6] >>> c = [8, 2] >>> batch, length = bf.Pad(pad_val=0, ret_length=True)([a, b, c]) >>> batch [[1. 2. 3. 4.] [4. 5. 6. 0.] [8. 2. 0. 0.]] <NDArray 3x4 @cpu_shared(0)> >>> length [4 3 2] <NDArray 3 @cpu_shared(0)> >>> # Inputs are multiple ndarrays >>> a = np.array([[1, 2, 3, 4], [5, 6, 7, 8]]) >>> b = np.array([[5, 8], [1, 2]]) >>> bf.Pad(axis=1, pad_val=-1)([a, b]) [[[ 1 2 3 4] [ 5 6 7 8]] [[ 5 8 -1 -1] [ 1 2 -1 -1]]] <NDArray 2x2x4 @cpu_shared(0)>
-
__call__
(data)[source]¶ Batchify the input data.
The input can be list of numpy.ndarray, list of numbers or list of mxnet.nd.NDArray. Inputting mxnet.nd.NDArray is discouraged as each array need to be converted to numpy for efficient padding.
The arrays will be padded to the largest dimension at axis and then stacked to form the final output. In addition, the function will output the original dimensions at the axis if ret_length is turned on.
- Parameters
data (List[np.ndarray] or List[List[dtype]] or List[mx.nd.NDArray]) – List of samples to pad and stack.
- Returns
batch_data (NDArray) – Data in the minibatch. Shape is (N, …)
valid_length (NDArray, optional) – The sequences’ original lengths at the padded axis. Shape is (N,). This will only be returned in ret_length is True.
-
class
gluonnlp.data.batchify.
Tuple
(fn, *args)[source]¶ Wrap multiple batchify functions together. The input functions will be applied to the corresponding input fields.
Each data sample should be a list or tuple containing multiple attributes. The i`th batchify function stored in `Tuple will be applied on the i`th attribute. For example, each data sample is (nd_data, label). You can wrap two batchify functions using `Tuple(DataBatchify, LabelBatchify) to batchify nd_data and label correspondingly.
- Parameters
Examples
>>> import gluonnlp.data.batchify as bf >>> a = ([1, 2, 3, 4], 0) >>> b = ([5, 7], 1) >>> c = ([1, 2, 3, 4, 5, 6, 7], 0) >>> f1, f2 = bf.Tuple(bf.Pad(pad_val=0), bf.Stack())([a, b]) >>> f1 [[1. 2. 3. 4.] [5. 7. 0. 0.]] <NDArray 2x4 @cpu_shared(0)> >>> f2 [0 1] <NDArray 2 @cpu_shared(0)>
-
class
gluonnlp.data.batchify.
List
[source]¶ Simply forward the list of input data.
This is particularly useful when the Dataset contains textual data and in conjonction with the Tuple batchify function.
Examples
>>> import gluonnlp.data.batchify as bf >>> a = ([1, 2, 3, 4], "I am using MXNet") >>> b = ([5, 7, 2, 5], "Gluon rocks!") >>> c = ([1, 2, 3, 4], "Batchification!") >>> _, l = bf.Tuple(bf.Stack(), bf.List())([a, b, c]) >>> l ['I am using MXNet', 'Gluon rocks!', 'Batchification!']
-
class
gluonnlp.data.batchify.
NamedTuple
(container, fn_info)[source]¶ Wrap multiple batchify functions together and apply it to merge inputs from a namedtuple.
The generated batch samples are stored as a namedtuple with the same structure.
Each data sample should be a namedtuple. The i`th batchify function stored in `NamedTuple will be applied on the i`th attribute of the namedtuple data. For example, each data sample is Sample(data=nd_data, label=nd_label). You can wrap two batchify functions using `NamedTuple(Sample, {‘data’: DataBatchify, ‘label’: LabelBatchify}) to batchify nd_data and nd_label correspondingly. The result will be stored as a Sample object and you can access the data and label via sample.data and sample.label, correspondingly.
- Parameters
Examples
>>> from gluonnlp.data.batchify import NamedTuple, Pad, Stack >>> from collections import namedtuple >>> SampleData = namedtuple('SampleData', ['data', 'label']) >>> a = SampleData([1, 2, 3, 4], 0) >>> b = SampleData([5, 7], 1) >>> c = SampleData([1, 2, 3, 4, 5, 6, 7], 0) >>> batchify_fn = NamedTuple(SampleData, {'data': Pad(pad_val=0), 'label': Stack()}) >>> sample = batchify_fn([a, b, c]) >>> sample SampleData(data= [[1. 2. 3. 4. 0. 0. 0.] [5. 7. 0. 0. 0. 0. 0.] [1. 2. 3. 4. 5. 6. 7.]] <NDArray 3x7 @cpu_shared(0)>, label= [0 1 0] <NDArray 3 @cpu_shared(0)>) >>> sample.data [[1. 2. 3. 4. 0. 0. 0.] [5. 7. 0. 0. 0. 0. 0.] [1. 2. 3. 4. 5. 6. 7.]] <NDArray 3x7 @cpu_shared(0)> >>> # Let's consider to use a list >>> batchify_fn = NamedTuple(SampleData, [Pad(pad_val=0), Stack()]) >>> batchify_fn([a, b, c]) SampleData(data= [[1. 2. 3. 4. 0. 0. 0.] [5. 7. 0. 0. 0. 0. 0.] [1. 2. 3. 4. 5. 6. 7.]] <NDArray 3x7 @cpu_shared(0)>, label= [0 1 0] <NDArray 3 @cpu_shared(0)>)
-
class
gluonnlp.data.batchify.
Dict
(fn_dict)[source]¶ Wrap multiple batchify functions together and apply it to merge inputs from a dict.
The generated batch samples are stored as a dict with the same keywords.
Each data sample should be a dict and the fn corresponds to key will be applied on the input with the keyword key. For example, each data sample is {‘data’: nd_data, ‘label’: nd_label}. You can merge the data and labels using Dict({‘data’: DataBatchify, ‘label’: LabelBatchify}) to batchify the nd_data and nd_label.
- Parameters
fn_dict (
Dict
[AnyStr
,Callable
]) – A dictionary that contains the key–>batchify function mapping.
Examples
>>> from gluonnlp.data.batchify import Dict, Pad, Stack >>> a = {'data': [1, 2, 3, 4], 'label': 0} >>> b = {'data': [5, 7], 'label': 1} >>> c = {'data': [1, 2, 3, 4, 5, 6, 7], 'label': 0} >>> batchify_fn = Dict({'data': Pad(pad_val=0), 'label': Stack()}) >>> sample = batchify_fn([a, b, c]) >>> sample['data'] [[1. 2. 3. 4. 0. 0. 0.] [5. 7. 0. 0. 0. 0. 0.] [1. 2. 3. 4. 5. 6. 7.]] <NDArray 3x7 @cpu_shared(0)> >>> sample['label'] [0 1 0] <NDArray 3 @cpu_shared(0)>
-
class
gluonnlp.data.batchify.
CorpusBatchify
(vocab, batch_size)[source]¶ Transform the dataset into N independent sequences, where N is the batch size.
- Parameters
vocab (gluonnlp.Vocab) – The vocabulary to use for numericalizing the dataset. Each token will be mapped to the index according to the vocabulary.
batch_size (int) – The number of samples in each batch.
-
__call__
(data)[source]¶ Batchify a dataset.
- Parameters
data (mxnet.gluon.data.Dataset) – A flat dataset to be batchified.
- Returns
NDArray of shape (len(data) // N, N) where N is the batch_size wrapped by a mxnet.gluon.data.SimpleDataset. Excessive tokens that don’t align along the batches are discarded.
- Return type
-
class
gluonnlp.data.batchify.
CorpusBPTTBatchify
(vocab, seq_len, batch_size, last_batch='keep')[source]¶ Transform the dataset into batches of numericalized samples, in the way that the recurrent states from last batch connects with the current batch for each sample.
Each sample is of shape (seq_len, batch_size). When last_batch=’keep’, the first dimension of last sample may be shorter than seq_len.
- Parameters
vocab (gluonnlp.Vocab) – The vocabulary to use for numericalizing the dataset. Each token will be mapped to the index according to the vocabulary.
seq_len (int) – The length of each of the samples for truncated back-propagation-through-time (TBPTT).
batch_size (int) – The number of samples in each batch.
last_batch ({'keep', 'discard'}) –
How to handle the last batch if the remaining length is less than seq_len.
keep: A batch with less samples than previous batches is returned. vocab.padding_token is used to pad the last batch based on batch size.
discard: The last batch is discarded if it’s smaller than (seq_len, batch_size).
-
__call__
(corpus)[source]¶ Batchify a dataset.
- Parameters
corpus (mxnet.gluon.data.Dataset) – A flat dataset to be batchified.
- Returns
Batches of numericalized samples such that the recurrent states from last batch connects with the current batch for each sample. Each element of the Dataset is a tuple of size 2, specifying the data and label for BPTT respectively. Both items are of the same shape (seq_len, batch_size).
- Return type
-
class
gluonnlp.data.batchify.
StreamBPTTBatchify
(vocab, seq_len, batch_size, sampler='random', last_batch='keep')[source]¶ Transform a Stream of CorpusDataset to BPTT batches.
The corpus is transformed into batches of numericalized samples, in the way that the recurrent states from last batch connects with the current batch for each sample.
Each sample is of shape (seq_len, batch_size).
For example, the following 4 sequences:
a b c d <eos> e f g h i j <eos> k l m n <eos> o <eos>
will generate 2 batches with seq_len = 5, batch_size = 2 as follow (transposed):
batch_0.data.T:
a b c d <eos> e f g h i
batch_0.target.T:
b c d <eos> k f g h i j
batch_1.data.T:
k l m n <eos> j <eos> o <eos> <padding>
batch_1.target.T:
l m n <eos> <padding> <eos> o <eos> <padding> <padding>
- Parameters
vocab (gluonnlp.Vocab) – The vocabulary to use for numericalizing the dataset. Each token will be mapped to the index according to the vocabulary.
seq_len (int) – The length of each of the samples for truncated back-propagation-through-time (TBPTT).
batch_size (int) – The number of samples in each batch.
sampler (str, {'sequential', 'random'}, defaults to 'random') –
The sampler used to sample texts within a file.
’sequential’: SequentialSampler
’random’: RandomSampler
last_batch ({'keep', 'discard'}) –
How to handle the last batch if the remaining length is less than seq_len.
keep: A batch with less samples than previous batches is returned.
discard: The last batch is discarded if it’s smaller than (seq_len, batch_size).
-
__call__
(corpus)[source]¶ Batchify a stream.
- Parameters
corpus (nlp.data.DatasetStream) – A stream of un-flattened CorpusDataset.
- Returns
Batches of numericalized samples such that the recurrent states from last batch connects with the current batch for each sample. Each element of the Dataset is a tuple of data and label arrays for BPTT. They are of shape (seq_len, batch_size) respectively.
- Return type
nlp.data.DataStream
-
class
gluonnlp.data.batchify.
EmbeddingCenterContextBatchify
(batch_size, window_size=5, reduce_window_size_randomly=True, shuffle=True, cbow=False, weight_dtype='float32', index_dtype='int64')[source]¶ Helper to create batches of center and contexts words.
Batches are created lazily on a optionally shuffled version of the Dataset. To create batches from some corpus, first create a EmbeddingCenterContextBatchify object and then call it with the corpus. Please see the documentation of __call__ for more details.
- Parameters
batch_size (int) – Maximum size of batches returned. Actual batch returned can be smaller when running out of samples.
window_size (int, default 5) – The maximum number of context elements to consider left and right of each center element. Less elements may be considered if there are not sufficient elements left / right of the center element or if a reduced window size was drawn.
reduce_window_size_randomly (bool, default True) – If True, randomly draw a reduced window size for every center element uniformly from [1, window].
shuffle (bool, default True) – If True, shuffle the sentences before lazily generating batches.
cbow (bool, default False) – Enable CBOW mode. In CBOW mode the returned context contains multiple entries per row. One for each context. If CBOW is False (default), there is a separate row for each context. The context_data array always contains weights for the context words equal to 1 over the number of context words in the given row of the context array.
weight_dtype (numpy.dtype, default numpy.float32) – Data type for data array of sparse COO context representation.
index_dtype (numpy.dtype, default numpy.int64) –
-
__call__
(corpus)[source]¶ Batchify a dataset.
- Parameters
corpus (list of sentences) –
- List of sentences. Any list containing for example integers or
strings can be a sentence. Context samples do not cross sentence boundaries.
- returns
Each element of the DataStream is a tuple of 2 elements (center, context). center is a numpy.ndarray of shape (batch_size, ). context is a tuple of 3 numpy.ndarray, representing a sparse COO array (data, row, col). The center and context arrays contain the center and corresponding context words respectively. A sparse representation is used for context as the number of context words for one center word varies based on the randomly chosen context window size and sentence boundaries. The returned center and col arrays are of the same dtype as the sentence elements.
- rtype
DataStream