Using ImageDataGenerator with your own generator - generator

I have large dataset that will not fit in memory and it has multiple inputs. So thats why I created my own generator. But then I wanted to augment my data by using ImageDataGenerator I face problem. I don't know how to combine both generators.
What I have done till now is :
def data_gen( batch_size= None, nb_epochs=None, sess=None):
dataset = tf.data.TFRecordDataset(training_filenames)
dataset = dataset.map(_parse_function_all)
dataset = dataset.shuffle(buffer_size= 1000 + 4* batch_size)
dataset = dataset.batch(batch_size).repeat()
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
for i in range(nb_epochs):
sess.run(iterator.initializer)
while True:
try:
next_val = sess.run(next_element)
images_a = next_val[0][:, 0]
images_b = next_val[0][:, 1]
labels = next_val[1]
yield [images_a, images_b], labels
except tf.errors.OutOfRangeError:
break
mymodel = Model(input=[input_a, input_b], output=out)
mymodel.compile(loss=loss_both_equal, optimizer=rms, metrics=['accuracy', auc_roc])
data_gen_1 = data_gen(batch_size= batch_size, nb_epochs= 10, sess= sess)
mymodel.fit_generator(generator= data_gen_1, epochs = epochs,
steps_per_epoch=335,
callbacks=[tensorboard, alphaChanger])
So If I want to do some augmentation using DataImageGenerator, how I can combine my own generator with DataIamgeGenerator?

Related

How to add an additional output node during training for Pytorch?

I am making a class-incremental learning multi-label classifier. Here the model first trains with 7 labels. After training, another dataset emerges that contains the same labels except one more. I want to automatically add an extra node to the trained network and continue training on this new dataset. How can I do this?
class FeedForewardNN(nn.Module):
def __init__(self, input_size, h1_size = 264, h2_size = 128, num_services=8):
super().__init__()
self.input_size = input_size
self.lin1 = nn.Linear(input_size, h1_size)
self.lin2 = nn.Linear(h1_size, h2_size)
self.lin3 = nn.Linear(h2_size, num_services)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
def forward(self, x):
x = self.lin1(x)
x = self.relu(x)
x = self.lin2(x)
x = self.relu(x)
x = self.lin3(x)
x = self.sigmoid(x)
return x
This is the architecture of the feedforward Neural Network.
Then I first train on the data set with only 7 classes.
#Create NN
input_size = len(x_columns)
net1 = FeedForewardNN(input_size, num_services=7)
alpha= 0.001
#Define optimizer
optimizer = optim.Adam(net.parameters(), lr=alpha)
criterion = nn.BCELoss()
running_loss = 0
#Training Loop
loss_list = []
auc_list = []
for i in range(len(train_data_x)):
optimizer.zero_grad()
outputs = net1(train_data_x[i])
loss = criterion(outputs, train_data_y[i])
loss.backward()
optimizer.step()
However then, I want to add one additional output node, define the new weights but maintain the old trained weights, and train on this new data set.
I suggest to replace layer with new one, having desired shape, and than partially assign its parameter values with old ones as follows:
def increaseClassifier( m: torch.nn.Linear ):
w = m.weight
b = m.bias
old_shape = m.weight.shape
m2 = nn.Linear( old_shape[1], old_shape[0] +1 )
m2.weight = nn.parameter.Parameter( torch.cat( (m.weight, m2.weight[0:1]) ) )
m2.bias = nn.parameter.Parameter( torch.cat( (m.bias, m2.bias[0:1]) ) )
return m2
class FeedForewardNN(nn.Module):
...
def incrHere(self):
self.lin3 = increaseClassifier( self.lin3 )
UPD:
Can you explain, how these additional weights that come with this new output node are initialized?
The initial weights for new channel come from new layer creation, layer constructor make new parameters with some random initialization, then we are replace part of it with trained weight, and remained part is ready for new training.
m2.weight = nn.parameter.Parameter( torch.cat( (m.weight, m2.weight[0:1]) ) )

What is the proper way to create training, validation and test set in pytorch or change the transform of an already created data set?

I noticed that a standard thing like getting the validation set in PyTorch is not as common as one would expect or not obviously available in the pytorch library.
I found two websites that do it their own way:
- https://gist.github.com/MattKleinsmith/5226a94bad5dd12ed0b871aed98cb123
- https://www.geeksforgeeks.org/training-neural-networks-with-validation-using-pytorch/
but they have their problems because the second one force you to have both the train & validation set have the same transforms and the first one splits with respect to the data loader - which is then impossible to give easily to a distributed data loader afaik.
If that is not the way to do it then what is the right proper way to create a train, val and test set?
The solution I found is to create three data sets since the beginning with the correct transforms you want. Then you have 3 data set objects and you give it to the torch.utils.data.Subset(train_dataset, train_indices). The crux is this essentially:
# load the dataset
path_to_data_set: str = str(Path(path_to_data_set).expanduser())
train_dataset = datasets.MNIST(root=path_to_data_set, train=True,
download=True, transform=train_transform)
val_dataset = datasets.MNIST(root=path_to_data_set, train=True,
download=True, transform=val_transform)
indices = list(range(len(train_dataset)))
train_indices, val_indices = split_inidices(indices, test_size=val_size, random_state=seed, shuffle=shuffle)
train_dataset = torch.utils.data.Subset(train_dataset, train_indices)
val_dataset = torch.utils.data.Subset(val_dataset, val_indices)
train_loader, val_loader = get_serial_or_distributed_dataloaders(train_dataset,
val_dataset,
batch_size,
batch_size_eval,
rank,
world_size,
merge,
num_workers,
pin_memory
)
then you can create whatever dataloaders you want later. This way you don't have to change the transform in the first place.
full code:
"""
# - data augmentation
Current belief is that augmenting the validation set should be fine, especially if you want to actually encourage
generalization since it makes the val set harder and it allows you to make val split percentage slightly lower since
your validation set was increased size.
For reproducibility of other work, especially for scientific pursues rather than "lets beat state of the art" - to make
it easier to compare results use what they use. e.g. it seems only augmenting the train set is the common thing,
especially when I looked at the augmentation strategies in min-imagenet and mnist.
Test set augmentation helps mostly to make test set harder (so acc should go down) - but it also increases variance
since the data size was increased. If you are reporting results most likely augmenting the data set is a good idea
- especially if you are going to compute test set errors when comparing accuracy with previous work.
Also, the way CI intervals are computed with t_p * std_n / sqrt n, means that the avg test error will be smaller, so
you are fine in general.
Default code I see doesn't augment test set so I most likely won't either.
ref:
- https://stats.stackexchange.com/questions/320800/data-augmentation-on-training-set-only/320967#320967
- https://arxiv.org/abs/1809.01442, https://stats.stackexchange.com/a/390470/28986
# - pin_memory
For data loading, passing pin_memory=True to a DataLoader will automatically put the fetched data Tensors in pinned
memory, and thus enables faster data transfer to CUDA-enabled GPUs. Note on pinning:
This is an advanced tip. If you overuse pinned memory, it can cause serious problems when running low on RAM, and
you should be aware that pinning is often an expensive operation. Thus, will leave it's default as False.
ref:
- on pin_memory: https://pytorch.org/docs/stable/data.html
"""
from typing import Callable, Optional, Union
import numpy as np
import torch
from numpy.random import RandomState
from torch.utils.data import Dataset, SubsetRandomSampler, random_split, DataLoader, RandomSampler
def get_train_val_split_random_sampler(
train_dataset: Dataset,
val_dataset: Dataset,
val_size: float = 0.2,
batch_size: int = 128,
batch_size_eval: int = 64,
num_workers: int = 4,
pin_memory: bool = False
# random_seed: Optional[int] = None,
) -> tuple[DataLoader, DataLoader]:
"""
Note:
- this will use different transforms for val and train if the objects you pass have different transforms.
- note train_dataset, val_dataset whill often be the same data set object but different instances with different
transforms for each data set.
Recommended use:
- this one is recommended when you want the train & val to have different transforms e.g. when doing scientific
work - instead of beating benchmark work - and the train, val sets had different transforms.
ref:
- https://gist.github.com/MattKleinsmith/5226a94bad5dd12ed0b871aed98cb123
"""
assert 0 <= val_size <= 1.0, f"Error: {val_size} valid_size should be in the range [0, 1]."
num_train = len(train_dataset)
indices = list(range(num_train))
split_idx = int(np.floor(val_size * num_train))
# I don't think this is needed since later the sampler randomly samples data from a given list
# if shuffle == True:
# np.random.seed(random_seed)
# np.random.shuffle(indices)
train_idx, valid_idx = indices[:split_idx], indices[split_idx:]
assert len(train_idx) != 0 and len(valid_idx) != 0
# Samples elements randomly from a given list of indices, without replacement.
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)
train_loader = torch.utils.data.DataLoader(train_dataset,
batch_size=batch_size, sampler=train_sampler,
num_workers=num_workers, pin_memory=pin_memory)
valid_loader = torch.utils.data.DataLoader(val_dataset,
batch_size=batch_size_eval, sampler=valid_sampler,
num_workers=num_workers, pin_memory=pin_memory)
return train_loader, valid_loader
def get_train_val_split_with_split(
train_dataset: Dataset,
train_val_split: list[int, int], # e.g. [50_000, 10_000] for mnist
batch_size: int = 128,
batch_size_eval: int = 64,
num_workers: int = 4,
pin_memory: bool = False
) -> tuple[DataLoader, DataLoader]:
"""
Note:
- this will have the train and val sets have the same transform.
ref:
- https://gist.github.com/MattKleinsmith/5226a94bad5dd12ed0b871aed98cb123
- change transform: https://discuss.pytorch.org/t/changing-transforms-after-creating-a-dataset/64929/4
"""
train_dataset, valid_dataset = random_split(train_dataset, train_val_split)
train_loader = torch.utils.data.DataLoader(train_dataset,
batch_size=batch_size, num_workers=num_workers, pin_memory=pin_memory)
valid_loader = torch.utils.data.DataLoader(valid_dataset,
batch_size=batch_size_eval, num_workers=num_workers,
pin_memory=pin_memory)
return train_loader, valid_loader
def get_serial_or_distributed_dataloaders(train_dataset: Dataset,
val_dataset: Dataset,
batch_size: int = 128,
batch_size_eval: int = 64,
rank: int = -1,
world_size: int = 1,
merge: Optional[Callable] = None,
num_workers: int = -1, # -1 means its running serially
pin_memory: bool = False,
):
"""
"""
from uutils.torch_uu.distributed import is_running_serially
if is_running_serially(rank):
train_sampler = RandomSampler(train_dataset)
val_sampler = RandomSampler(val_dataset)
num_workers = 4 if num_workers == -1 else num_workers
else:
assert (batch_size >= world_size), f'Each worker must get at least one data point, so batch_size >= world_size' \
f'but got: {batch_size}{world_size}'
from torch.utils.data import DistributedSampler
# note: shuffle = True by default
train_sampler = DistributedSampler(train_dataset, num_replicas=world_size, rank=rank)
val_sampler = DistributedSampler(val_dataset, num_replicas=world_size, rank=rank)
# set the input num_workers but for ddp 0 is recommended afaik, todo - check
num_workers = 0 if num_workers == -1 else num_workers
# get dist dataloaders
train_loader = DataLoader(train_dataset,
batch_size=batch_size,
sampler=train_sampler,
collate_fn=merge,
num_workers=num_workers,
pin_memory=pin_memory)
val_loader = DataLoader(val_dataset,
batch_size=batch_size_eval,
sampler=val_sampler,
collate_fn=merge,
num_workers=num_workers,
pin_memory=pin_memory)
# return dataloaders
# dataloaders = {'train': train_dataloader, 'val': val_dataloader, 'test': test_dataloader}
# iter(train_dataloader) # if this fails its likely your running in pycharm and need to set num_workers flag to 0
return train_loader, val_loader
def split_inidices(indices: list,
test_size: Optional = None,
random_state: Optional[Union[int, RandomState, None]] = None,
shuffle: bool = False, # false for reproducibility, and any split is as good as any other.
) -> tuple[list[int], list[int]]:
import sklearn.model_selection
# - api: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
train_indices, val_indices = sklearn.model_selection.train_test_split(indices, test_size=test_size,
random_state=random_state,
shuffle=shuffle)
return train_indices, val_indices
# - visualization help
"""
Inspired from:
- https://gist.github.com/MattKleinsmith/5226a94bad5dd12ed0b871aed98cb123
- https://www.geeksforgeeks.org/training-neural-networks-with-validation-using-pytorch/
"""
from argparse import Namespace
from pathlib import Path
from typing import Optional, Callable
import numpy as np
import torch
from torch.utils.data import random_split, DataLoader
from torchvision import datasets
from torchvision.transforms import transforms
from uutils.torch_uu.dataloaders.common import split_inidices, \
get_serial_or_distributed_dataloaders
NORMALIZE_MNIST = transforms.Normalize((0.1307,), (0.3081,)) # MNIST
def get_train_valid_test_data_loader_helper_for_mnist(args: Namespace) -> dict:
train_kwargs = {'path_to_data_set': args.path_to_data_set,
'batch_size': args.batch_size,
'batch_size_eval': args.batch_size_eval,
'augment_train': args.augment_train,
'augment_val': args.augment_val,
'num_workers': args.num_workers,
'pin_memory': args.pin_memory,
'rank': args.rank,
'world_size': args.world_size,
'merge': None
}
test_kwargs = {'path_to_data_set': args.path_to_data_set,
'batch_size_eval': args.batch_size_eval,
'augment_test': args.augment_train,
'num_workers': args.num_workers,
'pin_memory': args.pin_memory,
'rank': args.rank,
'world_size': args.world_size,
'merge': None
}
train_loader, val_loader = get_train_valid_loader(**train_kwargs)
test_loader: DataLoader = get_test_loader(**test_kwargs)
dataloaders: dict = {'train': train_loader, 'val': val_loader, 'test': test_loader}
return dataloaders
def get_transform(augment: bool):
if augment:
transform = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
NORMALIZE_MNIST
])
else:
transform = transforms.Compose([
transforms.ToTensor(),
NORMALIZE_MNIST
])
return transform
def get_train_valid_loader(path_to_data_set: Path,
batch_size: int = 128,
batch_size_eval: int = 64,
seed: Optional[int] = None,
augment_train: bool = True,
augment_val: bool = False,
val_size: Optional[float] = 0.2,
shuffle: bool = False, # false for reproducibility, and any split is as good as any other.
num_workers: int = -1,
pin_memory: bool = False,
rank: int = -1,
world_size: int = 1,
merge: Optional[Callable] = None,
) -> tuple[DataLoader, DataLoader]:
"""
Utility function for loading and returning train and valid
multi-process iterators over the MNIST dataset. A sample
9x9 grid of the images can be optionally displayed.
If using CUDA, num_workers should be set to 1 and pin_memory to True.
"""
# train_kwargs = {'batch_size': args.batch_size}
# define transforms
train_transform = get_transform(augment_train)
val_transform = get_transform(augment_val)
# load the dataset
path_to_data_set: str = str(Path(path_to_data_set).expanduser())
train_dataset = datasets.MNIST(root=path_to_data_set, train=True,
download=True, transform=train_transform)
val_dataset = datasets.MNIST(root=path_to_data_set, train=True,
download=True, transform=val_transform)
indices = list(range(len(train_dataset)))
train_indices, val_indices = split_inidices(indices, test_size=val_size, random_state=seed, shuffle=shuffle)
train_dataset = torch.utils.data.Subset(train_dataset, train_indices)
val_dataset = torch.utils.data.Subset(val_dataset, val_indices)
train_loader, val_loader = get_serial_or_distributed_dataloaders(train_dataset,
val_dataset,
batch_size,
batch_size_eval,
rank,
world_size,
merge,
num_workers,
pin_memory
)
return train_loader, val_loader
def get_test_loader(path_to_data_set,
batch_size_eval: int = 64,
shuffle: bool = True,
augment_test: bool = False,
num_workers: int = -1,
pin_memory=False,
rank: int = -1,
world_size: int = 1,
merge: Optional[Callable] = None,
) -> DataLoader:
"""
Utility function for loading and returning a multi-process
test iterator over the MNIST dataset.
If using CUDA, num_workers should be set to 1 and pin_memory to True.
Params
------
- path_to_data_set: path directory to the dataset.
- batch_size: how many samples per batch to load.
- shuffle: whether to shuffle the dataset after every epoch.
- num_workers: number of subprocesses to use when loading the dataset.
- pin_memory: whether to copy tensors into CUDA pinned memory. Set it to
True if using GPU.
Returns
-------
- data_loader: test set iterator.
Note:
- it knows it's the test set since train=False in the body when creating the data set.
"""
# define transform
test_transform = get_transform(augment_test)
# load the dataset
path_to_data_set: str = str(Path(path_to_data_set).expanduser())
test_dataset = datasets.MNIST(root=path_to_data_set,
train=False, # ensures its test set
download=True,
transform=test_transform)
_, test_loader = get_serial_or_distributed_dataloaders(test_dataset,
test_dataset,
batch_size_eval,
batch_size_eval,
rank,
world_size,
merge,
num_workers,
pin_memory,
)
return test_loader
repo that came from this with permanent github link: https://github.com/brando90/ultimate-utils/blob/ef2217c07b43aa5354f7b6f8f1761c5f16017874/ultimate-utils-proj-src/uutils/torch_uu/dataloaders/mnist.py#L22
related:
https://discuss.pytorch.org/t/changing-transformation-applied-to-data-during-training/15671/14
https://discuss.pytorch.org/t/changing-transforms-after-creating-a-dataset/64929/7
https://discuss.pytorch.org/t/apply-different-transform-data-augmentation-to-train-and-validation/63580/13

PyTorch-lightning models running out of Memory after 1st epoch

I saw a Kaggle kernel on PyTorch and run it with the same img_size, batch_size, etc. and created another PyTorch-lightning kernel with exact same values but my lightning model runs out of memory after about 1.5 epochs (each epoch contains 8750 steps) on the first fold whereas the native PyTorch model runs for whole 5 folds. Is there any way to improve the code or release memory? I could have tried to delete the models or do some garbage collection but if it doesn't complete even the first fold I can't delete the models and things.
def run_fold(fold):
df_train = train[train['fold'] != fold]
df_valid = train[train['fold'] == fold]
train_dataset = G2NetDataset(df_train, get_train_aug())
valid_dataset = G2NetDataset(df_valid, get_test_aug())
train_dl = DataLoader(train_dataset,
batch_size = config.batch_size,
num_workers = config.num_workers,
shuffle = True,
drop_last = True,
pin_memory = True)
valid_dl = DataLoader(valid_dataset,
batch_size = config.batch_size,
num_workers = config.num_workers,
shuffle = False,
drop_last = False,
pin_memory = True)
model = Classifier()
logger = pl.loggers.WandbLogger(project='G2Net', name=f'fold: {fold}')
trainer = pl.Trainer(gpus = 1,
max_epochs = config.epochs,
fast_dev_run = config.debug,
logger = logger,
log_every_n_steps=10)
trainer.fit(model, train_dl, valid_dl)
result = trainer.test(test_dataloaders = valid_dl)
wandb.run.finish()
return result
def main():
if config.train:
results = []
for fold in range(config.n_fold):
result = run_fold(fold)
results.append(result)
return results
results = main()
I cannot say much without looking at your model class, but couple possible issues that I encountered were metric and loss evaluation for logging.
For example, stuff like
pl.metrics.Accuracy(compute_on_step=False)
requires and explicit call of .compute()
def training_epoch_end(self, outputs):
loss = sum([out['loss'] for out in outputs])/len(outputs)
self.log_dict({'train_loss' : loss.detach(),
'train_accuracy' : self.train_metric.compute()})
at the epoch end.

Why model's loss is always revolving around 1 in every epoch?

During training, loss of my model is revolving around "1". It is not converging.
I tried various optimizer but it still showing the same pattern. I am using keras with tensorflow backend. What could be possible reasons? Any help or reference link will be appreciable.
here is my model:
def model_vgg19():
vgg_model = VGG19(weights="imagenet", include_top=False, input_shape=(128,128,3))
for layer in vgg_model.layers[:10]:
layer.trainable = False
intermediate_layer_outputs = get_layers_output_by_name(vgg_model, ["block1_pool", "block2_pool", "block3_pool", "block4_pool"])
convnet_output = GlobalAveragePooling2D()(vgg_model.output)
for layer_name, output in intermediate_layer_outputs.items():
output = GlobalAveragePooling2D()(output)
convnet_output = concatenate([convnet_output, output])
convnet_output = Dense(2048, activation='relu')(convnet_output)
convnet_output = Dropout(0.6)(convnet_output)
convnet_output = Dense(2048, activation='relu')(convnet_output)
convnet_output = Lambda(lambda x: K.l2_normalize(x,axis=1)(convnet_output)
final_model = Model(inputs=[vgg_model.input], outputs=convnet_output)
return final_model
model=model_vgg19()
here is my loss function:
def hinge_loss(y_true, y_pred):
y_pred = K.clip(y_pred, _EPSILON, 1.0-_EPSILON)
loss = tf.convert_to_tensor(0,dtype=tf.float32)
g = tf.constant(1.0, shape=[1], dtype=tf.float32)
for i in range(0, batch_size, 3):
try:
q_embedding = y_pred[i+0]
p_embedding = y_pred[i+1]
n_embedding = y_pred[i+2]
D_q_p = K.sqrt(K.sum((q_embedding - p_embedding)**2))
D_q_n = K.sqrt(K.sum((q_embedding - n_embedding)**2))
loss = (loss + g + D_q_p - D_q_n)
except:
continue
loss = loss/(batch_size/3)
zero = tf.constant(0.0, shape=[1], dtype=tf.float32)
return tf.maximum(loss,zero)
What is definitely a problem is that you shuffle your data and then try to learn triplets out of this.
As you can see here: https://keras.io/models/model/ model.fit shuffles your data in each epoch, making your triplet setup obsolete. Try to set the shuffle parameter to false and see what happens, there might be different errors as well.

How do I get CSV files into an Estimator in Tensorflow 1.6

I am new to tensorflow (and my first question in StackOverflow)
As a learning tool, I am trying to do something simple. (4 days later I am still confused)
I have one CSV file with 36 columns (3500 records) with 0s and 1s.
I am envisioning this file as a flattened 6x6 matrix.
I have another CSV file with 1 columnn of ground truth 0 or 1 (3500 records) which indicates if at least 4 of the 6 of elements in the 6x6 matrix's diagonal are 1's.
I am not sure I have processed the CSV files correctly.
I am confused as to how I create the features dictionary and Labels and how that fits into the DNNClassifier
I am using TensorFlow 1.6, Python 3.6
Below is the small amount of code I have so far.
import tensorflow as tf
import os
def x_map(line):
rDefaults = [[] for cl in range(36)]
x_row = tf.decode_csv(line, record_defaults=rDefaults)
return x_row
def y_map(line):
line = tf.string_to_number(line, out_type=tf.int32)
y_row = tf.one_hot(line, depth=2)
return y_row
x_path_file = os.path.join('D:', 'Diag', '6x6_train.csv')
y_path_file = os.path.join('D:', 'Diag', 'HasDiag_train.csv')
filenames = [x_path_file]
x_dataset = tf.data.TextLineDataset(filenames)
x_dataset = x_dataset.map(x_map)
x_dataset = x_dataset.batch(1)
x_iter = x_dataset.make_one_shot_iterator()
x_next_el = x_iter.get_next()
filenames = [y_path_file]
y_dataset = tf.data.TextLineDataset(filenames)
y_dataset = y_dataset.map(y_map)
y_dataset = y_dataset.batch(1)
y_iter = y_dataset.make_one_shot_iterator()
y_next_el = y_iter.get_next()
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
x_el = (sess.run(x_next_el))
y_el = (sess.run(y_next_el))
The output for x_el is:
(array([1.], dtype=float32), array([1.], dtype=float32), array([1.], dtype=float32), array([1.], dtype=float32), array([1.], dtype=float32), array([0.] ... it goes on...
The output for y_el is:
[[1. 0.]]
You're pretty much there for a minimal working model. The main issue I see is that tf.decode_csv returns a tuple of tensors, where as I expect you want a single tensor with all values. Easy fix:
x_row = tf.stack(tf.decode_csv(line, record_defaults=rDefaults))
That should work... but it fails to take advantage of many of the awesome things the tf.data.Dataset API has to offer, like shuffling, parallel threading etc. For example, if you shuffle each dataset, those shuffling operations won't be consistent. This is because you've created two separate datasets and manipulated them independently. If you create them independently, zip them together then manipulate, those manipulations will be consistent.
Try something along these lines:
def get_inputs(
count=None, shuffle=True, buffer_size=1000, batch_size=32,
num_parallel_calls=8, x_paths=[x_path_file], y_paths=[y_path_file]):
"""
Get x, y inputs.
Args:
count: number of epochs. None indicates infinite epochs.
shuffle: whether or not to shuffle the dataset
buffer_size: used in shuffle
batch_size: size of batch. See outputs below
num_parallel_calls: used in map. Note if > 1, intra-batch ordering
will be shuffled
x_paths: list of paths to x-value files.
y_paths: list of paths to y-value files.
Returns:
x: (batch_size, 6, 6) tensor
y: (batch_size, 2) tensor of 1-hot labels
"""
def x_map(line):
rDefaults = [[] for cl in range(n_dims**2)]
x_row = tf.stack(tf.decode_csv(line, record_defaults=rDefaults))
return x_row
def y_map(line):
line = tf.string_to_number(line, out_type=tf.int32)
y_row = tf.one_hot(line, depth=2)
return y_row
def xy_map(x, y):
return x_map(x), y_map(y)
x_ds = tf.data.TextLineDataset(x_paths)
y_ds = tf.data.TextLineDataset(y_paths)
combined = tf.data.Dataset.zip((x_ds, y_ds))
combined = combined.repeat(count=count)
if shuffle:
combined = combined.shuffle(buffer_size)
combined = combined.map(xy_map, num_parallel_calls=num_parallel_calls)
combined = combined.batch(batch_size)
x, y = combined.make_one_shot_iterator().get_next()
return x, y
To experiment/debug,
x, y = get_inputs()
with tf.Session() as sess:
xv, yv = sess.run((x, y))
print(xv.shape, yv.shape)
For use in an estimator, pass the function itself.
estimator.train(get_inputs, max_steps=10000)
def get_eval_inputs():
return get_inputs(
count=1, shuffle=False
x_paths=[x_eval_paths],
y_paths=[y_eval_paths])
estimator.eval(get_eval_inputs)