Pytorch: Overfitting on a small batch: Debugging - deep-learning

I am building a multi-class image classifier.
There is a debugging trick to overfit on a single batch to check if there any deeper bugs in the program.
How to design the code in a way that can do it in a much portable format?
One arduous and a not smart way is to build a holdout train/test folder for a small batch where test class consists of 2 distribution - seen data and unseen data and if the model is performing better on seen data and poorly on unseen data, then we can conclude that our network doesn't have any deeper structural bug.
But, this does not seems like a smart and a portable way, and have to do it with every problem.
Currently, I have a dataset class where I am partitioning the data in train/dev/test in the below way -
def split_equal_into_val_test(csv_file=None, stratify_colname='y',
frac_train=0.6, frac_val=0.15, frac_test=0.25,
):
"""
Split a Pandas dataframe into three subsets (train, val, and test).
Following fractional ratios provided by the user, where val and
test set have the same number of each classes while train set have
the remaining number of left classes
Parameters
----------
csv_file : Input data csv file to be passed
stratify_colname : str
The name of the column that will be used for stratification. Usually
this column would be for the label.
frac_train : float
frac_val : float
frac_test : float
The ratios with which the dataframe will be split into train, val, and
test data. The values should be expressed as float fractions and should
sum to 1.0.
random_state : int, None, or RandomStateInstance
Value to be passed to train_test_split().
Returns
-------
df_train, df_val, df_test :
Dataframes containing the three splits.
"""
df = pd.read_csv(csv_file).iloc[:, 1:]
if frac_train + frac_val + frac_test != 1.0:
raise ValueError('fractions %f, %f, %f do not add up to 1.0' %
(frac_train, frac_val, frac_test))
if stratify_colname not in df.columns:
raise ValueError('%s is not a column in the dataframe' %
(stratify_colname))
df_input = df
no_of_classes = 4
sfact = int((0.1*len(df))/no_of_classes)
# Shuffling the data frame
df_input = df_input.sample(frac=1)
df_temp_1 = df_input[df_input['labels'] == 1][:sfact]
df_temp_2 = df_input[df_input['labels'] == 2][:sfact]
df_temp_3 = df_input[df_input['labels'] == 3][:sfact]
df_temp_4 = df_input[df_input['labels'] == 4][:sfact]
dev_test_df = pd.concat([df_temp_1, df_temp_2, df_temp_3, df_temp_4])
dev_test_y = dev_test_df['labels']
# Split the temp dataframe into val and test dataframes.
df_val, df_test, dev_Y, test_Y = train_test_split(
dev_test_df, dev_test_y,
stratify=dev_test_y,
test_size=0.5,
)
df_train = df[~df['img'].isin(dev_test_df['img'])]
assert len(df_input) == len(df_train) + len(df_val) + len(df_test)
return df_train, df_val, df_test
def train_val_to_ids(train, val, test, stratify_columns='labels'): # noqa
"""
Convert the stratified dataset in the form of dictionary : partition['train] and labels.
To generate the parallel code according to https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel
Parameters
-----------
csv_file : Input data csv file to be passed
stratify_columns : The label column
Returns
-----------
partition, labels:
partition dictionary containing train and validation ids and label dictionary containing ids and their labels # noqa
"""
train_list, val_list, test_list = train['img'].to_list(), val['img'].to_list(), test['img'].to_list() # noqa
partition = {"train_set": train_list,
"val_set": val_list,
}
labels = dict(zip(train.img, train.labels))
labels.update(dict(zip(val.img, val.labels)))
return partition, labels
P.S - I know about the Pytorch lightning and know that they have an overfitting feature which can be used easily but I don't want to move to PyTorch lightning.

I don't know how portable it will be, but a trick that I use is to modify the __len__ function in the Dataset.
If I modified it from
def __len__(self):
return len(self.data_list)
to
def __len__(self):
return 20
It will only output the first 20 elements in the dataset (regardless of shuffle). You only need to change one line of code and the rest should work just fine so I think it's pretty neat.

Related

Transforming a Dataset for Multi-Label Text Classification

I am conducting some experiments on multi-label classification via deep learning models.
But I face a problem with the dataset.
I use Keras,TensorFlow 2.0, numpy,pandas.
I have a dataset in the form:
Dataset in the form that I have it
To apply multi-label classification(6 labels) I need my dataset to be in this form:
Dataset in the form that I need it
How is it possible to achieve this? Are there any functions making this transformation easier?
Try:
comments_df[['abusive','hateful','offensive','disrespectful','fearful','normal']] = comments_df['sentiment'].str.split('_', -1, expand=True)
This gives me an error:
ValueError: Columns must be same length as key
Regarding the DL model I will use, it's bi-LSTM, but it doesn't have anything to do with the question per-se.
Try this:
df = pd.get_dummies(data = df, columns = ['sentiment'])
I found this to work (not optimal solution):
"""
Creating a column for each of the target labels with sentiment's column data.
"""
def split_sentiment_outputs(output_label, sentiment_col="sentiment"):
comments_df[output_label] = comments_df[sentiment_col].str.split('_')
"""
Transform column's data to categorical.
"""
def transform_data_for_multilabel(output_label):
row = comments_df[output_label]
for index, row in row.items():
# print("Index:", index)
# print("length:", len(row))
# print("content:", row)
# print("--------------")
z = 0
while z < len(row):
if row[z] == output_label:
comments_df.at[index, output_label] = 1
break
else:
comments_df.at[index, output_label] = 0
z = z + 1
# Applying Data Transformation
output_labels = ["abusive", "hateful", "offensive", "disrespectful", "fearful", "normal"]
for i in range(MAX_OUT):
split_sentiment_outputs(output_labels[i])
for i in range(MAX_OUT):
transform_data_for_multilabel(output_labels[i])

How to estimate the parameters of a mixture model in OpenTURNS?

I would like to estimate the parameters of a mixture model of normal distributions in OpenTURNS (that is, the distribution of a weighted sum of Gaussian random variables). OpenTURNS can create such a mixture, but it cannot estimate its parameters. Moreover, I need to create the mixture as an OpenTURNS distribution in order to propagate uncertainty through a function.
For example, I know how to create a mixture of two normal distributions:
import openturns as ot
mu1 = 1.0
sigma1 = 0.5
mu2 = 3.0
sigma2 = 2.0
weights = [0.3, 0.7]
n1 = ot.Normal(mu1, sigma1)
n2 = ot.Normal(mu2, sigma2)
m = ot.Mixture([n1, n2], weights)
In this example, I would like to estimate mu1, sigma1, mu2, sigma2 on a given sample. In order to create a working example, it is easy to generate a sample by simulation.
s = m.getSample(100)
You can rely on scikit-learn's GaussianMixture to estimate the parameters and then use them to define a Mixture model in OpenTURNS.
The script hereafter contains a Python class MixtureFactory that estimates the parameters of a scikitlearn GaussianMixture and outputs an OpenTURNS Mixture distribution:
from sklearn.mixture import GaussianMixture
from sklearn.utils.validation import check_is_fitted
import openturns as ot
import numpy as np
class MixtureFactory(GaussianMixture):
"""
Representation of a Gaussian mixture model probability distribution.
This class allows to estimate the parameters of a Gaussian mixture
distribution using scikit algorithms & provides openturns Mixture object.
Read more in scikit learn user guide & openturns theory.
Parameters:
-----------
n_components : int, defaults to 1.
The number of mixture components.
covariance_type : {'full' (default), 'tied', 'diag', 'spherical'}
String describing the type of covariance parameters to use.
Must be one of:
'full'
each component has its own general covariance matrix
'tied'
all components share the same general covariance matrix
'diag'
each component has its own diagonal covariance matrix
'spherical'
each component has its own single variance
tol : float, defaults to 1e-3.
The convergence threshold. EM iterations will stop when the
lower bound average gain is below this threshold.
reg_covar : float, defaults to 1e-6.
Non-negative regularization added to the diagonal of covariance.
Allows to assure that the covariance matrices are all positive.
max_iter : int, defaults to 100.
The number of EM iterations to perform.
n_init : int, defaults to 1.
The number of initializations to perform. The best results are kept.
init_params : {'kmeans', 'random'}, defaults to 'kmeans'.
The method used to initialize the weights, the means and the
precisions.
Must be one of::
'kmeans' : responsibilities are initialized using kmeans.
'random' : responsibilities are initialized randomly.
weights_init : array-like, shape (n_components, ), optional
The user-provided initial weights, defaults to None.
If it None, weights are initialized using the `init_params` method.
means_init : array-like, shape (n_components, n_features), optional
The user-provided initial means, defaults to None,
If it None, means are initialized using the `init_params` method.
precisions_init : array-like, optional.
The user-provided initial precisions (inverse of the covariance
matrices), defaults to None.
If it None, precisions are initialized using the 'init_params' method.
The shape depends on 'covariance_type'::
(n_components,) if 'spherical',
(n_features, n_features) if 'tied',
(n_components, n_features) if 'diag',
(n_components, n_features, n_features) if 'full'
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.
warm_start : bool, default to False.
If 'warm_start' is True, the solution of the last fitting is used as
initialization for the next call of fit(). This can speed up
convergence when fit is called several times on similar problems.
In that case, 'n_init' is ignored and only a single initialization
occurs upon the first call.
See :term:`the Glossary <warm_start>`.
verbose : int, default to 0.
Enable verbose output. If 1 then it prints the current
initialization and each iteration step. If greater than 1 then
it prints also the log probability and the time needed
for each step.
verbose_interval : int, default to 10.
Number of iteration done before the next print.
"""
def __init__(self, n_components=2, covariance_type='full', tol=1e-6,
reg_covar=1e-6, max_iter=1000, n_init=1, init_params='kmeans',
weights_init=None, means_init=None, precisions_init=None,
random_state=41, warm_start=False,
verbose=0, verbose_interval=10):
super().__init__(n_components, covariance_type, tol, reg_covar,
max_iter, n_init, init_params, weights_init, means_init,
precisions_init, random_state, warm_start, verbose, verbose_interval)
def fit(self, X):
"""
Fit the mixture model parameters.
EM algorithm is applied here to estimate the model parameters and build a
Mixture distribution (see openturns mixture).
The method fits the model ``n_init`` times and sets the parameters with
which the model has the largest likelihood or lower bound. Within each
trial, the method iterates between E-step and M-step for ``max_iter``
times until the change of likelihood or lower bound is less than
``tol``, otherwise, a ``ConvergenceWarning`` is raised.
If ``warm_start`` is ``True``, then ``n_init`` is ignored and a single
initialization is performed upon the first call. Upon consecutive
calls, training starts where it left off.
Parameters
----------
X : array-like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row
corresponds to a single data point.
Returns
-------
"""
data = np.array(X)
# Evaluate the model parameters.
super().fit(data)
# openturns mixture
# n_components ==> weight of size n_components
weights = self.weights_
n_components = len(weights)
# Create ot distribution
collection = n_components * [0]
# Covariance matrices
cov = self.covariances_
mu = self.means_
# means : n_components x n_features
n_components, n_features = mu.shape
# Following the type of covariance, we define the collection of gaussians
# Spherical : C_k = Identity * sigma_k
if self.covariance_type is 'spherical':
c = ot.CorrelationMatrix(n_features)
for l in range(n_components):
sigma = np.sqrt(cov[l])
collection[l] = ot.Normal(list(mu[l]), [ sigma ] * n_features , c)
elif self.covariance_type is 'diag' :
for l in range(n_components):
c = ot.CovarianceMatrix(n_features)
for i in range(n_features):
c[i,i] = cov[l, i]
collection[l] = ot.Normal(list(mu[l]), c)
elif self.covariance_type == 'tied':
# Same covariance for all clusters
c = ot.CovarianceMatrix(n_features)
for i in range(n_features):
for j in range(0, i+1):
c[i,j] = cov[i,j]
# Define the collection with the same covariance
for l in range(n_components):
collection[l] = ot.Normal(list(mu[l]), c)
else:
n_features = cov.shape[1]
for l in range(n_components):
c = ot.CovarianceMatrix(n_features)
for i in range(n_features):
for j in range(0, i+1):
c[i,j] = cov[l][i,j]
collection[l] = ot.Normal(list(mu[l]), c)
self._mixture = ot.Mixture(collection, weights)
return self
def get_mixture(self):
"""
Returns the Mixture object
"""
check_is_fitted(self)
return self._mixture
if __name__ == "__main__":
mu1 = 1.0
sigma1 = 0.5
mu2 = 3.0
sigma2 = 2.0
weights = [0.3, 0.7]
n1 = ot.Normal(mu1, sigma1)
n2 = ot.Normal(mu2, sigma2)
m = ot.Mixture([n1, n2], weights)
x = m.getSample(1000)
est_dist = MixtureFactory(random_state=1)
est_dist.fit(x)
print(est_dist.get_mixture())
I have actually tried this method and it works perfectly. On top of that the fit of the model through the SciKit GMM and the ulterior adjustment thanks to OpenTurns are very fast. I recommend future users to test several numbers of components and covariance matrix structures, as it will not take a lot of time and can substantially improve the goodness of fit to the data.
Thanks for the answer.
Here is a pure OpenTURNS solution. It is probably slower than the scikit-learn-based method, but it is more generic: you could use it to estimate the parameters of any mixture model, not necessarily a mixture of normal distributions.
The idea is to retrieve the log-likelihood function from the Mixture object and minimize it.
In the following, let us assume that s is the sample we want to fit the mixture on.
First, we need to build the mixture we want to estimate the parameters of. We can specify any valid set of parameters, it does not matter. In your example, you want a mixture of 2 normal distributions.
mixture = ot.Mixture([ot.Normal()]*2, [0.5]*2)
There is a small hurdle. All weights sum to 1, thus one of them is determined by the others: the solver must not be allowed to freely set it. The order of the parameters of an OpenTURNS Mixture is as follows:
weight of the first distribution;
parameters of the first distribution;
weight of the second distribution;
parameters of the second distribution:
...
You can view all parameters with mixture.getParameter() and their names with mixture.getParameterDescription(). The following is a helper function that:
takes as input the list containing of all mixture parameters except the weight of its first distribution;
outputs a Point containing all parameters including the weight of the first distribution.
def full(params):
"""
Point of all mixture parameters from a list that omits the first weight.
"""
params = ot.Point(params)
aux_mixture = ot.Mixture(mixture)
dist_number = aux_mixture.getDistributionCollection().getSize()
index = aux_mixture.getDistributionCollection()[0].getParameter().getSize()
list_weights = []
for num in range(1, dist_number):
list_weights.append(params[index])
index += 1 + aux_mixture.getDistributionCollection()[num].getParameter().getSize()
complementary_weight = ot.Point([abs(1.0 - sum(list_weights))])
complementary_weight.add(params)
return complementary_weight
The next function computes the opposite of the log-likelihood of a given list of parameters (except the first weight).
For the sake of numerical stability, it divides this value by the number of observations.
We will minimize this function in order to find the Maximum Likelihood Estimate.
def minus_log_pdf(params):
"""
- log-likelihood of a list of parameters excepting the first weight
divided by the number of observations
"""
aux_mixture = ot.Mixture(mixture)
full_params = full(params)
try:
aux_mixture.setParameter(full_params)
except TypeError:
# case where the proposed parameters are invalid:
# return a huge value
return [ot.SpecFunc.LogMaxScalar]
res = - aux_mixture.computeLogPDF(s).computeMean()
return res
To use OpenTURNS optimization facilities, we need to turn this function into a PythonFunction object.
OT_minus_log_pdf = ot.PythonFunction(mixture.getParameter().getSize()-1, 1, minus_log_pdf)
Cobyla is usually good at likelihood optimization.
problem = ot.OptimizationProblem(OT_minus_log_pdf)
algo = ot.Cobyla(problem)
In order to decrease chances of Cobyla being stuck on a local minimum, we are going to use MultiStart. We pick a starting set of parameters and randomly change the weights. The following helper function makes it easy:
def random_weights(params, nb):
"""
List of nb Points representing mixture parameters with randomly varying weights.
"""
aux_mixture = ot.Mixture(mixture)
full_params = full(params)
aux_mixture.setParameter(full_params)
list_params = []
for num in range(nb):
dirichlet = ot.Dirichlet([1.0] * aux_mixture.getDistributionCollection().getSize()).getRealization()
dirichlet.add(1.0 - sum(dirichlet))
aux_mixture.setWeights(dirichlet)
list_params.append(aux_mixture.getParameter()[1:])
return list_params
We pick 10 starting points and increase the number of maximum evaluations of the log-likelihood from 100 (by default) to 10000.
init = mixture.getParameter()[1:]
starting_points = random_weights(init, 10)
algo_multistart = ot.MultiStart(algo, starting_points)
algo_multistart.setMaximumEvaluationNumber(10000)
Let's run the solver and retrieve the result.
algo_multistart.run()
result = algo_multistart.getResult()
All that remains is to set the mixture's parameters to the optimal value.
We must not forget to add the first weight back!
optimal_parameters = result.getOptimalPoint()
mixture.setParameter(full(optimal_parameters))
Below is an alternative.
The first step creates a new GaussianMixture class, derived from PythonDistribution. The key point is to implement the computeLogPDF method and the set/getParameters methods. Notice that this parametrization of a mixture only has one single weight w.
class GaussianMixture(ot.PythonDistribution):
def __init__(self, mu1 = -5.0, sigma1 = 1.0, \
mu2 = 5.0, sigma2 = 1.0, \
w = 0.5):
super(GaussianMixture, self).__init__(1)
if w < 0.0 or w > 1.0:
raise ValueError('The weight is not in [0, 1]. w=%s.' % (w))
self.mu1 = mu2
self.sigma1 = sigma1
self.mu2 = mu2
self.sigma2 = sigma2
self.w = w
collDist = [ot.Normal(mu1, sigma1), ot.Normal(mu2, sigma2)]
weight = [w, 1.0 - w]
self.distribution = ot.Mixture(collDist, weight)
def computeCDF(self, x):
p = self.distribution.computeCDF(x)
return p
def computePDF(self, x):
p = self.distribution.computePDF(x)
return p
def computeQuantile(self, prob, tail = False):
quantile = self.distribution.computeQuantile(prob, tail)
return quantile
def getSample(self, size):
X = self.distribution.getSample(size)
return X
def getParameter(self):
parameter = ot.Point([self.mu1, self.sigma1, \
self.mu2, self.sigma2, \
self.w])
return parameter
def setParameter(self, parameter):
[mu1, sigma1, mu2, sigma2, w] = parameter
self.__init__(mu1, sigma1, mu2, sigma2, w)
return parameter
def computeLogPDF(self, sample):
logpdf = self.distribution.computeLogPDF(sample)
return logpdf
In order to create the distribution, we use the Distribution class:
gm = ot.Distribution(GaussianMixture())
Estimating the parameters of this distribution is straightforward with MaximumLikelihoodFactory. However, we must set the bounds, because sigma cannot be negative and that w is in (0, 1).
factory = ot.MaximumLikelihoodFactory(gm)
lowerBound = [0.0, 1.e-6, 0.0, 1.e-6, 0.01]
upperBound = [0.0, 0.0, 0.0, 0.0, 0.99]
finiteLowerBound = [False, True, False, True, True]
finiteUpperBound = [False, False, False, False, True]
bounds = ot.Interval(lowerBound, upperBound, finiteLowerBound, finiteUpperBound)
factory.setOptimizationBounds(bounds)
Then we configure the optimization solver.
solver = factory.getOptimizationAlgorithm()
startingPoint = [-4.0, 1.0, 7.0, 1.5, 0.3]
solver.setStartingPoint(startingPoint)
factory.setOptimizationAlgorithm(solver)
Estimating the parameters is based on the build method.
distribution = factory.build(sample)
There are two limitations with this implementation.
First, it is not as fast as it should be, because of a limitation of the PythonDistribution (see https://github.com/openturns/openturns/issues/1391).
Estimating the parameters may be difficult, because the problem may have local optimas that cannot be retrieved with the default algorithm in MaximumLikelihoodFactory. This kind of task is generally done with the EM algorithm.

How to load all 5 batches of CIFAR10 in a single data structure as on MNIST in PyTorch?

With Mnist I have a single file with the labels and a single file for the train, so I simply do:
self.data = datasets.MNIST(root='./data', train=True, download=True)
Basically I create a set of labels (from 0-9) and save the i-th position of the image in the data structure, to create my custom tasks:
def make_tasks (self):
        self.task_to_examples = {} #task 0-9
        self.all_tasks = set (self.data.train_labels.numpy ())
        for i, digit in enumerate (self.data.train_labels.numpy ()):
            if str(digit) not in self.task_to_examples:
                self.task_to_examples[str(digit)] = []
            self.task_to_examples[str(digits)].append(i)
I don't understand how to do the same thing using CIFAR10 because it is divided into 5 batches, I would like all the data in a single structure.
If your desired structure is {"class_id": [indices of the samples]}, then for CIFAR10 you can do something like this:
import numpy as np
import torchvision
# set root accordingly
cifar = torchvision.datasets.CIFAR10(root=".", train=True, download=True)
task_to_examples = {
str(task_id): np.where(cifar.targets == task_id)[0].tolist()
for task_id in np.unique(cifar.targets)
}

How to handle large JSON file in Pytorch?

I am working on a time series problem. Different training time series data is stored in a large JSON file with the size of 30GB. In tensorflow I know how to use TF records. Is there a similar way in pytorch?
I suppose IterableDataset (docs) is what you need, because:
you probably want to traverse files without random access;
number of samples in jsons is not pre-computed.
I've made a minimal usage example with an assumption that every line of dataset file is a json itself, but you can change the logic.
import json
from torch.utils.data import DataLoader, IterableDataset
class JsonDataset(IterableDataset):
def __init__(self, files):
self.files = files
def __iter__(self):
for json_file in self.files:
with open(json_file) as f:
for sample_line in f:
sample = json.loads(sample_line)
yield sample['x'], sample['time'], ...
...
dataset = JsonDataset(['data/1.json', 'data/2.json', ...])
dataloader = DataLoader(dataset, batch_size=32)
for batch in dataloader:
y = model(batch)
Generally, you do not need to change/overload the default data.Dataloader.
What you should look into is how to create a custom data.Dataset.
Once you have your own Dataset that knows how to extract item-by-item from the json file, you feed it do the "vanilla" data.Dataloader and all the batching/multi-processing etc, is done for you based on your dataset provided.
If, for example, you have a folder with several json files, each containing several examples, you can have a Dataset that looks like:
import bisect
class MyJsonsDataset(data.Dataset):
def __init__(self, jfolder):
super(MyJsonsDataset, self).__init__()
self.filenames = [] # keep track of the jfiles you need to load
self.cumulative_sizes = [0] # keep track of number of examples viewed so far
# this is not actually python code - just pseudo code for you to follow
for each jsonfile in jfolder:
self.filenames.append(jsonfile)
l = number of examples in jsonfile
self.cumulative_sizes.append(self.cumulative_sizes[-1] + l)
# discard the first element
self.cumulative_sizes.pop(0)
def __len__(self):
return self.cumulative_sizes[-1]
def __getitem__(self, idx):
# first you need to know wich of the files holds the idx example
jfile_idx = bisect.bisect_right(self.cumulative_sizes, idx)
if jfile_idx == 0:
sample_idx = idx
else:
sample_idx = idx - self.cumulative_sizes[jfile_idx - 1]
# now you need to retrieve the `sample_idx` example from self.filenames[jfile_idx]
return retrieved_example

How do I get CSV files into an Estimator in Tensorflow 1.6

I am new to tensorflow (and my first question in StackOverflow)
As a learning tool, I am trying to do something simple. (4 days later I am still confused)
I have one CSV file with 36 columns (3500 records) with 0s and 1s.
I am envisioning this file as a flattened 6x6 matrix.
I have another CSV file with 1 columnn of ground truth 0 or 1 (3500 records) which indicates if at least 4 of the 6 of elements in the 6x6 matrix's diagonal are 1's.
I am not sure I have processed the CSV files correctly.
I am confused as to how I create the features dictionary and Labels and how that fits into the DNNClassifier
I am using TensorFlow 1.6, Python 3.6
Below is the small amount of code I have so far.
import tensorflow as tf
import os
def x_map(line):
rDefaults = [[] for cl in range(36)]
x_row = tf.decode_csv(line, record_defaults=rDefaults)
return x_row
def y_map(line):
line = tf.string_to_number(line, out_type=tf.int32)
y_row = tf.one_hot(line, depth=2)
return y_row
x_path_file = os.path.join('D:', 'Diag', '6x6_train.csv')
y_path_file = os.path.join('D:', 'Diag', 'HasDiag_train.csv')
filenames = [x_path_file]
x_dataset = tf.data.TextLineDataset(filenames)
x_dataset = x_dataset.map(x_map)
x_dataset = x_dataset.batch(1)
x_iter = x_dataset.make_one_shot_iterator()
x_next_el = x_iter.get_next()
filenames = [y_path_file]
y_dataset = tf.data.TextLineDataset(filenames)
y_dataset = y_dataset.map(y_map)
y_dataset = y_dataset.batch(1)
y_iter = y_dataset.make_one_shot_iterator()
y_next_el = y_iter.get_next()
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
x_el = (sess.run(x_next_el))
y_el = (sess.run(y_next_el))
The output for x_el is:
(array([1.], dtype=float32), array([1.], dtype=float32), array([1.], dtype=float32), array([1.], dtype=float32), array([1.], dtype=float32), array([0.] ... it goes on...
The output for y_el is:
[[1. 0.]]
You're pretty much there for a minimal working model. The main issue I see is that tf.decode_csv returns a tuple of tensors, where as I expect you want a single tensor with all values. Easy fix:
x_row = tf.stack(tf.decode_csv(line, record_defaults=rDefaults))
That should work... but it fails to take advantage of many of the awesome things the tf.data.Dataset API has to offer, like shuffling, parallel threading etc. For example, if you shuffle each dataset, those shuffling operations won't be consistent. This is because you've created two separate datasets and manipulated them independently. If you create them independently, zip them together then manipulate, those manipulations will be consistent.
Try something along these lines:
def get_inputs(
count=None, shuffle=True, buffer_size=1000, batch_size=32,
num_parallel_calls=8, x_paths=[x_path_file], y_paths=[y_path_file]):
"""
Get x, y inputs.
Args:
count: number of epochs. None indicates infinite epochs.
shuffle: whether or not to shuffle the dataset
buffer_size: used in shuffle
batch_size: size of batch. See outputs below
num_parallel_calls: used in map. Note if > 1, intra-batch ordering
will be shuffled
x_paths: list of paths to x-value files.
y_paths: list of paths to y-value files.
Returns:
x: (batch_size, 6, 6) tensor
y: (batch_size, 2) tensor of 1-hot labels
"""
def x_map(line):
rDefaults = [[] for cl in range(n_dims**2)]
x_row = tf.stack(tf.decode_csv(line, record_defaults=rDefaults))
return x_row
def y_map(line):
line = tf.string_to_number(line, out_type=tf.int32)
y_row = tf.one_hot(line, depth=2)
return y_row
def xy_map(x, y):
return x_map(x), y_map(y)
x_ds = tf.data.TextLineDataset(x_paths)
y_ds = tf.data.TextLineDataset(y_paths)
combined = tf.data.Dataset.zip((x_ds, y_ds))
combined = combined.repeat(count=count)
if shuffle:
combined = combined.shuffle(buffer_size)
combined = combined.map(xy_map, num_parallel_calls=num_parallel_calls)
combined = combined.batch(batch_size)
x, y = combined.make_one_shot_iterator().get_next()
return x, y
To experiment/debug,
x, y = get_inputs()
with tf.Session() as sess:
xv, yv = sess.run((x, y))
print(xv.shape, yv.shape)
For use in an estimator, pass the function itself.
estimator.train(get_inputs, max_steps=10000)
def get_eval_inputs():
return get_inputs(
count=1, shuffle=False
x_paths=[x_eval_paths],
y_paths=[y_eval_paths])
estimator.eval(get_eval_inputs)