Training loss is Nan using image segmentation in TPU using TFrecords - deep-learning

I am a beginner trying to work with TPUs using Tensorflow in Kaggle Kernels. I previously trained an Unet model using a dataset in GPU, and now I am trying to implement that in TPU. I made a tfrecord out of the dataset images and mask, and the TFrecord returns image and mask. When I try to train in TPU, the loss is always Nan, even though the metrics accuracy is normal. Since this is the same model and loss I used in GPU, I am guessing the problem is in tfrecord or loading dataset.
The code for loading data is given below :
def decode_image(image_data):
image = tf.image.decode_jpeg(image_data, channels=3)
image = tf.cast(image, tf.float32) / (255.0) # convert image to floats in [0, 1] range
image = tf.reshape(image, [*IMAGE_SIZE, 3]) # explicit size needed for TPU
return image
def decode_image_mask(image_data):
image = tf.image.decode_jpeg(image_data, channels=3)
image = tf.cast(image, tf.float64) / (255.0) # convert image to floats in [0, 1] range
image = tf.reshape(image, [*IMAGE_SIZE, 3]) # explicit size needed for TPU
image=tf.image.rgb_to_grayscale(image)
image=tf.math.round(image)
return image
def read_tfrecord(example):
TFREC_FORMAT = {
"image": tf.io.FixedLenFeature([], tf.string), # tf.string means bytestring
"mask": tf.io.FixedLenFeature([], tf.string), # shape [] means single element
}
example = tf.io.parse_single_example(example, TFREC_FORMAT)
image = decode_image(example['image'])
mask=decode_image_mask(example['mask'])
return image, mask
def load_dataset(filenames, ordered=False):
# Read from TFRecords. For optimal performance, reading from multiple files at once and
# disregarding data order. Order does not matter since we will be shuffling the data anyway.
ignore_order = tf.data.Options()
if not ordered:
ignore_order.experimental_deterministic = False # disable order, increase speed
dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTO) # automatically interleaves reads from multiple files
dataset = dataset.with_options(ignore_order) # uses data as soon as it streams in, rather than in its original order
dataset = dataset.map(read_tfrecord, num_parallel_calls=AUTO)
return dataset
def get_training_dataset():
dataset = load_dataset(TRAINING_FILENAMES)
dataset = dataset.repeat() # the training dataset must repeat for several epochs
dataset = dataset.shuffle(2048)
dataset = dataset.batch(BATCH_SIZE,drop_remainder=True)
dataset = dataset.prefetch(AUTO) # prefetch next batch while training (autotune prefetch buffer size)
return dataset
def get_validation_dataset(ordered=False):
dataset = load_dataset(VALIDATION_FILENAMES, ordered=ordered)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.cache()
dataset = dataset.prefetch(AUTO) # prefetch next batch while training (autotune prefetch buffer size)
return dataset
def count_data_items(filenames):
# the number of data items is written in the name of the .tfrec files, i.e. flowers00-230.tfrec = 230 data items
n = [int(re.compile(r"-([0-9]*)\.").search(filename).group(1)) for filename in filenames]
return np.sum(n)
So, what am I doing wrong?

Turns out the problem was that I was unbatching the data and batching it to 20 to properly view the image and masks in matplotlib, and this was screwing up how data was being sent to the model, hence the Nan loss. Making another copy of the dataset and using that to view image, while sending the original to train solved this problem.

Related

How can I extract label from results in YOLO v5?

Is there any way to extract the detected label like person or cat, dog or others that is printing by the results.print() function? I want these detected labels to be saved in an array and use it later. I am using YOLOv5 model here.
cap = cv2.VideoCapture(0)
while cap.isOpened():
ret, frame = cap.read()
# Make detections
results = model(frame)
results.print()
# Showing the box and prediction
cv2.imshow('YOLO', np.squeeze(results.render()))
if cv2.waitKey(10) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
The printed output of the results.print() was like this -
image 1/1: 480x640 1 person
Speed: 7.0ms pre-process, 80.6ms inference, 3.5ms NMS per image at shape (1, 3, 480, 640)
From this output, I wanna extract the person label and store it in an array.
This might not be the optimal solution, but here's an approach that I used for a personal project:
lst = []
cap = cv2.VideoCapture(0)
while cap.isOpened():
ret, frame = cap.read()
# Make detections
results = model(frame)
cv2.imshow('YOLO', np.squeeze(results.render()))
df = results.pandas().xyxy[0]
for i in df['name']: # name->labels
lst.append(i)
if cv2.waitKey(10) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
I have used results.pandas().xyxy[0] function to get the results as a data frame and then appended the labels to a list.
Assuming you use YoloV5 with pytorch, please see this link. It detailes how to interpret the results as json objects and also explains the structure.

Modify Random Forest Regression to predict multiple values in future using past data

I am using Random Forest Regression on a power vs time data of an experiment that is performed for a certain time duration. Using that data, I want to predict the trend of power in future using time as an input. The code that has been implemented is mentioned below.
# Loading the excel dataset
df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/Cleaned total data.xlsx', header = None, names = [ "active_power", "current", "voltage"], usecols = "A:C",skiprows = [i for i in range(1)])
df = df.dropna()
The data set consists of approximately 30 hours of power vs time values as mentioned below.
Next a random Forest Regressor is fitted on training data. The R2 score achieved on test data is 0.87.
# Creating X and y
X = np.array(series[['time_h']]).reshape(-1,1)
y = np.array(series['active_power'])
# Splitting dataset in training and testing
X_train2,X_test2,y_train2,y_test2 = train_test_split(X,y,test_size = 0.15, random_state = 1)
# Creating Random Forest model and fitting it on training data
forest = RandomForestRegressor(n_estimators=128, criterion='mse', random_state=1, n_jobs=-1)
forest_fit = forest.fit(X_train2, y_train2)
# Saving the model and checking the R2 score on test data
filename = 'random_forest.sav'
joblib.dump(forest, filename)
loaded_model = joblib.load(filename)
result = loaded_model.score(X_test2, y_test2)
print(result)
For future prediction, an array of time for 400 hours has been created to use as an input to the model as the power needs to be predicted for that duration.
# Creating a time array for future which will be used as input for future predictions
future_time2 = np.arange(len(series)*15)
future_time2 = future_time2*0.25/360
columns = ['time_hour']
dataframe = pd.DataFrame(data = future_time2, columns = columns)
future_times = dataframe[41006:].to_numpy()
future_times
When the predictions are made in future, the model only provides output of a constant value over the entire duration of 400 hours. The output prediction is as below.
# Predicting power for future
future_pred = loaded_model.predict(future_times)
future_pred
Could someone please suggest me why the model is predicting same value for entire duration and how to modify the code so that I can get a trend of prediction with reasonable values and not a single value.
Thank you.

Tensor shape mismatch error in PyTorch on MNIST dataset, but no error on synthetic data

I am trying to implement a Deep Learning paper (https://github.com/kiankd/corel2019) and having a weird error when supplying real data (MNIST) to it, but no error when using the same synthetic data as the authors used.
The error happens in this function:
def get_armask(shape, labels, device=None):
mask = torch.zeros(shape).to(device)
arr = torch.arange(0, shape[0]).long().to(device)
mask[arr, labels] = -1.
return mask
More specifically this line:
mask[arr, labels] = -1.
The error is:
RuntimeError: The shape of the mask [500] at index 0 does not match the shape of the indexed tensor [500, 10] at index 1
The weird thing is, that if I use the synthetic data, there is no error and it works perfectly. If I print out the shapes, I get the following (both with synthetic data and with MNIST):
mask torch.Size([500, 10])
arr torch.Size([500])
labels torch.Size([500])
The code used to generate the synthetic data is the following:
X_data = (torch.rand(N_samples, D_input) * 10.).to(device)
labels = torch.LongTensor([i % N_classes for i in range(N_samples)]).to(device)
While the code to load MNIST is this:
train_images = mnist.train_images()
X_data_all = train_images.reshape((train_images.shape[0], train_images.shape[1] * train_images.shape[2]))
X_data = torch.tensor(X_data_all[:500,:]).to(device)
X_data = X_data.type(torch.FloatTensor)
labels = torch.tensor(mnist.train_labels()[:500]).to(device)
get_armask is used the following way:
def forward(self, predictions, labels):
mask = get_armask(predictions.shape, labels, device=self.device)
# make the attractor and repulsor, mask them!
attraction_tensor = mask * predictions
repulsion_tensor = (mask + 1) * predictions
# now, apply the special cosine-COREL rules, taking the argmax and squaring the repulsion
repulsion_tensor, _ = repulsion_tensor.max(dim=1)
repulsion_tensor = repulsion_tensor ** 2
return arloss(attraction_tensor, repulsion_tensor, self.lam)
The actual error seems to be different from what is in the error message, but I have no idea where to look. I tried a few things, like changing the learning rate, normalizing the MNIST data to be more or less in the same range as the test data but nothing seems to work.
Any suggestions? Thanks a lot in advance!
After exchanging some emails with the author of the paper we figured out what is the problem. The labels were type of Byte instead of Long, that caused the error. The error message is very misleading, the actual problem has nothing to do with the sizes...

how to calculate a Mobilenet FLOPs in Keras

run_meta = tf.RunMetadata()
enter codwith tf.Session(graph=tf.Graph()) as sess:
K.set_session(sess)
with tf.device('/cpu:0'):
base_model = MobileNet(alpha=1, weights=None, input_tensor=tf.placeholder('float32', shape=(1,224,224,3)))
opts = tf.profiler.ProfileOptionBuilder.float_operation()
flops = tf.profiler.profile(sess.graph, run_meta=run_meta, cmd='op', options=opts)
opts = tf.profiler.ProfileOptionBuilder.trainable_variables_parameter()
params = tf.profiler.profile(sess.graph, run_meta=run_meta, cmd='op', options=opts)
print("{:,} --- {:,}".format(flops.total_float_ops, params.total_parameters))
When I run above code, I got a below result
1,137,481,704 --- 4,253,864
This is different from the flops described in the paper.
mobilenet: https://arxiv.org/pdf/1704.04861.pdf
ShuffleNet: https://arxiv.org/pdf/1707.01083.pdf
How to calculate exact flops described in the paper?
tl;dr You've actually got the right answer! You are simply comparing flops with multiply accumulates (from the paper) and therefore need to divide by two.
If you're using Keras, then the code you listed is slightly over-complicating things...
Let model be any compiled Keras model. We can arrive at the flops of the model with the following code.
import tensorflow as tf
import keras.backend as K
def get_flops():
run_meta = tf.RunMetadata()
opts = tf.profiler.ProfileOptionBuilder.float_operation()
# We use the Keras session graph in the call to the profiler.
flops = tf.profiler.profile(graph=K.get_session().graph,
run_meta=run_meta, cmd='op', options=opts)
return flops.total_float_ops # Prints the "flops" of the model.
# .... Define your model here ....
# You need to have compiled your model before calling this.
print(get_flops())
However, when I look at my own example (not Mobilenet) that I did on my computer, the printed out total_float_ops was 2115 and I had the following results when I simply printed the flops variable:
[...]
Mul 1.06k float_ops (100.00%, 49.98%)
Add 1.06k float_ops (50.02%, 49.93%)
Sub 2 float_ops (0.09%, 0.09%)
It's pretty clear that the total_float_ops property takes into consideration multiplication, addition and subtraction.
I then looked back at the MobileNets example, looking through the paper briefly, I found the implementation of MobileNet that is the default Keras implementation based on the number of parameters:
The first model in the table matches the result you have (4,253,864) and the Mult-Adds are approximately half of the flops result that you have. Therefore you have the correct answer, it's just you were mistaking flops for Mult-Adds (aka multiply accumulates or MACs).
If you want to compute the number of MACs you simply have to divide the result from the above code by two.
Important Notes
Keep the following in mind if you are trying to run the code sample:
The code sample was written in 2018 and doesn't work with tensorflow version 2. See #driedler 's answer for a complete example of tensorflow version 2 compatibility.
The code sample was originally meant to be run once on a compiled model... For a better example of using this in a way that does not have side effects (and can therefore be run multiple times on the same model), see #ch271828n 's answer.
This is working for me in TF-2.1:
def get_flops(model_h5_path):
session = tf.compat.v1.Session()
graph = tf.compat.v1.get_default_graph()
with graph.as_default():
with session.as_default():
model = tf.keras.models.load_model(model_h5_path)
run_meta = tf.compat.v1.RunMetadata()
opts = tf.compat.v1.profiler.ProfileOptionBuilder.float_operation()
# Optional: save printed results to file
# flops_log_path = os.path.join(tempfile.gettempdir(), 'tf_flops_log.txt')
# opts['output'] = 'file:outfile={}'.format(flops_log_path)
# We use the Keras session graph in the call to the profiler.
flops = tf.compat.v1.profiler.profile(graph=graph,
run_meta=run_meta, cmd='op', options=opts)
return flops.total_float_ops
The above solutions cannot be run twice, otherwise the flops will accumulate! (In other words, the second time you run it, you will get output = flops_of_1st_call + flops_of_2nd_call.) The following code calls reset_default_graph to avoid this.
def get_flops():
session = tf.compat.v1.Session()
graph = tf.compat.v1.get_default_graph()
with graph.as_default():
with session.as_default():
model = keras.applications.mobilenet.MobileNet(
alpha=1, weights=None, input_tensor=tf.compat.v1.placeholder('float32', shape=(1, 224, 224, 3)))
run_meta = tf.compat.v1.RunMetadata()
opts = tf.compat.v1.profiler.ProfileOptionBuilder.float_operation()
# Optional: save printed results to file
# flops_log_path = os.path.join(tempfile.gettempdir(), 'tf_flops_log.txt')
# opts['output'] = 'file:outfile={}'.format(flops_log_path)
# We use the Keras session graph in the call to the profiler.
flops = tf.compat.v1.profiler.profile(graph=graph,
run_meta=run_meta, cmd='op', options=opts)
tf.compat.v1.reset_default_graph()
return flops.total_float_ops
Modified from #driedler, thanks!
You can use model.summary() on all Keras models to get number of FLOPS.

Defining a seed value in Branscripts for CNTK sequential machine learning models

This is respect to CNTK brain scripts. I went through [1] to figure out whether there is an option to specify the random seed value, although I couldn't find any (Yes there is an option to set the 'random seed' parameter through the ParameterTensor() function, but if I followed that approach, I might have to explicitly initialize all the LSTM weights separately(defining separate weights for input layer gate, forget layer gate etc. ), instead of using the model sequence as below). Is there any other option available to set the random seed value, preserving the following RNN layered sequence.
nn_Train = {
action = train
BrainScriptNetworkBuilder = {
model = Sequential (
RecurrentLSTMLayer {$stateDim$, usePeepholes = true}:
DenseLayer {$labelDim$, bias=false}
)
z = model (inputs)
inputs=Input($inputDim$) # features
labels=Input($labelDim$)
# loss and metric
ce = SquareError(labels, z)
# node assignment
featureNodes = (inputs)
labelNodes = (labels)
criterionNodes = (ce)
evaluationNodes = (ce)
outputNodes = (z)
}
[1] https://github.com/microsoft/cntk/wiki/Parameters-And-Constants#random-initialization
There isn't a global random seed option for parameters unfortunately. However, you can modify the cntk.core.bs file next to cntk.exe where all the layers are defined to support random seed for the layers you want.