During classification I sometimes get "nan" values

During classification I sometimes get "nan" values - caffe

I've trained a caffemodel with GoogLeNet. During testing I have a very high accuracy:
I0122 06:00:54.384351 2039975936 solver.cpp:409] Test net output #0: loss1/loss1 = 0.433825 (* 0.3 = 0.130148 loss)
I0122 06:00:54.385201 2039975936 solver.cpp:409] Test net output #1: loss1/top-1 = 0.8764
I0122 06:00:54.385234 2039975936 solver.cpp:409] Test net output #2: loss1/top-5 = 0.969
I0122 06:00:54.385243 2039975936 solver.cpp:409] Test net output #3: loss2/loss1 = 0.327197 (* 0.3 = 0.0981591 loss)
I0122 06:00:54.385251 2039975936 solver.cpp:409] Test net output #4: loss2/top-1 = 0.8918
I0122 06:00:54.385256 2039975936 solver.cpp:409] Test net output #5: loss2/top-5 = 0.984601
I0122 06:00:54.385262 2039975936 solver.cpp:409] Test net output #6: loss3/loss3 = 0.304042 (* 1 = 0.304042 loss)
I0122 06:00:54.385268 2039975936 solver.cpp:409] Test net output #7: loss3/top-1 = 0.9228
I0122 06:00:54.385273 2039975936 solver.cpp:409] Test net output #8: loss3/top-5 = 0.9768
No I have a python classifier which looks like this:
caffe.Classifier(MODEL_FILE, PRETRAINED,
mean=np.load('train_image_mean.npy').mean(1).mean(1),
channel_swap=(2, 1, 0),
raw_scale=255,
image_dims=(256, 256))
I ran the classifier through all my validation data. The accuarcy is very high. But I get some "nan" probability values for some input images. What's the reason for that? What does "nan" mean? Is it "I did not recognize any class"?
Edit:
This question is not a duplication since it refers to the classification and not to the training
Thank you.

Related

Train loss is decreasing, but accuracy remain the same

this is the train and development cell for multi-label classification task using Roberta (BERT). the first part is training and second part is development (validation). train_dataloader is my train dataset and dev_dataloader is development dataset. my question is: why train loss is decreasing step by step, but accuracy doesn't increase so much? practically, accuracy is increasing until iterate 4, but train loss is decreasing until the last epoch (iterate). is this ok or there should be a problem?
train_loss_set = []
iterate = 4
for _ in trange(iterate, desc="Iterate"):
model.train()
train_loss = 0
nu_train_examples, nu_train_steps = 0, 0
for step, batch in enumerate(train_dataloader):
batch = tuple(t.to(device) for t in batch)
batch_input_ids, batch_input_mask, batch_labels = batch
optimizer.zero_grad()
output = model(batch_input_ids, attention_mask=batch_input_mask)
logits = output[0]
loss_function = BCEWithLogitsLoss()
loss = loss_function(logits.view(-1,num_labels),batch_labels.type_as(logits).view(-1,num_labels))
train_loss_set.append(loss.item())
loss.backward()
optimizer.step()
train_loss += loss.item()
nu_train_examples += batch_input_ids.size(0)
nu_train_steps += 1
print("Train loss: {}".format(train_loss/nu_train_steps))
###############################################################################
model.eval()
logits_pred,true_labels,pred_labels,tokenized_texts = [],[],[],[]
# Predict
for i, batch in enumerate(dev_dataloader):
batch = tuple(t.to(device) for t in batch)
batch_input_ids, batch_input_mask, batch_labels = batch
with torch.no_grad():
out = model(batch_input_ids, attention_mask=batch_input_mask)
batch_logit_pred = out[0]
pred_label = torch.sigmoid(batch_logit_pred)
batch_logit_pred = batch_logit_pred.detach().cpu().numpy()
pred_label = pred_label.to('cpu').numpy()
batch_labels = batch_labels.to('cpu').numpy()
tokenized_texts.append(batch_input_ids)
logits_pred.append(batch_logit_pred)
true_labels.append(batch_labels)
pred_labels.append(pred_label)
pred_labels = [item for sublist in pred_labels for item in sublist]
true_labels = [item for sublist in true_labels for item in sublist]
threshold = 0.4
pred_bools = [pl>threshold for pl in pred_labels]
true_bools = [tl==1 for tl in true_labels]
print("Accuracy is: ", jaccard_score(true_bools,pred_bools,average='samples'))
torch.save(model.state_dict(), 'bert_model')
and the outputs:
Iterate: 0%| | 0/10 [00:00<?, ?it/s]
Train loss: 0.4024542534684801
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Jaccard is ill-defined and being set to 0.0 in samples with no true or predicted labels. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Accuracy is: 0.5806403013182674
Iterate: 10%|█ | 1/10 [03:21<30:14, 201.64s/it]
Train loss: 0.2972540049911379
Accuracy is: 0.6091337099811676
Iterate: 20%|██ | 2/10 [06:49<27:07, 203.49s/it]
Train loss: 0.26178574864264137
Accuracy is: 0.608361581920904
Iterate: 30%|███ | 3/10 [10:17<23:53, 204.78s/it]
Train loss: 0.23612180122962365
Accuracy is: 0.6096717783158462
Iterate: 40%|████ | 4/10 [13:44<20:33, 205.66s/it]
Train loss: 0.21416303515434265
Accuracy is: 0.6046892655367231
Iterate: 50%|█████ | 5/10 [17:12<17:11, 206.27s/it]
Train loss: 0.1929110718982203
Accuracy is: 0.6030885122410546
Iterate: 60%|██████ | 6/10 [20:40<13:46, 206.74s/it]
Train loss: 0.17280191068465894
Accuracy is: 0.6003766478342749
Iterate: 70%|███████ | 7/10 [24:08<10:21, 207.04s/it]
Train loss: 0.1517329115446631
Accuracy is: 0.5864783427495291
Iterate: 80%|████████ | 8/10 [27:35<06:54, 207.23s/it]
Train loss: 0.12957811209705325
Accuracy is: 0.5818832391713747
Iterate: 90%|█████████ | 9/10 [31:03<03:27, 207.39s/it]
Train loss: 0.11256680189521162
Accuracy is: 0.5796045197740114
Iterate: 100%|██████████| 10/10 [34:31<00:00, 207.14s/it]

The training loss is decreasing because you model gradually learns your training set. The evaluation accuracy is how well the model learned the global features of your training set and how well your model predicts "unseen data". So, if the loss is decreasing, your model is learning. Perhaps it has learned too specific information from the training set and it is, in fact, overfitting. This means that it fits "too well" to the training data and is unable to make correct predictions on unseen data, due to the fact that the test data may be a little different. That is why the evaluation accuracy is not increasing any more.
This could be an explanation.

How to use DNN to fit these data

I'm using DNN to fit these data, and I use softmax to classify them into 2 class, and each of them has a demensity of 4040, can someone with experience tell me what's wrong with my nets.
It is strange that my initial loss is 7.6 and my initial error is 0.5524, and Basically they won't change anymore.
for train, test in kfold.split(data_pro, valence_labels):
model = keras.Sequential()
model.add(keras.layers.Dense(5000,activation='relu',input_shape=(4040,)))
model.add(keras.layers.Dropout(rate=0.25))
model.add(keras.layers.Dense(500, activation='relu'))
model.add(keras.layers.Dropout(rate=0.5))
model.add(keras.layers.Dense(1000, activation='relu'))
model.add(keras.layers.Dropout(rate=0.5))
model.add(keras.layers.Dense(2, activation='softmax'))
model.add(keras.layers.Dropout(rate=0.5))
model.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate=0.0001,rho=0.9),
loss='binary_crossentropy',
metrics=['accuracy'])
print('------------------------------------------------------------------------')
print(f'Training for fold {fold_no} ...')
log_dir="logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
# Fit data to model
history = model.fit(data_pro[train], valence_labels[train],
batch_size=128,
epochs=50,
verbose=1,
callbacks=[tensorboard_callback]
)
# Generate generalization metrics
scores = model.evaluate(data_pro[test], valence_labels[test], verbose=0)
print(f'Score for fold {fold_no}: {model.metrics_names[0]} of {scores[0]}; {model.metrics_names[1]} of {scores[1]*100}%')
acc_per_fold.append(scores[1] * 100)
loss_per_fold.append(scores[0])
# Increase fold number
fold_no = fold_no + 1
# == Provide average scores ==
print('------------------------------------------------------------------------')
print('Score per fold')
for i in range(0, len(acc_per_fold)):
print('------------------------------------------------------------------------')
print(f'> Fold {i+1} - Loss: {loss_per_fold[i]} - Accuracy: {acc_per_fold[i]}%')
print('------------------------------------------------------------------------')
print('Average scores for all folds:')
print(f'> Accuracy: {np.mean(acc_per_fold)} (+- {np.std(acc_per_fold)})')
print(f'> Loss: {np.mean(loss_per_fold)}')
print('------------------------------------------------------------------------')

You shouldn't add Dropout after the final Dense , delete the model.add(keras.layers.Dropout(rate=0.5))
And I think your code may raise error because your labels's dim is 1 , But your final Dense's units is 2 . Change model.add(keras.layers.Dense(2, activation='softmax')) to model.add(keras.layers.Dense(1, activation='sigmoid'))
Read this to learn tensorflow
Update 1 :
Change
model.compile(optimizer= tf.keras.optimizers.SGD(learning_rate = 0.00001,momentum=0.9,nesterov=True),
loss=tf.keras.losses.CategoricalCrossentropy(),
metrics=['accuracy'])
to
model.compile(optimizer= tf.keras.optimizers.Adam(learning_rate=3e-4),
loss=tf.keras.losses.CategoricalCrossentropy(),
metrics=['accuracy'])
And change
accAll = []
for epoch in range(1, 50):
model.fit(train_data, train_labels,
batch_size=50,epochs=5,
validation_data = (val_data, val_labels))
val_loss, val_Accuracy = model.evaluate(val_data,val_labels,batch_size=1)
accAll.append(val_Accuracy)
to
accAll = model.fit(
train_data, train_labels,
batch_size=50,epochs=20,
validation_data = (val_data, val_labels)
)

Validation loss is constant and training loss decreasing

I have a model training and I got this plot. It is over audio (about 70K of around 5-10s) and no augmentation is being done. I have tried the following to avoid overfitting:
Reduce complexity of the model by reducing number of GRU cells and hidden dimensions.
Add dropout in each layer.
I have tried with higher dataset.
What I am not sure is if my calculation of training loss and validation loss is correct. It is something like this. I am using drop_last=True and I am using the CTC loss criterion.
train_data_len = len(train_loader.dataset)
valid_data_len = len(valid_loader.dataset)
epoch_train_loss = 0
epoch_val_loss = 0
train_losses = []
valid_losses = []
model.train()
for e in range(n_epochs):
t0 = time.time()
#batch loop
running_loss = 0.0
for batch_idx, _data in enumerate(train_loader, 1):
# Calculate output ...
# bla bla
loss = criterion(output, labels.float(), input_lengths, label_lengths)
loss.backward()
optimizer.step()
scheduler.step()
# loss stats
running_loss += loss.item() * specs.size(0)
t_t = time.time() - t0
######################
# validate the model #
######################
with torch.no_grad():
model.eval()
tv = time.time()
running_val_loss = 0.0
for batch_idx_v, _data in enumerate(valid_loader, 1):
#bla, bla
val_loss = criterion(output, labels.float(), input_lengths, label_lengths)
running_val_loss += val_loss.item() * specs.size(0)
print("Epoch {}: Training took {:.2f} [s]\tValidation took: {:.2f} [s]\n".format(e+1, t_t, time.time() - tv))
epoch_train_loss = running_loss / train_data_len
epoch_val_loss = running_val_loss / valid_data_len
train_losses.append(epoch_train_loss)
valid_losses.append(epoch_val_loss)
print('Epoch: {} Losses\tTraining Loss: {:.6f}\tValidation Loss: {:.6f}'.format(
e+1, epoch_train_loss, epoch_val_loss))
model.train()

The output of my regression NN with LSTMs is wrong even with low val_loss

The Model
I am currently working on a stack of LSTMs and trying to solve a regression problem. The architecture of the model is as below:
comp_lstm = tf.keras.models.Sequential([
tf.keras.layers.LSTM(64, return_sequences = True),
tf.keras.layers.LSTM(64, return_sequences = True),
tf.keras.layers.LSTM(64),
tf.keras.layers.Dense(units=128),
tf.keras.layers.Dense(units=64),
tf.keras.layers.Dense(units=32),
tf.keras.layers.Dense(units=1)
])
comp_lstm.compile(optimizer='adam', loss='mae')
When I train the model, it shows some good loss and val_loss figures:
Epoch 6/20
200/200 [==============================] - 463s 2s/step - loss: 1.3793 - val_loss: 1.3578
Epoch 7/20
200/200 [==============================] - 461s 2s/step - loss: 1.3791 - val_loss: 1.3602
Now I run the code to check the output with the code below:
idx = np.random.randint(len(val_X))
sample_X, sample_y = [[val_X[idx,:]]], [[val_y[idx]]]
test = tf.data.Dataset.from_tensor_slices(([sample_X], [sample_y]))
prediction = comp_lstm.predict(test)
print(f'The actual value was {sample_y} and the model predicted {prediction}')
And the output is:
The actual value was [[21.3]] and the model predicted [[2.7479606]]
The next few times I ran it, I got the value:
The actual value was [[23.1]] and the model predicted [[0.8445232]]
The actual value was [[21.2]] and the model predicted [[2.5449793]]
The actual value was [[22.5]] and the model predicted [[1.2662419]]
I am not sure why this is working out the way that it is. The val_loss is super low, but the output is wildly different.
The Data Wrangling
The data wrangling in order to get train_X and val_X etc. is shown below:
hist2 = 128
features2 = np.array(list(map(list,[df["scaled_temp"].shift(x) for x in range(1, hist2+1)]))).T.tolist()
df_feat2 = pd.DataFrame([pd.Series(x) for x in features2], index = df.index)
df_trans2 = df.join(df_feat2).drop(columns=['scaled_temp']).iloc[hist2:]
df_trans2 = df_trans2.sample(frac=1)
target = df_trans2['T (degC)'].values
feat2 = df_trans2.drop(columns = ['T (degC)']).values
The shape of feat2 is (44435, 128), while the shape of target is (44435,)
The dataframe that is the column df["scaled_temp"] is shown below (which has been scaled with a standard scaler):
Date Time
2020-04-23T21:14:07.546476Z -0.377905
2020-04-23T21:17:32.406111Z -0.377905
2020-04-23T21:17:52.670373Z -0.377905
2020-04-23T21:18:55.010392Z -0.377905
2020-04-23T21:19:57.327291Z -0.377905
...
2020-06-08T09:13:06.718934Z -0.889968
2020-06-08T09:14:09.170193Z -0.889968
2020-06-08T09:15:11.634954Z -0.889968
2020-06-08T09:16:14.087139Z -0.889968
2020-06-08T09:17:16.549216Z -0.889968
Name: scaled_temp, Length: 44563, dtype: float64
The dataframe for df['T (degC)'] is shown below:
Date Time
2020-05-09T07:30:30.621001Z 24.0
2020-05-11T15:56:30.856851Z 21.3
2020-05-27T05:02:09.407266Z 28.3
2020-05-02T09:33:03.219329Z 20.5
2020-05-31T03:20:04.326902Z 22.4
...
2020-05-31T01:47:45.982819Z 23.1
2020-05-27T08:03:21.456607Z 27.2
2020-05-04T21:58:36.652251Z 20.9
2020-05-17T18:42:39.681050Z 22.5
2020-05-04T22:07:58.350329Z 21.1
Name: T (degC), Length: 44435, dtype: float64
The dataset creation process is as below:
train_X, val_X = feat2[:int(feat2.shape[0]*0.95), :], feat2[int(feat2.shape[0]*0.95):, :]
train_y, val_y = target[:int(target.shape[0]*0.95)], target[int(target.shape[0]*0.95):]
train = tf.data.Dataset.from_tensor_slices(([train_X], [train_y])).batch(BATCH_SIZE).repeat()
val = tf.data.Dataset.from_tensor_slices(([val_X], [val_y])).batch(BATCH_SIZE).repeat()
So I am not sure as to why this is happening.

Caffe: How do you print a weighted loss for the testing layer?

My current Caffe output looks like this:
Iteration 1000, Testing net (#0)
Test net output #0: accuracy_1 = 0.337018
Test net output #1: accuracy_2 = 0.3397
Test net output #2: accuracy_3 = 0.360761
Test net output #3: loss_1 = 2.08132 (* 1 = 2.08132 loss)
Test net output #4: loss_2 = 2.03755 (* 1 = 2.03755 loss)
Test net output #5: loss_3 = 1.91984 (* 1 = 1.91984 loss)
Iteration 1000, loss = 3.87841
Train net output #0: loss_1 = 1.26657 (* 1 = 1.26657 loss)
Train net output #1: loss_2 = 1.40096 (* 1 = 1.40096 loss)
Train net output #2: loss_3 = 1.21088 (* 1 = 1.21088 loss)
The training iteration prints out the correct weighted loss (aka "loss = 3.87841"), while the testing iteration simply says "Testing net (#0)". How do I get the testing iteration to also print out the correct weighted loss? Thank you!

I don't think that is your weighted loss; but it is the average_loss of training smoothed across iterations.
You may want to check the average_loss field here:
https://github.com/BVLC/caffe/blob/master/src/caffe/solver.cpp#L190-L221
Share your loss and accuracy layers to receive a more detailed answer.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

During classification I sometimes get "nan" values - caffe

Related

Train loss is decreasing, but accuracy remain the same

How to use DNN to fit these data

Validation loss is constant and training loss decreasing

The output of my regression NN with LSTMs is wrong even with low val_loss

Caffe: How do you print a weighted loss for the testing layer?

Categories

Resources