Why my training loss decreases first then increases reguarly in each epoch? - deep-learning

In each epoch, my training loss first decreases (compared with the loss of its last epoch) but then increases till the end of this epoch. Below is an example:
Epoch [29][250/940] lr: 3.298e-04, top1_acc: 0.4903, top5_acc: 0.7323, loss: 2.2106, grad_norm: 1.8732
Epoch [29][500/940] lr: 3.298e-04, top1_acc: 0.4844, top5_acc: 0.7267, loss: 2.2330, grad_norm: 1.8848
Epoch [29][750/940] lr: 3.298e-04, top1_acc: 0.4850, top5_acc: 0.7247, loss: 2.2491, grad_norm: 1.8910
Epoch [30][250/940] lr: 3.055e-04, top1_acc: 0.4985, top5_acc: 0.7384, loss: 2.1676, grad_norm: 1.9173
Epoch [30][500/940] lr: 3.055e-04, top1_acc: 0.4971, top5_acc: 0.7356, loss: 2.1739, grad_norm: 1.9099
Epoch [30][750/940] lr: 3.055e-04, top1_acc: 0.4962, top5_acc: 0.7341, loss: 2.1861, grad_norm: 1.9269
Epoch [31][250/940] lr: 2.816e-04, top1_acc: 0.5119, top5_acc: 0.7454, loss: 2.1087, grad_norm: 1.9586
Epoch [31][500/940] lr: 2.816e-04, top1_acc: 0.5071, top5_acc: 0.7435, loss: 2.1329, grad_norm: 1.9674
Epoch [31][750/940] lr: 2.816e-04, top1_acc: 0.5076, top5_acc: 0.7436, loss: 2.1296, grad_norm: 1.9738
The training loss curve is:
The pattern of decrease-then-increase in each epoch of training loss is very stable but it only happens at the the later part of the training. So it should not be the problem of codes.
Is there a genius can explain this?

Related

Train loss is decreasing, but accuracy remain the same

this is the train and development cell for multi-label classification task using Roberta (BERT). the first part is training and second part is development (validation). train_dataloader is my train dataset and dev_dataloader is development dataset. my question is: why train loss is decreasing step by step, but accuracy doesn't increase so much? practically, accuracy is increasing until iterate 4, but train loss is decreasing until the last epoch (iterate). is this ok or there should be a problem?
train_loss_set = []
iterate = 4
for _ in trange(iterate, desc="Iterate"):
model.train()
train_loss = 0
nu_train_examples, nu_train_steps = 0, 0
for step, batch in enumerate(train_dataloader):
batch = tuple(t.to(device) for t in batch)
batch_input_ids, batch_input_mask, batch_labels = batch
optimizer.zero_grad()
output = model(batch_input_ids, attention_mask=batch_input_mask)
logits = output[0]
loss_function = BCEWithLogitsLoss()
loss = loss_function(logits.view(-1,num_labels),batch_labels.type_as(logits).view(-1,num_labels))
train_loss_set.append(loss.item())
loss.backward()
optimizer.step()
train_loss += loss.item()
nu_train_examples += batch_input_ids.size(0)
nu_train_steps += 1
print("Train loss: {}".format(train_loss/nu_train_steps))
###############################################################################
model.eval()
logits_pred,true_labels,pred_labels,tokenized_texts = [],[],[],[]
# Predict
for i, batch in enumerate(dev_dataloader):
batch = tuple(t.to(device) for t in batch)
batch_input_ids, batch_input_mask, batch_labels = batch
with torch.no_grad():
out = model(batch_input_ids, attention_mask=batch_input_mask)
batch_logit_pred = out[0]
pred_label = torch.sigmoid(batch_logit_pred)
batch_logit_pred = batch_logit_pred.detach().cpu().numpy()
pred_label = pred_label.to('cpu').numpy()
batch_labels = batch_labels.to('cpu').numpy()
tokenized_texts.append(batch_input_ids)
logits_pred.append(batch_logit_pred)
true_labels.append(batch_labels)
pred_labels.append(pred_label)
pred_labels = [item for sublist in pred_labels for item in sublist]
true_labels = [item for sublist in true_labels for item in sublist]
threshold = 0.4
pred_bools = [pl>threshold for pl in pred_labels]
true_bools = [tl==1 for tl in true_labels]
print("Accuracy is: ", jaccard_score(true_bools,pred_bools,average='samples'))
torch.save(model.state_dict(), 'bert_model')
and the outputs:
Iterate: 0%| | 0/10 [00:00<?, ?it/s]
Train loss: 0.4024542534684801
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Jaccard is ill-defined and being set to 0.0 in samples with no true or predicted labels. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Accuracy is: 0.5806403013182674
Iterate: 10%|█ | 1/10 [03:21<30:14, 201.64s/it]
Train loss: 0.2972540049911379
Accuracy is: 0.6091337099811676
Iterate: 20%|██ | 2/10 [06:49<27:07, 203.49s/it]
Train loss: 0.26178574864264137
Accuracy is: 0.608361581920904
Iterate: 30%|███ | 3/10 [10:17<23:53, 204.78s/it]
Train loss: 0.23612180122962365
Accuracy is: 0.6096717783158462
Iterate: 40%|████ | 4/10 [13:44<20:33, 205.66s/it]
Train loss: 0.21416303515434265
Accuracy is: 0.6046892655367231
Iterate: 50%|█████ | 5/10 [17:12<17:11, 206.27s/it]
Train loss: 0.1929110718982203
Accuracy is: 0.6030885122410546
Iterate: 60%|██████ | 6/10 [20:40<13:46, 206.74s/it]
Train loss: 0.17280191068465894
Accuracy is: 0.6003766478342749
Iterate: 70%|███████ | 7/10 [24:08<10:21, 207.04s/it]
Train loss: 0.1517329115446631
Accuracy is: 0.5864783427495291
Iterate: 80%|████████ | 8/10 [27:35<06:54, 207.23s/it]
Train loss: 0.12957811209705325
Accuracy is: 0.5818832391713747
Iterate: 90%|█████████ | 9/10 [31:03<03:27, 207.39s/it]
Train loss: 0.11256680189521162
Accuracy is: 0.5796045197740114
Iterate: 100%|██████████| 10/10 [34:31<00:00, 207.14s/it]
The training loss is decreasing because you model gradually learns your training set. The evaluation accuracy is how well the model learned the global features of your training set and how well your model predicts "unseen data". So, if the loss is decreasing, your model is learning. Perhaps it has learned too specific information from the training set and it is, in fact, overfitting. This means that it fits "too well" to the training data and is unable to make correct predictions on unseen data, due to the fact that the test data may be a little different. That is why the evaluation accuracy is not increasing any more.
This could be an explanation.

The output of my regression NN with LSTMs is wrong even with low val_loss

The Model
I am currently working on a stack of LSTMs and trying to solve a regression problem. The architecture of the model is as below:
comp_lstm = tf.keras.models.Sequential([
tf.keras.layers.LSTM(64, return_sequences = True),
tf.keras.layers.LSTM(64, return_sequences = True),
tf.keras.layers.LSTM(64),
tf.keras.layers.Dense(units=128),
tf.keras.layers.Dense(units=64),
tf.keras.layers.Dense(units=32),
tf.keras.layers.Dense(units=1)
])
comp_lstm.compile(optimizer='adam', loss='mae')
When I train the model, it shows some good loss and val_loss figures:
Epoch 6/20
200/200 [==============================] - 463s 2s/step - loss: 1.3793 - val_loss: 1.3578
Epoch 7/20
200/200 [==============================] - 461s 2s/step - loss: 1.3791 - val_loss: 1.3602
Now I run the code to check the output with the code below:
idx = np.random.randint(len(val_X))
sample_X, sample_y = [[val_X[idx,:]]], [[val_y[idx]]]
test = tf.data.Dataset.from_tensor_slices(([sample_X], [sample_y]))
prediction = comp_lstm.predict(test)
print(f'The actual value was {sample_y} and the model predicted {prediction}')
And the output is:
The actual value was [[21.3]] and the model predicted [[2.7479606]]
The next few times I ran it, I got the value:
The actual value was [[23.1]] and the model predicted [[0.8445232]]
The actual value was [[21.2]] and the model predicted [[2.5449793]]
The actual value was [[22.5]] and the model predicted [[1.2662419]]
I am not sure why this is working out the way that it is. The val_loss is super low, but the output is wildly different.
The Data Wrangling
The data wrangling in order to get train_X and val_X etc. is shown below:
hist2 = 128
features2 = np.array(list(map(list,[df["scaled_temp"].shift(x) for x in range(1, hist2+1)]))).T.tolist()
df_feat2 = pd.DataFrame([pd.Series(x) for x in features2], index = df.index)
df_trans2 = df.join(df_feat2).drop(columns=['scaled_temp']).iloc[hist2:]
df_trans2 = df_trans2.sample(frac=1)
target = df_trans2['T (degC)'].values
feat2 = df_trans2.drop(columns = ['T (degC)']).values
The shape of feat2 is (44435, 128), while the shape of target is (44435,)
The dataframe that is the column df["scaled_temp"] is shown below (which has been scaled with a standard scaler):
Date Time
2020-04-23T21:14:07.546476Z -0.377905
2020-04-23T21:17:32.406111Z -0.377905
2020-04-23T21:17:52.670373Z -0.377905
2020-04-23T21:18:55.010392Z -0.377905
2020-04-23T21:19:57.327291Z -0.377905
...
2020-06-08T09:13:06.718934Z -0.889968
2020-06-08T09:14:09.170193Z -0.889968
2020-06-08T09:15:11.634954Z -0.889968
2020-06-08T09:16:14.087139Z -0.889968
2020-06-08T09:17:16.549216Z -0.889968
Name: scaled_temp, Length: 44563, dtype: float64
The dataframe for df['T (degC)'] is shown below:
Date Time
2020-05-09T07:30:30.621001Z 24.0
2020-05-11T15:56:30.856851Z 21.3
2020-05-27T05:02:09.407266Z 28.3
2020-05-02T09:33:03.219329Z 20.5
2020-05-31T03:20:04.326902Z 22.4
...
2020-05-31T01:47:45.982819Z 23.1
2020-05-27T08:03:21.456607Z 27.2
2020-05-04T21:58:36.652251Z 20.9
2020-05-17T18:42:39.681050Z 22.5
2020-05-04T22:07:58.350329Z 21.1
Name: T (degC), Length: 44435, dtype: float64
The dataset creation process is as below:
train_X, val_X = feat2[:int(feat2.shape[0]*0.95), :], feat2[int(feat2.shape[0]*0.95):, :]
train_y, val_y = target[:int(target.shape[0]*0.95)], target[int(target.shape[0]*0.95):]
train = tf.data.Dataset.from_tensor_slices(([train_X], [train_y])).batch(BATCH_SIZE).repeat()
val = tf.data.Dataset.from_tensor_slices(([val_X], [val_y])).batch(BATCH_SIZE).repeat()
So I am not sure as to why this is happening.

Low validation accuracy in parallel DenseNet

I've taken the code from https://github.com/flyyufelix/cnn_finetune and remodeled it so that there is now two DenseNet-121 in parallel, with the layers after each model's last Global Average Pooled removed.
Both models were joined together like this:
print("Begin model 1")
model = densenet121_model(img_rows=img_rows, img_cols=img_cols, color_type=channel, num_classes=num_classes)
print("Begin model 2")
model2 = densenet121_nw_model(img_rows=img_rows, img_cols=img_cols, color_type=channel, num_classes=num_classes)
mergedOut = Add()([model.output,model2.output])
#mergedOut = Flatten()(mergedOut)
mergedOut = Dense(num_classes, name='cmb_fc6')(mergedOut)
mergedOut = Activation('softmax', name='cmb_prob')(mergedOut)
newModel = Model([model.input,model2.input], mergedOut)
adam = Adam(lr=1e-3, decay=1e-6, amsgrad=True)
newModel.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy'])
# Start Fine-tuning
newModel.fit([X_train,X_train], Y_train,
batch_size=batch_size,
nb_epoch=nb_epoch,
shuffle=True,
verbose=1,
validation_data=([X_valid,X_valid],Y_valid)
)
The first model has its layers frozen, and the one in parallel is suppose to learn additional features on top of the first model to supposedly improve accuracy.
However, even at 100 epochs,
the training accuracy is almost 100% but validation floats around 9%.
I'm not quite sure what could be the reason and how to fix it, considering I've already changed the optimizer from SGD (same concept, 2 densenets with the first trained on ImageNet, the second has no weights to begin with same results) to Adam (2 densenets, both pre-trained on imagenet).
Epoch 101/1000
1000/1000 [==============================] - 1678s 2s/step - loss: 0.0550 - acc: 0.9820 - val_loss: 12.9906 - val_acc: 0.0900
Epoch 102/1000
1000/1000 [==============================] - 1703s 2s/step - loss: 0.0567 - acc: 0.9880 - val_loss: 12.9804 - val_acc: 0.1100

Why is my accuracy high from the beginning of training?

I am training a neural network to recognize some of attributes on .png pictures, and what I get when I start training is something like this, and it is increasing till the end of the epoch:
32/4817 [..............................] - ETA: 167s - loss: 0.6756 - acc: 0.5
64/4817 [..............................] - ETA: 152s - loss: 0.6214 - acc: 0.7
96/4817 [..............................] - ETA: 145s - loss: 0.6169 - acc: 0.7
128/4817 [.............................] - ETA: 142s - loss: 0.5972 - acc: 0.7
160/4817 [.............................] - ETA: 140s - loss: 0.5734 - acc: 0.7
192/4817 [>............................] - ETA: 138s - loss: 0.5604 - acc: 0.7
224/4817 [>............................] - ETA: 137s - loss: 0.5427 - acc: 0.7
256/4817 [>............................] - ETA: 135s - loss: 0.5160 - acc: 0.7
288/4817 [>............................] - ETA: 134s - loss: 0.5492 - acc: 0.7
320/4817 [>............................] - ETA: 133s - loss: 0.5574 - acc: 0.7
352/4817 [=>...........................] - ETA: 131s - loss: 0.5559 - acc: 0.7
384/4817 [=>...........................] - ETA: 129s - loss: 0.5550 - acc: 0.7
416/4817 [=>...........................] - ETA: 128s - loss: 0.5504 - acc: 0.7
448/4817 [=>...........................] - ETA: 127s - loss: 0.5417 - acc: 0.7
480/4817 [=>...........................] - ETA: 126s - loss: 0.5425 - acc: 0.7
My question is why is the starting accuracy so high? I suppose it should be something around 0.1 and then increasing while learning.
Also, at the end I get:
('Test loss:', 0.42451223436727564)
('Test accuracy:', 0.82572614112830256)
Is that too big test loss?
This is my network:
input_shape = x_train[0].shape
print(input_shape)
model = Sequential()
stoplearn = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0,
patience=0, verbose=0, mode='auto')
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=20,
verbose=1,
validation_data=(x_test, y_test),
callbacks=[stoplearn])
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
It is written in Python using Keras.
You classify your data into two classes (since your output layer is of size 2), so accuracy of 0.5 is not high. In fact, it means that your network behaves randomly, which is what you expect at the beginning. Regarding the loss, there is no absolute answer for that. Your test accuracy seems not bad, and you could try to play with some of the parameters (for example, taking smaller size for the fully connected layer) to see whether you can improve it.
You have two classes. Random choice will lead to 50% accuracy. This is what you get in the beginning. Hence your result is expected.
The reason why it jumps directly to 70% accuracy could be that your problem is simple.
If you want to double-check it, you could
use other classifiers,
check how many examples are used to calculate accuracy,
serialize the trained classifier and manually feed it with new examples and check their results

Keras: batch training for multiple large datasets

this question regards the common problem of training on multiple large files in Keras which are jointly too large to fit on GPU memory.
I am using Keras 1.0.5 and I would like a solution that does not require 1.0.6.
One way to do this was described by fchollet
here and
here:
# Create generator that yields (current features X, current labels y)
def BatchGenerator(files):
for file in files:
current_data = pickle.load(open("file", "rb"))
X_train = current_data[:,:-1]
y_train = current_data[:,-1]
yield (X_train, y_train)
# train model on each dataset
for epoch in range(n_epochs):
for (X_train, y_train) in BatchGenerator(files):
model.fit(X_train, y_train, batch_size = 32, nb_epoch = 1)
However I fear that the state of the model is not saved, rather that the model is reinitialized not only between epochs but also between datasets. Each "Epoch 1/1" represents training on a different dataset below:
~~~~~ Epoch 0 ~~~~~~
Epoch 1/1
295806/295806 [==============================] - 13s - loss: 15.7517
Epoch 1/1
407890/407890 [==============================] - 19s - loss: 15.8036
Epoch 1/1
383188/383188 [==============================] - 19s - loss: 15.8130
~~~~~ Epoch 1 ~~~~~~
Epoch 1/1
295806/295806 [==============================] - 14s - loss: 15.7517
Epoch 1/1
407890/407890 [==============================] - 20s - loss: 15.8036
Epoch 1/1
383188/383188 [==============================] - 15s - loss: 15.8130
I am aware that one can use model.fit_generator but as the method above was repeatedly suggested as a way of batch training I would like to know what I am doing wrong.
Thanks for your help,
Max
It has been a while since I faced that problem but I remember that I used
Kera's functionality to provide data through Python generators, i.e. model = Sequential(); model.fit_generator(...).
An exemplary code snippet (should be self-explanatory)
def generate_batches(files, batch_size):
counter = 0
while True:
fname = files[counter]
print(fname)
counter = (counter + 1) % len(files)
data_bundle = pickle.load(open(fname, "rb"))
X_train = data_bundle[0].astype(np.float32)
y_train = data_bundle[1].astype(np.float32)
y_train = y_train.flatten()
for cbatch in range(0, X_train.shape[0], batch_size):
yield (X_train[cbatch:(cbatch + batch_size),:,:], y_train[cbatch:(cbatch + batch_size)])
model = Sequential()
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
train_files = [train_bundle_loc + "bundle_" + cb.__str__() for cb in range(nb_train_bundles)]
gen = generate_batches(files=train_files, batch_size=batch_size)
history = model.fit_generator(gen, samples_per_epoch=samples_per_epoch, nb_epoch=num_epoch,verbose=1, class_weight=class_weights)