I am trying to train DNN that converges to random (i.e., drawn from normal distribution) function but for now the network doesn't learn anything and the loss is stuck. Is is even possible or am I just wasting my time?
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Model
from tensorflow.keras.layers import Input, Dense
import numpy as np
import matplotlib.pyplot as plt
n_hidden_units = 25
num_lay = 10
learning_rate = 0.01
batch_size = 1000
epochs = 1000
save_freq_epoches = 500000
num_of_xs = 2
inputs_train = np.random.randn(batch_size*10,num_of_xs)*1
outputs_train = np.random.randn(batch_size*10,1)#np.sum(inputs_train,axis=1)#
inputs_train = tf.convert_to_tensor(inputs_train)
outputs_train = tf.convert_to_tensor((outputs_train-outputs_train.min())/(outputs_train.max()-outputs_train.min()))
kernel_init = keras.initializers.RandomUniform(-0.25, 0.25)
inputs = Input(num_of_xs)
x = Dense(n_hidden_units, kernel_initializer=kernel_init, activation='relu', )(inputs)
for _ in range(num_lay):
x = Dense(n_hidden_units,kernel_initializer=kernel_init, activation='relu', )(x)
outputs = Dense(1, kernel_initializer=kernel_init, activation='linear')(x)
model = Model(inputs=inputs, outputs=outputs)
optimizer1 = keras.optimizers.Adam(beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0,
amsgrad=True,learning_rate=learning_rate)
model.compile(loss='mse', optimizer=optimizer1, metrics=None)
model.fit(inputs_train, outputs_train, batch_size=batch_size,epochs=epochs, shuffle=False,
verbose=2,
)
plt.plot(outputs_train,'ob')
plt.plot(model(inputs_train),'*r')
plt.show()
For now I am getting the worst predictions (in red) relative to the target labels (blue)
If you are using a validation split, you can't. Otherwise you do, but it will be hard, since good pipelines have regularization techniques that try to prevent this from happening.
Your target distribution is given by
np.random.randn(batch_size*10,1)
Then normalized to:
(outputs_train-outputs_train.min())/(outputs_train.max()-outputs_train.min())
As you can see, your targets are completely independent from your variable x! So, if you have to predict the value (y) for a previously unseen value (x), there is literally nothing you can do better than simply predicting the mean value for y.
In other words, your target distribution is a flat line y = avg + noise.
Your question is then: can the network predict this extra noise? Well, no, that's why we call it noise, because it is the random deviations from the pattern that are completely unrelated to the input info that we feed the network.
BUT.
If you do NOT use validation (that is, you are interested in the prediction error with respect to the {x, y} pairs that you see during training) then the network will learn noise, up to its full prediction capacity (the more complex the network, the more it can adapt to complex noise). This is precisely what we call overfitting, and it is a BAD thing!
Normally we want models to predict something like "y = x * 2 + 3", whereas learning noise is more like learning a dictionary of unrelated predictions: "{x1: 2.93432, x2: -0.00324, ...}"
Because overfitting is bad (it is bad because it makes predictions for unseen validation data worse, which means our models are worse in new data), pipelines have built-in techniques to fight the natural tendency of neural networks to do this. Such techniques include data augmentation (common in images), early stopping, dropout, and so on.
If you REALLY need to overfit to your data, you will need to deactivate any such techniques, and train for as long as you can (which is normally not something we want to do!).
Related
First I want to say thank to anyone consider reading this question, and I want to sorry if my question is so stubborn, and for my poor English.
So currently I'm working on a recommendation system problem, and my approach was to use matrix factorization with implicit feedback using BPR (arXiv:1205.2618). Somehow, I discovered that when I trained my model (BPRMF), using a large batch size (in this case 4096), resulted in a poorer BPR loss compared to when I used a smaller batch size (1024). my training log on few epochs.
I noted that higher batch size resulted in faster training time as it can utilize GPU memory more efficiently, but the higher loss is something maybe I'm not so willingly to trade. As far as I know, a large batch size bring much more information for the gradient descent step to take a better step, so it should help with convergence, and usually problem with large batch size is in memory and resource, not with loss.
I have did some research about this, and saw that Large Batch Training Result in Poor Generalization and here another, but in my case, it was poor lost while in training.
My best guess is that using a large batch size, then take the mean of the loss make the gradient flow to the user and item embedding lower by the mean ( 1 / batch size) coefficient, make it hard to escape local maxima while training. Is it the answer in this case ? (However, I saw that recent study has show that local minima is not necessarily bad, so ...)
Really appreciated anybody help me answer why large batchsize ended up with anomaly results.
Side note: Might be another stupid question, but as you can see in the code below, you can see that the l2 loss is not normalized by batch size, so I expected it to at least double or quadruple when I multiply batch size by 4, but that seem not to be the case here in the log above.
Here is my code
from typing import Tuple
import torch
from torch.nn.parameter import Parameter
import torch.nn.functional as F
from .PretrainedModel import PretrainedModel
class BPRMFModel(PretrainedModel):
def __init__(self, n_users: int, n_items: int, u_embed: int, l2:float,
dataset: str, u_i_pretrained_dir, use_pretrained = 0, **kwargs) -> None:
super().__init__(n_users=n_users, n_items=n_items, u_embed=u_embed, dataset=dataset,
u_i_pretrained_dir=u_i_pretrained_dir, use_pretrained=use_pretrained,
**kwargs)
self.l2 = l2
self.reset_parameters()
self.items_e = Parameter(self._items_e)
self.users_e = Parameter(self._users_e)
def forward(self, u: torch.Tensor, i: torch.Tensor) -> torch.Tensor:
u = F.embedding(u, self.users_e)
i = F.embedding(i, self.items_e)
return torch.matmul(u, i.T)
def CF_loss(self, u: torch.Tensor, i_pos: torch.Tensor, i_neg: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
#u, i_pos, i_neg shape is [batch_size,]
u = F.embedding(u, self.users_e)
i_pos = F.embedding(i_pos, self.items_e)
i_neg = F.embedding(i_neg, self.items_e)
pos_scores = torch.einsum("ij,ij->i", u, i_pos)
neg_scores = torch.einsum("ij,ij->i", u, i_neg)
# loss = torch.mean(
# F.softplus(-(pos_scores - neg_scores))
# )
loss = torch.neg(
torch.mean(
F.logsigmoid(pos_scores - neg_scores)
)
)
l2_loss = (
u.pow(2).sum() +
i_pos.pow(2).sum() +
i_neg.pow(2).sum()
)
return loss, self.l2 * l2_loss
def get_users_rating_for_each_items(self, u: torch.Tensor, i: torch.Tensor) -> torch.Tensor:
return self(u, i)
def save_pretrained(self):
self._items_e = self.items_e.data
self._users_e = self.users_e.data
return super().save_pretrained()
PretrainedModel is just a base class helping me with the save and load model weight
Really appreciated anyone who bear with me till this end.
I'm interested in fine-tuning a Mask-RCNN model that I'm using for instance segmentation. Currently I have trained the model for 6 epochs and the various Mask-RCNN losses are as follows:
The reason I'm stopping is that the COCO evaluation metrics seem to have dipped in the last epoch:
I know this is a far reaching question, but I'm looking to gain some intuition of how to understand which parameters are going to be the most impactful in improving the evaluation metrics. I understand there are three places to consider:
Should I be looking at batch size, learning rate and momentum, this uses an SGD optimizer with a learning rate of 1e-4 and batch size 2?
Should I be looking at using more training data or adding augmentation (I don't currently use any) and my dataset is current pretty large 40K images?
Should I be looking at the specific MaskRCNN parameters?
I thing I'll likely be asked to me more specific on what I want to improve so let me say that I would like to improve the recall of the individual masks. The model is performing well but doesn't quite capture the full extend of what I would like it to. I'm also leaving out details of the specific learning problem as I'd like to gain intuition of how to approach this in general.
A couple of notes:
6 epochs are too small for the network to converge even if you use a pre-trained network—especially such a big one as resnet50. I think you need at least 50 epochs. On a pre-trained resnet18 I started to get good results after 30 epochs, resnet34 needed +10-20 epochs and your resnet50 + 40k images of the train set - definitely need more epochs than 6;
definitely use a pre-trained network;
in my experience, I failed to get the results I like with SGD. I started using AdamW + ReduceLROnPlateau scheduler. The network converges quite fast, like 50-60% AP on epoch 7 or 8 but then it comes up to 80-85 after 50-60 epochs using very small improvements from epoch to epoch, only if the LR is small enough. You must be familiar with the gradient descent notion. I used to think of it as if you have more augmentation, your "hill" is covered with "boulders" that you have to be able to bypass and this is only possible if you control the LR. Additionally, AdamW helps with the overfitting.
This is how I do it. For networks with higher input resolution (your input images are scaled on input by the net itself), I use higher LR.
init_lr = 0.00005
weight_decay = init_lr * 100
optimizer = torch.optim.AdamW(params, lr=init_lr, weight_decay=weight_decay)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, verbose=True, patience=3, factor=0.75)
for epoch in range(epochs):
# train for one epoch, printing every 10 iterations
metric_logger = train_one_epoch(model, optimizer, train_loader, scaler, device,
epoch, print_freq=10)
scheduler.step(metric_logger.loss.global_avg)
optimizer.param_groups[0]["weight_decay"] = optimizer.param_groups[0]["lr"] * 100
# scheduler.step()
# evaluate on the test dataset
evaluate(model, test_loader, device=device)
print("[INFO] serializing model to '{}' ...".format(args["model"]))
save_and_print_size_of_model(model, args["model"], script=False)
Find such an LR and weight decay that the training exhausts LR to a very small value, like 1/10 of your initial LR, at the end of the training. If you will have a plateau too often, the scheduler quickly brings it to very small values and the network will learn nothing all the rest of the epochs.
Your plots indicate that your LR is too high at some point in the training, the network stops training and then AP is going down. You need constant improvements, even small ones. The more network trains the more subtle details it learns about your domain and the smaller the learning rate. Imho, constant LR will not allow doing that correctly.
anchor generator settings. Here is how I initialize the network.
def get_maskrcnn_resnet_model(name, num_classes, pretrained, res='normal'):
print('Using maskrcnn with {} backbone...'.format(name))
backbone = resnet_fpn_backbone(name, pretrained=pretrained, trainable_layers=5)
sizes = ((4,), (8,), (16,), (32,), (64,))
aspect_ratios = ((0.25, 0.5, 1.0, 2.0, 4.0),) * len(sizes)
anchor_generator = AnchorGenerator(
sizes=sizes, aspect_ratios=aspect_ratios
)
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=['0', '1', '2', '3'],
output_size=7, sampling_ratio=2)
default_min_size = 800
default_max_size = 1333
if res == 'low':
min_size = int(default_min_size / 1.25)
max_size = int(default_max_size / 1.25)
elif res == 'normal':
min_size = default_min_size
max_size = default_max_size
elif res == 'high':
min_size = int(default_min_size * 1.25)
max_size = int(default_max_size * 1.25)
else:
raise ValueError('Invalid res={} param'.format(res))
model = MaskRCNN(backbone, min_size=min_size, max_size=max_size, num_classes=num_classes,
rpn_anchor_generator=anchor_generator, box_roi_pool=roi_pooler)
model.roi_heads.detections_per_img = 512
return model
I need to find small objects here why I use such anchor params.
classes in-balancing issue. If you have only your object and bg - no problem. If you have more classes then make sure that your training split (as 80% for train and 20% for the test) is more or less precisely applied to all the classes used in your particular training.
Good luck!
Simple and short question. I have a network (Unet) which performs image segmentation. I want the logits as the output to feed into the cross entropy loss (using pytorch). Currently my final layer looks as so:
class Logits(nn.Sequential):
def __init__(self,
in_channels,
n_class
):
super(Logits, self).__init__()
# fully connected layer outputting the prediction layers for each of my classes
self.conv = self.add_module('conv_out',
nn.Conv2d(in_channels,
n_class,
kernel_size = 1
)
)
self.activ = self.add_module('sigmoid_out',
nn.Sigmoid()
)
Is it correct to use the sigmoid activation function here? Does this give me logits?
When people talk about "logits" they usually refer to the "raw" n_class-dimensional output vector. For multi-class classification (n_class > 2) you want to convert the n_class-dimensional vector of raw "logits" into a n_class-dim probability vector.
That is, you want prob = f(logits) with prob_i >= 0 for all n_class entries, and that sum(prob)=1.
The most straight forward way of doing that in a differentiable way is to use the Softmax function:
prob_i = softmax(logits) = exp(logits_i) / sum_j exp(logits_j)
It is easy to see that the output of softmax is indeed a n_class-dim probability vector (I leave it to you as a short exercise).
BTW, this is why the raw predictions are called "logits" because they are kind of "log" of the output predicted probabilities.
Now, it is customary not to explicitly compute the softmax on top of a classification network and defer its computation to the loss function, e.g. nn.CrossEntropyLoss that internally computes the softmax and requires the raw logits as inputs, rather than the normalized probabilities. This is done mainly for numerical stability.
Therefore, if you are training a multi-class classification network with nn.CrossEntropyLoss you do not need to worry at all about the final activation and simply output the raw logits from your final conv/linear layer.
Most importantly, do not use nn.Sigmoid() activation as it tends to have saturated gradients and will mess up your training.
As far as I understood, you are working on a multi-label classification task where a single input can have several labels, hence your usage of nn.Sigmoid (vs nn.Softmax for multi-class classification).
There a loss function which combines nn.Sigmoid and the nn.BCELoss: nn.BCEWithLogitsLoss. So you would have as input, a vector of logits whose length is the number of classes. And, the target would as well have the same shape: as a multi-hot-encoding, with 1s for active classes.
When using a Keras LSTM to predict on time series data I've been getting errors when I'm trying to train the model using a batch size of 50, while then trying to predict on the same model using a batch size of 1 (ie just predicting the next value).
Why am I not able to train and fit the model with multiple batches at once, and then use that model to predict for anything other than the same batch size. It doesn't seem to make sense, but then I could easily be missing something about this.
Edit: this is the model. batch_size is 50, sl is sequence length, which is set at 20 currently.
model = Sequential()
model.add(LSTM(1, batch_input_shape=(batch_size, 1, sl), stateful=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=epochs, batch_size=batch_size, verbose=2)
here is the line for predicting on the training set for RMSE
# make predictions
trainPredict = model.predict(trainX, batch_size=batch_size)
here is the actual prediction of unseen time steps
for i in range(test_len):
print('Prediction %s: ' % str(pred_count))
next_pred_res = np.reshape(next_pred, (next_pred.shape[1], 1, next_pred.shape[0]))
# make predictions
forecastPredict = model.predict(next_pred_res, batch_size=1)
forecastPredictInv = scaler.inverse_transform(forecastPredict)
forecasts.append(forecastPredictInv)
next_pred = next_pred[1:]
next_pred = np.concatenate([next_pred, forecastPredict])
pred_count += 1
This issue is with the line:
forecastPredict = model.predict(next_pred_res, batch_size=batch_size)
The error when batch_size here is set to 1 is:
ValueError: Cannot feed value of shape (1, 1, 2) for Tensor 'lstm_1_input:0', which has shape '(10, 1, 2)' which is the same error that throws when batch_size here is set to 50 like the other batch sizes as well.
The total error is:
forecastPredict = model.predict(next_pred_res, batch_size=1)
File "/home/entelechy/tf_keras/lib/python3.5/site-packages/keras/models.py", line 899, in predict
return self.model.predict(x, batch_size=batch_size, verbose=verbose)
File "/home/entelechy/tf_keras/lib/python3.5/site-packages/keras/engine/training.py", line 1573, in predict
batch_size=batch_size, verbose=verbose)
File "/home/entelechy/tf_keras/lib/python3.5/site-packages/keras/engine/training.py", line 1203, in _predict_loop
batch_outs = f(ins_batch)
File "/home/entelechy/tf_keras/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2103, in __call__
feed_dict=feed_dict)
File "/home/entelechy/tf_keras/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/entelechy/tf_keras/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 944, in _run
% (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1, 1, 2) for Tensor 'lstm_1_input:0', which has shape '(10, 1, 2)'
Edit: Once I set the model to stateful=False then I am able to use different batch sizes for fitting/training and prediction. What is the reason for this?
Unfortunately what you want to do is impossible with Keras ... I've also struggle a lot of time on this problems and the only way is to dive into the rabbit hole and work with Tensorflow directly to do LSTM rolling prediction.
First, to be clear on terminology, batch_size usually means number of sequences that are trained together, and num_steps means how many time steps are trained together. When you mean batch_size=1 and "just predicting the next value", I think you meant to predict with num_steps=1.
Otherwise, it should be possible to train and predict with batch_size=50 meaning you are training on 50 sequences and make 50 predictions every time step, one for each sequence (meaning training/prediction num_steps=1).
However, I think what you mean is that you want to use stateful LSTM to train with num_steps=50 and do prediction with num_steps=1. Theoretically this make senses and should be possible, and it is possible with Tensorflow, just not Keras.
The problem: Keras requires an explicit batch size for stateful RNN. You must specify batch_input_shape (batch_size, num_steps, features).
The reason: Keras must allocate a fixed-size hidden state vector in the computation graph with shape (batch_size, num_units) in order to persist the values between training batches. On the other hand, when stateful=False, the hidden state vector can be initialized dynamically with zeroes at the beginning of each batch so it does not need to be a fixed size. More details here: http://philipperemy.github.io/keras-stateful-lstm/
Possible work around: Train and predict with num_steps=1. Example: https://github.com/keras-team/keras/blob/master/examples/lstm_stateful.py. This might or might not work at all for your problem as the gradient for back propagation will be computed on only one time step. See: https://github.com/fchollet/keras/issues/3669
My solution: use Tensorflow: In Tensorflow you can train with batch_size=50, num_steps=100, then do predictions with batch_size=1, num_steps=1. This is possible by creating a different model graph for training and prediction sharing the same RNN weight matrices. See this example for next-character prediction: https://github.com/sherjilozair/char-rnn-tensorflow/blob/master/model.py#L11 and blog post http://karpathy.github.io/2015/05/21/rnn-effectiveness/. Note that one graph can still only work with one specified batch_size, but you can setup multiple model graphs sharing weights in Tensorflow.
Sadly what you wish for is impossible because you specify the batch_size when you define the model...
However, I found a simple way around this problem: create 2 models! The first is used for training and the second for predictions, and have them share weights:
train_model = Sequential([Input(batch_input_shape=(batch_size,...),
<continue specifying your model>])
predict_model = Sequential([Input(batch_input_shape=(1,...),
<continue specifying exact same model>])
train_model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())
predict_model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())
Now you can use any batch size you want. after you fit your train_model just save it's weights and load them with the predict_model:
train_model.save_weights('lstm_model.h5')
predict_model.load_weights('lstm_model.h5')
notice that you only want to save and load the weights, and not the whole model (which includes the architecture, optimizer etc...). This way you get the weights but you can input one batch at a time...
more on keras save/load models:
https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model
notice that you need to install h5py to use "save weights".
Another easy workaround is:
def create_model(batch_size):
model = Sequential()
model.add(LSTM(1, batch_input_shape=(batch_size, 1, sl), stateful=True))
model.add(Dense(1))
return model
model_train = create_model(batch_size=50)
model_train.compile(loss='mean_squared_error', optimizer='adam')
model_train.fit(trainX, trainY, epochs=epochs, batch_size=batch_size)
model_predict = create_model(batch_size=1)
weights = model_train.get_weights()
model_predict.set_weights(weights)
The best solution to this problem is "Copy Weights". It can be really helpful if you want to train & predict with your LSTM model with different batch sizes.
For example, once you have trained your model with 'n' batch size as shown below:
# configure network
n_batch = len(X)
n_epoch = 1000
n_neurons = 10
# design network
model = Sequential()
model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
And now you want to want predict values fewer than your batch size where n=1.
What you can do is that, copy the weights of your fit model and reinitialize the new model LSTM model with same architecture and set batch size equal to 1.
# re-define the batch size
n_batch = 1
# re-define model
new_model = Sequential()
new_model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True))
new_model.add(Dense(1))
# copy weights
old_weights = model.get_weights()
new_model.set_weights(old_weights)
Now you can easily predict and train LSTMs with different batch sizes.
For more information please read: https://machinelearningmastery.com/use-different-batch-sizes-training-predicting-python-keras/
I found below helpful (and fully inline with above). The section "Solution 3: Copy Weights" worked for me:
How to use Different Batch Sizes when Training and Predicting with LSTMs, by Jason Brownlee
n_neurons = 10
# design network
model = Sequential()
model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
# fit network
for i in range(n_epoch):
model.fit(X, y, epochs=1, batch_size=n_batch, verbose=1, shuffle=False)
model.reset_states()
# re-define the batch size
n_batch = 1
# re-define model
new_model = Sequential()
new_model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True))
new_model.add(Dense(1))
# copy weights
old_weights = model.get_weights()
new_model.set_weights(old_weights)
# compile model
new_model.compile(loss='mean_squared_error', optimizer='adam')
I also have same problem and resolved it.
In another way, you can save your weights, when you test your result, you can reload your model with same architecture and set batch_size=1 as below:
n_neurons = 10
# design network
model = Sequential()
model.add(LSTM(n_neurons, batch_size=1, batch_input_shape=(n_batch,X.shape[1], X.shape[2]), statefull=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.load_weights("w.h5")
It will work well. I hope it will helpfull for you.
If you don't have access to the code that created the model or if you just don't want your prediction/validation code to depend on your model creation and training code there is another way:
You could create a new model from a modified version of the loaded model's config like this:
loaded_model = tf.keras.models.load_model('model_file.h5')
config = loaded_model.get_config()
old_batch_input_shape = config['layers'][0]['config']['batch_input_shape']
config['layers'][0]['config']['batch_input_shape'] = (new_batch_size, old_batch_input_shape[1])
new_model = loaded_model.__class__.from_config(config)
new_model.set_weights(loaded_model.get_weights())
This works well for me in a situation where I have several different models with state-full RNN layers working together in a graph network but being trained separately with different networks leading to different batch sizes. It allows me to experiment with the model structures and training batches without needing to change anything in my validation script.
I am trying to implement discriminant condition codes in Keras as proposed in
Xue, Shaofei, et al., "Fast adaptation of deep neural network based
on discriminant codes for speech recognition."
The main idea is you encode each condition as an input parameter and let the network learn dependency between the condition and the feature-label mapping. On a new dataset instead of adapting the entire network you just tune these weights using backprop. For example say my network looks like this
X ---->|----|
|DNN |----> Y
Z --- >|----|
X: features Y: labels Z:condition codes
Now given a pretrained DNN, and X',Y' on a new dataset I am trying to estimate the Z' using backprop that will minimize prediction error on Y'. The math seems straightforward except I am not sure how to implement this in keras without having access to the backprop itself.
For instance, can I add an Input() layer with trainable=True with all other layers set to trainable= False. Can backprop in keras update more than just layer weights? Or is there a way to hack keras layers to do this?
Any suggestions welcome.
thanks
I figured out how to do this (exactly) in Keras by looking at fchollet's post here
Using the keras backend I was able to compute the gradient of my loss w.r.t to Z directly and used it to drive the update.
Code below:
import keras.backend as K
import numpy as np
model.summary() #Pretrained model
loss = K.categorical_crossentropy(Y, Y_out)
grads = K.gradients(loss, Z)
grads /= (K.sqrt(K.mean(K.square(grads)))+ 1e-5)
iterate = K.function([X,Z],[loss,grads])
step = 0.1
Z_adapt = Z_in.copy()
for i in range(100):
loss_val, grads_val = iterate([X_in,Z_adapt])
Z_adapt -= grads_val[0] * step
print "iter:",i,np.mean(loss_value)
print "Before:"
print model.evaluate([X_in, Z_in],Y_out)
print "After:"
print model.evaluate([X_in, Z_adapt],Y_out)
X,Y,Z are nodes in the model graph. Z_in is an initial value for Z'. I set it to an average value from the train set. Z_adapt is after 100 iterations of gradient descent and should give you a better result.
Assume that the size of Z is m x n. Then you can first define an input layer of size m * n x 1. The input will be an m * n x 1 vector of ones. You can define a dense layer containing m * n neurons and set trainable = True for it. The response of this layer will give you a flattened version of Z. Reshape it appropriately and give it as input to the rest of the network that can be appended ahead of this.
Keep in mind that if the size of Z is too large, then network may not be able to learn a dense layer of that many neurons. In that case, maybe you need to put additional constraints or look into convolutional layers. However, convolutional layers will put some constraints on Z.