I am wondering how to go about setting up ONLY a test phase in Caffe for an LMDB file. I have already trained my model, everything seems good, my loss has decreased, and the output I am getting on images loaded in one by one also seem good.
Now I would like to see how my model performs on a separate LMDB test set, but seem to be unable to do so successfully. It would not be ideal for me to do a loop by loading images one at a time since my loss function is already defined in caffe and this would require me to redefine it.
this is what I have so far, but the results of this dont make sense; when I compare the loss I have from the train set to the loss I get from this, they don't match (orders of magnitude apart). Does anyone have any idea what my problem could be?
caffe.set_device(0)
caffe.set_mode_gpu()
net = caffe.Net('/home/jeremy/Desktop/caffestuff/JP_Kitti/all_proto/mirror_shuffle/deploy_JP.prototxt','/home/jeremy/Desktop/caffestuff/JP_Kitti/all_proto/mirror_shuffle/snapshot_iter_10000.caffemodel',caffe.TEST)
solver = None # ignore this workaround for lmdb data (can't instantiate two solvers on the same data)
solver = caffe.SGDSolver('/home/jeremy/Desktop/caffestuff/JP_Kitti/all_proto/mirror_shuffle/lenet_auto_solverJP_test.prototxt')
niter = 100
test_loss = zeros(niter)
count = 0
for it in range(niter):
solver.test_nets[0].forward() # SGD by Caffe
# store the test loss
test_loss[count] = solver.test_nets[0].blobs['loss']
print(solver.test_nets[0].blobs['loss'].data)
count = count+1
See my answer here. Do not forget to subtract the mean, otherwise you'll get low accuracy. The link to the code, posted above, takes care of that.
Related
PPO model doesn't iterate through the whole dataframe .. its basically repeating the first step many times (10,000 in this example) ?
In this case, DF's shape is (5476, 28) and each step's obs shape is: (60, 28).. I dont see that its iterating through the whole DF.
# df shape - (5476, 28)
env = MyRLEnv(df)
model = PPO("MlpPolicy", env, verbose=4)
model.learn(total_timesteps=10000)
MyRLEnv:
self.action_space = spaces.Discrete(4)
self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(60, 28) , dtype=np.float64)
Thanks!
I also got stuck few days ago on something similar to this , but after inspecting deeply I found that the learn method actually runs the environment for n times , now this n is equal to total_timesteps/size_of_df , this in your case would be nearly 10000/5476times which is almost equal to 1.8 , so this 1.8 means , the algorithm would reset the environment at the beginning, then run the step method for the entire dataframe and reset the environment again and run the step method for only 80% of the data in dataframe. So, when the PPO Algorithm stops you see only 80% of the dataframe is being ran .
The Actor Critic Algorithms runs the environment numerous number of times to improve it's efficiency, so that is the reason it is usually suggested that in order to get better results we should keep the value of total_timesteps fairly high , so that it can run it on the same data for quite some times to learn better.
Example:
Say my total_timesteps = 10000 and len(df) = 5000,
then in that case it would run for, n = total_timesteps/len(df) = 2 Full scans of the entire dataframe .
I'm trying to implement a simple GAN in Pytorch. The following training code works:
for epoch in range(max_epochs): # loop over the dataset multiple times
print(f'epoch: {epoch}')
running_loss = 0.0
for batch_idx,(data,_) in enumerate(data_gen_fn):
# data preparation
real_data = data
input_shape = real_data.shape
inputs_generator = torch.randn(*input_shape).detach()
# generator forward
fake_data = generator(inputs_generator).detach()
# discriminator forward
optimizer_generator.zero_grad()
optimizer_discriminator.zero_grad()
#################### ALERT CODE #######################
predictions_on_real = discriminator(real_data)
predictions_on_fake = discriminator(fake_data)
predictions = torch.cat((predictions_on_real,
predictions_on_fake), dim=0)
#########################################################
# loss discriminator
labels_real_fake = torch.tensor([1]*batch_size + [0]*batch_size)
loss_discriminator_batch = criterion_discriminator(predictions,
labels_real_fake)
# update discriminator
loss_discriminator_batch.backward()
optimizer_discriminator.step()
# generator
# zero the parameter gradients
optimizer_discriminator.zero_grad()
optimizer_generator.zero_grad()
fake_data = generator(inputs_generator) # make again fake data but without detaching
predictions_on_fake = discriminator(fake_data) # D(G(encoding))
# loss generator
labels_fake = torch.tensor([1]*batch_size)
loss_generator_batch = criterion_generator(predictions_on_fake,
labels_fake)
loss_generator_batch.backward() # dL(D(G(encoding)))/dW_{G,D}
optimizer_generator.step()
If I plot the generated images for each iteration, I see that the generated images look like the real ones, so the training procedure seems to work well.
However, if I try to change the code in the ALERT CODE part , i.e., instead of:
#################### ALERT CODE #######################
predictions_on_real = discriminator(real_data)
predictions_on_fake = discriminator(fake_data)
predictions = torch.cat((predictions_on_real,
predictions_on_fake), dim=0)
#########################################################
I use the following:
#################### ALERT CODE #######################
predictions = discriminator(torch.cat( (real_data, fake_data), dim=0))
#######################################################
That is conceptually the same (in a nutshell, instead of doing two different forward on the discriminator, the former on the real, the latter on the fake data, and finally concatenate the results, with the new code I first concatenate real and fake data, and finally I make just one forward pass on the concatenated data.
However, this code version does not work, that is the generated images seems to be always random noise.
Any explanation to this behavior?
Why do we different results?
Supplying inputs in either the same batch, or separate batches, can make a difference if the model includes dependencies between different elements of the batch. By far the most common source in current deep learning models is batch normalization. As you mentioned, the discriminator does include batchnorm, so this is likely the reason for different behaviors. Here is an example. Using single numbers and a batch size of 4:
features = [1., 2., 5., 6.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 3.5, std 2.0615528128088303
>>>normalized features [-1.21267813 -0.72760688 0.72760688 1.21267813]
Now we split the batch into two parts. First part:
features = [1., 2.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 1.5, std 0.5
>>>normalized features [-1. 1.]
Second part:
features = [5., 6.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 5.5, std 0.5
>>>normalized features [-1. 1.]
As we can see, in the split-batch version, the two batches are normalized to the exact same numbers, even though the inputs are very different. In the joint-batch version, on the other hand, the larger numbers are still larger than the smaller ones as they are normalized using the same statistics.
Why does this matter?
With deep learning, it's always hard to say, and especially with GANs and their complex training dynamics. A possible explanation is that, as we can see in the example above, the separate batches result in more similar features after normalization even if the original inputs are quite different. This may help early in training, as the generator tends to output "garbage" which has very different statistics from real data.
With a joint batch, these differing statistics make it easy for the discriminator to tell the real and generated data apart, and we end up in a situation where the discriminator "overpowers" the generator.
By using separate batches, however, the different normalizations result in the generated and real data to look more similar, which makes the task less trivial for the discriminator and allows the generator to learn.
So I am involved in a project that involves feeding a combination of text embeddings and image vectors into a DNN to arrive at the result. Now for the word embedding part, I am using TFHUB's Electra while for the image part I am using a NASNet Mobile network.
However, the issue I am facing is that while running the word embedding part, using the code shown below, the code just keeps running nonstop. It has been over 2 hours now and my training dataset has just 14900 rows of tweets.
Note - The input to the function is just a list of 14900 tweets.
tfhub_handle_encoder="https://tfhub.dev/google/electra_small/2"
tfhub_handle_preprocess="https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3
# Load Models
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)
bert_model = hub.KerasLayer(tfhub_handle_encoder)
def get_text_embedding(text):
preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='Preprocessor')
encoder_inputs = preprocessing_layer(text) encoder =
hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='Embeddings') outputs =
encoder(encoder_inputs) text_repr = outputs['pooled_output'] text_repr =
tf.keras.layers.Dense(128, activation='relu')(text_repr)
return text_repr
text_repr = get_text_embedding(train_text)
Is there a faster way to get text representation using these models?
Thanks for the help!
The operation performed in the code is quadratic in its nature. While I managed to execute your snippet with 10000 samples within a few minutes, a 14900 long input ran out of memory on 32GB RAM runtime. Is it possible that your runtime is experiencing swapping?
It is not clear what is the snippet trying to achieve. Do you intend to train model? In such case you can define the text_input as an Input layer and use fit to train. Here is an example: https://www.tensorflow.org/text/tutorials/classify_text_with_bert#define_your_model
Below is the code to configure TrainingArguments consumed from the HuggingFace transformers library to finetune the GPT2 language model.
training_args = TrainingArguments(
output_dir="./gpt2-language-model", #The output directory
num_train_epochs=100, # number of training epochs
per_device_train_batch_size=8, # batch size for training #32, 10
per_device_eval_batch_size=8, # batch size for evaluation #64, 10
save_steps=100, # after # steps model is saved
warmup_steps=500,# number of warmup steps for learning rate scheduler
prediction_loss_only=True,
metric_for_best_model = "eval_loss",
load_best_model_at_end = True,
evaluation_strategy="epoch",
learning_rate=0.00004, # learning rate
)
early_stop_callback = EarlyStoppingCallback(early_stopping_patience = 3)
trainer = Trainer(
model=gpt2_model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=test_dataset,
callbacks = [early_stop_callback],
)
The number of epochs as 100 and learning_rate as 0.00004 and also the early_stopping is configured with the patience value as 3.
The model ran for 5/100 epochs and noticed that the difference in loss_value is negligible. The latest checkpoint is saved as checkpoint-latest.
Now Can I modify the learning_rate may be to 0.01 from 0.00004 and resume the training from the latest saved checkpoint - checkpoint-latest? Doing that will be efficient?
Or to train with the new learning_rate value should I start the training from the beginning?
No, you don't have to restart your training.
Changing the learning rate is like changing how big a step your model take in the direction determined by your loss function.
You can also think of it as transfer learning where the model has some experience (no matter how little or irrelevant) and the weights are in a state most likely better than a randomly initialised one.
As a matter of fact, changing the learning rate mid-training is considered an art in deep learning and you should change it if you have a very very good reason to do it.
You would probably want to write down when (why, what, etc) you did it if you or someone else wants to "reproduce" the result of your model.
Pytorch provides several methods to adjust the learning_rate: torch.optim.lr_scheduler.
Check the docs for usage https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
I'm studying on a deep learning(supervised-learning) to estimate depth images from monocular images.
And the dataset currently uses KITTI data. RGB images (input image) are used KITTI Raw data, and data from the following link is used for ground-truth.
In the process of learning a model by designing a simple encoder-decoder network, the result is not so good, so various attempts are being made.
While searching for various methods, I found that groundtruth only learns valid areas by masking because there are many invalid areas, i.e., values that cannot be used, as shown in the image below.
So, I learned through masking, but I am curious about why this result keeps coming out.
and this is my training part of code.
How can i fix this problem.
for epoch in range(num_epoch):
model.train() ### train ###
for batch_idx, samples in enumerate(tqdm(train_loader)):
x_train = samples['RGB'].to(device)
y_train = samples['groundtruth'].to(device)
pred_depth = model.forward(x_train)
valid_mask = y_train != 0 #### Here is masking
valid_gt_depth = y_train[valid_mask]
valid_pred_depth = pred_depth[valid_mask]
loss = loss_RMSE(valid_pred_depth, valid_gt_depth)
As far as I can understand, you are trying to estimate depth from an RGB image as input. This is an ill-posed problem since the same input image can project to multiple plausible depth values. You would need to integrate certain techniques to estimate accurate depth from RGB images instead of simply taking an L1 or L2 loss between an RGB image and its corresponding depth image.
I would suggest you to go through some papers in estimating depth from single images such as: Depth Map Prediction from a Single Image using a Multi-Scale Deep Network where they use a network to first estimate the global structure of the given image and then use a second network that refines the local scene information. Instead of taking a simple RMSE loss, as you did, they use a scale-invariant error function in which the relationship between points is measured.