How to timing each layer(including python layer) with pycaffe interface - caffe

Can someone share with me how to use the pycaffe interface to time each layer including some customized python layer?
I have observed there are timing operations in caffe source file: tools/caffe.cpp:
forward_timer.Start();
for (int i = 0; i < layers.size(); ++i) {
timer.Start();
layers[i]->Forward(bottom_vecs[i], top_vecs[i]);
forward_time_per_layer[i] += timer.MicroSeconds();
}
but I don't know how to use it in the pycaffe interface.
There is official doc says:
Benchmarking: caffe time benchmarks model execution layer-by-layer
through timing and synchronization. This is useful to check system
performance and measure relative execution times for models.
caffe time -model examples/mnist/lenet_train_test.prototxt -weights examples/mnist/lenet_iter_10000.caffemodel -gpu 0 -iterations 10
but this is used at caffe command line, I'm actually running faster-rcnn, which contains python implementations, and called by python scripts, I dont know how to use it.
with command caffe time -model xxx.prototxt -weight caffemodel, it cannot find my python layers, how to set the path so caffe can find it?
EDIT: after adding python layer file path to PYTHONPATH, it can find the module, but some error happens, which does not happen when I did with python interface:
F0119 22:24:39.621578 9314 net.cpp:141] Check failed: param_size <= num_param_blobs (0 vs. -2) Too many params specified for layer proposal
*** Check failure stack trace: ***
# 0x7efd9bde9daa (unknown)
# 0x7efd9bde9ce4 (unknown)
# 0x7efd9bde96e6 (unknown)
# 0x7efd9bdec687 (unknown)
# 0x7efd9c4555b6 caffe::Net<>::Init()
# 0x7efd9c45667b caffe::Net<>::Net()
# 0x408c0c time()
# 0x405bec main
# 0x7efd9a799f45 (unknown)
# 0x4064f3 (unknown)
# (nil) (unknown)
Aborted

Just add timing lines to ForwardFromto() in net.cpp

Related

RISC-V: Illegal instruction exception when switching to supervisor mode

When setting the mstatus.mpp field to switch to supervisor mode, I'm getting an illegal instruction exception when calling mret. I'm testing this in qemu-system-riscv64 version 6.1 with the riscv64-softmmu system.
I recently upgraded from QEMU 5.0 to 6.1. Prior to this upgrade my code worked. I can't see anything relevant in the changelog. I'm assuming that there's a problem in my code that the newer version simply doesn't tolerate.
Here is a snippet of assembly that shows what's happening (unrelated boot code removed):
.setup_hart:
csrw satp, zero # Disable address translation.
li t0, (1 << 11) # Supervisor mode.
csrw mstatus, t0
csrw mie, zero # Disable interrupts.
la sp, __stack_top # Setup stack pointer.
la t0, asm_trap_vector
csrw mtvec, t0
la t0, kernel_main # Jump to kernel_main on trap return.
csrw mepc, t0
la ra, cpu_halt # If we return from main, halt.
mret
If I set the mstatus.mpp field to 0b11 for machine mode, I can get to kernel_main without any problem.
Here's the output from QEMU showing the exception information:
riscv_cpu_do_interrupt: hart:0, async:0, cause:0000000000000002, epc:0x000000008000006c, tval:0x0000000000000000, desc=illegal_instruction
mepc points to the address of the mret instruction where the exception occurs.
I've tested that the machine supports supervisor mode by writing and retrieving the value in mstatus.mpp successfully.
Is there something obvious I'm missing? My code seems very similar to the few examples I can find online, such as https://osblog.stephenmarz.com/ch3.2.html. Any help would be greatly appreciated.
The issue turned out to be RISC-V's Physical Memory Protection (PMP). QEMU will raise an illegal instruction exception when executing an MRET instruction if no PMP rules have been defined. Adding a PMP entry resolved the issue.
This was confusing, as this behaviour is not specified in the Privileged Architecture manual's section on mret.

How to prevent my reward sum received during evaluation runs repeating in intervals when using RLlib?

I am using Ray 1.3.0 (for RLlib) with a combination of SUMO version 1.9.2 for the simulation of a multi-agent scenario. I have configured RLlib to use a single PPO network that is commonly updated/used by all N agents. My evaluation settings look like this:
# === Evaluation Settings ===
# Evaluate with every `evaluation_interval` training iterations.
# The evaluation stats will be reported under the "evaluation" metric key.
# Note that evaluation is currently not parallelized, and that for Ape-X
# metrics are already only reported for the lowest epsilon workers.
"evaluation_interval": 20,
# Number of episodes to run per evaluation period. If using multiple
# evaluation workers, we will run at least this many episodes total.
"evaluation_num_episodes": 10,
# Whether to run evaluation in parallel to a Trainer.train() call
# using threading. Default=False.
# E.g. evaluation_interval=2 -> For every other training iteration,
# the Trainer.train() and Trainer.evaluate() calls run in parallel.
# Note: This is experimental. Possible pitfalls could be race conditions
# for weight synching at the beginning of the evaluation loop.
"evaluation_parallel_to_training": False,
# Internal flag that is set to True for evaluation workers.
"in_evaluation": True,
# Typical usage is to pass extra args to evaluation env creator
# and to disable exploration by computing deterministic actions.
# IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal
# policy, even if this is a stochastic one. Setting "explore=False" here
# will result in the evaluation workers not using this optimal policy!
"evaluation_config": {
# Example: overriding env_config, exploration, etc:
"lr": 0, # To prevent any kind of learning during evaluation
"explore": True # As required by PPO (read IMPORTANT NOTE above)
},
# Number of parallel workers to use for evaluation. Note that this is set
# to zero by default, which means evaluation will be run in the trainer
# process (only if evaluation_interval is not None). If you increase this,
# it will increase the Ray resource usage of the trainer since evaluation
# workers are created separately from rollout workers (used to sample data
# for training).
"evaluation_num_workers": 1,
# Customize the evaluation method. This must be a function of signature
# (trainer: Trainer, eval_workers: WorkerSet) -> metrics: dict. See the
# Trainer.evaluate() method to see the default implementation. The
# trainer guarantees all eval workers have the latest policy state before
# this function is called.
"custom_eval_function": None,
What happens is every 20 iterations (each iteration collecting "X" training samples), there is an evaluation run of a minimum of 10 episodes. The sum of reward received by all N agents is summed over these episodes and that is set as the reward sum for that particular evaluation run. Over time, I notice that there is a pattern with the reward sums that repeats over the same interval of evaluation runs continuously, and the learning goes nowhere.
UPDATE (23/06/2021)
Unfortunately, I did not have TensorBoard activated for that particular run but from the mean rewards that were collected during evaluations (that happens every 20 iterations) of 10 episodes each, it is clear that there is a repeating pattern as shown in the annotated plot below:
The 20 agents in the scenario should be learning to avoid colliding but instead continue to somehow stagnate at a certain policy and end up showing the exact same reward sequence during evaluation?
Is this a characteristic of how I have configured the evaluation aspect, or should I be checking something else? I would be grateful if anyone could advise or point me in the right direction.
Thank you.
Step 1: I noticed that when I stopped the run at some point for any reason, and then restarted it from the saved checkpoint after restoration, most graphs on TensorBoard (including rewards) charted out the line in EXACTLY the same fashion all over again, which made it look like the sequence was repeating.
Step 2: This led me to believe that there was something wrong with my checkpoints. I compared the weights in checkpoints using a loop and voila, they are all the same! Not a single change! So either there was something wrong with the saving/restoring of checkpoints which after a bit of playing around I found was not the case. So it just meant my weights were not being updated!
Step 3: I sifted through my training configuration to see if something there was preventing the network from learning, and I noticed I had set my "multiagent" configuration option "policies_to_train" to a policy that did not exist. This unfortunately, either did not throw a warning/error or it did and I completely missed it.
Solution step: By setting the multiagent "policies_to_train" configuration option correctly, it started to work!
Could it be that due to the multi-agent dynamics, your policy is chasing its tail? How many policies do you have? Are they competing/collaborating/neutral to each other?
Note that multi-agent training can be very unstable and seeing these fluctuations is quite normal as the different policies get updated and then have to face different "env"-dynamics b/c of that (env=env+all other policies, which appear as part of the env as well).

What does local rank mean in distributed deep learning?

https://github.com/huggingface/transformers/blob/master/examples/run_glue.py
I want to adapt this script to do text classification on my data. The computer for this task is one single machine with two graphic cards. So this involves kind of "distributed" training with the term local_rank in the script above, especially when local_rank equals 0 or -1 like in line 83.
After reading some materials from distributed computation I guess that local_rank is like an ID for a machine. And 0 may mean this machine is the "main" or "head" in the computation. But what is -1?
Q: But what is -1?
Usually, this is used to disable the distributed setting. Indeed, as you can see here:
train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
and here:
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
output_device=args.local_rank,
find_unused_parameters=True)
setting local_rank to -1 has this effect.
I want to add something more for the #Berriel's answer. Since you have two GPUs and not a distributed machine with a node structure, you do not need distributed methods like DistributedSampler. Hugginface use -1 to disable the distributed settings in training mechanisms.
Check out the following code from huggiface training_args.py script. As you can see if there is a distributed training mechanism self.local_rank get changed.
def _setup_devices(self) -> "torch.device":
logger.info("PyTorch: setting up devices")
if self.no_cuda:
device = torch.device("cpu")
self._n_gpu = 0
elif is_torch_tpu_available():
device = xm.xla_device()
self._n_gpu = 0
elif is_sagemaker_distributed_available():
import smdistributed.dataparallel.torch.distributed as dist
dist.init_process_group()
self.local_rank = dist.get_local_rank()
device = torch.device("cuda", self.local_rank)
self._n_gpu = 1
elif self.local_rank == -1:
# if n_gpu is > 1 we'll use nn.DataParallel.
# If you only want to use a specific subset of GPUs use `CUDA_VISIBLE_DEVICES=0`
# Explicitly set CUDA to the first (index 0) CUDA device, otherwise `set_device` will
# trigger an error that a device index is missing. Index 0 takes into account the
# GPUs available in the environment, so `CUDA_VISIBLE_DEVICES=1,2` with `cuda:0`
# will use the first GPU in that env, i.e. GPU#1
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Sometimes the line in the postinit has not been run before we end up here, so just checking we're not at
# the default value.
self._n_gpu = torch.cuda.device_count()

How to print/return the softmax score in Keras during training?

Question: How do I print/return the softmax layer for a multiclass problem using Keras?
my motivation: it is important for visualization/debugging.
it is important to do this for the 'training' setting. ergo batch normalization and dropout must behave as they do in train time.
it should be efficient. calling vanilla model.predict() every now and then is less desirable as the model I am using is heavy and this is extra forward passes. The most desirable case is finding a way to simply display the original network output which was calculated during training.
it is ok to assume that this is done while using Tensorflow as a backend.
Thank you.
You can get the outputs of any layer by using: model.layers[index].output
For all layers use this:
from keras import backend as K
inp = model.input # input placeholder
outputs = [layer.output for layer in model.layers] # all layer outputs
functor = K.function([inp]+ [K.learning_phase()], outputs ) # evaluation function
# Testing
test = np.random.random(input_shape)[np.newaxis,...]
layer_outs = functor([test, 1.])
print layer_outs

How to take network snaphot in Caffe on demand?

I want to be able to take network snapshot on-demand (say on some condition) as the training is going on. Is there a way to do this with Caffe?
For example with callbacks in Python:
import caffe
def OnStart():
pass # both callbacks must be defined anyway
def OnGradientsReady():
global solver
if solver.iter == 17:
solver.snapshot()
solver = caffe.get_solver("mnist/lenet_solver_t1.prototxt")
solver.add_callback(OnStart, OnGradientsReady)
solver.solve()