What is the cause of the low CPU utilization in rllib PPO? What does 'cpu_util_percent' measure? - reinforcement-learning

I implement multiagent ppo in rllib with a custom environment, it learns and works well except for the speed performance. I wonder if an underutilized CPU may cause the issue, so I want to know what ray/tune/perf/cpu_util_percent measures. Does it measure only the rollout workers, or is averaged over the learner? And what may be the cause? (All my runs give average of 13% CPU usage.)
run on gcp
ray 2.0
python3.9
torch1.12
head: n1-standard-8 with 1 v100 gpu
2 workers: c2-standard-60
num_workers: 120 # this worker != machine, num_workers = num_rollout_workers
num_envs_per_worker: 1
num_cpus_for_driver: 8
num_gpus: 1
num_cpus_per_worker: 1
num_gpus_per_worker: 0
train_batch_size: 12000
sgd_minibatch_size: 3000
I tried smaller batch size=4096 and smaller number of workers=10, and larger batch_size=480000, all resulted 10~20% CPU usage.
I cannot share the code.

Related

Pytorch DirectML computational inconsistency

I am trying to train a DQN on the OpenAI LunarLander Enviroment. I included an argument parser to control which device I use in different runs (CPU and GPU computing with Pytorch's to("cpu") or to("dml") command).
Here is my code:
# Putting networks to either CPU or DML e.g. .to("cpu") for CPU .to("dml") for Microsoft DirectML GPU computing.
self.Q = self.Q.to(self.args.device)
self.Q_target = self.Q_target.to(self.args.device)
However, in pytorch-directml some methods do not have support yet such as .gather(), .max(), MSE_Loss() etc. That is why I need to unload the data from GPU to CPU, do the computations, calculate loss and put it back to GPU for further actions. See it below.
Q_targets_next = self.Q_target(next_states.to("cpu")).detach().max(1)[0].unsqueeze(1).to("cpu") # Calculate target value from bellman equation
Q_targets = (rewards.to("cpu") + self.args.gamma * Q_targets_next.to("cpu") * (1-dones.to("cpu"))) # Calculate expected value from local network
Q_expected = self.Q(states).contiguous().to("cpu").gather(1, actions.to("cpu"))
# Calculate loss (on CPU)
loss = F.mse_loss(Q_expected, Q_targets)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# Put the networks back to DML
self.Q = self.Q.to(self.args.device)
self.Q_target = self.Q_target.to(self.args.device)
The strange thing is this:
Code is bug free; when I run it with args.device = "cpu" it works perfectly however, when I run the exact same code with args.device = "dml" it is terrible and network does not learn anything.
I noticed in every iteration results between CPU and GPU are changing just a little bit(1e-5) but after long iterations this makes a huge difference and GPU and CPU results are almost completely different.
What am I missing here? Is there something I need to pay attention when moving matrices between CPU and GPU? Should I make them contiguous()? Or simply is this a bug in pytorch-dml library?

Faster RCNN (caffe) Joint Learning: Out of Memory with dedicated memory 5376 MB (changed batch sizes and no.of RPN proposals) what else?

I have worked with the alternative optimization matlab code before, currently I am trying to get joint learning running. I am able to run the test demo with my GPU Tesla 2070. For training, I have set all the batch sizes to 1:
__C.TRAIN.IMS_PER_BATCH = 1
__C.TRAIN.BATCH_SIZE = 1
__C.TRAIN.RPN_BATCHSIZE = 1
(updated yaml to 1 as well since it was overridden)
But I still have the error == cudaSuccess (2 vs. 0) out of memory.
I have tried to experiment with lowering the number of proposals. (the originals are below:)
train:
Number of top scoring boxes to keep before apply NMS to RPN proposals
C.TRAIN.RPN_PRE_NMS_TOP_N = 12000
Number of top scoring boxes to keep after applying NMS to RPN proposals
C.TRAIN.RPN_POST_NMS_TOP_N = 2000
test:
Number of top scoring boxes to keep before apply NMS to RPN proposals
C.TEST.RPN_PRE_NMS_TOP_N = 6000
Number of top scoring boxes to keep after applying NMS to RPN proposals
C.TEST.RPN_POST_NMS_TOP_N = 300
I tried as low as pre: 100 post:50 for sanity check.
And I still am not able to run without the out of memory problem. What am I missing here?? I have a Tesla 5376 MB dedicated memory and I use the Tesla only for this (have a separate GPU for my screen) I am positive about reading 5376 MB should be enough by an author himself.
Thanks.

Training time on GeForce GTX Titan X with CUDA 7.5

I'm running the Caffe library on GeForce GTX Titan X with CUDA 7.5 (Ubuntu 14). I'm not sure whether Caffe is properly configured for my setup. My dataset consists of images with 256 x 256 pixels (3 channels), 100000 training / 10000 test samples. For the very first test I'm using AlexNet with new_height=256, new_width=256, crop_size=227. Running 1000 training iterations on one Titan X with batch_size=256 takes about 17 minutes... Is it not too slow for this hardware?
Any help and advices are kindly appreciated!
Running 1000 iterations on a batch of 256 images:
(256 height* 256 width* 256 batch size * 1000 iteration * 3 channels) bytes / ((1024*1024)MB * (17*60)seconds) = 47MBps compute speed.
The following may improve the performance:
If the original images are of bigger resolution, try to preprocess them to 256x256 thus reducing a lot of pixel reads from the harddisk.
Compile Caffe using Cudnn flag. This may lead to a 30% improvement in speed
Try creating an LMDB dataset of the input set and use the LMDB data for training.
Try using an SSD instead of a SATA harddisk.
No, it is not. Check out this link for Caffe performance and hardware configuration.

Why does the Cuda runtime reserve 80 GiB virtual memory upon initialization?

I was profiling my Cuda 4 program and it turned out that at some stage the running process used over 80 GiB of virtual memory. That was a lot more than I would have expected.
After examining the evolution of the memory map over time and comparing what line of code it is executing it turned out that after these simple instructions the virtual memory usage bumped up to over 80 GiB:
int deviceCount;
cudaGetDeviceCount(&deviceCount);
if (deviceCount == 0) {
perror("No devices supporting CUDA");
}
Clearly, this is the first Cuda call, thus the runtime got initialized. After this the memory map looks like (truncated):
Address Kbytes RSS Dirty Mode Mapping
0000000000400000 89796 14716 0 r-x-- prg
0000000005db1000 12 12 8 rw--- prg
0000000005db4000 80 76 76 rw--- [ anon ]
0000000007343000 39192 37492 37492 rw--- [ anon ]
0000000200000000 4608 0 0 ----- [ anon ]
0000000200480000 1536 1536 1536 rw--- [ anon ]
0000000200600000 83879936 0 0 ----- [ anon ]
Now with this huge memory area mapped into virtual memory space.
Okay, its maybe not a big problem since reserving/allocating memory in Linux doesn't do much unless you actually write to this memory. But it's really annoying since for example MPI jobs have to be specified with the maximum amount of vmem usable by the job. And 80GiB that's s just a lower boundary then for Cuda jobs - one has to add all other stuff too.
I can imagine that it has to do with the so-called scratch space that Cuda maintains. A kind of memory pool for kernel code that can dynamically grow and shrink. But that's speculation. Also it's allocated in device memory.
Any insights?
Nothing to do with scratch space, it is the result of the addressing system that allows unified andressing and peer to peer access between host and multiple GPUs. The CUDA driver registers all the GPU(s) memory + host memory in a single virtual address space using the kernel's virtual memory system. It isn't actually memory consumption, per se, it is just a "trick" to map all the available address spaces into a linear virtual space for unified addressing.

Why are overlapping data transfers in CUDA slower than expected?

When I run the simpleMultiCopy in the SDK (4.0) on the Tesla C2050 I get the following results:
[simpleMultiCopy] starting...
[Tesla C2050] has 14 MP(s) x 32 (Cores/MP) = 448 (Cores)
> Device name: Tesla C2050
> CUDA Capability 2.0 hardware with 14 multi-processors
> scale_factor = 1.00
> array_size = 4194304
Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap")
(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution
(compute capability >= 2.0 AND (Tesla product OR Quadro 4000/5000)
Measured timings (throughput):
Memcpy host to device : 2.725792 ms (6.154988 GB/s)
Memcpy device to host : 2.723360 ms (6.160484 GB/s)
Kernel : 0.611264 ms (274.467599 GB/s)
Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 6.060416 ms
Compute can overlap with one transfer: 5.449152 ms
Compute can overlap with both data transfers: 2.725792 ms
Average measured timings over 10 repetitions:
Avg. time when execution fully serialized : 6.113555 ms
Avg. time when overlapped using 4 streams : 4.308822 ms
Avg. speedup gained (serialized - overlapped) : 1.804733 ms
Measured throughput:
Fully serialized execution : 5.488530 GB/s
Overlapped using 4 streams : 7.787379 GB/s
[simpleMultiCopy] test results...
PASSED
This shows that the expected runtime is 2.7 ms, while it actually takes 4.3. What is it exactly that causes this discrepancy? (I've also posted this question at http://forums.developer.nvidia.com/devforum/discussion/comment/8976.)
The first kernel launch cannot start until the first memcpy is completed, and the last memcpy cannot start until the last kernel launch is completed. So, there is "overhang" that introduces some of the overhead you are observing. You can decrease the size of the "overhang" by increasing the number of streams, but the streams' inter-engine synchronization incurs its own overhead.
It's important to note that overlapping compute+transfer doesn't always benefit a given workload - in addition to the overhead issues described above, the workload itself has to spend equal amounts of time doing compute and data transfer. Due to Amdahl's Law, the potential speedup of 2x or 3x falls off as the workload becomes either transfer-found or compute-bound.