Training time on GeForce GTX Titan X with CUDA 7.5 - caffe

I'm running the Caffe library on GeForce GTX Titan X with CUDA 7.5 (Ubuntu 14). I'm not sure whether Caffe is properly configured for my setup. My dataset consists of images with 256 x 256 pixels (3 channels), 100000 training / 10000 test samples. For the very first test I'm using AlexNet with new_height=256, new_width=256, crop_size=227. Running 1000 training iterations on one Titan X with batch_size=256 takes about 17 minutes... Is it not too slow for this hardware?
Any help and advices are kindly appreciated!

Running 1000 iterations on a batch of 256 images:
(256 height* 256 width* 256 batch size * 1000 iteration * 3 channels) bytes / ((1024*1024)MB * (17*60)seconds) = 47MBps compute speed.
The following may improve the performance:
If the original images are of bigger resolution, try to preprocess them to 256x256 thus reducing a lot of pixel reads from the harddisk.
Compile Caffe using Cudnn flag. This may lead to a 30% improvement in speed
Try creating an LMDB dataset of the input set and use the LMDB data for training.
Try using an SSD instead of a SATA harddisk.

No, it is not. Check out this link for Caffe performance and hardware configuration.

Related

What is the cause of the low CPU utilization in rllib PPO? What does 'cpu_util_percent' measure?

I implement multiagent ppo in rllib with a custom environment, it learns and works well except for the speed performance. I wonder if an underutilized CPU may cause the issue, so I want to know what ray/tune/perf/cpu_util_percent measures. Does it measure only the rollout workers, or is averaged over the learner? And what may be the cause? (All my runs give average of 13% CPU usage.)
run on gcp
ray 2.0
python3.9
torch1.12
head: n1-standard-8 with 1 v100 gpu
2 workers: c2-standard-60
num_workers: 120 # this worker != machine, num_workers = num_rollout_workers
num_envs_per_worker: 1
num_cpus_for_driver: 8
num_gpus: 1
num_cpus_per_worker: 1
num_gpus_per_worker: 0
train_batch_size: 12000
sgd_minibatch_size: 3000
I tried smaller batch size=4096 and smaller number of workers=10, and larger batch_size=480000, all resulted 10~20% CPU usage.
I cannot share the code.

What is the correct value of 'rescale_grad' in case of multi-GPU machine?

My batch size is 512, I have 8 GPUs
Should I define:
rescale_grad = 1. / 512 or rescale_grad = 1. / (8*512)
Thanks!
Batch size is something that is tied to the computer and not to the GPU. Quote (from here):
Workload Partitioning
By default, MXNet partitions a data batch evenly among the available
GPUs. Assume a batch size b and assume there are k GPUs, then in one
iteration each GPU will perform forward and backward on b/k examples.
The gradients are then summed over all GPUs before updating the model.
In your case b is 512. Therefore you should be using rescale_grad = 1. / 512

why the difference in cuda cores between nvidia control panel and device query?

Q1: why there is different information i got from Nvidia control panel->system information and information from device query example in cuda sdk.
system information:
cuda cores 384 cores
memory data rate 1800MHz
device query output:
cuda cores= 2 MP x 192 SP/MP = 576 cuda cores
memory clock rate 900MHz
Q2: how can i calculate the GFLOPs of my GPU using device query data?
the most common used formula i found was the one mentioned here which suggest using Number of mul-add units, number of mul units which i don't know?
Max GFLOPS (Cores x SIMDs x ([mul-add]x2+[mul]*1)*clock speed)
Q1: It tells you right there just above the line...
MapSMtoCores for SM 5.0 is unefined. Default to use 192 Cores/SM
Maxwell, the architecture behind the GeForce 840M, uses 128 "cores" per "SMM"
3 * 128 = 384
Q2: "Cores" * frequency * 2 (because each core can do a multiply+add)

Why are overlapping data transfers in CUDA slower than expected?

When I run the simpleMultiCopy in the SDK (4.0) on the Tesla C2050 I get the following results:
[simpleMultiCopy] starting...
[Tesla C2050] has 14 MP(s) x 32 (Cores/MP) = 448 (Cores)
> Device name: Tesla C2050
> CUDA Capability 2.0 hardware with 14 multi-processors
> scale_factor = 1.00
> array_size = 4194304
Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap")
(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution
(compute capability >= 2.0 AND (Tesla product OR Quadro 4000/5000)
Measured timings (throughput):
Memcpy host to device : 2.725792 ms (6.154988 GB/s)
Memcpy device to host : 2.723360 ms (6.160484 GB/s)
Kernel : 0.611264 ms (274.467599 GB/s)
Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 6.060416 ms
Compute can overlap with one transfer: 5.449152 ms
Compute can overlap with both data transfers: 2.725792 ms
Average measured timings over 10 repetitions:
Avg. time when execution fully serialized : 6.113555 ms
Avg. time when overlapped using 4 streams : 4.308822 ms
Avg. speedup gained (serialized - overlapped) : 1.804733 ms
Measured throughput:
Fully serialized execution : 5.488530 GB/s
Overlapped using 4 streams : 7.787379 GB/s
[simpleMultiCopy] test results...
PASSED
This shows that the expected runtime is 2.7 ms, while it actually takes 4.3. What is it exactly that causes this discrepancy? (I've also posted this question at http://forums.developer.nvidia.com/devforum/discussion/comment/8976.)
The first kernel launch cannot start until the first memcpy is completed, and the last memcpy cannot start until the last kernel launch is completed. So, there is "overhang" that introduces some of the overhead you are observing. You can decrease the size of the "overhang" by increasing the number of streams, but the streams' inter-engine synchronization incurs its own overhead.
It's important to note that overlapping compute+transfer doesn't always benefit a given workload - in addition to the overhead issues described above, the workload itself has to spend equal amounts of time doing compute and data transfer. Due to Amdahl's Law, the potential speedup of 2x or 3x falls off as the workload becomes either transfer-found or compute-bound.

cuda: effective Bandwidth in the sdk example of Reduction

in the reduction.pdf ,it introduces the reduction method through 7 steps ,there are 16777216 elements,in the 1th step,the effective bandwidth is 2.083 GB/S,how 2.083GB/S come out? and how the 2th step bandwidth 4.854GB/s come out?
The bandwidth figures are calculated using the number of bytes in the reduction input data divided by the execution time (note there are 2^22 integers = 16777216 bytes). The calculation is clearly shown on page 10 of the pdf that ships in the SDK in reduction/doc.