GPU Utilization - cuda

I have been using NVML library to get the values of graphics and memory utilization for
Rodinia benchmark suite. I observe that with different frequencies, the utilization of the same application shows different values. From the wiki link http://en.wikipedia.org/wiki/CPU_usage it seems it does not take into account the various stalls like memory, branch etc. What exactly is this utilization measuring during a time interval? And how come its value is varying with variation in frequency.
Thanks

The definition of the utilization rates is given in the nvml documentation, p90:
8.12 nvmlUtilization_t Struct Reference
#include <nvml.h>
Data Fields
• unsigned int gpu
Percent of time over the past second during which one or more kernels was executing on the GPU.
• unsigned int memory
Percent of time over the past second during which global (device) memory was being read or written.
The utilization rates for a given workload will likely vary if you change the application clocks (I assume that is what you mean by frequency).
For example, if the GPU core clock runs faster, then the processing of the workload may be changed, and it may take less time to complete the workload.

Related

Optimising Monte-Carlo algorithm | Reduce operation on GPU & Eigenvalues problem | Many-body problem

This issue reminds some typical many-body problem, but with some extra calculations.
I am working on the generalized Metropolis Monte-Carlo algorithm for the modeling of large number of arbitrary quantum systems (magnetic ions for example) interacting classically with each other. But it actually doesn't matter for the question.
There is more than 100000 interacting objects, each one can be described by a coordinate and a set of parameters describing its current state r_i, s_i.
Can be translated to the C++CUDA as float4 and float4 vectors
To update the system following Monte-Carlo method for such systems, we need to randomly sample 1 object from the whole set; calculate the interaction function for it f(r_j - r_i, s_j); substitute to some matrix and find eigenvectors of it, from which one a new state will be calculated.
The interaction is additive as usual, i.e. the total interaction will be the sum between all pairs.
Formally this can be decomposed into steps
Generate random number i
Calculate the interaction function for all possible pairs f(r_j - r_i, s_j)
Sum it. The result will be a vector F
Multiply it by some tensor and add another one h = h + dot(F,t). Some basic linear algebra stuff.
Find the eigenvectors and eigenvalues, based on some simple algorithm, choose one vector V_k and write in back to the array s_j of all objects's states.
There is a big question, which parts of this can be computed on CUDA kernels.
I am quite new to CUDA programming. So far I ended up with the following algorithm
//a good random generator
std::uniform_int_distribution<std::mt19937::result_type> random_sampler(0, N-1);
for(int i=0; i\<a_lot; ++i) {
//sample a number of object
nextObject = random_sampler(rng);
//call kernel to calculate the interaction and sum it up by threads. also to write down a new state back to the d_s array
CUDACalcAndReduce<THREADS><<<blocksPerGrid, THREADS>>>(d_r, d_s, d_sum, newState, nextObject, previousObject, N);
//copy the sum
cudaMemcpy(buf, d_sum, sizeof(float)*4*blocksPerGrid, cudaMemcpyDeviceToHost);
//manually reduce the rest of the sum
total = buf[0];
for (int i=1; i<blocksPerGrid; ++i) {
total += buf[i];
}
//find eigenvalues and etc. and determine a new state of the object
//just linear algebra with complex numbers
newState = calcNewState(total);
//a new state will be written by CUDA function on the next iteration
//remember the previous number of the object
previousObject = nextObject;
}
The problem is continuous transferring data between CPU and GPU, and the actual number of bytes is blocksPerGrid*4*sizeof(float) which sometimes is just a few bytes. I optimized CUDA code following the guide from NVIDIA and now it limited by the bus speed between CPU and GPU. I guess switching to pinned memory type will not make any sense since the number of transferred bytes is low.
I used Nvidia Visual Profiler and it shows the following
the most time was waisted by the transferring the data to CPU. The speed as one can see by the inset is 57.143 MB/s and the size is only 64B!
The question is is it worth to move the logic of eigenvalues algorithm to CUDA kernel?
Therefore there will be no data transfer between CPU and GPU. The problem with this algorithm, you can update only one object per iteration. It means that I can run the eigensolver only on one CUDA core. ;( Will it be that slow compared to my CPU, that will eliminate the advantage of keeping data inside the GPU ram?
The matrix size for the eigensolver algorithm does not exceed 10x10 complex numbers. I've heard that cuBLAS can be run fully on CUDA kernels without calling the CPU functions, but not sure how it is implemented.
UPD-1
As it was mentioned in the comment section.
For the each iteration we need to diagonalize only one 10x10 complex Hermitian matrix, which depends on the total calculated interaction function f. Then, we in general it is not allowed to a compute a new sum of f, before we update the state of the sampled object based on eigenvectors and eigenvalues of 10x10 matrix.
Due to the stochastic nature of Monte-Carlo approach we need all 10 eigenvectors to pick up a new state for the sampled object.
However, the suggested idea of double-buffering (in the comments) can work out in a way if we calculate the total sum of f for the next j-th iteration without the contribution of i-th sampled object and, then, add it later. I need to test it carefully in action...
UPD-2
The specs are
CPU 4-cores Intel(R) Core(TM) i5-6500 CPU # 3.20GHz
GPU GTX960
quite outdated, but I might find an access to the better system. However, switching to GTX1660 SUPER did not affect the performance, which means that a PCI bus is a bottleneck ;)
The question is is it worth to move the logic of eigenvalues algorithm
to CUDA kernel?
Depends on the system. Old cpu + new gpu? Both new? Both old?
Generally single cuda thread is a lot slower than single cpu thread. Because cuda compiler does not vectorize its loops but host c++ compiler vectorizes. So, you need to use 10-100 cuda threads to make the comparison fair.
For the optimizations:
According to the image, currently it loses 1 microsecond as a serial part of overall algorithm. 1 microsecond is not much compared to the usual kernel-launch latency from CPU but is big when it is GPU launching the kernel (dynamic parallelism) itself.
CUDA-graph feature enables the overall algorithm re-launch every step(kernel) automatically and complete quicker if steps are not CPU-dependent. But it is intended for "graph"-like workloads where some kernel leads to multiple kernels and they later join in another kernel, etc.
CUDA-dynamic-parallelism feature lets a kernel's cuda threads launch new kernels. This has much better timings than launching from CPU due to not waiting for the synchronizations between driver and host.
Sampling part's copying could be made in chunks like 100-1000 elements at once and consumed by CUDA part at once for 100-1000 steps if all parts are in CUDA.
If I were to write it, I would do it like this:
launch a loop kernel (only 1 CUDA thread) that is parent
start loop in the kernel
do real (child) kernel-launching within the loop
since every iteration needs serial, it should sync before continuing next iteration.
end the parent after 100-1000 sized chunk is complete and get new random data from CPU
when parent kernel ends, it shows in profiler as a single kernel launch that takes a lot of time and it doesn't have any CPU-based inefficiencies.
On top of the time saved from not synching a lot, there would be consistency of performance between 10x10 matrix part and the other kernel part because they are always in same hardware, not some different CPU and GPU.
Since random-num generation is always an input for the system, at least it can be double-buffered to hide cpu-to-gpu data copying latency behind the computation. Iirc, random number generation is much cheaper than sending data over pcie bridge. So this would hide mostly the data transmission slowness.
If it is a massively parallel experiment like running the executable N times, you can still launch like 10 executable instances at once and let them keep gpu busy with good efficiency. Not practical if too much memory is required per instance. Many gpus except ancient ones can run tens of kernels in parallel if each of them can not fully occupy all resources of gpu.

Why is the cpu faster than the gpu for small inputs?

I have experienced that the CPU executes faster than the GPU for small input sizes. Why is this? Preparation, data transfer or what?
For example for the kernel and CPU function(CUDA code):
__global__ void squareGPU(float* d_in, float* d_out, unsigned int N) {
unsigned int lid = threadIdx.x;
unsigned int gid = blockIdx.x*blockDim.x+lid;
if(gid < N) {
d_out[gid] = d_in[gid]*d_in[gid];
}
}
void squareCPU(float* d_in, float* d_out, unsigned int N) {
for(unsigned int i = 0; i < N; i++) {
d_out[i] = d_in[i]*d_in[i];
}
}
Running these functions 100 times on an array of 5000 32-bit floats, I get the following using a small test program
Size of array:
5000
Block size:
256
You chose N=5000 and block size: 256
Total time for GPU: 403 microseconds (0.40ms)
Total time for CPU: 137 microseconds (0.14ms)
Increasing the size of the array to 1000000, I get:
Size of array:
1000000
Block size:
256
You chose N=1000000 and block size: 256
Total time for GPU: 1777 microseconds (1.78ms)
Total time for CPU: 48339 microseconds (48.34ms)
I am not including time used to transfer data between host and device(and vice versa), in fact, here is the relevant part of my testing procedure:
gettimeofday(&t_start, NULL);
for(int i = 0; i < 100; i++) {
squareGPU<<< num_blocks, block_size>>>(d_in, d_out, N);
} cudaDeviceSynchronize();
gettimeofday(&t_end, NULL);
After choosing a block size, I compute the number of blocks relatively to the array size: unsigned int num_blocks = ((array_size + (block_size-1)) / block_size);
Answering the general question of CPU vs. GPU performance comparison is fairly complicated, and generally involves consideration of at least 3 or 4 different factors that I can think of. However you've simplified the problem somewhat by isolating your measurement to the actual calculations, as opposed to the data transfer, or the "complete operation".
In this case, there are probably at least 2 things to consider:
Kernel launch overhead - Launching a kernel on a GPU carries and "approximately" fixed cost overhead, usually in the range of 5 to 50 microseconds, per kernel launch. This means that if you size the amount of work such that your CPU can do it in less than that amount of time, there is no way the GPU can be faster. Even above that level, there is a linear function which describes that overhead model, which I'm sure you can work out if you wish, to compare CPU vs. GPU performance in the presence of a fixed cost overhead. When comparing small test cases, this is an important factor to consider, however my guess is that because most of your test case timings are well above 50 microseconds, we can safely "ignore" this factor, as an approximation.
The actual performance/capability of the actual CPU vs. the actual GPU. This is generally hard to model, depends on the specific hardware you are using, and you haven't provided that info. However we can make some observations anyway, and some conjecture, expanding on this in the next section, based on the data you have provided.
Your two cases involve a total amount of work described by N, considering N=5000 and N=1000000. Building a little chart:
N | CPU time | GPU time
5000 | 137 | 403
1000000 | 48339 | 1777
So we see that in the case of the CPU, when the work increased by a factor of 200, the execution time increased by a factor of ~352, whereas in the GPU case, the execution time increased by a factor of ~4.5. We'll need to explain both of these "non-linearities" in order to have a reasonable guess as to what is going on.
Effects of cache - because you are running your test cases 100 times, the caches could have an effect. In the CPU case, this is my only guess as to why you are not seeing a linear relationship. I would guess that at the very small size, you are in some CPU "inner" cache, with 40KB of data "in view". Going to the larger size, you have 8MB of data in view, and although this probably fits in the "outer" cache on your CPU, its possible it may not, and even if it does, the outer cache may yield slower overall performance than the inner cache. I would guess this is the reason for the CPU appearing to get worse as the data gets larger. Your CPU is being affected non-linearly in a negative way, from the larger data set. In the GPU case, the outer cache is at most 6MB (unless you are running on a Ampere GPU), so your larger data set does not fit completely in the outer cache.
Effects of machine saturation - both the CPU and GPU can be fully "loaded" or partially loaded, depending on the workload. In the CPU case, I am guessing you are not using any multi-threading, therefore your CPU code is restricted to a single core. (And, your CPU almost certainly has multiple cores available.) Your single threaded code will approximately "saturate" i.e. keep that single core "busy". However the GPU has many cores, and I would guess that your smaller test case (which will work out to 5000 threads) will only partially saturate your GPU. What I mean is that some of the GPU thread processing resources will be idle in the smaller case (unless you happen to be running on the smallest of GPUs). 5000 threads is only about enough to keep 2 GPU SMs busy, so if your GPU has more than 2 SMs, some of its resource is idle during the smaller test case, whereas your million-thread larger test case is enough to saturate i.e. keep all thread processing resources busy, on any current CUDA GPU. The effect of this is that while the CPU doesn't benefit at all from a larger test case (you should consider using multi-threading), your GPU is likely benefitting. The larger test case allows your GPU to do more work in the same amount of time that the smaller test case is taking. Therefore the GPU benefits non-linearly in a positive way, from the larger workload.
The GPU is also better able to mitigate the effects of missing in the outer cache, when it is given a large enough workload. This is called the latency-hiding effect of the GPU in the presence of a "large" parallel workload, and the CPU doesn't have (or doesn't have as much of) a corresponding mechanism. So depending on your exact CPU and GPU, this could be an additional factor. I don't intend to give a full tutorial on latency-hiding here, but the concept is based partially on the item 2 above, so you may gather the general idea/benefit from that.

CUDA Lookup Table vs. Algorithm

I know this can be tested but I am interested in the theory, on paper what should be faster.
I'm trying to work out what would be theoretically faster, a random look-up from a table in shared memory (so bank conflicts possible) vs an algorithm with say, 'n' fp multiplications.
Best case scenario is the shared memory look-up has no bank conflicts and so takes 20-40 clock cycles, worst case is 32 bank conflicts and 640-1280 clock cycles. The multiplications will be 'n' * cycles per instruction. Is this proper reasoning?
Do the fp multiplications each take 1 cycle? 5 cycles? At which point, as a number of multiplications, does it make sense to use a shared memory look-up table?
The multiplications will be 'n' x cycles per instruction. Is this proper reasoning? When doing 'n' fp multiplications, it is keeping the cores busy with those operations. It's probably not just 'mult' instructions, it will be other ones like 'mov' in-between also. So maybe it might be n*3 instructions total. When you fetch a cached value from shared memory the (20-40) * 5(avg max bank conflicts..guessing)= ~150 clocks the cores are free to do other things. If the kernel is compute bound(limited) then using shared memory might be more efficient. If the kernel has limited shared memory or using more shared memory would result in fewer in-flight warps then re-calculating it would be faster.
Do the fp multiplications each take 1 cycle? 5 cycles?
When I wrote this it was 6 cycles but that was 7 years ago. It might (or might not) be faster now. This is only for a particular core though and not the entire SM.
At which point, as a number of multiplications, does it make sense to use a shared memory look-up table? It's really hard to say. There are a lot of variables here like GPU generation, what the rest of the kernel is doing, the setup time for the shared memory, etc.
A problem with building random numbers in a kernel is also the additional registers requirements. This might cause slowdown for the rest of the kernel because there would be more register usage so that could cause less occupancy.
Another solution (again depending on the problem) would be to use a GPU RNG and fill a global memory array with random numbers. Then have your kernel access these. It would take 300-500 clock cycles but there would not be any bank conflicts. Also with Pascal(not release yet) there will be hbm2 and this will likely lower the global memory access time even further.
Hope this helps. Hopefully some other experts can chime in and give you better information.

Increase utilization of GPU when using Mathematica CUDADot?

I've recently started using Mathematica's CUDALink with a GT430 and am using CUDADot to multiply a 150000x1038 matrix (encs) by a 1038x1 matrix (probe). Both encs and probe are registered with the memory manager:
mmEncs = CUDAMemoryLoad[encs];
mmProbe = CUDAMemoryLoad[probe];
I figured that a dot product of these would max out the GT430, so I tested with the following:
For[i = 0, i < 10, i++,
CUDADot[mmEncs, mmProbe];
]
While it runs, I use MSI's "Afterburner" utility to monitor GPU usage. The following screenshot shows the result:
There's a distinct peak for each CUDADot operation and, overall, I'd say this picture indicates that I'm utilizing less than 1/4 of GPU capacity. Two questions:
Q1: Why do peaks max out at 50%? Seems low.
Q2: Why are there are such significant periods of inactivity between peaks?
Thanks in advance for any hints! I have no clue w.r.t. Q1 but maybe Q2 is because of unintended memory transfers between host and device?
Additional info since original posting: CUDAInformation[] reports "Core Count -> 64" but NVIDIA Control Panel reports "CUDA Cores: 96". Is there any chance that CUDALink will under-utilize the GT430 if it's operating on the false assumption that it has 64 cores?
I am going to preface this answer by noting that I have no idea what "MSI Afterburner" is really measuring, or at what frequency it is sampling that quantity which it measures, and I don't believe you do either. That means we don't know what either the units of x or y axis in your screenshot are. This makes any quantification of performance pretty much impossible.
1.Why do peaks max out at 50%? Seems low.
I don't believe you can say it "seems low" if you don't know what it is really measuring. If, for example, it measures instruction throughput, it could be that the Mathematica dot kernel is memory bandwidth limited on your device. That means the throughput bottleneck of the code would be memory bandwidth, rather than SM instruction throughput. If you were to plot memory throughput, you would see 100%. I would expect a gemv operation to be memory bandwidth bound, so this result is probably not too surprising.
2.Why are there are such significant periods of inactivity between peaks?
The CUDA API has device and host side latency. On a WDDM platform (so Windows Vist, 7, 8, and whatever server versions are derived from them), this host side latency is rather high and the CUDA driver does batching of operations to help amortise that latency. This batching can lead to "gaps" or "pauses" in GPU operations. I think that is what you are seeing here. NVIDIA have a dedicated computation driver (TCC) for Telsa cards on the Windows platform to overcome these limitations.
A much better way to evaluate the performance of this operation would be to time the loop yourself, compute an average time per call, calculate the operation count (a dot product has a known lower bound you can work out from the dimensions of the matrix and vector), and compute a FLOP/s value. You can compare that to the specifications of your GPU to see how well or badly it is performing.

Increasing block size decreases performance

In my cuda code if I increase the blocksizeX ,blocksizeY it actually is taking more time .[Therefore I run it at 1x1]Also a chunk of my execution time ( for eg 7 out of 9 s ) is taken by just the call to the kernel .Infact I am quite amazed that even if I comment out the entire kernel the time is almost same.Any suggestions where and how to optimize?
P.S. I have edited this post with my actual code .I am downsampling an image so every 4 neighoring pixels (so for eg 1,2 from row 1 and 1,2 from row 2) give an output pixel.I get a effective bw of .5GB/s compared to theoretical maximum of 86.4 GB/s.The time I use is the difference in calling the kernel with instructions and calling an empty kernel.
It looks pretty bad to me right now but I cant figure out what am I doing wrong.
__global__ void streamkernel(int *r_d,int *g_d,int *b_d,int height ,int width,int *f_r,int *f_g,int *f_b){
int id=blockIdx.x * blockDim.x*blockDim.y+ threadIdx.y*blockDim.x+threadIdx.x+blockIdx.y*gridDim.x*blockDim.x*blockDim.y;
int number=2*(id%(width/2))+(id/(width/2))*width*2;
if (id<height*width/4)
{
f_r[id]=(r_d[number]+r_d[number+1];+r_d[number+width];+r_d[number+width+1];)/4;
f_g[id]=(g_d[number]+g_d[number+1]+g_d[number+width]+g_d[number+width+1])/4;
f_b[id]=(g_d[number]+g_d[number+1]+g_d[number+width]+g_d[number+width+1];)/4;
}
}
Try looking up the matrix multiplication example in CUDA SDK examples for how to use shared memory.
The problem with your current kernel is that it's doing 4 global memory reads and 1 global memory write for each 3 additions and 1 division. Each global memory access costs roughly 400 cycles. This means you're spending the vast majority of time doing memory access (what GPUs are bad at) rather than compute (what GPUs are good at).
Shared memory in effect allows you to cache this so that amortized, you get roughly 1 read and 1 write at each pixel for 3 additions and 1 division. That is still not doing so great on the CGMA ratio (compute to global memory access ratio, the holy grail of GPU computing).
Overall, I think for a simple kernel like this, a CPU implementation is likely going to be faster given the overhead of transferring data across the PCI-E bus.
You're forgetting the fact that one multiprocessor can execute up to 8 blocks simultaneously and the maximum performance is reached exactly then. However there are many factors that limit the number of blocks that can exist in parallel (incomplete list):
Maximum amount of shared memory per multiprocessor limits the number of blocks if #blocks * shared memory per block would be > total shared memory.
Maximum number of threads per multiprocessor limits the number of blocks if #blocks * #threads / block would be > max total #threads.
...
You should try to find a kernel execution configuration that causes exactly 8 blocks to be run on one multiprocessor. This will almost always yield the highest performance even if the occupancy is =/= 1.0! From this point on you can try to iteratively make changes that reduce the number of executed blocks per MP, but therefore increase the occupancy of your kernel and see if the performance increases.
The nvidia occupancy calculator(excel sheet) will be of great help.