calculating gst_throughput and gld_throughput with nvprof

calculating gst_throughput and gld_throughput with nvprof - cuda

I got the following problem. I want to measure the gst_efficiency and the gld_efficiency for my cuda application using nvprof. The documentation distributed with cuda 5.0 tells me to generate these using the following formulas for devices with compute capability 2.0-3.0:
gld_efficiency = 100 * gld_requested_throughput / gld_throughput
gst_efficiency = 100 * gst_requested_throughput / gst_throughput
For the required metrics the following formulas are given:
gld_throughput = ((128 * global_load_hit) + (l2_subp0_read_requests + l2_subp1_read_requests) * 32 - (l1_local_ld_miss * 128)) / gputime
gst_throughput = (l2_subp0_write_requests + l2_subp1_write_requests) * 32 - (l1_local_ld_miss * 128)) / gputime
gld_requested_throughput = (gld_inst_8bit + 2 * gld_inst_16bit + 4 * gld_inst_32bit + 8
* gld_inst_64bit + 16 * gld_inst_128bit) / gputime
gst_requested_throughput = (gst_inst_8bit + 2 * gst_inst_16bit + 4 * gst_inst_32bit + 8
* gst_inst_64bit + 16 * gst_inst_128bit) / gputime
Since no formula is given for the metrics used I assume that these are events which can be counted by nvprof. But some of the events seem not to be available on my gtx 460 (also tried gtx 560 Ti). I pasted the output of nvprof --query-events.
Any ideas what's going wrong or what I'm misinterpreting?
EDIT:
I don't want to use CUDA Visual Profiler, since I'm trying to analyse my application for different parameters. I therefore want to run nvprof using multiple parameter configurations, recording multiple events (each one in its one run) and then output the data in tables. I got this automated already and working for other metrics (i.e. instructions issued) and want to do this for load and store efficiency. This is why I'm not interested in solution involving nvvp. By the way, for my application nvvp fails to calculate the metrics required for store-efficiency so it doesn't help my at all in this case.

I'm glad somebody had the same issue :) I was trying to do the very same thing and couldn't use the Visual Profiler, because I wanted to profile like 6000 different kernels.
The formulas on NVidia site are poorly documented - actually the variables can be:
a) events
b) other Metrics
c) different variables dependent on the GPU you have
However, a LOT of the metrics there have either typos in it or are versed a bit differently in nvprof than they are on the site. Also, there the variables are not tagged, so you can't tell just by looking whether they are a),b) or c). I used a script to grep them and then had to fix it by hand. Here is what I found:
1) "l1_local/global_ld/st_hit/miss"
These have "load"/"store" in nvprof instead of "ld"/"st" on site.
2) "l2_ ...whatever... _requests"
These have "sector_queries" in nvprof instead of "requests".
3) "local_load/store_hit/miss"
These have "l1_" in additionally in the profiler - "l1_local/global_load/store_hit/miss"
4) "tex0_cache_misses"
This one has "sector" in it in the profiler - "tex0_cache_sector_misses"
5) "tex_cache_sector_queries"
Missing "0" - so "tex0_cache_sector_queries" in the nvprof.
Finally, the variables:
1) "#SM"
The number of streaming multiprocessors. Get via cudaDeviceProp.
2) "gputime"
Obviously, the execution time on GPU.
3) "warp_size"
The size of warp on your GPU, again get via cudaDeviceProp.
4) "max_warps_per_sm"
Number of blocks executable on an sm * #SM * warps per block. I guess.
5) "elapsed_cycles"
Found this:
https://devtalk.nvidia.com/default/topic/518827/computeprof-34-active-cycles-34-counter-34-active-cycles-34-value-doesn-39-t-make-sense-to-/
But still not entirely sure, if I get it.
Hopefully this helps you and some other people who encounter the same problem :)

Related

Chisel Synchronization

Here is a description of one of the states in my state machine. What I would like to do is to go to the next state after the for loops.
is(s_multiplier){
when(ready){state := s_ready}
// Initialization of C memory to 0
for(i <- 0 to matrixSize - 1){
for(j <- 0 to matrixSize - 1){
memC.write(i + j, 0.asSInt((2 * cellSize).W))
}
}
// Objective 1 : Multiplication for the 128X128
// Objective 2 : Multiplication for the n.m and m.p size parameters given
for(i <- 0 to matrixSize - 1){
for(j <- 0 to matrixSize - 1){
sum := 0.asSInt(cellSize.W)
for(k <- 0 to matrixSize - 1){
sum = sum + memA.read(i * matrixSize + k, true.B) * memB.read(k * matrixSize + j, true.B)
}
memC.write(i * matrixSize + j, sum)
}
}
ready := true.B
}
I just created a boolean variable ready that I put to true after the loops. But as everything is supposed to be executed in parallel, I Don't think that my code is correct :/

There is a fundamental difference between writing software algorithms and using chisel to construct the hardware necessary to perform equivalent calculations.
Before discussing the matrix multiplication, consider (as a simpler example) your memory initialization operation loop. The way you have done it makes sense, but for hardware every time the inner body of the loop is executed the hardware necessary to init that memory cell is added to the hardware graph. That means you have created the necessary wires to initialize 16384 memory locations all at the same time. That a lot of wires. Not only that, it would require a memory that has 16384 write ports (you probably can't find that). Your hardware would initialize all this memory in one clock cycle, which is good, but by devoting an enormous number of gates to do so.
Typically one would initialize memory over a number of clock cycles and in this way reducing the amount of hardware required.
Similarly in the matrix multiplication section you are generating all the hardware necessary to compute a matrix multiplication in 1 clock cycle. This is great for performance but the number of multiplications required for this approach is 2,097,152 hardware multipliers plus a further large number of adders. Every * and + operation in the inner loop generates hardware. The number of gates required to multiply two 32 bit numbers is roughly 1024 gates.
The way to go about this is to figure out a way of breaking down the problem into stages. Maybe this would be module that can multiply one row by one column and sum the total. You would then need to use registers to work your way through the matrix, keeping track of the row and columns in order to compute the value at every point in the result matrix. In order to reduce the number of hardware elements you instead perform the calculation over multiple clock cycles keeping state information (indices to the rows and columns) on the progress of the calculation in registers or in memory.
There's a lot of ways to try and optimize a function this and Chisel is a great language for experimenting and testing out tactics.
Maybe you want to make the memory very wide to accommodate getting multiple cell values at once.
Maybe you will unroll your loop a bit more to compute multiple cell values at once by having more than one cell calculator.
Clever iteration strategies can optimize your memory accesses for both reading and writing.
The point is that writing hardware is not necessary harder than writing software (and Chisel helps there) but it is pretty different in the approach.
I would recommend you spend a little more time with Chisel bootcamp. The 2.3_control_flow page's section on sorting is pretty similar with respect to the discussion above. You can write a one cycle sorter but the size of the hardware to do it grows rapidly, in practice it is necessary to break the problem into pieces and spread the calculation over multiple cycles.
Good luck.

Number of Threads Calculating a single Value

I am using CUDA with Compute capability 1.2. I am running my CUDA Code with each element of a matrix calculated by the addtion of other 2 matrices. I am calculating the value of one element by one Thread. I want to know is it possible to use 2 threads for calculating single value.If it is possible, Can anyone plz tell me how to use 2 different threads of same block to calculate the single value?

If you need to calculate
q = m2[i][k] + m2[(k+1)][j] + p1[(i-1)]*p1[k]*p1[j];
by two cores, use a wider variable + less iterations. int2:
__shared__ int2 m2[N][N],p1[N],q;
could use two cores but not two threads. If you insist on two threads,
qThread1 = m2[i][k] + m2[(k+1)][j] //in a kernel
...
...
...
qThread2 = p1[(i-1)]*p1[k]*p1[j] //in another kernel
Then you simply add them into q by another thread. Synchronizations, kernel starting overheads, cache utilization can decrease performance as well as decreased instruction level parallelizm. Maybe kernel occupation increases but not sure if it tolerates the above negatives.

How to count number of executed thread for whole the CUDA kernel execution?

I want to count the number of thread execution gradually for whole the kernel execution. Is there an native counter for this or is there any other method to do that? I know keeping a global variable and increment by each thread would not work since a variable in global memory does not guarantees the synchronized access by the threads.

There are numerous ways to measure thread level execution efficiency. This answer provides a list of different collection mechanisms. Robert Crovella's answer provides a manual instrumentation method that allows for accurately collection of information. A similar technique can be used to collect divergence information in the kernel.
Number of Threads Launched for Execution (static)
gridDim.x * gridDim.y * gridDim.z * blockDim.x * blockDim.y * blockDim.z
Number of Threads Launched
gridDim.x * gridDim.y * gridDim.z * ROUNDUP((blockDim.x * blockDim.y * blockDim.z), WARP_SIZE)
This number includes threads that are inactive for the life time of the warp.
This can be collected using the PM counter threads_launched.
Warp Instructions Executed
The counter inst_executed counts the number of warp instructions executed/retired.
Warp Instructions Issued
The counter inst_issued counts the number of instructions issued. inst_issued >= inst_executed. Some instructions will be issued multiple times per instruction executed in order to handle dispatch to narrow execution units or in order to handle address divergence in shared memory and L1 operations.
Thread Instructions Executed
The counter thread_inst_executed counts the number of thread instructions executed. The metrics avg_threads_executed_per_instruction can be derived using thread_inst_executed / inst_executed. The maximum value for this counter is WARP_SIZE.
Not Predicated Off Threads Instructions Executed
Compute capability 2.0 and above devices use instruction predication to disable write-back for threads in a warp as a performance optimization for short sequences of divergent instructions.
The counter not_predicated_off_thread_inst_executed counts the number of instructions executed by all threads. This counter is only available on compute capability 3.0 and above devices.
not_predicated_off_thread_inst_executed <= thread_inst_executed <= WARP_SIZE * inst_executed
This relationship will be off slightly on some chips due to small bugs in thread_inst_executed and not_predicated_off_thread_inst_executed counters.
Profilers
Nsight Visual Studio Edition 2.x support collecting the aforementioned counters.
Nsight VSE 3.0 supports a new Instruction Count experiment that can collect per SASS instruction statistics and show the data in table form or next to high level source, PTX, or SASS code. The information is rolled up from SASS to high level source. The quality of the roll up depends on the ability of the compiler to output high quality symbol information. It is recommended that you always look at both source and SASS at the same time. This experiment can collect the following per instruction statistics:
a. inst_executed
b. thread_inst_executed (or active mask)
c. not_predicated_off_thread_inst_executed (active predicate mask)
d. histogram of active_mask
e. histogram of predicate_mask
Visual Profiler 5.0 can accurately collect the aforementioned SM counters. nvprof can collect and show the per SM details. Visual Profiler 5.x does not support collection of per instruction statistics available in Nsight VSE 3.0. Older versions of the Visual Profiler and CUDA command line profiler can collect many of the aforementioned counters but the results may not be as accurate as the 5.0 and above version of the tools.

Maybe something like this:
__global__ void mykernel(int *current_thread_count, ...){
atomicAdd(current_thread_count, 1);
// the rest of your kernel code
}
int main() {
int tally, *dev_tally;
cudaMalloc((void **)&dev_tally, sizeof(int));
tally = 0;
cudaMemcpy(dev_tally, &tally, sizeof(int), cudaMemcpyHostToDevice);
....
// set up block and grid dimensions, etc.
dim3 grid(...);
dim3 block(...)
mykernel<<<grid, block>>>(dev_tally, ...);
cudaMemcpy(&tally, dev_tally, sizeof(int), cudaMemcpyDeviceToHost);
printf("total number of threads that executed was: %d\n", tally);
....
return 0;
}
You can read more about atomic functions here
Part of the reason for the confusion expressed by many in the comments, is that when mykernel is complete (assuming it ran successfully) everyone expects tally to end up with a value equal to grid.x*grid.y*grid.z*block.x*block.y*block.z

I don't think there is a way to calculate the number of threads in a specific path branch. for ex for an histogram, it would be nice to have the following:
PS: Histogram is about counting the pixels for each color.
for (i=0; i<256; i++) // 256 colors, 1 pixel = 1 thread
if (threadidx.x == i)
Histogramme[i] = CUDA_NbActiveThreadsInBranch() // Threads having i as color

Code running perfectly on host, put in a kernel, fails for mysterious reasons

I have to port a pre-existing “host-only” backpropagation implementation to CUDA. I think the nature of the algorithm doesn’t matter here, so I won’t give much explanation about the way it works. What I think matter though, is that it uses 3-dimensional arrays, whose all three dimensions are dynamically allocated.
I use VS2010, with CUDA 5.0. And my device is a 2.1. The original host-only code can be downloaded here
→ http://files.getwebb.org/view-cre62u4d.html
Main points of the code:
patterns from adult.data are loaded into memory, using the Data structure, present in “pattern.h”.
several multi-dimensional arrays are allocated
the algorithm is ran over the patterns, using the arrays allocated just before.
If you want to try to run the code don’t forget to modify the PATH constant at the beginning of kernel.cu. I also advise you to use “2” layers, “5” neurons, and a learning rate of “0.00001”. As you can see, this work perfectly. The “MSE” is improving. For those who have no clue about what does this algorithms, let’s simply say that it learns how to predict a target value, based on 14 variables present in the patterns. The “MSE” decrease, meaning that the algorithm makes less mistakes after each “epoch”.
I spent a really long time trying to run this code on the device. And I’m still unsuccessful. Last attempt was done by simply copying the code initializing the arrays and running the algorithm into a big kernel. Which failed again. This code can be downloaded there
→ http://files.getwebb.org/view-cre62u4c.html
To be precise, here are the differences with the original host-only code:
f() and fder(), which are used by the algorithm, become device
functions.
parameters are hardcoded: 2 layers, 5 neurons, and a learning rate of
0.00001
the “w” array is initialized using a fixed value (0.5), not rand()
anymore
a Data structure is allocated in device’s memory, and the data are
sent in device’s memory after they have been loaded from adult.data
in host’s memory
I think I did the minimal amount of modifications needed to make the code run in a kernel. The “kernel_check_learningData” kernel, show some informations about the patterns loaded in device’s memory, proving the following code, sending the patterns from the host to the device, did work:
Data data;
Data* dev_data;
int* dev_t;
double* dev_x;
...
input_adult(PathFile, &data);
...
cudaMalloc((void**)&dev_data, sizeof(Data));
cudaMalloc((void**)&dev_t, data.N * sizeof(int));
cudaMalloc((void**)&dev_x, data.N * data.n * sizeof(double));
// Filling the device with t and x's data.
cudaMemcpy(dev_t, data.t, data.N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_x, data.x, data.N * data.n * sizeof(double), cudaMemcpyHostToDevice);
// Updating t and x pointers into devices Data structure.
cudaMemcpy(&dev_data->t, &dev_t, sizeof(int*), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->x, &dev_x, sizeof(double*), cudaMemcpyHostToDevice);
// Copying N and n.
cudaMemcpy(&dev_data->N, &data.N, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->n, &data.n, sizeof(int), cudaMemcpyHostToDevice);
It apparently fails at the beginning of the forward phase, when reading the “w” array. I can’t find any explanation for that.
I see two possibilities:
the code sending the patterns into device's memory is bugged, despite the fact it seems to work properly, and provoke a bug way further, when beginning the forward phase.
the CUDA API is not behaving like it should!
I’m desperately searching for my mistake for a very long time. So I wondered if the community could provide me with some help.
Thanks.

Here's the problem in your code, and why it works in 64 bit machine mode but not 32 bit machine mode.
In your backpropagation kernel, in the forward path, you have a sequence of code like this:
/*
* for layer = 0
*/
for (i = 0; i < N[0]; i++) { // for all neurons i of layer 0
a[0][i] = x[ data->n * pat + i]; // a[0][i] = input i
}
In 32 bit machine mode (Win32 project, --machine 32 is being passed to nvcc), the failure occurs on the iteration i=7 when the write of a[0][7] occurs; this write is out of bounds. At this point, a[0][7] is intended to hold a double value, but for some reason the indexing is placing us out of bounds.
By the way, you can verify this by simply opening a command prompt in the directory where your executable is built, and running the command:
cuda-memcheck test_bp
assuming test_bp.exe is the name of your executable. cuda-memcheck conveniently identifies that there is an out of bounds write occurring, and even identifies the line of source that it is occurring on.
So why is this out of bounds? Let's take a look earlier in the kernel code where a[0][] is allocated:
a[0] = (double *)malloc( N[0] * sizeof(double *) );
^ oops!!
a[0][] is intended to hold double data but you're allocating pointer storage.
As it turns out, in a 64 bit machine the two types of storage are the same size, so it ends up working. But in a 32-bit machine, a double pointer is 4 bytes whereas double data is 8 bytes. So, in a 32-bit machine, when we index through this array taking data strides of 8 bytes, we eventually run off the end of the array.
Elsewhere in the kernel code you are allocating storage for the other "layers" of a like this:
a[layer] = (double *)malloc( N[layer] * sizeof(double) );
which is correct. I see that the original "host-only" code seems to contain this error as well. There may be a latent defect in that code as well.
You will still need to address the kernel running time to avoid the windows TDR event, in some fashion, if you want to run on a windows wddm device. And as I already pointed out, this code makes no attempt to use the parallel capability of the machine.

CUDA: What reasons could there be for nvcc taking several minutes to compile?

I have some CUDA code that nvcc (well, technically ptxas) likes to take upwards of 10 minutes to compile. While it isn't small, it certainly isn't huge. (~5000 lines).
The delay seems to come and go between CUDA version updates, but previously it only took a minute or so instead of 10.
When I used the -v option, it seemed to get stuck after displaying the following:
ptxas --key="09ae2a85bb2d44b6" -arch=sm_13 "/tmp/tmpxft_00002ab1_00000000-2_trip3dgpu_kernel.ptx" -o "/tmp/tmpxft_00002ab1_00000000-9_trip3dgpu_kernel.sm_13.cubin"
The kernel does have a fairly large parameter list and a structure with a good number of pointers is passed around, but I do know that there was at least one point in time in which very nearly the exact same code compiled in only a couple seconds.
I am running 64 bit Ubuntu 9.04 if it helps.
Any ideas?

I had similar problem - without optimization, compilation failed running out of registers, and with optimizations it took nearly half an hour. My kernel had expressions like
t1itern[II(i,j)] = (1.0 - overr) * t1itero[II(i,j)] + overr * (rhs[IJ(i-1,j-1)].rhs1 - abiter[IJ(i-1,j-1)].as * t1itern[II(i,j - 1)] - abiter[IJ(i-1,j-1)].ase * t1itero[II(i + 1,j - 1)] - abiter[IJ(i-1,j-1)].ae * t1itern[II(i + 1,j)] - abiter[IJ(i-1,j-1)].ane * t1itero[II(i + 1,j + 1)] - abiter[IJ(i-1,j-1)].an * t1itern[II(i,j + 1)] - abiter[IJ(i-1,j-1)].anw * t1itero[II(i - 1,j + 1)] - abiter[IJ(i-1,j-1)].aw * t1itern[II(i - 1,j)] - abiter[IJ(i-1,j-1)].asw * t1itero[II(i - 1,j - 1)] - rhs[IJ(i-1,j-1)].aads * t2itern[II(i,j - 1)] - rhs[IJ(i-1,j-1)].aadn * t2itern[II(i,j + 1)] - rhs[IJ(i-1,j-1)].aade * t2itern[II(i + 1,j)] - rhs[IJ(i-1,j-1)].aadw * t2itern[II(i - 1,j)] - rhs[IJ(i-1,j-1)].aadc * t2itero[II(i,j)]) / abiter[IJ(i-1,j-1)].ac;
and when i rewrote them:
tt1 = lrhs.rhs1;
tt1 = tt1 - labiter.as * t1itern[II(1,j - 1)];
tt1 = tt1 - labiter.ase * t1itern[II(2,j - 1)];
tt1 = tt1 - labiter.ae * t1itern[II(2,j)];
//etc
it significantly reduced compilation time and register usage.

You should note that there is a limit on the the size of the parameter list that can be passed to a function, currently 256 bytes (see section B.1.4 of the CUDA Programming Guide). Has the function changed at all?
There is also a limit of 2 million PTX instructions per kernel, but you shouldn't be approaching that ;-)
What version of the toolkit are you using? The 3.0 beta is available if you are a registered developer which is a major update. If you still have the problem you should contact NVIDIA, they will need to be able to reproduce the problem of course.

Setting -maxrregcount 64 on the compile line helps since it causes the register allocator to spill to lmem earlier

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008