cublas one function call produced three executions

cublas one function call produced three executions - cuda

I called the cublas_Sgemm_v2 function for 10236 times with first matrix non-transposed and the second transposed. However, in the nvprof results, I saw three items produced from that function call. The (m, n, k) values to the function call are (588, 588, 20).
There are the items listed in the nvprof results.
Time(%) Time Calls Avg Min Max Name
12.32% 494.86ms 10236 48.344us 47.649us 49.888us sgemm_sm35_ldg_nt_128x8x128x16x16
8.64% 346.91ms 10236 33.890us 32.352us 35.488us sgemm_sm35_ldg_nt_64x16x128x8x32
8.11% 325.63ms 10236 31.811us 31.360us 32.512us sgemm_sm35_ldg_nt_128x16x64x16x16
Is this expected and why is that? Can someone explain what do the values in the function names such as sgemm_sm35_ldg_nt_128x8x128x16x16 mean?
I also have other function calls to cublas_Sgemm_v2 with different transpose settings and I only see one item per each function call.
UPDATE:
As #Marco13 asked, I put more results here:
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
Resulted from 7984 calls with (Trans, NonTrans) with (m, n, k) = (588, 100, 588)
20.84% 548.30ms 7984 68.675us 58.977us 81.474us sgemm_sm35_ldg_tn_32x16x64x8x16
Resulted from 7984 calls with (NonTrans, NonTrans) with (m, n, k) = (588, 100, 588)
12.95% 340.71ms 7984 42.674us 21.856us 64.514us sgemm_sm35_ldg_nn_64x16x64x16x16
All the following resulted from 3992 calls with (NonTrans, Trans) with (m, n, k) = (588, 588, 100)
9.81% 258.15ms 3992 64.666us 61.601us 68.642us sgemm_sm35_ldg_nt_128x8x128x16x16
6.84% 179.90ms 3992 45.064us 40.097us 49.505us sgemm_sm35_ldg_nt_64x16x128x8x32
6.33% 166.51ms 3992 41.709us 38.304us 61.185us sgemm_sm35_ldg_nt_128x16x64x16x16
Another run with 588 changed to 288:
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
Resulted from 7984 calls with (Trans, NonTrans) with (m, n, k) = (288, 100, 288)
22.01% 269.11ms 7984 33.706us 30.273us 39.232us sgemm_sm35_ldg_tn_32x16x64x8x16
Resulted from 7984 calls with (NonTrans, NonTrans) with (m, n, k) = (288, 100, 288)
14.79% 180.78ms 7984 22.642us 18.752us 26.752us sgemm_sm35_ldg_nn_64x16x64x16x16
Resulted from 3992 calls with (NonTrans, Trans) with (m, n, k) = (288, 288, 100)
7.43% 90.886ms 3992 22.766us 19.936us 25.024us sgemm_sm35_ldg_nt_64x16x64x16x16
From the last three lines is looks like certain transposition types can be more efficient than the others, and certain matrix sizes are more economic in terms of computation time over matrix size. What is the guideline of ensuring economic computation?
UPDATE 2:
For the case of (m, n, k) = (588, 100, 588) above, I manually transposed the matrix before calling the sgemm function, then there is only one item in the nvprof result. The time it take is only a little less than the sum of the two items in the above table. So there is no much performance gain from doing so.
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
31.65% 810.59ms 15968 50.763us 21.505us 72.098us sgemm_sm35_ldg_nn_64x16x64x16x16

Sorry, not an answer - but slightly too long for a comment:
Concerning the edit, about the influence of the "transpose" state: Transposing a matrix might cause an access pattern that is worse in terms of memory coalescing. A quick websearch brings brings some results about this ( https://devtalk.nvidia.com/default/topic/528450/cuda-programming-and-performance/cublas-related-question/post/3734986/#3734986 ), but with a slightly different setup than yours:
DGEMM performance on a K20c
args: ta=N tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13280010 sec GFLOPS=1034.93
args: ta=T tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13872910 sec GFLOPS=990.7
args: ta=N tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.12521601 sec GFLOPS=1097.61
args: ta=T tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13652611 sec GFLOPS=1006.69
In this case, the differences do not seem worth the hassle of changing the matrix storage (e.g. from column-major to row-major, to avoid transposing the matrix), because all patterns seem to run with a similar speed. But your mileage may vary - particularly, the difference in your tests between (t,n) and (n,n) are very large (548ms vs 340ms), which I found quite surprising. If you have the choice to easily switch between various representations of the matrix, then a benchmark covering all the four cases may be worthwhile.
In any case, regarding your question about the functions that are called there: The CUBLAS code for the sgemm function in CUBLAS 1.1 was already full of unrolled loops and already contained 80 (!) versions of the sgemm function for different cases that have been assembled using a #define-hell. It has to be assumed that this has become even more unreadable in the newer CUBLAS versions, where the newer compute capabilities have to be taken into account - and the function names that you found there indicated that this indeed is the case:
sgemm_sm35_ldg_nt_64x16x128x8x32
sm35 : Runs on a device with compute capability 3.5
ldg : ? Non-texture-memory version ? (CUBLAS 1.1 contained functions called sgemm_main_tex_* which worked on texture memory, and functions sgemm_main_gld_* which worked on normal, global memory)
nt : First matrix is Not transposed, second one is Transposed
64x16x128x8x32 - Probably related to tile sizes, maybe shared memory etc...
Still, I think it's surprising that a single call to sgemm causes three of these internal functions to be called. But as mentioned in the comment: I assume that they try to handle the "main" part of the matrix with a specialized, efficient version, and "border tiles" with one that is capable of doing range checks and/or cope with warps that are not full. (Not very precise, just to be suggestive: A matrix of size 288x288 could be handled by an efficient core for matrices of size 256x256, and two calls for the remaining 32x288 and 288x32 entries).
But all this is also the reason why I guess there can hardly be a general guideline concerning the matrix sizes: The "best" matrix size in terms of computation time over matrix size will at least depend on
the hardware version (compute capability) of the target system
the transposing-flags
the CUBLAS version
EDIT Concerning the comment: One could imagine that there should be a considerable difference between the transosed and the non-transposed processing. When multiplying two matrices
a00 a01 a02 b00 b01 b02
a10 a11 a12 * b10 b11 b12
a20 a21 a22 b20 b21 b22
Then the first element of the result will be
a00 * b00 + a01 * b10 + a02 * b20
(which simply is the dot product of the first row of a and the first column of b). For this computation one has to read consecutive values from a. But the values that are read from b are not consecutive. Instead, they are "the first value in each row". One could think that this would have a negative impact on memory coalescing. But for sure, the NVIDIA engineers have tried hard to avoid any negative impact here, and the implementation of sgemm in CUBLAS is far, far away from "a parallel version of the naive 3-nested-loops implementation" where this access pattern would have such an obvious drawback.

Related

CUBLAS accumulate output

This is a very simple question about Cublas library which I strangely couldn't find answer in documentation or elsewhere.
I am using rather old version of CUBLAS (10.2) but it should not matter. I use cublasSgemm to multiply two 32-bit floats matrices A * B and put the result in matrix C:
stat = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, nRows, k, nCols, alpha, A, nRows, B, k, beta, C, nRows);
Is it possible to make CUBLAS to accumulate the result in C? This means that if C contains some data it would not be erased but accumulated with the multiplication result?
This can be used for example when memory is limited and one need to shrink sizes of input matrices if are too big and multiply several times. I however couldn't see such option in cublasSgemm?

Is it possible to make CUBLAS to accumulate the result in C? This means that if C contains some data it would not be erased but accumulated with the multiplication result?
Yes, cublasSgemm does exactly that. Referring to the documentation:
This function performs the matrix-matrix multiplication
C=αop(A)op(B)+βC
^^^
This is the accumulation part of the formula.
If you set beta to zero, then the previous contents of C will not be accumulated.
If you set beta to 1, then the previous contents of C will be added to the multiplication (AxB) result.
If you set beta to some other value, a scaled (multiplied) version of the previous contents of C will be added.
Note that as far as this description and function are concerned, all of this functionality was defined/specified as part of the netlib BLAS description, and should be similar to other BLAS libraries, and is not unique or specific to CUBLAS.

Chisel Synchronization

Here is a description of one of the states in my state machine. What I would like to do is to go to the next state after the for loops.
is(s_multiplier){
when(ready){state := s_ready}
// Initialization of C memory to 0
for(i <- 0 to matrixSize - 1){
for(j <- 0 to matrixSize - 1){
memC.write(i + j, 0.asSInt((2 * cellSize).W))
}
}
// Objective 1 : Multiplication for the 128X128
// Objective 2 : Multiplication for the n.m and m.p size parameters given
for(i <- 0 to matrixSize - 1){
for(j <- 0 to matrixSize - 1){
sum := 0.asSInt(cellSize.W)
for(k <- 0 to matrixSize - 1){
sum = sum + memA.read(i * matrixSize + k, true.B) * memB.read(k * matrixSize + j, true.B)
}
memC.write(i * matrixSize + j, sum)
}
}
ready := true.B
}
I just created a boolean variable ready that I put to true after the loops. But as everything is supposed to be executed in parallel, I Don't think that my code is correct :/

There is a fundamental difference between writing software algorithms and using chisel to construct the hardware necessary to perform equivalent calculations.
Before discussing the matrix multiplication, consider (as a simpler example) your memory initialization operation loop. The way you have done it makes sense, but for hardware every time the inner body of the loop is executed the hardware necessary to init that memory cell is added to the hardware graph. That means you have created the necessary wires to initialize 16384 memory locations all at the same time. That a lot of wires. Not only that, it would require a memory that has 16384 write ports (you probably can't find that). Your hardware would initialize all this memory in one clock cycle, which is good, but by devoting an enormous number of gates to do so.
Typically one would initialize memory over a number of clock cycles and in this way reducing the amount of hardware required.
Similarly in the matrix multiplication section you are generating all the hardware necessary to compute a matrix multiplication in 1 clock cycle. This is great for performance but the number of multiplications required for this approach is 2,097,152 hardware multipliers plus a further large number of adders. Every * and + operation in the inner loop generates hardware. The number of gates required to multiply two 32 bit numbers is roughly 1024 gates.
The way to go about this is to figure out a way of breaking down the problem into stages. Maybe this would be module that can multiply one row by one column and sum the total. You would then need to use registers to work your way through the matrix, keeping track of the row and columns in order to compute the value at every point in the result matrix. In order to reduce the number of hardware elements you instead perform the calculation over multiple clock cycles keeping state information (indices to the rows and columns) on the progress of the calculation in registers or in memory.
There's a lot of ways to try and optimize a function this and Chisel is a great language for experimenting and testing out tactics.
Maybe you want to make the memory very wide to accommodate getting multiple cell values at once.
Maybe you will unroll your loop a bit more to compute multiple cell values at once by having more than one cell calculator.
Clever iteration strategies can optimize your memory accesses for both reading and writing.
The point is that writing hardware is not necessary harder than writing software (and Chisel helps there) but it is pretty different in the approach.
I would recommend you spend a little more time with Chisel bootcamp. The 2.3_control_flow page's section on sorting is pretty similar with respect to the discussion above. You can write a one cycle sorter but the size of the hardware to do it grows rapidly, in practice it is necessary to break the problem into pieces and spread the calculation over multiple cycles.
Good luck.

How to map number in a range to another in the same range with no collisions?

Effectively what I'm looking for is a function f(x) that outputs into a range that is pre-defined. Calling f(f(x)) should be valid as well. The function should be cyclical, so calling f(f(...(x))) where the number of calls is equal to the size of the range should give you the original number, and f(x) should not be time dependent and will always give the same output.
While I can see that taking a list of all possible values and shuffling it would give me something close to what I want, I'd much prefer it if I could simply plug values into the function one at a time so that I do not have to compute the entire range all at once.
I've looked into Minimal Perfect Hash Functions but haven't been able to find one that doesn't use external libraries. I'm okay with using them, but would prefer to not do so.
If an actual range is necessary to help answer my question, I don't think it would need to be bigger than [0, 2^24-1], but the starting and ending values don't matter too much.

You might want to take a look at Linear Congruential Generator. You shall be looking at full period generator (say, m=224), which means parameters shall satisfy Hull-Dobell Theorem.
Calling f(f(x)) should be valid as well.
should work
the number of calls is equal to the size of the range should give you the original number
yes, for LCG with parameters satisfying Hull-Dobell Theorem you'll get full period covered once, and 'm+1' call shall put you back at where you started.
Period of such LCG is exactly equal to m
should not be time dependent and will always give the same output
LCG is O(1) algorithm and it is 100% reproducible
LCG is reversible as well, via extended Euclid algorithm, check Reversible pseudo-random sequence generator for details

Minimal perfect hash functions are overkill, all you've asked for is a function f that is,
bijective, and
"cyclical" (ie fN=f)
For a permutation to be cyclical in that way, its order must divide N (or be N but in a way that's just a special case of dividing N). Which in turn means the LCM of the orders of the sub-cycles must divide N. One way to do that is to just have one "sub"-cycle of order N. For power of two N, it's also really easy to have lots of small cycles of some other power-of-two order. General permutations do not necessarily satisfy the cycle-requirement, of course they are bijective but the LCM of the orders of the sub-cycles may exceed N.
In the following I will leave all reduction modulo N implicit. Without loss of generality I will assume the range starts at 0 and goes up to N-1, where N is the size of the range.
The only thing I can immediately think of for general N is f(x) = x + c where gcd(c, N) == 1. The GCD condition ensures there is only one cycle, which necessarily has order N.
For power-of-two N I have more inspiration:
f(x) = cx where c is odd. Bijective because gcd(c, N) == 1 so c has a modular multiplicative inverse. Also cN=1, because φ(N)=N/2 (since N is a power of two) so cφ(N)=1 (Euler's theorem).
f(x) = x XOR c where c < N. Trivially bijective and trivially cycles with a period of 2, which divides N.
f(x) = clmul(x, c) where c is odd and clmul is carry-less multiplication. Bijective because any odd c has a carry-less multiplicative inverse. Has some power-of-two cycle length (less than N) so it divides N. I don't know why though. This is a weird one, but it has decent special cases such as x ^ (x << k). By symmetry, the "mirrored" version also works.
Eg x ^ (x >> k).
f(x) = x >>> k where >>> is bit-rotation. Obviously bijective, and fN(x) = x >>> Nk, where Nk mod N = 0 so it rotates all the way back to the unrotated position regardless of what k is.

CUDA Atomic operation on array in global memory

I have a CUDA program whose kernel basically does the following.
I provide a list of n points in cartesian coordinates e.g. (x_i,y_i) in a plane of dimension dim_x * dim_y. I invoke the kernel accordingly.
For every point on this plane (x_p,y_p) I calculate by a formula the time it would take for each of those n points to reach there; given those n points are moving with a certain velocity.
I order those times in increasing order t_0,t_1,...t_n where the precision of t_i is set to 1. i.e. If t'_i=2.3453 then I would only use t_i=2.3.
Assuming the times are generated from a normal distribution I simulate the 3 quickest times to find the percentage of time those 3 points reached earliest. Hence suppose prob_0 = 0.76,prob_1=0.20 and prob_2=0.04 by a random experiment. Since t_0 reaches first most amongst the three, I also return the original index (before sorting of times) of the point. Say idx_0 = 5 (An integer).
Hence for every point on this plane I get a pair (prob,idx).
Suppose n/2 of those points are of one kind and the rest are of other. A sample image generated looks as follows.
Especially when precision of the time was set to 1 I noticed that the number of unique 3 tuples of time (t_0,t_1,t_2) was just 2.5% of the total data points i.e. number of points on the plane. This meant that most of the times the kernel was uselessly simulating when it could just use the values from previous simulations. Hence I could use a dictionary having key as 3-tuple of times and value as index and prob. Since as far as I know and tested, STL can't be accessed inside a kernel, I constructed an array of floats of size 201000000. This choice was by experimentation since none of the top 3 times exceeded 20 seconds. Hence t_0 could take any value from {0.0,0.1,0.2,...,20.0} thus having 201 choices. I could construct a key for such a dictionary like the following
Key = t_o * 10^6 + t_1 * 10^3 + t_2
As far as the value is concerned I could make it as (prob+idx). Since idx is an integer and 0.0<=prob<=1.0, I could retrieve both of those values later by
prob=dict[key]-floor(dict[key])
idx = floor(dict[key])
So now my kernel looks like the following
__global__ my_kernel(float* points,float* dict,float *p,float *i,size_t w,...){
unsigned int col = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int row = blockIdx.x*blockDim.x + threadIdx.x;
//Calculate time taken for each of the points to reach a particular point on the plane
//Order the times in increasing order t_0,t_1,...,t_n
//Calculate Key = t_o * 10^6 + t_1 * 10^3 + t_2
if(dict[key]>0.0){
prob=dict[key]-floor(dict[key])
idx = floor(dict[key])
}
else{
//Simulate and find prob and idx
dict[key]=(prob+idx)
}
p[row*width+col]=prob;
i[row*width+col]=idx;
}
The result is quite similar to the original program for most points but for some it is wrong.
I am quite sure that this is due to race condition. Notice that dict was initialized with all zeroes. The basic idea would be to make the data structure "read many write once" in a particular location of the dict.
I am aware that there might be much more optimized ways of solving this problem rather than allocating so much memory. Please let me know in that case. But I would really like to understand why this particular solution is failing. In particular I would like to know how to use atomicAdd in this setting. I have failed to use it.

Unless your simulation in the else branch is very long (~100s of floating-point operations), a lookup table in global memory is likely to be slower than running the computation. Global memory access is very expensive!
In any case, there is no way to save time by "skipping work" using conditional branching. The Single Instruction, Multiple Thread architecture of a GPU means that the instructions for both sides of the branch will be executed serially, unless all of the threads in a block follow the same branch.
edit:
The fact that you are seeing a performance increase as a result of introducing the conditional branch and you didn't have any problems with deadlock suggests that all the threads in each block are always taking the same branch. I suspect that once dict starts getting populated, the performance increase will go away.
Perhaps I have misunderstood something, but if you want to calculate the probability of an event x, assuming a normal distribution and given the mean mu and standard deviation sigma, there is no need to generate a load of random numbers and approximate a Gaussian curve. You can directly calculate the probability:
p = exp(-((x - mu) * (x - mu) / (2.0f * sigma * sigma))) /
(sigma * sqrt(2.0f * M_PI));

Is there a sorted atomicAdd or equivalent

I have a working detection and tracking process (pixel image in rows and columns) which does not give perfectly repeatable results because its use of atomicAdd means that data points can be accumulated in different orders leading to round off errors in the calculation of centroids and other track statistics.
In the main there are few clashes for the atomicAdd, so most results are identical. However for verification and validation I need to be able to make the atomicAdd add these clashing data points in a consistent order, such that say thread 3 will beat thread 10 when both want to use the atomicAdd to add a pixel on the row N that they are processing.
Is there a mechanism that allows the atomicAdd to be deterministic in its thread order, or have I missed something?

Check out "Fast Reproducible Atomic Summations" paper from Berkeley.
http://www.eecs.berkeley.edu/~hdnguyen/public/papers/ARITH21_Fast_Sum.pdf
But basically you could try something like finding a sum of abs values along with your original sum, multiply it by O(N^2) and then subtract and add it to/from your original sum (sum = (sum - sumAbs * N^2) + sumAbs * N^2) to cancel out the lowest bits (that are indeterministic). As you can see the upper bound grows proportional to N^2... so the lower the N (number of elements in the sum) the better is your error bound.
You could also try Kahan summation to reduce the error bound in conjunction with the above.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008