What does the __grid_constant__ parameter qualifier do? - cuda

Several weeks ago, NVIDIA's Stephen Jones gave a GTC talk named "CUDA: New features and beyond", in which he presented an upcoming feature of CUDA in v11.7: A qualifier/decorator for kernel parameters named __grid_constant__. I didn't understand the explanation about what that's supposed to mean.
Specifically, how does a __grid_constant__ int x differ from a int x ? Aren't they both just read by threads from constant memory?

Related

Generating Uniform Double random numbers on device in CUDA

I would like to generate uniform random numbers on the device, to be used inside of a device function. Each thread should generate a different uniform random number. I have this code, but I get a segmentation fault.
int main{
curandStateMtgp32 *devMTGPStates;
mtgp32_kernel_params *devKernelParams;
cudaMalloc((void **)&devMTGPStates, NUM_THREADS*NUM_BLOCKS * sizeof(curandStateMtgp32));
cudaMalloc((void**)&devKernelParams,sizeof(mtgp32_kernel_params));
curandMakeMTGP32Constants(mtgp32dc_params_fast_11213, devKernelParams);
curandMakeMTGP32KernelState(devMTGPStates,
mtgp32dc_params_fast_11213, devKernelParams,NUM_BLOCKS*NUM_THREADS, 1234);
doHenry <<NUM_BLOCKS,NUM_THREADS>>> (devMTGPStates);
}
Inside my global function doHenry, evaluated on the device, I put:
double rand1 = curand_uniform_double(&state[threadIdx.x+NUM_THREADS*blockIdx.x]);
Is this the best way to generate a random number per thread? I don't understand what the devKernelParams is doing, but I know I need one state per thread, right?
I think you're getting the seg fault on this line:
curandMakeMTGP32KernelState(devMTGPStates, mtgp32dc_params_fast_11213, devKernelParams,NUM_BLOCKS*NUM_THREADS, 1234);
I believe the reason for the seg fault is because you have exceeded 200 for the n parameter, for which you are passing NUM_BLOCKS*NUM_THREADS. I tried a version of your code, and I was able to reproduce the seg fault at around n=540.
The MT generator has a limitation on the amount of states it can set up when using pre-generated kernel parameters (mtgp32dc_params_fast_11213). You may wish to read the relevant section of the documentation. (Bit Generation with the MTGP32 generator)
I'm not really an expert on CURAND, but other generators (such as XORWOW) don't have this type of limitation, so if you want to generate a large amount of independent thread state easily, consider one of the other generators. Using the particular approach you have outlined, the MTGP32 generator seems to be limited to about 200*256 independent thread generation. Contrary to what I said in the comments (which is true for other generator types) the MTGP32 state seems to be sufficient at one state for a block of up to 256 threads. And the example given in the documentation (refer to the second example) uses that type of state generation and threadblock hierarchy.

CUDA5.0 Samples AdvancedQuickSort

I am reading the CUDA 5.0 samples (AdvancedQuickSort) now. However, I cannot understand this sample totally due to following codes:
// Now compute my own personal offset within this. I need to know how many
// threads with a lane ID less than mine are going to write to the same buffer
// as me. We can use popc to implement a single-operation warp scan in this case.
unsigned lane_mask_lt;
asm( "mov.u32 %0, %%lanemask_lt;" : "=r"(lane_mask_lt) );
unsigned int my_mask = greater ? gt_mask : lt_mask;
unsigned int my_offset = __popc(my_mask & lane_mask_lt);
which is in the __global__ void qsort_warp function, especially for this assemble language in the codes. Can anyone help me to explain the meaning of this assemble language?
%lanemask_lt is a special, read-only register in PTX assembly which is initialized with a 32-bit mask with bits set in positions less than the thread’s lane number in the warp. The inline PTX you have posted is simply reading the value of that register and storing it in a variable where it can be used in the subsequent C++ code you posted.
Every version of the CUDA toolkit ships with a PTX assembly lanugage reference guide you can use to look up things like this.

Generalized Hough Transform in CUDA - How can I speed up the binning process?

Like the title says, I'm working on a little personal research into parallel computer vision techniques. Using CUDA, I am trying to implement a GPGPU version of the Hough transform. The only problem that I've encountered is during the voting process. I'm calling atomicAdd() to prevent multiple, simultaneously write operations and I don't seem to be gaining too much performance efficiency. I've searched the web, but haven't found any way to noticeably enhance the performance of the voting process.
Any help you could provide regarding the voting process would be greatly appreciated.
I'm not familiar with the Hough transform, so posting some pseudocode could help here. But if you are interested in voting, you might consider using the CUDA vote intrinsic instructions to accelerate this.
Note this requires 2.0 or later compute capability (Fermi or later).
If you are looking to count the number of threads in a block for which a specific condition is true, you can just use __syncthreads_count().
bool condition = ...; // compute the condition
int blockCount = __syncthreads_count(condition); // must be in non-divergent code
If you are looking to count the number of threads in a grid for which the condition is true, you can then do the atomicAdd
bool condition = ...; // compute the condition
int blockCount = __syncthreads_count(condition); // must be in non-divergent code
atomicAdd(totalCount, blockCount);
If you need to count the number of threads in a group smaller than a block for which the condition is true, you can use __ballot() and __popc() (population count).
// get the count of threads within each warp for which the condition is true
bool condition = ...; // compute the condition in each thread
int warpCount = __popc(__ballot()); // see the CUDA programming guide for details
Hope this helps.
In a very short past, I did use the voting processes...
at the very end, the atomicAdd become even faster and in both scenarios
this link is very useful:
warp-filtering
an this one was my solved problem Write data only from selected lanes in a Warp using Shuffle + ballot + popc
aren't u looking for a critical section?

CUDA cublas<t>gbmv understanding

I recently wanted to use a simple CUDA matrix-vector multiplication. I found a proper function in cublas library: cublas<<>>gbmv. Here is the official documentation
But it is actually very poor, so I didn't manage to understand what the kl and ku parameters mean. Moreover, I have no idea what stride is (it must also be provided).
There is a brief explanation of these parameters (Page 37), but it looks like I need to know something else.
A search on the internet doesn't provide tons of useful information on this question, mostly references to different version of documentation.
So I have several questions to GPU/CUDA/cublas gurus:
How do I find more understandable docs or guides about using cublas?
If you know how to use this very function, couldn't you explain me how do I use it?
Maybe cublas library is somewhat extraordinary and everyone uses something more popular, better documented and so on?
Thanks a lot.
So BLAS (Basic Linear Algebra Subprograms) generally is an API to, as the name says, basic linear algebra routines. It includes vector-vector operations (level 1 blas routines), matrix-vector operations (level 2) and matrix-matrix operations (level 3). There is a "reference" BLAS available that implements everything correctly, but most of the time you'd use an optimized implementation for your architecture. cuBLAS is an implementation for CUDA.
The BLAS API was so successful as an API that describes the basic operations that it's become very widely adopted. However, (a) the names are incredibly cryptic because of architectural limitations of the day (this was 1979, and the API was defined using names of 8 characters or less to ensure it could widely compile), and (b) it is successful because it's quite general, and so even the simplest function calls require a lot of extraneous arguments.
Because it's so widespread, it's often assumed that if you're doing numerical linear algebra, you already know the general gist of the API, so implementation manuals often leave out important details, and I think that's what you're running into.
The Level 2 and 3 routines generally have function names of the form TMMOO.. where T is the numerical type of the matrix/vector (S/D for single/double precision real, C/Z for single/double precision complex), MM is the matrix type (GE for general - eg, just a dense matrix you can't say anything else about; GB for a general banded matrix, SY for symmetric matrices, etc), and OO is the operation.
This all seems slightly ridiculous now, but it worked and works relatively well -- you quickly learn to scan these for familiar operations so that SGEMV is a single-precision general-matrix times vector multiplication (which is probably what you want, not SGBMV), DGEMM is double-precision matrix-matrix multiply, etc. But it does take some practice.
So if you look at the cublas sgemv instructions, or in the documentation of the original, you can step through the argument list. First, the basic operation is
This function performs the matrix-vector multiplication
y = a op(A)x + b y
where A is a m x n matrix stored in column-major format, x and y
are vectors, and and are scalars.
where op(A) can be A, AT, or AH. So if you just want y = Ax, as is the common case, then a = 1, b = 0. and transa == CUBLAS_OP_N.
incx is the stride between different elements in x; there's lots of situations where this would come in handy, but if x is just a simple 1d array containing the vector, then the stride would be 1.
And that's about all you need for SGEMV.

CONFLICT_FREE_OFFSET macro used in the parallel prefix algorithm from GPU Gems 3

First of all, here is the link to the algorithm:
GPU Gems 3, Chapter 39: Parallel Prefix Sum (Scan) with CUDA.
In order to avoid bank conflicts, padding is added to the shared memory array every NUM_BANKS (i.e., 32 for devices of computability 2.x) elements. This is done by (as in Figure 39-5):
int ai = offset*(2*thid+1)-1
int bi = offset*(2*thid+2)-1
ai += ai/NUM_BANKS
bi += ai/NUM_BANKS
temp[bi] += temp[ai]
I don't understand how ai/NUM_BANKS is equivalent to the macro:
#define NUM_BANKS 16
#define LOG_NUM_BANKS 4
#define CONFLICT_FREE_OFFSET(n) \
((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS))
Isn't it equal to
n >> LOG_NUM_BANKS
Any help is appreciated. Thanks
I wrote that code and co-wrote the article, and I request that you use the article only for learning about scan algorithms, and do not use the code in it. It was written when CUDA was new, and I was new to CUDA. If you use a modern implementation of scan in CUDA you don't need any bank conflict avoidance.
If you want to do scans the easy way, use thrust::inclusive_scan or thrust::exclusive_scan.
If you really want to implement a scan, refer to more recent articles such as this one [1]. Or for a real opus with faster code but that will require a bit more study, this one [2]. Or read Sean Baxter's tutorial (though the latter doesn't include citations of the seminal work on the scan algorithm).
[1] Shubhabrata Sengupta, Mark Harris, Michael Garland, and John D. Owens. "Efficient Parallel Scan Algorithms for many-core GPUs". In Jakub Kurzak, David A. Bader, and Jack Dongarra, editors, Scientific Computing with Multicore and Accelerators, Chapman & Hall/CRC Computational Science, chapter 19, pages 413–442. Taylor & Francis, January 2011.
[2] Merrill, D. and Grimshaw, A. Parallel Scan for Stream Architectures. Technical Report CS2009-14, Department of Computer Science, University of Virginia. Dec. 2009.