CUDA5.0 Samples AdvancedQuickSort - cuda

I am reading the CUDA 5.0 samples (AdvancedQuickSort) now. However, I cannot understand this sample totally due to following codes:
// Now compute my own personal offset within this. I need to know how many
// threads with a lane ID less than mine are going to write to the same buffer
// as me. We can use popc to implement a single-operation warp scan in this case.
unsigned lane_mask_lt;
asm( "mov.u32 %0, %%lanemask_lt;" : "=r"(lane_mask_lt) );
unsigned int my_mask = greater ? gt_mask : lt_mask;
unsigned int my_offset = __popc(my_mask & lane_mask_lt);
which is in the __global__ void qsort_warp function, especially for this assemble language in the codes. Can anyone help me to explain the meaning of this assemble language?

%lanemask_lt is a special, read-only register in PTX assembly which is initialized with a 32-bit mask with bits set in positions less than the thread’s lane number in the warp. The inline PTX you have posted is simply reading the value of that register and storing it in a variable where it can be used in the subsequent C++ code you posted.
Every version of the CUDA toolkit ships with a PTX assembly lanugage reference guide you can use to look up things like this.

Related

Counting registers/thread in Cuda kernel

The nSight profiler tells me that the following kernel uses 52 registers per thread:
//Just the first lines of the kernel.
__global__ void voles_kernel(float *params, int *ctrl_params,
float dt, float currTime,
float *dev_voles, float *dev_weasels,
curandStateMtgp32 *state)
{
__shared__ float dev_params[9];
__shared__ int BuYeSimStep[4];
if(threadIdx.x < 4)
{
BuYeSimStep[threadIdx.x] = ctrl_params[threadIdx.x];
}
if(threadIdx.x < 9){
dev_params[threadIdx.x] = params[threadIdx.x];
}
__syncthreads();
float currVole = curand_uniform(&state[blockIdx.x]) + 3.0;
float currWeas = curand_uniform(&state[blockIdx.x]) + 0.1;
float oldVole = currVole;
float oldWeas = currWeas;
int jj;
if (blockIdx.x * blockDim.x + threadIdx.x < BuYeSimStep[2])
{
int dayIndex = 0;
/* Not declaring any new variable from here on, just doing arithmetics.
....... */
If each register has 4 bytes I don't understand how we get to 52 registers, even
assuming that the arrays params[9] and ctrl_params[4] end up in registers (in which
case using shared memory as I did doesn't make sense). I would
like to increase occupancy, but I don't get why I'm using so many registers.
Any ideas?
It's generally difficult to look at C code and predict the register usage from it. The compiler may aggressively optimize code by increasing register usage, perhaps to save an instruction here or there. You seem to be making an assumption that register usage can be predicted from your C code variable allocations, and while there is some connection between the two, you cannot assume register usage can be computed directly from C code variable allocations.
Since you haven't provided your code, nobody can actually help with the register usage. If you want to better understand the register usage, you will need to look at the PTX code directly. To do this, compile your code using nvcc with the -ptx switch, and inspect the resultant .ptx file directly. To do this you may wish to refer to the PTX documentation as well as the nvcc documentation to look at the various compiler options.
You haven't provided your code, so it's not really possible to make any direct suggestions, but you may be able to reduce register usage by reducing constant usage, reducing or refactoring arithmetic usage, switching from double to float, and I'm sure there are many other suggestions as well. Register usage will also be affected if you are passing the -G switch to the compiler.
You can limit the compiler's usage of registers per thread by passing the -maxrregcount switch to nvcc with an appropriate parameter, such as -maxrregcount 20 which will instruct the compiler to limit itself to 20 registers per thread. This tactic may not give good results, however, or you may need to tune the parameter to a value which doesn't sacrifice too much performance. However you may find an optimum choice which doesn't sacrifice too much basic performance but allows you to improve occupancy. If you constrain the compiler too much, it will begin to spill it's needed register usage to local memory, which will generally reduce performance.
You should also be aware that you can pass -Xptxas -v to nvcc which will give useful output about the compiler's register usage and other related data (spilling, etc.) at compile time.
If you want to increase the occupancy, a direct way is using compiler flag: maxregcount to restrict the usage of registers, but it may suffer a performance loss because some registers will be spilled to local memory, which is very slow.
I suggest you debug your code with Eclipse Nsight.
Create a breakpoint at the first line of your kernel and step to there.
In Debug Perspective, inside the CUDA Thread, you have the current stack trace. Right-click on the stack and click on "Instruction Stepping Mode". The window "Disassembly" will open your kernel PTX Assembly. You can continue stepping in your kernel to track the correlation of your source code and the assembly. So you can discover which register is used for.

How to declare a struct with a dynamic array inside it in device

How to declare a struct in device that a member of it, is an array and then dynamically allocated memory for this. for example in below code, compiler said: error : calling a __host__ function("malloc") from a __global__ function("kernel_ScoreMatrix") is not allowed. is there another way for perform this action?
Type ofdev_size_idx_threads is int* and value of it, sent to kernel and used for allocate memory.
struct struct_matrix
{
int *idx_threads_x;
int *idx_threads_y;
int thread_diag_length;
int idx_length;
};
struct struct_matrix matrix[BLOCK_SIZE_Y];
matrix->idx_threads_x= (int *) malloc ((*(dev_size_idx_threads) * sizeof(int) ));
From device code, dynamic memory allocations (malloc and new) are supported only with devices of cc2.0 and greater. If you have a cc2.0 device or greater, and you pass an appropriate flag to nvcc (such as -arch=sm_20) you should not see this error. Note that if you are passing multiple compilation targets (sm_10, sm_20, etc.), if even one of the targets does not meet the cc2.0+ requirement, you will see this error.
If you have a cc1.x device, you will need to perform these types of allocations from the host (e.g. using cudaMalloc) and pass appropriate pointers to your kernel.
If you choose that route (allocating from the host), you may also be interested in my answer to questions like this one.
EDIT: responding to questions below:
In visual studio (2008 express, should be similar for other versions), you can set the compilation target as follows: open project, select Project...Properties, select Configuration Properties...CUDA Runtime API...GPU Now, on the right hand pane, you will see entries like GPU Architecture (1) (and (2) etc.) These are drop-downs that you can click on and select the target(s) you want to compile for. If your GPU is sm_21 I would select that for (1) and leave the others blank, or select compatible versions like sm_20.
To see worked examples, please follow the link I gave above. A couple worked examples are linked from my answer here as well as a description of how it is done.

OpenCL version of cudaMemcpyToSymbol & optimization

Can someone tell me OpenCl version of cudaMemcpyToSymbol for copying __constant to device and getting back to host?
Or usual clenquewritebuffer(...) will do the job ?
Could not find much help in forum. Actually a few lines of demo will suffice.
Also shall I expect same kind of optimization in opencl as that of CUDA using constant cache?
Thanks
I have seen people use cudaMemcpyToSymbol() for setting up constants in the kernel and the compiler could take advantage of those constants when optimizing the code. If one was to setup a memory buffer in openCL to pass such constants to the kernel then the compiler could not use them to optimize the code.
Instead the solution I found is to replace the cudaMemcpyToSymbol() with a print to a string that defines the symbol for the compiler. The compiler can take definitions in the form of -D FOO=bar for setting the symbol FOO to the value bar.
Not sure about OpenCL.Net, but in plain OpenCL: yes, clenquewritebuffer is enough (just remember to create buffer with CL_MEM_READ_ONLY flag set).
Here is a demo from Nvidia GPU Computing SDK (OpenCL/src/oclQuasirandomGenerator/oclQuasirandomGenerator.cpp):
c_Table[i] = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, QRNG_DIMENSIONS * QRNG_RESOLUTION * sizeof(unsigned int),
NULL, &ciErr);
ciErr |= clEnqueueWriteBuffer(cqCommandQueue[i], c_Table[i], CL_TRUE, 0,
QRNG_DIMENSIONS * QRNG_RESOLUTION * sizeof(unsigned int), tableCPU, 0, NULL, NULL);
Constant memory in CUDA and in OpenCL are exactly the same, and provide the same type of optimization. That is, if you use nVidia GPU. On ATI GPUs, it should act similarly. And I doubt that constant memory would give you any benefit over global when run on CPU.

clock() in opencl

I know that there is function clock() in CUDA where you can put in kernel code and query the GPU time. But I wonder if such a thing exists in OpenCL? Is there any way to query the GPU time in OpenCL? (I'm using NVIDIA's tool kit).
There is no OpenCL way to query clock cycles directly. However, OpenCL does have a profiling mechanism that exposes incremental counters on compute devices. By comparing the differences between ordered events, elapsed times can be measured. See clGetEventProfilingInfo.
Just for others coming her for help: Short introduction to profiling kernel runtime with OpenCL
Enable profiling mode:
cmdQueue = clCreateCommandQueue(context, *devices, CL_QUEUE_PROFILING_ENABLE, &err);
Profiling kernel:
cl_event prof_event;
clEnqueueNDRangeKernel(cmdQueue, kernel, 1 , 0, globalWorkSize, NULL, 0, NULL, &prof_event);
Read profiling data in:
cl_ulong ev_start_time=(cl_ulong)0;
cl_ulong ev_end_time=(cl_ulong)0;
clFinish(cmdQueue);
err = clWaitForEvents(1, &prof_event);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &ev_start_time, NULL);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &ev_end_time, NULL);
Calculate kernel execution time:
float run_time_gpu = (float)(ev_end_time - ev_start_time)/1000; // in usec
Profiling of individual work-items / work-goups is NOT possible yet.
You can set globalWorkSize = localWorkSize for profiling. Then you have only one workgroup.
Btw: Profiling of a single work-item (some work-items) isn't very helpful. With only some work-items you won't be able to hide memory latencies and the overhead leading to not meaningful measurements.
Try this (Only work with NVidia OpenCL of course) :
uint clock_time()
{
uint clock_time;
asm("mov.u32 %0, %%clock;" : "=r"(clock_time));
return clock_time;
}
The NVIDIA OpenCL SDK has an example Using Inline PTX with OpenCL. The clock register is accessible through inline PTX as the special register %clock. %clock is described in PTX: Parallel Thread Execution ISA manual. You should be able to replace the %%laneid with %%clock.
I have never tested this with OpenCL but use it in CUDA.
Please be warned that the compiler may reorder or remove the register read.
On NVIDIA you can use the following:
typedef unsigned long uint64_t; // if you haven't done so earlier
inline uint64_t n_nv_Clock()
{
uint64_t n_clock;
asm volatile("mov.u64 %0, %%clock64;" : "=l" (n_clock)); // make sure the compiler will not reorder this
return n_clock;
}
The volatile keyword tells the optimizer that you really mean it and don't want it moved / optimized away. This is a standard way of doing so both in PTX and e.g. in gcc.
Note that this returns clocks, not nanoseconds. You need to query for device clock frequency (using clGetDeviceInfo(device, CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(freq), &freq, 0))). Also note that on older devices there are two frequencies (or three if you count the memory frequency which is irrelevant in this case): the device clock and the shader clock. What you want is the shader clock.
With the 64-bit version of the register you don't need to worry about overflowing as it generally takes hundreds of years. On the other hand, the 32-bit version can overflow quite often (you can still recover the result - unless it overflows twice).
Now, 10 years later after the question was posted I did some tests on NVidia. I tried running the answers given by user 'Spectral' and 'the swine'. Answer given by 'Spectral' does not work. I always got same invalid values returned by clock_time function.
uint clock_time()
{
uint clock_time;
asm("mov.u32 %0, %%clock;" : "=r"(clock_time)); // this is wrong
return clock_time;
}
After subtracting start and end time I got zero.
So had a look at the PTX assembly which in PyOpenCL you can get this way:
kernel_string = """
your OpenCL code
"""
prg = cl.Program(ctx, kernel_string).build()
print(prg.binaries[0].decode())
It turned out that the clock command was optimized away! So there was no '%clock' instruction in the printed assembly.
Looking into Nvidia's PTX documentation I found the following:
'Normally any memory that is written to will be specified as an out operand, but if there is a hidden side effect on user memory (for example, indirect access of a memory location via an operand), or if you want to stop any memory optimizations around the asm() statement performed during generation of PTX, you can add a "memory" clobbers specification after a 3rd colon, e.g.:'
So the function that actually work is this:
uint clock_time()
{
uint clock_time;
asm volatile ("mov.u32 %0, %%clock;" : "=r"(clock_time) :: "memory");
return clock_time;
}
The assembly contained lines like:
// inline asm
mov.u32 %r13, %clock;
// inline asm
The version given by 'the swine' also works.

Generating random numbers from a gaussian distribution in CUDA

I've searched a lot over the internet to find a way to generate random numbers on my CUDA device, within a kernel. The numbers must come from a gaussian distribution.
The best thing I found was from NVIDIA itself. It is the Wallace algorithm, that uses a uniform distribution to build a gaussian one. But the code samples they give lack explanation and I really need to understand how the algorithm goes, especially on the device. For example, they give:
__device__ void generateRandomNumbers_wallace(
unsigned seed, // Initialization seed
float *chi2Corrections, // Set of correction values
float *globalPool, // Input random number pool
float *output // Output random numbers
unsigned tid=threadIdx.x;
// Load global pool into shared memory.
unsigned offset = __mul24(POOL_SIZE, blockIdx.x);
for( int i = 0; i < 4; i++ )
pool[tid+THREADS*i] = globalPool[offset+TOTAL_THREADS*i+tid];
__syncthreads();
const unsigned lcg_a=241;
const unsigned lcg_c=59;
const unsigned lcg_m=256;
const unsigned mod_mask = lcg_m-1;
seed=(seed+tid)&mod_mask ;
// Loop generating outputs repeatedly
for( int loop = 0; loop < OUTPUTS_PER_RUN; loop++ )
{
Transform();
unsigned intermediate_address;
i_a = __mul24(loop,8*TOTAL_THREADS)+8*THREADS *
blockIdx.x + threadIdx.x;
float chi2CorrAndScale=chi2Corrections[
blockIdx.x * OUTPUTS_PER_RUN + loop];
for( i = 0; i < 4; i++ )
output[i_a + i*THREADS]=chi2CorrAndScale*pool[tid+THREADS*i];
}
First of all, many of the variables declared aren't even used in the function! And I really don't get what the "8" is for in the second loop. I understand the "4" in the other loops have something to do with the 4x4 orthogonal matrix block, am I right? Could anyone give me a better idea of what is going on here?
Anyway, does anyone have any good code samples I could use? Or does anyone have another way of generating random gaussian numbers in a CUDA kernel? Code samples will be much appreciated.
Thanks!
You could use CURAND, which is included with the CUDA Toolkit (version 3.2 and later). It'd be far simpler!
A few notes on the code you posted:
The Wallace generator transforms Gaussian to Gaussian (i.e. not Uniform to Gaussian)
CUDA code has two implicit variables: blockIdx and threadIdx - these define the block index and thread index with a block, see the CUDA Programming Guide for more information
The code uses __mul24, on sm_20 and later this is actually slower than "ordinary" 32-bit multiplication so I would avoid it (even on older architectures for simplicity)
The Box-Muller method is also good.
The Fast Walsh Hadamard transform is done by patterns of addition and subtraction. Hence the central limit theorem applies. An array of uniform random numbers that undergoes a Walsh Hadamard transformation will have a Gaussian/Normal distribution. There are some slight technical details about that. The algorithm was not discovered by Wallace. It was first published in Servo Magazine around 1993/1994 by myself.
I have code about the Walsh Hadamard transform at www.code.google.com/p/lemontree
Regards,
Sean O'Connor