CUDA's warp shuffle for double-precision data

CUDA's warp shuffle for double-precision data - cuda

A CUDA program should do reduction for double-precision data, I use Julien Demouth's slides named "Shuffle: Tips and Tricks"
the shuffle function is below:
/*for shuffle of double-precision point */
__device__ __inline__ double shfl(double x, int lane)
{
int warpSize = 32;
// Split the double number into 2 32b registers.
int lo, hi;
asm volatile("mov.b32 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));
// Shuffle the two 32b registers.
lo = __shfl_xor(lo,lane,warpSize);
hi = __shfl_xor(hi,lane,warpSize);
// Recreate the 64b number.
asm volatile("mov.b64 %0,{%1,%2};":"=d"(x):"r"(lo),"r"(hi));
return x;
}
At present, I got the errors below while compiling the program.
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 71; error : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 271; error : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 287; error : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 302; error : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 317; error : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 332; error : Arguments mismatch for instruction 'mov'
ptxas fatal : Ptx assembly aborted due to errors
make: *** [csr_double] error 255
Could someone give some advice?

There is a syntax error in the inline assembly instruction for the load of the double argument to 32 bit registers. This:
asm volatile("mov.b32 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));
should be:
asm volatile("mov.b64 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));
Using a "d" (ie 64 bit floating point register) as the source in a 32 bit load is illegal (and a mov.b32 makes no sense here, the code must load 64 bits to two 32 bit registers).

As of CUDA 9.0, __shfl, __shfl_up, __shfl_down and __shfl_xor have been deprecated.
The newly introduced functions __shfl_sync, __shfl_up_sync, __shfl_down_sync and __shfl_xor_sync have the following prototypes:
T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int
width=warpSize);
T __shfl_xor_sync(unsigned mask, T var, int laneMask, int width=warpSize);
where T can be int, unsigned int, long, unsigned long, long long, unsigned long long, float or double.
You no longer need to create your own shuffle instructions for double-precision arithmetics.

Related

Is dynamic and static shared memory allocation allowed at the same time in CUDA? [duplicate]

Suppose I have two __device__ CUDA function, each having the following local variable:
__shared__ int a[123];
and another function (say it's my kernel, i.e. a __global__ function), with:
extern __shared__ int b[];
Is this explicitly allowed/forbidden by nVIDIA? (I don't see it in the programming guide section B.2.3 on __shared__) Do the sizes all count together together towards the shared memory limit, or is it the maximum possibly in use at a single time? Or some other rule?
This can be considered a follow-up question to this one.

The shared memory is split in two parts: statically allocated and dynamically allocated. The first part is calculated during compilation, and each declaration is an actual allocation - activating ptxas info during compilation illustrates it here:
ptxas info : Used 22 registers, 384 bytes smem, 48 bytes cmem[0]
Here, we have 384 bytes, which is 3 arrays of 32 ints. (see sample corde below).
You may pass a pointer to shared memory since Kepler, to another function allowing a device sub-function to access another shared memory declaration.
Then, comes the dynamically allocated shared memory, which reserved size is declared during kernel call.
Here is an example of some various uses in a couple of functions. Note the pointer value of each shared memory region.
__device__ void dev1()
{
__shared__ int a[32] ;
a[threadIdx.x] = threadIdx.x ;
if (threadIdx.x == 0)
printf ("dev1 : %x\n", a) ;
}
__device__ void dev2()
{
__shared__ int a[32] ;
a[threadIdx.x] = threadIdx.x * 5 ;
if (threadIdx.x == 0)
printf ("dev2 : %x\n", a) ;
}
__global__ void kernel(int* res, int* res2)
{
__shared__ int a[32] ;
extern __shared__ int b[];
a[threadIdx.x] = 0 ;
b[threadIdx.x] = threadIdx.x * 3 ;
dev1();
__syncthreads();
dev2();
__syncthreads();
res[threadIdx.x] = a[threadIdx.x] ;
res2[threadIdx.x] = b[threadIdx.x] ;
if (threadIdx.x == 0)
printf ("global a : %x\n", a) ;
if (threadIdx.x == 0)
printf ("global b : %x\n", b) ;
}
int main()
{
int* dres ;
int* dres2 ;
cudaMalloc <> (&dres, 32*sizeof(int)) ;
cudaMalloc <> (&dres2, 32*sizeof(int)) ;
kernel<<<1,32,32*sizeof(float)>>> (dres, dres2);
int hres[32] ;
int hres2[32] ;
cudaMemcpy (hres, dres, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
cudaMemcpy (hres2, dres2, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
for (int k = 0 ; k < 32 ; ++k)
{
printf ("%d -- %d \n", hres[k], hres2[k]) ;
}
return 0 ;
}
This code outputs the ptxas info using 384 bytes smem, that is one array for global a array, a second for dev1 method a array, and a third for dev2 method a array. Totalling 3*32*sizeof(float)=384 bytes.
When running the kernel with dynamic shared memory equals to 32*sizeof(float), the pointer to b starts right after these three arrays.
EDIT:
The ptx file generated by this code holds declarations of statically-defined shared memory,
.shared .align 4 .b8 _ZZ4dev1vE1a[128];
.shared .align 4 .b8 _ZZ4dev2vE1a[128];
.extern .shared .align 4 .b8 b[];
except for the entry-point where it is defined in the body of the method
// _ZZ6kernelPiS_E1a has been demoted
The shared space of the memory is defined in the PTX documentation here:
The shared (.shared) state space is a per-CTA region of memory for threads in a CTA to share data. An address in shared memory can be read and written by any thread in a CTA. Use ld.shared and st.shared to access shared variables.
Though with no detail on the runtime. There is a word in the programming guide here with no further detail on the mixing of the two.
During PTX compilation, the compiler may know the amount of shared memory that is statically allocated. There might be some supplemental magic. Looking at the SASS, the first instructions use the SR_LMEMHIOFF
1 IADD32I R1, R1, -0x8;
2 S2R R0, SR_LMEMHIOFF;
3 ISETP.GE.U32.AND P0, PT, R1, R0, PT;
and calling functions in reverse order assign different values to the statically-allocated shared memory (looks very much like a form of stackalloc).
I believe the ptxas compiler calculates all the shared memory it might need in the worst case when all method may be called (when not using one of the method and using function pointers, the b address does not change, and the unallocated shared memory region is never accessed).
Finally, as einpoklum suggests in a comment, this is experimental and not part of a norm/API definition.

Using both dynamically-allocated and statically-allocated shared memory

Suppose I have two __device__ CUDA function, each having the following local variable:
__shared__ int a[123];
and another function (say it's my kernel, i.e. a __global__ function), with:
extern __shared__ int b[];
Is this explicitly allowed/forbidden by nVIDIA? (I don't see it in the programming guide section B.2.3 on __shared__) Do the sizes all count together together towards the shared memory limit, or is it the maximum possibly in use at a single time? Or some other rule?
This can be considered a follow-up question to this one.

The shared memory is split in two parts: statically allocated and dynamically allocated. The first part is calculated during compilation, and each declaration is an actual allocation - activating ptxas info during compilation illustrates it here:
ptxas info : Used 22 registers, 384 bytes smem, 48 bytes cmem[0]
Here, we have 384 bytes, which is 3 arrays of 32 ints. (see sample corde below).
You may pass a pointer to shared memory since Kepler, to another function allowing a device sub-function to access another shared memory declaration.
Then, comes the dynamically allocated shared memory, which reserved size is declared during kernel call.
Here is an example of some various uses in a couple of functions. Note the pointer value of each shared memory region.
__device__ void dev1()
{
__shared__ int a[32] ;
a[threadIdx.x] = threadIdx.x ;
if (threadIdx.x == 0)
printf ("dev1 : %x\n", a) ;
}
__device__ void dev2()
{
__shared__ int a[32] ;
a[threadIdx.x] = threadIdx.x * 5 ;
if (threadIdx.x == 0)
printf ("dev2 : %x\n", a) ;
}
__global__ void kernel(int* res, int* res2)
{
__shared__ int a[32] ;
extern __shared__ int b[];
a[threadIdx.x] = 0 ;
b[threadIdx.x] = threadIdx.x * 3 ;
dev1();
__syncthreads();
dev2();
__syncthreads();
res[threadIdx.x] = a[threadIdx.x] ;
res2[threadIdx.x] = b[threadIdx.x] ;
if (threadIdx.x == 0)
printf ("global a : %x\n", a) ;
if (threadIdx.x == 0)
printf ("global b : %x\n", b) ;
}
int main()
{
int* dres ;
int* dres2 ;
cudaMalloc <> (&dres, 32*sizeof(int)) ;
cudaMalloc <> (&dres2, 32*sizeof(int)) ;
kernel<<<1,32,32*sizeof(float)>>> (dres, dres2);
int hres[32] ;
int hres2[32] ;
cudaMemcpy (hres, dres, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
cudaMemcpy (hres2, dres2, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
for (int k = 0 ; k < 32 ; ++k)
{
printf ("%d -- %d \n", hres[k], hres2[k]) ;
}
return 0 ;
}
This code outputs the ptxas info using 384 bytes smem, that is one array for global a array, a second for dev1 method a array, and a third for dev2 method a array. Totalling 3*32*sizeof(float)=384 bytes.
When running the kernel with dynamic shared memory equals to 32*sizeof(float), the pointer to b starts right after these three arrays.
EDIT:
The ptx file generated by this code holds declarations of statically-defined shared memory,
.shared .align 4 .b8 _ZZ4dev1vE1a[128];
.shared .align 4 .b8 _ZZ4dev2vE1a[128];
.extern .shared .align 4 .b8 b[];
except for the entry-point where it is defined in the body of the method
// _ZZ6kernelPiS_E1a has been demoted
The shared space of the memory is defined in the PTX documentation here:
The shared (.shared) state space is a per-CTA region of memory for threads in a CTA to share data. An address in shared memory can be read and written by any thread in a CTA. Use ld.shared and st.shared to access shared variables.
Though with no detail on the runtime. There is a word in the programming guide here with no further detail on the mixing of the two.
During PTX compilation, the compiler may know the amount of shared memory that is statically allocated. There might be some supplemental magic. Looking at the SASS, the first instructions use the SR_LMEMHIOFF
1 IADD32I R1, R1, -0x8;
2 S2R R0, SR_LMEMHIOFF;
3 ISETP.GE.U32.AND P0, PT, R1, R0, PT;
and calling functions in reverse order assign different values to the statically-allocated shared memory (looks very much like a form of stackalloc).
I believe the ptxas compiler calculates all the shared memory it might need in the worst case when all method may be called (when not using one of the method and using function pointers, the b address does not change, and the unallocated shared memory region is never accessed).
Finally, as einpoklum suggests in a comment, this is experimental and not part of a norm/API definition.

Error on using atomic operations in CUDA C

I'm working on CUDA as a beginner and am trying to execute a pre written code
the compile gives error for every atomic operation that the code contains...
for example
__global__ void MarkEdgesUV(unsigned int *d_edge_flag, unsigned long long int *d_appended_uvw, unsigned int *d_size, int no_of_edges)
{
unsigned int tid = blockIdx.x*MAX_THREADS_PER_BLOCK + threadIdx.x;
if(tid<no_of_edges)
{
if(tid>0)
{
unsigned long long int test = INF;
test = test << NO_OF_BITS_MOVED_FOR_VERTEX_IDS;
test |=INF;
unsigned long long int test1 = d_appended_uvw[tid]>>(64-(NO_OF_BITS_MOVED_FOR_VERTEX_IDS+NO_OF_BITS_MOVED_FOR_VERTEX_IDS));
unsigned long long int test2 = d_appended_uvw[tid-1]>>(64-(NO_OF_BITS_MOVED_FOR_VERTEX_IDS+NO_OF_BITS_MOVED_FOR_VERTEX_IDS));
if(test1>test2)
d_edge_flag[tid]=1;
if(test1 == test)
* atomicMin(d_size,tid); //also to know the last element in the array, i.e. the size of new edge list
}
else
d_edge_flag[tid]=1;
}
}
gives the error:
error: identifier "atomicMin" is undefined
this happen to be a very reliable code...also i check out, the usage of atomics seems correct....please explain why the error has occured?

I guess you are compiling with nvcc only (defaulting to sm_10), without specifying the minimal needed compute capability. In fact atomicMin() on 32 bits global memory has been introduced in devices with CC1.1 (compute capability 1.1), as you can see in this table.
Try compiling with
nvcc -arch sm_11 ...
This will let the atomicMin() funcion be recognized. However best you compile for the actual architecture of your GPU.
EDITED: being it Tesla C2075, :
nvcc -arch sm_20 ...
should work as well.
EDIT: as required by the OP, adding a reference:
You can find a very detailed explanation of the meaning of -arch and -code options here.
In particular it's reported explicitly that both -arch and -code options can be omitted.
nvcc x.cu
is short hand for
nvcc x.cu -arch=compute_10 -code=sm_10,compute_10

Shared memory address passed to device function is still shared memory?

Let's say i have this __device__ function:
__device__ unsigned char* dev_kernel(unsigned char* array_sh, int params){
return array_sh + params;
}
And within the __global__ kernel i use it in this way:
uarray = dev_kernel (uarray, params);
Where uarray is an array located in shared memory.
But when i use cuda-gdb to see the addresss of uarray within __global__ kernel i get:
(#generic unsigned char * #shared) 0x1000010 "z\377*"
And within __device__ kernel i get:
(unsigned char * #generic) 0x1000010 <Error reading address 0x1000010: Operation not permitted>
Despite the error, the program in running ok (maybe it is some limitation of cuda-gdb).
So, i want to know: Within the __device__ kernel, uarray is shared yet? I'm changing the array from global to shared memory and the time is almost the same (with shared memory the time is a little worse).

So, i want to know: Within the __device__ kernel, uarray is shared yet?
Yes, when you pass a pointer to shared memory to a device function this way, it still points to the same place in shared memory.
In response to the questions posted below which are perplexing me, I elected to show a simple example:
$ cat t249.cu
#include <stdio.h>
#define SSIZE 256
__device__ unsigned char* dev_kernel(unsigned char* array_sh, int params){
return array_sh + params;
}
__global__ void mykernel(){
__shared__ unsigned char myshared[SSIZE];
__shared__ unsigned char *u_array;
for (int i = 0; i< SSIZE; i++)
myshared[i] = (unsigned char) i;
unsigned char *loc = dev_kernel(myshared, 5);
u_array = loc;
printf("val = %d\n", *loc);
printf("val = %d\n", *u_array);
}
int main(){
mykernel<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
$ nvcc -arch=sm_20 -g -G -o t249 t249.cu
$ cuda-gdb ./t249
NVIDIA (R) CUDA Debugger
5.5 release
....
Reading symbols from /home/user2/misc/t249...done.
(cuda-gdb) break mykernel
Breakpoint 1 at 0x4025dc: file t249.cu, line 9.
(cuda-gdb) run
Starting program: /home/user2/misc/t249
[Thread debugging using libthread_db enabled]
Breakpoint 1, mykernel () at t249.cu:9
9 __global__ void mykernel(){
(cuda-gdb) break 14
Breakpoint 2 at 0x4025e1: file t249.cu, line 14.
(cuda-gdb) continue
Continuing.
[New Thread 0x7ffff725a700 (LWP 26184)]
[Context Create of context 0x67e360 on Device 0]
[Launch of CUDA Kernel 0 (mykernel<<<(1,1,1),(1,1,1)>>>) on Device 0]
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 2, warp 0, lane 0]
Breakpoint 1, mykernel<<<(1,1,1),(1,1,1)>>> () at t249.cu:12
12 for (int i = 0; i< SSIZE; i++)
(cuda-gdb) continue
Continuing.
Breakpoint 2, mykernel<<<(1,1,1),(1,1,1)>>> () at t249.cu:14
14 unsigned char *loc = dev_kernel(myshared, 5);
(cuda-gdb) print &(myshared[0])
$1 = (#shared unsigned char *) 0x8 ""
^
|
cuda-gdb is telling you that this pointer is defined in a __shared__ statement, and therefore it's storage is implicit and it is unmodifiable.
(cuda-gdb) print &(u_array)
$2 = (#generic unsigned char * #shared *) 0x0
^ ^
| u_array is stored in shared memory.
u_array is a generic pointer, meaning it can point to anything.
(cuda-gdb) step
dev_kernel(unsigned char * #generic, int) (array_sh=0x1000008 "", params=5)
at t249.cu:6
6 return array_sh + params;
(cuda-gdb) print array_sh
$3 = (#generic unsigned char * #register) 0x1000008 ""
^ ^
| array_sh is stored in a register.
array_sh is a generic pointer, it can point to anything.
(cuda-gdb) print u_array
No symbol "u_array" in current context.
(note that I can't access u_array from inside the __device__ function, so I don't understand your comment there.)
(cuda-gdb) step
mykernel<<<(1,1,1),(1,1,1)>>> () at t249.cu:15
15 u_array = loc;
(cuda-gdb) step
16 printf("val = %d\n", *loc);
(cuda-gdb) print u_array
$4 = (
#generic unsigned char * #shared) 0x100000d ......
^ ^
| u_array is stored in shared memory
u_array is a generic pointer, it can point to anything
(cuda-gdb)
Although you haven't provided it, I am assuming your definition of u_array is similar to mine, based on the cuda-gdb output you are getting.
Note that the indicators like #shared are not telling you what kind of memory a pointer is pointing to, they are telling you either what kind of pointer it is (defined implicitly in a __shared__ statement) or else where it is stored (in shared memory).
If this doesn't sort out your questions, please provide a complete example, along with complete cuda-gdb session output, just as I have.

CUDA register usage

CUDA manual specifies the number of 32-bit registers per multiprocessor. Does it mean that:
Double variable takes two registers?
Pointer variable takes two registers? - It has to be more than one register on Fermi with 6 GB memory, right?
If answer to question 2 is yes, it must be better to use less pointer variables and more int indices.
E. g., this kernel code:
float* p1; // two regs
float* p2 = p1 + 1000; // two regs
int i; // one reg
for ( i = 0; i < n; i++ )
{
CODE THAT USES p1[i] and p2[i]
}
theoretically requires more registers than this kernel code:
float* p1; // two regs
int i; // one reg
int j; // one reg
for ( i = 0, j = 1000; i < n; i++, j++ )
{
CODE THAT USES p1[i] and p1[j]
}

The short answer to your three questions are:
Yes.
Yes, if the code is compiled for a 64 bit host operating system. Device pointer size always matches host application pointer size in CUDA.
No.
To expand on point 3, consider the following two simple memory copy kernels:
__global__
void debunk(float *in, float *out, int n)
{
int i = n * (threadIdx.x + blockIdx.x*blockDim.x);
for(int j=0; j<n; j++) {
out[i+j] = in[i+j];
}
}
__global__
void debunk2(float *in, float *out, int n)
{
int i = n * (threadIdx.x + blockIdx.x*blockDim.x);
float *x = in + i;
float *y = out + i;
for(int j=0; j<n; j++, x++, y++) {
*x = *y;
}
}
By your reckoning, debunk must use less registers because it has only two local integer variables, whereas debunk2 has two additional pointers. And yet, when I compile them using the CUDA 5 release toolchain:
$ nvcc -m64 -arch=sm_20 -c -Xptxas="-v" pointer_size.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z6debunkPfS_i' for 'sm_20'
ptxas info : Function properties for _Z6debunkPfS_i
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 8 registers, 52 bytes cmem[0]
ptxas info : Compiling entry function '_Z7debunk2PfS_i' for 'sm_20'
ptxas info : Function properties for _Z7debunk2PfS_i
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 8 registers, 52 bytes cmem[0]
They compile to the exact same register count. And if you disassemble the toolchain output you will see that apart from the setup code, the final instruction streams are almost identical. There are a number of reasons for this, but it basically comes down to two simple rules:
Trying to determine the register count from C code (or even PTX assembler) is mostly futile
Trying to second guess a very sophisticated compiler and assembler is also mostly futile.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

CUDA's warp shuffle for double-precision data - cuda

Related

Is dynamic and static shared memory allocation allowed at the same time in CUDA? [duplicate]

Using both dynamically-allocated and statically-allocated shared memory

Error on using atomic operations in CUDA C

Shared memory address passed to device function is still shared memory?

CUDA register usage

Categories

Resources