Shift operator in CUDA [closed] - cuda

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 9 years ago.
Improve this question
Given this CUDA code, I am trying to perform bit shifting operations and the return on these values is zero. This should not be happening. Does anyone know how to fix this issue? Am I missing a CUDA header include?
Code
__device__ unsigned int FI( unsigned int in_data, unsigned int subkey,
unsigned int *KLi1, unsigned int *KLi2, unsigned int *KOi1, unsigned int *KOi2,
unsigned int *KOi3, unsigned int *KIi1, unsigned int *KIi2, unsigned int *KIi3) {
unsigned int nine, seven;
unsigned int S7[128] = {};
unsigned int S9[512] = {};
nine = (in_data>>7);
seven = (in_data&0x7F);
/* Now run the various operations */
nine = (unsigned int)(S9[nine] ^ seven);
seven = (unsigned int)(S7[seven] ^ (nine & 0x7F));
seven ^= (subkey>>9);
nine ^= (subkey&0x1FF);
nine = (unsigned int)(S9[nine] ^ seven);
seven = (unsigned int)(S7[seven] ^ (nine & 0x7F));
in_data = (unsigned int)((seven<<9) + nine);
return( in_data );
}
Breakpoint Analysis
Here is an example of a code snippet that shifts an unsigned int 7 places to the right. When I cuda-gdb my exec and breakpoint at the instruction, I observe that the value after shifting remains zero when it shouldn't. When I normally execute the same operation in cuda-gdb command prompt, I get a non-zero value. Any suggestions or hints?
The variables nine and seven should be non-empty based on the value of in_data.
nine = (in_data>>7);
seven = (in_data&0x7F);
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (1,0,0), device 0, sm 0, warp 0, lane 1]
Breakpoint 1, FI (KLi1=0x3fffae0, KLi2=0x3fffb00, KOi1=0x3fffb20, KOi2=0x3fffb40, KOi3=0x3fffb60,
KIi1=0x3fffb80, KIi2=0x3fffba0, KIi3=0x3fffbc0, in_data=461, subkey=0) at kasumiOp.cu:61
61 nine = (in_data>>7);
(cuda-gdb) p in_data
$1 = 461
(cuda-gdb) step
62 seven = (in_data&0x7F);
(cuda-gdb) p nine
$2 = 0
(cuda-gdb) step
65 nine = (unsigned int)(S9[nine] ^ seven);
(cuda-gdb) p seven
$3 = 0
(cuda-gdb) p 461 >> 7
$4 = 3
(cuda-gdb) cuda thread
thread (1,0,0)
(cuda-gdb) p 561 & 0x7f
$5 = 49
(cuda-gdb) p 461 & 0x7f
$6 = 77
So, in_data is a value. I will try a trivial example and see if I can reproduce the same.

With the limited information provided (no code) I might take a guess:
the CUDA-GDB documentation states:
The GDB print command has been extended to decipher the location of any program variable and can be used to display the contents of any CUDA program variable including:
* data allocated via cudaMalloc()
* data that resides in various GPU memory regions, such as shared, local, and global memory
* special CUDA runtime variables, such as threadIdx
if in_data refers to a particular memory area then it might be that you're dealing with memory pointers instead of real data.
Just my two cents though.

Related

Can anyone tell me why my CUDA C code is returning my array Z to be wholly zero? (again - but with different code this time) [duplicate]

Here is my code:
int threadNum = BLOCKDIM/8;
dim3 dimBlock(threadNum,threadNum);
int blocks1 = nWidth/threadNum + (nWidth%threadNum == 0 ? 0 : 1);
int blocks2 = nHeight/threadNum + (nHeight%threadNum == 0 ? 0 : 1);
dim3 dimGrid;
dimGrid.x = blocks1;
dimGrid.y = blocks2;
// dim3 numThreads2(BLOCKDIM);
// dim3 numBlocks2(numPixels/BLOCKDIM + (numPixels%BLOCKDIM == 0 ? 0 : 1) );
perform_scaling<<<dimGrid,dimBlock>>>(imageDevice,imageDevice_new,min,max,nWidth, nHeight);
cudaError_t err = cudaGetLastError();
cudasafe(err,"Kernel2");
This is the execution of my second kernel and it is fully independent in term of the usage of data. BLOCKDIM is 512 , nWidth and nHeight are 512 too and cudasafe simply prints the corresponding string message of the error code. This section of the code gives configuration error just after the kernel call.
What might give this error, any idea?
This type of error message frequently refers to the launch configuration parameters (grid/threadblock dimensions in this case, could also be shared memory, etc. in other cases). When you see a message like this it's a good idea just to print out your actual config parameters before launching the kernel, to see if you've made any mistakes.
You said BLOCKDIM = 512. You have threadNum = BLOCKDIM/8 so threadNum = 64. Your threadblock configuration is:
dim3 dimBlock(threadNum,threadNum);
So you are asking to launch blocks of 64 x 64 threads, that is 4096 threads per block. That won't work on any generation of CUDA devices. All current CUDA devices are limited to a maximum of 1024 threads per block, which is the product of the 3 block dimensions.
Maximum dimensions are listed in table 14 of the CUDA programming guide, and also available via the deviceQuery CUDA sample code.
Just to add to the previous answers, you can find the max threads allowed in your code also, so it can run in other devices without hard-coding the number of threads you will use:
struct cudaDeviceProp properties;
cudaGetDeviceProperties(&properties, device);
cout<<"using "<<properties.multiProcessorCount<<" multiprocessors"<<endl;
cout<<"max threads per processor: "<<properties.maxThreadsPerMultiProcessor<<endl;

C to MIPS translate

Here i have been given an exam question that i partly solved but do not understand it completely
why it is used volatile here? and the missing expression
i have must be switches >>8.
when it comes to translation i have some difficulty.
Eight switches are memory mapped to the memory address 0xabab0020, where the
least significant bit (index 0) represents switch number 1 and the bit with index 7
represents switch number 8. A bit value 1 indicates that the switch is on and
0 means that it is off. Write down the missing C code expression, such that the
while loop exits if the toggle switch number 8 is off.
volatile int * switches = (volatile int *) 0xabab0020;
volatile int * leds = (volatile int *) 0xabab0040;
while(/* MISSING C CODE EXPRESSION */){
*leds = (*switches >> 4) & 1;
}
Translate the complete C code above into MIPS assembly code, including the missing C code expression. You are not allowed to use pseudo instructions.
without volatile your code can legally be interpreted by the compiler as:
int * switches = (volatile int *) 0xabab0020;
int * leds = (volatile int *) 0xabab0040;
*leds = (*switches >> 4) & 1;
while(/* MISSING C CODE EXPRESSION */){
}
The volatile qualifier is an indication to the C compiler that the data at addresses switches and leds can be changed by another agent in the system. Without the volatile qualifier, the compiler would be allowed to optimize references to these variables away.
The problem description says the loop should run while bit 7 of *switches is set, i.e: while (*switches & 0x80 != 0)
Translating the code is left as an exercise for the reader.
volatile int * switches = (volatile int *) 0xabab0020;
volatile int * leds = (volatile int *) 0xabab0040;
while((*switches >> 8) & 1){
*leds = (*switches >> 4) & 1;
}
To mips
lui $s0,0xabab #load the upper half
ori $s0,0x0020
lui $s1,0xabab
ori $s1,0x0040
while:
lw $t0,0x20($s0)
srl $t0,$t0,8 #only the 8th bit is important
andi $t0,$t0,1 # clear other bit keep the LSB
beq $t0,$0, done
lw $t1,0x40($s1)
srl $t1,$t1,4
andi $t1,$t1,1
sw $t1,0x40($s1)
j while
done:
sw $t0,0x20($s0)

Can CUDA branch divergence help me in this case?

This is little more than a thought experiment right now, but I want to check my understanding of the CUDA execution model. Consider the following case:
I am running on a GPU with poor double-precision performance (a non-Tesla card).
I have a kernel that needs to calculate a value using double precision. That value is a constant for the rest of the runtime of the kernel, and it is also constant across a warp.
Is something like the following pseudocode advantageous?
// value that we use later in the kernel; this is constant across all threads
// in a warp
int constant_value;
// check to see if this is the first thread in a warp
enum { warp_size = 32 };
if (!(threadIdx.x & (warp_size - 1))
{
// only do the double-precision math in one thread
constant_value = (int) round(double_precision_calculation());
}
// broadcast constant_value to all threads in the warp
constant_value = __shfl(v, 0);
// go on to use constant_value as needed later in the kernel
The reason why I considered doing this is my (possibly wrong) understanding of how double-precision resources are made available on each multiprocessor. From what I understand, there are simply 1/32 as many double-precision ALUs as single-precision ones on recent Geforce cards. Does this mean that if the other threads in a warp diverge, I can work around this lack of resources, and still get decent performance, as long as the double-precision values that I want can be broadcast to all threads in a warp?
Does this mean that if the other threads in a warp diverge, I can work around this lack of resources, and still get decent performance, as long as the double-precision values that I want can be broadcast to all threads in a warp?
No, you can't.
An instruction issue always occurs at the warp level, even in a warp-diverged scenario. Since it is issued at the warp level, it will require/use/schedule enough execution resources for the warp, even for inactive threads.
Therefore a computation done on only one thread will still use the same resources/scheduling slot as a computation done on all 32 threads in the warp.
For example, a floating point multiply will require 32 instances of usage of a floating point ALU. The exact scheduling of this will vary based on the specific GPU, but you cannot reduce the 32 instance usage to a lower number through warp divergence or any other mechanism.
Based on a question in the comments, here's a worked example on CUDA 7.5, Fedora 20, GT640 (GK208 - has 1/24 ratio of DP to SP units):
$ cat t1241.cu
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
const int nTPB = 32;
const int nBLK = 1;
const int rows = 1048576;
const int nSD = 128;
typedef double mytype;
template <bool use_warp>
__global__ void mpy_k(const mytype * in, mytype * out){
__shared__ mytype sdata[nTPB*nSD];
int idx = threadIdx.x + blockDim.x*blockIdx.x;
mytype accum = in[idx];
#pragma unroll 128
for (int i = 0; i < rows; i++)
if (use_warp)
accum += accum*sdata[threadIdx.x+(i&(nSD-1))*nTPB];
else
if (threadIdx.x == 0)
accum += accum*sdata[threadIdx.x+(i&(nSD-1))*nTPB];
out[idx] = accum;
}
int main(){
mytype *din, *dout;
cudaMalloc(&din, nTPB*nBLK*rows*sizeof(mytype));
cudaMalloc(&dout, nTPB*nBLK*sizeof(mytype));
cudaMemset(din, 0, nTPB*nBLK*rows*sizeof(mytype));
cudaMemset(dout, 0, nTPB*nBLK*sizeof(mytype));
mpy_k<true><<<nBLK, nTPB>>>(din, dout); // warm-up
cudaDeviceSynchronize();
unsigned long long dt = dtime_usec(0);
mpy_k<true><<<nBLK, nTPB>>>(din, dout);
cudaDeviceSynchronize();
dt = dtime_usec(dt);
printf("full warp elapsed time: %f\n", dt/(float)USECPSEC);
mpy_k<false><<<nBLK, nTPB>>>(din, dout); //warm up
cudaDeviceSynchronize();
dt = dtime_usec(0);
mpy_k<false><<<nBLK, nTPB>>>(din, dout);
cudaDeviceSynchronize();
dt = dtime_usec(dt);
printf("one thread elapsed time: %f\n", dt/(float)USECPSEC);
cudaError_t res = cudaGetLastError();
if (res != cudaSuccess) printf("CUDA runtime failure %s\n", cudaGetErrorString(res));
return 0;
}
$ nvcc -arch=sm_35 -o t1241 t1241.cu
$ CUDA_VISIBLE_DEVICES="1" ./t1241
full warp elapsed time: 0.034346
one thread elapsed time: 0.049174
$
It is not faster to use just one thread in the warp for a floating-point multiply

Shared memory address passed to device function is still shared memory?

Let's say i have this __device__ function:
__device__ unsigned char* dev_kernel(unsigned char* array_sh, int params){
return array_sh + params;
}
And within the __global__ kernel i use it in this way:
uarray = dev_kernel (uarray, params);
Where uarray is an array located in shared memory.
But when i use cuda-gdb to see the addresss of uarray within __global__ kernel i get:
(#generic unsigned char * #shared) 0x1000010 "z\377*"
And within __device__ kernel i get:
(unsigned char * #generic) 0x1000010 <Error reading address 0x1000010: Operation not permitted>
Despite the error, the program in running ok (maybe it is some limitation of cuda-gdb).
So, i want to know: Within the __device__ kernel, uarray is shared yet? I'm changing the array from global to shared memory and the time is almost the same (with shared memory the time is a little worse).
So, i want to know: Within the __device__ kernel, uarray is shared yet?
Yes, when you pass a pointer to shared memory to a device function this way, it still points to the same place in shared memory.
In response to the questions posted below which are perplexing me, I elected to show a simple example:
$ cat t249.cu
#include <stdio.h>
#define SSIZE 256
__device__ unsigned char* dev_kernel(unsigned char* array_sh, int params){
return array_sh + params;
}
__global__ void mykernel(){
__shared__ unsigned char myshared[SSIZE];
__shared__ unsigned char *u_array;
for (int i = 0; i< SSIZE; i++)
myshared[i] = (unsigned char) i;
unsigned char *loc = dev_kernel(myshared, 5);
u_array = loc;
printf("val = %d\n", *loc);
printf("val = %d\n", *u_array);
}
int main(){
mykernel<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
$ nvcc -arch=sm_20 -g -G -o t249 t249.cu
$ cuda-gdb ./t249
NVIDIA (R) CUDA Debugger
5.5 release
....
Reading symbols from /home/user2/misc/t249...done.
(cuda-gdb) break mykernel
Breakpoint 1 at 0x4025dc: file t249.cu, line 9.
(cuda-gdb) run
Starting program: /home/user2/misc/t249
[Thread debugging using libthread_db enabled]
Breakpoint 1, mykernel () at t249.cu:9
9 __global__ void mykernel(){
(cuda-gdb) break 14
Breakpoint 2 at 0x4025e1: file t249.cu, line 14.
(cuda-gdb) continue
Continuing.
[New Thread 0x7ffff725a700 (LWP 26184)]
[Context Create of context 0x67e360 on Device 0]
[Launch of CUDA Kernel 0 (mykernel<<<(1,1,1),(1,1,1)>>>) on Device 0]
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 2, warp 0, lane 0]
Breakpoint 1, mykernel<<<(1,1,1),(1,1,1)>>> () at t249.cu:12
12 for (int i = 0; i< SSIZE; i++)
(cuda-gdb) continue
Continuing.
Breakpoint 2, mykernel<<<(1,1,1),(1,1,1)>>> () at t249.cu:14
14 unsigned char *loc = dev_kernel(myshared, 5);
(cuda-gdb) print &(myshared[0])
$1 = (#shared unsigned char *) 0x8 ""
^
|
cuda-gdb is telling you that this pointer is defined in a __shared__ statement, and therefore it's storage is implicit and it is unmodifiable.
(cuda-gdb) print &(u_array)
$2 = (#generic unsigned char * #shared *) 0x0
^ ^
| u_array is stored in shared memory.
u_array is a generic pointer, meaning it can point to anything.
(cuda-gdb) step
dev_kernel(unsigned char * #generic, int) (array_sh=0x1000008 "", params=5)
at t249.cu:6
6 return array_sh + params;
(cuda-gdb) print array_sh
$3 = (#generic unsigned char * #register) 0x1000008 ""
^ ^
| array_sh is stored in a register.
array_sh is a generic pointer, it can point to anything.
(cuda-gdb) print u_array
No symbol "u_array" in current context.
(note that I can't access u_array from inside the __device__ function, so I don't understand your comment there.)
(cuda-gdb) step
mykernel<<<(1,1,1),(1,1,1)>>> () at t249.cu:15
15 u_array = loc;
(cuda-gdb) step
16 printf("val = %d\n", *loc);
(cuda-gdb) print u_array
$4 = (
#generic unsigned char * #shared) 0x100000d ......
^ ^
| u_array is stored in shared memory
u_array is a generic pointer, it can point to anything
(cuda-gdb)
Although you haven't provided it, I am assuming your definition of u_array is similar to mine, based on the cuda-gdb output you are getting.
Note that the indicators like #shared are not telling you what kind of memory a pointer is pointing to, they are telling you either what kind of pointer it is (defined implicitly in a __shared__ statement) or else where it is stored (in shared memory).
If this doesn't sort out your questions, please provide a complete example, along with complete cuda-gdb session output, just as I have.

CUDA kernel call in a simple sample

It's the first parallel code of cuda by example .
Can any one describe me about the kernel call : <<< N , 1 >>>
This is the code with important points :
#define N 10
__global__ void add( int *a, int *b, int *c ) {
int tid = blockIdx.x; // this thread handles the data at its thread id
if (tid < N)
c[tid] = a[tid] + b[tid];
}
int main( void ) {
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
// allocate the memory on the GPU
// fill the arrays 'a' and 'b' on the CPU
// copy the arrays 'a' and 'b' to the GPU
add<<<N,1>>>( dev_a, dev_b, dev_c );
// copy the array 'c' back from the GPU to the CPU
// display the results
// free the memory allocated on the GPU
return 0;
}
Why it used of <<< N , 1 >>> that it means we used of N blocks and 1 thread in each block ?? since we can write this <<< 1 , N >>> and used 1 block and N thread in this block for more optimization.
For this little example, there is no particular reason (as Bart already told you in the comments). But for a larger, more realistic example you should always keep in mind that the number of threads per block is limited. That is, if you use N = 10000, you could not use <<<1,N>>> anymore, but <<<N,1>>> would still work.