CUDA inline PTX ld.shared runs into cudaErrorIllegalAddress error - cuda

I'm using inline PTX ld.shared to load data from shared memory:
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; //declare a buffer in shared memory
float Csub = 0;
As[TY][TX] = A[a + wA * TY + TX]; //load data from global memory to shared memory
__syncthreads();
float t;
asm("ld.shared.f32 %0, [%1];" :"=f"(t) : "r"((int)&As[TY][k])); //load data from shared memory into t
Csub += t;
__syncthreads();
But runs into error:
CUDA error at C:/ProgramData/NVIDIA Corporation/CUDA Samples/v11.2/0_Simple/matrixMul_mine/matrixMul.cu:196 code=700(cudaErrorIllegalAddress) "cudaStreamSynchronize(stream)"
I dump SASS and found the LDS happens even earlier than LDG and the 2 bar.sync. It seems that the compiler lose the track of data dependency;
So my question is:
Is there anything wrong in my inline PTX that leads to cudaErrorIllegalAddress?
Does inline PTX disturb data dependency?

Yichen is right.
There are two types of addressing: ld. or ld.statespace.
If ld. only, the address should be a generic address. The generic address, to my understanding(limited), is the CUDA-C pointer value, like &As[TY][k] in your code.
If ld.statespace, the address should be the address in the state space.
I think if you use ld.f32 instead of ld.shared.f32, your code should be okay. BTW, I don't think you can use the generic address in 32-bit data width, which can truncate the generic address into a wrong value.
Or you can convert the generic address to the shared space address. Here is the cutlass's conversion code:
".reg .u32 smem_ptr32;\n\t"
".reg .u64 smem_ptr64; cvta.to.shared.u64 smem_ptr64, %1; cvt.u32.u64 smem_ptr32, smem_ptr64; \n\t"
then use smem_ptr32 instead of [%1]
"ld.shared.f32 %0, [smem_ptr32];"
As PTX ISA says, this address can be either 32-bit or 64-bit. I think it's not necessary to convert the 64-bit ptr to a 32-bit ptr. Using smem_ptr64 shall be okay.
Here is what the shared memory address looks like:
CUDA-C pointer(Generic):1526743433216 + 1024
smem_ptr64(Shared space):0 + 1024

Related

How to handle a device variable using OpenACC

I'm trying to optimize my code which I accelerated using basically OpenACC only.
Is it a good approach to insert CUDA such as in the example that follows?
In this case, u_device and v_device are used by the device only. Using cudaMalloc assures me that I allocate memory on the device memory and not on the host memory too.
int size = NVAR * sizeof(double);
// Declare pointers that will point to the memory allocated on the device.
double* v_device;
double* u_device;
// Allocate memory on the device
cudaMalloc(&v_device, size);
cudaMalloc(&u_device, size);
#pragma acc parallel loop private(v_device, u_device)
for (i = ibeg; i <= iend; i++){
#pragma acc loop
for (nv = 0; nv < NVAR; nv++) v_device[nv] = V[nv][k][j][i];
PrimToCons (v_device, u_device);
#pragma acc loop
for (nv = 0; nv < NVAR; nv++) U[k][j][i][nv] = u_device[nv];
}
cudaFree(u_device);
cudaFree(v_device);
Before I would have used OpenACC and written something like this:
double* v_device = (double*)malloc(size);
double* u_device = (double*)malloc(size);
#pragma acc enter data create(u_device[:size],v[:device])
#pragma acc parallel loop private(v_device, u_device)
for (i = ibeg; i <= iend; i++){
...
}
#pragma acc exit data delete(u_device[:size],v[:device])
Is there a way with OpenACC to avoid host memory allocation?
Another doubt I have regarding cudaMalloc is the possibility to put the routine inside the kernel, in order to make the arrays private:
#pragma acc parallel loop private(v_device, u_device)
for (i = ibeg; i <= iend; i++){
double* v_device;
double* u_device;
// Allocate memory on the device
cudaMalloc(&v_device, size);
cudaMalloc(&u_device, size);
.
.
.
cudaFree(u_device);
cudaFree(v_device);
}
Writing in this way I get the error:
182, Accelerator restriction: call to 'cudaMalloc' with no acc routine information
Is there a way with OpenACC to avoid host memory allocation?
You can use cudaMalloc, but for pure OpenACC, you'd use "acc_malloc" and "acc_free". For example: https://github.com/rmfarber/ParallelProgrammingWithOpenACC/blob/master/Chapter05/acc_malloc.c
Note the use the "deviceptr" clause which indicates that the pointer is a device pointer. Though here, you're wanting to privatize these arrays so you can keep the private.
I've never used a device pointer in a private clause, but just tried and it seems to work. Which make sense since all the compiler really needs is the size and type of the private array to make the private copies. In this case since it's on the gang loop, the compiler will attempt to put the private arrays in shared memory, assuming they aren't too big to fit. I'd recommend using the triplet notation for the array, i.e. "private(v_device[:NVAR],...) so the compiler will know the size.
Though I'm not sure there's much of an advantage to using device arrays here. The device memory you're allocating isn't going to be used taking up space on the device. Device memory is often much smaller than host memory, so if you do need to waste space, probably better this be on the host. Plus having to use acc_malloc or cudaMalloc limits portability of the code. Not that there isn't cases where using device only memory is beneficial, I just don't think it is for this case.
Note you can call "malloc" within device code, but it's not recommended. Malloc's get serialized causing performance issues, but also the default heap is relatively small which can lead to heap overflows. Granted, this can be increased by either calling cudaDeviceLimits or via the environment variable "NV_ACC_CUDA_HEAPSIZE".

Can I obtain the amount of allocated dynamic shared memory from within a kernel?

On the host side, I can save the amount of dynamic shared memory I intend to launch a kernel with, and use it. I can even pass that as an argument to the kernel. But - is there a way to get it directly from device code, without help from the host side? That is, have the code for a kernel determine, as it runs, how much dynamic shared memory it has available?
Yes, there's a special register holding that value. named %dynamic_smem_size. You can obtain this register's value in your CUDA C/C++ code by wrapping some inline PTX with a getter function:
__device__ unsigned dynamic_smem_size()
{
unsigned ret;
asm volatile ("mov.u32 %0, %dynamic_smem_size;" : "=r"(ret));
return ret;
}
You can similarly obtain the total size of allocated shared memory (static + dynamic) from the register %total_smem_size.

CUDA shared memory under the hood questions

I have several questions regarding to CUDA shared memory.
First, as mentioned in this post, shared memory may declare in two different ways:
Either dynamically shared memory allocated, like the following
// Lunch the kernel
dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n);
This may use inside a kernel as mention:
extern __shared__ int s[];
Or static shared memory, which can use in kernel call like the following:
__shared__ int s[64];
Both are use for different reasons, however which one is better and why ?
Second, I'm running a multi blocks, 256 threads per block kernel. I'm using static shared memory in global and device kernels, both of them uses shared memory. An example is given:
__global__ void startKernel(float* p_d_array)
{
__shared double matA[3*3];
float a1 =0 ;
float a2 = 0;
float a3 = 0;
float b = p_d_array[threadidx.x];
a1 += reduce( b, threadidx.x);
a2 += reduce( b, threadidx.x);
a3 += reduce( b, threadidx.x);
// continue...
}
__device__ reduce ( float data , unsigned int tid)
{
__shared__ float data[256];
// do reduce ...
}
I'd like to know how the shared memory is allocated in such case. I presume each block receive its own shared memory.
What's happening when block # 0 goes into reduce function?
Does the shared memory is allocated in advance to the function call?
I call three different reduce device function, in such case, theoretically in block # 0 , threads # [0,127] may still execute ("delayed due hard word") on the first reduce call, while threads # [128,255] may operate on the second reduce call. In this case, I'd like to know if both reduce function are using the same shared memory?
Even though if they are called from two different function calls ?
On the other hand, Is that possible that a single block may allocated 3*256*sizeof(float) shared memory for both functions calls? That's seems superfluous in CUDA manners, but I still want to know how CUDA operates in such case.
Third, is that possible to gain higher performance in shared memory due to compiler optimization using
const float* p_shared ;
or restrict keyword after the data assignment section?
AFAIR, there is little difference whether you request shared memory "dynamically" or "statically" - in either case it's just a kernel launch parameter be it set by your code or by code generated by the compiler.
Re: 2nd, compiler will sum the shared memory requirement from the kernel function and functions called by kernel.

Declaring Variables in a CUDA kernel

Say you declare a new variable in a CUDA kernel and then use it in multiple threads, like:
__global__ void kernel(float* delt, float* deltb) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
float a;
a = delt[i] + deltb[i];
a += 1;
}
and the kernel call looks something like below, with multiple threads and blocks:
int threads = 200;
uint3 blocks = make_uint3(200,1,1);
kernel<<<blocks,threads>>>(d_delt, d_deltb);
Is "a" stored on the stack?
Is a new "a" created for each thread when they are initialized?
Or will each thread independently access "a" at an unknown time, potentially messing up the algorithm?
Any variable (scalar or array) declared inside a kernel function, without an extern specifier, is local to each thread, that is each thread has its own "copy" of that variable, no data race among threads will occur!
Compiler chooses whether local variables will reside on registers or in local memory (actually global memory), depending on transformations and optimizations performed by the compiler.
Further details on which variables go on local memory can be found in the NVIDIA CUDA user guide, chapter 5.3.2.2
None of the above. The CUDA compiler is smart enough and aggressive enough with optimisations that it can detect that a is unused and the complete code can be optimised away.You can confirm this by compiling the kernel with -Xptxas=-v as an option and look at the resource count, which should be basically no registers and no local memory or heap.
In a less trivial example, a would probably be stored in a per thread register, or in per thread local memory, which is off-die DRAM.

is there a better and a faster way to copy from CPU memory to GPU using thrust?

Recently I have been using thrust a lot. I have noticed that in order to use thrust, one must always copy the data from the cpu memory to the gpu memory.
Let's see the following example :
int foo(int *foo)
{
host_vector<int> m(foo, foo+ 100000);
device_vector<int> s = m;
}
I'm not quite sure how the host_vector constructor works, but it seems like I'm copying the initial data, coming from *foo, twice - once to the host_vector when it is initialized, and another time when device_vector is initialized. Is there a better way of copying from cpu to gpu without making an intermediate data copies? I know I can use device_ptras a wrapper, but that still doesn't fix my problem.
thanks!
One of device_vector's constructors takes a range of elements specified by two iterators. It's smart enough to understand the raw pointer in your example, so you can construct a device_vector directly and avoid the temporary host_vector:
void my_function_taking_host_ptr(int *raw_ptr, size_t n)
{
// device_vector assumes raw_ptrs point to system memory
thrust::device_vector<int> vec(raw_ptr, raw_ptr + n);
...
}
If your raw pointer points to CUDA memory, introduce a device_ptr:
void my_function_taking_cuda_ptr(int *raw_ptr, size_t n)
{
// wrap raw_ptr before passing to device_vector
thrust::device_ptr<int> d_ptr(raw_ptr);
thrust::device_vector<int> vec(d_ptr, d_ptr + n);
...
}
Using a device_ptr doesn't allocate any storage; it just encodes the location of the pointer in the type system.