In CUDA, what instruction is used to load data from global memory to shared memory? - cuda

I am currently studying CUDA and learned that there are global memory and shared memory.
I have checked the CUDA document and found that GPUs can access shared memory and global memory using ld.shared/st.shared and ld.global/st.global instructions, respectively.
What I am curious about is what instruction is used to load data from global memory to shared memory?
It would be great if someone could let me know.
Thanks!
__global__ void my_function(int* global_mem)
{
__shared__ int shared_mem[10];
for(int i = 0; i < 10; i++) {
shared_mem[i] = global_mem[i]; // What instrcuton is used for this load operation?
}
}

In the case of
__shared__ float smem[2];
smem[0] = global_memory[0];
Then the operation is (in SASS)
LDG Rx, [Ry]
STS [Rz], Rx
To expand a bit more, read https://forums.developer.nvidia.com/t/whats-different-between-ld-and-ldg-load-from-generic-memory-vs-load-from-global-memory/40856/2
Summary:
instruction
meaning
LDS
load from shared space
LDC
load from constant space
LDG
load from global space
LD
generic load - space deduced from the supplied address

With NVIDIA's Ampere microarchitecture, pipelining functionality was introduced to improve, among other things, the performance of copying from global to shared memory. Thus, we no longer need two instructions per element loaded, which keep the thread busier than it needs to be. Instead, you could write something like this:
#define NO_ZFILL 0
// ...
for(int i = 0; i < 10; i++) {
__pipeline_memcpy_async(&shared_mem[i], &global_mem[i], sizeof(int), NO_ZFILL);
}
__pipeline_commit();
__pipeline_wait_prior(0); // wait for the first commited batch of pipeline ops
And the resulting PTX code looks like this:
{
ld.param.u64 %rd1, [my_function(int*)_param_0];
mov.u32 %r1, my_function(int*)::shared_mem;
cp.async.ca.shared.global [%r1], [%rd1], 4, 4;
add.s64 %rd2, %rd1, 4;
add.s32 %r2, %r1, 4;
cp.async.ca.shared.global [%r2], [%rd2], 4, 4;
add.s64 %rd3, %rd1, 8;
add.s32 %r3, %r1, 8;
cp.async.ca.shared.global [%r3], [%rd3], 4, 4;
add.s64 %rd4, %rd1, 12;
add.s32 %r4, %r1, 12;
cp.async.ca.shared.global [%r4], [%rd4], 4, 4;
add.s64 %rd5, %rd1, 16;
add.s32 %r5, %r1, 16;
cp.async.ca.shared.global [%r5], [%rd5], 4, 4;
add.s64 %rd6, %rd1, 20;
add.s32 %r6, %r1, 20;
cp.async.ca.shared.global [%r6], [%rd6], 4, 4;
add.s64 %rd7, %rd1, 24;
add.s32 %r7, %r1, 24;
cp.async.ca.shared.global [%r7], [%rd7], 4, 4;
add.s64 %rd8, %rd1, 28;
add.s32 %r8, %r1, 28;
cp.async.ca.shared.global [%r8], [%rd8], 4, 4;
add.s64 %rd9, %rd1, 32;
add.s32 %r9, %r1, 32;
cp.async.ca.shared.global [%r9], [%rd9], 4, 4;
add.s64 %rd10, %rd1, 36;
add.s32 %r10, %r1, 36;
cp.async.ca.shared.global [%r10], [%rd10], 4, 4;
cp.async.commit_group;
cp.async.wait_group 0;
ret;
}
Notes about the PTX:
The key instructions are those beginning with cp.async, and the the add's are address computations.
Compiled with target virtual architecture sm_80.
The compiler has unrolled the loop (although it didn't have to).
This still needs to be compiled further into actual assembly instructions.
For more details, see Section B.27.3 Pipeline Primitives in the CUDA Programming Guide.
There is a fancier, but more opaque, way of doing this using the "cooperative groups" C++ interface bundled

As Kryrene already said, loading data from global memory to shared memory typically requires two instructions.
However, GPUs since the Ampere architecture (CC >= 8.0) also support loading data directly from global memory into shared memory with a single instruction which issues an asynchronous copy. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#async_data_operations
In ptx this could be cp.async.ca.shared.global , in SASS it could be LDGSTS

Related

Why does PTX shows 32 bit load operation for a 128 bit struct assignment?

I defined custom structs of 128 bits like this-
typedef struct dtype{
int val;
int temp2;
int temp3;
int temp4;
}dtype;
Then I performed an assignment :-
dtype temp= h_a[i]; //where h_a is dtype *
I was expecting a 128 bit load but instead PTX showed what appears like a 32 bit load operation-
mul.wide.s32 %rd4, %r18, 16;
add.s64 %rd5, %rd1, %rd4;
ld.global.u32 %r17, [%rd5];
Shouldn't it appear like ld.global.v4.u32 %r17, [%rd5];
Where am I going wrong?
The compiler will only emit vectorized load or store instructions if the memory is guaranteed to be aligned to the size of the type, and all the elements of the type are used (otherwise the vector instruction will be optimized away to a scalar instruction to save bandwidth).
If you do this:
struct dtype{
int val;
int temp2;
int temp3;
int temp4;
};
struct __align__ (16) adtype{
int val;
int temp2;
int temp3;
int temp4;
};
__global__
void kernel(adtype* x, dtype* y)
{
adtype lx = x[threadIdx.x];
dtype ly;
ly.val = lx.temp4;
ly.temp2 = lx.temp3;
ly.temp3 = lx.val;
ly.temp4 = lx.temp2;
y[threadIdx.x] = ly;
}
you should get something like this:
visible .entry _Z6kernelP6adtypeP5dtype(
.param .u64 _Z6kernelP6adtypeP5dtype_param_0,
.param .u64 _Z6kernelP6adtypeP5dtype_param_1
)
{
ld.param.u64 %rd1, [_Z6kernelP6adtypeP5dtype_param_0];
ld.param.u64 %rd2, [_Z6kernelP6adtypeP5dtype_param_1];
cvta.to.global.u64 %rd3, %rd2;
cvta.to.global.u64 %rd4, %rd1;
mov.u32 %r1, %tid.x;
mul.wide.u32 %rd5, %r1, 16;
add.s64 %rd6, %rd4, %rd5;
ld.global.v4.u32 {%r2, %r3, %r4, %r5}, [%rd6];
add.s64 %rd7, %rd3, %rd5;
st.global.u32 [%rd7], %r5;
st.global.u32 [%rd7+4], %r4;
st.global.u32 [%rd7+8], %r2;
st.global.u32 [%rd7+12], %r3;
ret;
}
Here you can clearly see the vectorized load for the aligned type, and the non-vectorized store for the non-aligned type. If the kernel is changed so that the store is to the aligned version:
__global__
void kernel(adtype* x, dtype* y)
{
dtype ly = y[threadIdx.x];
adtype lx;
lx.val = ly.temp4;
lx.temp2 = ly.temp3;
lx.temp3 = ly.val;
lx.temp4 = ly.temp2;
x[threadIdx.x] = lx;
}
you will get this:
.visible .entry _Z6kernelP6adtypeP5dtype(
.param .u64 _Z6kernelP6adtypeP5dtype_param_0,
.param .u64 _Z6kernelP6adtypeP5dtype_param_1
)
{
ld.param.u64 %rd1, [_Z6kernelP6adtypeP5dtype_param_0];
ld.param.u64 %rd2, [_Z6kernelP6adtypeP5dtype_param_1];
cvta.to.global.u64 %rd3, %rd1;
cvta.to.global.u64 %rd4, %rd2;
mov.u32 %r1, %tid.x;
mul.wide.u32 %rd5, %r1, 16;
add.s64 %rd6, %rd4, %rd5;
add.s64 %rd7, %rd3, %rd5;
ld.global.u32 %r2, [%rd6+12];
ld.global.u32 %r3, [%rd6+8];
ld.global.u32 %r4, [%rd6+4];
ld.global.u32 %r5, [%rd6];
st.global.v4.u32 [%rd7], {%r2, %r3, %r5, %r4};
ret;
}
Now the aligned type is stored with a vectorized instruction.
[ All code compiled for sm_53 using the default Godbolt toolchain (10.2) ]
I am adding an additional point in case anyone happens to be facing the same issue.
{
dtype temp = h_a[i]; //Loading data exactly needed
sum.val += temp.val;
}
I followed the steps given in the above^^ answer however I was not getting a 128 bit load although the above approach is absolutely correct.
The thing is that the compiler saw that out of the 4 fields of the struct, I was using only 1 field in some addition operation. So it very smartly only loaded the chunk which I needed. So no matter what I tried, I was always getting a 32 bit load.
{
dtype temp = h_a[i]; //Loading data exactly needed
sum.val += temp.val;
sum.temp2 += temp.temp2;
sum.temp3 += temp.temp3;
sum.temp4 += temp.temp4;
}
A little change.
Now I am using all the fields. So the compiler loaded all the fields!
Yes, now using the approach given in the above ^^ answer, using __align __(16) I got the correct 128 bit load.
Although this maybe very obvious for many people, but I am not a veteran coder. I only use coding in certain places to work out my projects. This was seriously insightful for me and I hope someone gets benefited by this also!

Does PTX actually have a 64-bit warp shuffle instruction?

I noticed that in the docs for __shfl_sync() and it's relatives that 64-bit datatypes (long, double) are supported.
Does this mean that hardware/PTX natively supports 64-bit warp shuffles, or are these broken down into a pair of 32-bit shuffles when the code is compiled?
Currently, there is no 64-bit shuffle instruction in PTX. The basic register unit in all current CUDA GPUs are 32-bits. 64-bit quantities have no corresponding 64-bit registers, but instead occupy a pair of 32-bit registers. The warp shuffle operation at the machine level operates on 32-bit registers.
The compiler processes 64-bit operands to the shfl intrinsics for CUDA C++ by emitting 2 PTX (or SASS) instructions. This is readily discoverable/confirmable using the CUDA binary utilities.
Example:
$ cat t45.cu
typedef double mt;
__global__ void k(mt *d){
mt x = d[threadIdx.x];
x = __shfl_sync(0xFFFFFFFF, x, threadIdx.x+1);
d[threadIdx.x] = x;
}
$ nvcc -c t45.cu
$ cuobjdump -ptx t45.o
Fatbin elf code:
================
arch = sm_30
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
Fatbin ptx code:
================
arch = sm_30
code version = [6,2]
producer = cuda
host = linux
compile_size = 64bit
compressed
.version 6.2
.target sm_30
.address_size 64
.visible .entry _Z1kPd(
.param .u64 _Z1kPd_param_0
)
{
.reg .pred %p<3>;
.reg .b32 %r<9>;
.reg .f64 %fd<3>;
.reg .b64 %rd<5>;
ld.param.u64 %rd1, [_Z1kPd_param_0];
cvta.to.global.u64 %rd2, %rd1;
mov.u32 %r5, %tid.x;
mul.wide.u32 %rd3, %r5, 8;
add.s64 %rd4, %rd2, %rd3;
ld.global.f64 %fd1, [%rd4];
add.s32 %r6, %r5, 1;
mov.b64 {%r1,%r2}, %fd1;
mov.u32 %r7, 31;
mov.u32 %r8, -1;
shfl.sync.idx.b32 %r4|%p1, %r2, %r6, %r7, %r8;
shfl.sync.idx.b32 %r3|%p2, %r1, %r6, %r7, %r8;
mov.b64 %fd2, {%r3,%r4};
st.global.f64 [%rd4], %fd2;
ret;
}
$

purposely causing bank conflicts for shared memory on CUDA device

It is a mystery for me how shared memory on cuda devices work. I was curious to count threads having access to the same shared memory. For this I wrote a simple program
#include <cuda_runtime.h>
#include <stdio.h>
#define nblc 13
#define nthr 1024
//------------------------#device--------------------
__device__ int inwarpD[nblc];
__global__ void kernel(){
__shared__ int mywarp;
mywarp=0;
for (int i=0;i<5;i++) mywarp += (10000*threadIdx.x+1);
__syncthreads();
inwarpD[blockIdx.x]=mywarp;
}
//------------------------#host-----------------------
int main(int argc, char **argv){
int inwarpH[nblc];
cudaSetDevice(2);
kernel<<<nblc, nthr>>>();
cudaMemcpyFromSymbol(inwarpH, inwarpD, nblc*sizeof(int), 0, cudaMemcpyDeviceToHost);
for (int i=0;i<nblc;i++) printf("%i : %i\n",i, inwarpH[i]);
}
and ran it on K80 GPU. Since several threads are having access to the same shared memory variable I was expecting that this variable will be updated 5*nthr times, albeit not at the same cycle because of the bank conflict. However, the output indicates that mywarp shared variable was updated only 5 times. For each blocks different threads accomplished this task:
0 : 35150005
1 : 38350005
2 : 44750005
3 : 38350005
4 : 51150005
5 : 38350005
6 : 38350005
7 : 38350005
8 : 51150005
9 : 44750005
10 : 51150005
11 : 38350005
12 : 38350005
Instead, I was expecting
523776*10000+5*1024=5237765120
for each block. Can someone kindly explain me where my understanding of shared memory fails. I would like also to know how would it be possible that all threads in one block access (update) the same shared variable. I know it is not possible at the same MP cycle. Serialisation is fine for me because it is going to be a rare event.
Lets walk through the ptx that it generates.
//Declare some registers
.reg .s32 %r<5>;
.reg .s64 %rd<4>;
// demoted variable
.shared .align 4 .u32 _Z6kernelv$__cuda_local_var_35411_30_non_const_mywarp;
//load tid in register r1
mov.u32 %r1, %tid.x;
//multiple tid*5000+5 and store in r2
mad.lo.s32 %r2, %r1, 50000, 5;
//store result in shared memory
st.shared.u32 [_Z6kernelv$__cuda_local_var_35411_30_non_const_mywarp], %r2;
///synchronize
bar.sync 0;
//load from shared memory and store in r3
ld.shared.u32 %r3, [_Z6kernelv$__cuda_local_var_35411_30_non_const_mywarp];
mov.u32 %r4, %ctaid.x;
mul.wide.u32 %rd1, %r4, 4;
mov.u64 %rd2, inwarpD;
add.s64 %rd3, %rd2, %rd1;
//store r3 in global memory
st.global.u32 [%rd3], %r3;
ret;
So basically
for (int i=0;i<5;i++)
mywarp += (10000*threadIdx.x+1);
is being optimized down to
mywarp=50000*threadIdx.x+5
so you're not experiencing a bank-conflict. You are experiencing a race-condition.

Is there an equivalent to memcpy() that works inside a CUDA kernel?

I'm trying to break apart and reshape the structure of an array asynchronously using the CUDA kernel. memcpy() doesn't work inside the kernel, and neither does cudaMemcpy()*; I'm at a loss.
Can anyone tell me the preferred method for copying memory from within the CUDA kernel?
It is worth noting, cudaMemcpy(void *to, void *from, size, cudaMemcpyDeviceToDevice) will NOT work for what I am trying to do, because it can only be called from outside of the kernel and does not execute asynchronously.
Yes, there is an equivalent to memcpy that works inside cuda kernels. It is called memcpy. As an example:
__global__ void kernel(int **in, int **out, int len, int N)
{
int idx = threadIdx.x + blockIdx.x*blockDim.x;
for(; idx<N; idx+=gridDim.x*blockDim.x)
memcpy(out[idx], in[idx], sizeof(int)*len);
}
which compiles without error like this:
$ nvcc -Xptxas="-v" -arch=sm_20 -c memcpy.cu
ptxas info : Compiling entry function '_Z6kernelPPiS0_ii' for 'sm_20'
ptxas info : Function properties for _Z6kernelPPiS0_ii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 11 registers, 48 bytes cmem[0]
and emits PTX:
.version 3.0
.target sm_20
.address_size 32
.file 1 "/tmp/tmpxft_00000407_00000000-9_memcpy.cpp3.i"
.file 2 "memcpy.cu"
.file 3 "/usr/local/cuda/nvvm/ci_include.h"
.entry _Z6kernelPPiS0_ii(
.param .u32 _Z6kernelPPiS0_ii_param_0,
.param .u32 _Z6kernelPPiS0_ii_param_1,
.param .u32 _Z6kernelPPiS0_ii_param_2,
.param .u32 _Z6kernelPPiS0_ii_param_3
)
{
.reg .pred %p<4>;
.reg .s32 %r<32>;
.reg .s16 %rc<2>;
ld.param.u32 %r15, [_Z6kernelPPiS0_ii_param_0];
ld.param.u32 %r16, [_Z6kernelPPiS0_ii_param_1];
ld.param.u32 %r2, [_Z6kernelPPiS0_ii_param_3];
cvta.to.global.u32 %r3, %r15;
cvta.to.global.u32 %r4, %r16;
.loc 2 4 1
mov.u32 %r5, %ntid.x;
mov.u32 %r17, %ctaid.x;
mov.u32 %r18, %tid.x;
mad.lo.s32 %r30, %r5, %r17, %r18;
.loc 2 6 1
setp.ge.s32 %p1, %r30, %r2;
#%p1 bra BB0_5;
ld.param.u32 %r26, [_Z6kernelPPiS0_ii_param_2];
shl.b32 %r7, %r26, 2;
.loc 2 6 54
mov.u32 %r19, %nctaid.x;
.loc 2 4 1
mov.u32 %r29, %ntid.x;
.loc 2 6 54
mul.lo.s32 %r8, %r29, %r19;
BB0_2:
.loc 2 7 1
shl.b32 %r21, %r30, 2;
add.s32 %r22, %r4, %r21;
ld.global.u32 %r11, [%r22];
add.s32 %r23, %r3, %r21;
ld.global.u32 %r10, [%r23];
mov.u32 %r31, 0;
BB0_3:
add.s32 %r24, %r10, %r31;
ld.u8 %rc1, [%r24];
add.s32 %r25, %r11, %r31;
st.u8 [%r25], %rc1;
add.s32 %r31, %r31, 1;
setp.lt.u32 %p2, %r31, %r7;
#%p2 bra BB0_3;
.loc 2 6 54
add.s32 %r30, %r8, %r30;
ld.param.u32 %r27, [_Z6kernelPPiS0_ii_param_3];
.loc 2 6 1
setp.lt.s32 %p3, %r30, %r27;
#%p3 bra BB0_2;
BB0_5:
.loc 2 9 2
ret;
}
The code block at BB0_3 is a byte sized memcpy loop emitted automagically by the compiler. It might not be a great idea from a performance point-of-view to use it, but it is fully supported (and has been for a long time on all architectures).
Edited four years later to add that since the device side runtime API was released as part of the CUDA 6 release cycle, it is also possible to directly call something like
cudaMemcpyAsync(void *to, void *from, size, cudaMemcpyDeviceToDevice)
in device code for all architectures which support it (Compute Capability 3.5 and newer hardware using separate compilation and device linking).
In my testing the best answer is to write your own looping copy routine. In my case:
__device__
void devCpyCplx(const thrust::complex<float> *in, thrust::complex<float> *out, int len){
// Casting for improved loads and stores
for (int i=0; i<len/2; ++i) {
((float4*) out)[i] = ((float4*) out)[i];
}
if (len%2) {
((float2*) out)[len-1] = ((float2*) in)[len-1];
}
}
memcpy works in a kernel but it may be much slower. cudaMemcpyAsync from the host is a valid option.
I needed to partition 800 contiguous vectors of ~33,000 length to 16,500 length in different buffer with 1,600 copy calls. Timing with nvvp:
memcpy in kernel: 140 ms
cudaMemcpy DtoD on host: 34 ms
loop copy in kernel: 8.6 ms
#talonmies reports that memcpy copies byte by byte which is inefficient with loads and stores. I'm targeting compute 3.0 still so I can't test cudaMemcpy on device.
Edit: Tested on a newer device. Device runtime cudaMemcpyAsync(out, in, bytes, cudaMemcpyDeviceToDevice, 0) is comparable to a good copy loop and better than a bad copy loop. Note using the device runtime api may require compile changes (sm>=3.5, separate compilation). Refer to programming guide and nvcc docs for compiling.
Device memcpy bad. Host cudaMemcpyAsync okay. Device cudaMemcpyAsync good.
cudaMemcpy() does indeed run asynchronously but you're right, it can't be executed from within a kernel.
Is the new shape of the array determined based on some calculation? Then, you would typically run the same number of threads as there are entries in your array. Each thread would run a calculation to determine the source and destination of a single entry in the array and then copy it there with a single assignment. (dst[i] = src[j]). If the new shape of the array is not based on calculations, it might be more efficient to run a series of cudaMemcpy() with cudaMemCpyDeviceToDevice from the host.

cuda kernel - registers

Float4 variables defined in the kernel should be store in registers !? I made a simple test. In the first kernel I use registers to optimize a memory traffic, in the second I read directly from a global memory.
__global__ void kernel(float4 *arg1, float4 *arg2, float4 *arg3)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
float4 temp1 = arg2[x];
float4 temp2 = arg3[x];
//some computations using temp1 and temp2
arg2[x] = temp1;
arg3[x] = temp2;
arg1[x] = make_float4(temp1.x, temp1.y, temp1.z, temp1.w);
}
__global__ void kernel(float4 *arg1, float4 *arg2, float4 *arg3)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
//some computations using a direct access to global memory
//for example arg2[x].x
arg1[x] = make_float4(arg2[x].x, arg2[x].y, arg2[x].z, arg2[x].w);
}
The first kernel is 9-10% faster. The difference in not so big. When using registers can bring more benefits ?
Firstly, you can't say what will and won't be in registers solely based on C code. That is certainly not the the source of the performance difference between the two codes. In fact, both kernels use registers for the float4 variables, and the code they compile to is almost identical.
First kernel:
ld.param.u64 %rd3, [__cudaparm__Z7kernel0P6float4S0_S0__arg2];
add.u64 %rd4, %rd3, %rd2;
ld.global.v4.f32 {%f1,%f2,%f3,%f4}, [%rd4+0];
.loc 16 21 0
ld.param.u64 %rd5, [__cudaparm__Z7kernel0P6float4S0_S0__arg3];
add.u64 %rd6, %rd5, %rd2;
ld.global.v4.f32 {%f5,%f6,%f7,%f8}, [%rd6+0];
st.global.v4.f32 [%rd4+0], {%f1,%f2,%f3,%f4};
st.global.v4.f32 [%rd6+0], {%f5,%f6,%f7,%f8};
.loc 16 24 0
ld.param.u64 %rd7, [__cudaparm__Z7kernel0P6float4S0_S0__arg1];
add.u64 %rd8, %rd7, %rd2;
st.global.v4.f32 [%rd8+0], {%f1,%f2,%f3,%f4};
second kernel:
ld.param.u64 %rd3, [__cudaparm__Z7kernel1P6float4S0_S0__arg2];
add.u64 %rd4, %rd3, %rd2;
ld.global.v4.f32 {%f1,%f2,%f3,%f4}, [%rd4+0];
ld.param.u64 %rd5, [__cudaparm__Z7kernel1P6float4S0_S0__arg1];
add.u64 %rd6, %rd5, %rd2;
st.global.v4.f32 [%rd6+0], {%f1,%f2,%f3,%f4};
If there really is a performance difference between them, it is probably that the first kernel has more opportunity for instruction level parallelism than the second. But that is just a wild guess, without knowing much more about how the benchmarking of the two was done.