Does PTX actually have a 64-bit warp shuffle instruction?

Does PTX actually have a 64-bit warp shuffle instruction? - cuda

I noticed that in the docs for __shfl_sync() and it's relatives that 64-bit datatypes (long, double) are supported.
Does this mean that hardware/PTX natively supports 64-bit warp shuffles, or are these broken down into a pair of 32-bit shuffles when the code is compiled?

Currently, there is no 64-bit shuffle instruction in PTX. The basic register unit in all current CUDA GPUs are 32-bits. 64-bit quantities have no corresponding 64-bit registers, but instead occupy a pair of 32-bit registers. The warp shuffle operation at the machine level operates on 32-bit registers.
The compiler processes 64-bit operands to the shfl intrinsics for CUDA C++ by emitting 2 PTX (or SASS) instructions. This is readily discoverable/confirmable using the CUDA binary utilities.
Example:
$ cat t45.cu
typedef double mt;
__global__ void k(mt *d){
mt x = d[threadIdx.x];
x = __shfl_sync(0xFFFFFFFF, x, threadIdx.x+1);
d[threadIdx.x] = x;
}
$ nvcc -c t45.cu
$ cuobjdump -ptx t45.o
Fatbin elf code:
================
arch = sm_30
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
Fatbin ptx code:
================
arch = sm_30
code version = [6,2]
producer = cuda
host = linux
compile_size = 64bit
compressed
.version 6.2
.target sm_30
.address_size 64
.visible .entry _Z1kPd(
.param .u64 _Z1kPd_param_0
)
{
.reg .pred %p<3>;
.reg .b32 %r<9>;
.reg .f64 %fd<3>;
.reg .b64 %rd<5>;
ld.param.u64 %rd1, [_Z1kPd_param_0];
cvta.to.global.u64 %rd2, %rd1;
mov.u32 %r5, %tid.x;
mul.wide.u32 %rd3, %r5, 8;
add.s64 %rd4, %rd2, %rd3;
ld.global.f64 %fd1, [%rd4];
add.s32 %r6, %r5, 1;
mov.b64 {%r1,%r2}, %fd1;
mov.u32 %r7, 31;
mov.u32 %r8, -1;
shfl.sync.idx.b32 %r4|%p1, %r2, %r6, %r7, %r8;
shfl.sync.idx.b32 %r3|%p2, %r1, %r6, %r7, %r8;
mov.b64 %fd2, {%r3,%r4};
st.global.f64 [%rd4], %fd2;
ret;
}
$

Related

In CUDA, what instruction is used to load data from global memory to shared memory?

I am currently studying CUDA and learned that there are global memory and shared memory.
I have checked the CUDA document and found that GPUs can access shared memory and global memory using ld.shared/st.shared and ld.global/st.global instructions, respectively.
What I am curious about is what instruction is used to load data from global memory to shared memory?
It would be great if someone could let me know.
Thanks!
__global__ void my_function(int* global_mem)
{
__shared__ int shared_mem[10];
for(int i = 0; i < 10; i++) {
shared_mem[i] = global_mem[i]; // What instrcuton is used for this load operation?
}
}

In the case of
__shared__ float smem[2];
smem[0] = global_memory[0];
Then the operation is (in SASS)
LDG Rx, [Ry]
STS [Rz], Rx
To expand a bit more, read https://forums.developer.nvidia.com/t/whats-different-between-ld-and-ldg-load-from-generic-memory-vs-load-from-global-memory/40856/2
Summary:
instruction
meaning
LDS
load from shared space
LDC
load from constant space
LDG
load from global space
LD
generic load - space deduced from the supplied address

With NVIDIA's Ampere microarchitecture, pipelining functionality was introduced to improve, among other things, the performance of copying from global to shared memory. Thus, we no longer need two instructions per element loaded, which keep the thread busier than it needs to be. Instead, you could write something like this:
#define NO_ZFILL 0
// ...
for(int i = 0; i < 10; i++) {
__pipeline_memcpy_async(&shared_mem[i], &global_mem[i], sizeof(int), NO_ZFILL);
}
__pipeline_commit();
__pipeline_wait_prior(0); // wait for the first commited batch of pipeline ops
And the resulting PTX code looks like this:
{
ld.param.u64 %rd1, [my_function(int*)_param_0];
mov.u32 %r1, my_function(int*)::shared_mem;
cp.async.ca.shared.global [%r1], [%rd1], 4, 4;
add.s64 %rd2, %rd1, 4;
add.s32 %r2, %r1, 4;
cp.async.ca.shared.global [%r2], [%rd2], 4, 4;
add.s64 %rd3, %rd1, 8;
add.s32 %r3, %r1, 8;
cp.async.ca.shared.global [%r3], [%rd3], 4, 4;
add.s64 %rd4, %rd1, 12;
add.s32 %r4, %r1, 12;
cp.async.ca.shared.global [%r4], [%rd4], 4, 4;
add.s64 %rd5, %rd1, 16;
add.s32 %r5, %r1, 16;
cp.async.ca.shared.global [%r5], [%rd5], 4, 4;
add.s64 %rd6, %rd1, 20;
add.s32 %r6, %r1, 20;
cp.async.ca.shared.global [%r6], [%rd6], 4, 4;
add.s64 %rd7, %rd1, 24;
add.s32 %r7, %r1, 24;
cp.async.ca.shared.global [%r7], [%rd7], 4, 4;
add.s64 %rd8, %rd1, 28;
add.s32 %r8, %r1, 28;
cp.async.ca.shared.global [%r8], [%rd8], 4, 4;
add.s64 %rd9, %rd1, 32;
add.s32 %r9, %r1, 32;
cp.async.ca.shared.global [%r9], [%rd9], 4, 4;
add.s64 %rd10, %rd1, 36;
add.s32 %r10, %r1, 36;
cp.async.ca.shared.global [%r10], [%rd10], 4, 4;
cp.async.commit_group;
cp.async.wait_group 0;
ret;
}
Notes about the PTX:
The key instructions are those beginning with cp.async, and the the add's are address computations.
Compiled with target virtual architecture sm_80.
The compiler has unrolled the loop (although it didn't have to).
This still needs to be compiled further into actual assembly instructions.
For more details, see Section B.27.3 Pipeline Primitives in the CUDA Programming Guide.
There is a fancier, but more opaque, way of doing this using the "cooperative groups" C++ interface bundled

As Kryrene already said, loading data from global memory to shared memory typically requires two instructions.
However, GPUs since the Ampere architecture (CC >= 8.0) also support loading data directly from global memory into shared memory with a single instruction which issues an asynchronous copy. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#async_data_operations
In ptx this could be cp.async.ca.shared.global , in SASS it could be LDGSTS

Atomic counter for critical section not using atomic bandwidth according to profiler

Even though there are first-thread per-warp atomic access to a shared variable, profiler shows zero bandwidth for atomics:
Minimal reproduction example I could do here:
#include <stdio.h>
#include <cuda_runtime.h>
#define criticalSection(T, ...) {\
__shared__ int ctrBlock; \
if(threadIdx.x==0) \
ctrBlock=0; \
__syncthreads(); \
while(atomicAdd(&ctrBlock,0)<(blockDim.x/32)) \
{ \
if( atomicAdd(&ctrBlock,0) == (threadIdx.x/32) ) \
{ \
int ctr=0; \
while(ctr<32) \
{ \
if( ctr == (threadIdx.x&31) ) \
{ \
{ \
T,##__VA_ARGS__; \
} \
} \
ctr++; \
__syncwarp(); \
} \
if((threadIdx.x&31) == 0)atomicAdd(&ctrBlock,1); \
} \
__syncthreads(); \
} \
}
__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
// instead of if(i==0) C[0]=0.0f; initialization
if(i==blockDim.x*blockIdx.x)
C[blockDim.x*blockIdx.x]=0.0f;
__syncthreads();
criticalSection({
if (i < numElements)
{
C[blockDim.x*blockIdx.x] += A[i] + B[i];
}
});
}
int main(void)
{
int numElements = 50000;
size_t size = numElements * sizeof(float);
float *h_A = (float *)malloc(size);
float *h_B = (float *)malloc(size);
float *h_C = (float *)malloc(size);
for (int i = 0; i < numElements; ++i)
{
h_A[i] = i;
h_B[i] = 2*i;
}
float *d_A = NULL;
cudaMalloc((void **)&d_A, size);
float *d_B = NULL;
cudaMalloc((void **)&d_B, size);
float *d_C = NULL;
cudaMalloc((void **)&d_C, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
printf("%g\n",h_C[0]);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);
return 0;
}
it correctly outputs the sum of (1 to 255)*3 result(at every starting element per block) everytime it runs.
Question: why would profiler show it is not using atomic bandwidth even though it correctly works?
Kernel completes (196 blocks, 256 threads per block) under 2.4 milliseconds on a 192-core Kepler GPU. Is GPU collecting atomics and converting them to something more efficient at each synchronization point?
It doesn't give any error, I removed error checking for readability.
Changing C array element addition to:
((volatile float *) C)[blockDim.x*blockIdx.x] += A[i] + B[i];
does not change the behavior nor the result.
Using CUDA toolkit 9.2 and driver v396, Ubuntu 16.04, Quadro K420.
Here are compiling commands:
nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 -o vectorAdd.o -c vectorAdd.cu
nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 -o vectorAdd vectorAdd.o
Ptx output of cuobjdump(sass was more than 50k characters):
.visible .entry _Z9vectorAddPKfS0_Pfi(
.param .u64 _Z9vectorAddPKfS0_Pfi_param_0,
.param .u64 _Z9vectorAddPKfS0_Pfi_param_1,
.param .u64 _Z9vectorAddPKfS0_Pfi_param_2,
.param .u32 _Z9vectorAddPKfS0_Pfi_param_3
)
{
.reg .pred %p<32>;
.reg .f32 %f<41>;
.reg .b32 %r<35>;
.reg .b64 %rd<12>;
.shared .align 4 .u32 _ZZ9vectorAddPKfS0_PfiE8ctrBlock;
ld.param.u64 %rd5, [_Z9vectorAddPKfS0_Pfi_param_0];
ld.param.u64 %rd6, [_Z9vectorAddPKfS0_Pfi_param_1];
ld.param.u64 %rd7, [_Z9vectorAddPKfS0_Pfi_param_2];
ld.param.u32 %r13, [_Z9vectorAddPKfS0_Pfi_param_3];
cvta.to.global.u64 %rd1, %rd7;
mov.u32 %r14, %ctaid.x;
mov.u32 %r1, %ntid.x;
mul.lo.s32 %r2, %r14, %r1;
mov.u32 %r3, %tid.x;
add.s32 %r4, %r2, %r3;
setp.ne.s32 %p8, %r4, 0;
#%p8 bra BB0_2;
mov.u32 %r15, 0;
st.global.u32 [%rd1], %r15;
BB0_2:
bar.sync 0;
setp.ne.s32 %p9, %r3, 0;
#%p9 bra BB0_4;
mov.u32 %r16, 0;
st.shared.u32 [_ZZ9vectorAddPKfS0_PfiE8ctrBlock], %r16;
BB0_4:
bar.sync 0;
mov.u32 %r17, _ZZ9vectorAddPKfS0_PfiE8ctrBlock;
atom.shared.add.u32 %r18, [%r17], 0;
shr.u32 %r5, %r1, 5;
setp.ge.u32 %p10, %r18, %r5;
#%p10 bra BB0_27;
shr.u32 %r6, %r3, 5;
and.b32 %r7, %r3, 31;
cvta.to.global.u64 %rd8, %rd5;
mul.wide.s32 %rd9, %r4, 4;
add.s64 %rd2, %rd8, %rd9;
cvta.to.global.u64 %rd10, %rd6;
add.s64 %rd3, %rd10, %rd9;
mul.wide.u32 %rd11, %r2, 4;
add.s64 %rd4, %rd1, %rd11;
neg.s32 %r8, %r7;
BB0_6:
atom.shared.add.u32 %r21, [%r17], 0;
mov.u32 %r34, 0;
setp.ne.s32 %p11, %r21, %r6;
mov.u32 %r33, %r8;
#%p11 bra BB0_26;
BB0_7:
setp.eq.s32 %p12, %r33, 0;
setp.lt.s32 %p13, %r4, %r13;
and.pred %p14, %p12, %p13;
#!%p14 bra BB0_9;
bra.uni BB0_8;
BB0_8:
ld.global.f32 %f1, [%rd2];
ld.global.f32 %f2, [%rd3];
add.f32 %f3, %f1, %f2;
ld.volatile.global.f32 %f4, [%rd4];
add.f32 %f5, %f4, %f3;
st.volatile.global.f32 [%rd4], %f5;
BB0_9:
bar.warp.sync -1;
add.s32 %r22, %r34, 1;
setp.eq.s32 %p15, %r22, %r7;
and.pred %p16, %p15, %p13;
#!%p16 bra BB0_11;
bra.uni BB0_10;
BB0_10:
ld.global.f32 %f6, [%rd2];
ld.global.f32 %f7, [%rd3];
add.f32 %f8, %f6, %f7;
ld.volatile.global.f32 %f9, [%rd4];
add.f32 %f10, %f9, %f8;
st.volatile.global.f32 [%rd4], %f10;
BB0_11:
bar.warp.sync -1;
add.s32 %r23, %r34, 2;
setp.eq.s32 %p17, %r23, %r7;
and.pred %p18, %p17, %p13;
#!%p18 bra BB0_13;
bra.uni BB0_12;
BB0_12:
ld.global.f32 %f11, [%rd2];
ld.global.f32 %f12, [%rd3];
add.f32 %f13, %f11, %f12;
ld.volatile.global.f32 %f14, [%rd4];
add.f32 %f15, %f14, %f13;
st.volatile.global.f32 [%rd4], %f15;
BB0_13:
bar.warp.sync -1;
add.s32 %r24, %r34, 3;
setp.eq.s32 %p19, %r24, %r7;
and.pred %p20, %p19, %p13;
#!%p20 bra BB0_15;
bra.uni BB0_14;
BB0_14:
ld.global.f32 %f16, [%rd2];
ld.global.f32 %f17, [%rd3];
add.f32 %f18, %f16, %f17;
ld.volatile.global.f32 %f19, [%rd4];
add.f32 %f20, %f19, %f18;
st.volatile.global.f32 [%rd4], %f20;
BB0_15:
bar.warp.sync -1;
add.s32 %r25, %r34, 4;
setp.eq.s32 %p21, %r25, %r7;
and.pred %p22, %p21, %p13;
#!%p22 bra BB0_17;
bra.uni BB0_16;
BB0_16:
ld.global.f32 %f21, [%rd2];
ld.global.f32 %f22, [%rd3];
add.f32 %f23, %f21, %f22;
ld.volatile.global.f32 %f24, [%rd4];
add.f32 %f25, %f24, %f23;
st.volatile.global.f32 [%rd4], %f25;
BB0_17:
bar.warp.sync -1;
add.s32 %r26, %r34, 5;
setp.eq.s32 %p23, %r26, %r7;
and.pred %p24, %p23, %p13;
#!%p24 bra BB0_19;
bra.uni BB0_18;
BB0_18:
ld.global.f32 %f26, [%rd2];
ld.global.f32 %f27, [%rd3];
add.f32 %f28, %f26, %f27;
ld.volatile.global.f32 %f29, [%rd4];
add.f32 %f30, %f29, %f28;
st.volatile.global.f32 [%rd4], %f30;
BB0_19:
bar.warp.sync -1;
add.s32 %r27, %r34, 6;
setp.eq.s32 %p25, %r27, %r7;
and.pred %p26, %p25, %p13;
#!%p26 bra BB0_21;
bra.uni BB0_20;
BB0_20:
ld.global.f32 %f31, [%rd2];
ld.global.f32 %f32, [%rd3];
add.f32 %f33, %f31, %f32;
ld.volatile.global.f32 %f34, [%rd4];
add.f32 %f35, %f34, %f33;
st.volatile.global.f32 [%rd4], %f35;
BB0_21:
bar.warp.sync -1;
add.s32 %r28, %r34, 7;
setp.eq.s32 %p27, %r28, %r7;
and.pred %p28, %p27, %p13;
#!%p28 bra BB0_23;
bra.uni BB0_22;
BB0_22:
ld.global.f32 %f36, [%rd2];
ld.global.f32 %f37, [%rd3];
add.f32 %f38, %f36, %f37;
ld.volatile.global.f32 %f39, [%rd4];
add.f32 %f40, %f39, %f38;
st.volatile.global.f32 [%rd4], %f40;
BB0_23:
add.s32 %r34, %r34, 8;
bar.warp.sync -1;
add.s32 %r33, %r33, 8;
setp.ne.s32 %p29, %r34, 32;
#%p29 bra BB0_7;
setp.ne.s32 %p30, %r7, 0;
#%p30 bra BB0_26;
atom.shared.add.u32 %r30, [%r17], 1;
BB0_26:
bar.sync 0;
atom.shared.add.u32 %r32, [%r17], 0;
setp.lt.u32 %p31, %r32, %r5;
#%p31 bra BB0_6;
BB0_27:
ret;
}

There are at least 2 things to be aware of here.
Let's observe that your program is using atomics on shared memory locations. Also, you indicated that you are compiling for (and when profiling, running on) a Kepler architecture GPU.
On Kepler, shared memory atomics are emulated via a software sequence. This won't be visible when inspecting the PTX code, as the conversion to the emulation sequence is done by ptxas, the tool that converts PTX to SASS code for execution on the target device.
Since you are targetting and running on Kepler, the SASS includes no shared memory atomic instructions (instead, shared atomics are emulated with a loop that uses special hardware locks, and for example you can see LDSLK, a load-from-shared-with-lock instruction, in your SASS code).
Since your code has no actual atomic instructions (on Kepler), it is not generating any atomic traffic that is trackable by the profiler.
If you want to verify this, use the cuobjdump tool on your compiled binary. I recommend compiling only for the Kepler target architecture you will actually use for this sort of binary analysis. Here's an example:
$ nvcc -o t324 t324.cu -arch=sm_30
$ cuobjdump -sass ./t324 |grep ATOM
$ nvcc -o t324 t324.cu -arch=sm_50
$ cuobjdump -sass ./t324 |grep ATOM
/*00e8*/ #P2 ATOMS.ADD R6, [RZ], RZ ; /* 0xec0000000ff2ff06 */
/*01b8*/ #P0 ATOMS.ADD R12, [RZ], RZ ; /* 0xec0000000ff0ff0c */
/*10f8*/ #P0 ATOMS.ADD RZ, [RZ], R12 ; /* 0xec00000000c0ffff */
/*1138*/ #P0 ATOMS.ADD R10, [RZ], RZ ; /* 0xec0000000ff0ff0a */
$
As indicated above, on Maxwell and beyond, there is a native shared memory atomic instruction available (e.g. ATOMS) in SASS code. Therefore if you compile your code for a maxwell architecture or beyond, you will see actual atomic instructions in the SASS.
However, I'm not sure if or how this will be represented in the visual profiler. I suspect shared atomic reporting may be limited. This is discoverable by reviewing the available metrics and observing that for architectures of 5.0 and higher, most of the atomic metrics are specifically for global atomics, and the only metric I can find pertaining to shared atomics is:
inst_executed_shared_atomics Warp level shared instructions for atom and atom CAS Multi-context
I'm not sure that is sufficient to compute bandwidth or utilization, so I'm not sure the visual profiler intends to report much in the way of shared atomic usage, even on 5.0+ architectures. You're welcome to try it out of course.
As an aside, I would usually think that this sort of construct implies a logical defect in the code:
int i = blockDim.x * blockIdx.x + threadIdx.x;
if(i==0)
C[0]=0.0f;
__syncthreads();
But it's not relevant to this particular inquiry, and I'm not sure of the intent of your code anyway. Keep in mind that CUDA specifies no order of block execution.

purposely causing bank conflicts for shared memory on CUDA device

It is a mystery for me how shared memory on cuda devices work. I was curious to count threads having access to the same shared memory. For this I wrote a simple program
#include <cuda_runtime.h>
#include <stdio.h>
#define nblc 13
#define nthr 1024
//------------------------#device--------------------
__device__ int inwarpD[nblc];
__global__ void kernel(){
__shared__ int mywarp;
mywarp=0;
for (int i=0;i<5;i++) mywarp += (10000*threadIdx.x+1);
__syncthreads();
inwarpD[blockIdx.x]=mywarp;
}
//------------------------#host-----------------------
int main(int argc, char **argv){
int inwarpH[nblc];
cudaSetDevice(2);
kernel<<<nblc, nthr>>>();
cudaMemcpyFromSymbol(inwarpH, inwarpD, nblc*sizeof(int), 0, cudaMemcpyDeviceToHost);
for (int i=0;i<nblc;i++) printf("%i : %i\n",i, inwarpH[i]);
}
and ran it on K80 GPU. Since several threads are having access to the same shared memory variable I was expecting that this variable will be updated 5*nthr times, albeit not at the same cycle because of the bank conflict. However, the output indicates that mywarp shared variable was updated only 5 times. For each blocks different threads accomplished this task:
0 : 35150005
1 : 38350005
2 : 44750005
3 : 38350005
4 : 51150005
5 : 38350005
6 : 38350005
7 : 38350005
8 : 51150005
9 : 44750005
10 : 51150005
11 : 38350005
12 : 38350005
Instead, I was expecting
523776*10000+5*1024=5237765120
for each block. Can someone kindly explain me where my understanding of shared memory fails. I would like also to know how would it be possible that all threads in one block access (update) the same shared variable. I know it is not possible at the same MP cycle. Serialisation is fine for me because it is going to be a rare event.

Lets walk through the ptx that it generates.
//Declare some registers
.reg .s32 %r<5>;
.reg .s64 %rd<4>;
// demoted variable
.shared .align 4 .u32 _Z6kernelv$__cuda_local_var_35411_30_non_const_mywarp;
//load tid in register r1
mov.u32 %r1, %tid.x;
//multiple tid*5000+5 and store in r2
mad.lo.s32 %r2, %r1, 50000, 5;
//store result in shared memory
st.shared.u32 [_Z6kernelv$__cuda_local_var_35411_30_non_const_mywarp], %r2;
///synchronize
bar.sync 0;
//load from shared memory and store in r3
ld.shared.u32 %r3, [_Z6kernelv$__cuda_local_var_35411_30_non_const_mywarp];
mov.u32 %r4, %ctaid.x;
mul.wide.u32 %rd1, %r4, 4;
mov.u64 %rd2, inwarpD;
add.s64 %rd3, %rd2, %rd1;
//store r3 in global memory
st.global.u32 [%rd3], %r3;
ret;
So basically
for (int i=0;i<5;i++)
mywarp += (10000*threadIdx.x+1);
is being optimized down to
mywarp=50000*threadIdx.x+5
so you're not experiencing a bank-conflict. You are experiencing a race-condition.

Is there an equivalent to memcpy() that works inside a CUDA kernel?

I'm trying to break apart and reshape the structure of an array asynchronously using the CUDA kernel. memcpy() doesn't work inside the kernel, and neither does cudaMemcpy()*; I'm at a loss.
Can anyone tell me the preferred method for copying memory from within the CUDA kernel?
It is worth noting, cudaMemcpy(void *to, void *from, size, cudaMemcpyDeviceToDevice) will NOT work for what I am trying to do, because it can only be called from outside of the kernel and does not execute asynchronously.

Yes, there is an equivalent to memcpy that works inside cuda kernels. It is called memcpy. As an example:
__global__ void kernel(int **in, int **out, int len, int N)
{
int idx = threadIdx.x + blockIdx.x*blockDim.x;
for(; idx<N; idx+=gridDim.x*blockDim.x)
memcpy(out[idx], in[idx], sizeof(int)*len);
}
which compiles without error like this:
$ nvcc -Xptxas="-v" -arch=sm_20 -c memcpy.cu
ptxas info : Compiling entry function '_Z6kernelPPiS0_ii' for 'sm_20'
ptxas info : Function properties for _Z6kernelPPiS0_ii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 11 registers, 48 bytes cmem[0]
and emits PTX:
.version 3.0
.target sm_20
.address_size 32
.file 1 "/tmp/tmpxft_00000407_00000000-9_memcpy.cpp3.i"
.file 2 "memcpy.cu"
.file 3 "/usr/local/cuda/nvvm/ci_include.h"
.entry _Z6kernelPPiS0_ii(
.param .u32 _Z6kernelPPiS0_ii_param_0,
.param .u32 _Z6kernelPPiS0_ii_param_1,
.param .u32 _Z6kernelPPiS0_ii_param_2,
.param .u32 _Z6kernelPPiS0_ii_param_3
)
{
.reg .pred %p<4>;
.reg .s32 %r<32>;
.reg .s16 %rc<2>;
ld.param.u32 %r15, [_Z6kernelPPiS0_ii_param_0];
ld.param.u32 %r16, [_Z6kernelPPiS0_ii_param_1];
ld.param.u32 %r2, [_Z6kernelPPiS0_ii_param_3];
cvta.to.global.u32 %r3, %r15;
cvta.to.global.u32 %r4, %r16;
.loc 2 4 1
mov.u32 %r5, %ntid.x;
mov.u32 %r17, %ctaid.x;
mov.u32 %r18, %tid.x;
mad.lo.s32 %r30, %r5, %r17, %r18;
.loc 2 6 1
setp.ge.s32 %p1, %r30, %r2;
#%p1 bra BB0_5;
ld.param.u32 %r26, [_Z6kernelPPiS0_ii_param_2];
shl.b32 %r7, %r26, 2;
.loc 2 6 54
mov.u32 %r19, %nctaid.x;
.loc 2 4 1
mov.u32 %r29, %ntid.x;
.loc 2 6 54
mul.lo.s32 %r8, %r29, %r19;
BB0_2:
.loc 2 7 1
shl.b32 %r21, %r30, 2;
add.s32 %r22, %r4, %r21;
ld.global.u32 %r11, [%r22];
add.s32 %r23, %r3, %r21;
ld.global.u32 %r10, [%r23];
mov.u32 %r31, 0;
BB0_3:
add.s32 %r24, %r10, %r31;
ld.u8 %rc1, [%r24];
add.s32 %r25, %r11, %r31;
st.u8 [%r25], %rc1;
add.s32 %r31, %r31, 1;
setp.lt.u32 %p2, %r31, %r7;
#%p2 bra BB0_3;
.loc 2 6 54
add.s32 %r30, %r8, %r30;
ld.param.u32 %r27, [_Z6kernelPPiS0_ii_param_3];
.loc 2 6 1
setp.lt.s32 %p3, %r30, %r27;
#%p3 bra BB0_2;
BB0_5:
.loc 2 9 2
ret;
}
The code block at BB0_3 is a byte sized memcpy loop emitted automagically by the compiler. It might not be a great idea from a performance point-of-view to use it, but it is fully supported (and has been for a long time on all architectures).
Edited four years later to add that since the device side runtime API was released as part of the CUDA 6 release cycle, it is also possible to directly call something like
cudaMemcpyAsync(void *to, void *from, size, cudaMemcpyDeviceToDevice)
in device code for all architectures which support it (Compute Capability 3.5 and newer hardware using separate compilation and device linking).

In my testing the best answer is to write your own looping copy routine. In my case:
__device__
void devCpyCplx(const thrust::complex<float> *in, thrust::complex<float> *out, int len){
// Casting for improved loads and stores
for (int i=0; i<len/2; ++i) {
((float4*) out)[i] = ((float4*) out)[i];
}
if (len%2) {
((float2*) out)[len-1] = ((float2*) in)[len-1];
}
}
memcpy works in a kernel but it may be much slower. cudaMemcpyAsync from the host is a valid option.
I needed to partition 800 contiguous vectors of ~33,000 length to 16,500 length in different buffer with 1,600 copy calls. Timing with nvvp:
memcpy in kernel: 140 ms
cudaMemcpy DtoD on host: 34 ms
loop copy in kernel: 8.6 ms
#talonmies reports that memcpy copies byte by byte which is inefficient with loads and stores. I'm targeting compute 3.0 still so I can't test cudaMemcpy on device.
Edit: Tested on a newer device. Device runtime cudaMemcpyAsync(out, in, bytes, cudaMemcpyDeviceToDevice, 0) is comparable to a good copy loop and better than a bad copy loop. Note using the device runtime api may require compile changes (sm>=3.5, separate compilation). Refer to programming guide and nvcc docs for compiling.
Device memcpy bad. Host cudaMemcpyAsync okay. Device cudaMemcpyAsync good.

cudaMemcpy() does indeed run asynchronously but you're right, it can't be executed from within a kernel.
Is the new shape of the array determined based on some calculation? Then, you would typically run the same number of threads as there are entries in your array. Each thread would run a calculation to determine the source and destination of a single entry in the array and then copy it there with a single assignment. (dst[i] = src[j]). If the new shape of the array is not based on calculations, it might be more efficient to run a series of cudaMemcpy() with cudaMemCpyDeviceToDevice from the host.

cuda kernel - registers

Float4 variables defined in the kernel should be store in registers !? I made a simple test. In the first kernel I use registers to optimize a memory traffic, in the second I read directly from a global memory.
__global__ void kernel(float4 *arg1, float4 *arg2, float4 *arg3)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
float4 temp1 = arg2[x];
float4 temp2 = arg3[x];
//some computations using temp1 and temp2
arg2[x] = temp1;
arg3[x] = temp2;
arg1[x] = make_float4(temp1.x, temp1.y, temp1.z, temp1.w);
}
__global__ void kernel(float4 *arg1, float4 *arg2, float4 *arg3)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
//some computations using a direct access to global memory
//for example arg2[x].x
arg1[x] = make_float4(arg2[x].x, arg2[x].y, arg2[x].z, arg2[x].w);
}
The first kernel is 9-10% faster. The difference in not so big. When using registers can bring more benefits ?

Firstly, you can't say what will and won't be in registers solely based on C code. That is certainly not the the source of the performance difference between the two codes. In fact, both kernels use registers for the float4 variables, and the code they compile to is almost identical.
First kernel:
ld.param.u64 %rd3, [__cudaparm__Z7kernel0P6float4S0_S0__arg2];
add.u64 %rd4, %rd3, %rd2;
ld.global.v4.f32 {%f1,%f2,%f3,%f4}, [%rd4+0];
.loc 16 21 0
ld.param.u64 %rd5, [__cudaparm__Z7kernel0P6float4S0_S0__arg3];
add.u64 %rd6, %rd5, %rd2;
ld.global.v4.f32 {%f5,%f6,%f7,%f8}, [%rd6+0];
st.global.v4.f32 [%rd4+0], {%f1,%f2,%f3,%f4};
st.global.v4.f32 [%rd6+0], {%f5,%f6,%f7,%f8};
.loc 16 24 0
ld.param.u64 %rd7, [__cudaparm__Z7kernel0P6float4S0_S0__arg1];
add.u64 %rd8, %rd7, %rd2;
st.global.v4.f32 [%rd8+0], {%f1,%f2,%f3,%f4};
second kernel:
ld.param.u64 %rd3, [__cudaparm__Z7kernel1P6float4S0_S0__arg2];
add.u64 %rd4, %rd3, %rd2;
ld.global.v4.f32 {%f1,%f2,%f3,%f4}, [%rd4+0];
ld.param.u64 %rd5, [__cudaparm__Z7kernel1P6float4S0_S0__arg1];
add.u64 %rd6, %rd5, %rd2;
st.global.v4.f32 [%rd6+0], {%f1,%f2,%f3,%f4};
If there really is a performance difference between them, it is probably that the first kernel has more opportunity for instruction level parallelism than the second. But that is just a wild guess, without knowing much more about how the benchmarking of the two was done.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008