How is 2D Shared Memory arranged in CUDA - cuda

I've always worked with linear shared memory (load, store, access neighbours) but I've made a simple test in 2D to study bank conflicts which results have confused me.
The next code read data from one dimensional global memory array to shared memory and copy it back from shared memory to global memory.
__global__ void update(int* gIn, int* gOut, int w) {
// shared memory space
__shared__ int shData[16][16];
// map from threadIdx/BlockIdx to data position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// calculate the global id into the one dimensional array
int gid = x + y * w;
// load shared memory
shData[threadIdx.x][threadIdx.y] = gIn[gid];
// synchronize threads not really needed but keep it for convenience
__syncthreads();
// write data back to global memory
gOut[gid] = shData[threadIdx.x][threadIdx.y];
}
The visual profiler reported conflicts in shared memory. The next code avoid thouse conflicts (only show the differences)
// load shared memory
shData[threadIdx.y][threadIdx.x] = gIn[gid];
// write data back to global memory
gOut[gid] = shData[threadIdx.y][threadIdx.x];
This behavior has confused me because in Programming Massively Parallel Processors. A Hands-on approach we can read:
matrix elements in C and CUDA are placed into the linearly addressed locations according to the row major convention. That is, the elements of row 0 of a matrix are first placed in order into consecutive locations.
Is this related to shared memory arragment? or with threads indexes? Maybe am I missing something?
The kernel configuration is as follow:
// kernel configuration
dim3 dimBlock = dim3 ( 16, 16, 1 );
dim3 dimGrid = dim3 ( 64, 64 );
// Launching a grid of 64x64 blocks with 16x16 threads -> 1048576 threads
update<<<dimGrid, dimBlock>>>(d_input, d_output, 1024);
Thanks in advance.

Yes, shared memory is arranged in row-major order as you expected. So your [16][16] array is stored row wise, something like this:
bank0 .... bank15
row 0 [ 0 .... 15 ]
1 [ 16 .... 31 ]
2 [ 32 .... 47 ]
3 [ 48 .... 63 ]
4 [ 64 .... 79 ]
5 [ 80 .... 95 ]
6 [ 96 .... 111 ]
7 [ 112 .... 127 ]
8 [ 128 .... 143 ]
9 [ 144 .... 159 ]
10 [ 160 .... 175 ]
11 [ 176 .... 191 ]
12 [ 192 .... 207 ]
13 [ 208 .... 223 ]
14 [ 224 .... 239 ]
15 [ 240 .... 255 ]
col 0 .... col 15
Because there are 16 32 bit shared memory banks on pre-Fermi hardware, every integer entry in each column maps onto one shared memory bank. So how does that interact with your choice of indexing scheme?
The thing to keep in mind is that threads within a block are numbered in the equivalent of column major order (technically the x dimension of the structure is the fastest varying, followed by y, followed by z). So when you use this indexing scheme:
shData[threadIdx.x][threadIdx.y]
threads within a half-warp will be reading from the same column, which implies reading from the same shared memory bank, and bank conflicts will occur. When you use the opposite scheme:
shData[threadIdx.y][threadIdx.x]
threads within the same half-warp will be reading from the same row, which implies reading from each of the 16 different shared memory banks, no conflicts occur.

Related

Can anyone tell me why my CUDA C code is returning my array Z to be wholly zero? (again - but with different code this time) [duplicate]

Here is my code:
int threadNum = BLOCKDIM/8;
dim3 dimBlock(threadNum,threadNum);
int blocks1 = nWidth/threadNum + (nWidth%threadNum == 0 ? 0 : 1);
int blocks2 = nHeight/threadNum + (nHeight%threadNum == 0 ? 0 : 1);
dim3 dimGrid;
dimGrid.x = blocks1;
dimGrid.y = blocks2;
// dim3 numThreads2(BLOCKDIM);
// dim3 numBlocks2(numPixels/BLOCKDIM + (numPixels%BLOCKDIM == 0 ? 0 : 1) );
perform_scaling<<<dimGrid,dimBlock>>>(imageDevice,imageDevice_new,min,max,nWidth, nHeight);
cudaError_t err = cudaGetLastError();
cudasafe(err,"Kernel2");
This is the execution of my second kernel and it is fully independent in term of the usage of data. BLOCKDIM is 512 , nWidth and nHeight are 512 too and cudasafe simply prints the corresponding string message of the error code. This section of the code gives configuration error just after the kernel call.
What might give this error, any idea?
This type of error message frequently refers to the launch configuration parameters (grid/threadblock dimensions in this case, could also be shared memory, etc. in other cases). When you see a message like this it's a good idea just to print out your actual config parameters before launching the kernel, to see if you've made any mistakes.
You said BLOCKDIM = 512. You have threadNum = BLOCKDIM/8 so threadNum = 64. Your threadblock configuration is:
dim3 dimBlock(threadNum,threadNum);
So you are asking to launch blocks of 64 x 64 threads, that is 4096 threads per block. That won't work on any generation of CUDA devices. All current CUDA devices are limited to a maximum of 1024 threads per block, which is the product of the 3 block dimensions.
Maximum dimensions are listed in table 14 of the CUDA programming guide, and also available via the deviceQuery CUDA sample code.
Just to add to the previous answers, you can find the max threads allowed in your code also, so it can run in other devices without hard-coding the number of threads you will use:
struct cudaDeviceProp properties;
cudaGetDeviceProperties(&properties, device);
cout<<"using "<<properties.multiProcessorCount<<" multiprocessors"<<endl;
cout<<"max threads per processor: "<<properties.maxThreadsPerMultiProcessor<<endl;

nvprof events "fb_subp0_read_sectors" and "fb_subp1_read_sectors" do not report correct results

I tried to count the number of DRAM (global memory) accesses for simple vector add kernel.
__global__ void AddVectors(const float* A, const float* B, float* C, int N)
{
int blockStartIndex = blockIdx.x * blockDim.x * N;
int threadStartIndex = blockStartIndex + threadIdx.x;
int threadEndIndex = threadStartIndex + ( N * blockDim.x );
int i;
for( i=threadStartIndex; i<threadEndIndex; i+=blockDim.x ){
C[i] = A[i] + B[i];
}
}
Grid Size = 180
Block size = 128
size of array = 180 * 128 * N floats where N is input parameter (elements per thread)
when N = 1, size of array = 180 * 128 * 1 floats = 90KB
All arrays A, B and C should be read from DRAM.
Therefore theoretically,
DRAM writes (C) = 2880 (32 byte accesses)
DRAM reads (A,B) = 2880 + 2880 = 5760 (32 byte accesses)
But when I used nvprof
DRAM writes = fb_subp0_write_sectors + fb_subp1_write_sectors = 1440 + 1440 = 2880 (32 byte accesses)
DRAM reads = fb_subp0_read_sectors + fb_subp1_read_sectors = 23 + 7 = 30 (32 byte accesses)
Now this is the problem. Theoretically there should be 5760 DRAM reads, but nvprof only reports 30, for me this looks impossible. Further more, if you double the size of the vector (N = 2), still the reported DRAM accesses remains at 30.
It would be great, if someone can shed some light.
I have disabled the L1 cache by using compiler option "-Xptxas -dlcm=cg"
Thanks,
Waruna
If you have done cudaMemcpy before the kernel launch to copy the source buffers from host to device, that gets the source buffers in L2 cache and hence the kernel doesn't see any misses from L2 for reads and you get less number of (fb_subp0_read_sectors + fb_subp1_read_sectors).
If you comment out cudaMemcpy before the kernel launch, you will see that the event values of fb_subp0_read_sectors and fb_subp1_read_sectors include the values you are expecting.

About the number of registers allocated per SM in CUDA

First Question.
The CUDA C Programming Guide is written like below.
The same on-chip memory is used for both L1 and shared memory: It can
be configured as 48 KB of shared memory and 16 KB of L1 cache or as 16
KB of shared memory and 48 KB of L1 cache
But, device query shows "Total number of registers available per block: 32768".
I use GTX580.(CC is 2.0)
The guide says default cache size is 16KB, but 32768 means 32768*4(byte) = 131072 Bytes = 128 KBytes. Actually, I don't know which is correct.
Second Question.
I set like below,
dim3 grid(32, 32); //blocks in a grid
dim3 block(16, 16); //threads in a block
kernel<<<grid,block>>>(...);
Then, the number of threads per a block is 256. => we need 256*N registers per a block.
N means the number of registers per a thread needed.
(256*N)*blocks is the number of registers per a SM.(not byte)
So, if default size is 16KB and threads/SM is MAX(1536), then N can't over 2. Because of "Maximum number of threads per multiprocessor: 1536".
16KB/4Bytes = 4096 registers, 4096/1536 = 2.66666...
In case of larger caches 48KB, N can't over 8.
48KB/4Bytes = 12288 registers, 12288/1536 = 8
Is that true? Actually I'm so confused.
Actually, My almost full code is here.
I think, the kernel is optimized when the block dimension is 16x16.
But, in case of 8x8, faster than 16x16 or similar.
I don't know the why.
the number of registers per a thread is 16, the shared memory is 80+16 bytes.
I had asked same question, but I couldn't get the exact solution.:
The result of an experiment different from CUDA Occupancy Calculator
#define WIDTH 512
#define HEIGHT 512
#define TILE_WIDTH 8
#define TILE_HEIGHT 8
#define CHANNELS 3
#define DEVICENUM 1
#define HEIGHTs HEIGHT/DEVICENUM
__global__ void PRINT_POLYGON( unsigned char *IMAGEin, int *MEMin, char a, char b, char c){
int Col = blockIdx.y*blockDim.y+ threadIdx.y; //Col is y coordinate
int Row = blockIdx.x*blockDim.x+ threadIdx.x; //Row is x coordinate
int tid_in_block = threadIdx.x + threadIdx.y*blockDim.x;
int bid_in_grid = blockIdx.x + blockIdx.y*gridDim.x;
int threads_per_block = blockDim.x * blockDim.y;
int tid_in_grid = tid_in_block + threads_per_block * bid_in_grid;
float result_a, result_b;
__shared__ int M[15];
for(int k = 0; k < 5; k++){
M[k] = MEMin[a*5+k];
M[k+5] = MEMin[b*5+k];
M[k+10] = MEMin[c*5+k];
}
int result_a_up = (M[11]-M[1])*(Row-M[0]) - (M[10]-M[0])*(Col-M[1]);
int result_b_up = (M[6] -M[1])*(M[0]-Row) - (M[5] -M[0])*(M[1]-Col);
int result_down = (M[11]-M[1])*(M[5]-M[0]) - (M[6]-M[1])*(M[10]-M[0]);
result_a = (float)result_a_up / (float)result_down;
result_b = (float)result_b_up / (float)result_down;
if((0 <= result_a && result_a <=1) && ((0 <= result_b && result_b <= 1)) && ((0 <= (result_a+result_b) && (result_a+result_b) <= 1))){
IMAGEin[tid_in_grid*CHANNELS] += M[2] + (M[7]-M[2])*result_a + (M[12]-M[2])*result_b; //Red Channel
IMAGEin[tid_in_grid*CHANNELS+1] += M[3] + (M[8]-M[3])*result_a + (M[13]-M[3])*result_b; //Green Channel
IMAGEin[tid_in_grid*CHANNELS+2] += M[4] + (M[9]-M[4])*result_a + (M[14]-M[4])*result_b; //Blue Channel
}
}
struct DataStruct {
int deviceID;
unsigned char IMAGE_SEG[WIDTH*HEIGHTs*CHANNELS];
};
void* routine( void *pvoidData ) {
DataStruct *data = (DataStruct*)pvoidData;
unsigned char *dev_IMAGE;
int *dev_MEM;
unsigned char *IMAGE_SEG = data->IMAGE_SEG;
HANDLE_ERROR(cudaSetDevice(5));
//initialize array
memset(IMAGE_SEG, 0, WIDTH*HEIGHTs*CHANNELS);
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
printf("Device %d Starting..\n", data->deviceID);
//Evaluate Time
cudaEvent_t start, stop;
cudaEventCreate( &start );
cudaEventCreate( &stop );
cudaEventRecord(start, 0);
HANDLE_ERROR( cudaMalloc( (void **)&dev_MEM, sizeof(int)*35) );
HANDLE_ERROR( cudaMalloc( (void **)&dev_IMAGE, sizeof(unsigned char)*WIDTH*HEIGHTs*CHANNELS) );
cudaMemcpy(dev_MEM, MEM, sizeof(int)*35, cudaMemcpyHostToDevice);
cudaMemset(dev_IMAGE, 0, sizeof(unsigned char)*WIDTH*HEIGHTs*CHANNELS);
dim3 grid(WIDTH/TILE_WIDTH, HEIGHTs/TILE_HEIGHT); //blocks in a grid
dim3 block(TILE_WIDTH, TILE_HEIGHT); //threads in a block
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 0, 1, 2);
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 0, 2, 3);
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 0, 3, 4);
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 0, 4, 5);
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 3, 2, 4);
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 2, 6, 4);
HANDLE_ERROR( cudaMemcpy( IMAGE_SEG, dev_IMAGE, sizeof(unsigned char)*WIDTH*HEIGHTs*CHANNELS, cudaMemcpyDeviceToHost ) );
HANDLE_ERROR( cudaFree( dev_MEM ) );
HANDLE_ERROR( cudaFree( dev_IMAGE ) );
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime( &elapsed_time_ms[data->deviceID], start, stop );
cudaEventDestroy(start);
cudaEventDestroy(stop);
elapsed_time_ms[DEVICENUM] += elapsed_time_ms[data->deviceID];
printf("Device %d Complete!\n", data->deviceID);
return 0;
}
The blockDim 8x8 is faster than 16x16 due to the increase in address divergence in your memory access when you increase the block size.
Metrics collected on GTX480 with 15 SMs.
metric 8x8 16x16
duration 161µs 114µs
issued_ipc 1.24 1.31
executed_ipc .88 .59
serialization 54.61% 28.74%
The number of instruction replays clues us in that we likely have bad memory access patterns.
achieved occupancy 88.32% 30.76%
0 warp schedulers issues 8.81% 7.98%
1 warp schedulers issues 2.36% 29.54%
2 warp schedulers issues 88.83% 52.44%
16x16 appears to keep the warp scheduler busy. However, it is keeping the schedulers busy re-issuing instructions.
l1 global load trans 524,407 332,007
l1 global store trans 401,224 209,139
l1 global load trans/request 3.56 2.25
l1 global store trans/request 16.33 8.51
The first priority is to reduce transactions per request. The Nsight VSE source view can display memory statistics per instruction. The primary issue in your kernel is the interleaved U8 load and stores for IMAGEin[] += value. At 16x16 this is resulting in 16.3 transactions per request but only 8.3 for 8x8 configuration.
Changing
IMAGEin[(i*HEIGHTs+j)*CHANNELS] += ...
to be consecutive increases performance of 16x16 by 3x. I imagine increasing channels to 4 and handling packing in the kernel will improve cache performance and memory throughput.
If you fix the number of memory transactions per request you will then likely have to look at execution dependencies and try to increase your ILP.
It is faster with block size of 8x8 because it is a lesser multiple of 32, as it is visible in the picture below, there are 32 CUDA cores bound together, with two different warp schedulers that actually schedule the same thing. So the same instruction is executed on these 32 cores in each execution cycle.
To better clarify this, in the first case (8x8) each block is made of two warps (64 threads) so it is finished within only two execution cycles, however, when you are using (16x16) as your block size, each takes 8 warps (256 threads), therefore taking 4 times more execution cycles resulting in a slower compound.
However, filling an SM with more warps is better in some cases, when memory access is high and each warp is likely to go into a memory stall (i.e. getting its operands from memory), then it will be replaced with another warp until the memory operation gets completed. Therefore resulting in more occupancy of the SM.
You should of course throw in the number of blocks per SM and number of SMs total in your calculations, for example, assigning more than 8 blocks to a single SM might reduce its occupancy, but probably in your case, you are not facing these issues, because 256 is generally a better number than 64, since it will balance your blocks among SMs whereas using 64 threads will result in more blocks getting executed in the same SM.
EDIT: This answer is based on my speculations, for a more scientific approach, see Greg Smiths answer.
Register pool is different from shared memory/cache, to the very bottom of their architecture!
Registers are made of Flip-flops and L1 cache are probably SRAM.
Just to get an idea, look at the picture below which represents FERMI architecture, then update your question to further specify the problem you are facing.
As a note, you can see how many registers and shared memory (smem) are taken by your functions by passing the option --ptxas-options = -v to nvcc.

CUDA - Memory Limit - Vector Summation

I'm trying to learn CUDA and the following code works OK for the values N<= 16384, but fails for the greater values(Summation check at the end of the code fails, c values are always 0 for the index value of i>=16384).
#include<iostream>
#include"cuda_runtime.h"
#include"../cuda_be/book.h"
#define N (16384)
__global__ void add(int *a,int *b,int *c)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if(tid<N)
{
c[tid] = a[tid] + b[tid];
tid += blockDim.x * gridDim.x;
}
}
int main()
{
int a[N],b[N],c[N];
int *dev_a,*dev_b,*dev_c;
//allocate mem on gpu
HANDLE_ERROR(cudaMalloc((void**)&dev_a,N*sizeof(int)));
HANDLE_ERROR(cudaMalloc((void**)&dev_b,N*sizeof(int)));
HANDLE_ERROR(cudaMalloc((void**)&dev_c,N*sizeof(int)));
for(int i=0;i<N;i++)
{
a[i] = -i;
b[i] = i*i;
}
HANDLE_ERROR(cudaMemcpy(dev_a,a,N*sizeof(int),cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_b,b,N*sizeof(int),cudaMemcpyHostToDevice));
system("PAUSE");
add<<<128,128>>>(dev_a,dev_b,dev_c);
//copy the array 'c' back from the gpu to the cpu
HANDLE_ERROR( cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost));
system("PAUSE");
bool success = true;
for(int i=0;i<N;i++)
{
if((a[i] + b[i]) != c[i])
{
printf("Error in %d: %d + %d != %d\n",i,a[i],b[i],c[i]);
system("PAUSE");
success = false;
}
}
if(success) printf("We did it!\n");
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}
I think it's a shared memory related problem, but I can't come up with a good explanation(Possible lack of knowledge). Could you provide me an explanation and a workaround to run for the values of N greater than 16384. Here is the specs for my GPU:
General Info for device 0
Name: GeForce 9600M GT
Compute capability: 1.1
Clock rate: 1250000
Device copy overlap : Enabled
Kernel Execution timeout : Enabled
Mem info for device 0
Total global mem: 536870912
Total const mem: 65536
Max mem pitch: 2147483647
Texture Alignment 256
MP info about device 0
Multiproccessor count: 4
Shared mem per mp: 16384
Registers per mp: 8192
Threads in warp: 32
Max threads per block: 512
Max thread dimensions: (512,512,64)
Max grid dimensions: (65535,65535,1)
You probably intended to write
while(tid<N)
not
if(tid<N)
You aren't running out of shared memory, your vector arrays are being copied into your device's global memory. As you can see this has far more space available than the 196608 bytes (16384*4*3) you need.
The reason for your problem is that you are only performing one addition operation per thread so hence with this structure, the maximum dimension that your vectors can be is the block*thread parameters in your kernel launch as tera has pointed out. By correcting
if(tid<N)
to
while(tid<N)
in your code, each thread will perform its addition on multiple indexes and the whole array will be considered.
For more information about the memory hierarchy and the various different places memory can sit, you should read sections 2.3 and 5.3 of the CUDA_C_Programming_Guide.pdf provided with the CUDA toolkit.
Hope that helps.
If N is:
#define N (33 * 1024) //value defined in Cuda by Examples
The same code I found in Cuda by Example, but the value of N was different. I think that o value of N cant be 33 * 1024. I must change the parameters number of block and number of threads per blocks. Because:
add<<<128,128>>>(dev_a,dev_b,dev_c); //16384 threads
(128 * 128) < (33 * 1024) so we have a crash.

how to optimize matrix multiplication using OpenACC?

I am learning OpenACC (with PGI's compiler) and trying to optimize matrix multiplication example. The fastest implementation I came up so far is the following:
void matrix_mul(float *restrict r, float *a, float *b, int N, int accelerate){
#pragma acc data copyin (a[0: N * N ], b[0: N * N]) copyout (r [0: N * N ]) if(accelerate)
{
# pragma acc region if(accelerate)
{
# pragma acc loop independent vector(32)
for (int j = 0; j < N; j ++)
{
# pragma acc loop independent vector(32)
for (int i = 0; i < N ; i ++ )
{
float sum = 0;
for (int k = 0; k < N ; k ++ ) {
sum += a [ i + k*N ] * b [ k + j * N ];
}
r[i + j * N ] = sum ;
}
}
}
}
This results in thread blocks of size 32x32 threads and gives me the best performance so far.
Here are the benchmarks:
Matrix multiplication (1500x1500):
GPU: Geforce GT650 M, 64-bit Linux
Data sz : 1500
Unaccelerated:
matrix_mul() time : 5873.255333 msec
Accelerated:
matrix_mul() time : 420.414700 msec
Data size : 1750 x 1750
matrix_mul() time : 876.271200 msec
Data size : 2000 x 2000
matrix_mul() time : 1147.783400 msec
Data size : 2250 x 2250
matrix_mul() time : 1863.458100 msec
Data size : 2500 x 2500
matrix_mul() time : 2516.493200 msec
Unfortunately I realized that the generated CUDA code is quite primitive (e.g. it does not even use shared memory) and hence cannot compete with hand-optimized CUDA program. As a reference implementation I took Arrayfire lib with the following results:
Arrayfire 1500 x 1500 matrix mul
CUDA toolkit 4.2, driver 295.59
GPU0 GeForce GT 650M, 2048 MB, Compute 3.0 (single,double)
Memory Usage: 1932 MB free (2048 MB total)
af: 0.03166 seconds
Arrayfire 1750 x 1750 matrix mul
af: 0.05042 seconds
Arrayfire 2000 x 2000 matrix mul
af: 0.07493 seconds
Arrayfire 2250 x 2250 matrix mul
af: 0.10786 seconds
Arrayfire 2500 x 2500 matrix mul
af: 0.14795 seconds
I wonder if there any suggestions how to get better performance from OpenACC ?
Perhaps my choice of directives is not right ?
You're getting right at a 14x speedup, which is pretty good for PGI's compiler in my experience.
First off, are you compiling with -Minfo? That will give you a lot of feedback from the compiler regarding optimization choices.
You are using a 32x32 thread block, but in my experience 16x16 thread blocks tend to get better performance. If you omit the vector(32) clauses, what scheduling does the compiler choose?
Declaring a and b with restrict might let the compiler generate better code.
Just by looking at your code, I'm not sure that shared memory would help performance. Shared memory only helps improve performance if your code can store and reuse values there instead of going to global memory. In this case you're not reusing any part of a or b after reading it.
It's also worth noting that I've had bad experiences with PGI's compiler when it comes to shared memory usage. It will sometimes do funny stuff and cache the wrong values (seems to mostly happen if you iterate a loop backward), generating wrong results. I actually have to compile my current application using the undocumented -ta=nvidia,nocache option to get it to work correctly, by bypassing shared memory usage altogether.