I am learning OpenACC (with PGI's compiler) and trying to optimize matrix multiplication example. The fastest implementation I came up so far is the following:
void matrix_mul(float *restrict r, float *a, float *b, int N, int accelerate){
#pragma acc data copyin (a[0: N * N ], b[0: N * N]) copyout (r [0: N * N ]) if(accelerate)
{
# pragma acc region if(accelerate)
{
# pragma acc loop independent vector(32)
for (int j = 0; j < N; j ++)
{
# pragma acc loop independent vector(32)
for (int i = 0; i < N ; i ++ )
{
float sum = 0;
for (int k = 0; k < N ; k ++ ) {
sum += a [ i + k*N ] * b [ k + j * N ];
}
r[i + j * N ] = sum ;
}
}
}
}
This results in thread blocks of size 32x32 threads and gives me the best performance so far.
Here are the benchmarks:
Matrix multiplication (1500x1500):
GPU: Geforce GT650 M, 64-bit Linux
Data sz : 1500
Unaccelerated:
matrix_mul() time : 5873.255333 msec
Accelerated:
matrix_mul() time : 420.414700 msec
Data size : 1750 x 1750
matrix_mul() time : 876.271200 msec
Data size : 2000 x 2000
matrix_mul() time : 1147.783400 msec
Data size : 2250 x 2250
matrix_mul() time : 1863.458100 msec
Data size : 2500 x 2500
matrix_mul() time : 2516.493200 msec
Unfortunately I realized that the generated CUDA code is quite primitive (e.g. it does not even use shared memory) and hence cannot compete with hand-optimized CUDA program. As a reference implementation I took Arrayfire lib with the following results:
Arrayfire 1500 x 1500 matrix mul
CUDA toolkit 4.2, driver 295.59
GPU0 GeForce GT 650M, 2048 MB, Compute 3.0 (single,double)
Memory Usage: 1932 MB free (2048 MB total)
af: 0.03166 seconds
Arrayfire 1750 x 1750 matrix mul
af: 0.05042 seconds
Arrayfire 2000 x 2000 matrix mul
af: 0.07493 seconds
Arrayfire 2250 x 2250 matrix mul
af: 0.10786 seconds
Arrayfire 2500 x 2500 matrix mul
af: 0.14795 seconds
I wonder if there any suggestions how to get better performance from OpenACC ?
Perhaps my choice of directives is not right ?
You're getting right at a 14x speedup, which is pretty good for PGI's compiler in my experience.
First off, are you compiling with -Minfo? That will give you a lot of feedback from the compiler regarding optimization choices.
You are using a 32x32 thread block, but in my experience 16x16 thread blocks tend to get better performance. If you omit the vector(32) clauses, what scheduling does the compiler choose?
Declaring a and b with restrict might let the compiler generate better code.
Just by looking at your code, I'm not sure that shared memory would help performance. Shared memory only helps improve performance if your code can store and reuse values there instead of going to global memory. In this case you're not reusing any part of a or b after reading it.
It's also worth noting that I've had bad experiences with PGI's compiler when it comes to shared memory usage. It will sometimes do funny stuff and cache the wrong values (seems to mostly happen if you iterate a loop backward), generating wrong results. I actually have to compile my current application using the undocumented -ta=nvidia,nocache option to get it to work correctly, by bypassing shared memory usage altogether.
Related
I believe my CUDA application could potentially benefit from shared memory, in order to keep the data near the GPU cores. Right now, I have a single kernel to which I pass a pointer to a previously allocated chunk of device memory, and some constants. After the kernel has finished, the device memory includes the result, which is copied to host memory. This scheme works perfectly and is cross-checked with the same algorithm run on the CPU.
The docs make it quite clear that global memory is much slower and has higher access latency than shared memory, but either way to get the best performance you should make your threads coalesce and align any access. My GPU has Compute Capability 6.1 "Pascal", has 48 kiB of shared memory per thread block and 2 GiB DRAM. If I refactor my code to use shared memory, how do I make sure to avoid bank conflicts?
Shared memory is organized in 32 banks, so that 32 threads from the same block each may simultaneously access a different bank without having to wait. Let's say I take the kernel from above, launch a kernel configuration with one block and 32 threads in that block, and statically allocate 48 kiB of shared memory outside the kernel. Also, each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on. Given this, I would access those 32 shared memory locations with on offset of 48 kiB / 32 banks / sizeof(double) which equals 192:
__shared__ double cache[6144];
__global__ void kernel(double *buf_out, double a, double b, double c)
{
for(...)
{
// Perform calculation on shared memory
cache[threadIdx.x * 192] = ...
}
// Write result to global memory
buf_out[threadIdx.x] = cache[threadIdx.x * 192];
}
My reasoning: while threadIdx.x runs from 0 to 31, the offset together with cache being a double make sure that each thread will access the first element of a different bank, at the same time. I haven't gotten around to modify and test the code, but is this the right way to align access for the SM?
MWE added:
This is the naive CPU-to-CUDA port of the algorithm, using global memory only. Visual Profiler reports a kernel execution time of 10.3 seconds.
Environment: Win10, MSVC 2019, x64 Release Build, CUDA v11.2.
#include "cuda_runtime.h"
#include <iostream>
#include <stdio.h>
#define _USE_MATH_DEFINES
#include <math.h>
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x/* + blockIdx.x * blockDim.x*/;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
buf[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
buf[tid] += cos(temp * (Y * y + Z * z));
}
}
}
int main(void)
{
double *dev_mem;
double *buf = NULL;
cudaError_t cudaStatus;
unsigned int screen_elems = 1000;
if ((buf = (double*)malloc(screen_elems * sizeof(double))) == NULL)
{
printf("Could not allocate memory...");
return -1;
}
memset(buf, 0, screen_elems * sizeof(double));
if ((cudaStatus = cudaMalloc((void**)&dev_mem, screen_elems * sizeof(double))) != cudaSuccess)
{
printf("cudaMalloc failed with code %u", cudaStatus);
return cudaStatus;
}
kernel<<<1, 1000>>>(dev_mem, 1e-3, 5e-5, 50e-9, 10.0, 2 * M_PI / 5e-7);
cudaDeviceSynchronize();
if ((cudaStatus = cudaMemcpy(buf, dev_mem, screen_elems * sizeof(double), cudaMemcpyDeviceToHost)) != cudaSuccess)
{
printf("cudaMemcpy failed with code %u", cudaStatus);
return cudaStatus;
}
cudaFree(dev_mem);
cudaDeviceReset();
free(buf);
return 0;
}
The kernel below uses shared memory instead and takes approximately 10.6 seconds to execute, again measured in Visual Profiler:
__shared__ double cache[1000];
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
cache[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
cache[tid] += cos(temp * (Y * y + Z * z));
}
}
buf[tid] = cache[tid];
}
The innermost line inside the loops is typically executed several million times, depending on the five constants passed to the kernel. So instead of thrashing the off-chip global memory, I expected the on-chip shared-memory version to be much faster, but apparently it is not - what am I missing?
Let's say... each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on.
In that case, it does not make sense to use shared memory. The whole point of shared memory is the sharing... among all threads in a block. Under your assumptions, you should keep your element in a register, not in shared memory. Indeed, in your "MWE Added" kernel - that's probably what you should do.
If your threads were to share information - then the pattern of this sharing would determine how best to utilize shared memory.
Also remember that if you don't read data repeatedly, or from multiple threads, it is much less likely that shared memory will help you - as you always have to read from global memory at least once and write to shared memory at least once to have your data in shared memory.
I just started learning CUDA, and I am confused by one point. For the sake of argument, imagine that I had several hundred buoys in the ocean. Imagine that they broadcast a std::vector intermittently once every few milliseconds. The vector might be 5 readings, or 10 readings, etc, depending on the conditions in the ocean at that time. There is no way to tell when the event will fire, it is not deterministic.
Imagine that I had the idea that I could predict the temperature from gathering all this information in realtime, but that the predictor had to first sort all std::vectos on temperature accross all buoy. My question is this. Do I have to copy the entire data back to the GPU every time a single buyoy fires an event? Since the other buoy's data has not changed, can I leave that data in the GPU and just update what has changed and ask the kernel to rerun the prediction?
If yes, what is the [thrust pseudo]-code that would do this? Is this best done with streams and events and pinned memory? What is the limit as to how fast I can update the GPU with realtime data?
I was told that this sort of problem is not well suited to GPU and better in FPGA.
A basic sequence could be like this.
Setup phase (initial sort):
Gather an initial set of vectors from each buoy.
Create a parallel set of vectors, one for each buoy, of length equal to the initial length of the buoy vector, and popluated by the buoy index:
b1: 1.5 1.7 2.2 2.3 2.6
i1: 1 1 1 1 1
b2: 2.4 2.5 2.6
i2: 2 2 2
b3: 2.8
i3: 3
Concatenate all vectors into a single buoy-temp-vector and buoy-index-vector:
b: 1.5 1.7 2.2 2.3 2.6 2.4 2.5 2.6 2.8
i: 1 1 1 1 1 2 2 2 3
Sort-by-key:
b: 1.5 1.7 2.2 2.3 2.4 2.5 2.6 2.6 2.8
i: 1 1 1 1 2 2 1 2 3
The setup phase is complete. The update phase is executed whenever a buoy update is received. Suppose buoy 2 sends an update:
b2: 2.5 2.7 2.9 3.0
Do thrust::remove_if on the buoy vector, if the corresponding index vector position holds the updated buoy number (2 in this case). Repeat the remove_if on the index vector using the same rule:
b: 1.5 1.7 2.2 2.3 2.6 2.8
i: 1 1 1 1 1 3
Generate the corresponding index vector for the buoy to be updated, and copy both vectors (buoy 2 temp-value and index vectors) to the device:
b2: 2.5 2.7 2.9 3.0
i2: 2 2 2 2
Do thrust::merge_by_key on the newly received update from buoy 2
b: 1.5 1.7 2.2 2.3 2.5 2.6 2.7 2.8 2.9 3.0
i: 1 1 1 1 2 1 2 3 2 2
The only data that has to be copied to the device on an update cycle is the actual buoy data to be updated. Note that with some work, the setup phase could be eliminated, and the initial assembly of the vectors could be merely seen as "updates" from each buoy, into initially-empty buoy value and buoy index vectors. But for description, it's easier to visualize with a setup phase, I think. The above description doesn't explicitly point out the various vector sizings and resizings needed, but this can be accomplished using the same methods one would use on std::vector. Vector resizing may be "costly" on the GPU, just as it can be "costly" on the CPU (if a resize to larger triggers a new allocation and copy...) but this could also be elmiminated if a max number of buoys is known and a max number of elements per update is known. In that case, we could allocate our overall buoy value and buoy index vector to be the maximum necessary sizes.
Here is a fully-worked example following the above outline. As a placeholder, I have included a dummy prediction_kernel call, showing where you could insert your specialized prediction code, operating on the sorted data.
#include <stdio.h>
#include <stdlib.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/merge.h>
#include <sys/time.h>
#include <time.h>
#define N_BUOYS 1024
#define N_MAX_UPDATE 1024
#define T_RANGE 100
#define N_UPDATES_TEST 1000
struct equal_func{
const int idx;
equal_func(int _idx) : idx(_idx) {}
__host__ __device__
bool operator()(int test_val) {
return (test_val == idx);
}
};
__device__ float dev_result[N_UPDATES_TEST];
// dummy "prediction" kernel
__global__ void prediction_kernel(const float *data, int iter, size_t d_size){
int idx=threadIdx.x+blockDim.x*blockIdx.x;
if (idx == 0) dev_result[iter] = data[d_size/2];
}
void create_vec(unsigned int id, thrust::host_vector<float> &data, thrust::host_vector<int> &idx){
size_t mysize = rand()%N_MAX_UPDATE;
data.resize(mysize);
idx.resize(mysize);
for (int i = 0; i < mysize; i++){
data[i] = ((float)rand()/(float)RAND_MAX)*(float)T_RANGE;
idx[i] = id;}
thrust::sort(data.begin(), data.end());
}
int main(){
timeval t1, t2;
int pp = 0;
// ping-pong processing vectors
thrust::device_vector<float> buoy_data[2];
buoy_data[0].resize(N_BUOYS*N_MAX_UPDATE);
buoy_data[1].resize(N_BUOYS*N_MAX_UPDATE);
thrust::device_vector<int> buoy_idx[2];
buoy_idx[0].resize(N_BUOYS*N_MAX_UPDATE);
buoy_idx[1].resize(N_BUOYS*N_MAX_UPDATE);
// vectors for initial buoy data
thrust::host_vector<float> h_buoy_data[N_BUOYS];
thrust::host_vector<int> h_buoy_idx[N_BUOYS];
//SETUP
// populate initial data
int lidx=0;
for (int i = 0; i < N_BUOYS; i++){
create_vec(i, h_buoy_data[i], h_buoy_idx[i]);
thrust::copy(h_buoy_data[i].begin(), h_buoy_data[i].end(), &(buoy_data[pp][lidx]));
thrust::copy(h_buoy_idx[i].begin(), h_buoy_idx[i].end(), &(buoy_idx[pp][lidx]));
lidx+= h_buoy_data[i].size();}
// sort initial data
thrust::sort_by_key(&(buoy_data[pp][0]), &(buoy_data[pp][lidx]), &(buoy_idx[pp][0]));
//UPDATE CYCLE
gettimeofday(&t1, NULL);
for (int i = 0; i < N_UPDATES_TEST; i++){
unsigned int vec_to_update = rand()%N_BUOYS;
int nidx = lidx - h_buoy_data[vec_to_update].size();
create_vec(vec_to_update, h_buoy_data[vec_to_update], h_buoy_idx[vec_to_update]);
thrust::remove_if(&(buoy_data[pp][0]), &(buoy_data[pp][lidx]), buoy_idx[pp].begin(), equal_func(vec_to_update));
thrust::remove_if(&(buoy_idx[pp][0]), &(buoy_idx[pp][lidx]), equal_func(vec_to_update));
lidx = nidx + h_buoy_data[vec_to_update].size();
thrust::device_vector<float> temp_data = h_buoy_data[vec_to_update];
thrust::device_vector<int> temp_idx = h_buoy_idx[vec_to_update];
int ppn = (pp == 0)?1:0;
thrust::merge_by_key(&(buoy_data[pp][0]), &(buoy_data[pp][nidx]), temp_data.begin(), temp_data.end(), buoy_idx[pp].begin(), temp_idx.begin(), buoy_data[ppn].begin(), buoy_idx[ppn].begin() );
pp = ppn; // update ping-pong buffer index
prediction_kernel<<<1,1>>>(thrust::raw_pointer_cast(buoy_data[pp].data()), i, lidx);
}
gettimeofday(&t2, NULL);
unsigned int tdiff_us = ((t2.tv_sec*1000000)+t2.tv_usec) - ((t1.tv_sec*1000000)+t1.tv_usec);
printf("Completed %d updates in %f sec\n", N_UPDATES_TEST, (float)tdiff_us/(float)1000000);
// float *temps = (float *)malloc(N_UPDATES_TEST*sizeof(float));
// cudaMemcpyFromSymbol(temps, dev_result, N_UPDATES_TEST*sizeof(float));
// for (int i = 0; i < 100; i++) printf("temp %d: %f\n", i, temps[i]);
return 0;
}
Using CUDA 6, on linux, on a Quadro 5000 GPU, 1000 "updates" requires about 2 seconds. The majority of the time is spent in the calls to thrust::remove_if and thrust::merge_by_key I suppose for worst case real-time estimation, you would want to try and time the worst case update, which might be something like receiving a longest-possible update.
I'm trying to learn CUDA and the following code works OK for the values N<= 16384, but fails for the greater values(Summation check at the end of the code fails, c values are always 0 for the index value of i>=16384).
#include<iostream>
#include"cuda_runtime.h"
#include"../cuda_be/book.h"
#define N (16384)
__global__ void add(int *a,int *b,int *c)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if(tid<N)
{
c[tid] = a[tid] + b[tid];
tid += blockDim.x * gridDim.x;
}
}
int main()
{
int a[N],b[N],c[N];
int *dev_a,*dev_b,*dev_c;
//allocate mem on gpu
HANDLE_ERROR(cudaMalloc((void**)&dev_a,N*sizeof(int)));
HANDLE_ERROR(cudaMalloc((void**)&dev_b,N*sizeof(int)));
HANDLE_ERROR(cudaMalloc((void**)&dev_c,N*sizeof(int)));
for(int i=0;i<N;i++)
{
a[i] = -i;
b[i] = i*i;
}
HANDLE_ERROR(cudaMemcpy(dev_a,a,N*sizeof(int),cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_b,b,N*sizeof(int),cudaMemcpyHostToDevice));
system("PAUSE");
add<<<128,128>>>(dev_a,dev_b,dev_c);
//copy the array 'c' back from the gpu to the cpu
HANDLE_ERROR( cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost));
system("PAUSE");
bool success = true;
for(int i=0;i<N;i++)
{
if((a[i] + b[i]) != c[i])
{
printf("Error in %d: %d + %d != %d\n",i,a[i],b[i],c[i]);
system("PAUSE");
success = false;
}
}
if(success) printf("We did it!\n");
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}
I think it's a shared memory related problem, but I can't come up with a good explanation(Possible lack of knowledge). Could you provide me an explanation and a workaround to run for the values of N greater than 16384. Here is the specs for my GPU:
General Info for device 0
Name: GeForce 9600M GT
Compute capability: 1.1
Clock rate: 1250000
Device copy overlap : Enabled
Kernel Execution timeout : Enabled
Mem info for device 0
Total global mem: 536870912
Total const mem: 65536
Max mem pitch: 2147483647
Texture Alignment 256
MP info about device 0
Multiproccessor count: 4
Shared mem per mp: 16384
Registers per mp: 8192
Threads in warp: 32
Max threads per block: 512
Max thread dimensions: (512,512,64)
Max grid dimensions: (65535,65535,1)
You probably intended to write
while(tid<N)
not
if(tid<N)
You aren't running out of shared memory, your vector arrays are being copied into your device's global memory. As you can see this has far more space available than the 196608 bytes (16384*4*3) you need.
The reason for your problem is that you are only performing one addition operation per thread so hence with this structure, the maximum dimension that your vectors can be is the block*thread parameters in your kernel launch as tera has pointed out. By correcting
if(tid<N)
to
while(tid<N)
in your code, each thread will perform its addition on multiple indexes and the whole array will be considered.
For more information about the memory hierarchy and the various different places memory can sit, you should read sections 2.3 and 5.3 of the CUDA_C_Programming_Guide.pdf provided with the CUDA toolkit.
Hope that helps.
If N is:
#define N (33 * 1024) //value defined in Cuda by Examples
The same code I found in Cuda by Example, but the value of N was different. I think that o value of N cant be 33 * 1024. I must change the parameters number of block and number of threads per blocks. Because:
add<<<128,128>>>(dev_a,dev_b,dev_c); //16384 threads
(128 * 128) < (33 * 1024) so we have a crash.
I am running some image processing operations on GPU and I need the histogram of the output.
I have written and tested the processing kernels. Also I have tested the histogram kernel for samples of the output pictures separately. They both work fine but when I put all of them in one loop I get nothing.
This is my histogram kernel:
__global__ void histogram(int n, uchar* color, uchar* mask, int* bucket, int ch, int W, int bin)
{
unsigned int X = blockIdx.x*blockDim.x+threadIdx.x;
unsigned int Y = blockIdx.y*blockDim.y+threadIdx.y;
int l = (256%bin==0)?256/bin: 256/bin+1;
int c;
if (X+Y*W < n && mask[X+Y*W])
{
c = color[(X+Y*W)*3]/bin;
atomicAdd(&bucket[c], 1);
c = color[(X+Y*W)*3+1]/bin;
atomicAdd(&bucket[c+l], 1);
c = color[(X+Y*W)*3+2]/bin;
atomicAdd(&bucket[c+l*2], 1);
}
}
It is updating histogram vectors for red, green, and blue.('l' is the length of the vectors)
When I comment out atomicAdds it again produces the output but of course not the histogram.
Why don't they work together?
Edit:
This is the loop:
cudaMemcpy(frame_in_gpu,frame_in.data, W*H*3*sizeof(uchar),cudaMemcpyHostToDevice);
cuda_process(frame_in_gpu, frame_out_gpu, W, H, dimGrid,dimBlock);
cuda_histogram(W*H, frame_in_gpu, mask_gpu, hist, 3, W, bin, dimg_histogram, dimb_histogram);
Then I copy the output to host memory and write it to a video.
These are c codes that only call their kernels with dimGrid and dimBlock that are given as inputs. Also:
dim3 dimBlock(32,32);
dim3 dimGrid(W/32,H/32);
dim3 dimb_Histogram(16,16);
dim3 dimg_Histogram(W/16,H/16);
I changed this for histogram because it worked better with it. Does it matter?
Edit2:
I am using -arch=sm_11 option for compilation. I just read it somewhere. Could anyone tell me how I should choose it?
perhaps you should try to compile without -arch=sm_11 flag.
sm 1.1 is the first architecture which supported atomic operations on global memory while your GPU supports SM 2.0. Hence there is no reason to compile for SM 1.1 unless for backward compatibility.
One possible issue could be that SM 1.1 does not support atomic operations on 64-bit ints in global memory. So I would suggest you recompile the code without -arch option, or use
-arch=sm_20 if you like
I've always worked with linear shared memory (load, store, access neighbours) but I've made a simple test in 2D to study bank conflicts which results have confused me.
The next code read data from one dimensional global memory array to shared memory and copy it back from shared memory to global memory.
__global__ void update(int* gIn, int* gOut, int w) {
// shared memory space
__shared__ int shData[16][16];
// map from threadIdx/BlockIdx to data position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// calculate the global id into the one dimensional array
int gid = x + y * w;
// load shared memory
shData[threadIdx.x][threadIdx.y] = gIn[gid];
// synchronize threads not really needed but keep it for convenience
__syncthreads();
// write data back to global memory
gOut[gid] = shData[threadIdx.x][threadIdx.y];
}
The visual profiler reported conflicts in shared memory. The next code avoid thouse conflicts (only show the differences)
// load shared memory
shData[threadIdx.y][threadIdx.x] = gIn[gid];
// write data back to global memory
gOut[gid] = shData[threadIdx.y][threadIdx.x];
This behavior has confused me because in Programming Massively Parallel Processors. A Hands-on approach we can read:
matrix elements in C and CUDA are placed into the linearly addressed locations according to the row major convention. That is, the elements of row 0 of a matrix are first placed in order into consecutive locations.
Is this related to shared memory arragment? or with threads indexes? Maybe am I missing something?
The kernel configuration is as follow:
// kernel configuration
dim3 dimBlock = dim3 ( 16, 16, 1 );
dim3 dimGrid = dim3 ( 64, 64 );
// Launching a grid of 64x64 blocks with 16x16 threads -> 1048576 threads
update<<<dimGrid, dimBlock>>>(d_input, d_output, 1024);
Thanks in advance.
Yes, shared memory is arranged in row-major order as you expected. So your [16][16] array is stored row wise, something like this:
bank0 .... bank15
row 0 [ 0 .... 15 ]
1 [ 16 .... 31 ]
2 [ 32 .... 47 ]
3 [ 48 .... 63 ]
4 [ 64 .... 79 ]
5 [ 80 .... 95 ]
6 [ 96 .... 111 ]
7 [ 112 .... 127 ]
8 [ 128 .... 143 ]
9 [ 144 .... 159 ]
10 [ 160 .... 175 ]
11 [ 176 .... 191 ]
12 [ 192 .... 207 ]
13 [ 208 .... 223 ]
14 [ 224 .... 239 ]
15 [ 240 .... 255 ]
col 0 .... col 15
Because there are 16 32 bit shared memory banks on pre-Fermi hardware, every integer entry in each column maps onto one shared memory bank. So how does that interact with your choice of indexing scheme?
The thing to keep in mind is that threads within a block are numbered in the equivalent of column major order (technically the x dimension of the structure is the fastest varying, followed by y, followed by z). So when you use this indexing scheme:
shData[threadIdx.x][threadIdx.y]
threads within a half-warp will be reading from the same column, which implies reading from the same shared memory bank, and bank conflicts will occur. When you use the opposite scheme:
shData[threadIdx.y][threadIdx.x]
threads within the same half-warp will be reading from the same row, which implies reading from each of the 16 different shared memory banks, no conflicts occur.