CUDA: Shared memory allocation with overlapping borders - cuda

Is there an easy way (google hasn't delivered...) to allocate per-block shared memory regions from a single input array such that there can be an overlap?
The simple example is string searching; Saw I want to dice up the input text, have each thread in each block search for a pattern starting from text[thread_id], but want the data assigned to each block to overlap by the pattern length so matching cases that fall across the border are still found.
I.e the total memory size allocated to shared memory on each block is
(blocksize+patternlength)*sizeof(char)
I'm probably missing something simple and am currently diving through the CUDA guide, but would appreciate some guidance.
UPDATE: I suspect some people have misunderstood my question (or I mis-explained it).
Say I have a dataset QWERTYUIOP, and i want to search for a 3 character match, and i dice up the dataset (arbitrarily) into 4's for each thread block; QWER TYUI OPxx
This is simple enough to accomplish but the algorithm fails if the 3 character match is actually looking for IOP.
In this case, what I want is for each block to have in shared memory:
QWERTY TYUIOP OPxxxx
ie each block gets assigned the blocksize+patternlength-1 characters so no memory border issues occur.
Hope that explains things better.
Since #jmilloy is being persistent... :P
//VERSION 1: Simple
__global__ void gpuSearchSimple(char *T, int lenT, char *P, int lenP, int *pFound)
{
int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
int fMatch = 1;
for (int i=0; i < lenP; i++)
{
if (T[startIndex+i] != P[i]) fMatch = 0;
}
if (fMatch) atomicMin(pFound, startIndex);
}
//VERSION 2: Texture
__global__ void gpuSearchTexture(int lenT, int lenP, int *pFound)
{
int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
int fMatch = 1;
for (int i=0; i < lenP; i++)
{
if (tex1Dfetch(texT,startIndex+i) != tex1Dfetch(texP,i)) fMatch = 0;
}
if (fMatch) atomicMin(pFound, startIndex);
}
//Version 3: Shared
__global__ void gpuSearchTexSha(int lenT, int lenP, int *pFound)
{
extern __shared__ char shaP[];
for (int i=0;threadIdx.x+i<lenP; i+=blockDim.x){
shaP[threadIdx.x+i]= tex1Dfetch(texP,threadIdx.x+i);
}
__syncthreads();
//At this point shaP is populated with the pattern
int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
// only continue if an earlier instance hasn't already been found
int fMatch = 1;
for (int i=0; i < lenP; i++)
{
if (tex1Dfetch(texT,startIndex+i) != shaP[i]) fMatch = 0;
}
if (fMatch) atomicMin(pFound, startIndex);
}
What I would like to have done is to put the text into shared memory chunks, as described in the rest of the question, instead of keeping the text in texture memory for the later versions.

I am not sure that question makes all that much sense. You can dynamically size a shared allocation memory at runtime like this:
__global__ void kernel()
{
extern __shared__ int buffer[];
....
}
kernel<<< gridsize, blocksize, buffersize >>>();
but the contents of the buffer are undefined at the beginning of the kernel. You will have to devise a scheme in the kernel to load from global memory with the overlap that you want to ensure that your pattern matching will work as you want it to.

No. Shared memory is shared between threads in a block, and is ONLY accessible to the block it is assigned to. You cannot have shared memory that is available to two different blocks.
As far as I know, shared memory actually resides on the multiprocessors, and a thread can only access the shared memory from the multiprocessor that it is running on. So this is a physical limitation. (I guess if two blocks reside on one mp, a thread from one block may be able to unpredictably access the shared memory that was allocated to the other block).
Remember that you need to explicitly copy the data from global memory to shared memory. It is a simple matter to copy overlapping regions of the string to non-overlapping shared memory.
I think getting your data where you need it is the majority of the work required in developing CUDA programs. My guidance is that you start with a version that solves the problem without using any shared memory first. In order for that to work, you will solve your overlapping problem and the shared memory implementation will be easy!
edit 2
after answer was marked as correct
__global__ void gpuSearchTexSha(int lenT, int lenP, int *pFound)
{
extern __shared__ char* shared;
char* shaP = &shared[0];
char* shaT = &shared[lenP];
//copy pattern into shaP in parallel
if(threadIdx.x < lenP)
shaP[threadIdx.x] = tex1Dfetch(texP,threadIdx.x);
//determine texT start and length for this block
blockStartIndex = blockIdx.x * gridDim.x/lenT;
lenS = gridDim.x/lenT + lenP - 1;
//copy text into shaT in parallel
shaT[threadIdx.x] = tex1Dfetch(texT,blockStartIndex + threadIdx.x);
if(threadIdx.x < lenP)
shaP[blockDim.x + threadIdx.x] = text1Dfetch(texT,blockStartIndex + blockDim.x + threadIdx.x)
__syncthreads();
//We have one pattern in shaP for each thread in the block
//We have the necessary portion of the text (with overlaps) in shaT
int fMatch = 1;
for (int i=0; i < lenP; i++)
{
if (shaT[threadIdx.x+i] != shaP[i]) fMatch = 0;
}
if (fMatch) atomicMin(pFound, blockStartIndex + threadIdx.x);
}
key notes:
we only need one copy of the pattern in shared memory per block - they can all use it
shared memory needed per block is lenP + lenS (where lenS is your blocksize + patternlength)
the kernel assumes that gridDim.x * blockDim.x = lenT (the same as version 1)
we can copy into shared memory in parallel (no need for for loops if you have enough threads)

Overlapping shared memory is not good, the thread will have to synchronize each time they want to access the same address in shared memory (although in architecture >= 2.0 this has been mitigated).
The simplest idea that comes into my mind is to duplicate the portion of the text that you want to be overlapped.
Instead of reading from the global memory in exact chuncks:
AAAA BBBB CCCC DDDD EEEE
Read with overlapping:
AAAA BBBB CCCC CCCC DDDD EEEEE

Related

Implementing Max Reduce in Cuda

I've been learning Cuda and I am still getting to grips with parallelism. The problem I am having at the moment is implementing a max reduce on an array of values. This is my kernel
__global__ void max_reduce(const float* const d_array,
float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
if (gid < elements)
shared[tid] = d_array[gid];
__syncthreads();
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]);
__syncthreads();
}
if (gid == 0)
*d_max = shared[tid];
}
I have implemented a min reduce using the same method (replacing the max function with the min) which works fine.
To test the kernel, I found the min and max values using a serial for loop. The min and max values always come out the same in the kernel but only the min reduce matches up.
Is there something obvious I'm missing/doing wrong?
Your main conclusion in your deleted answer was correct: the kernel you have posted doesn't comprehend the fact that at the end of that kernel execution, you have done a good deal of the overall reduction, but the results are not quite complete. The results of each block must be combined (somehow). As pointed out in the comments, there are a few other issues with your code as well. Let's take a look at a modified version of it:
__device__ float atomicMaxf(float* address, float val)
{
int *address_as_int =(int*)address;
int old = *address_as_int, assumed;
while (val > __int_as_float(old)) {
assumed = old;
old = atomicCAS(address_as_int, assumed,
__float_as_int(val));
}
return __int_as_float(old);
}
__global__ void max_reduce(const float* const d_array, float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
shared[tid] = -FLOAT_MAX; // 1
if (gid < elements)
shared[tid] = d_array[gid];
__syncthreads();
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]); // 2
__syncthreads();
}
// what to do now?
// option 1: save block result and launch another kernel
if (tid == 0)
d_max[blockIdx.x] = shared[tid]; // 3
// option 2: use atomics
if (tid == 0)
atomicMaxf(d_max, shared[0]);
}
As Pavan indicated, you need to initialize your shared memory array. The last block launched may not be a "full" block, if gridDim.x*blockDim.x is greater than elements.
Note that in this line, even though we are checking that the thread operating (gid) is less than elements, when we add s to gid for indexing into the shared memory we can still index outside of the legitimate values copied into shared memory, in the last block. Therefore we need the shared memory initialization indicated in note 1.
As you already discovered, your last line was not correct. Each block produces it's own result, and we must combine them somehow. One method you might consider if the number of blocks launched is small (more on this later) is to use atomics. Normally we steer people away from using atomics since they are "costly" in terms of execution time. However, the other option we are faced with is saving the block result in global memory, finishing the kernel, and then possibly launching another kernel to combine the individual block results. If I have launched a large number of blocks initially (say more than 1024) then if I follow this methodology I might end up launching two additional kernels. Thus the consideration of atomics. As indicated, there is no native atomicMax function for floats, but as indicated in the documentation, you can use atomicCAS to generate any arbitrary atomic function, and I have provided an example of that in atomicMaxf which provides an atomic max for float.
But is running 1024 or more atomic functions (one per block) the best way? Probably not.
When launching kernels of threadblocks, we really only need to launch enough threadblocks to keep the machine busy. As a rule of thumb we want at least 4-8 warps operating per SM, and somewhat more is probably a good idea. But there's no particular benefit from a machine utilization standpoint to launch thousands of threadblocks initially. If we pick a number like 8 threadblocks per SM, and we have at most, say, 14-16 SMs in our GPU, this gives us a relatively small number of 8*14 = 112 threadblocks. Let's choose 128 (8*16) for a nice round number. There's nothing magical about this, it's just enough to keep the GPU busy. If we make each of these 128 threadblocks do additional work to solve the whole problem, we can then leverage our use of atomics without (perhaps) paying too much of a penalty for doing so, and avoid multiple kernel launches. So how would this look?:
__device__ float atomicMaxf(float* address, float val)
{
int *address_as_int =(int*)address;
int old = *address_as_int, assumed;
while (val > __int_as_float(old)) {
assumed = old;
old = atomicCAS(address_as_int, assumed,
__float_as_int(val));
}
return __int_as_float(old);
}
__global__ void max_reduce(const float* const d_array, float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
shared[tid] = -FLOAT_MAX;
while (gid < elements) {
shared[tid] = max(shared[tid], d_array[gid]);
gid += gridDim.x*blockDim.x;
}
__syncthreads();
gid = (blockDim.x * blockIdx.x) + tid; // 1
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]);
__syncthreads();
}
if (tid == 0)
atomicMaxf(d_max, shared[0]);
}
With this modified kernel, when creating the kernel launch, we are not deciding how many threadblocks to launch based on the overall data size (elements). Instead we are launching a fixed number of blocks (say, 128, you can modify this number to find out what runs fastest), and letting each threadblock (and thus the entire grid) loop through memory, computing partial max operations on each element in shared memory. Then, in the line marked with comment 1, we must re-set the gid variable to it's initial value. This is actually unnecessary and the block reduction loop code can be further simplified if we guarantee that the size of the grid (gridDim.x*blockDim.x) is less than elements, which is not difficult to do at kernel launch.
Note that when using this atomic method, it's necessary to initialize the result (*d_max in this case) to an appropriate value, like -FLOAT_MAX.
Again, we normally steer people way from atomic usage, but in this case, it's worth considering if we carefully manage it, and it allows us to save the overhead of an additional kernel launch.
For a ninja-level analysis of how to do fast parallel reductions, take a look at Mark Harris' excellent whitepaper which is available with the relevant CUDA sample.
Here's one that appears naive but isn't. This won't generalize to other functions like sum(), but it works great for min() and max().
__device__ const float float_min = -3.402e+38;
__global__ void maxKernel(float* d_data)
{
// compute max over all threads, store max in d_data[0]
int i = threadIdx.x;
__shared__ float max_value;
if (i == 0) max_value = float_min;
float v = d_data[i];
__syncthreads();
while (max_value < v) max_value = v;
__syncthreads();
if (i == 0) d_data[0] = max_value;
}
Yup, that's right, only syncing once after initialization and once before writing the result. Damn the race conditions! Full speed ahead!
Before you tell me it won't work, please give it a try first. I have tested thoroughly and it works every time on a variety of arbitrary kernel sizes. It turns out that the race condition doesn't matter in this case because the while loop resolves it.
It works significantly faster than a conventional reduction. Another surprise is that the average number of passes for a kernel size of 32 is 4. Yup, that's (log(n)-1), which seems counterintuitive. It's because the race condition gives an opportunity for good luck. This bonus comes in addition to removing the overhead of the conventional reduction.
With larger n, there is no way to avoid at least one iteration per warp, but that iteration only involves one compare operation which is usually immediately false across the warp when max_value is on the high end of the distribution. You could modify it to use multiple SM's, but that would greatly increase the total workload and add a communication cost, so not likely to help.
For terseness I've omitted the size and output arguments. Size is simply the number of threads (which could be 137 or whatever you like). Output is returned in d_data[0].
I've uploaded the working file here: https://github.com/kenseehart/YAMR

CUDA: writing to shared memory increses kernel time execution a lot

I am trying to reduce 65536 elements array (calculate sum of elements in it) with help of CUDA. Kernel looks like following (please, ignore *dev_distanceFloats and index arguments for now)
__global__ void kernel_calcSum(float *d, float *dev_distanceFloats, int index) {
int tid = threadIdx.x;
float mySum = 0;
for (int e = 0; e < 256; e++) {
mySum += d[tid + e];
}
}
ant it launched as one block with 256 threads:
kernel_calcSum <<<1, 256 >>>(dev_spFloats1, dev_distanceFloats, index);
So far, so good, each of 256 threads takes 256 elements from global memory and calculates it's sum in local variable mySum. Kernel execution time is about 45 milliseconds.
Next step is to introduce shared memory among those 256 threads in block (to calculate sum of mySum), so kernel becomes as following:
__global__ void kernel_calcSum(float *d, float *dev_distanceFloats, int index) {
__shared__ float sdata[256];
int tid = threadIdx.x;
float mySum = 0;
for (int e = 0; e < 256; e++) {
mySum += d[tid + e];
}
sdata[tid] = mySum;
}
I just added writing to shared memory, but execution time increases from 45 milliseconds to 258 milliseconds (I am cheking this with help of NVidia Visual Profiler 5.0.0).
I realize that there are 8 bank conflicts for each thread when writing to sdata variable (I am on GTX670 which have capability 3.0 with 32 banks). As an experiment - I tried to reduce of threads to 32 when launching kernel - but time still 258 milliseconds.
Question 1: why writing to shared memory takes so long in my case ?
Question 2: is there any tool, which show in details kinda "execution plan" (timings for memory access, conflicts, etc) ?
Thanks for your suggestions.
Update:
playing with kernel - I set sdata to some constant for each thread:
__global__ void kernel_calcSum(float *d, float *dev_distanceFloats, int index) {
__shared__ float sdata[256];
int tid = threadIdx.x;
float mySum = 0;
for (int e = 0; e < 256; e++) {
mySum += d[tid + e];
}
sdata[tid] = 111;
}
and timings are back to 48 millisec.
So, changing
sdata[tid] = mySum;
to
sdata[tid] = 111;
made this.
Is this compiler optimization (may be it just removed this line?) or by some reason copying from local memory (register?) to shared takes long?
Both of your kernels do not do anything, because they do not write out results to memory that would still be accessible after the kernel finishes.
In the first case, the compiler is clever enough to notice this and optimize away the whole calculation.
In the second case where shared memory is involved, the compiler does not notice this as the flow of information through shared memory would be harder to track. It thus leaves the calculation in.
Pass in a pointer to global memory (as you already do) and write out results via this pointer.
Shared memory is not the right thing for this. What you need are warp atomic operations, to sum up within a warp, then communicate the intermediary results between warps. There's example code demonstrating this shipping with CUDA.
Summing up elements is one of those tasks where massive parallization won't help much and the GPU in fact can be outperformed by a CPU.

Implementing Neural Network using CUDA

I am trying to create a Neural Network using CUDA:
My kernel looks like :
__global__ void feedForward(float *input, float *output, float **weight) {
//Here the threadId uniquely identifies weight in a neuron
int weightIndex = threadIdx.x;
//Here the blockId uniquely identifies a neuron
int neuronIndex = blockIdx.x;
if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex]
* input[weightIndex];
}
While copying the output back to host, I'm getting an error
Error unspecified launch failure at line xx
At line xx :
CUDA_CHECK_RETURN(cudaMemcpy(h_output, d_Output, output_size, cudaMemcpyDeviceToHost));
Am I doing something wrong here?
Is it because of how I'm using both the block index as well as thread index to reference the weight matrix.
Or does the problem lie elsewhere ?
I'm allcoating the weight matrix as follows:
cudaMallocPitch((void**)&d_Weight, &pitch_W,input_size,NO_OF_NEURONS);
My kernel call is:
feedForward<<<NO_OF_NEURONS,NO_OF_WEIGHTS>>>(d_Input,d_Output,d_Weight);
After that i call:
cudaThreadSynchronize();
I am new to programming with CUDA.
Any help would be appreciated.
Thanks
There is a problem in output code. Though it won't produce the error described, it will produce incorrect results.
int neuronIndex = blockIdx.x;
if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex] * input[weightIndex];
We can see that all threads in single block are writing concurrently into one memory cell. So udefined results are expected. To avoid this I suggest reduce all values within a block in shared memory and perform a single write to global memory. Something like this:
__global__ void feedForward(float *input, float *output, float **weight) {
int weightIndex = threadIdx.x;
int neuronIndex = blockIdx.x;
__shared__ float out_reduce[NO_OF_WEIGHTS];
out_reduce[weightIndex] =
(weightIndex<NO_OF_WEIGHTS && neuronIndex<NO_OF_NEURONS) ?
weight[neuronIndex][weightIndex] * input[weightIndex]
: 0.0;
__syncthreads();
for (int s = NO_OF_WEIGHTS; s > 0 ; s >>= 1)
{
if (weightIndex < s) out_reduce[weightIndex] += out_reduce[weightIndex + s];
__syncthreads();
}
if (weightIndex == 0) output[neuronIndex] += out_reduce[weightIndex];
}
It turned out that I had to rewrite half of you small kernel to help with reduction code...
I build a very simple MLP network using CUDA. You can find my code over here if it may interest you: https://github.com/PirosB3/CudaNeuralNetworks/
For any questions, just shoot!
Daniel
You're using cudaMallocPitch, but don't show how the variables are initialized; I'd be willing to bet this is where your error stems from. cudaMallocPitch is rather tricky; the 3rd parameter should be in bytes, while the 4th parameter is not. i.e.
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&device_Ptr, &pitch, width * sizeof(float), height);
Is your variable input_size in bytes? If not, then you might be allocating too little memory (i.e. you'll think you're requesting 64 elements, but instead you'll be getting 64 bytes), and as such you'll be accessing memory out of range in your kernel. In my experience, an "unspecified launch failure" error usually means I have a segfault

CUDA Dot Product

I'm trying to implement the classic dot-product kernel for double precision arrays with atomic computation of the final sum across the various blocks. I used the atomicAdd for double precision as stated in page 116 of the programming guide.Probably i'm doing something wrong.The partial sums across the threads in every block are computed correctly but afterwords the atomic operation doesn't seem to be working properly since every time i run my kernel with the same data,i receive different results. I'll be grateful if somebody could spot the mistake or provide an alternative solution!
Here is my kernel:
__global__ void cuda_dot_kernel(int *n,double *a, double *b, double *dot_res)
{
__shared__ double cache[threadsPerBlock]; //thread shared memory
int global_tid=threadIdx.x + blockIdx.x * blockDim.x;
int i=0,cacheIndex=0;
double temp = 0;
cacheIndex = threadIdx.x;
while (global_tid < (*n)) {
temp += a[global_tid] * b[global_tid];
global_tid += blockDim.x * gridDim.x;
}
cache[cacheIndex] = temp;
__syncthreads();
for (i=blockDim.x/2; i>0; i>>=1) {
if (threadIdx.x < i) {
cache[threadIdx.x] += cache[threadIdx.x + i];
}
__syncthreads();
}
__syncthreads();
if (cacheIndex==0) {
*dot_res=cuda_atomicAdd(dot_res,cache[0]);
}
}
And here is my device function atomicAdd:
__device__ double cuda_atomicAdd(double *address, double val)
{
double assumed,old=*address;
do {
assumed=old;
old= __longlong_as_double(atomicCAS((unsigned long long int*)address,
__double_as_longlong(assumed),
__double_as_longlong(val+assumed)));
}while (assumed!=old);
return old;
}
Getting a reduction right using ad hoc CUDA code can be tricky, so here's an alternative solution using a Thrust algorithm, which is included with the CUDA Toolkit:
#include <thrust/inner_product.h>
#include <thrust/device_ptr.h>
double do_dot_product(int n, double *a, double *b)
{
// wrap raw pointers to device memory with device_ptr
thrust::device_ptr<double> d_a(a), d_b(b);
// inner_product implements a mathematical dot product
return thrust::inner_product(d_a, d_a + n, d_b, 0.0);
}
You are using the cuda_atomicAdd function incorrectly. This section of your kernel:
if (cacheIndex==0) {
*dot_res=cuda_atomicAdd(dot_res,cache[0]);
}
is the culprit. Here, you atomically add to dot_res. then non atomically set dot_res with the result it returns. The return result from this function is the previous value of the location being atomically updated, and it supplied for "information" or local use of the caller only. You don't assign it to what you are atomically updated, that completely defeats the purpose of using atomic memory access in the first place. Do something like this instead:
if (cacheIndex==0) {
double result=cuda_atomicAdd(dot_res,cache[0]);
}
Did not checked your code that depth but here are some advices.
I would only advice using Thrust if you only use your GPU for such generic tasks, since if a complex problem will arise people have no idea to efficiently program parallel on the gpu.
Start a new parallel reduction kernel to summarize the dot product.
Since the data is already on the device you won't see a decrease in performance starting a new kernel.
Your kernel seems not to scale across the maximum number of possible blocks on the newest GPU. If it would and your kernel would be able to calculate the dot product of millions of values the performance would decrease dramatically because of the serialized atomic operation.
Beginner mistake: Is your input data and shared memory access range checked? Or are you sure the input data is always multiple of your block size? Else you will read garbage. Most of my wrong results were due to this fault.
optimise your parallel reduction. My Thesis or Optimisations Mark Harris
Untested, i just wrote it down in notepad:
/*
* #param inCount_s unsigned long long int Length of both input arrays
* #param inValues1_g double* First value array
* #param inValues2_g double* Second value array
* #param outDots_g double* Output dots of each block, length equals the number of blocks
*/
__global__ void dotProduct(const unsigned long long int inCount_s,
const double* inValuesA_g,
const double* inValuesB_g,
double* outDots_g)
{
//get unique block index in a possible 3D Grid
const unsigned long long int blockId = blockIdx.x //1D
+ blockIdx.y * gridDim.x //2D
+ gridDim.x * gridDim.y * blockIdx.z; //3D
//block dimension uses only x-coordinate
const unsigned long long int tId = blockId * blockDim.x + threadIdx.x;
/*
* shared value pair products array, where BLOCK_SIZE power of 2
*
* To improve performance increase its size by multiple of BLOCK_SIZE, so that each threads loads more then 1 element!
* (outDots_g length decreases by same factor, and you need to range check and initialize memory)
* -> see harris gpu optimisations / parallel reduction slides for more informations.
*/
__shared__ double dots_s[BLOCK_SIZE];
/*
* initialize shared memory array and calculate dot product of two values,
* shared memory always needs to be initialized, its never 0 by default, else garbage is read later!
*/
if(tId < inCount_s)
dots_s[threadIdx.x] = inValuesA_g[tId] * inValuesB_g[tId];
else
dots_s[threadIdx.x] = 0;
__syncthreads();
//do parallel reduction on shared memory array to sum up values
reductionAdd(dots_s, dots_s[0]) //see my thesis link
//output value
if(threadIdx.x == 0)
outDots_g[0] = dots_s[0];
//start new parallel reduction kernel to sum up outDots_g!
}
Edit: removed unnecessary points.

cuda shared memory overwrite?

I am trying to write a parallel prefix scan on cuda by following this tutorial -
I am trying the work-inefficient 'double buffered one' as explained in the tutorial.
This is what I have:
// double buffered naive.
// d = number of iterations, N - size, and input.
__global__ void prefixsum(int* in, int d, int N)
{
//get the block index
int idx = blockIdx.x*blockDim.x + threadIdx.x;
// allocate shared memory
extern __shared__ int temp_in[], temp_out[];
// copy data to it.
temp_in[idx] = in[idx];
temp_out[idx] = 0;
// block until all threads copy
__syncthreads();
int i = 1;
for (i; i<=d; i++)
{
if (idx < N+1 && idx >= (int)pow(2.0f,(float)i-1))
{
// copy new result to temp_out
temp_out[idx] += temp_in[idx - (int)pow(2.0f,(float)i-1)] + temp_in[idx];
}
else
{
// if the element is to remain unchanged, copy the same thing
temp_out[idx] = temp_in[idx];
}
// block until all theads do this
__syncthreads();
// copy the result to temp_in for next iteration
temp_in[idx] = temp_out[idx];
// wait for all threads to do so
__syncthreads();
}
//finally copy everything back to global memory
in[idx] = temp_in[idx];
}
Can you point out what's wrong with this? I have written comments for what I think should happen.
This is the kernel invocation -
prefixsum<<<dimGrid,dimBlock>>>(d_arr, log(SIZE)/log(2), N);
This is the grid and block allocations:
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
The problem is that I don't get the correct output for any input that's more than 8 elements long.
I see two problems in your code
Problem 1: extern shared memory
Agh.... I hate extern __shared__ memory. The problem is, that the compiler does not know how big are the arrays. As a result, they both point to the same piece of memory!
So, in your case: temp_in[5] and temp_out[5] refer to the same word in shared memory.
If you really want the extern __shared__ memory, you can manually offset the second array, for example something like this:
size_t size = .... //the size of your array
extern __shared__ int memory[];
int* temp_in=memory;
int* temp_out=memory+size;
Problem 2: Shared array index
Shared memory is private for each block. That is, temp[0] in one block can be different than temp[0] in another block. However, you index it by blockIdx.x*blockDim.x + threadIdx.x as if the temp arrays were shared between the blocks.
Instead, you should most likely index your temp arrays just by threadIdx.x.
Of course, the idx array is global and you index that one correctly.