I have written a CUDA kernel to process an image. But depending on the output of the processed image, I have to call the kernel again, to re-tune the image.
For example, let us consider an image having 9 pixels
1 2 3
4 5 6
7 8 9
Suppose that, depending on its neighboring values, the value 9 changes to 10. Since the value has changed, I have to re-process the new image, with the same kernel.
1 2 3
4 5 6
7 8 10
I have already written the algorithm to process the image in a single iteration. The way I'm planning to implement the iterations in CUDA is the following:
__global__ void process_image_GPU(unsigned int *d_input, unsigned int *d_output, int dataH, int dataW, unsigned int *val) {
__shared__ unsigned int sh_map[TOTAL_WIDTH][TOTAL_WIDTH];
// Do processing
// If during processing, anywhere any thread changes the value of the image call
{ atomicAdd(val, 1); }
}
int main(int argc, char *argv[]) {
// Allocate d_input, d_output and call cudaMemcpy
unsigned int *x, *val;
x = (unsigned int *)malloc(sizeof(unsigned int));
x[0] = 0;
cudaMalloc((void **)&val, sizeof(unsigned int));
cudaMemcpy((void *)val, (void *)x, sizeof(unsigned int), cudaMemcpyHostToDevice);
process_image_GPU<<<dimGrid, dimBlock>>>(d_input, d_output, rows, cols, val);
cudaMemcpy((void *)x, (void *)val, sizeof(unsigned int), cudaMemcpyDeviceToHost);
if(x != 0)
// Call the kernel again
}
Is it the only way to do this? Is there any other efficient way to implement the same?
Thanks a lot for your time.
I hazard an answer, despite the almost vanishing information you provided. Hope it helps.
From what you have said, you have already set up an updating rule for your pixels, based on the value of the adjacent pixels. Let x^(k)_ij the value of the pixel number ij at iteration k and let
x^(k+1)_ij = f(x^(k)_(i-1)j, x^(k)_ij, x^(k)_(i+1)j, x^(k)_i(j-1), x^(k)_i(j+1))
I'm assuming the typical stencil-based updating rule, but of course other rules would be possible.
At this point, you have to set up a stopping rule, namely, a rule that indicates if your algorithm has reached convergence. For example, you could evaluate the norm of the difference between the two images at steps k+1 and k.
Once formulated the problem in this way, I would say that you have the following two possibilities:
Rouy-Tourin-like scheme: all the computational pixels are updated in a brute-force way "simultaneously" until convergence is reached;
Fast sweeping method: the computational grid is swept (selective update) along a prefixed number of directions until convergence is reached;
Depending on the kind of problem you are dealing with, I would say that you have the additionl possibility:
Fast iterative method: the computational pixels are selectively updated with the aid of a heap structure.
All the above methods have been compared, for the solution of the eikonal equation, here.
Of course, you will need to show converngence of the above computational schemes for the particular problem of our interest.
Related
I have a fixed array populated with some values and I am trying to perform convolution of this array with a spike in frequency domain. Spike means all of the values inside the array is zero except at one place e.g a=[0,0,1,0,0,]
I have to create this spike approximately 1 million times .. the value 1 being placed at different index everytime...
float *spike = (float *) malloc(sizeof(float)*len);
memset(spike,0,sizeof(float)*len);
void compute_spike(float *spike, int ind)
{
spike[ind] = 1.0;
}
How can I create cufft complex type spike array on GPU efficiently ? You can also assume that I have an array of 1 million indices .. what is the best strategy to perform this convolution ? Should I create this spike on host and then move and do fft, convolve and ifft ? or should i create it on the fly on GPU how ?
Given the large bandwidth differences between the PCI-e bus and GPU memory, it makes much more sense to perform the whole construction in GPU memory. I would suggest fusing the memset operation and the spike assignment into a single kernel, something like
template<typename T>
__global__
void compute_spike(T* gpu_spike, int index, int N, T val)
{
int tid = threadIdx.x + blockDim.x * blockIdx.x;
int stride = blockDim.x * gridDim.x;
for(; tid < N; tid += stride) gpu_spike[tid] = (tid == index) ? val : T(0);
}
[Note: code written in browser, never compiled or run, use a own risk]
This uses the grid-stride loop design pattern, you can read more about it at the blog link. Note that you code uses float, but your text mentions "cufft complex type" so I have presented the code as a template. Modify it as you see fit. This should be close in performance to a cudaMemset call, but reduces latency by fusing everything together
__global__ void sum(const float * __restrict__ indata, float * __restrict__ outdata) {
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
// --- Specialize BlockReduce for type float.
typedef cub::BlockReduce<float, BLOCKSIZE> BlockReduceT;
// --- Allocate temporary storage in shared memory
__shared__ typename BlockReduceT::TempStorage temp_storage;
float result;
if(tid < N) result = BlockReduceT(temp_storage).Sum(indata[tid]);
// --- Update block reduction value
if(threadIdx.x == 0) outdata[blockIdx.x] = result;
return;
}
I have tested the reduction sum(as shown in above code snippet) with cuda cub successfully, I want to perform the inner product of two vectors based on this code. But I have some confusions about it:
We need two input vectors for the inner_product, need I to conduct a component-wise multiplication of this two input vectors before the reduction sum on the resulting new vector.
In the code examples of the cuda cub, the dimension of input vectors is equal to the blocknumber*threadnumber. what if we have a very large vector.
Yes, with cub, and assuming your vectors were stored separately (i.e. not interleaved), you would need to do an element-wise multiplication first. On the other hand, thrust transform_reduce could handle it in a single function call.
blocknumber*threadnumber should give you all the range you need. on a cc3.0 or higher GPU, blocknumber (i.e. gridDim.x) can range up to 2^31-1 and threadnumber (i.e. blockDim.x) can range up to 1024. This gives you the possibility to handle 2^40 elements. If each element is 4 bytes, this would constitute (i.e. require) 2^42 bytes. That is about 4TB (or double that if you are considering 2 input vectors), which is much larger than any GPU memory currently. So you will run out of GPU memory space before you run out of grid dimension.
Note that what you are showing is cub::BlockReduce. However if you are doing a vector dot product of two large vectors, you might want to use cub::DeviceReduce instead.
I've been learning Cuda and I am still getting to grips with parallelism. The problem I am having at the moment is implementing a max reduce on an array of values. This is my kernel
__global__ void max_reduce(const float* const d_array,
float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
if (gid < elements)
shared[tid] = d_array[gid];
__syncthreads();
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]);
__syncthreads();
}
if (gid == 0)
*d_max = shared[tid];
}
I have implemented a min reduce using the same method (replacing the max function with the min) which works fine.
To test the kernel, I found the min and max values using a serial for loop. The min and max values always come out the same in the kernel but only the min reduce matches up.
Is there something obvious I'm missing/doing wrong?
Your main conclusion in your deleted answer was correct: the kernel you have posted doesn't comprehend the fact that at the end of that kernel execution, you have done a good deal of the overall reduction, but the results are not quite complete. The results of each block must be combined (somehow). As pointed out in the comments, there are a few other issues with your code as well. Let's take a look at a modified version of it:
__device__ float atomicMaxf(float* address, float val)
{
int *address_as_int =(int*)address;
int old = *address_as_int, assumed;
while (val > __int_as_float(old)) {
assumed = old;
old = atomicCAS(address_as_int, assumed,
__float_as_int(val));
}
return __int_as_float(old);
}
__global__ void max_reduce(const float* const d_array, float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
shared[tid] = -FLOAT_MAX; // 1
if (gid < elements)
shared[tid] = d_array[gid];
__syncthreads();
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]); // 2
__syncthreads();
}
// what to do now?
// option 1: save block result and launch another kernel
if (tid == 0)
d_max[blockIdx.x] = shared[tid]; // 3
// option 2: use atomics
if (tid == 0)
atomicMaxf(d_max, shared[0]);
}
As Pavan indicated, you need to initialize your shared memory array. The last block launched may not be a "full" block, if gridDim.x*blockDim.x is greater than elements.
Note that in this line, even though we are checking that the thread operating (gid) is less than elements, when we add s to gid for indexing into the shared memory we can still index outside of the legitimate values copied into shared memory, in the last block. Therefore we need the shared memory initialization indicated in note 1.
As you already discovered, your last line was not correct. Each block produces it's own result, and we must combine them somehow. One method you might consider if the number of blocks launched is small (more on this later) is to use atomics. Normally we steer people away from using atomics since they are "costly" in terms of execution time. However, the other option we are faced with is saving the block result in global memory, finishing the kernel, and then possibly launching another kernel to combine the individual block results. If I have launched a large number of blocks initially (say more than 1024) then if I follow this methodology I might end up launching two additional kernels. Thus the consideration of atomics. As indicated, there is no native atomicMax function for floats, but as indicated in the documentation, you can use atomicCAS to generate any arbitrary atomic function, and I have provided an example of that in atomicMaxf which provides an atomic max for float.
But is running 1024 or more atomic functions (one per block) the best way? Probably not.
When launching kernels of threadblocks, we really only need to launch enough threadblocks to keep the machine busy. As a rule of thumb we want at least 4-8 warps operating per SM, and somewhat more is probably a good idea. But there's no particular benefit from a machine utilization standpoint to launch thousands of threadblocks initially. If we pick a number like 8 threadblocks per SM, and we have at most, say, 14-16 SMs in our GPU, this gives us a relatively small number of 8*14 = 112 threadblocks. Let's choose 128 (8*16) for a nice round number. There's nothing magical about this, it's just enough to keep the GPU busy. If we make each of these 128 threadblocks do additional work to solve the whole problem, we can then leverage our use of atomics without (perhaps) paying too much of a penalty for doing so, and avoid multiple kernel launches. So how would this look?:
__device__ float atomicMaxf(float* address, float val)
{
int *address_as_int =(int*)address;
int old = *address_as_int, assumed;
while (val > __int_as_float(old)) {
assumed = old;
old = atomicCAS(address_as_int, assumed,
__float_as_int(val));
}
return __int_as_float(old);
}
__global__ void max_reduce(const float* const d_array, float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
shared[tid] = -FLOAT_MAX;
while (gid < elements) {
shared[tid] = max(shared[tid], d_array[gid]);
gid += gridDim.x*blockDim.x;
}
__syncthreads();
gid = (blockDim.x * blockIdx.x) + tid; // 1
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]);
__syncthreads();
}
if (tid == 0)
atomicMaxf(d_max, shared[0]);
}
With this modified kernel, when creating the kernel launch, we are not deciding how many threadblocks to launch based on the overall data size (elements). Instead we are launching a fixed number of blocks (say, 128, you can modify this number to find out what runs fastest), and letting each threadblock (and thus the entire grid) loop through memory, computing partial max operations on each element in shared memory. Then, in the line marked with comment 1, we must re-set the gid variable to it's initial value. This is actually unnecessary and the block reduction loop code can be further simplified if we guarantee that the size of the grid (gridDim.x*blockDim.x) is less than elements, which is not difficult to do at kernel launch.
Note that when using this atomic method, it's necessary to initialize the result (*d_max in this case) to an appropriate value, like -FLOAT_MAX.
Again, we normally steer people way from atomic usage, but in this case, it's worth considering if we carefully manage it, and it allows us to save the overhead of an additional kernel launch.
For a ninja-level analysis of how to do fast parallel reductions, take a look at Mark Harris' excellent whitepaper which is available with the relevant CUDA sample.
Here's one that appears naive but isn't. This won't generalize to other functions like sum(), but it works great for min() and max().
__device__ const float float_min = -3.402e+38;
__global__ void maxKernel(float* d_data)
{
// compute max over all threads, store max in d_data[0]
int i = threadIdx.x;
__shared__ float max_value;
if (i == 0) max_value = float_min;
float v = d_data[i];
__syncthreads();
while (max_value < v) max_value = v;
__syncthreads();
if (i == 0) d_data[0] = max_value;
}
Yup, that's right, only syncing once after initialization and once before writing the result. Damn the race conditions! Full speed ahead!
Before you tell me it won't work, please give it a try first. I have tested thoroughly and it works every time on a variety of arbitrary kernel sizes. It turns out that the race condition doesn't matter in this case because the while loop resolves it.
It works significantly faster than a conventional reduction. Another surprise is that the average number of passes for a kernel size of 32 is 4. Yup, that's (log(n)-1), which seems counterintuitive. It's because the race condition gives an opportunity for good luck. This bonus comes in addition to removing the overhead of the conventional reduction.
With larger n, there is no way to avoid at least one iteration per warp, but that iteration only involves one compare operation which is usually immediately false across the warp when max_value is on the high end of the distribution. You could modify it to use multiple SM's, but that would greatly increase the total workload and add a communication cost, so not likely to help.
For terseness I've omitted the size and output arguments. Size is simply the number of threads (which could be 137 or whatever you like). Output is returned in d_data[0].
I've uploaded the working file here: https://github.com/kenseehart/YAMR
I am writing my first CUDA application and am writing all the kernels my self for practice.
In one portion I am simply calculating X_transpose * X.
I have been using cudaMallocPitch and cudaMemcpy2D, I first allocate enough space on the device for X and X_transpose*X. I copy X to the device, my kernel takes two inputs, the X matrix, then the space to write the X_transpose * X result.
Using the profiler the kernel originally took 104 seconds to execute on a matrix of size 5000x6000. I pad the matrix with zeros on the host so that it is a multiple of the block size to avoid checking the bounds of the matrix in the kernel. I use a block size of 32 by 32.
I made some changes to try to maximize coalesced reads/writes to global memory, this seemed to help significantly. Using the visual profiler to profile the release build of my code, the kernel now takes 4.27 seconds to execute.
I haven't done an accurate timing of my matlab execution(just the operation X'*X;), but it appears to be about 3 seconds. I was hoping I could get much better speedups than matlab using CUDA.
The nvidia visual profiler is unable to find any issues with my kernel, I was hoping the community here might have some suggestions as to how I can make it go faster.
The kernel code:
__global__ void XTXKernel(Matrix X, Matrix XTX) {
//find location in output matrix
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
int row = threadIdx.y;
int col = threadIdx.x;
Matrix XTXsub = GetSubMatrix(XTX, blockRow, blockCol);
float Cvalue = 0;
for(int m = 0; m < (X.paddedHeight / BLOCK_SIZE); ++m) {
//Get sub-matrix
Matrix Xsub = GetSubMatrix(X, m, blockCol);
Matrix XTsub = GetSubMatrix(X, m, blockRow);
__shared__ float Xs[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float XTs[BLOCK_SIZE][BLOCK_SIZE];
//Xs[row][col] = GetElement(Xsub, row, col);
//XTs[row][col] = GetElement(XTsub, col, row);
Xs[row][col] = *(float*)((char*)Xsub.data + row*Xsub.pitch) + col;
XTs[col][row] = *(float*)((char*)XTsub.data + row*XTsub.pitch) + col;
__syncthreads();
for(int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += Xs[e][row] * XTs[col][e];
__syncthreads();
}
//write the result to the XTX matrix
//SetElement(XTXsub, row, col, Cvalue);
((float *)((char*)XTXsub.data + row*XTX.pitch) + col)[0] = Cvalue;
}
The definition of my Matrix structure:
struct Matrix {
matrixLocation location;
unsigned int width; //width of matrix(# cols)
unsigned int height; //height of matrix(# rows)
unsigned int paddedWidth; //zero padded width
unsigned int paddedHeight; //zero padded height
float* data; //pointer to linear array of data elements
size_t pitch; //pitch in bytes, the paddedHeight*sizeof(float) for host, device determines own pitch
size_t size; //total number of elements in the matrix
size_t paddedSize; //total number of elements counting zero padding
};
Thanks in advance for your suggestions.
EDIT: I forgot to mention, I am running the on a Kepler card, GTX 670 4GB.
Smaller block size like 16x16 or 8x8 may be faster. This slides also demos larger non-square size of block/shared mem may be faster for particular matrix size.
For shared mem allocation, add a dumy element on the leading dimension by using [BLOCK_SIZE][BLOCK_SIZE+1] to avoid the bank conflict.
Try to unroll the inner for loop by using #pragma unroll
On the other hand, You probably won't be much faster than matlab GPU code for large enough A'*A. Since the performance bottleneck of matlab is the invoking overhead rather than the kernel performance.
The cuBLAS routine culas_gemm() may have highest performance for matrix multiplication. You could compare yours with it.
MAGMA routine magma_gemm() has higher performance than cuBLAS in some cases. It's a open source project. You may also get some ideas from their code.
A number of algorithms iterate until a certain convergence criterion is reached (e.g. stability of a particular matrix). In many cases, one CUDA kernel must be launched per iteration. My question is: how then does one efficiently and accurately determine whether a matrix has changed over the course of the last kernel call? Here are three possibilities which seem equally unsatisfying:
Writing a global flag each time the matrix is modified inside the kernel. This works, but is highly inefficient and is not technically thread safe.
Using atomic operations to do the same as above. Again, this seems inefficient since in the worst case scenario one global write per thread occurs.
Using a reduction kernel to compute some parameter of the matrix (e.g. sum, mean, variance). This might be faster in some cases, but still seems like overkill. Also, it is possible to dream up cases where a matrix has changed but the sum/mean/variance haven't (e.g. two elements are swapped).
Is there any of the three options above, or an alternative, that is considered best practice and/or is generally more efficient?
I'll also go back to the answer I would have posted in 2012 but for a browser crash.
The basic idea is that you can use warp voting instructions to perform a simple, cheap reduction and then use zero or one atomic operations per block to update a pinned, mapped flag that the host can read after each kernel launch. Using a mapped flag eliminates the need for an explicit device to host transfer after each kernel launch.
This requires one word of shared memory per warp in the kernel, which is a small overhead, and some templating tricks can allow for loop unrolling if you provide the number of warps per block as a template parameter.
A complete working examplate (with C++ host code, I don't have access to a working PyCUDA installation at the moment) looks like this:
#include <cstdlib>
#include <vector>
#include <algorithm>
#include <assert.h>
__device__ unsigned int process(int & val)
{
return (++val < 10);
}
template<int nwarps>
__global__ void kernel(int *inout, unsigned int *kchanged)
{
__shared__ int wchanged[nwarps];
unsigned int laneid = threadIdx.x % warpSize;
unsigned int warpid = threadIdx.x / warpSize;
// Do calculations then check for change/convergence
// and set tchanged to be !=0 if required
int idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tchanged = process(inout[idx]);
// Simple blockwise reduction using voting primitives
// increments kchanged is any thread in the block
// returned tchanged != 0
tchanged = __any(tchanged != 0);
if (laneid == 0) {
wchanged[warpid] = tchanged;
}
__syncthreads();
if (threadIdx.x == 0) {
int bchanged = 0;
#pragma unroll
for(int i=0; i<nwarps; i++) {
bchanged |= wchanged[i];
}
if (bchanged) {
atomicAdd(kchanged, 1);
}
}
}
int main(void)
{
const int N = 2048;
const int min = 5, max = 15;
std::vector<int> data(N);
for(int i=0; i<N; i++) {
data[i] = min + (std::rand() % (int)(max - min + 1));
}
int* _data;
size_t datasz = sizeof(int) * (size_t)N;
cudaMalloc<int>(&_data, datasz);
cudaMemcpy(_data, &data[0], datasz, cudaMemcpyHostToDevice);
unsigned int *kchanged, *_kchanged;
cudaHostAlloc((void **)&kchanged, sizeof(unsigned int), cudaHostAllocMapped);
cudaHostGetDevicePointer((void **)&_kchanged, kchanged, 0);
const int nwarps = 4;
dim3 blcksz(32*nwarps), grdsz(16);
// Loop while the kernel signals it needs to run again
do {
*kchanged = 0;
kernel<nwarps><<<grdsz, blcksz>>>(_data, _kchanged);
cudaDeviceSynchronize();
} while (*kchanged != 0);
cudaMemcpy(&data[0], _data, datasz, cudaMemcpyDeviceToHost);
cudaDeviceReset();
int minval = *std::min_element(data.begin(), data.end());
assert(minval == 10);
return 0;
}
Here, kchanged is the flag the kernel uses to signal it needs to run again to the host. The kernel runs until each entry in the input has been incremented to above a threshold value. At the end of each threads processing, it participates in a warp vote, after which one thread from each warp loads the vote result to shared memory. One thread reduces the warp result and then atomically updates the kchanged value. The host thread waits until the device is finished, and can then directly read the result from the mapped host variable.
You should be able to adapt this to whatever your application requires
I'll go back to my original suggestion. I've updated the related question with an answer of my own, which I believe is correct.
create a flag in global memory:
__device__ int flag;
at each iteration,
initialize the flag to zero (in host code):
int init_val = 0;
cudaMemcpyToSymbol(flag, &init_val, sizeof(int));
In your kernel device code, modify the flag to 1 if a change is made to the matrix:
__global void iter_kernel(float *matrix){
...
if (new_val[i] != matrix[i]){
matrix[i] = new_val[i];
flag = 1;}
...
}
after calling the kernel, at the end of the iteration (in host code), test for modification:
int modified = 0;
cudaMemcpyFromSymbol(&modified, flag, sizeof(int));
if (modified){
...
}
Even if multiple threads in separate blocks or even separate grids, are writing the flag value, as long as the only thing they do is write the same value (i.e. 1 in this case), there is no hazard. The write will not get "lost" and no spurious values will show up in the flag variable.
Testing float or double quantities for equality in this fashion is questionable, but that doesn't seem to be the point of your question. If you have a preferred method to declare "modification" use that instead (such as testing for equality within a tolerance, perhaps).
Some obvious enhancements to this method would be to create one (local) flag variable per thread, and have each thread update the global flag variable once per kernel, rather than on every modification. This would result in at most one global write per thread per kernel. Another approach would be to keep one flag variable per block in shared memory, and have all threads simply update that variable. At the completion of the block, one write is made to global memory (if necessary) to update the global flag. We don't need to resort to complicated reductions in this case, because there is only one boolean result for the entire kernel, and we can tolerate multiple threads writing to either a shared or global variable, as long as all threads are writing the same value.
I can't see any reason to use atomics, or how it would benefit anything.
A reduction kernel seems like overkill, at least compared to one of the optimized approaches (e.g. a shared flag per block). And it would have the drawbacks you mention, such as the fact that anything less than a CRC or similarly complicated computation might alias two different matrix results as "the same".