Making CUB blockradixsort on-chip entirely? - cuda

I am reading the CUB documentations and examples:
#include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockRadixSort for 128 threads owning 4 integer items each
typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
// Allocate shared memory for BlockRadixSort
__shared__ typename BlockRadixSort::TempStorage temp_storage;
// Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
...
// Collectively sort the keys
BlockRadixSort(temp_storage).Sort(thread_keys);
...
}
In the example, each thread has 4 keys. It looks like 'thread_keys' will be allocated in global local memory. If I only has 1 key per thread, could I declare"int thread_key;" and make this variable in register only?
BlockRadixSort(temp_storage).Sort() is taking a pointer to the key as parameter. Does it mean that the keys have to be in global memory?
I would like to use this code but I want each thread to hold one key in register and keep it on-chip in register/shared memory after they are sorted.
Thanks in advance!

You can do this using shared memory (which will keep it "on-chip"). I'm not sure I know how to do it using strictly registers without de-constructing the BlockRadixSort object.
Here's an example code that uses shared memory to hold the initial data to be sorted, and the final sorted results. This sample is mostly set up for one data element per thread, since that seems to be what you are asking for. It's not difficult to extend it to multiple elements per thread, and I have put most of the plumbing in place to do that, with the exception of the data synthesis and debug printouts:
#include <cub/cub.cuh>
#include <stdio.h>
#define nTPB 32
#define ELEMS_PER_THREAD 1
// Block-sorting CUDA kernel (nTPB threads each owning ELEMS_PER THREAD integers)
__global__ void BlockSortKernel()
{
__shared__ int my_val[nTPB*ELEMS_PER_THREAD];
using namespace cub;
// Specialize BlockRadixSort collective types
typedef BlockRadixSort<int, nTPB, ELEMS_PER_THREAD> my_block_sort;
// Allocate shared memory for collectives
__shared__ typename my_block_sort::TempStorage sort_temp_stg;
// need to extend synthetic data for ELEMS_PER_THREAD > 1
my_val[threadIdx.x*ELEMS_PER_THREAD] = (threadIdx.x + 5)%nTPB; // synth data
__syncthreads();
printf("thread %d data = %d\n", threadIdx.x, my_val[threadIdx.x*ELEMS_PER_THREAD]);
// Collectively sort the keys
my_block_sort(sort_temp_stg).Sort(*static_cast<int(*)[ELEMS_PER_THREAD]>(static_cast<void*>(my_val+(threadIdx.x*ELEMS_PER_THREAD))));
__syncthreads();
printf("thread %d sorted data = %d\n", threadIdx.x, my_val[threadIdx.x*ELEMS_PER_THREAD]);
}
int main(){
BlockSortKernel<<<1,nTPB>>>();
cudaDeviceSynchronize();
}
This seems to work correctly for me, in this case I happened to be using RHEL 5.5/gcc 4.1.2, CUDA 6.0 RC, and CUB v1.2.0 (which is quite recent).
The strange/ugly static casting is needed as far as I can tell, because the CUB Sort is expecting a reference to an array of length equal to the customization parameter ITEMS_PER_THREAD(i.e. ELEMS_PER_THREAD):
__device__ __forceinline__ void Sort(
Key (&keys)[ITEMS_PER_THREAD],
int begin_bit = 0,
int end_bit = sizeof(Key) * 8)
{ ...

Related

Delete a cudaMalloc allocated memory inside kernel

I want to delete an array allocated by cudaMalloc inside a kernel using delete[]; but memory checker shows access violation, array is kept in memory and kernel continues to execute.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
__global__ void kernel(int *a)
{
int *b = new int[10];
delete[] b; // no violation
delete[] a; // Memory Checker detects access violation.
}
int main()
{
int *d_a;
cudaMalloc(&d_a, 10 * sizeof(int));
kernel<<<1, 1>>>(d_a);
return 0;
}
What is the difference between memory allocated by cudaMalloc and new in device code?
Is it possible to delete memory allocated by cudaMalloc in device code?
Thanks
cudaMalloc in host code and new (or malloc) in device code allocate out of logically separate areas. The two areas are not generally interoperable from an API standpoint.
no
You may wish to read the documentation. The description given there for in-kernel malloc and free generally applies to in-kernel new and delete as well.

External call of a class method in a kernel

I have a class FPlan that has a number of methods such as permute and packing.
__host__ __device__ void Perturb_action(FPlan *dfp){
dfp->perturb();
dfp->packing();
}
__global__ void Vector_Perturb(FPlan **dfp, int n){
int i=threadIx.x;
if(i<n) Perturb_action(dfp[i]);
}
in main:
FPlan **fp_vec;
fp_vec=(FPlan**)malloc(VEC_SIZE*sizeof(FPlan*));
//initialize the vec
for(int i=0; i<VEC_SIZE;i++)
fp_vec[i]=&fp;
//fp of type FPlan that is initialized
int v_sz=sizeof(fp_vec);
double test=fp_vec[0]->getCost();
printf("the cost before perturb %f\n"test);
FPlan **value;
cudaMalloc(&value,v_sz);
cudaMemcpy(value,&fp_vec,v_sz,cudaMemcpyHostToDevice);
//call kernel
dim3 threadsPerBlock(VEC_SIZE);
dim3 numBlocks(1);
Vector_Perturb<<<numBlocks,threadsPerBlock>>> (value,VEC_SIZE);
cudaMemcpy(fp_vec,value,v_sz,cudaMemcpyDeviceToHost);
test=fp_vec[0]->getCost();
printf("the cost after perturb %f\n"test);
test=fp_vec[1]->getCost();
printf("the cost after perturb %f\n"test);
I am getting before permute for fp_vec[0] printf the cost 0.8.
After permute for fp_vec[0] the value inf and for fp_vec[1] the value 0.8.
The expected output after the permutation should be something like fp_vec[0] = 0.7 and fp_vec[1] = 0.9. I want to apply these permutations to an array of type FPlan.
What am I missing? Is calling an external function supported in CUDA?
This seems to be a common problem these days:
Consider the following code:
#include <stdio.h>
#include <stdlib.h>
int main() {
int* arr = (int*) malloc(100);
printf("sizeof(arr) = %i", sizeof(arr));
return 0;
}
what is the expected ouptut? 100? no its 4 (at least on a 32 bit machine). sizeof() returns the size of the type of a variable not the allocated size of an array.
int v_sz=sizeof(fp_vec);
double test=fp_vec[0]->getCost();
printf("the cost before perturb %f\n"test);
FPlan **value;
cudaMalloc(&value,v_sz);
cudaMemcpy(value,&fp_vec,v_sz,cudaMemcpyHostToDevice);
You are allocating 4 (or 8) bytes on the device and copy 4 (or 8) bytes. The result is undefined (and maybe every time garbage).
Besides that, you shold do proper error checking of your CUDA calls.
Have a look: What is the canonical way to check for errors using the CUDA runtime API?

use of constant in cuda is not accessed in the kernel

in the cuda code ,I am trying to use a structure and constant structure object and the value is assigned to constant object using cudaMemcpyToSymbol but this constant values are not accessed . I know the actual use of constant is not this way as each thread needs to access different values and cannot take advantage of memory broadcast to half warp but here in some situation I need this way
#include <iostream>
#include <stdio.h>
#include <cuda.h>
using namespace std;
struct CDistance
{
int Magnitude;
int Direction;
};
__constant__ CDistance *c_daSTLDistance;
__global__ static void CalcSTLDistance_Kernel(CDistance *m_daSTLDistance)
{
int ID = threadIdx.x;
m_daSTLDistance[ID].Magnitude = m_daSTLDistance[ID].Magnitude + c_daSTLDistance[ID].Magnitude ;
m_daSTLDistance[ID].Direction = 2 ;
}
// main routine that executes on the host
int main(void)
{
CDistance *m_haSTLDistance,*m_daSTLDistance;
m_haSTLDistance = new CDistance[10];
for(int i=0;i<10;i++)
{
m_haSTLDistance[i].Magnitude=3;
m_haSTLDistance[i].Direction=2;
}
//m_haSTLDistance =(CDistance*)malloc(100 * sizeof(CDistance));
cudaMalloc((void**)&m_daSTLDistance,sizeof(CDistance)*10);
cudaMemcpy(m_daSTLDistance, m_haSTLDistance,sizeof(CDistance)*10, cudaMemcpyHostToDevice);
cudaMemcpyToSymbol(c_daSTLDistance, m_haSTLDistance, sizeof(m_daSTLDistance)*10);
CalcSTLDistance_Kernel<<< 1, 100 >>> (m_daSTLDistance);
cudaMemcpy(m_haSTLDistance, m_daSTLDistance, sizeof(CDistance)*10, cudaMemcpyDeviceToHost);
for (int i=0;i<10;i++){
cout<<m_haSTLDistance[i].Magnitude<<endl;
}
free(m_haSTLDistance);
cudaFree(m_daSTLDistance);
}
here in the output, the constant c_daSTLDistance[ID].Magnitude is not accessed in the kernel and the statically assigned value 3 is obtained whereas I want this device value 3 is added to constant value and total 6 is returned.
while looking in to the cuda-memcheck it says error in read operation with memory out of bound
Your code doesn't work because of an uninitialised pointer/buffer overflow problem around the use of c_daSTLDistance. It is illegal to do this:
__constant__ CDistance *c_daSTLDistance;
....
cudaMemcpyToSymbol(c_daSTLDistance, m_haSTLDistance, sizeof(m_daSTLDistance)*10);
No memory was every allocated or a valid value set for c_daSTLDistance.
Further, note that all constant memory variables must be statically defined, and there is no ability to dynamically allocate constant memory at runtime. Therefore, what you are attempting to do can't be made to work. Also note that on all but the very oldest of CUDA devices, kernel arguments are stored in constant memory. So if you had a trivially small array of constant structures, it would be far easier and simpler to pass them by value to the kernel. The compiler and runtime will automagically place them in constant memory for you without any explicit host API calls.

how to get cufftcomplex magnitude and phase fast

i have a cufftcomplex data block which is the result from cuda fft(R2C). i know the data is save as a structure with a real number followed by image number. now i want to get the amplitude=sqrt(R*R+I*I), and phase=arctan(I/R) of each complex element by a fast way(not for loop). Is there any good way to do that? or any library could do that?
Since cufftExecR2C operates on data that is on the GPU, the results are already on the GPU, (before you copy them back to the host, if you are doing that.)
It should be straightforward to write your own cuda kernel to accomplish this. The amplitude you're describing is the value returned by cuCabs or cuCabsf in cuComplex.h header file. By looking at the functions in that header file, you should be able to figure out how to write your own that computes the phase angle. You'll note that cufftComplex is just a typedef of cuComplex.
let's say your cufftExecR2C call left some results of type cufftComplex in array data of size sz. Your kernel might look like this:
#include <math.h>
#include <cuComplex.h>
#include <cufft.h>
#define nTPB 256 // threads per block for kernel
#define sz 100000 // or whatever your output data size is from the FFT
...
__host__ __device__ float carg(const cuComplex& z) {return atan2(cuCimagf(z), cuCrealf(z));} // polar angle
__global__ void magphase(cufftComplex *data, float *mag, float *phase, int dsz){
int idx = threadIdx.x + blockDim.x*blockIdx.x;
if (idx < dsz){
mag[idx] = cuCabsf(data[idx]);
phase[idx] = carg(data[idx]);
}
}
...
int main(){
...
/* Use the CUFFT plan to transform the signal in place. */
/* Your code might be something like this already: */
if (cufftExecR2C(plan, (cufftReal*)data, data) != CUFFT_SUCCESS){
fprintf(stderr, "CUFFT error: ExecR2C Forward failed");
return;
}
/* then you might add: */
float *h_mag, *h_phase, *d_mag, *d_phase;
// malloc your h_ arrays using host malloc first, then...
cudaMalloc((void **)&d_mag, sz*sizeof(float));
cudaMalloc((void **)&d_phase, sz*sizeof(float));
magphase<<<(sz+nTPB-1)/nTPB, nTPB>>>(data, d_mag, d_phase, sz);
cudaMemcpy(h_mag, d_mag, sz*sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(h_phase, d_phase, sz*sizeof(float), cudaMemcpyDeviceToHost);
You can also do this using thrust by creating functors for the magnitude and phase functions, and passing these functors along with data, mag and phase to thrust::transform.
I'm sure you can probably do it with CUBLAS as well, using a combination of vector add and vector multiply operations.
This question/answer may be of interest as well. I lifted my phase function carg from there.

Efficient method to check for matrix stability in CUDA

A number of algorithms iterate until a certain convergence criterion is reached (e.g. stability of a particular matrix). In many cases, one CUDA kernel must be launched per iteration. My question is: how then does one efficiently and accurately determine whether a matrix has changed over the course of the last kernel call? Here are three possibilities which seem equally unsatisfying:
Writing a global flag each time the matrix is modified inside the kernel. This works, but is highly inefficient and is not technically thread safe.
Using atomic operations to do the same as above. Again, this seems inefficient since in the worst case scenario one global write per thread occurs.
Using a reduction kernel to compute some parameter of the matrix (e.g. sum, mean, variance). This might be faster in some cases, but still seems like overkill. Also, it is possible to dream up cases where a matrix has changed but the sum/mean/variance haven't (e.g. two elements are swapped).
Is there any of the three options above, or an alternative, that is considered best practice and/or is generally more efficient?
I'll also go back to the answer I would have posted in 2012 but for a browser crash.
The basic idea is that you can use warp voting instructions to perform a simple, cheap reduction and then use zero or one atomic operations per block to update a pinned, mapped flag that the host can read after each kernel launch. Using a mapped flag eliminates the need for an explicit device to host transfer after each kernel launch.
This requires one word of shared memory per warp in the kernel, which is a small overhead, and some templating tricks can allow for loop unrolling if you provide the number of warps per block as a template parameter.
A complete working examplate (with C++ host code, I don't have access to a working PyCUDA installation at the moment) looks like this:
#include <cstdlib>
#include <vector>
#include <algorithm>
#include <assert.h>
__device__ unsigned int process(int & val)
{
return (++val < 10);
}
template<int nwarps>
__global__ void kernel(int *inout, unsigned int *kchanged)
{
__shared__ int wchanged[nwarps];
unsigned int laneid = threadIdx.x % warpSize;
unsigned int warpid = threadIdx.x / warpSize;
// Do calculations then check for change/convergence
// and set tchanged to be !=0 if required
int idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tchanged = process(inout[idx]);
// Simple blockwise reduction using voting primitives
// increments kchanged is any thread in the block
// returned tchanged != 0
tchanged = __any(tchanged != 0);
if (laneid == 0) {
wchanged[warpid] = tchanged;
}
__syncthreads();
if (threadIdx.x == 0) {
int bchanged = 0;
#pragma unroll
for(int i=0; i<nwarps; i++) {
bchanged |= wchanged[i];
}
if (bchanged) {
atomicAdd(kchanged, 1);
}
}
}
int main(void)
{
const int N = 2048;
const int min = 5, max = 15;
std::vector<int> data(N);
for(int i=0; i<N; i++) {
data[i] = min + (std::rand() % (int)(max - min + 1));
}
int* _data;
size_t datasz = sizeof(int) * (size_t)N;
cudaMalloc<int>(&_data, datasz);
cudaMemcpy(_data, &data[0], datasz, cudaMemcpyHostToDevice);
unsigned int *kchanged, *_kchanged;
cudaHostAlloc((void **)&kchanged, sizeof(unsigned int), cudaHostAllocMapped);
cudaHostGetDevicePointer((void **)&_kchanged, kchanged, 0);
const int nwarps = 4;
dim3 blcksz(32*nwarps), grdsz(16);
// Loop while the kernel signals it needs to run again
do {
*kchanged = 0;
kernel<nwarps><<<grdsz, blcksz>>>(_data, _kchanged);
cudaDeviceSynchronize();
} while (*kchanged != 0);
cudaMemcpy(&data[0], _data, datasz, cudaMemcpyDeviceToHost);
cudaDeviceReset();
int minval = *std::min_element(data.begin(), data.end());
assert(minval == 10);
return 0;
}
Here, kchanged is the flag the kernel uses to signal it needs to run again to the host. The kernel runs until each entry in the input has been incremented to above a threshold value. At the end of each threads processing, it participates in a warp vote, after which one thread from each warp loads the vote result to shared memory. One thread reduces the warp result and then atomically updates the kchanged value. The host thread waits until the device is finished, and can then directly read the result from the mapped host variable.
You should be able to adapt this to whatever your application requires
I'll go back to my original suggestion. I've updated the related question with an answer of my own, which I believe is correct.
create a flag in global memory:
__device__ int flag;
at each iteration,
initialize the flag to zero (in host code):
int init_val = 0;
cudaMemcpyToSymbol(flag, &init_val, sizeof(int));
In your kernel device code, modify the flag to 1 if a change is made to the matrix:
__global void iter_kernel(float *matrix){
...
if (new_val[i] != matrix[i]){
matrix[i] = new_val[i];
flag = 1;}
...
}
after calling the kernel, at the end of the iteration (in host code), test for modification:
int modified = 0;
cudaMemcpyFromSymbol(&modified, flag, sizeof(int));
if (modified){
...
}
Even if multiple threads in separate blocks or even separate grids, are writing the flag value, as long as the only thing they do is write the same value (i.e. 1 in this case), there is no hazard. The write will not get "lost" and no spurious values will show up in the flag variable.
Testing float or double quantities for equality in this fashion is questionable, but that doesn't seem to be the point of your question. If you have a preferred method to declare "modification" use that instead (such as testing for equality within a tolerance, perhaps).
Some obvious enhancements to this method would be to create one (local) flag variable per thread, and have each thread update the global flag variable once per kernel, rather than on every modification. This would result in at most one global write per thread per kernel. Another approach would be to keep one flag variable per block in shared memory, and have all threads simply update that variable. At the completion of the block, one write is made to global memory (if necessary) to update the global flag. We don't need to resort to complicated reductions in this case, because there is only one boolean result for the entire kernel, and we can tolerate multiple threads writing to either a shared or global variable, as long as all threads are writing the same value.
I can't see any reason to use atomics, or how it would benefit anything.
A reduction kernel seems like overkill, at least compared to one of the optimized approaches (e.g. a shared flag per block). And it would have the drawbacks you mention, such as the fact that anything less than a CRC or similarly complicated computation might alias two different matrix results as "the same".