Issue with periodically discrepancies in cufft-fftw complex to real transformations - cuda

For my thesis, I have to optimize a special MPI-Navier Stokes-Solver program with CUDA. The original program uses FFTW for solving several PDEs. In detail, several upper triangle matrices are fourier tranformed in two dimensions, but handled as one dimensional arrays. For the moment, I'm struggling with parts of the original code: (N is always set to 64)
Original:
//Does the complex to real in place fft an normalizes
void fftC2R(double complex *arr) {
fftw_execute_dft_c2r(plan_c2r, (fftw_complex*)arr, (double*)arr);
//Currently ignored: Normalization
/* for(int i=0; i<N*(N/2+1); i++)
arr[i] /= (double complex)sqrt((double complex)(N*N));*/
}
void doTimeStepETDRK2_nonlin_original() {
//calc velocity
ux[0] = 0;
uy[0] = 0;
for(int i=1; i<N*(N/2+1); i++) {
ux[i] = I*kvec[1][i]*qvec[i] / kvec[2][i];
uy[i] = -I*kvec[0][i]*qvec[i] / kvec[2][i];
}
fftC2R(ux);
fftC2R(uy);
//do some stuff here...
//...
return;
}
Where ux and uy are allocated as (double complex arrays):
ux = (double complex*)fftw_malloc(N*(N/2+1) * sizeof(double complex));
uy = (double complex*)fftw_malloc(N*(N/2+1) * sizeof(double complex));
The fft-plan is created as:
plan_c2r = fftw_plan_dft_c2r_2d(N, N,(fftw_complex*) qvec, (double*)qvec, FFTW_ESTIMATE);
Where qvec is allocated the same way like ux and uy and has data type double complex.
Here are the relevant parts of the CUDA code:
NN2_VecSetZero_and_init<<<block_size,grid_size>>>();
cudaSafeCall(cudaDeviceSynchronize());
cudaSafeCall(cudaGetLastError());
int err = (int)cufftExecZ2D(cu_plan_c2r,(cufftDoubleComplex*)sym_ux,(cufftDoubleReal*)sym_ux);
if (err != CUFFT_SUCCESS ) {
exit(EXIT_FAILURE);
return;
}
err = (int)cufftExecZ2D(cu_plan_c2r,(cufftDoubleComplex*)sym_uy,(cufftDoubleReal*)sym_uy);
if (err != CUFFT_SUCCESS ) {
exit(EXIT_FAILURE);
return;
}
//do some stuff here...
//...
return;
Where sim_ux and sim_uy are allocated as:
cudaMalloc((void**)&sym_ux, N*(N/2+1)*sizeof(cufftDoubleComplex));
cudaMalloc((void**)&sym_uy, N*(N/2+1)*sizeof(cufftDoubleComplex));
The initialization of the relevant cufft parts looks like
if (cufftPlan2d(&cu_plan_c2r,N,N, CUFFT_Z2D) != CUFFT_SUCCESS){
exit(EXIT_FAILURE);
return -1;
}
if (cufftPlan2d(&cu_plan_r2c,N,N, CUFFT_D2Z) != CUFFT_SUCCESS){
exit(EXIT_FAILURE);
return -1;
}
if ( cufftSetCompatibilityMode ( cu_plan_c2r , CUFFT_COMPATIBILITY_FFTW_ALL) != CUFFT_SUCCESS ) {
exit(EXIT_FAILURE);
return -1;
}
if ( cufftSetCompatibilityMode ( cu_plan_r2c , CUFFT_COMPATIBILITY_FFTW_ALL) != CUFFT_SUCCESS ) {
exit(EXIT_FAILURE);
return -1;
}
So I use full FFTW compatibility and I call every function with the FFTW calling patterns.
When I run both versions, I receive almost equal results for ux and uy (sim_ux and sim_uy). But at periodically positions of the arrays, Cufft seems to ignore these elements, where FFTW sets the real part of these elements to zero and calculates the complex parts (the arrays are too large to show them here). The stepcount, for which this occurs, is N/2+1. So I believe, I didn't completely understand fft-padding theory between Cufft and FFTW.
I can exclude any former discrepancies between these arrays, until Cufft-executions are called. So any other arrays of the code above are not relevant here.
My question is: am I too optimistic in using almost 100% of the FFTW calling style? Do I have to handle my arrays before the ffts? The Cufft documentation says, I'd have to resize the data input and output arrays. But how can I do this, when I'm running inplace transformations? I really wouldn't like to go too far distant from the original code and I don't want to use any more copy instructions for each fft call, because memory is limited and arrays should remain and processed on gpu as long as possible.
I'm thankful for every hint, critical statement or idea!
My configuration:
Compiler: gcc 4.6 (C99 Standard)
MPI-Package: mvapich2-1.5.1p1 (shouldn't play a role because of reduced single processing debug mode)
CUDA-Version: 4.2
GPU: CUDA-arch-compute_20 (NVIDIA GeForce GTX 570)
FFTW 3.3

once i had to do with CUFFT the only solution I got to work was with exclusive usage of "cu_plan_c2c" - there are easy transformations between real and complex arrays:
-filling complex part with 0 for emulating cu_plan_r2c
-use atan2 (not atan) on the complex result to emulate cu_plan_c2r
Sorry for not pointing you to any better solution, but this is how I ended up solving this problem. Hope you are not getting into any hard toruble with low memory cpu-sided....

Related

Is there proper CUDA atomicLoad function?

I've faced with the issue that CUDA atomic API do not have atomicLoad function.
After searching on stackoverflow, I've found the following implementation of CUDA atomicLoad
But looks like this function is failed to work in following example:
#include <cassert>
#include <iostream>
#include <cuda_runtime_api.h>
template <typename T>
__device__ T atomicLoad(const T* addr) {
const volatile T* vaddr = addr; // To bypass cache
__threadfence(); // for seq_cst loads. Remove for acquire semantics.
const T value = *vaddr;
// fence to ensure that dependent reads are correctly ordered
__threadfence();
return value;
}
__global__ void initAtomic(unsigned& count, const unsigned initValue) {
count = initValue;
}
__global__ void addVerify(unsigned& count, const unsigned biasAtomicValue) {
atomicAdd(&count, 1);
// NOTE: When uncomment the following while loop the addVerify is stuck,
// it cannot read last proper value in variable count
// while (atomicLoad(&count) != (1024 * 1024 + biasAtomicValue)) {
// printf("count = %u\n", atomicLoad(&count));
// }
}
int main() {
std::cout << "Hello, CUDA atomics!" << std::endl;
const auto atomicSize = sizeof(unsigned);
unsigned* datomic = nullptr;
cudaMalloc(&datomic, atomicSize);
cudaStream_t stream;
cudaStreamCreate(&stream);
constexpr unsigned biasAtomicValue = 11;
initAtomic<<<1, 1, 0, stream>>>(*datomic, biasAtomicValue);
addVerify<<<1024, 1024, 0, stream>>>(*datomic, biasAtomicValue);
cudaStreamSynchronize(stream);
unsigned countHost = 0;
cudaMemcpyAsync(&countHost, datomic, atomicSize, cudaMemcpyDeviceToHost, stream);
assert(countHost == 1024 * 1024 + biasAtomicValue);
cudaStreamDestroy(stream);
return 0;
}
If you will uncomment the section with atomicLoad then application will stuck ...
Maybe I missed something ? Is there a proper way to load variable modified atomically ?
P.S.: I know there exists cuda::atomic implementation, but this API is not supported by my hardware
Since warps work in a lockstep manner (at least in old arch), if you put a conditional wait for one thread and a producer on another thread, both in same warp, then the warp could be stuck in the waiting if it starts/is executed first. Maybe only newest architecture that has asynchronous warp thread scheduling can do this. For example, you should query minor-major versions of cuda architecture before running this. Volta and onwards is ok.
Also you are launching 1million threads and waiting on all of them at once. GPU may not have that many execution ports/pipeline availability to have 1 million threads in-flight. Maybe it would work in only a GPU of 64k CUDA pipelines (assuming 16 threads in flight per pipeline). Instead of waiting on millions of threads, just spawn sub-kernels from main kernel when a condition occurs. Dynamic parallelism is the key feature. You should also check for the minimum minor-major cuda version to use dynamic parallelism just in case someone is using ancient nvidia cards.
Atomic-add command returns the old value in the target address. If you have meant to call a third kernel only once only after the condition, then you can simply check that returned value by an "if" before starting the dynamic parallelism.
You are printing for 1 million times, it is not good for performance and it may take some time before text appears in console output if you have a slow CPU/RAM.
Lastly, you can optimize performance of atomic operations by running them on shared memory first then going global atomic only once per block. This will miss the point of condition if there are more threads than the condition value (assuming always 1 increment value) so it may not be applicable for all algorithms.

Can i parallelize my code or it is not worth? [duplicate]

This question have a lack of details. So, i decided to create another question instead edit this one. The new question is here: Can i parallelize my code or it is not worth?
I have a program running in CUDA, where one piece of the code is running within a loop (serialized, as you can see below). This piece of code is a search within an array that contain addresses and/or NULL pointers. All the threads execute this code below.
while (i < n) {
if (array[i] != NULL) {
return array[i];
}
i++;
}
return NULL;
Where n is the size of array and array is in shared memory. I'm only interested in the first address that is different from NULL (first match).
The whole code (i've posted only a piece, the whole code is big) is running fast, but the "heart" of the code (i.e, the part that is more repeated) is serialized, as you can see. I want to know if i can parallelize this part (the search) with some optimized algorithm.
Like i said, the program is already in CUDA (and the array in device), so it will not have memory transfers from host to device and vice versa.
My problem is: n is not big. Difficultly it will be greater than 8.
I've tried to parallelize it, but my "new" code took more time than the code above.
I was studying reduction and min operations, but i've checked that it's useful when n is big.
So, any tips? Can i parallelize it efficiently, i.e., with a low overhead?
Keeping things simple, one of the major limiting factors of GPGPU code is memory management. In most computers copying memory to the device (GPU) is a slow process.
As illustrated by http://www.ncsa.illinois.edu/~kindr/papers/ppac09_paper.pdf:
"The key requirement for obtaining effective
acceleration from GPU subroutine libraries is minimization of
I/O between the host and the GPU."
This is because I/O operations between host and device are SLOW!
Tying this back to your problem, it doesn't really make sense to run on the GPU since the amount of data you mention is so small. You would spend more time running the memcpy routines than it would take to run on the CPU in the first place - especially since you mention you are only interested in the first match.
One common misconception that many people have is that 'if I run it on the GPU, it has more cores so will run faster' and this just isn't the case.
When deciding if it is worth porting to CUDA or OpenCL you must think about if the process is inherently parallel or not - are you processing very large amounts of data etc.?
Since you say the array is a shared memory resource, the result of this search is the same for each thread of a block. This means a first and simple optimization would be to only let a single thread do the search. This will free all but the first warp of the block from doing any work (they still need to wait for the result, yet don't have to waste any computing resources):
__shared__ void *result = NULL;
if(tid == 0)
{
for(unsigned int i=0; i<n; ++i)
{
if (array[i] != NULL)
{
result = array[i];
break;
}
}
}
__syncthreads();
return result;
A step further would then be to let the threads perform the search in parallel as a classic intra-block reduction. If you can guarantee n to always be <= 64, you can do this in a single warp and don't need any synchronization during the search (except for the complete synchronization at the end, of course).
for(unsigned int i=n/2; i>32; i>>=1)
{
if(tid < i && !array[tid])
array[tid] = array[tid+i];
__syncthreads();
}
if(tid < 32)
{
if(n > 32 && !array[tid]) array[tid] = array[tid+32];
if(n > 16 && !array[tid]) array[tid] = array[tid+16];
if(n > 8 && !array[tid]) array[tid] = array[tid+8];
if(n > 4 && !array[tid]) array[tid] = array[tid+4];
if(n > 2 && !array[tid]) array[tid] = array[tid+2];
if(n > 1 && !array[tid]) array[tid] = array[tid+1];
}
__syncthreads();
return array[0];
Of course the example assumes n to be a power of two (and the array to be padded with NULLs accordingly), but feel free to tune it to your needs and optimize this further.

Is there an efficient way to optimize my serialized code?

This question have a lack of details. So, i decided to create another question instead edit this one. The new question is here: Can i parallelize my code or it is not worth?
I have a program running in CUDA, where one piece of the code is running within a loop (serialized, as you can see below). This piece of code is a search within an array that contain addresses and/or NULL pointers. All the threads execute this code below.
while (i < n) {
if (array[i] != NULL) {
return array[i];
}
i++;
}
return NULL;
Where n is the size of array and array is in shared memory. I'm only interested in the first address that is different from NULL (first match).
The whole code (i've posted only a piece, the whole code is big) is running fast, but the "heart" of the code (i.e, the part that is more repeated) is serialized, as you can see. I want to know if i can parallelize this part (the search) with some optimized algorithm.
Like i said, the program is already in CUDA (and the array in device), so it will not have memory transfers from host to device and vice versa.
My problem is: n is not big. Difficultly it will be greater than 8.
I've tried to parallelize it, but my "new" code took more time than the code above.
I was studying reduction and min operations, but i've checked that it's useful when n is big.
So, any tips? Can i parallelize it efficiently, i.e., with a low overhead?
Keeping things simple, one of the major limiting factors of GPGPU code is memory management. In most computers copying memory to the device (GPU) is a slow process.
As illustrated by http://www.ncsa.illinois.edu/~kindr/papers/ppac09_paper.pdf:
"The key requirement for obtaining effective
acceleration from GPU subroutine libraries is minimization of
I/O between the host and the GPU."
This is because I/O operations between host and device are SLOW!
Tying this back to your problem, it doesn't really make sense to run on the GPU since the amount of data you mention is so small. You would spend more time running the memcpy routines than it would take to run on the CPU in the first place - especially since you mention you are only interested in the first match.
One common misconception that many people have is that 'if I run it on the GPU, it has more cores so will run faster' and this just isn't the case.
When deciding if it is worth porting to CUDA or OpenCL you must think about if the process is inherently parallel or not - are you processing very large amounts of data etc.?
Since you say the array is a shared memory resource, the result of this search is the same for each thread of a block. This means a first and simple optimization would be to only let a single thread do the search. This will free all but the first warp of the block from doing any work (they still need to wait for the result, yet don't have to waste any computing resources):
__shared__ void *result = NULL;
if(tid == 0)
{
for(unsigned int i=0; i<n; ++i)
{
if (array[i] != NULL)
{
result = array[i];
break;
}
}
}
__syncthreads();
return result;
A step further would then be to let the threads perform the search in parallel as a classic intra-block reduction. If you can guarantee n to always be <= 64, you can do this in a single warp and don't need any synchronization during the search (except for the complete synchronization at the end, of course).
for(unsigned int i=n/2; i>32; i>>=1)
{
if(tid < i && !array[tid])
array[tid] = array[tid+i];
__syncthreads();
}
if(tid < 32)
{
if(n > 32 && !array[tid]) array[tid] = array[tid+32];
if(n > 16 && !array[tid]) array[tid] = array[tid+16];
if(n > 8 && !array[tid]) array[tid] = array[tid+8];
if(n > 4 && !array[tid]) array[tid] = array[tid+4];
if(n > 2 && !array[tid]) array[tid] = array[tid+2];
if(n > 1 && !array[tid]) array[tid] = array[tid+1];
}
__syncthreads();
return array[0];
Of course the example assumes n to be a power of two (and the array to be padded with NULLs accordingly), but feel free to tune it to your needs and optimize this further.

Thrust - accessing neighbors

I would like to use Thrust's stream compaction functionality (copy_if) for distilling indices of elements from a vector if the elements adhere to a number of constraints. One of these constraints depends on the values of neighboring elements (8 in 2D and 26 in 3D). My question is: how can I obtain the neighbors of an element in Thrust?
The function call operator of the functor for the 'copy_if' basically looks like:
__host__ __device__ bool operator()(float x) {
bool mark = x < 0.0f;
if (mark) {
if (left neighbor of x > 1.0f) return false;
if (right neighbor of x > 1.0f) return false;
if (top neighbor of x > 1.0f) return false;
//etc.
}
return mark;
}
Currently I use a work-around by first launching a CUDA kernel (in which it is easy to access neighbors) to appropriately mark the elements. After that, I pass the marked elements to Thrust's copy_if to distill the indices of the marked elements.
I came across counting_iterator as a sort of substitute for directly using threadIdx and blockIdx to acquire the index of the processed element. I tried the solution below, but when compiling it, it gives me a "/usr/include/cuda/thrust/detail/device/cuda/copy_if.inl(151): Error: Unaligned memory accesses not supported". As far as I know I'm not trying to access memory in an unaligned fashion. Anybody knows what's going on and/or how to fix this?
struct IsEmpty2 {
float* xi;
IsEmpty2(float* pXi) { xi = pXi; }
__host__ __device__ bool operator()(thrust::tuple<float, int> t) {
bool mark = thrust::get<0>(t) < -0.01f;
if (mark) {
int countindex = thrust::get<1>(t);
if (xi[countindex] > 1.01f) return false;
//etc.
}
return mark;
}
};
thrust::copy_if(indices.begin(),
indices.end(),
thrust::make_zip_iterator(thrust::make_tuple(xi, thrust::counting_iterator<int>())),
indicesEmptied.begin(),
IsEmpty2(rawXi));
#phoad: you're right about the shared mem, it struck me after I already posted my reply, subsequently thinking that the cache probably will help me. But you beat me with your quick response. The if-statement however is executed in less than 5% of all cases, so either using shared mem or relying on the cache will probably have negligible impact on performance.
Tuples only support 10 values, so that would mean I would require tuples of tuples for the 26 values in the 3D case. Working with tuples and zip_iterator was already quite cumbersome, so I'll pass for this option (also from a code readability stand point). I tried your suggestion by directly using threadIdx.x etc. in the device function, but Thrust doesn't like that. I seem to be getting some unexplainable results and sometimes I end up with an Thrust error. The following program for example generates a 'thrust::system::system_error' with an 'unspecified launch failure', although it first correctly prints "Processing 10" to "Processing 41":
struct printf_functor {
__host__ __device__ void operator()(int e) {
printf("Processing %d\n", threadIdx.x);
}
};
int main() {
thrust::device_vector<int> dVec(32);
for (int i = 0; i < 32; ++i)
dVec[i] = i + 10;
thrust::for_each(dVec.begin(), dVec.end(), printf_functor());
return 0;
}
Same applies to printing blockIdx.x Printing blockDim.x however generates no error. I was hoping for a clean solution, but I guess I am stuck with my current work-around solution.

CUDA/Thrust double pointer problem (vector of pointers)

Hey all, I am using CUDA and the Thrust library. I am running into a problem when I try to access a double pointer on the CUDA kernel loaded with a thrust::device_vector of type Object* (vector of pointers) from the host. When compiled with 'nvcc -o thrust main.cpp cukernel.cu' i receive the warning 'Warning: Cannot tell what pointer points to, assuming global memory space' and a launch error upon attempting to run the program.
I have read the Nvidia forums and the solution seems to be 'Don't use double pointers in a CUDA kernel'. I am not looking to collapse the double pointer into a 1D pointer before sending to the kernel...Has anyone found a solution to this problem? The required code is below, thanks in advance!
--------------------------
main.cpp
--------------------------
Sphere * parseSphere(int i)
{
Sphere * s = new Sphere();
s->a = 1+i;
s->b = 2+i;
s->c = 3+i;
return s;
}
int main( int argc, char** argv ) {
int i;
thrust::host_vector<Sphere *> spheres_h;
thrust::host_vector<Sphere> spheres_resh(NUM_OBJECTS);
//initialize spheres_h
for(i=0;i<NUM_OBJECTS;i++){
Sphere * sphere = parseSphere(i);
spheres_h.push_back(sphere);
}
//initialize spheres_resh
for(i=0;i<NUM_OBJECTS;i++){
spheres_resh[i].a = 1;
spheres_resh[i].b = 1;
spheres_resh[i].c = 1;
}
thrust::device_vector<Sphere *> spheres_dv = spheres_h;
thrust::device_vector<Sphere> spheres_resv = spheres_resh;
Sphere ** spheres_d = thrust::raw_pointer_cast(&spheres_dv[0]);
Sphere * spheres_res = thrust::raw_pointer_cast(&spheres_resv[0]);
kernelBegin(spheres_d,spheres_res,NUM_OBJECTS);
thrust::copy(spheres_dv.begin(),spheres_dv.end(),spheres_h.begin());
thrust::copy(spheres_resv.begin(),spheres_resv.end(),spheres_resh.begin());
bool result = true;
for(i=0;i<NUM_OBJECTS;i++){
result &= (spheres_resh[i].a == i+1);
result &= (spheres_resh[i].b == i+2);
result &= (spheres_resh[i].c == i+3);
}
if(result)
{
cout << "Data GOOD!" << endl;
}else{
cout << "Data BAD!" << endl;
}
return 0;
}
--------------------------
cukernel.cu
--------------------------
__global__ void deviceBegin(Sphere ** spheres_d, Sphere * spheres_res, float
num_objects)
{
int index = threadIdx.x + blockIdx.x*blockDim.x;
spheres_res[index].a = (*(spheres_d+index))->a; //causes warning/launch error
spheres_res[index].b = (*(spheres_d+index))->b;
spheres_res[index].c = (*(spheres_d+index))->c;
}
void kernelBegin(Sphere ** spheres_d, Sphere * spheres_res, float num_objects)
{
int threads = 512;//per block
int grids = ((num_objects)/threads)+1;//blocks per grid
deviceBegin<<<grids,threads>>>(spheres_d, spheres_res, num_objects);
}
The basic problem here is that device vector spheres_dv contains host pointers. Thrust cannot do "deep copying" or pointer translation between the GPU and host CPU address spaces. So when you copy spheres_h to GPU memory, you are winding up with a GPU array of host pointers. Indirection of host pointers on the GPU is illegal - they are pointers in the wrong memory address space, thus you are getting the GPU equivalent of a segfault inside the kernel.
The solution is going to involve replacing your parseSphere function with something that performs memory allocation on the GPU, rather than using the parseSphere, which presently allocates each new structure in host memory. If you had a Fermi GPU (which it appears you do not) and are using CUDA 3.2 or 4.0, then one approach would be to turn parseSphere into a kernel. The C++ new operator is supported in device code, so structure creation would occur in device memory. You would need to modify the definition of Sphere so that the constructor is defined as a __device__ function for this approach to work.
The alternative approach will involve creating a host array holding device pointers, then copy that array to device memory. You can see an example of that in this answer. Note that it is probably the case that declaring a thrust::device_vector containing thrust::device_vector won't work, so you will likely need to do this array of device pointers construction using the underlying CUDA API calls.
You should also note that I haven't mentioned the reverse copy operation, which is equally as difficult to do.
The bottom line is that thrust (and C++ STL containers for that matter) really are not intended to hold pointers. They are intended to hold values, and abstract away pointer indirection and direct memory access through the use of iterators and underlying algorithms which the user isn't supposed to see. Further, the "deep copy" problem is main the reason why the wise people on the NVIDIA forums counsel against multiple levels of pointers in GPU code. It greatly complicates code, and it executes slower on the GPU as well.