I've been struggling the whole day, trying to make a basic CUFFT example work properly. However i run into a little problem which I cannot identify. Basically I have a linear 2D array vx with x and y coordinates. Then I just calculate a forward then backward CUFFT (in-place), that simple. Then I copy back the array vx, normalize it by NX*NY , then display.
#define NX 32
#define NY 32
#define LX (2*M_PI)
#define LY (2*M_PI)
float *x = new float[NX*NY];
float *y = new float[NX*NY];
float *vx = new float[NX*NY];
for(int j = 0; j < NY; j++){
for(int i = 0; i < NX; i++){
x[j*NX + i] = i * LX/NX;
y[j*NX + i] = j * LY/NY;
vx[j*NX + i] = cos(x[j*NX + i]);
}
}
float *d_vx;
CUDA_CHECK(cudaMalloc(&d_vx, NX*NY*sizeof(float)));
CUDA_CHECK(cudaMemcpy(d_vx, vx, NX*NY*sizeof(float), cudaMemcpyHostToDevice));
cufftHandle planr2c;
cufftHandle planc2r;
CUFFT_CHECK(cufftPlan2d(&planr2c, NY, NX, CUFFT_R2C));
CUFFT_CHECK(cufftPlan2d(&planc2r, NY, NX, CUFFT_C2R));
CUFFT_CHECK(cufftSetCompatibilityMode(planr2c, CUFFT_COMPATIBILITY_NATIVE));
CUFFT_CHECK(cufftSetCompatibilityMode(planc2r, CUFFT_COMPATIBILITY_NATIVE));
CUFFT_CHECK(cufftExecR2C(planr2c, (cufftReal *)d_vx, (cufftComplex *)d_vx));
CUFFT_CHECK(cufftExecC2R(planc2r, (cufftComplex *)d_vx, (cufftReal *)d_vx));
CUDA_CHECK(cudaMemcpy(vx, d_vx, NX*NY*sizeof(cufftReal), cudaMemcpyDeviceToHost));
for (int j = 0; j < NY; j++){
for (int i = 0; i < NX; i++){
printf("%.3f ", vx[j*NX + i]/(NX*NY));
}
printf("\n");
}
When vx is defined as cos(x) or sin(x), it works fine, but when using sin(y) or cos(y), it gives me back the correct function (sin or cos) but with half amplitude (that is, oscillating between 0.5 and -0.5 instead of 1 and -1) ! Note that using sin(2*y) or cos(2*y) (or sin(4*y), cos(4*y), ...) works fine. Any idea?
The problem here is that input and output of an in-place real to complex transform is a complex type whose size isn't the same as the input real data (it is twice as large). You haven't allocated enough memory to hold the intermediate complex results of the real to complex transform. Quoting from the documentation:
cufftExecR2C() (cufftExecD2Z()) executes a single-precision
(double-precision) real-to-complex, implicitly forward, CUFFT
transform plan. CUFFT uses as input data the GPU memory pointed to by
the idata parameter. This function stores the nonredundant Fourier
coefficients in the odata array. Pointers to idata and odata are both
required to be aligned to cufftComplex data type in single-precision
transforms and cufftDoubleComplex data type in double-precision
transforms.
The solution is either to allocate a second device buffer to hold the intermediate result or enlarge the in place allocation so it is large enough to hold the complex data. So the core transform code changes to something like:
float *d_vx;
CUDA_CHECK(cudaMalloc(&d_vx, NX*NY*sizeof(cufftComplex)));
CUDA_CHECK(cudaMemcpy(d_vx, vx, NX*NY*sizeof(cufftComplex), cudaMemcpyHostToDevice));
cufftHandle planr2c;
cufftHandle planc2r;
CUFFT_CHECK(cufftPlan2d(&planr2c, NY, NX, CUFFT_R2C));
CUFFT_CHECK(cufftPlan2d(&planc2r, NY, NX, CUFFT_C2R));
CUFFT_CHECK(cufftSetCompatibilityMode(planr2c, CUFFT_COMPATIBILITY_NATIVE));
CUFFT_CHECK(cufftSetCompatibilityMode(planc2r, CUFFT_COMPATIBILITY_NATIVE));
CUFFT_CHECK(cufftExecR2C(planr2c, (cufftReal *)d_vx, d_vx));
CUFFT_CHECK(cufftExecC2R(planc2r, d_vx, (cufftReal *)d_vx));
CUDA_CHECK(cudaMemcpy(vx, d_vx, NX*NY*sizeof(cufftComplex), cudaMemcpyDeviceToHost));
[disclaimer: written in browser, never compiled or tested, use at own risk]
Note you will need to adjust the host code to match the size and type of the input and data.
As a final comment, would it have been that hard to add the additional 8 or 10 lines required to turn what you posted into a compilable, runnable example that someone trying to help you could work with?
Related
I new in cuda and I'm try to implement a Kernel to calculate the energy of my Metropolis Monte Carlo Simulation.
I'll put here the linear version of this function:
float calc_energy(struct frame frm, float L, float rc){
int i,j;
float E=0, rij, dx, dy, dz;
for(i=0; i<frm.natm; i++)
{
for(j=i+1; j<frm.natm; j++)
{
dx = fabs(frm.conf[j][0] - frm.conf[i][0]);
dy = fabs(frm.conf[j][1] - frm.conf[i][1]);
dz = fabs(frm.conf[j][2] - frm.conf[i][2]);
dx = dx - round(dx/L)*L;
dy = dy - round(dy/L)*L;
dz = dz - round(dz/L)*L;
/*rij*/
rij = sqrt(dx*dx + dy*dy + dz*dz);
if (rij <= rc)
{
E = E + (4*((1/pow(rij,12))-(1/pow(rij,6))));
}
}
}
return E;
Then I'm try to parallelize this using Cuda: This is my idea:
void calc_energy(frame* s, float L, float rc)
{
extern __shared__ float E;
int i = blockDim.x*blockIdx.x + threadIdx.x;
int j = blockDim.y*blockIdx.y + threadIdx.y;
float rij, dx, dy, dz;
dx = fabs(s->conf[j][0] - s->conf[i][0]);
dy = fabs(s->conf[j][1] - s->conf[i][1]);
dz = fabs(s->conf[j][2] - s->conf[i][2]);
dx = dx - round(dx/L)*L;
dy = dy - round(dy/L)*L;
dz = dz - round(dz/L)*L;
rij = sqrt(dx*dx + dy*dy + dz*dz);
if (rij <= rc)
{
E += (4*((1/pow(rij,12))-(1/pow(rij,6)))); //<- here is the big problem
}
}
My main question is how to sum the variable E from each thread and return it to the host??. I intend to use as many thread and blocks as possible.
Obviously a part of the code is missing when the variable E is calculated.
I have read a few things about reduction methods, but I would like to know if this is necessary here.
I call the kernel using the following code:
calc_energy<<<dimGrid,dimBlock>>>(d_state, 100, 5);
edit:
I understood that I needed to use reduction methods. CUB work great to me.
Continuing with the implementation of the code, I realized that I have a new problem, perhaps because of my lack of knowledge in this area.
In my nested loop, the variable (frm.natm) can reach values in the order of 10^5. thinking of my GPU (GTX 750ti) the number of Thread per block is 1024 and the number of Block per grid is 1024. If I understood correctly, the maximum number of runs in a kernel is 1024x1024 = 1048576 (less than that actually).
So if I need to do 10^5 x 10^5 = 10^10 calculations in my nested loop, what would be the best way to think of the algorithm? Choose a fixed number (that fits my GPU) and split the calculations would be a good idea?
My main question is how to sum the variable E from each thread and return it to the host?
You will need to sum each threads calculation at a block level first using some form of block-wise parallel reduction (I recommend the CUB block wise reduction implementation for that).
Once each block has a partial sum from its threads, the block sums need to be combined. This can either be done on the atomically by one thread from each block, by a second kernel call (with one block), or on the host. How and where you will use the final sum will determine which of those options is the most optimal for your application.
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <algorithm>
#include <cstdlib>
int main(void)
{
thrust::host_vector<int> h_vec(100);
std::generate(h_vec.begin(), h_vec.end(), rand);
thrust::device_vector<int> d_vec = h_vec;
int x = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus<int>());
std::cout<< x<< std::endl;
return 0;
}
In previous post here, I asked about how to calculate sum of an array with reduction. Now I have a new problem, with larger image, my result is not correct, it change every time I run.
I tested with 96*96 image size array sample
First time result: 28169.046875
Second time result: 28169.048828
Expected result: 28169.031250
Here is my code:
#include <stdio.h>
#include <cuda.h>
__global__ void calculate_threshold_kernel(float * input, float * output)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int t = threadIdx.x;
__shared__ float partialSum[256];
partialSum[t] = input[idx];
__syncthreads();
for (int stride = 1; stride < blockDim.x; stride *= 2)
{
if (t % (2 * stride) == 0)
partialSum[t] += partialSum[t + stride];
__syncthreads();
}
if (t == 0)
{
atomicAdd(output,partialSum[0]);
}
}
int main( void )
{
float *d_array, *d_output,*h_input, *h_output;
int img_height = 96;
int img_width = 96;
int input_elements = img_height * img_width;
h_input = (float*) malloc(sizeof(float) * input_elements);
cudaMalloc((void**)&d_output, sizeof(float));
cudaMemset(d_output, 0, sizeof(float));
h_output = (float*)malloc(sizeof(float));
cudaMalloc((void**)&d_array, input_elements*sizeof(float));
float array[] = {[array sample]};
for (int i = 0; i < input_elements; i++)
{
h_input[i] = array[i];
}
cudaMemcpy(d_array, h_input, input_elements*sizeof(float), cudaMemcpyHostToDevice);
dim3 blocksize(256);
dim3 gridsize(input_elements/blocksize.x);
calculate_threshold_kernel<<<gridsize,blocksize>>>(d_array, d_output);
cudaMemcpy(h_output, d_output, sizeof(float), cudaMemcpyDeviceToHost);
printf("Sum from GPU = %f\n", *h_output);
return 0;
}
While the answer from Kangshiyin is correct about floating point accuracy and floating point arithmetic being non-commutative, he is not correct about the reason behind the results differing from one run to the other.
Floating point arithmetic is non-commutative, this means operations performed in different order can return different results. For example (((a+b)+c)+d) may be slightly different than ((a+b)+(c+d)) for certain values of a,b,c and d. But both these results should not vary from run to run.
Your result vary between different runs because atomicAdd results in the order of additions being different. Using double also does not guarantee that the results will be the same between different runs.
There are ways to implement parallel reduction without atomicAdd as the final step (ex: use a second kernel launch to add partial sums from the first launch) which can provide consistent (yet slightly different from CPU) results.
float has a limited precision up to 7 demical digits as explained here.
https://en.wikipedia.org/wiki/Floating_point#Accuracy_problems
The result changes because operations on float are non-commutative and you are using parallel reduction.
The result changes because operations on float are non-commutative and you are using atomicAdd(), which can not keep the order of additions.
You could use double instead if you want more accurate result.
The data in the 3D matrix was generated by layers (from top to bottom) and I want to multiply that data with a 2D matrix B but istead of taking each layer I need to take a vector from layer 1, a vector from layer 2 and so on.
Currently what I'm doing is to copy those vectors from the 3D matrix to a 2D matrix tmpA then multiply with B (using CUBLAS) and store result in tmpB to finally copy back row by row to where it corresponds in a 3D matrix C.
Overall, my whole app runs at least twice as faster than the CPU version, but it seems to me that those memory copies (even) made from device to device are not very good at all for the performance.
What would be a better way to do this computation? I was thinking about rearranging data before multiplying, so to avoid the memory copies.
The 3D matrix A and C and the 2D matrix B are already in GPU's memory.
EDIT
Let M, N, P be the dimensions of the 3D matrix A stored in row major order in a linear array on the device's memory. My code looks like this:
cudaMalloc((void**)&d_tmpIn, sizeof(float)*M*P);
cudaMalloc((void**)&d_tmpOut, sizeof(float)*M*P);
cudaMalloc((void**)&d_C, sizeof(float)*M*N*P);
for (int iN = 0; iN < N; iN++)
{
dst = d_tmpIn;
for (int iM = 0; iM < M; iM++)
{
cudaMemcpy(dst, &(d_A[iN*P+0+iM*N*P]), sizeof(float)*P, cudaMemcpyD2D);
dst += P;
}
cublasDgemm(cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N, P, M, M, &alpha, d_tmpIn, P, d_B, M, &beta, d_tmpOut, P);
src = d_tmpOut;
for (int iM = 0; iM < M; iM++)
{
cudaMemcpy(&(d_C[iN*P+0+iM*N*P]), src, sizeof(float)*P, cudaMemcpyD2D);
src += P;
}
}
Hope this helps.
You don't need to do memory copies! The BLAS and LAPACK APIs were created in such a way that you can specify the starting point, the stride length, the length of the leading dimensions and so on.
This way you can use the 3D arrays A and C as is, but call cublasDgemm by using the correct parameters.
In your case (if I understand the code correctly) it looks like each matrix should be P X M and you have N of them. But it looks like the 3D array is arranged as PxNxM. So without allocating memory for d_tmpIn and d_tmpOut, you could do something like this: The number of rows of A are P. the number of columns are M. However, the leading dimension (lda) should be mentioned as N * P. The same goes for C.
int lda = N * P;
int ldc = N * P;
for (int iN = 0; iN < N; iN++)
{
double *d_tmpIn = d_A + iN * P;
double *d_tmpOut = d_C + iN * P;
cublasSetStream(streams[iN]); // Optional
cublasDgemm(cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N,
P, M, M, &alpha, d_tmpIn, lda, d_B, M, &beta, d_tmpOut, ldc);
}
You could also create iN streams and run each cublas run in a separate stream. Note that this is only going to be useful if M and P are small enough (i.e. the GPU is not yet saturated computationally)
EDIT If you do plan to go ahead with streams, try to create them once at the beginning of the program and re-use them. Do not create and destroy streams in the same loop as the Dgemm. This increases overhead.
I tried searching for libraries on Google for numerical integration on CUDA but couldn't find any.
1) I want to ask, are there any libraries available to perform integration (of a function) on CUDA?
2) If I write my own code on CUDA, e.g. implementing Romberg Integration, how shall I proceed? Suppose I have function, say f(x); do I need to calculate the integrals of this function for different intervals e.g. 0.0 - 0.1, ..., 0.2 - 0.3, ..., 1.3 - 2.3? how do I calculate all of them in parallel?
In my mind, the strategy is that if I have to perform, e.g., 1000 integrations, I generate 1000 threads, each thread calculates trapzoids as well as the error estimates. But in case when I want to calculate trapzoids for one of the integration interval in parallel along with other integrals, I don't have any idea how to approach this programatically.
As noticed above by Tera in his comment, from the point of view of parallel programming, integration is basically a reduction, so that a very simple way to implement integration in CUDA is exploiting the primitives of the Thrust library (see also my answer to Simpson's method to integrate real valued functions with CUDA).
Below is a simple example implementing the Romberg integration method by the Thrust primitives. It is a "direct" translation of the corresponding Matlab code available at this site, so this example also shows how "simply" some Matlab codes can be ported to CUDA by Thurst.
#include <thrust/sequence.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#define pi_f 3.14159265358979f // Greek pi in single precision
struct sin_functor
{
__host__ __device__
float operator()(float x) const
{
return sin(2.f*pi_f*x);
}
};
int main(void)
{
int M = 5; // --- Maximum number of Romberg iterations
float a = 0.f; // --- Lower integration limit
float b = .5f; // --- Upper integration limit
float hmin = (b-a)/pow(2.f,M-1); // --- Minimum integration step size
// --- Define the matrix for Romberg approximations and initialize to 1.f
thrust::host_vector<float> R(M*M,1.f);
for (int k=0; k<M; k++) {
float h = pow(2.f,k-1)*hmin; // --- Step size for the k-th row of the Romberg matrix
// --- Define integration nodes
int N = (int)((b - a)/h) + 1;
thrust::device_vector<float> d_x(N);
thrust::sequence(d_x.begin(), d_x.end(), a, h);
// --- Calculate function values
thrust::device_vector<float> d_y(N);
thrust::transform(d_x.begin(), d_x.end(), d_y.begin(), sin_functor());
// --- Calculate integral
R[k*M] = (.5f*h) * (d_y[0] + 2.f*thrust::reduce(d_y.begin() + 1, d_y.begin() + N - 1, 0.0f) + d_y[N-1]);
}
// --- Compute the k-th column of the Romberg matrix
for (int k=1; k<M; k++) {
// --- The matrix of Romberg approximations is triangular!
for (int kk=0; kk<(M-k+1); kk++) {
// --- See the Romberg integration algorithm
R[kk*M+k] = R[kk*M+k-1] + (R[kk*M+k-1] - R[(kk+1)*M+k-1])/(pow(4.f,k)-1.f);
}
}
// --- Define the vector Rnum for numerical approximations
thrust::host_vector<float> Rnum(M);
thrust::copy(R.begin(), R.begin() + M, Rnum.begin());
for (int i=0; i<M; i++) printf("%i %f\n",i,Rnum[i]);
getchar();
return 0;
}
I am writing my first CUDA application and am writing all the kernels my self for practice.
In one portion I am simply calculating X_transpose * X.
I have been using cudaMallocPitch and cudaMemcpy2D, I first allocate enough space on the device for X and X_transpose*X. I copy X to the device, my kernel takes two inputs, the X matrix, then the space to write the X_transpose * X result.
Using the profiler the kernel originally took 104 seconds to execute on a matrix of size 5000x6000. I pad the matrix with zeros on the host so that it is a multiple of the block size to avoid checking the bounds of the matrix in the kernel. I use a block size of 32 by 32.
I made some changes to try to maximize coalesced reads/writes to global memory, this seemed to help significantly. Using the visual profiler to profile the release build of my code, the kernel now takes 4.27 seconds to execute.
I haven't done an accurate timing of my matlab execution(just the operation X'*X;), but it appears to be about 3 seconds. I was hoping I could get much better speedups than matlab using CUDA.
The nvidia visual profiler is unable to find any issues with my kernel, I was hoping the community here might have some suggestions as to how I can make it go faster.
The kernel code:
__global__ void XTXKernel(Matrix X, Matrix XTX) {
//find location in output matrix
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
int row = threadIdx.y;
int col = threadIdx.x;
Matrix XTXsub = GetSubMatrix(XTX, blockRow, blockCol);
float Cvalue = 0;
for(int m = 0; m < (X.paddedHeight / BLOCK_SIZE); ++m) {
//Get sub-matrix
Matrix Xsub = GetSubMatrix(X, m, blockCol);
Matrix XTsub = GetSubMatrix(X, m, blockRow);
__shared__ float Xs[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float XTs[BLOCK_SIZE][BLOCK_SIZE];
//Xs[row][col] = GetElement(Xsub, row, col);
//XTs[row][col] = GetElement(XTsub, col, row);
Xs[row][col] = *(float*)((char*)Xsub.data + row*Xsub.pitch) + col;
XTs[col][row] = *(float*)((char*)XTsub.data + row*XTsub.pitch) + col;
__syncthreads();
for(int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += Xs[e][row] * XTs[col][e];
__syncthreads();
}
//write the result to the XTX matrix
//SetElement(XTXsub, row, col, Cvalue);
((float *)((char*)XTXsub.data + row*XTX.pitch) + col)[0] = Cvalue;
}
The definition of my Matrix structure:
struct Matrix {
matrixLocation location;
unsigned int width; //width of matrix(# cols)
unsigned int height; //height of matrix(# rows)
unsigned int paddedWidth; //zero padded width
unsigned int paddedHeight; //zero padded height
float* data; //pointer to linear array of data elements
size_t pitch; //pitch in bytes, the paddedHeight*sizeof(float) for host, device determines own pitch
size_t size; //total number of elements in the matrix
size_t paddedSize; //total number of elements counting zero padding
};
Thanks in advance for your suggestions.
EDIT: I forgot to mention, I am running the on a Kepler card, GTX 670 4GB.
Smaller block size like 16x16 or 8x8 may be faster. This slides also demos larger non-square size of block/shared mem may be faster for particular matrix size.
For shared mem allocation, add a dumy element on the leading dimension by using [BLOCK_SIZE][BLOCK_SIZE+1] to avoid the bank conflict.
Try to unroll the inner for loop by using #pragma unroll
On the other hand, You probably won't be much faster than matlab GPU code for large enough A'*A. Since the performance bottleneck of matlab is the invoking overhead rather than the kernel performance.
The cuBLAS routine culas_gemm() may have highest performance for matrix multiplication. You could compare yours with it.
MAGMA routine magma_gemm() has higher performance than cuBLAS in some cases. It's a open source project. You may also get some ideas from their code.