Calling __host__ __device__ function inside __global__ function causing an overhead - cuda

This is the following question from this thread.
My __global__ function contains only a single API Geoditic2ECEF(GPS gps). It took 35ms to execute that global function with a single API. However, if I write the entire code of __host__ __device__ Geoditic2ECEF(GPS gps) in the __global__ function rather than calling it as an API, the __global__ function took only 2 ms to execute. It seems like calling an __host__ __device__ API inside __global__ function causing a mysterious overhead.
This is the PTX output when I used the API
ptxas info : Compiling entry function '_Z16cudaCalcDistanceP7RayInfoPK4GPS3PK6float6PK9ObjStatusPKfSB_SB_fiiiiii' for 'sm_52'
ptxas info : Function properties for _Z16cudaCalcDistanceP7RayInfoPK4GPS3PK6float6PK9ObjStatusPKfSB_SB_fiiiiii 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 9 registers, 404 bytes cmem[0]
This is the PTX output when I dont use the API
ptxas info : Compiling entry function '_Z16cudaCalcDistanceP7RayInfoPK4GPS3PK6float6PK9ObjStatusPKfSB_SB_fiiiiii' for 'sm_52'
ptxas info : Function properties for _Z16cudaCalcDistanceP7RayInfoPK4GPS3PK6float6PK9ObjStatusPKfSB_SB_fiiiiii 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 2 registers, 404 bytes cmem[0]
The only difference is that the API version used 9 registers while the non-API version used 2 registers. What can I deduce from this information.
In file utils.cu, I defined following structs and API
struct GPS {
float latitude;
float longtitude;
float height;
};
struct Coordinate
{
__host__ __device__ Coordinate(float x_ = 0, float y_ = 0, float z_= 0)
{
x = x_;
y = y_;
z = z_;
}
__host__ __device__ float norm()
{
return sqrtf(x * x + y * y + z * z);
}
float x;
float y;
float z;
};
__host__ __device__ Coordinate Geoditic2ECEF(GPS gps)
{
Coordinate result;
float a = 6378137;
float b = 6356752;
float f = (a - b) / a;
float e_sq = f * (2 - f);
float lambda = gps.latitude / 180 * M_PI;
float phi = gps.longtitude / 180 * M_PI;
float N = a / sqrtf(1 - e_sq * sinf(lambda) * sinf(lambda));
result.x = (gps.height + N) * cosf(lambda) * cosf(phi);
result.y = (gps.height + N) * cosf(lambda) * sinf(phi);
result.z = (gps.height + (1 - e_sq) * N) * sinf(lambda);
return result;
}
In main.cu, I have following functions
__global__ void cudaCalcDistance(GPS* missile_cur,
int num_faces, int num_partialPix)
{
int partialPixIdx = threadIdx.x + IMUL(blockIdx.x, blockDim.x);
int faceIdx = threadIdx.y + IMUL(blockIdx.y, blockDim.y);
if(faceIdx < num_faces && partialPixIdx < num_partialPix)
{
Coordinate missile_pos;
// API version
missile_pos = Geoditic2ECEF(missile_cur->gps);
// non_API version
// float a = 6378137;
// float b = 6356752;
// float f = (a - b) / a;
// float e_sq = f * (2 - f);
// float lambda = missile_cur->latitude / 180 * M_PI;
// float phi = missile_cur->longtitude / 180 * M_PI;
// float N = a / sqrtf(1 - e_sq * sinf(lambda) * sinf(lambda));
// missile_pos.x = (missile_cur->height + N) * cosf(lambda) * cosf(phi);
// missile_pos.y = (missile_cur->height + N) * cosf(lambda) * sinf(phi);
// missile_pos.z = (missile_cur->height + (1 - e_sq) * N) * sinf(lambda);
}
}
void calcDistance(GPS * data)
{
int num_partialPix = 10000;
int num_surfaces = 4000;
dim3 blockDim(16, 16);
dim3 gridDim(ceil((float)num_partialPix / threadsPerBlock),
ceil((float)num_surfaces / threadsPerBlock));
cudaCalcDistance<<<gridDim, blockDim>>>(data,
m_Rb2c_cur,num_surfaces,num_partialPix);
gpuErrChk(cudaDeviceSynchronize());
}
int main()
{
GPS data(11, 120, 32);
GPS *d_data;
gpuErrChk(cudaMallocManaged((void**)&d_data, sizeof(GPS)));
gpuErrChk(cudaMemcpy(d_data, &data, sizeof(GPS), cudaMemcpyHostToDevice));
calcDistance(d_data);
gpuErrChk(cudaFree(d_data));
}

You don't seem to have asked a question that I can see, so I will assume your question is something like "what is this mysterious overhead and what are my options to mitigate it?"
When the call to a __device__ function is in a different compilation unit than the definition of that function, the compiler cannot inline that function (generally).
This can have a variety of performance impacts:
The call instruction creates some overhead
the function call has an ABI that reserves registers, this creates register pressure which may affect code performance
the compiler may have to transfer additional function parameters outside of registers, via the stack. This adds additional overhead.
The compiler cannot (generally) optimize across the function call boundary.
All of these can create performance impacts to varying degrees, and you can find other questions here on the cuda tag which mention these.
The most common solutions I know of are:
Move the definition of the function to the same compilation unit as the calling environment (and, if possible, remove -rdc=true or -dc from compilation command line).
In recent CUDA versions, make use of link-time optimization.

Related

CUDA dynamic parallelism is computing sequentially

I need to write an application that computes some matrices from other matrices. In general, it sums outer products of rows of initial matrix E and multiplies it by some numbers calculated from v and t for each t in a given range. I am a newbie to CUDA, so there might be some very wrong ideas in the implementation. So, there is my code and some explanation in comments:
#include <cupy/complex.cuh>
#include <thrust/device_vector.h>
#include <thrust/functional.h>
#include <thrust/sequence.h>
#include <thrust/transform.h>
const int BLOCK_SIZE = 16;
const int DIM_SIZE = 16;
const double d_hbar=1.0545718176461565e-34;
extern "C"
struct sum_functor { //sum functor for thrust::transform, summs array of matrices
int N;
complex<float> *arr;
complex<float> *result;
__host__ __device__ sum_functor(int _N, complex<float>* _arr, complex<float>* _result) : N(_N), arr(_arr), result(_result) {};
__host__ __device__ complex<float> operator()(int n){
complex<float> sum = result[n];
for (int i = 0; i < BLOCK_SIZE; i++) {
sum += arr[N * N * i + n];
}
return sum;
}
};
extern "C" //outer product multiplied by given exp and rho
__global__ void outer(const complex<float>* E, int size, complex<float>* blockResult,
int m, int n, complex<float> exp, complex<float> rho) {
int col = blockIdx.y*blockDim.y+threadIdx.y;
int row = blockIdx.x*blockDim.x+threadIdx.x;
if (row < size && col < size) {
blockResult[row * size + col] = exp * rho * E[m * size + row] * E[n * size + col];
}
}
//compute constants and launch outer product kernels
//outer products are stored in blockResult, i.e. 16 matrices one after another
extern "C"
__global__ void calcBlock(const complex<float>* v, const complex<float>* E, int size, double t,
int rhoind, complex<float>* blockResult, int blockInd) {
int i = threadIdx.x;
int j = i + blockInd;
int m = j / size;
int n = j % size;
if (m < size && n < size) {
const complex<float>hbar(d_hbar);
complex<float> exp = thrust::exp(complex<float>(0, -1)*(v[m] - v[n]) * complex<float>(t)/hbar);
complex<float> rho = E[m * size + rhoind] * E[n * size + rhoind];
dim3 dimGrid((size - 1)/DIM_SIZE + 1, (size - 1) / DIM_SIZE + 1, 1);
dim3 dimBlock(DIM_SIZE, DIM_SIZE, 1);
outer<<<dimGrid, dimBlock>>>(E, size, blockResult + i * size * size, m, n, exp, rho);
}
}
//launch the block calculation, then sum the all matrices in block and add it to the result
//repeat block by block until all size*size matrices in total are summed
extern "C"
__global__ void calcSum(const complex<float>* v, const complex<float>* E, int size, double t, int ind,
int rhoind, complex<float>* blockResult, complex<float>* result, int* resultIndexes) {
for (int i = 0; i < size * size; i += BLOCK_SIZE) {
calcBlock<<<1, BLOCK_SIZE>>>(v, E, size, t, rhoind, blockResult, i);
cudaDeviceSynchronize();
thrust::transform(thrust::device, resultIndexes,
resultIndexes + size * size,
result + ind * size * size, sum_functor(size, blockResult, result + ind * size * size));
}
}
//launch calcSum in parallel for every t in range
extern "C"
__global__ void eigenMethod(const complex<float>* v, const complex<float>* E, int size, const double* t, int t_size,
int rhoind, complex<float>* blockResult, complex<float>* result, int* resultIndexes) {
int i = threadIdx.x;
if (i < t_size) {
calcSum<<<1, 1>>>(v, E, size, t[i], i, rhoind, blockResult + i * BLOCK_SIZE * size * size, result, resultIndexes);
}
}
//main is simplified cause I am using CuPy
int main() {
*Calculate E(size * size), v(size)*
*t is vector of size t_size*
*Initialize blockResult(t_size*BLOCK_SIZE*size*size)*
*resultIndexes(size*size) is enumerate from 0 to size * size)*
*result(t_size*size*size) filled with zeros*
eigenMetod<<<1, t_size>>>(v, E, size, t, t_size, 0, blockResult, result, resultIndexes)
}
The overall idea might be strange and stupid, but it is working. Thus, the problem I've encountered is that all calcSum kernels that are called from eigenMethod are scheduled one after another.
The calcSum function and everything above works fast enough for the purposes for which it was created. The main problem is that when I am trying to call multiple of these in the eigenMethod function. I have tried benchmarking it and got a linear dependence between runtime and the number of calls. For example, the eigenMethod function with t_size = 32 works almost two times faster than with t_size = 64.
Also, I have tried profiling it, but did not get the information that I wanted since Nsight Systems does not support CDP (CUDA Dynamic Parallelism) according to the the topics I saw. I think that accessing the same part of global memory (arrays E and v are the same pointer for all functions I call) might be a problem. As a hotfix, I have created individual copies for every calcSum function, but it did not help. Is there a way to compute multiple calcSum kernels in parallel? The benchmark results are listed below (matrix size is 128x128):
t_size
time, s
1
0.32
4
1.036
8
1.9
16
3.6
By not specifying a stream you are using the default stream which is shared among threads of the same block. Therefore all launches of calcSum go into the same stream and have to be executed after another. This can be fixed by using explicit streams instead.
"[A]ccessing the same part of global memory" from multiple kernels has nothing to do with this. As long as the kernels are only reading from the same locations and not writing to them, this is not problematic at all. Writing to the same locations would cause race conditions and therefore potentially non-deterministic output, but the CUDA runtime can not detect this and wont "sequentialize" your kernels to get around it.
As discussed in the comments, I don't think CDP is needed here at all and it is potentially expensive and in this form not future-proof. So performance will most probably not be ideal.

Wrong sizes for block reduction in CUDA?

I was checking out this sum_reduction.cu example and tutorial and noticed that for certain problem sizes it doesn't work e.g. it works with problem size n=2000 but not with n=3000. Apparently it always work with problem sizes that are multiple of the block size but neither the tutorial nor the example code states so. The question is, does this reduction algorithm only works for certain problem sizes? the example they chose N=256k which is even, a power of two and also multiple of the block size 512.
For self containment I paste the most important bits of (a template version of) the code here:
template<typename T>
__global__ void kernelSum(const T* __restrict__ input, T* __restrict__ per_block_results, const size_t n) {
extern __shared__ T sdata[];
size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
// load input into __shared__ memory
T x = 0.0;
if (tid < n) {
x = input[tid];
}
sdata[threadIdx.x] = x;
__syncthreads();
// contiguous range pattern
for(int offset = blockDim.x / 2; offset > 0; offset >>= 1) {
if(threadIdx.x < offset) {
// add a partial sum upstream to our own
sdata[threadIdx.x] += sdata[threadIdx.x + offset];
}
// wait until all threads in the block have
// updated their partial sums
__syncthreads();
}
// thread 0 writes the final result
if(threadIdx.x == 0) {
per_block_results[blockIdx.x] = sdata[0];
}
}
and to invoke the kernel:
// launch one kernel to compute, per-block, a partial sum
block_sum<double> <<<num_blocks,block_size,block_size * sizeof(double)>>>(d_input, d_partial_sums_and_total, num_elements);
// launch a single block to compute the sum of the partial sums
block_sum<double> <<<1,num_blocks,num_blocks * sizeof(double)>>>(d_partial_sums_and_total, d_partial_sums_and_total + num_blocks, num_blocks);
To my understanding if the problem size is smaller than the block reduction this statement T x = 0.0; ensures that the element is zeroed out and thus should work but it doesn't?
UPDATE: I am sorry the float/double thing was a typo while preparing the question and not the real problem.
The code you have posted is not consistent, as your templated kernel
is called kernelSum but you are invoking something called
block_sum.
Furthermore, I don't believe your usage of the templated kernel
function could possibly be correct as written:
block_sum<double> <<<num_blocks,block_size,block_size * sizeof(float)>>>(d_input, d_partial_sums_and_total, num_elements);
^ ^
| these types are required to match |
The kernel template is being instantiated with type double. Therefore it is expecting enough shared memory to store block_size double quantities, based on this line:
extern __shared__ T sdata[];
But you are only passing half of the required storage:
block_size * sizeof(float)
I believe that's going to give you unexpected results.
The reduction as written does expect that the block
dimension is a power of 2, due to this loop:
// contiguous range pattern
for(int offset = blockDim.x / 2; offset > 0; offset >>= 1) {
This is not likely to be an issue on the first kernel call, because you are probably choosing a power of two for the number of threads per block (block_size):
block_sum<double> <<<num_blocks,block_size,...
However, for the second kernel call, this will depend on whether num_blocks is a power of two, which depends on your grid calculations, which you haven't shown:
block_sum<double> <<<1,num_blocks,...
Finally, the first kernel launch will fail if num_blocks exceeds the limit for your device. This may happen for very large data sets but probably not for size 3000, and it depends on your grid calculations which you haven't shown.
Item 3 above is a difficult requirement to satisfy on the fly for arbitrary vector sizes. Therefore I would suggest an alternate reduction strategy to handle arbitrary sized vectors. For this I would suggest that you study the CUDA reduction sample code and presentation.
Here's a complete program, mostly based on the code you have shown, that has the above issues addressed, and seems to work for me for a size of 3000:
#include <stdio.h>
#include <stdlib.h>
#define DSIZE 3000
#define nTPB 256
template<typename T>
__global__ void block_sum(const T* __restrict__ input, T* __restrict__ per_block_results, const size_t n) {
extern __shared__ T sdata[];
size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
// load input into __shared__ memory
T x = 0.0;
if (tid < n) {
x = input[tid];
}
sdata[threadIdx.x] = x;
__syncthreads();
// contiguous range pattern
for(int offset = blockDim.x / 2; offset > 0; offset >>= 1) {
if(threadIdx.x < offset) {
// add a partial sum upstream to our own
sdata[threadIdx.x] += sdata[threadIdx.x + offset];
}
// wait until all threads in the block have
// updated their partial sums
__syncthreads();
}
// thread 0 writes the final result
if(threadIdx.x == 0) {
per_block_results[blockIdx.x] = sdata[0];
}
}
int main(){
double *d_input, *d_partial_sums_and_total, *h_input, *h_partial_sums_and_total;
int num_elements=DSIZE;
int block_size = nTPB;
int num_blocks = (num_elements + block_size -1)/block_size;
// bump num_blocks up to the next power of 2
int done = 0;
int test_val = 1;
while (!done){
if (test_val >= num_blocks){
num_blocks = test_val;
done = 1;}
else test_val *= 2;
if (test_val > 65535) {printf("blocks failure\n"); exit(1);}
}
h_input = (double *)malloc(num_elements * sizeof(double));
h_partial_sums_and_total = (double *)malloc((num_blocks+1)*sizeof(double));
cudaMalloc((void **)&d_input, num_elements * sizeof(double));
cudaMalloc((void **)&d_partial_sums_and_total, (num_blocks+1)*sizeof(double));
double h_result = 0.0;
for (int i = 0; i < num_elements; i++) {
h_input[i] = rand()/(double)RAND_MAX;
h_result += h_input[i];}
cudaMemcpy(d_input, h_input, num_elements*sizeof(double), cudaMemcpyHostToDevice);
cudaMemset(d_partial_sums_and_total, 0, (num_blocks+1)*sizeof(double));
// launch one kernel to compute, per-block, a partial sum
block_sum<double> <<<num_blocks,block_size,block_size * sizeof(double)>>>(d_input, d_partial_sums_and_total, num_elements);
// launch a single block to compute the sum of the partial sums
block_sum<double> <<<1,num_blocks,num_blocks * sizeof(double)>>>(d_partial_sums_and_total, d_partial_sums_and_total + num_blocks, num_blocks);
cudaMemcpy(h_partial_sums_and_total, d_partial_sums_and_total, (num_blocks+1)*sizeof(double), cudaMemcpyDeviceToHost);
printf("host result = %lf\n", h_result);
printf("device result = %lf\n", h_partial_sums_and_total[num_blocks]);
}
For brevity/readability, I have dispensed with error checking in the above code. When having difficulty with a cuda code, you should always do proper cuda error checking.
Also, in the future, you will make it easier for others to help you if you post a complete code to demonstrate what you are doing, as I have done above.

Shared Memory of Objects in CUDA and libc++abi.dylib error

I have the following problem (keep in mind that I am fairly new to programming with CUDA),
I have a class called vec3f that is just like the float3 data type but with overloaded operators, and other vector functions. These functions are prefixed with __ device __ __ host __ (i added spaces because it was making these words bolded). Then, in my kernel I do a nested for loop over block_x and block_y indicies and do something like,
//set up shared memory block
extern __shared__ vec3f share[];
vec3f *sh_pos = share;
vec3f *sh_velocity = &sh_pos[blockDim.x*blockDim.y];
sh_pos[blockDim.x * threadIdx.x + threadIdx.y] = oldParticles[index].position();
sh_velocity[blockDim.x * threadIdx.x + threadIdx.y] = oldParticles[index].velocity();
__syncthreads();
In the above code, oldParticles is a pointer to a class called particles that is being passed to the kernel. OldParticles is acutally an underlying pointer of a thrust::device_vector (im not sure if this has something to do with it). Everything compiles okay but when I run I get the error
libc++abi.dylib: terminate called throwing an exception
Abort trap: 6
Thanks for the replies. I think the error had to do with me not allocating room for the arguments being passed to my kernel. Doing the following in my host code fixed this error,
particle* particle_ptrs[2];
particle_ptrs[0] = thrust::raw_pointer_cast(&d_old_particles[0]);
particle_ptrs[1] = thrust::raw_pointer_cast(&d_new_particles[0]);
CUDA_SAFE_CALL( cudaMalloc( (void**)&particle_ptrs[0], max_particles * sizeof(particle) ) );
CUDA_SAFE_CALL( cudaMalloc( (void**)&particle_ptrs[1], max_particles * sizeof(particle) ) );
The kernel call is then,
force_kernel<<< grid,block,sharedMemSize >>>(particle_ptrs[0],particle_ptrs[1],time_step);
The issue that I am having now seems to be that I can't get data copied back to the host from the device. I think this has to do with me not being familiar with thrust.
Im doing a series of copies as follows,
//make a host vector assume this is initialized
thrust::host_vector<particle> h_particles;
thrust::device_vector<particle> d_old_particles, d_new_particles;
d_old_particles = h_particles;
//launch kernel as shown above
//with thrust vectors having been casted into their underlying pointers
//particle_ptrs[1] gets modified and so shouldnt d_new_particles?
//copy back
h_particles = d_new_particles;
So I guess my question is, can I modify a thrust device vector in a kernel (in this case particle_pters[0]) save the modification to to another thrust device vector in the kernel (in this case particle_pters[1]) and then once I exit from the kernel, copy that to a host vector?
I still can't get this to work. I made a shorter example where I am having the same problem,
#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include "vec3f.h"
const int BLOCK_SIZE = 8;
const int max_particles = 64;
const float dt = 0.01;
using namespace std;
//particle class
class particle {
public:
particle() :
_velocity(vec3f(0,0,0)), _position(vec3f(0,0,0)), _density(0.0) {
};
particle(const vec3f& pos, const vec3f& vel) :
_position(pos), _velocity(vel), _density(0.0) {
};
vec3f _velocity;
vec3f _position;
float _density;
};
//forward declaration of kernel func
__global__ void kernel_func(particle* old_parts, particle* new_parts, float dt);
//global thrust vectors
thrust::host_vector<particle> h_parts;
thrust::device_vector<particle> old_parts, new_parts;
particle* particle_ptrs[2];
int main() {
//load host vector
for (int i =0; i<max_particles; i++) {
h_parts.push_back(particle(vec3f(0.5,0.5,0.5),vec3f(10,10,10)));
}
particle_ptrs[0] = thrust::raw_pointer_cast(&old_parts[0]);
particle_ptrs[1] = thrust::raw_pointer_cast(&new_parts[0]);
cudaMalloc( (void**)&particle_ptrs[0], max_particles * sizeof(particle) );
cudaMalloc( (void**)&particle_ptrs[1], max_particles * sizeof(particle) );
//copy host particles to old device particles...
old_parts = h_parts;
//kernel block and grid dimensions
dim3 block(BLOCK_SIZE,BLOCK_SIZE,1);
dim3 grid(int(sqrt(float(max_particles) / (float(block.x*block.y)))), int(sqrt(float(max_particles) / (float(block.x*block.y)))), 1);
kernel_func<<<block,grid>>>(particle_ptrs[0],particle_ptrs[1],dt);
//copy new device particles back to host particles
h_parts = new_parts;
for (int i =0; i<max_particles; i++) {
particle temp1 = h_parts[i];
cout << temp1._position << endl;
}
//delete thrust device vectors
old_parts.clear();
old_parts.shrink_to_fit();
new_parts.clear();
new_parts.shrink_to_fit();
return 0;
}
//kernel function
__global__ void kernel_func(particle* old_parts, particle* new_parts, float dt) {
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
//get array position for 2d grid...
unsigned int arr_pos = y*blockDim.x*gridDim.x + x;
new_parts[arr_pos]._velocity = old_parts[arr_pos]._velocity * 10.0 * dt;
new_parts[arr_pos]._position = old_parts[arr_pos]._position * 10.0 * dt;
new_parts[arr_pos]._density = old_parts[arr_pos]._density * 10.0 * dt;
}
So the host vector has an initial position of (0.5,0.5,0.5) for all 64 particles. Then the kernel attempts to multiply that by 10 to give (5,5,5) as the position for all particles. But I dont see this when I "cout" the data. It is still just (0.5,0.5,0.5). Is there a problem with how I am allocating memory? Is there a problem with the lines:
//copy new device particles back to host particles
h_parts = new_parts;
What could be the issue? Thank you.
There are various problems with the code you have posted.
you have your block and grid variables reversed in your kernel invocation. grid comes first.
you should be doing cuda error checking on your kernel and runtime API calls.
your method of allocating storage using cudaMalloc on a pointer which has been raw-cast from an empty device vector is not sensible. The vector container has no knowledge that you did this "under the hood." Instead, you can directly allocate storage for the device vector when you instantiate it, like:
thrust::device_vector<particle> old_parts(max_particles), new_parts(max_particles);
You say you're expecting 5,5,5, but your kernel is multiplying by 10 and then by dt which is 0.01, so I believe the correct output is 0.05, 0.05, 0.05
Your grid computation (int(sqrt...)), for an arbitrary max_particles either is not guaranteed to produce enough blocks (if casting a float to int truncates or rounds down) or will produce extra blocks (if it rounds up). The round down case is bad. We should handle that by using a ceil function or another grid computation method. The round up case (which is what ceil will do) is OK, but we need to handle the fact that the grid may launch extra blocks/threads. We do that with a thread check in the kernel. There were other problems with the grid computation as well. We want to take the square root of max_particles, then divide it by the block dimension in a particular direction, to get the grid dimension in that direction.
Here's some code that I've modified with these changes in mind, it seems to produce the correct output (0.05, 0.05, 0.05). Note that I had to make some other changes because I don't have your "vec3f.h" header file handy, so I used float3 instead.
#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <vector_functions.h>
const int BLOCK_SIZE = 8;
const int max_particles = 64;
const float dt = 0.01;
using namespace std;
//particle class
class particle {
public:
particle() :
_velocity(make_float3(0,0,0)), _position(make_float3(0,0,0)), _density(0.0)
{
};
particle(const float3& pos, const float3& vel) :
_position(pos), _velocity(vel), _density(0.0)
{
};
float3 _velocity;
float3 _position;
float _density;
};
//forward declaration of kernel func
__global__ void kernel_func(particle* old_parts, particle* new_parts, float dt);
int main() {
//global thrust vectors
thrust::host_vector<particle> h_parts;
particle* particle_ptrs[2];
//load host vector
for (int i =0; i<max_particles; i++) {
h_parts.push_back(particle(make_float3(0.5,0.5,0.5),make_float3(10,10,10)));
}
//copy host particles to old device particles...
thrust::device_vector<particle> old_parts = h_parts;
thrust::device_vector<particle> new_parts(max_particles);
particle_ptrs[0] = thrust::raw_pointer_cast(&old_parts[0]);
particle_ptrs[1] = thrust::raw_pointer_cast(&new_parts[0]);
//kernel block and grid dimensions
dim3 block(BLOCK_SIZE,BLOCK_SIZE,1);
dim3 grid((int)ceil(sqrt(float(max_particles)) / (float(block.x))), (int)ceil(sqrt(float(max_particles)) / (float(block.y))), 1);
cout << "grid x: " << grid.x << " grid y: " << grid.y << endl;
kernel_func<<<grid,block>>>(particle_ptrs[0],particle_ptrs[1],dt);
//copy new device particles back to host particles
cudaDeviceSynchronize();
h_parts = new_parts;
for (int i =0; i<max_particles; i++) {
particle temp1 = h_parts[i];
cout << temp1._position.x << "," << temp1._position.y << "," << temp1._position.z << endl;
}
//delete thrust device vectors
old_parts.clear();
old_parts.shrink_to_fit();
new_parts.clear();
new_parts.shrink_to_fit();
return 0;
}
//kernel function
__global__ void kernel_func(particle* old_parts, particle* new_parts, float dt) {
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
//get array position for 2d grid...
unsigned int arr_pos = y*blockDim.x*gridDim.x + x;
if (arr_pos < max_particles) {
new_parts[arr_pos]._velocity.x = old_parts[arr_pos]._velocity.x * 10.0 * dt;
new_parts[arr_pos]._velocity.y = old_parts[arr_pos]._velocity.y * 10.0 * dt;
new_parts[arr_pos]._velocity.z = old_parts[arr_pos]._velocity.z * 10.0 * dt;
new_parts[arr_pos]._position.x = old_parts[arr_pos]._position.x * 10.0 * dt;
new_parts[arr_pos]._position.y = old_parts[arr_pos]._position.y * 10.0 * dt;
new_parts[arr_pos]._position.z = old_parts[arr_pos]._position.z * 10.0 * dt;
new_parts[arr_pos]._density = old_parts[arr_pos]._density * 10.0 * dt;
}
}

Using CUDA Shared Memory to Improve Global Access Patterns

I have the following kernel to get the magnitude of a bunch of vectors:
__global__ void norm_v1(double *in, double *out, int n)
{
const uint i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
{
double x = in[3*i], y = in[3*i+1], z = in[3*i+2];
out[i] = sqrt(x*x + y*y + z*z);
}
}
However due to the packing of in as [x0,y0,z0,...,xn,yn,zn] it performs poorly with the profiler indicating a 32% global load efficiency. Repacking the data as [x0, x1, ..., xn, y0, y1, ..., yn, z0, z1, ..., zn] improves things greatly (with the offsets for x, y, and z changing accordingly). Runtime is down and efficiency is up to 100%.
However, this packing is simply not practical for my application. I therefore wish to investigate the use of shared memory. My idea is for each thread in a block to copy three values (blockDim.x apart) from global memory -- yielding coalesced access. Under the assumption of a maximum blockDim.x = 256 I came up with:
#define BLOCKDIM 256
__global__ void norm_v2(double *in, double *out, int n)
{
__shared__ double invec[3*BLOCKDIM];
const uint i = blockIdx.x * blockDim.x + threadIdx.x;
invec[0*BLOCKDIM + threadIdx.x] = in[0*BLOCKDIM+i];
invec[1*BLOCKDIM + threadIdx.x] = in[1*BLOCKDIM+i];
invec[2*BLOCKDIM + threadIdx.x] = in[2*BLOCKDIM+i];
__syncthreads();
if (i < n)
{
double x = invec[3*threadIdx.x];
double y = invec[3*threadIdx.x+1];
double z = invec[3*threadIdx.x+2];
out[i] = sqrt(x*x + y*y + z*z);
}
}
However this is clearly deficient when n % blockDim.x != 0, requires knowing the maximum blockDim in advance and generates incorrect results for out[i > 255] when tested with an n = 1024. How should I best remedy this?
I think this can solve the out[i > 255] problem:
__shared__ double shIn[3*BLOCKDIM];
const uint blockStart = blockIdx.x * blockDim.x;
invec[0*blockDim.x+threadIdx.x] = in[ blockStart*3 + 0*blockDim.x + threadIdx.x];
invec[1*blockDim.x+threadIdx.x] = in[ blockStart*3 + 1*blockDim.x + threadIdx.x];
invec[2*blockDim.x+threadIdx.x] = in[ blockStart*3 + 2*blockDim.x + threadIdx.x];
__syncthreads();
double x = shIn[3*threadIdx.x];
double y = shIn[3*threadIdx.x+1];
double z = shIn[3*threadIdx.x+2];
out[blockStart+threadIdx.x] = sqrt(x*x + y*y + z*z);
As for n % blockDim.x != 0 I would suggest padding the input/output arrays with 0 to match the requirement.
If you dislike the BLOCKDIM macro - explore using extern __shared__ shArr[] and then passing 3rd parameter to kernel configuration:
norm_v2<<<gridSize,blockSize,dynShMem>>>(...)
the dynShMem is the dynamic shared memory usage (in bytes). This is extra shared memory pool with its size specified at run-time, where all extern __shared__ variables will be initially assigned to.
What GPU are you using? Fermi or Kepler might help your original code with their L1 caching.
If you don't want to pad your in array, or you end up doing similar trick somewhere else, you may want to consider implementing a device-side memcopy, something like this:
template <typename T>
void memCopy(T* destination, T* source, size_t numElements) {
//assuming sizeof(T) is a multiple of sizeof(int)
//assuming one-dimentional kernel (only threadIdx.x and blockDim.x matters)
size_t totalSize = numElements*sizeof(T)/sizeof(int);
int* intDest = (int*)destination;
int* intSrc = (int*)source;
for (size_t i = threadIdx.x; i < totalSize; i += blockDim.x) {
intDest[i] = intSrc[i];
}
__syncthreads();
}
It basically treats any array as an array of int-s and copy the data from one location to another. You may want to replace the underlying int type with double-s or long long int if you work with 64-bit types only.
Then you can replace the copying lines with:
memCopy(invec, in+blockStart*3, min(blockDim.x, n-blockStart));

2D kernel calling and launch parameters for non-square matrix

I am attempting to port the following (simplified) nested loop as a CUDA 2D kernel. The sizes of NgS and NgO will increase with larger data sets; for now I just want to get this kernel to output the correct results for all values:
// macro that translates 2D [i][j] array indices to 1D flattened array indices
#define idx(i,j,lda) ( (j) + ((i)*(lda)) )
int NgS = 1859;
int NgO = 900;
// 1D flattened matrices have been initialized as:
Radio_cpu = new double [NgS*NgO];
Result_cpu = new double [NgS*NgO];
// ignoring the part where they are filled w/ data
for (m=0; m<NgO; m++) {
for (n=0; n<NgS; n++) {
Result_cpu[idx(n,m,NgO)]] = k0*Radio_cpu[idx(n,m,NgO)]];
}
}
The examples I have come across usually deal with square loops, and I have been unable to get the correct output for all the GPU array indices compared to the CPU version. Here is the host code calling the kernel:
dim3 dimBlock(16, 16);
dim3 dimGrid;
dimGrid.x = (NgO + dimBlock.x - 1) / dimBlock.x;
dimGrid.y = (NgS + dimBlock.y - 1) / dimBlock.y;
// Result_gpu and Radio_gpu are allocated versions of the CPU variables on GPU
trans<<<dimGrid,dimBlock>>>(NgO, NgS, k0, Radio_gpu, Result_gpu);
Here is the kernel:
__global__ void trans(int NgO, int NgS,
double k0, double * Radio, double * Result) {
int n = blockIdx.x * blockDim.x + threadIdx.x;
int m = blockIdx.y * blockDim.y + threadIdx.y;
if(n > NgS || m > NgO) return;
// map the two 2D indices to a single linear, 1D index
int grid_width = gridDim.x * blockDim.x;
int idxxx = m + (n * grid_width);
Result[idxxx] = k0 * Radio[idxxx];
}
With the current code, I proceeded to compare the Result_cpu variable with Result_gpu variable once copied back. When I cycle through the values I get:
// matches from NgS = 0...913
Result_gpu[NgS = 913][NgO = 0]: -56887.2
Result_cpu[Ngs = 913][NgO = 0]: -56887.2
// mismatches from NgS = 914...1858
Result_gpu[NgS = 914][NgO = 0]: -12.2352
Result_cpu[NgS = 914][NgO = 0]: 79448.6
This pattern is the same, irregardless of the value of NgO. I have been trying to figure out where I have made a mistake by looking at various examples for a few hours and trying out changes, but so far this scheme has worked minus the obvious issue at hand whereas the others have caused kernel invocation errors/left the GPU array uninitialized for all values. Since I clearly cannot see the mistake, I'd really appreciate if someone could point me in the right direction towards a fix. I'm pretty sure it's right under my nose and I can't see it.
In case it matters, I'm testing this code on a Kepler card, compiling using MSVC 2010, CUDA 4.2 and 304.79 driver and have compiled the code with both arch=compute_20,code=sm_20 and arch=compute_30,code=compute_30 flags with no difference.
#vaca_loca: I tested the following kernel (it works for me also with non-square block dimensions):
__global__ void trans(int NgO, int NgS,
double k0, double * Radio, double * Result) {
int n = blockIdx.x * blockDim.x + threadIdx.x;
int m = blockIdx.y * blockDim.y + threadIdx.y;
if(n > NgO || m > NgS) return;
int ofs = m * NgO + n;
Result[ofs] = k0 * Radio[ofs];
}
void test() {
int NgS = 1859, NgO = 900;
int data_sz = NgS * NgO, bytes = data_sz * sizeof(double);
cudaSetDevice(0);
double *Radio_cpu = new double [data_sz*3],
*Result_cpu = Radio_cpu + data_sz,
*Result_gpu = Result_cpu + data_sz;
double k0 = -1.7961233;
srand48(time(NULL));
int i, j, n, m;
for(m=0; m<NgO; m++) {
for (n=0; n<NgS; n++) {
Radio_cpu[m + n*NgO] = lrand48() % 234234;
Result_cpu[m + n*NgO] = k0*Radio_cpu[m + n*NgO];
}
}
double *g_Radio, *g_Result;
cudaMalloc((void **)&g_Radio, bytes * 2);
g_Result = g_Radio + data_sz;
cudaMemcpy(g_Radio, Radio_cpu, bytes, cudaMemcpyHostToDevice);
dim3 dimBlock(16, 16);
dim3 dimGrid;
dimGrid.x = (NgO + dimBlock.x - 1) / dimBlock.x;
dimGrid.y = (NgS + dimBlock.y - 1) / dimBlock.y;
trans<<<dimGrid,dimBlock>>>(NgO, NgS, k0, g_Radio, g_Result);
cudaMemcpy(Result_gpu, g_Result, bytes, cudaMemcpyDeviceToHost);
for(m=0; m<NgO; m++) {
for (n=0; n<NgS; n++) {
double c1 = Result_cpu[m + n*NgO],
c2 = Result_gpu[m + n*NgO];
if(std::abs(c1-c2) > 1e-4)
printf("(%d;%d): %.7f %.7f\n", n, m, c1, c2);
}
}
cudaFree(g_Radio);
delete []Radio_cpu;
}
though, in my opinion, accessing data from global memory using quads might not be very cache-friendly since access stride is pretty large. You might consider using 2D textures instead if it's critical for your algorithm to access data in 2D locality