CUDA and Monte Carlo with local behavior defined - cuda

I have a question about a strange behavior In CUDA.
I am currently developing a Monte Carlo simulation on particles trajectories and I am doing the following the thing.
The position p(n) of my particle at a given date t(n) depends on the position t(n-1) of my particle at the previous date t(n-1). Indeed, let’s say the value v(n) is computed from the value p(n-1). Here is a simplified example of my code:
__device__ inline double calculateStep( double drift, double vol, double dt, double randomWalk, double S_t){
return exp((drift - vol*vol*0.5)*dt + randomWalk*vol*sqrt(dt))*S_t;
}
__device__ double doSomethingWhith(double v_n, ….) {
...
Return v_n*exp(t)*S
}
__global__ myMCsimulation( double* matrice, double * randomWalk, int nbreSimulation, int nPaths, double drift, ……) {
double dt = T/nPaths;
unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int stride = blockDim.x*gridDim.x;
unsigned int index = tid;
double mydt = (index - nbreSimulation)/nbreSimulation*dt + dt;
for ( index = tid; index < nbreSimulation*nPaths; index += stride) {
if (index >= nbreSimulation)
{
double v_n = DoSomethingWith(drift,dt, matrice[index – nbreSimulation]);
matrice[index] = matrice[index - nbreSimulation ] * calculateStep(drift,v_n,dt,randomWalk[index]); //
}
...}
The last code line :
matrice[index] = matrice[index - nbreSimulation ] * calculateStep(drift,v_n,dt,randomWalk[index]);
enables me to fill in only the second row of the matrix matrice. I don’t know why.
When I change the code line by :
matrice[index] = DoSomethingWith(drift,dt, matrice[index – nbreSimulation]);
My matrix is well filled in and I have all my values changed, then I am able to get back the matrice[index – nbreSimulation].
I think this is a concurrent access but I am not sure, I tried __syncthreads() but it did not work.
Could someone please help on this point?
Many thanks

I have changed my code by the following thing and now it works perfectly.
if (index < nbreSimulation) {
matrice[index] = S0;
for (workingCol=1; workingCol< nPaths; workingCol++) {
previousMove = index;
index = index + nbreSimulation;
................
matrice[index] = calculateStep(drift,vol_int[index],dt,randomWalk[index], matrice[previousMove]); }
}
}

I have tried the following thing :
I have declared a shared variable (an array of doubles) which contains the value computed at each iteration :
__shared__ double mat[];
......
for ( index = tid; index < nbreSimulation*nPaths; index += stride) {
.....
mat[index] = computedValue;
......
}
Without success. Does anyone see the issue?

Related

CUDA - Malloc inside kernel ( compute_50,sm_50 )

I had a problem while running a program with the CUDA Memory Checker.
In other threads on stackoverflow, the main problem with using malloc inside a kernel was that the "compute_50,sm_50" was not set properly. Here the code compiles so this is not the problem.
The problem is now solved, but I don't understand why the new code solved the problem.
My question is: why it is working now ?
Old code:
__device__ unsigned int draw_active_levels(curandState * localState,const int num_levels_max){
unsigned int return_value = 0;
float draw;
draw = curand_uniform(localState);
int num_active_levels = floorf(draw * (num_levels_max - 1)) + 1;
double * arrLevelWeights = (double*) malloc((num_levels_max+1) * sizeof(double));
arrLevelWeights[num_levels_max]=0.0; //<--------Error on this line
double level_weights = 1.0 / num_levels_max;
for(int i=0; i<num_levels_max; i++){
arrLevelWeights[i] = level_weights;
}
//...
//do some operations using arrLevelWeights
//..
free(arrLevelWeights);
return return_value;
}
Error with old code:
Memory Checker detected 2 access violations.
error = access violation on store (global memory)
gridid = 198
blockIdx = {1,0,0}
threadIdx = {29,0,0}
address = 0x00000020
accessSize = 8
New code:
I just added a few lines to check if malloc returned a null pointer.
__device__ unsigned int draw_active_levels(curandState * localState,const int num_levels_max){
unsigned int return_value = 0;
float draw;
draw = curand_uniform(localState);
int num_active_levels = floorf(draw * (num_levels_max - 1)) + 1;
double * arrLevelWeights;
arrLevelWeights = (double*) malloc((num_levels_max+1) * sizeof(double));
if(arrLevelWeights == NULL){
printf("Error while dynamically allocating memory on device.\n"); //<--- this line is never called (I put a breakpoint on it)
}
arrLevelWeights[num_levels_max]=0.0; //<-------Error disapeared !
double level_weights = 1.0 / num_levels_max;
for(int i=0; i<num_levels_max; i++){
arrLevelWeights[i] = level_weights;
}
//...
//do some operations using arrLevelWeights
//..
free(arrLevelWeights);
return return_value;
}
If malloc returns NULL, it simply means that you've run out of device heap space which has, by default, a size of 8 MB. I'm not sure how adding a line that is never executed fixes the problem, though.
As you said in a comment, you ran out of heap space because of a missing free somewhere else in your code, which is why I suggest you use RAII (with your own smart pointer class) for memory allocations to avoid this kind of problem in the future.

How make_int2 is working?

in the CUDA C Programming Guide Version there is a small paragraph about Built-in Vector Types. It says that this structure has 4 components and they accessible with a specific way, e.i .x .y .z .w. What are the 4 components? Can someone give an example?
Moreover it says that with this line int2 make_int2(int x, int y); it constructs a vector with value x, y. Each of these variables has 4 components?
The reason I am trying to understand these things is because I am studying the following code:
/*1*/ int ny = num_ofElements_y_ofmyMatrix;
/*2*/ int nx = num_ofElements_x_ofmyMatrix;
/*3*/ int2 matrix_index_2d = make_int2( ( blockIdx.x * blockDim.x ) + threadIdx.x, ( blockIdx.y * blockDim.y ) + threadIdx.y );
/*4*/ int matrix_index_1d = ( nx * matrix_index_2d.y ) + matrix_index_2d.x;
/*5*/ if ( matrix_index_2d.x < nx && matrix_index_2d.y < ny )
/*6*/ {
/*7*/ float r = myMatrix[ matrix_index_1d ];
/*8*/ }
How the indexing in lines 3 and 4 is working? Subsequently, who does the access of the matrix myMatrix is working?
UPDATE:
As far as the code snippet is concerned usually when I am accessing an array I am using the following:
col = blockDim.x*blockIdx.x + threahIdx.x;
row = blockDim.x*blockIdx.x + threahIdx.x;
if (col < NUMCOLS && row < NUMROWS){...}
in order to have a row-major access of an array in c++, e.g myMatrix[row*NUMCOLS + col].
What is the connection with the type of indexing used in line 3 and 4?
Not all the CUDA built-in vector types have 4 components. For example, double2 has 2 double components. double2 is indeed defined as
struct __device_builtin__ __builtin_align__(16) double2
{
double x, y;
};
and can be used to deal with complex, double precision numbers. According to the definition above, you can use a declaration like
double2 foo;
and then initialize the two int2 make_int2(int x, int y); and y components as
foo.x = 1.;
foo.y = 3.4;
As another example, float4 has 4 float components and can be used to deal with four-vector in a Minkowski space.
In your example, int2 has 2 integer components and the instruction
int2 foo_int = make_int2(3,1);
constructs a foo_int struct of type int2 and initializes the x and y components to 3 and 1 respectively.

kernel using AoS is faster than using SoA

I have two versions of a kernel that performs the same task -fill a linked cell list-, the difference between both kernels is the datatype to store particle position, the first one using a float array to store the positions (4 float per particle due to 128bit reads/writes), and the second uses a vec3f structure array to store the positions (a structure which holds 3 floats).
Doing some tests using nvprof, I found that the second kernel (which uses vec3f) ran faster than the first one:
Time(%) Time Calls Avg Min Max Name
42.88 37.26s 2 18.63s 23.97us 37.26s adentu_grid_cuda_filling_kernel(int*, int*, int*, float*, int, _vec3f, _vec3f, _vec3i)
11.00 3.93s 2 1.97s 25.00us 3.93s adentu_grid_cuda_filling_kernel(int*, int*, int*, _vec3f*, int, _vec3f, _vec3f, _vec3i)
The tests are done trying to fill a linked cell list using 256 and 512000 particles.
My question is, what happened here? I supposed that float array should do a better memory access due to coalesced memory, versus the use of vec3f structure array which has unaligned memory. I missunderstood anything?
These are the kernels, the first kernel:
__global__ void adentu_grid_cuda_filling_kernel (int *head,
int *linked,
int *cellnAtoms,
float *pos,
int nAtoms,
vec3f origin,
vec3f h,
vec3i nCell)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= nAtoms)
return;
vec3i cell;
vec3f _pos = (vec3f){(float)pos[idx*4+0], (float)pos[idx*4+1], (float)pos[idx*4+2]};
cell.x = floor ((_pos.x - origin.x)/h.x);
cell.y = floor ((_pos.y - origin.y)/h.y);
cell.z = floor ((_pos.z - origin.z)/h.z);
int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x;
int i;
if (atomicCAS (&head[c], -1, idx) != -1){
i = head[c];
while (atomicCAS (&linked[i], -1, idx) != -1)
i = linked[i];
}
atomicAdd (&cellnAtoms[c], 1);
}
And this is the second kernel:
__global__ void adentu_grid_cuda_filling_kernel (int *head,
int *linked,
int *cellNAtoms,
vec3f *pos,
int nAtoms,
vec3f origin,
vec3f h,
vec3i nCell)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= nAtoms)
return;
vec3i cell;
vec3f _pos = pos[idx];
cell.x = floor ((_pos.x - origin.x)/h.x);
cell.y = floor ((_pos.y - origin.y)/h.y);
cell.z = floor ((_pos.z - origin.z)/h.z);
int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x;
int i;
if (atomicCAS (&head[c], -1, idx) != -1){
i = head[c];
while (atomicCAS (&linked[i], -1, idx) != -1)
i = linked[i];
}
atomicAdd (&cellNAtoms[c], 1);
}
This is the vec3f structure:
typedef struct _vec3f {float x, y, z} vec3f;
This is not an example of AoS vs. SoA. Let's look at the important lines of code and the data structures implicit in them.
Your first, "SoA" or "slow" case:
vec3f _pos = (vec3f){(float)pos[idx*4+0], (float)pos[idx*4+1], (float)pos[idx*4+2]};
^ ^ ^
| | |
These values are stored in *adjacent* memory locations
So an individual thread is accessing successively pos[idx*4] plus the 2 locations right after it. This is how a structure gets stored! What you're calling a structure of Arrays is in fact an array of structures, the way it is stored in memory. To have a valid "SoA" case, your code would need to look something like this:
vec3f _pos = (vec3f){(float)pos1[idx], (float)pos2[idx], (float)pos3[idx]};
^
|
Adjacent threads will read adjacent values for pos1, pos2, and pos3
leading to *coalesced* access.
Your "AoS" or "fast" doesn't really have a different storage format.
To my mind both of your approaches are actually AoS, the only difference is that the first approach is AoS with a structure of four elements while the second approach only uses three elements. This is why your second solution is preferable.
If you really want to have SoA in your first solution you would have to organize the pos array as follows:
vec3f _pos = (vec3f){(float)pos[idx], (float)pos[N + idx], (float)pos[2 * N + idx]};

cuda kernel warning : expression has no effect

I'm new to Cuda programming and trying my luck with Particle in Cell Code. But the first Problem is to build a particle mover. But when I'm trying to compile this code i get error messages like this:
error : expression must have integral or enum type / warning : expression has no effect.
My code:
__global__ void kernel(int* x, int* x_1, int* E_x, int* t, int* m)
{
int idx = 0;
if (idx < N)
// move particles
x_1[idx] = (E_x[idx] / m[1]) * t[1] * t[1] + x[idx];
}
kernel<<1,1>>( dev_x , dev_x_1, dev_E_x , dev_t, dev_m );
The integers defined as follows:
int x[N], x_1[N], v_x[N], v_y[N], v_z[N], E_x[N], m[1], t[1];
int *dev_x, *dev_v_x, *dev_x_1, *dev_v_y, *dev_v_z, *dev_E_x, *dev_m, *dev_t;
One problem is you are using a double-chevron syntax instead of the proper triple-chevron syntax on your kernel launch parameters. Instead of this:
kernel<<1,1>>( dev_x , dev_x_1, dev_E_x , dev_t, dev_m );
Do this:
kernel<<<1,1>>>( dev_x , dev_x_1, dev_E_x , dev_t, dev_m );

How to gather rows from a matrix by indices list using CUDA Thrust

This is seemingly a simple problem but I just can’t figure out an elegant way to do this with CUDA Thrust.
I have a two dimensional matrix NxM and a vector of desired row indices of size L that is a subset of all rows(i.e. L < N) and is not regular (basically an irregular list like, 7,11,13,205,... etc.). The matrix is stored by rows in a thrust device vector. The array of indices is a device vector as well.
Here are my two questions:
What is the most efficient way to copy the desired rows from the original NxM matrix forming a new matrix LxM?
Is it possible to create an iterator for the original NxM matrix that would dereference to only elements that belong to the desired rows?
Thank you very much for your help.
What you are asking about seems like a pretty straight forward stream compaction problem, and there isn't any particular problem doing it with thrust, but there are a couple of twists. In order to select the rows to copy, you need to have an stencil or key that the stream compaction algorithm can use. That needs to be constructed by a search or select operation using your list of rows to copy.
One example procedure to do this would go something like this:
Construct an iterator which returns the row number of any entry in the input matrix. Thrust has a very useful counting_iterator and transform_iterator which can be combined to do this
Perform a search of that row number iterator to find which entries match the list of rows to copy. thrust::binary search can be used for this. The search yields the stencil for the stream compaction operation
Use thrust::copy_if to perform the stream compaction on the input matrix with the stencil.
It sounds like a lot of work and intermediate steps, but the counting and transformation iterators don't actually produce any intermediate device vectors. The only intermediate storage required is the stencil array, which can be a boolean (so m*n bytes).
A full example in code:
#include <thrust/copy.h>
#include <thrust/binary_search.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/device_vector.h>
#include <cstdio>
struct div_functor : public thrust::unary_function<int,int>
{
int m;
div_functor(int _m) : m(_m) {};
__host__ __device__
int operator()(int x) const
{
return x / m;
}
};
struct is_true
{
__host__ __device__
bool operator()(bool x) { return x; }
};
int main(void)
{
// dimensions of the problem
const int m=20, n=5, l=4;
// Counting iterator for generating sequential indices
// Sample matrix containing 0...(m*n)
thrust::counting_iterator<float> indices(0.f);
thrust::device_vector<float> in_matrix(m*n);
thrust::copy(indices, indices+(m*n), in_matrix.begin());
// device vector contain rows to select
thrust::device_vector<int> select(l);
select[0] = 1;
select[1] = 4;
select[2] = 9;
select[3] = 16;
// construct device iterator supplying row numbers via a functor
typedef thrust::counting_iterator<int> counter;
typedef thrust::transform_iterator<div_functor, counter> rowIterator;
rowIterator rows_begin = thrust::make_transform_iterator(thrust::make_counting_iterator(0), div_functor(n));
rowIterator rows_end = rows_begin + (m*n);
// constructor a stencil array which indicates which entries will be copied
thrust::device_vector<bool> docopy(m*n);
thrust::binary_search(select.begin(), select.end(), rows_begin, rows_end, docopy.begin());
// use stream compaction on the matrix with the stencil array
thrust::device_vector<float> out_matrix(l*n);
thrust::copy_if(in_matrix.begin(), in_matrix.end(), docopy.begin(), out_matrix.begin(), is_true());
for(int i=0; i<(l*n); i++) {
float val = out_matrix[i];
printf("%i %f\n", i, val);
}
}
(usual disclaimer: use at your own risk)
About the only comment I would make is that the predicate to the copy_if call feels a bit redundant given we have already a binary stencil that could be used directly, but there doesn't seem to be a variant of the compaction algorithms which can operate on a binary stencil directly. Similarly, I could not think of a sensible way to use the list of rows directly in the stream compaction call. There might well be a more efficient way to do this with thrust, but this should at least get you started.
From your comment, it seems that space is tight and the additional memory overhead of the binary search and stencil creation is prohibitive for your application. In that case I would follow the advice I offered in a comment to Roger Dahl's answer, and use a custom copy kernel instead. Thrust device vectors can be cast to a pointer you can pass directly to a kernel (thrust::raw_pointer_cast), so it need not interfere with your existing thrust code. I would suggest using a block of threads per row to copy, that allows coalescing of reads and writes and should perform a lot better than using thrust::copy for each row. A very simple implementation might look something like this (reusing most of my thrust example):
#include <thrust/copy.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/device_vector.h>
#include <cstdio>
__global__
void rowcopykernel(const float *in, float *out, const int *list, const int m, const int n, const int l)
{
__shared__ const float * inrowp;
__shared__ float * outrowp;
if (threadIdx.x == 0) {
inrowp = (blockIdx.x < l) ? in + (n*list[blockIdx.x]) : 0;
outrowp = out + (n*blockIdx.x);
}
__syncthreads();
for(int i=threadIdx.x; (inrowp != 0) && (i<n); i+=blockDim.x) {
*(outrowp+i) = *(inrowp+i);
}
}
int main(void)
{
// dimensions of the problem
const int m=20, n=5, l=4;
// Sample matrix containing 0...(m*n)
thrust::counting_iterator<float> indices(0.f);
thrust::device_vector<float> in_matrix(m*n);
thrust::copy(indices, indices+(m*n), in_matrix.begin());
// device vector contain rows to select
thrust::device_vector<int> select(l);
select[0] = 1;
select[1] = 4;
select[2] = 9;
select[3] = 16;
// Output matrix
thrust::device_vector<float> out_matrix(l*n);
// raw pointer to thrust vectors
int * selp = thrust::raw_pointer_cast(&select[0]);
float * inp = thrust::raw_pointer_cast(&in_matrix[0]);
float * outp = thrust::raw_pointer_cast(&out_matrix[0]);
dim3 blockdim = dim3(128);
dim3 griddim = dim3(l);
rowcopykernel<<<griddim,blockdim>>>(inp, outp, selp, m, n, l);
for(int i=0; i<(l*n); i++) {
float val = out_matrix[i];
printf("%i %f\n", i, val);
}
}
(standard disclaimer: use at your own risk).
The execution parameter selection could be made fancier, but otherwise that should be about all that is required. If your rows are very small, you might want to investigate using a warp per row rather than a block (so one block copies several rows). If you have more than 65535 output rows, then you will need to either use a 2D grid, or modify the code to have each block do multiple rows. But, as with the thrust based solution about, this should get you started.
if you are not fixed on thrust, check out Arrafire:
surprisingly unlike thrust, this library has a native support for subscript indexing,
so that your problem can be solved in just few lines of code:
const int N = 7, M = 5;
float L_host[] = {3, 6, 4, 1};
int szL = sizeof(L_host) / sizeof(float);
// generate random NxM matrix with cuComplex data
array A = randu(N, M, c32);
// array used to index rows
array L(szL, 1, L_host);
print(A);
print(L);
array B = A(L,span); // copy selected rows of A
print(B);
and the results:
A =
0.7402 + 0.9210i 0.6814 + 0.2920i 0.5786 + 0.5538i 0.2133 + 0.4131i 0.7305 + 0.9400i
0.0390 + 0.9690i 0.3194 + 0.8109i 0.3557 + 0.7229i 0.0328 + 0.5360i 0.8432 + 0.6116i
0.9251 + 0.4464i 0.1541 + 0.4452i 0.2783 + 0.6192i 0.7214 + 0.3546i 0.2674 + 0.0208i
0.6673 + 0.1099i 0.2080 + 0.6110i 0.5876 + 0.3750i 0.2527 + 0.9847i 0.8331 + 0.7218i
0.4702 + 0.5132i 0.3073 + 0.4156i 0.2405 + 0.4148i 0.9200 + 0.1872i 0.6087 + 0.6301i
0.7762 + 0.2948i 0.2343 + 0.8793i 0.0937 + 0.6326i 0.1820 + 0.5984i 0.5298 + 0.8127i
0.7140 + 0.3585i 0.6462 + 0.9264i 0.2849 + 0.7793i 0.7082 + 0.0421i 0.0593 + 0.4797i
L = (row indices)
3.0000
6.0000
4.0000
1.0000
B =
0.6673 + 0.1099i 0.2080 + 0.6110i 0.5876 + 0.3750i 0.2527 + 0.9847i 0.8331 + 0.7218i
0.7140 + 0.3585i 0.6462 + 0.9264i 0.2849 + 0.7793i 0.7082 + 0.0421i 0.0593 + 0.4797i
0.4702 + 0.5132i 0.3073 + 0.4156i 0.2405 + 0.4148i 0.9200 + 0.1872i 0.6087 + 0.6301i
0.0390 + 0.9690i 0.3194 + 0.8109i 0.3557 + 0.7229i 0.0328 + 0.5360i 0.8432 + 0.6116i
it also works pretty fast. I tested this with an array of cuComplex of size
2000 x 2000 using the following code:
float *g_data = 0, *g_data2 = 0;
int g_N = 2000, g_M = 2000, // matrix of size g_N x g_M
g_L = 400; // copy g_L rows
void af_test()
{
array A(g_N, g_M, (cuComplex *)g_data, afDevicePointer);
array L(g_L, 1, g_data2, afDevicePointer);
array B = (A(L, span));
std::cout << "sz: " << B.elements() << "\n";
}
int main()
{
// input matrix N x M of cuComplex
array in = randu(g_N, g_M, c32);
g_data = (float *)in.device< cuComplex >();
// generate unique row indices
array in2 = setunique(floor(randu(g_L) * g_N));
print(in2);
g_data2 = in2.device<float>();
const int N_ITERS = 30;
try {
info();
af::sync();
timer::tic();
for(int i = 0; i < N_ITERS; i++) {
af_test();
}
af::sync();
printf("af: %.5f seconds\n", timer::toc() / N_ITERS);
} catch (af::exception& e) {
fprintf(stderr, "%s\n", e.what());
}
in.unlock();
in2.unlock();
}
I don't think there is a way to do this with Thrust but, because the operation will be memory bound, it should be easy to write a kernel that performs this operation at maximum possible performance. Simply create the same number of threads as there are indices in the vector. Have each thread calculate the source and destination addresses for one row and then use memcpy() to copy the row.
You may also want to carefully consider if it is possible to set up subsequent processing steps to access the rows in place, thereby avoiding the entire, expensive "compacting" operation, that only shuffles memory around. Even if addressing the rows becomes slightly more complicated (an extra memory lookup and multiply, maybe), overall performance may be much better.