cuda kernel warning : expression has no effect - cuda

I'm new to Cuda programming and trying my luck with Particle in Cell Code. But the first Problem is to build a particle mover. But when I'm trying to compile this code i get error messages like this:
error : expression must have integral or enum type / warning : expression has no effect.
My code:
__global__ void kernel(int* x, int* x_1, int* E_x, int* t, int* m)
{
int idx = 0;
if (idx < N)
// move particles
x_1[idx] = (E_x[idx] / m[1]) * t[1] * t[1] + x[idx];
}
kernel<<1,1>>( dev_x , dev_x_1, dev_E_x , dev_t, dev_m );
The integers defined as follows:
int x[N], x_1[N], v_x[N], v_y[N], v_z[N], E_x[N], m[1], t[1];
int *dev_x, *dev_v_x, *dev_x_1, *dev_v_y, *dev_v_z, *dev_E_x, *dev_m, *dev_t;

One problem is you are using a double-chevron syntax instead of the proper triple-chevron syntax on your kernel launch parameters. Instead of this:
kernel<<1,1>>( dev_x , dev_x_1, dev_E_x , dev_t, dev_m );
Do this:
kernel<<<1,1>>>( dev_x , dev_x_1, dev_E_x , dev_t, dev_m );

Related

"device-function-maxrregcount" message while compiling cuda code

I am trying to write a code which performs multiple vector dot product inside the kernel. I'm using cublasSdot function from cublas library to perform vector dot product. This is my code:
using namespace std;
__global__ void ker(float * a, float * c,long long result_size,int n, int m)
{
float *sum;
int id = blockIdx.x*blockDim.x+threadIdx.x;
float *out1,*out2;
int k;
if(id<result_size)
{
cublasHandle_t handle;
cublasCreate(&handle);
out1 = a + id*m;
for(k=0;k<n;k++)
{
out2 =a + k*m;
cublasSdot(handle, m,out1,1,out2,1,sum);
c[id*n + k]= *sum;
}
}
}
int main()
{
int n=70000,m=100;
long result_size=n;
result_size*=n;
float * dev_data,*dev_result;
float * data = new float [n*m];
float * result = new float [result_size];
for (int i = 0; i< n; i++)
for(int j = 0; j <m;j++)
{
data[i*m+j]=rand();
}
cudaMalloc ((void**)&dev_data,sizeof(float)*m*n);
cudaMalloc ((void**)&dev_result,sizeof(float)*result_size);
cudaMemcpy( dev_data, data, sizeof(float) * m* n, cudaMemcpyHostToDevice);
int block_size=1024;
int grid_size=ceil((float)result_size/(float)block_size);
ker<<<grid_size,block_size>>>(dev_data,dev_result,result_size,n,m);
cudaDeviceSynchronize();
cudaMemcpy(result, dev_result, sizeof(float)*(result_size), cudaMemcpyDeviceToHost);
return 0;
}
I have included cublas_v2 library and used the following command to compile the code:
nvcc -lcublas_device -arch=sm_35 -rdc=true askstack.cu -o askstack
But I got the following message:
ptxas info : 'device-function-maxrregcount' is a BETA feature
Can anyone please let me know what should I do regarding this message?
This message is informational, as said by talonmies.
This maxregcount option of NVCC is used to specify a limit of registers that can be used by a kernel and all the device functions it uses :
If a kernel is limited to a certain number of registers with the launch_bounds attribute or the --maxrregcount option, then all functions that the kernel calls must not use more than that number of registers; if they exceed the limit, then a link error will be given.
See : NVCC Doc : 6.5.1. Object Compatibility
It seems that device-function-maxregcount is used to override this value for device functions only. So, you can have a different maximum amount of registers allowed on kernels and device functions.
For device functions, this option overrides the value specified by --maxregcount.
Source : The CUDA Handbook

CUDA cudaMemcpy2D not giving expected results [duplicate]

How do I initialize device array which is allocated using cudaMalloc()?
I tried cudaMemset, but it fails to initialize all values except 0.code, for cudaMemset looks like below, where value is initialized to 5.
cudaMemset(devPtr,value,number_bytes)
As you are discovering, cudaMemset works like the C standard library memset. Quoting from the documentation:
cudaError_t cudaMemset ( void * devPtr,
int value,
size_t count
)
Fills the first count bytes of the memory area pointed to by devPtr
with the constant byte value value.
So value is a byte value. If you do something like:
int *devPtr;
cudaMalloc((void **)&devPtr,number_bytes);
const int value = 5;
cudaMemset(devPtr,value,number_bytes);
what you are asking to happen is that each byte of devPtr will be set to 5. If devPtr was a an array of integers, the result would be each integer word would have the value 84215045. This is probably not what you had in mind.
Using the runtime API, what you could do is write your own generic kernel to do this. It could be as simple as
template<typename T>
__global__ void initKernel(T * devPtr, const T val, const size_t nwords)
{
int tidx = threadIdx.x + blockDim.x * blockIdx.x;
int stride = blockDim.x * gridDim.x;
for(; tidx < nwords; tidx += stride)
devPtr[tidx] = val;
}
(standard disclaimer: written in browser, never compiled, never tested, use at own risk).
Just instantiate the template for the types you need and call it with a suitable grid and block size, paying attention to the last argument now being a word count, not a byte count as in cudaMemset. This isn't really any different to what cudaMemset does anyway, using that API call results in a kernel launch which is do too different to what I posted above.
Alternatively, if you can use the driver API, there is cuMemsetD16 and cuMemsetD32, which do the same thing, but for half and full 32 bit word types. If you need to do set 64 bit or larger types (so doubles or vector types), your best option is to use your own kernel.
I also needed a solution to this question and I didn't really understand the other proposed solution. Particularly I didn't understand why it iterates over the grid blocks for(; tidx < nwords; tidx += stride) and for that matter, the kernel invocation and why using the counter-intuitive word sizes.
Therefore I created a much simpler monolithic generic kernel and customized it with strides i.e. you may use it to initialize a matrix in multiple ways e.g. set rows or columns to any value:
template <typename T>
__global__ void kernelInitializeArray(T* __restrict__ a, const T value,
const size_t n, const size_t incx) {
int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid*incx < n) {
a[tid*incx] = value;
}
}
Then you may invoke the kernel like this:
template <typename T>
void deviceInitializeArray(T* a, const T value, const size_t n, const size_t incx) {
int number_of_blocks = ((n / incx) + BLOCK_SIZE - 1) / BLOCK_SIZE;
dim3 gridDim(number_of_blocks, 1);
dim3 blockDim(BLOCK_SIZE, 1);
kernelInitializeArray<T> <<<gridDim, blockDim>>>(a, value, n, incx);
}

CUDA and Monte Carlo with local behavior defined

I have a question about a strange behavior In CUDA.
I am currently developing a Monte Carlo simulation on particles trajectories and I am doing the following the thing.
The position p(n) of my particle at a given date t(n) depends on the position t(n-1) of my particle at the previous date t(n-1). Indeed, let’s say the value v(n) is computed from the value p(n-1). Here is a simplified example of my code:
__device__ inline double calculateStep( double drift, double vol, double dt, double randomWalk, double S_t){
return exp((drift - vol*vol*0.5)*dt + randomWalk*vol*sqrt(dt))*S_t;
}
__device__ double doSomethingWhith(double v_n, ….) {
...
Return v_n*exp(t)*S
}
__global__ myMCsimulation( double* matrice, double * randomWalk, int nbreSimulation, int nPaths, double drift, ……) {
double dt = T/nPaths;
unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int stride = blockDim.x*gridDim.x;
unsigned int index = tid;
double mydt = (index - nbreSimulation)/nbreSimulation*dt + dt;
for ( index = tid; index < nbreSimulation*nPaths; index += stride) {
if (index >= nbreSimulation)
{
double v_n = DoSomethingWith(drift,dt, matrice[index – nbreSimulation]);
matrice[index] = matrice[index - nbreSimulation ] * calculateStep(drift,v_n,dt,randomWalk[index]); //
}
...}
The last code line :
matrice[index] = matrice[index - nbreSimulation ] * calculateStep(drift,v_n,dt,randomWalk[index]);
enables me to fill in only the second row of the matrix matrice. I don’t know why.
When I change the code line by :
matrice[index] = DoSomethingWith(drift,dt, matrice[index – nbreSimulation]);
My matrix is well filled in and I have all my values changed, then I am able to get back the matrice[index – nbreSimulation].
I think this is a concurrent access but I am not sure, I tried __syncthreads() but it did not work.
Could someone please help on this point?
Many thanks
I have changed my code by the following thing and now it works perfectly.
if (index < nbreSimulation) {
matrice[index] = S0;
for (workingCol=1; workingCol< nPaths; workingCol++) {
previousMove = index;
index = index + nbreSimulation;
................
matrice[index] = calculateStep(drift,vol_int[index],dt,randomWalk[index], matrice[previousMove]); }
}
}
I have tried the following thing :
I have declared a shared variable (an array of doubles) which contains the value computed at each iteration :
__shared__ double mat[];
......
for ( index = tid; index < nbreSimulation*nPaths; index += stride) {
.....
mat[index] = computedValue;
......
}
Without success. Does anyone see the issue?

CUDA different threads per block for different functions

I making a CUDA program and am stuck at a problem. I have two functions:
__global__ void cal_freq_pl(float *, char *, char *, int *, int *)
__global__ void cal_sum_vfreq_pl(float *, float *, char *, char *, int *)
I call the first function like this:
cal_freq_pl<<<M,512>>>( ... );
M is a number about 15, so I'm not worried about it. 512 is the maximum threads per block on my GPU. This works fine and gives the expected output for all M*512 values.
But when I call the 2nd function in a similar way:
cal_sum_vfreq_pl<<<M,512>>>( ... );
it does not work. After debugging the crap out of that function, I finally found out that it runs with these dimensions: cal_sum_vfreq_pl<<<M,384>>>( ... );, which is 128 less than 512. It shows no error with 512, but incorrect result.
I currently only have access to Compute1.0 arch and have Nvidia Quadro FX4600 graphics card on Windows 64-bit machine.
I have no idea why such a behavior should happen, I am positively sure that the 1st function is running for 512 threads and the 2nd only runs for 384 (or less).
Can someone please suggest some possible solution?
Thanks in advance...
EDIT:
Here is the kernel code:
__global__ void cal_sum_vfreq_pl(float *freq, float *v_freq_vectors, char *wstrings, char *vstrings, int *k){
int index = threadIdx.x;
int m = blockIdx.x;
int block_dim = blockDim.x;
int kv = *k; int vv = kv-1; int wv = kv-2;
int woffset = index*wv;
int no_vstrings = pow_pl(4, vv);
float temppp=0;
char wI[20], Iw[20]; int Iwi, wIi;
for(int i=0;i<wv;i++) Iw[i+1] = wI[i] = wstrings[woffset + i];
for(int l=0;l<4;l++){
Iw[0] = get_nucleotide_pl(l);
wI[vv-1] = get_nucleotide_pl(l);
Iwi = binary_search_pl(vstrings, Iw, vv);
wIi = binary_search_pl(vstrings, wI, vv);
temppp = temppp + v_freq_vectors[m*no_vstrings + Iwi] + v_freq_vectors[m*no_vstrings + wIi];
}
freq[index + m*block_dim] = 0.5*temppp;
}
It seems you allocated a lot of registers in the second kernel. You can not always reach the max threads per block due to the hardware resource limitation such as register number per block.
CUDA provides a tool to help calculate the proper nember of threads per block.
http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls
You can also find this .xls file in your CUDA installation dir.

how to get the index of thrust foreach

I am trying to using thrust for each to give device vector certain values
here is the code
const uint N = 222222;
struct assign_functor
{
template <typename Tuple>
__device__
void operator()(Tuple t)
{
uint x = threadIdx.x + blockIdx.x * blockDim.x;
uint y = threadIdx.y + blockIdx.y * blockDim.y;
uint offset = x + y * blockDim.x * gridDim.x;
thrust::get<0>(t) = offset;
}
};
int main(int argc, char** argv)
{
thrust::device_vector <float> d_float_vec(N);
thrust::for_each(
thrust::make_zip_iterator(
thrust::make_tuple(d_float_vec.begin())
),
thrust::make_zip_iterator(
thrust::make_tuple(d_float_vec.end())
),
assign_functor()
);
std::cout<<d_float_vec[10]<<" "<<d_float_vec[N-2]
}
the output of d_float_vec[N-2] is supposed to be 222220; but it turns out 1036. whats wrong with my code??
I know I could use thrust::sequence to give a sequence values to the vector. I just want to know how to get the real index for thrust foreach function. Thanks!
As noted in comments, your approach is never likely to work because you have assumed a number of things about the way thrust::for_each works internally which are probably not true, including:
You implicitly are assuming that for_each uses a single thread to process each input element. This is almost certainly not the case; it is much more likely that thrust will process multiple elements per thread during the operation.
You are also assuming that execution happens in order so that the Nth thread processes the Nth array element. That may not be the case, and execution may occur in an order which cannot be known a priori
You are assuming for_each processes the whole input data set in a single kernel laumch
Thrust algorithms should be treated as black boxes whose internal operations are undefined and no knowledge of them is required to implement user defined functors. In your example, if you require a sequential index inside a functor, pass a counting iterator. One way to re-write your example would be like this:
#include "thrust/device_vector.h"
#include "thrust/for_each.h"
#include "thrust/tuple.h"
#include "thrust/iterator/counting_iterator.h"
typedef unsigned int uint;
const uint N = 222222;
struct assign_functor
{
template <typename Tuple>
__device__
void operator()(Tuple t)
{
thrust::get<1>(t) = (float)thrust::get<0>(t);
}
};
int main(int argc, char** argv)
{
thrust::device_vector <float> d_float_vec(N);
thrust::counting_iterator<uint> first(0);
thrust::counting_iterator<uint> last = first + N;
thrust::for_each(
thrust::make_zip_iterator(
thrust::make_tuple(first, d_float_vec.begin())
),
thrust::make_zip_iterator(
thrust::make_tuple(last, d_float_vec.end())
),
assign_functor()
);
std::cout<<d_float_vec[10]<<" "<<d_float_vec[N-2]<<std::endl;
}
Here the counting iterator gets passed in a tuple along with the data array, allow the functor access to a sequential index which corresponds to the data array entry it is dealing with.