I'm just writing my first CUDA program, and it's actually a rewrite of a C++ code. Now it deals with a lot of vector maths, so I use the float4 datatype which provides exactly what I need. However, the old code contains a lot of
float *vec;
vec = new float[4];
for(int i=0; i<4; i++) vec[i] = ...;
Now with float4 all I can do is write a line for each .x,.y,.z,.w which I find a bit annoying. Is there no way of accessing float4 elements in a similar fashion, i.e.
float4 vec;
for(int i=0; i<4; i++) vec[i] = ...;
Unfortunately I couldn't find any hints on the internet.
Thanks in advance.
You could use a union, e.g.
typedef union {
float4 vec;
float a[4];
} U4;
U4 u;
for (int i = 0; i < 4; ++i) u.a[i] = ...;
For your arrays of float4 you would just change the underlying type to U4.
Note: technically it's UB to write to one variant of a union and then read from another, but it should work OK in this case and you don't need to worry about portability since this is CUDA-specific.
Probably not safe, but here is the easiest way.
float *vec;
vec = new float[4];
for(int i=0; i<4; i++) vec[i] = ...;
float4 vec4 = *(float4 *)vec;
Or you can flip this
float4 vec4;
float *vec = (float *)&vec4; // Do not free this pointer
for(int i=0; i<4; i++) vec[i] = ...;
EDIT
The only way to directly store into an array would be like this
float4 vec4 = {val[0], val[1], val[2], val[3]};
so if you have an array of float4s, you can do somehting like the following
float4 *vec4 = new float4[10];
float *vec = new float[4];
for(int i = 0; i < 10; i++) {
for(int j = 0; j < 4; j++) vec[j] = j;
vec4[i] = (float4){vec[0], vec[1], vec[2], vec[3]}
}
Other than this, I cant figure an easier way.
Related
I have a requirement where I want to parallelize the following using CUDA thrust.
std::vector<float> a, b, c; // size of each is (size.x * size.y * size.z), kind of a 3D array.
What I am trying to do is this
a[i] = 0 if b[i] < 0
a[i] = c[i] if b[i] > 0
This is the host code.
for (int i = 0; i < size.x; i++)
for (int j = 0; j < size.y; j++)
for (int z = 0; z < size.z; z++) {
a.data[get_idx(i, j, z)] = (b.data[get_idx(i, j, z)] < 0) ?
(0) : (1 * c.data[get_idx(i, j, z)]);
}
get_idx() just converts the loop indices to array indices.
What I want is an equivalent thrust::api that does this.
I have the thrust::device_vector ready with the values of the corresponding a, b, c cudaCopied to the host.
thrust::device_vector<float> dev_a, dev_b, dev_c;
What I have tried is to use thrust::for_each but I am unable to find a way to assign dev_c[i] to dev_a[i].
I would love a nudge in the right direction, maybe which thrust:api is the most suitable. Thanks in advance.
After doing some more digging around, I found the correct thrust api.
thrust::replace_copy_if
It is an overload of replace_copy_if which takes as input a 'stencil' which acts as the condition based on which value is copied.
In my case, 'b' is the stencil.
The following code works now.
struct is_less_than_zero
{
__host__ __device__ bool operator()(float x)
{
return x < 0;
}
};
is_less_than_zero pred{};
thrust::replace_copy_if(thrust::device, c.begin(), c.end(),
b.begin(), a.begin(), pred(), 0);
I have p.ntp test particles and every i-th particle has Cartesian coordinates tp.rh[i].x, tp.rh[i].y, tp.rh[i].z. Within this set I need to find CLUSTERS. It means, that I am looking for particles closer to the i-th particle less than hill2 (tp.D_rel < hill2). The number of such a members is stored in N_conv.
I use this cycle for (int i = 0; i < p.ntp; i++), which goes through the data set. For each i-th particle I calculate squared distances tp.D_rel[idx] relative to the others members in the set. Then I use first thread (idx == 0) to find the number of cases, which satisfy my condition. At the end, If are there more than 1 (N_conv > 1) positive cases I need to write out all particles forming possible cluster together (triplets, ...).
My code works well only in cases, where i < blockDim.x. Why? Is there a general way, how to find clusters in a set of data, but write out only triplets and more?
Note: I know, that some cases will be found twice.
__global__ void check_conv_system(double t, struct s_tp tp, struct s_mp mp, struct s_param p, double *time_step)
{
const uint bid = blockIdx.y * gridDim.x + blockIdx.x;
const uint tid = threadIdx.x;
const uint idx = bid * blockDim.x + tid;
double hill2 = 1.0e+6;
__shared__ double D[200];
__shared__ int ID1[200];
__shared__ int ID2[200];
if (idx >= p.ntp) return;
int N_conv;
for (int i = 0; i < p.ntp; i++)
{
tp.D_rel[idx] = (double)((tp.rh[i].x - tp.rh[idx].x)*(tp.rh[i].x - tp.rh[idx].x) +
(tp.rh[i].y - tp.rh[idx].y)*(tp.rh[i].y - tp.rh[idx].y) +
(tp.rh[i].z - tp.rh[idx].z)*(tp.rh[i].z - tp.rh[idx].z));
__syncthreads();
N_conv = 0;
if (idx == 0)
{
for (int n = 0; n < p.ntp; n++) {
if ((tp.D_rel[n] < hill2) && (i != n)) {
N_conv = N_conv + 1;
D[N_conv] = tp.D_rel[n];
ID1[N_conv] = i;
ID2[N_conv] = n;
}
}
if (N_conv > 0) {
for(int k = 1; k < N_conv; k++) {
printf("%lf %lf %d %d \n",t/365.2422, D[k], ID1[k], ID2[k]);
}
}
} //end idx == 0
} //end for cycle for i
}
As RobertCrovella mentionned, without an MCV example, it is hard to tell.
However, the tp.D_del array seems to be written to with idx index, and read-back after a __syncthreads() with full range indexing n. Note that the call to __syncthreads() will only perform synchronization within a block, not accross the whole device. As a result, some thread/block will access data that has not been calculated yet, hence the failure.
You want to review your code so that values computed by blocks do not depend one-another.
Is there a way to determine the number of cuda streams during program execution rather than at compiling? Just like using the "new" command. (The "stream" refers to a block of codes, not threads)
Edit 1
(In response of last comment) Say
for(int i = 0; i < nstreams; ++i)
(Some serial code here, not related to kernel or cuda memory copy);
someKernel<<<xx, yy, 0, stream[i]>>>(param list);
end
Without the serial code, the kernels should execute in parallel, if my understanding is correct?
But will the kernel execute concurrently, given the serial code can be parallelized by i (i.e. can be parallelized in OpenMP fashion, if taken out). Will it affect the concurrency?
Yes, the number of streams can be determined at runtime.
int num_streams;
// ... set num_streams at runtime
cudaStream_t streams[num_streams];
for (int i = 0; i < num_streams; i++)
cudaStreamCreate(&(streams[i]));
The following constructs also work:
int num_streams;
// ... set num_streams
cudaStream_t *streams = (cudaStream_t *)malloc(num_streams*sizeof(cudaStream_t));
for (int i = 0; i < num_streams; i++)
cudaStreamCreate(&(streams[i]));
or:
int num_streams;
// ... set num_streams
cudaStream_t *streams = new cudaStream_t[num_streams];
for (int i = 0; i < num_streams; i++)
cudaStreamCreate(&(streams[i]));
I am trying to work with 3D arrays in CUDA (200x200x100).
The moment I change my z dimension (model_num) from 4 to 5, I get a segmentation fault. Why, and how can I fix it?
const int nrcells = 200;
const int nphicells = 200;
const int model_num = 5; //So far, 4 is the maximum model_num that works. At 5 and after, there is a segmentation fault
__global__ void kernel(float* mgridb)
{
const unsigned long long int i = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x;
if(tx >= 0 && tx < nphicells && ty >=0 && ty < nrcells && tz >= 0 && tz < model_num){
//Do stuff with mgridb[i]
}
}
int main (void)
{
unsigned long long int size_matrices = nphicells*nrcells*model_num;
unsigned long long int mem_size_matrices = sizeof(float) * size_matrices;
float *h_mgridb = (float *)malloc(mem_size_matrices);
float mgridb[nphicells][nrcells][model_num];
for(int k = 0; k < model_num; k++){
for(int j = 0; j < nrcells; j++){
for(int i = 0; i < nphicells; i++){
mgridb[i][j][k] = 0;
}
}
}
float *d_mgridb;
cudaMalloc( (void**)&d_mgridb, mem_size_matrices );
cudaMemcpy(d_mgridb, h_mgridb, mem_size_matrices, cudaMemcpyHostToDevice);
int threads = nphicells;
uint3 blocks = make_uint3(nrcells,model_num,1);
kernel<<<blocks,threads>>>(d_mgridb);
cudaMemcpy( h_mgridb, d_mgridb, mem_size_matrices, cudaMemcpyDeviceToHost);
cudaFree(d_mgridb);
return 0;
}
This is getting stored on the stack:
float mgridb[nphicells][nrcells][model_num];
Your stack space is limited. When you exceed the amount you can store on the stack, you are getting a seg fault, either at the point of allocation, or as soon as you try and access it.
Use malloc instead. That allocates heap storage, which has much higher limits.
None of the above has anything to do with CUDA. Furthermore its not unique or specific to "3D" arrays. Any large stack based allocation (e.g. 1D array) is going to have the same trouble.
You may also have to adjust how you access the array, but it's not difficult to handle a flattened array using pointer indexing.
Your code is actually strange looking, because you are creating an appropriately sized array h_mgridb using malloc and then copying that array to the device (into d_mgridb). It's not clear what purpose mgridb serves in your code. h_mgridb and mgridb are not the same.
I have the next for nested loop and I would like to port it to CUDA to be run on a GPU
int current=0;
int ptr=0;
for (int i=0; i < Nbeans; i++){
for(int j=0;j< NbeamletsPerbeam[i];j++){
current = j + ptr;
for(int k=0;k<Nmax;k++){
......
}
ptr+=NbeamletsPerbeam[i];
}
}
I would be very happy if any body has an idea of how to do it or how can be done.
We are talking about Nbeams=5, NbeamletsPerBeam around 200 each.
This is what I currently have but I am not sure it is right...
for (int i= blockIdx.x; i < d_params->Nbeams; i += gridDim.x){
for (int j= threadIdx.y; j < d_beamletsPerBeam[i]; j+= blockDim.y){
currentBeamlet= j+k;
for (int ivoxel= threadIdx.x; ivoxel < totalVoxels; ivoxel += blockDim.x){
I would suggest this idea. But you might need to do some minor modifications based on your code.
dim3 blocks(NoOfThreads, 1);
dim3 grid(Nbeans, 1);
kernel<<grid, blocks, 1>>()
__global__ kernel()
{
int noOfBlocks = ( NbeamletsPerbeam[blockIdx.x] + blockDim.x -1)/blockDim.x;
for(int j=0; j< noOfBlocks;j++){
// use threads and compute....
if( (threadIdx.x * j) < NbeamletsPerbeam[blockIdx.x]) {
current = (threadIdx.x * j) + ptr;
for(int k=0;k<Nmax;k++){
......
}
ptr+=NbeamletsPerbeam[blockIdx.x];
}
}
}
This should do the trick and gives you better parallelization.