in the cuda code ,I am trying to use a structure and constant structure object and the value is assigned to constant object using cudaMemcpyToSymbol but this constant values are not accessed . I know the actual use of constant is not this way as each thread needs to access different values and cannot take advantage of memory broadcast to half warp but here in some situation I need this way
#include <iostream>
#include <stdio.h>
#include <cuda.h>
using namespace std;
struct CDistance
{
int Magnitude;
int Direction;
};
__constant__ CDistance *c_daSTLDistance;
__global__ static void CalcSTLDistance_Kernel(CDistance *m_daSTLDistance)
{
int ID = threadIdx.x;
m_daSTLDistance[ID].Magnitude = m_daSTLDistance[ID].Magnitude + c_daSTLDistance[ID].Magnitude ;
m_daSTLDistance[ID].Direction = 2 ;
}
// main routine that executes on the host
int main(void)
{
CDistance *m_haSTLDistance,*m_daSTLDistance;
m_haSTLDistance = new CDistance[10];
for(int i=0;i<10;i++)
{
m_haSTLDistance[i].Magnitude=3;
m_haSTLDistance[i].Direction=2;
}
//m_haSTLDistance =(CDistance*)malloc(100 * sizeof(CDistance));
cudaMalloc((void**)&m_daSTLDistance,sizeof(CDistance)*10);
cudaMemcpy(m_daSTLDistance, m_haSTLDistance,sizeof(CDistance)*10, cudaMemcpyHostToDevice);
cudaMemcpyToSymbol(c_daSTLDistance, m_haSTLDistance, sizeof(m_daSTLDistance)*10);
CalcSTLDistance_Kernel<<< 1, 100 >>> (m_daSTLDistance);
cudaMemcpy(m_haSTLDistance, m_daSTLDistance, sizeof(CDistance)*10, cudaMemcpyDeviceToHost);
for (int i=0;i<10;i++){
cout<<m_haSTLDistance[i].Magnitude<<endl;
}
free(m_haSTLDistance);
cudaFree(m_daSTLDistance);
}
here in the output, the constant c_daSTLDistance[ID].Magnitude is not accessed in the kernel and the statically assigned value 3 is obtained whereas I want this device value 3 is added to constant value and total 6 is returned.
while looking in to the cuda-memcheck it says error in read operation with memory out of bound
Your code doesn't work because of an uninitialised pointer/buffer overflow problem around the use of c_daSTLDistance. It is illegal to do this:
__constant__ CDistance *c_daSTLDistance;
....
cudaMemcpyToSymbol(c_daSTLDistance, m_haSTLDistance, sizeof(m_daSTLDistance)*10);
No memory was every allocated or a valid value set for c_daSTLDistance.
Further, note that all constant memory variables must be statically defined, and there is no ability to dynamically allocate constant memory at runtime. Therefore, what you are attempting to do can't be made to work. Also note that on all but the very oldest of CUDA devices, kernel arguments are stored in constant memory. So if you had a trivially small array of constant structures, it would be far easier and simpler to pass them by value to the kernel. The compiler and runtime will automagically place them in constant memory for you without any explicit host API calls.
Related
I have a class FPlan that has a number of methods such as permute and packing.
__host__ __device__ void Perturb_action(FPlan *dfp){
dfp->perturb();
dfp->packing();
}
__global__ void Vector_Perturb(FPlan **dfp, int n){
int i=threadIx.x;
if(i<n) Perturb_action(dfp[i]);
}
in main:
FPlan **fp_vec;
fp_vec=(FPlan**)malloc(VEC_SIZE*sizeof(FPlan*));
//initialize the vec
for(int i=0; i<VEC_SIZE;i++)
fp_vec[i]=&fp;
//fp of type FPlan that is initialized
int v_sz=sizeof(fp_vec);
double test=fp_vec[0]->getCost();
printf("the cost before perturb %f\n"test);
FPlan **value;
cudaMalloc(&value,v_sz);
cudaMemcpy(value,&fp_vec,v_sz,cudaMemcpyHostToDevice);
//call kernel
dim3 threadsPerBlock(VEC_SIZE);
dim3 numBlocks(1);
Vector_Perturb<<<numBlocks,threadsPerBlock>>> (value,VEC_SIZE);
cudaMemcpy(fp_vec,value,v_sz,cudaMemcpyDeviceToHost);
test=fp_vec[0]->getCost();
printf("the cost after perturb %f\n"test);
test=fp_vec[1]->getCost();
printf("the cost after perturb %f\n"test);
I am getting before permute for fp_vec[0] printf the cost 0.8.
After permute for fp_vec[0] the value inf and for fp_vec[1] the value 0.8.
The expected output after the permutation should be something like fp_vec[0] = 0.7 and fp_vec[1] = 0.9. I want to apply these permutations to an array of type FPlan.
What am I missing? Is calling an external function supported in CUDA?
This seems to be a common problem these days:
Consider the following code:
#include <stdio.h>
#include <stdlib.h>
int main() {
int* arr = (int*) malloc(100);
printf("sizeof(arr) = %i", sizeof(arr));
return 0;
}
what is the expected ouptut? 100? no its 4 (at least on a 32 bit machine). sizeof() returns the size of the type of a variable not the allocated size of an array.
int v_sz=sizeof(fp_vec);
double test=fp_vec[0]->getCost();
printf("the cost before perturb %f\n"test);
FPlan **value;
cudaMalloc(&value,v_sz);
cudaMemcpy(value,&fp_vec,v_sz,cudaMemcpyHostToDevice);
You are allocating 4 (or 8) bytes on the device and copy 4 (or 8) bytes. The result is undefined (and maybe every time garbage).
Besides that, you shold do proper error checking of your CUDA calls.
Have a look: What is the canonical way to check for errors using the CUDA runtime API?
want to do this programm on cuda.
1.in "main.cpp"
struct Center{
double * Data;
int dimension;
};
typedef struct Center Center;
//I allow a pointer on N Center elements by the CUDAMALLOC like follow
....
#include "kernel.cu"
....
center *V_dev;
int M =100, n=4;
cudaStatus = cudaMalloc((void**)&V_dev,M*sizeof(Center));
Init<<<1,M>>>(V_dev, M, N); //I always know the dimension of N before calling
My "kernel.cu" file is something like this
#include "cuda_runtime.h"
#include"device_launch_parameters.h"
... //other include headers to allow my .cu file to know the Center type definition
__global__ void Init(Center *V, int N, int dimension){
V[threadIdx.x].dimension = dimension;
V[threadIdx.x].Data = (double*)malloc(dimension*sizeof(double));
for(int i=0; i<dimension; i++)
V[threadIdx.x].Data[i] = 0; //For the value, it can be any kind of operation returning a float that i want to be able put here
}
I'm on visual studio 2008 and CUDA 5.0. When I Build my project, I've got these errors:
error: calling a _host_ function("malloc") from a _global_ function("Init") is not allowed.
I want to know please how can I perform this? (I know that 'malloc' and other cpu memory allocation are not allowed for device memory.
malloc is allowed in device code but you have to be compiling for a cc2.0 or greater target GPU.
Adjust your VS project settings to remove any GPU device settings like compute_10,sm_10 and replace it with compute_20,sm_20 or higher to match your GPU. (And, to run that code, your GPU needs to be cc2.0 or higher.)
You need the compiler parameter -arch=sm_20 and a GPU which supports it.
There are similar questions to what I'm about to ask, but I feel like none of them get at the heart of what I'm really looking for. What I have now is a CUDA method that requires defining two arrays into shared memory. Now, the size of the arrays is given by a variable that is read into the program after the start of execution. Because of this, I cannot use that variable to define the size of the arrays, due to the fact that defining the size of shared arrays requires knowing the value at compile time. I do not want to do something like __shared__ double arr1[1000] because typing in the size by hand is useless to me as that will change depending on the input. In the same vein, I cannot use #define to create a constant for the size.
Now I can follow an example similar to what is in the manual (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared) such as
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}
But this still runs into an issue. From what I've read, defining a shared array always makes the memory address the first element. That means I need to make my second array shifted over by the size of the first array, as they appear to do in this example. But the size of the first array is dependent on user input.
Another question (Cuda Shared Memory array variable) has a similar issue, and they were told to create a single array that would act as the array for both arrays and simply adjust the indices to properly match the arrays. While this does seem to do what I want, it looks very messy. Is there any way around this so that I can still maintain two independent arrays, each with sizes that are defined as input by the user?
When using dynamic shared memory with CUDA, there is one and only one pointer passed to the kernel, which defines the start of the requested/allocated area in bytes:
extern __shared__ char array[];
There is no way to handle it differently. However this does not prevent you from having two user-sized arrays. Here's a worked example:
$ cat t501.cu
#include <stdio.h>
__global__ void my_kernel(unsigned arr1_sz, unsigned arr2_sz){
extern __shared__ char array[];
double *my_ddata = (double *)array;
char *my_cdata = arr1_sz*sizeof(double) + array;
for (int i = 0; i < arr1_sz; i++) my_ddata[i] = (double) i*1.1f;
for (int i = 0; i < arr2_sz; i++) my_cdata[i] = (char) i;
printf("at offset %d, arr1: %lf, arr2: %d\n", 10, my_ddata[10], (int)my_cdata[10]);
}
int main(){
unsigned double_array_size = 256;
unsigned char_array_size = 128;
unsigned shared_mem_size = (double_array_size*sizeof(double)) + (char_array_size*sizeof(char));
my_kernel<<<1,1, shared_mem_size>>>(256, 128);
cudaDeviceSynchronize();
return 0;
}
$ nvcc -arch=sm_20 -o t501 t501.cu
$ cuda-memcheck ./t501
========= CUDA-MEMCHECK
at offset 10, arr1: 11.000000, arr2: 10
========= ERROR SUMMARY: 0 errors
$
If you have a random arrangement of arrays of mixed data types, you'll want to either manually align your array starting points (and request enough shared memory) or else use alignment directives (and be sure to request enough shared memory), or use structures to help with alignment.
I am reading the CUB documentations and examples:
#include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockRadixSort for 128 threads owning 4 integer items each
typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
// Allocate shared memory for BlockRadixSort
__shared__ typename BlockRadixSort::TempStorage temp_storage;
// Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
...
// Collectively sort the keys
BlockRadixSort(temp_storage).Sort(thread_keys);
...
}
In the example, each thread has 4 keys. It looks like 'thread_keys' will be allocated in global local memory. If I only has 1 key per thread, could I declare"int thread_key;" and make this variable in register only?
BlockRadixSort(temp_storage).Sort() is taking a pointer to the key as parameter. Does it mean that the keys have to be in global memory?
I would like to use this code but I want each thread to hold one key in register and keep it on-chip in register/shared memory after they are sorted.
Thanks in advance!
You can do this using shared memory (which will keep it "on-chip"). I'm not sure I know how to do it using strictly registers without de-constructing the BlockRadixSort object.
Here's an example code that uses shared memory to hold the initial data to be sorted, and the final sorted results. This sample is mostly set up for one data element per thread, since that seems to be what you are asking for. It's not difficult to extend it to multiple elements per thread, and I have put most of the plumbing in place to do that, with the exception of the data synthesis and debug printouts:
#include <cub/cub.cuh>
#include <stdio.h>
#define nTPB 32
#define ELEMS_PER_THREAD 1
// Block-sorting CUDA kernel (nTPB threads each owning ELEMS_PER THREAD integers)
__global__ void BlockSortKernel()
{
__shared__ int my_val[nTPB*ELEMS_PER_THREAD];
using namespace cub;
// Specialize BlockRadixSort collective types
typedef BlockRadixSort<int, nTPB, ELEMS_PER_THREAD> my_block_sort;
// Allocate shared memory for collectives
__shared__ typename my_block_sort::TempStorage sort_temp_stg;
// need to extend synthetic data for ELEMS_PER_THREAD > 1
my_val[threadIdx.x*ELEMS_PER_THREAD] = (threadIdx.x + 5)%nTPB; // synth data
__syncthreads();
printf("thread %d data = %d\n", threadIdx.x, my_val[threadIdx.x*ELEMS_PER_THREAD]);
// Collectively sort the keys
my_block_sort(sort_temp_stg).Sort(*static_cast<int(*)[ELEMS_PER_THREAD]>(static_cast<void*>(my_val+(threadIdx.x*ELEMS_PER_THREAD))));
__syncthreads();
printf("thread %d sorted data = %d\n", threadIdx.x, my_val[threadIdx.x*ELEMS_PER_THREAD]);
}
int main(){
BlockSortKernel<<<1,nTPB>>>();
cudaDeviceSynchronize();
}
This seems to work correctly for me, in this case I happened to be using RHEL 5.5/gcc 4.1.2, CUDA 6.0 RC, and CUB v1.2.0 (which is quite recent).
The strange/ugly static casting is needed as far as I can tell, because the CUB Sort is expecting a reference to an array of length equal to the customization parameter ITEMS_PER_THREAD(i.e. ELEMS_PER_THREAD):
__device__ __forceinline__ void Sort(
Key (&keys)[ITEMS_PER_THREAD],
int begin_bit = 0,
int end_bit = sizeof(Key) * 8)
{ ...
I need some help with Cuda GLOBAL memory. In my project I must declare Global array for avoid to send this array at every kernel call.
EDIT:
My application can call the kernel more than 1,000 times , and on every call I'm sending him an array with size more than [1000 X 1000], So I think it's taking more time , that's why my app works slowly. So I need declare Global array for GPU, So my questions are
1 How to declare Global array
2 How to initialize Global array from CPU before kernel call
Thanks in advance
Your edited question is confusing because you say you are sending your kernel an array of size 1000 x 1000 but you want to know how to do this using a global array. The only way I know of to send this much data to a kernel is to use a global array, so you are probably already doing this with an array in global memory.
Nevertheless, there are 2 methods, at least, to create and initialize an array in global memory:
1.statically, using __device__ and cudaMemcpyToSymbol, for example:
#define SIZE 100
__device__ int A[SIZE];
...
int main(){
int myA[SIZE];
for (int i=0; i< SIZE; i++) myA[i] = 5;
cudaMemcpyToSymbol(A, myA, SIZE*sizeof(int));
...
(kernel calls, etc.)
}
(device variable reference, cudaMemcpyToSymbol reference)
2.dynamically, using cudaMalloc and cudaMemcpy:
#define SIZE 100
...
int main(){
int myA[SIZE];
int *A;
for (int i=0; i< SIZE; i++) myA[i] = 5;
cudaMalloc((void **)&A, SIZE*sizeof(int));
cudaMemcpy(A, myA, SIZE*sizeof(int), cudaMemcpyHostToDevice);
...
(kernel calls, etc.)
}
(cudaMalloc reference, cudaMemcpy reference)
For clarity I'm omitting error checking which you should do on all cuda calls and kernel calls.
If I understand well this question, which is kind of unclear, you want to use global array and send it to the device in every kernel call. This bad practice leads to high latency because in every kernel call you need to transfer your data to the device. In my experience such practice led to negative speed-up.
An optimal way would be to use what I call flip-flop technique. The way you do it is:
Declare two array in the device. d_arr1 and d_arr2
Copy the data host -> device into one of the arrays.
Pass as kernel's parameters pointers to d_arr1 and d_arr2
Process the data into the kernel.
In consequent kernel calls you exchange the pointers you are passing as parameters
This way you avoid to transfer the data every kernel call. You transfer only at the beginning and at the end of your host loop.
int a, even =0;
for(a=0;a<1000;a++)
{
if (even % 2 ==0 )
//call to the kernel(pointer_a, pointer_b)
else
//call to the kernel(pointer_b, pointer_a)
}