I need some help with Cuda GLOBAL memory. In my project I must declare Global array for avoid to send this array at every kernel call.
EDIT:
My application can call the kernel more than 1,000 times , and on every call I'm sending him an array with size more than [1000 X 1000], So I think it's taking more time , that's why my app works slowly. So I need declare Global array for GPU, So my questions are
1 How to declare Global array
2 How to initialize Global array from CPU before kernel call
Thanks in advance
Your edited question is confusing because you say you are sending your kernel an array of size 1000 x 1000 but you want to know how to do this using a global array. The only way I know of to send this much data to a kernel is to use a global array, so you are probably already doing this with an array in global memory.
Nevertheless, there are 2 methods, at least, to create and initialize an array in global memory:
1.statically, using __device__ and cudaMemcpyToSymbol, for example:
#define SIZE 100
__device__ int A[SIZE];
...
int main(){
int myA[SIZE];
for (int i=0; i< SIZE; i++) myA[i] = 5;
cudaMemcpyToSymbol(A, myA, SIZE*sizeof(int));
...
(kernel calls, etc.)
}
(device variable reference, cudaMemcpyToSymbol reference)
2.dynamically, using cudaMalloc and cudaMemcpy:
#define SIZE 100
...
int main(){
int myA[SIZE];
int *A;
for (int i=0; i< SIZE; i++) myA[i] = 5;
cudaMalloc((void **)&A, SIZE*sizeof(int));
cudaMemcpy(A, myA, SIZE*sizeof(int), cudaMemcpyHostToDevice);
...
(kernel calls, etc.)
}
(cudaMalloc reference, cudaMemcpy reference)
For clarity I'm omitting error checking which you should do on all cuda calls and kernel calls.
If I understand well this question, which is kind of unclear, you want to use global array and send it to the device in every kernel call. This bad practice leads to high latency because in every kernel call you need to transfer your data to the device. In my experience such practice led to negative speed-up.
An optimal way would be to use what I call flip-flop technique. The way you do it is:
Declare two array in the device. d_arr1 and d_arr2
Copy the data host -> device into one of the arrays.
Pass as kernel's parameters pointers to d_arr1 and d_arr2
Process the data into the kernel.
In consequent kernel calls you exchange the pointers you are passing as parameters
This way you avoid to transfer the data every kernel call. You transfer only at the beginning and at the end of your host loop.
int a, even =0;
for(a=0;a<1000;a++)
{
if (even % 2 ==0 )
//call to the kernel(pointer_a, pointer_b)
else
//call to the kernel(pointer_b, pointer_a)
}
Related
I changed my method to allocate host memory from method 1 to method 2 as shown in my code below. The code can compile and run without any error.
I just wonder is it a proper way or any side effect to allocate memory for pointer to pointer using method 2.
#define TESTSIZE 10
#define DIGITSIZE 5
//Method 1
int **ra;
ra = (int**)malloc(TESTSIZE * sizeof(int));
for(int i = 0; i < TESTSIZE; i++){
ra[i] = (int *)malloc(DIGITSIZE * sizeof(int));
}
//Method 2
int **ra;
cudaMallocHost((void**)&ra, TESTSIZE * sizeof(int));
for(int i = 0; i < TESTSIZE; i++){
cudaMallocHost((void**)&ra[i], DIGITSIZE * sizeof(int));
}
Both of them work fine. Yet, there are differences between cudaMallocHost and malloc. The reasons is that cudaMallocHost allocates pinned memory so under the hood the OS's doing something similar to malloc and some extra functions to pin the pages. This means that cudaMallocHost generally takes longer.
That being said, if you repeatedly want to cudaMemcpy from a single buffer then cudaMallocHost may benefit in the long run since it's quicker to transfer data from pinned memory.
Also, you are required to use pinned memory to overlap data transfer/computations with streams.
in the cuda code ,I am trying to use a structure and constant structure object and the value is assigned to constant object using cudaMemcpyToSymbol but this constant values are not accessed . I know the actual use of constant is not this way as each thread needs to access different values and cannot take advantage of memory broadcast to half warp but here in some situation I need this way
#include <iostream>
#include <stdio.h>
#include <cuda.h>
using namespace std;
struct CDistance
{
int Magnitude;
int Direction;
};
__constant__ CDistance *c_daSTLDistance;
__global__ static void CalcSTLDistance_Kernel(CDistance *m_daSTLDistance)
{
int ID = threadIdx.x;
m_daSTLDistance[ID].Magnitude = m_daSTLDistance[ID].Magnitude + c_daSTLDistance[ID].Magnitude ;
m_daSTLDistance[ID].Direction = 2 ;
}
// main routine that executes on the host
int main(void)
{
CDistance *m_haSTLDistance,*m_daSTLDistance;
m_haSTLDistance = new CDistance[10];
for(int i=0;i<10;i++)
{
m_haSTLDistance[i].Magnitude=3;
m_haSTLDistance[i].Direction=2;
}
//m_haSTLDistance =(CDistance*)malloc(100 * sizeof(CDistance));
cudaMalloc((void**)&m_daSTLDistance,sizeof(CDistance)*10);
cudaMemcpy(m_daSTLDistance, m_haSTLDistance,sizeof(CDistance)*10, cudaMemcpyHostToDevice);
cudaMemcpyToSymbol(c_daSTLDistance, m_haSTLDistance, sizeof(m_daSTLDistance)*10);
CalcSTLDistance_Kernel<<< 1, 100 >>> (m_daSTLDistance);
cudaMemcpy(m_haSTLDistance, m_daSTLDistance, sizeof(CDistance)*10, cudaMemcpyDeviceToHost);
for (int i=0;i<10;i++){
cout<<m_haSTLDistance[i].Magnitude<<endl;
}
free(m_haSTLDistance);
cudaFree(m_daSTLDistance);
}
here in the output, the constant c_daSTLDistance[ID].Magnitude is not accessed in the kernel and the statically assigned value 3 is obtained whereas I want this device value 3 is added to constant value and total 6 is returned.
while looking in to the cuda-memcheck it says error in read operation with memory out of bound
Your code doesn't work because of an uninitialised pointer/buffer overflow problem around the use of c_daSTLDistance. It is illegal to do this:
__constant__ CDistance *c_daSTLDistance;
....
cudaMemcpyToSymbol(c_daSTLDistance, m_haSTLDistance, sizeof(m_daSTLDistance)*10);
No memory was every allocated or a valid value set for c_daSTLDistance.
Further, note that all constant memory variables must be statically defined, and there is no ability to dynamically allocate constant memory at runtime. Therefore, what you are attempting to do can't be made to work. Also note that on all but the very oldest of CUDA devices, kernel arguments are stored in constant memory. So if you had a trivially small array of constant structures, it would be far easier and simpler to pass them by value to the kernel. The compiler and runtime will automagically place them in constant memory for you without any explicit host API calls.
I have a problem to access and assign variable with cusp array1d type from device/global kernel. The attached code gives error
alay.cu(8): warning: address of a host variable "p1" cannot be directly taken in a device function
alay.cu(8): error: calling a __host__ function("thrust::detail::vector_base<float, thrust::device_malloc_allocator<float> > ::operator []") from a __global__ function("func") is not allowed
Code Below
#include <cusp/blas.h>
cusp::array1d<float, cusp::device_memory> p1(10,3);
__global__ void func()
{
p1[blockIdx.x]=p1[blockIdx.x]+blockIdx.x*5;
}
int main()
{
func<<<10,1>>>();
return 0;
}
CUSP matrices and arrays (and the Thrust containers they are built with) are intended for host use only. You cannot directly use them in GPU code.
The canonical way to populate a CUSP sparse matrix would be to construct it in host memory and the copy it across to device memory using the copy constructor, so your trivial example becomes this:
cusp::array1d<float, cusp::host_memory> p1(10);
for(int i=0; i<10; i++) p1[i] = 4.f;
cusp::array1d<float, cusp::device_memory> p2(10) = p1; // data now on device
If you want to manipulate a sparse matrix in device code, you will need to have a kernel specifically for whichever format you are interested in, and pass pointers to each of the device arrays holding the matrix data as arguments to that kernel. There is good Doxygen source annotation for all of the sparse types included in the CUSP distribution.
Your edit still doesn't present anything which couldn't be done on the host without a kernel, viz:
cusp::array1d<float, cusp::host_memory> p1(10, 3.f);
for(int i=0; i<10; i++) p1[i] += (i * 5.f);
cusp::array1d<float, cusp::device_memory> p2(10) = p1; // data now on device
I'm implement my kernel in a multithreaded "host"-program, where every host thread is calling the kernel.
I've got a problem with the use of constant memory. In the constant memory will be placed some parameters, but for every thread they are different.
I build a sample where the problem occurs, too.
This is the kernel
__global__ void Kernel( int *aiOutput, int Length )
{
int id = threadIdx.x + blockIdx.x * blockDim.x;
int iValue = 0;
// bound check
if( id < Length )
{
if( id % 3 == 0 )
iValue = c_iaCoeff[2];
else if( id % 2 == 0 )
iValue = c_iaCoeff[1];
else
iValue = c_iaCoeff[0];
aiOutput[id] = iValue;
}
__syncthreads();
}
And a pthread is calling this function.
void* WrapperCopy( void* params )
{
// choose cuda device to perform on
CUDA_CHECK_RETURN( cudaSetDevice( 0 ) );
// cast of params
SParams *_params = (SParams*)params;
// copy coefficients to constant memory
CUDA_CHECK_RETURN( cudaMemcpyToSymbol( c_iaCoeff, _params->h_piCoeff, 3*sizeof(int) ) );
// loop kernel
for( int i=0; i<100; i++ )
{
// perfrom kernel
Kernel<<< BLOCKCOUNT, BLOCKSIZE >>>( _params->d_piArray, _params->iLength );
}
// copy data back from gpu
CUDA_CHECK_RETURN( cudaMemcpy(
_params->h_piArray, _params->d_piArray, BLOCKSIZE*BLOCKCOUNT*sizeof(int), cudaMemcpyDeviceToHost ) );
return NULL;
}
Constant memory is declared as this.
__constant__ int c_iaCoeff[ 3 ];
For every host thread has diffrent values in h_piCoeff and will copy that to the constant memory.
Now I get for every pthread call the same results, becaus all of them got the same values in c_iaCoeff.
I think that is the problem of how constant memory works and have to be declared in a context - in the sample there will be only one c_iaCoeff declared for all pthreads calling and the kernels called by pthreads will get the values of the last cudaMemcpyToSymbol. Is that right?
Now I've tried to change my constant memory in a two-dimensional array.
The second dimension will be the values as before, but the first will be the index of the used pthread.
__constant__ int c_iaCoeff2[ THREADS ][ 3 ];
In the kernels the use of it will be in this way.
iValue = c_iaCoeff2[iTId][2];
But I don't know if it's possible to use constant memory in this way, is it?
Also I got an error when I try to copy data to the constant memory.
CUDA_CHECK_RETURN( cudaMemcpyToSymbol( c_iaCoeff[_params->iTId], _params->h_piCoeff, 3*sizeof(int) ) );
General is it possible to use constant memory as a two-dimensional array and if yes, where is my failure?
Yes, you should be able to use constant memory in the way you want to, but the cudaMemcpyToSymbol copy operation you are using is incorrect. The first argument to the call is a symbol, and the API does a lookup in the runtime symbol table to get the address of the constant memory symbol you request. So an address can't be passed to the call (although your code is actually passing an initialised host value to the call, why that is I will leave as an exercise to the reader).
What you may have missed is the optional fourth argument in the call, which is an offset into the memory pointed to by the symbol you request. So you should be able to do something like:
cudaMemcpyToSymbol( c_iaCoeff, // symbol to lookup
_params->h_piCoeff, // source location
3*sizeof(int), // number of bytes to copy
(3*_params->iTId)*sizeof(int) // Offset in bytes
);
[standard disclaimer: written in browser, unstested. use at own risk]
The last argument is the offset in bytes from the start of the symbol. Your 2D array will be laid out in row major order, so you need to use the pitch of the rows multiplied by the row index as an offset for each copy operation.
A number of algorithms iterate until a certain convergence criterion is reached (e.g. stability of a particular matrix). In many cases, one CUDA kernel must be launched per iteration. My question is: how then does one efficiently and accurately determine whether a matrix has changed over the course of the last kernel call? Here are three possibilities which seem equally unsatisfying:
Writing a global flag each time the matrix is modified inside the kernel. This works, but is highly inefficient and is not technically thread safe.
Using atomic operations to do the same as above. Again, this seems inefficient since in the worst case scenario one global write per thread occurs.
Using a reduction kernel to compute some parameter of the matrix (e.g. sum, mean, variance). This might be faster in some cases, but still seems like overkill. Also, it is possible to dream up cases where a matrix has changed but the sum/mean/variance haven't (e.g. two elements are swapped).
Is there any of the three options above, or an alternative, that is considered best practice and/or is generally more efficient?
I'll also go back to the answer I would have posted in 2012 but for a browser crash.
The basic idea is that you can use warp voting instructions to perform a simple, cheap reduction and then use zero or one atomic operations per block to update a pinned, mapped flag that the host can read after each kernel launch. Using a mapped flag eliminates the need for an explicit device to host transfer after each kernel launch.
This requires one word of shared memory per warp in the kernel, which is a small overhead, and some templating tricks can allow for loop unrolling if you provide the number of warps per block as a template parameter.
A complete working examplate (with C++ host code, I don't have access to a working PyCUDA installation at the moment) looks like this:
#include <cstdlib>
#include <vector>
#include <algorithm>
#include <assert.h>
__device__ unsigned int process(int & val)
{
return (++val < 10);
}
template<int nwarps>
__global__ void kernel(int *inout, unsigned int *kchanged)
{
__shared__ int wchanged[nwarps];
unsigned int laneid = threadIdx.x % warpSize;
unsigned int warpid = threadIdx.x / warpSize;
// Do calculations then check for change/convergence
// and set tchanged to be !=0 if required
int idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tchanged = process(inout[idx]);
// Simple blockwise reduction using voting primitives
// increments kchanged is any thread in the block
// returned tchanged != 0
tchanged = __any(tchanged != 0);
if (laneid == 0) {
wchanged[warpid] = tchanged;
}
__syncthreads();
if (threadIdx.x == 0) {
int bchanged = 0;
#pragma unroll
for(int i=0; i<nwarps; i++) {
bchanged |= wchanged[i];
}
if (bchanged) {
atomicAdd(kchanged, 1);
}
}
}
int main(void)
{
const int N = 2048;
const int min = 5, max = 15;
std::vector<int> data(N);
for(int i=0; i<N; i++) {
data[i] = min + (std::rand() % (int)(max - min + 1));
}
int* _data;
size_t datasz = sizeof(int) * (size_t)N;
cudaMalloc<int>(&_data, datasz);
cudaMemcpy(_data, &data[0], datasz, cudaMemcpyHostToDevice);
unsigned int *kchanged, *_kchanged;
cudaHostAlloc((void **)&kchanged, sizeof(unsigned int), cudaHostAllocMapped);
cudaHostGetDevicePointer((void **)&_kchanged, kchanged, 0);
const int nwarps = 4;
dim3 blcksz(32*nwarps), grdsz(16);
// Loop while the kernel signals it needs to run again
do {
*kchanged = 0;
kernel<nwarps><<<grdsz, blcksz>>>(_data, _kchanged);
cudaDeviceSynchronize();
} while (*kchanged != 0);
cudaMemcpy(&data[0], _data, datasz, cudaMemcpyDeviceToHost);
cudaDeviceReset();
int minval = *std::min_element(data.begin(), data.end());
assert(minval == 10);
return 0;
}
Here, kchanged is the flag the kernel uses to signal it needs to run again to the host. The kernel runs until each entry in the input has been incremented to above a threshold value. At the end of each threads processing, it participates in a warp vote, after which one thread from each warp loads the vote result to shared memory. One thread reduces the warp result and then atomically updates the kchanged value. The host thread waits until the device is finished, and can then directly read the result from the mapped host variable.
You should be able to adapt this to whatever your application requires
I'll go back to my original suggestion. I've updated the related question with an answer of my own, which I believe is correct.
create a flag in global memory:
__device__ int flag;
at each iteration,
initialize the flag to zero (in host code):
int init_val = 0;
cudaMemcpyToSymbol(flag, &init_val, sizeof(int));
In your kernel device code, modify the flag to 1 if a change is made to the matrix:
__global void iter_kernel(float *matrix){
...
if (new_val[i] != matrix[i]){
matrix[i] = new_val[i];
flag = 1;}
...
}
after calling the kernel, at the end of the iteration (in host code), test for modification:
int modified = 0;
cudaMemcpyFromSymbol(&modified, flag, sizeof(int));
if (modified){
...
}
Even if multiple threads in separate blocks or even separate grids, are writing the flag value, as long as the only thing they do is write the same value (i.e. 1 in this case), there is no hazard. The write will not get "lost" and no spurious values will show up in the flag variable.
Testing float or double quantities for equality in this fashion is questionable, but that doesn't seem to be the point of your question. If you have a preferred method to declare "modification" use that instead (such as testing for equality within a tolerance, perhaps).
Some obvious enhancements to this method would be to create one (local) flag variable per thread, and have each thread update the global flag variable once per kernel, rather than on every modification. This would result in at most one global write per thread per kernel. Another approach would be to keep one flag variable per block in shared memory, and have all threads simply update that variable. At the completion of the block, one write is made to global memory (if necessary) to update the global flag. We don't need to resort to complicated reductions in this case, because there is only one boolean result for the entire kernel, and we can tolerate multiple threads writing to either a shared or global variable, as long as all threads are writing the same value.
I can't see any reason to use atomics, or how it would benefit anything.
A reduction kernel seems like overkill, at least compared to one of the optimized approaches (e.g. a shared flag per block). And it would have the drawbacks you mention, such as the fact that anything less than a CRC or similarly complicated computation might alias two different matrix results as "the same".