What is this warning and how can I fix it?
warning: ‘cudaError_t cudaMemcpyToArray(cudaArray_t, size_t, size_t, const void*, size_t, cudaMemcpyKind)’ is deprecated [-Wdeprecated-declarations]
Deprecated means it is not recommended for use, and the support for it may be dropped in the very next CUDA release.
A description of what to do about it is given here.
For a typical usage where you are copying an entire allocation from host to device, and the source (host) allocation is a flat (unpitched) allocation of width w elements by height h rows, perhaps something like this:
cudaMemcpyToArray(dst, 0, 0, src, h*w*sizeof(src[0]), cudaMemcpyHostToDevice)
You would replace it with:
cudaMemcpy2DToArray(dst, 0, 0, src, w*sizeof(src[0]) , w*sizeof(src[0]), h, cudaMemcpyHostToDevice);
The replacement API (cudaMemcpy2DToArray) is documented here.
Note that in the example that I have given, if you have no sense of a "2D" allocation consisting of rows and columns, but instead have a single flat allocation of (let's say) w elements, you can simply set h=1 in the formulation above.
Related
I'm confused about copying arrays to constant memory.
According to programming guide there's at least one way to allocate constant memory and use it in order to store an array of values. And this is called static memory allocation:
__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data));
cudaMemcpyFromSymbol(data, constData, sizeof(data));
According to programming guide again we can use:
__device__ float* devPointer;
float* ptr;
cudaMalloc(&ptr, 256 * sizeof(float));
cudaMemcpyToSymbol(devPointer, &ptr, sizeof(ptr));
It looks like dynamic constant memory allocation is used, but I'm not sure about it. And also no qualifier __constant__ is used here.
So here are some questions:
Is this pointer stored in constant memory?
Is assigned (by this pointer) memory stored in constant memory too?
Is this pointer constant? And it's not allowed to change that pointer using device or host function. But is changing values of array prohibited or not? If changing values of array is allowed, then does it mean that constant memory is not used to store this values?
The developer can declare up to 64K of constant memory at file scope. In SM 1.0, the constant memory used by the toolchain (e.g. to hold compile-time constants) was separate and distinct from the constant memory available to developers, and I don't think this has changed since. The driver dynamically manages switching between different views of constant memory as it launches kernels that reside in different compilation units. Although you cannot allocate constant memory dynamically, this pattern suffices because the 64K limit is not system-wide, it only applies to compilation units.
Use the first pattern cited in your question: statically declare the constant data and update it with cudaMemcpyToSymbol before launching kernels that reference it. In the second pattern, only reads of the pointer itself will go through constant memory. Reads using the pointer will be serviced by the normal L1/L2 cache hierarchy.
I have GeForce 620M and my code is:
int threadsPerBlock = 256;
int blocksPerGrid = Number_AA_GPU / threadsPerBlock;
for(it=0;it<Number_repeatGPU;it++)
{
Kernel_Update<<<blocksPerGrid,threadsPerBlock>>>(A, B, C, D, rand(), rand());
}
I get:
invalid configuration argument.
What could be the reason?
The kernel configuration arguments are the arguments between the <<<...>>> symbols.
Your GeForce 620M is a compute capability 2.1 device.
A compute capability 2.1 device is limited to 65535 when you pass a 1-dimensional parameter for the blocks per grid parameter (the first of the two arguments you are passing.)
Since the other parameter you are passing (256, threadsPerBlock) is definitely in-bounds, I conclude that your first parameter is out of bounds:
int blocksPerGrid = Number_AA_GPU / threadsPerBlock;
i.e. Number_AA_GPU is either greater than 65535*256 (greater than or equal to 65536*256 would trigger a failure), or it is zero (actually Number_AA_GPU less than 256 would fail, due to integer division), or it is negative.
In the future, you can write more easily decipherable questions if you provide a complete example. In this case, telling us what Number_AA_GPU is could make my answer more definite.
When I use float atomicAdd(float *address, float val) to add a float value smaller than approx. 1e-39 to 0, the addition does not work, and the value at address remains 0.
Here is the simplest code:
__device__ float test[6] = {0};
__global__ void testKernel() {
float addit = sinf(1e-20);
atomicAdd(&test[0], addit);
test[1] += addit;
addit = sinf(1e-37);
atomicAdd(&test[2], addit);
test[3] += addit;
addit = sinf(1e-40);
atomicAdd(&test[4], addit);
test[5] += addit;
}
When I run the code above as testKernel<<<1, 1>>>(); and stop with the debugger I see:
test 0x42697800
[0] 9.9999997e-21
[1] 9.9999997e-21
[2] 9.9999999e-38
[3] 9.9999999e-38
[4] 0
[5] 9.9999461e-41
Notice the difference between test[4] and test[5]. Both did the same thing, yet the simple addition worked, and the atomic one did nothing at all.
What am I missing here?
Update: System info: CUDA 5.5.20, NVidia Titan card, Driver 331.82, Windows 7x64, Nsight 3.2.1.13309.
atomicAdd is a special instruction that does not necessarily obey the same flush and rounding behaviors that you might get if you specify for example -ftz=true or -ftz=false on other floating point operations (e.g. ordinary fp add)
As documented in the PTX ISA manual:
The floating-point operation .add is a single-precision, 32-bit operation. atom.add.f32 rounds to nearest even and flushes subnormal inputs and results to sign-preserving zero.
So even though ordinary floating point add should not flush denormals to zero if you specify -ftz=false (which is the default, I believe, for nvcc), the floating point atomic add operation to global memory will flush to zero (always).
This works fine:
a_size=FindSizeAtrunTime();
Kernel<<< gridDim, blockDim, a_size >>>(count)
But this shows error
__global__ void Kernel(int count_a, int count_b)
{
a_size=FindSizeAtrunTime();
__shared__ int a[a_size];
}
error: expression must have a constant value
In both cases size is being determined at runtime. So why the first case is ok, and not the second case?
The second is illegal on two levels.
Firstly, C++98 (which is what CUDA mostly derived from), doesn't allow statically declared, dynamically sized arrays. The language doesn't permit it, and so neither does CUDA.
Secondly, and more importantly, the size of dynamic shared memory allocations must be known before a kernel can be launched. The GPU must know how much shared memory to reserve before a block can be scheduled. That isn't possible in your second example, whereas it is in your first example.
Reduction in CUDA has utterly baffled me! First off, both this tutorial by Mark Harris and this one by Mike Giles make use of the declaration extern __shared__ temp[]. The keyword extern is used in C when a declaration is made, but allocation takes place "elsewhre" (e.g. in another C file context in general). What is the relevance of extern here? Why don't we use:
__shared__ float temp[N/2];
for instance? Or why don't we declare temp to be a global variable, e.g.
#define N 1024
__shared__ float temp[N/2];
__global__ void sum(float *sum, float *data){ ... }
int main(){
...
sum<<<M,L>>>(sum, data);
}
I have yet another question? How many blocks and threads per block should one use to invoke the summation kernel? I tried this example (based on this).
Note: You can find information about my devices here.
The answer to the first question is that CUDA supports dynamic shared memory allocation at runtime (see this SO question and the documentation for more details). The declaration of shared memory using extern denotes to the compiler that shared memory size will be determined at kernel launch, passed in bytes as an argument to the <<< >>> syntax (or equivalently via an API function), something like:
sum<<< gridsize, blocksize, sharedmem_size >>>(....);
The second question is normally to launch the number of blocks which will completely fill all the streaming multiprocessors on your GPU. Most sensibly written reduction kernels will accumulate many values per thread and then perform a shared memory reduction. The reduction requires that the number of threads per block be a power of two: That usually gives you 32, 64, 128, 256, 512 (or 1024 if you have a Fermi or Kepler GPU). It is a very finite search space, just benchmark to see what works best on your hardware. You can find a more general discussion about block and grid sizing here and here.