Host float constant usage in a kernel in CUDA - cuda

I am using CUDA 5.0. I noticed that the compiler will allow me to use host-declared int constants within kernels. However, it refuses to compile any kernels that use host-declared float constants. Does anyone know the reason for this seeming discrepancy?
For example, the following code runs just fine as is, but it will not compile if the final line in the kernel is uncommented.
#include <cstdio>
#include <cuda_runtime.h>
static int __constant__ DEV_INT_CONSTANT = 1;
static float __constant__ DEV_FLOAT_CONSTANT = 2.0f;
static int const HST_INT_CONSTANT = 3;
static float const HST_FLOAT_CONSTANT = 4.0f;
__global__ void uselessKernel(float * val)
{
*val = 0.0f;
// Use device int and float constants
*val += DEV_INT_CONSTANT;
*val += DEV_FLOAT_CONSTANT;
// Use host int and float constants
*val += HST_INT_CONSTANT;
//*val += HST_FLOAT_CONSTANT; // won't compile if uncommented
}
int main(void)
{
float * d_val;
cudaMalloc((void **)&d_val, sizeof(float));
uselessKernel<<<1, 1>>>(d_val);
cudaFree(d_val);
}

Adding a const number in the device code is OK, but adding a number stored on the host memory in the device code is NOT.
Every reference of the static const int in your code can be replaced with the value 3 by the compiler/optimizer when the addr of that variable is never referenced. In this case, it is like #define HST_INT_CONSTANT 3, and no host memory is allocated for this variable.
But for float var, the host memory is always allocated even it is of static const float. Since the kernel can not access the host memory directly, your code with static const float won't be compiled.
For C/C++, int can be optimized more aggressively than float.
You code runs when the comment is ON can be seen as a bug of CUDA C I think. The static const int is a host side thing, and should not be accessible to the device directly.

Related

variable declared in .cuh file is undefined in .cu [duplicate]

I am using CUDA 5.0. I noticed that the compiler will allow me to use host-declared int constants within kernels. However, it refuses to compile any kernels that use host-declared float constants. Does anyone know the reason for this seeming discrepancy?
For example, the following code runs just fine as is, but it will not compile if the final line in the kernel is uncommented.
#include <cstdio>
#include <cuda_runtime.h>
static int __constant__ DEV_INT_CONSTANT = 1;
static float __constant__ DEV_FLOAT_CONSTANT = 2.0f;
static int const HST_INT_CONSTANT = 3;
static float const HST_FLOAT_CONSTANT = 4.0f;
__global__ void uselessKernel(float * val)
{
*val = 0.0f;
// Use device int and float constants
*val += DEV_INT_CONSTANT;
*val += DEV_FLOAT_CONSTANT;
// Use host int and float constants
*val += HST_INT_CONSTANT;
//*val += HST_FLOAT_CONSTANT; // won't compile if uncommented
}
int main(void)
{
float * d_val;
cudaMalloc((void **)&d_val, sizeof(float));
uselessKernel<<<1, 1>>>(d_val);
cudaFree(d_val);
}
Adding a const number in the device code is OK, but adding a number stored on the host memory in the device code is NOT.
Every reference of the static const int in your code can be replaced with the value 3 by the compiler/optimizer when the addr of that variable is never referenced. In this case, it is like #define HST_INT_CONSTANT 3, and no host memory is allocated for this variable.
But for float var, the host memory is always allocated even it is of static const float. Since the kernel can not access the host memory directly, your code with static const float won't be compiled.
For C/C++, int can be optimized more aggressively than float.
You code runs when the comment is ON can be seen as a bug of CUDA C I think. The static const int is a host side thing, and should not be accessible to the device directly.

External call of a class method in a kernel

I have a class FPlan that has a number of methods such as permute and packing.
__host__ __device__ void Perturb_action(FPlan *dfp){
dfp->perturb();
dfp->packing();
}
__global__ void Vector_Perturb(FPlan **dfp, int n){
int i=threadIx.x;
if(i<n) Perturb_action(dfp[i]);
}
in main:
FPlan **fp_vec;
fp_vec=(FPlan**)malloc(VEC_SIZE*sizeof(FPlan*));
//initialize the vec
for(int i=0; i<VEC_SIZE;i++)
fp_vec[i]=&fp;
//fp of type FPlan that is initialized
int v_sz=sizeof(fp_vec);
double test=fp_vec[0]->getCost();
printf("the cost before perturb %f\n"test);
FPlan **value;
cudaMalloc(&value,v_sz);
cudaMemcpy(value,&fp_vec,v_sz,cudaMemcpyHostToDevice);
//call kernel
dim3 threadsPerBlock(VEC_SIZE);
dim3 numBlocks(1);
Vector_Perturb<<<numBlocks,threadsPerBlock>>> (value,VEC_SIZE);
cudaMemcpy(fp_vec,value,v_sz,cudaMemcpyDeviceToHost);
test=fp_vec[0]->getCost();
printf("the cost after perturb %f\n"test);
test=fp_vec[1]->getCost();
printf("the cost after perturb %f\n"test);
I am getting before permute for fp_vec[0] printf the cost 0.8.
After permute for fp_vec[0] the value inf and for fp_vec[1] the value 0.8.
The expected output after the permutation should be something like fp_vec[0] = 0.7 and fp_vec[1] = 0.9. I want to apply these permutations to an array of type FPlan.
What am I missing? Is calling an external function supported in CUDA?
This seems to be a common problem these days:
Consider the following code:
#include <stdio.h>
#include <stdlib.h>
int main() {
int* arr = (int*) malloc(100);
printf("sizeof(arr) = %i", sizeof(arr));
return 0;
}
what is the expected ouptut? 100? no its 4 (at least on a 32 bit machine). sizeof() returns the size of the type of a variable not the allocated size of an array.
int v_sz=sizeof(fp_vec);
double test=fp_vec[0]->getCost();
printf("the cost before perturb %f\n"test);
FPlan **value;
cudaMalloc(&value,v_sz);
cudaMemcpy(value,&fp_vec,v_sz,cudaMemcpyHostToDevice);
You are allocating 4 (or 8) bytes on the device and copy 4 (or 8) bytes. The result is undefined (and maybe every time garbage).
Besides that, you shold do proper error checking of your CUDA calls.
Have a look: What is the canonical way to check for errors using the CUDA runtime API?

Using constant in CUDA kernel: works with int, fails to compile with float [duplicate]

I am using CUDA 5.0. I noticed that the compiler will allow me to use host-declared int constants within kernels. However, it refuses to compile any kernels that use host-declared float constants. Does anyone know the reason for this seeming discrepancy?
For example, the following code runs just fine as is, but it will not compile if the final line in the kernel is uncommented.
#include <cstdio>
#include <cuda_runtime.h>
static int __constant__ DEV_INT_CONSTANT = 1;
static float __constant__ DEV_FLOAT_CONSTANT = 2.0f;
static int const HST_INT_CONSTANT = 3;
static float const HST_FLOAT_CONSTANT = 4.0f;
__global__ void uselessKernel(float * val)
{
*val = 0.0f;
// Use device int and float constants
*val += DEV_INT_CONSTANT;
*val += DEV_FLOAT_CONSTANT;
// Use host int and float constants
*val += HST_INT_CONSTANT;
//*val += HST_FLOAT_CONSTANT; // won't compile if uncommented
}
int main(void)
{
float * d_val;
cudaMalloc((void **)&d_val, sizeof(float));
uselessKernel<<<1, 1>>>(d_val);
cudaFree(d_val);
}
Adding a const number in the device code is OK, but adding a number stored on the host memory in the device code is NOT.
Every reference of the static const int in your code can be replaced with the value 3 by the compiler/optimizer when the addr of that variable is never referenced. In this case, it is like #define HST_INT_CONSTANT 3, and no host memory is allocated for this variable.
But for float var, the host memory is always allocated even it is of static const float. Since the kernel can not access the host memory directly, your code with static const float won't be compiled.
For C/C++, int can be optimized more aggressively than float.
You code runs when the comment is ON can be seen as a bug of CUDA C I think. The static const int is a host side thing, and should not be accessible to the device directly.

CUDA shared memory issue in outputs depending on extern declaration and size of array

If I am experimenting with shared memory in CUDA and I do not understand its behaviour in this bit of code.
I have a pretty basic kernel:
__global__ void sum( int* input, int* output, int size){
int tid = threadIdx.x+blockDim.x*blockIdx.x +
blockDim.x*gridDim.x*blockIdx.y;
extern __shared__ int sdata[];
sdata[tid] = input[tid];
__syncthreads();
output[tid] = input[tid];
}
And the output is 0 for all output[]. However, if I comment out sdata[tid] = input[tid];, then the output is fine and equal input[].
What am I doing wrong here? Am I missing something?
[UPDATE]
Well, if I remove the tag extern and give a size to the shared array, it seems to work fine. Any ideas why?
[UPDATE]
The way that I am invoking the kernel is from c++ code, so I needed to wrap it to be invoked from the main code.
kernel.cu contains the kernel itself plus the wrapper function:
void wrapper(int dBlock, int dThread, int* input, int* output, int size){
sum<<<dBlock,dThread>>>(input, output, size);
}
callerfunction.cpp contains c++ code and the function that invokes the wrapper.
If you use the extern qualifier you need to pass the size of the shared memory when launching the kernel.
kernel<<< blocks, threads, size>>>(...)
The size parameter is the size of shared memory in Bytes.

Using std::vector in CUDA device code

The question is that: is there a way to use the class "vector" in Cuda kernels? When I try I get the following error:
error : calling a host function("std::vector<int, std::allocator<int> > ::push_back") from a __device__/__global__ function not allowed
So there a way to use a vector in global section?
I recently tried the following:
create a new Cuda project
go to properties of the project
open Cuda C/C++
go to Device
change the value in "Code Generation" to be set to this value:
compute_20,sm_20
........ after that I was able to use the printf standard library function in my Cuda kernel.
is there a way to use the standard library class vector in the way printf is supported in kernel code? This is an example of using printf in kernel code:
// this code only to count the 3s in an array using Cuda
//private_count is an array to hold every thread's result separately
__global__ void countKernel(int *a, int length, int* private_count)
{
printf("%d\n",threadIdx.x); //it's print the thread id and it's working
// vector<int> y;
//y.push_back(0); is there a possibility to do this?
unsigned int offset = threadIdx.x * length;
int i = offset;
for( ; i < offset + length; i++)
{
if(a[i] == 3)
{
private_count[threadIdx.x]++;
printf("%d ",a[i]);
}
}
}
You can't use the STL in CUDA, but you may be able to use the Thrust library to do what you want. Otherwise just copy the contents of the vector to the device and operate on it normally.
In the cuda library thrust, you can use thrust::device_vector<classT> to define a vector on device, and the data transfer between host STL vector and device vector is very straightforward. you can refer to this useful link:http://docs.nvidia.com/cuda/thrust/index.html to find some useful examples.
you can't use std::vector in device code, you should use array instead.
I think you can implement a device vector by youself, because CUDA supports dynamic memory alloction in device codes. Operator new/delete are also supported. Here is an extremely simple prototype of device vector in CUDA, but it does work. It hasn't been tested sufficiently.
template<typename T>
class LocalVector
{
private:
T* m_begin;
T* m_end;
size_t capacity;
size_t length;
__device__ void expand() {
capacity *= 2;
size_t tempLength = (m_end - m_begin);
T* tempBegin = new T[capacity];
memcpy(tempBegin, m_begin, tempLength * sizeof(T));
delete[] m_begin;
m_begin = tempBegin;
m_end = m_begin + tempLength;
length = static_cast<size_t>(m_end - m_begin);
}
public:
__device__ explicit LocalVector() : length(0), capacity(16) {
m_begin = new T[capacity];
m_end = m_begin;
}
__device__ T& operator[] (unsigned int index) {
return *(m_begin + index);//*(begin+index)
}
__device__ T* begin() {
return m_begin;
}
__device__ T* end() {
return m_end;
}
__device__ ~LocalVector()
{
delete[] m_begin;
m_begin = nullptr;
}
__device__ void add(T t) {
if ((m_end - m_begin) >= capacity) {
expand();
}
new (m_end) T(t);
m_end++;
length++;
}
__device__ T pop() {
T endElement = (*m_end);
delete m_end;
m_end--;
return endElement;
}
__device__ size_t getSize() {
return length;
}
};
You can't use std::vector in device-side code. Why?
It's not marked to allow this
The "formal" reason is that, to use code in your device-side function or kernel, that code itself has to be in a __device__ function; and the code in the standard library, including, std::vector is not. (There's an exception for constexpr code; and in C++20, std::vector does have constexpr methods, but CUDA does not support C++20 at the moment, plus, that constexprness is effectively limited.)
You probably don't really want to
The std::vector class uses allocators to obtain more memory when it needs to grow the storage for the vectors you create or add into. By default (i.e. if you use std::vector<T> for some T) - that allocation is on the heap. While this could be adapted to the GPU - it would be quite slow, and incredibly slow if each "CUDA thread" would dynamically allocate its own memory.
#Now, you could say "But I don't want to allocate memory, I just want to read from the vector!" - well, in that case, you don't need a vector per se. Just copy the data to some on-device buffer, and either pass a pointer and a size, or use a CUDA-capable span, like in cuda-kat. Another option, though a bit "heavier", is to use the [NVIDIA thrust library]'s 3 "device vector" class. Under the hood, it's quite different from the standard library vector though.