I've ported a cuda project from linux to windows (basically just added few defines and typedefs in the header file). I'm using visual studio 2008, and the cuda runtime api custom build rules from the SDK. The code is c, not c++ (and I'm compiling /TC not /TP)
I'm having scope issues that I didn't have in linux. Global variables in my header file aren't shared between the .c files and .cu files.
I've created a simplified project, and here is all of the code:
main.h:
#ifndef MAIN_H
#define MAIN_H
#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime.h>
cudaEvent_t cudaEventStart;
#if defined __cplusplus
extern "C" void func(void);
#else
extern void func(void);
#endif
#endif
main.c:
#include "main.h"
int main(void)
{
int iDevice = 0;
cudaSetDevice(iDevice);
cudaFree(0);
cudaGetDevice(&iDevice);
printf("device: %d\n", iDevice);
cudaEventCreate(&cudaEventStart);
printf("create event: %d\n", (int) cudaEventStart);
func();
cudaEventDestroy(cudaEventStart);
printf("destroy event: %d\n", (int) cudaEventStart);
return cudaThreadExit();
}
kernel.cu:
#include "main.h"
void func()
{
printf("event in cu: %d\n", (int) cudaEventStart);
}
output:
device: 0
create event: 44199920
event in cu: 0
event destroy: 441999920
Any ideas about what I am doing wrong here? How do I need to change my setup so that it works in visual studio? Ideally, I'd like a setup that works multi-platform.
CUDA 3.2, GTX 480, 64-bit Win7, 263.06 general
What you are trying to do
Would not work even without CUDA -- try renaming kernel.cu to kernel.c and recompile. You will get a linker error because cudaEventStart will be multiply defined -- in each compilation unit (.c file) that includes it. You would need to make the variable static, and initialize it in only one compilation unit.
Compiles in CUDA because CUDA does not have a linker, and therefore code in compilation units compiled by nvcc (.cu files) cannot reference symbols in other compilation units. CUDA doesn't support static global variables currently. In the future CUDA will have a linker, but currently it does not.
What is happening is each compilation unit is getting its own, non-conflicting instance of cudaEventStart.
What you can do is get rid of the global variable (make it a local variable in main()), add cudaEvent_t parameters to the functions that need to use the event, and then pass the event variable around.
BTW, in your second post, you have circular #includes...
I modified my simplified example (with success) by including the .cu file in the header and removing the forward declarations of the .cu file function.
main.h:
#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include "kernel.cu"
cudaEvent_t cudaEventStart;
main.c:
#include "main.h"
int main(void)
{
int iDevice = 0;
cudaSetDevice(iDevice);
cudaFree(0);
cudaGetDevice(&iDevice);
printf("device: %d\n", iDevice);
cudaEventCreate(&cudaEventStart);
printf("create event: %d\n", (int) cudaEventStart);
func();
cudaEventDestroy(cudaEventStart);
printf("destroy event: %d\n", (int) cudaEventStart);
return cudaThreadExit();
}
kernel.cu:
#ifndef KERNEL_CU
#define KERNEL_CU
#include "main.h"
void func(void);
void func()
{
printf("event in cu: %d\n", (int) cudaEventStart);
}
#endif
output:
device: 0
create event: 42784024
event in cu: 42784024
event destroy: 42784024
About to see if it works in my real project, and whether the solution is portable back to linux.
Related
I'm transporting data to specific CUDA symbol, my CUDA version is 10.1, GPU is Tesla K80. I compiled the code on VS2017, code generated by compute_35 & sm35. When I wrote my code like this,
<.h>
#include <cuda_runtime.h>
__device__ __constant__ float scoreRatio;
<.cpp>
const float ScoreRatio;
cudaErr=cudaMemcpyToSymbol(&scoreRatio,&ScoreRatio,sizeof(ScoreRatio));
printf("%d: %s.\n",cudaErr,cudaGetErorString(cudaErr));
it compiled well but got cudaErrInvalidSymbol when I run the program,
13: Invalid device symbol
If I modified my code like this,
<.h>
#include <cuda_runtime.h>
__device__ __constant__ float scoreRatio;
<.cpp>
const float ScoreRatio;
cudaErr=cudaMemcpyToSymbol(scoreRatio,&ScoreRatio,sizeof(ScoreRatio));
then the compile would fail due to incompatible parameter type as the first parameter is FLOAT while function asks VOID*, here I found the function definition in cuda_runtime_api.h,
extern __host__ cudaError_t CUDARTAPI cudaMemcpyToSymbol(const void *symbol, const void *src, size_t count, size_t offset __dv(0), enum cudaMemcpyKind kind __dv(cudaMemcpyHostToDevice));
Could anyone please give some advice, much appreciated.
This:
<.h>
#include <cuda_runtime.h>
__device__ __constant__ float scoreRatio;
<.cpp>
const float ScoreRatio;
cudaErr=cudaMemcpyToSymbol(&scoreRatio,&ScoreRatio,sizeof(ScoreRatio));
printf("%d: %s.\n",cudaErr,cudaGetErorString(cudaErr));
is illegal/wrong in two ways. You must use nvcc to compile the code using a device code aware trajectory, and the first argument of the cudaMemcpyToSymbol call is incorrect. If you simply rename your .cpp source file to have a .cu file extension and change the contents so that it looks like this:
<.cu>
#include <.h>
....
const float ScoreRatio;
cudaErr=cudaMemcpyToSymbol(scoreRatio, &ScoreRatio, sizeof(ScoreRatio));
printf("%d: %s.\n", cudaErr, cudaGetErorString(cudaErr));
it will both compile and run correctly. See here for an explanation of why it is necessary to change the first argument of the cudaMemcpyToSymbol call.
I'm relatively new to cuda programming and can't find a solution to my problem.
I'm trying to have a shared library, lets call it func.so, that defines a device function
__device__ void hello(){ prinf("hello");}
I then want to be able to access that library via dlopen, and use that function in my programm. I tried something along the following lines:
func.cu
#include <stdio.h>
typedef void(*pFCN)();
__device__ void dhello(){
printf("hello\n")
}
__device__ pFCN ptest = dhello;
pFCN h_pFCN;
extern "C" pFCN getpointer(){
cudaMemcpyFromSymbol(&h_pFCN, ptest, sizeof(pFCN));
return h_pFCN;
}
main.cu
#include <dlfcn.h>
#include <stdio.h>
typedef void (*fcn)();
typedef fcn (*retpt)();
retpt hfcnpt;
fcn hfcn;
__device__ fcn dfcn;
__global__ void foo(){
(*dfcn)();
}
int main() {
void * m_handle = dlopen("gputest.so", RTLD_NOW);
hfcnpt = (retpt) dlsym( m_handle, "getpointer");
hfcn = (*hfcnpt)();
cudaMemcpyToSymbol(dfcn, &hfcn, sizeof(fcn), 0, cudaMemcpyHostToDevice);
foo<<<1,1>>>();
cudaThreadSynchronize();
return 0;
}
But this way I get the following error when debugging with cuda-gdb:
CUDA Exception: Warp Illegal Instruction
Program received signal CUDA_EXCEPTION_4, Warp Illegal Instruction.
0x0000000000806b30 in dtest () at func.cu:5
I appreciate any help you all can give me! :)
Calling a __device__ function in one compilation unit from device code in another compilation unit requires separate compilation with device linking usage of nvcc.
However, such usage with libraries only works with static libraries.
Therefore if the target __device__ function is in the .so library, and the calling code is outside of the .so library, your approach cannot work, with the current nvcc toolchain.
The only "workarounds" I can suggest would be to put the desired target function in a static library, or else put both caller and target inside the same .so library. There are a number of questions/answers on the cuda tag which give examples of these alternate approaches.
Thrust automatically selects the GPU backend when I provide an algorithm with iterators from thrust::device_vector, since the vector's data lives on the GPU. However, when I only provide thrust::counting_iterator parameters to an algorithm, how can I select which backend it executes on?
In the following invocation of thrust::find, there are no device_vector iterator arguments, so how does Thrust choose which backend (CPU, OMP, TBB, CUDA) to use?
How can I control on which backend this algorithm executes without using thrust::device_vector<> in this code?
thrust::counting_iterator<uint64_t> first(i);
thrust::counting_iterator<uint64_t> last = first + step_size;
auto iter = thrust::find(
thrust::make_transform_iterator(first, functor),
thrust::make_transform_iterator(last, functor),
true);
UPDATE 23.01.14. MSVS2012, CUDA5.5, Thrust 1.7:
Compile success!
#include <iostream>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/find.h>
#include <thrust/functional.h>
#include <thrust/execution_policy.h>
struct is_odd : public thrust::unary_function<uint64_t, bool> {
__host__ __device__ bool operator()(uint64_t const& x) {
return x & 1;
}
};
int main() {
thrust::counting_iterator<uint64_t> first(0);
thrust::counting_iterator<uint64_t> last = first + 100;
auto iter = thrust::find(thrust::device,
thrust::make_transform_iterator(first, is_odd()),
thrust::make_transform_iterator(last, is_odd()),
true);
int bbb; std::cin >> bbb;
return 0;
}
Sometimes where a Thrust algorithm executes can be ambiguous, as in your counting_iterator example, because its associated "backend system" is thrust::any_system_tag (a counting_iterator can be dereferenced anywhere because it is not backed by data). In situations like this, Thrust will use the device backend. By default, this will be CUDA. However, you can explicitly control how execution happens in a couple of ways.
You can either explicitly specify the system through the template parameter as in ngimel's answer, or you can provide the thrust::device execution policy as the first argument to thrust::find in your example:
#include <thrust/execution_policy.h>
...
thrust::counting_iterator<uint64_t> first(i);
thrust::counting_iterator<uint64_t> last = first + step_size;
auto iter = thrust::find(thrust::device,
thrust::make_transform_iterator(first, functor),
thrust::make_transform_iterator(last, functor),
true);
This technique requires Thrust 1.7 or better.
You have to specify System template parameter when instantiating counting_iterator:
typedef thrust::device_system_tag System;
thrust::counting_iterator<uint64_t,System> first(i)
If you are using the current version of Thrust, please follow the way Jared Hoberock mentioned. But if you might use older versions (the system that you work at might have old version of CUDA) then the example below might help.
#include <thrust/version.h>
#if THRUST_MINOR_VERSION > 6
#include <thrust/execution_policy.h>
#elif THRUST_MINOR_VERSION == 6
#include <thrust/iterator/retag.h>
#else
#endif
...
#if THRUST_MINOR_VERSION > 6
total =
thrust::transform_reduce(
thrust::host
, thrust::counting_iterator<unsigned int>(0)
, thrust::counting_iterator<unsigned int>(N)
, AFunctor(), 0, thrust::plus<unsigned int>());
#elif THRUST_MINOR_VERSION == 6
total =
thrust::transform_reduce(
thrust::retag<thrust::host_system_tag>(thrust::counting_iterator<unsigned int>(0))
, thrust::retag<thrust::host_system_tag>(thrust::counting_iterator<unsigned int>(N))
, AFunctor(), 0, thrust::plus<unsigned int>());
#else
total =
thrust::transform_reduce(
thrust::counting_iterator<unsigned int, thrust::host_space_tag>(0)
, thrust::counting_iterator<unsigned int, thrust::host_space_tag>(objectCount)
, AFunctor(), 0, thrust::plus<unsigned int>());
#endif
#see Thrust: How to directly control where an algorithm invocation executes?
I have a CUDA application where I am trying to use constant memory. But when I am writing the kernel in the same file where the main function is, then only the data in the constant memory is getting recognized inside the kernel. Otherwise if I declare the kernel function in some other file then the constant memory is becoming 0 and the operation is operating properly. I am providing a simple dummy code which would explain the problem more easily. This program have a 48x48 matrix divided into 16x16 blocks and I am storing random numbers 1 to 50 in it. Inside the kernel I am adding numbers stored in constant memory to the each rows in a block. The code is given below :
Header File:
#include <windows.h>
#include <dos.h>
#include <stdio.h>
#include <conio.h>
#include <math.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cutil.h>
#include <curand.h>
#include <curand_kernel.h>
__constant__ int test_cons[16];
__global__ void test_kernel_1(int *,int *);
Main Program :
int main(int argc,char *argv[])
{ int *mat,*dev_mat,*res,*dev_res;
int i,j;
int test[16 ] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
cudaMemcpyToSymbol(test_cons,test,16*sizeof(int));
mat = (int *)malloc(48*48*sizeof(int));
res = (int *)malloc(48*48*sizeof(int));
memset(res,0,48*48*sizeof(int));
srand(time(NULL));
for(i=0;i<48;i++)
{ for(j=0;j<48;j++)
{ mat[i*48+j] = rand()%(50-1)+1;
printf("%d\t",mat[i*48+j] );
}
printf("\n");
}
cudaMalloc((void **)&dev_mat,48*48*sizeof(int));
cudaMemcpy(dev_mat,mat,48*48*sizeof(int),cudaMemcpyHostToDevice);
cudaMalloc((void **)&dev_res,48*48*sizeof(int));
dim3 gridDim(48/16,48/16,1);
dim3 blockDim(16,16,1);
test_kernel_1<<< gridDim,blockDim>>>(dev_mat,dev_res);
cudaMemcpy(res,dev_res,48*48*sizeof(int),cudaMemcpyDeviceToHost);
printf("\n\n\n\n");
for(i=0;i<48;i++)
{ for(j=0;j<48;j++)
{ printf("%d\t",res[i*48+j] );
}
printf("\n");
}
cudaFree(dev_mat);
cudaFree(dev_res);
free(mat);
free(res);
exit(0);
}
Kernel Function :
__global__ void test_kernel_1(int *dev_mat,int* dev_res)
{
int row = blockIdx.y*blockDim.y+threadIdx.y;
int col = blockIdx.x*blockDim.x +threadIdx.x;
dev_res[row*48+col] = dev_mat[row*48+col] + test_cons[threadIdx.x];
}
Now when I am declaring the kernel function inside the main program file along with the main program then the constant memory values are correct otherwise if it is in a different file the test_cons[threadIdx.x] values are becoming 0.
I came across this link which kind of discuss the same problem but I am not getting it properly. It would be very much helpful if someone could tell me why this is happening and what I need to do avoid this problem. Any sort of help would be highly appreciated. Thanks.
I just recently answered a similar question here
CUDA can handle code that references device code (entry points) or symbols in other files, but it requires separate compilation with device linking (as described and linked in the link I gave above). (And separate compilation/linking requires CC 2.0 or greater)
So if you modify the link steps you can have your __constant__ variable in a given file, and reference it from a different file.
If not (if you don't specify separate compilation and device linking), then the device code that references the __constant__ variable, the host code that references the __constant__ variable, and the definition/declaration of the variable itself, all need to be in the same file.
So this:
__constant__ int test_cons[16];
This:
cudaMemcpyToSymbol(test_cons,test,16*sizeof(int));
And this:
dev_res[row*48+col] = dev_mat[row*48+col] + test_cons[threadIdx.x];
all need to be in the same file.
The above answer is totally acceptable I am adding this since the user is not able to make it working. You can accept the above answer this is just for your reference.
Kernel.cu file:
#include <stdio.h>
__constant__ int test_cons[16];
void copymemory (int *test)
{
cudaMemcpyToSymbol(test_cons,test,16*sizeof(int));
}
__global__ void test_kernel_1(int *dev_mat,int* dev_res)
{
int row = blockIdx.y*blockDim.y+threadIdx.y;
int col = blockIdx.x*blockDim.x +threadIdx.x;
if (threadIdx.x ==0)
{
printf ("testcons[0] is %d\n", test_cons[threadIdx.x]) ;
}
dev_res[row*48+col] = dev_mat[row*48+col] + test_cons[threadIdx.x];
}
simple.cu file
#include <stdio.h>
#include <math.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <curand.h>
#include <curand_kernel.h>
void copymemory (int *temp) ;
__global__ void test_kernel_1(int *,int *);
int main(int argc,char *argv[])
{
int *mat,*dev_mat,*res,*dev_res;
int i,j;
int test[16 ] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
mat = (int *)malloc(48*48*sizeof(int));
res = (int *)malloc(48*48*sizeof(int));
memset(res,0,48*48*sizeof(int));
copymemory (test) ;
srand(time(NULL));
for(i=0;i<48;i++)
{
for(j=0;j<48;j++)
{
mat[i*48+j] = rand()%(50-1)+1;
//printf("%d\t",mat[i*48+j] );
}
//printf("\n");
}
cudaMalloc((void **)&dev_mat,48*48*sizeof(int));
cudaMemcpy(dev_mat,mat,48*48*sizeof(int),cudaMemcpyHostToDevice);
cudaMalloc((void **)&dev_res,48*48*sizeof(int));
dim3 gridDim(48/16,48/16,1);
dim3 blockDim(16,16,1);
test_kernel_1<<< gridDim,blockDim>>>(dev_mat,dev_res);
cudaMemcpy(res,dev_res,48*48*sizeof(int),cudaMemcpyDeviceToHost);
for(i=0;i<48;i++)
{
for(j=0;j<48;j++)
{
// printf("%d\t",res[i*48+j] );
}
//printf("\n");
}
cudaFree(dev_mat);
cudaFree(dev_res);
free(mat);
free(res);
exit(0);
}
I have commented your printf. And the printf in the kernel prints the value 1. I also tested by changing the value of test[0] in main function and it works perfectly.
I want to use CUDA 5.0 linking to write re-usable CUDA objects. i've set up this simple test of but my kernel fails silently (runs without error or exception and outputs junk).
My simple test (below) allocates an array of integers to CUDA device memory. The CUDA kernel should populate the array with sequential entries (0,1,2,....,9). The device array is copied to CPU memory and output to the console.
Currently, this code outputs "0,0,0,0,0,0,0,0,0," instead of the desired "0,1,2,3,4,5,6,7,8,9,". It is compiled using VS2010 and CUDA 5.0 (with compute_35 and sm_35 set). Running on Win7-64-bit with a GeForce 580.
In Test.h:
class Test
{
public:
Test();
~Test();
void Run();
private:
int* cuArray;
};
In Test.cu:
#include <stdio.h>
#include <assert.h>
#include <cuda_runtime.h>
#include "Test.h"
#define ARRAY_LEN 10
__global__ void kernel(int *p)
{
int elemID = blockIdx.x * blockDim.x + threadIdx.x;
p[elemID] = elemID;
}
Test::Test()
{
cudaMalloc(&cuArray, ARRAY_LEN * sizeof(int));
}
Test::~Test()
{
cudaFree(cuArray);
}
void Test::Run()
{
kernel<<<1,ARRAY_LEN>>>(cuArray);
// Copy the array contents to CPU-accessible memory
int cpuArray[ARRAY_LEN];
cudaMemcpy(static_cast<void*>(cpuArray), static_cast<void*>(cuArray), ARRAY_LEN * sizeof(int), cudaMemcpyDeviceToHost);
// Write the array contents to console
for (int i = 0; i < ARRAY_LEN; ++i)
printf("%d,", cpuArray[i]);
printf("\n");
}
In main.cpp:
#include <iostream>
#include "Test.h"
int main()
{
Test t;
t.Run();
}
I've experimented with the DECLs (__device__ __host__) as suggested by #harrism but to no effect.
Can anyone suggest how to make his work? (The code works when it isn't inside a class.)
The device you are using is GTX 580 whose compute capability is 2.0. If you compile the code for any architecture greater than 2.0, the kernel will not run on your device, and the output will be garbage. Compile the code for compute 2.0 or lower, and the code will run fine.