I have a code in Ada that must use CUDA without using the Ada binding. So I made an interface that allows the Ada program to call C code. Now I want to compile it.
How can I tell gprbuild to not use gcc to compile .cu files by nvcc? If it's not possible, maybe I have to generate the objects using nvcc and then link them with the ada code? How would you do it?
EDIT: Using the link given by Simon Wright, I made this gpr file:
project Cuda_Interface is
for Languages use ("Ada", "Cuda");
for Source_Dirs use ("src");
for Object_Dir use "obj";
for Exec_Dir use ".";
for Main use ("cuda_interface.adb");
for Create_Missing_Dirs use "True";
package Naming is
for Body_Suffix("Cuda") use ".cu";
for Spec_Suffix("Cuda") use ".cuh";
end Naming;
package Compiler is
for Driver("Cuda") use "nvcc";
for Leading_Required_Switches("Cuda") use ("-c");
end Compiler;
package Linker is
for Default_Switches("Ada") use ("-L/usr/local/cuda/lib64", "-lcuda", "-lcudart", "-lm");
end Linker;
end Cuda_Interface;
The compilation works well but the linker returns this error:
/usr/bin/ld : cuda_interface.o : in the function « _ada_cuda_interface » :
cuda_interface.adb:(.text+0x3a5) : undefined reference to « inter_add_two »
collect2: error: ld returned 1 exit status
gprbuild: link of cuda_interface.adb failed
cuda_interface.adb:
with Ada.Text_IO; use Ada.Text_IO;
procedure Cuda_Interface is
type Index is range 1 .. 5;
type Element_Type is new Natural;
type Array_Type is array (Index) of Element_Type;
procedure Inter_Add_Two(Arr : in out Array_Type; Length : Index)
with
Import => True,
Convention => C,
External_Name => "inter_add_two";
A : Array_Type := (1, 2, 3, 4, 5);
begin
for I in Index loop
Put_Line("Value at "
& Index'Image(I)
& " is "
& Element_Type'Image(A(I)));
end loop;
New_Line;
Inter_Add_Two(A, Index'Last);
for I in Index loop
Put_Line("Value at "
& Index'Image(I)
& " is "
& Element_Type'Image(A(I)));
end loop;
end Cuda_Interface;
kernel.cuh
#ifndef __KERNEL_CUH__
#define __KERNEL_CUH__
#include <cuda.h>
__global__ void kernel_add_two(unsigned int *a, unsigned int length);
void inter_add_two(unsigned int *a, unsigned int length);
#endif // __KERNEL_CUH__
kernel.cu
#include "kernel.cuh"
#include <math.h>
#define THREADS_PER_BLOCK (1024)
__global__ void kernel_add_two(unsigned int *a, unsigned int length)
{
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < length) a[tid] += 2;
}
void inter_add_two(unsigned int *a, unsigned int length)
{
unsigned int block_number = ceil(((float)length) / THREADS_PER_BLOCK);
unsigned int *d_a;
cudaMalloc((void**)&d_a, sizeof(unsigned int) * length);
cudaMemcpy(d_a, a, sizeof(unsigned int) * length, cudaMemcpyHostToDevice);
kernel_add_two<<<block_number, THREADS_PER_BLOCK>>>(d_a, length);
cudaMemcpy(a, d_a, sizeof(unsigned int) * length, cudaMemcpyDeviceToHost);
cudaFree(d_a);
}
Thanks to the comments, I successfully compiled and ran an Ada program calling C code which calls CUDA code. These are the files I edited :
kernel.cuh
#ifndef __KERNEL_CUH__
#define __KERNEL_CUH__
#include <cuda.h>
void *__gxx_personality_v0;
extern "C"
{
__global__ void kernel_add_two(unsigned int *a, unsigned int length);
void inter_add_two(unsigned int *a, unsigned int length);
}
#endif // __KERNEL_CUH__
cuda_interface.gpr
project Cuda_Interface is
for Languages use ("Ada", "Cuda");
for Source_Dirs use ("src");
for Object_Dir use "obj";
for Exec_Dir use ".";
for Main use ("cuda_interface.adb");
for Create_Missing_Dirs use "True";
package Naming is
for Body_Suffix("Cuda") use ".cu";
for Spec_Suffix("Cuda") use ".cuh";
end Naming;
package Compiler is
for Driver("Cuda") use "nvcc";
for Leading_Required_Switches("Cuda") use ("-c");
end Compiler;
package Linker is
for Default_Switches("Ada") use ("-L/usr/local/cuda/lib64", "-lcuda", "-lcudart", "-lcudadevrt", "-lm");
end Linker;
end Cuda_Interface;
Related
I need to make my kernel communicate with the host. I tried to use a global counter (better ways are well accepted), but the following code prints always 0. What am I doing wrong? (I tried both commented and uncommented ways).
#include <stdio.h>
#include <cuda_runtime.h>
//__device__ int count[1] = {0};
__device__ int count = 0;
__global__ void inc() {
//count[0]++;
atomicAdd(&count, 1);
}
int main(void) {
inc<<<1,10>>>();
cudaDeviceSynchronize();
//int *c;
int c;
cudaMemcpyFromSymbol(&c, count, sizeof(int), cudaMemcpyDeviceToHost);
printf("%d\n", c);
return 0;
}
Anytime you are having trouble with a CUDA code, I strongly encourage you to use proper CUDA error checking and run your code with cuda-memcheck, before asking others for help. Even if you don't understand the error output, providing it in your question will be useful for those trying to help you.
If you had done so, you would have received a report that cudaMemcpyFromSymbol is throwing an invalid argument error.
If you study the documentation for that function call, you will see that the 4th parameter is not the direction parameter, but is the offset parameter. So providing cudaMemcpyDeviceToHost is incorrect for the offset parameter. Since cudaMemcpyFromSymbol is always a device->host transfer, providing the direction argument is redundant, and since it is provided a default, is unnecessary. Your code works correctly for me simply by eliminating that:
$ cat t1414.cu
#include <stdio.h>
#include <cuda_runtime.h>
//__device__ int count[1] = {0};
__device__ int count = 0;
__global__ void inc() {
//count[0]++;
atomicAdd(&count, 1);
}
int main(void) {
inc<<<1,10>>>();
cudaDeviceSynchronize();
//int *c;
int c;
cudaMemcpyFromSymbol(&c, count, sizeof(int));
printf("%d\n", c);
return 0;
}
$ nvcc -o t1414 t1414.cu
$ cuda-memcheck ./t1414
========= CUDA-MEMCHECK
10
========= ERROR SUMMARY: 0 errors
$
#include <algorithm>
#include <vector>
template <typename Dtype>
__global__ void R_D_CUT(const int n, Dtype* r, Dtype* d
, Dtype cur_r_max, Dtype cur_r_min, Dtype cur_d_max, Dtype cur_d_min) {
CUDA_KERNEL_LOOP(index, n) {
r[index] = __min(cur_r_max, __max(r[index], cur_r_min));
d[index] = __min(cur_d_max, __max(d[index], cur_d_min));
}
}
In above code, it can work well in Window. However, it does not work in Ubuntu due to __min and __max function. To fix it by replace __min to std::min<Dtype> and max to std::max<Dtype>:
template <typename Dtype>
__global__ void R_D_CUT(const int n, Dtype* r, Dtype* d
, Dtype cur_r_max, Dtype cur_r_min, Dtype cur_d_max, Dtype cur_d_min) {
CUDA_KERNEL_LOOP(index, n) {
r[index] = std::min<Dtype>(cur_r_max, std::max<Dtype>(r[index], cur_r_min));
d[index] = std::min<Dtype>(cur_d_max, std::max<Dtype>(d[index], cur_d_min));
}
}
However, when I recompile, I got the error
_layer.cu(7): error: calling a __host__ function("std::min<float> ") from a __global__ function("caffe::R_D_CUT<float> ") is not allowed
_layer.cu(7): error: calling a __host__ function("std::max<float> ") from a __global__ function("caffe::R_D_CUT<float> ") is not allowed
_layer_layer.cu(8): error: calling a __host__ function("std::min<float> ") from a __global__ function("caffe::R_D_CUT<float> ") is not allowed
_layer_layer.cu(8): error: calling a __host__ function("std::max<float> ") from a __global__ function("caffe::R_D_CUT<float> ") is not allowed
_layer_layer.cu(7): error: calling a __host__ function("std::min<double> ") from a __global__ function("caffe::R_D_CUT<double> ") is not allowed
_layer_layer.cu(7): error: calling a __host__ function("std::max<double> ") from a __global__ function("caffe::R_D_CUT<double> ") is not allowed
_layer_layer.cu(8): error: calling a __host__ function("std::min<double> ") from a __global__ function("caffe::R_D_CUT<double> ") is not allowed
_layer_layer.cu(8): error: calling a __host__ function("std::max<double> ") from a __global__ function("caffe::R_D_CUT<double> ") is not allowed
Could you help me to fix it? Thanks
Generally speaking, functionality associated with std:: is not available in CUDA device code (__global__ or __device__ functions).
Instead, for many math functions, NVIDIA provides a CUDA math library.
For this case, as #njuffa points out, CUDA provides templated/overloaded versions of min and max. So you should just be able to use min() or max() in device code, assuming the type usage corresponds to one of the available templated/overloaded types. Also, you should:
#include <math.h>
Here is a simple worked example showing usage of min() for both float and double type:
$ cat t381.cu
#include <math.h>
#include <stdio.h>
template <typename T>
__global__ void mymin(T d1, T d2){
printf("min is :%f\n", min(d1,d2));
}
int main(){
mymin<<<1,1>>>(1.0, 2.0);
mymin<<<1,1>>>(3.0f, 4.0f);
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_52 -o t381 t381.cu
$ ./t381
min is :1.000000
min is :3.000000
$
Note that the available overloaded options even include some integer types
Adding to #RobertCrovella's answer: If you want something which behaves more like std::max, you can use this templated wrapper over CUDA's math library:
#define __df__ __device__ __forceinline__
template <typename T> __df__ T maximum(T x, T y);
template <> __df__ int maximum<int >(int x, int y) { return max(x,y); }
template <> __df__ unsigned int maximum<unsigned >(unsigned int x, unsigned int y) { return umax(x,y); }
template <> __df__ long maximum<long >(long x, long y) { return llmax(x,y); }
template <> __df__ unsigned long maximum<unsigned long >(unsigned long x, unsigned long y) { return ullmax(x,y); }
template <> __df__ long long maximum<long long >(long long x, long long y) { return llmax(x,y); }
template <> __df__ unsigned long long maximum<unsigned long long>(unsigned long long x, unsigned long long y) { return ullmax(x,y); }
template <> __df__ float maximum<float >(float x, float y) { return fmaxf(x,y); }
template <> __df__ double maximum<double >(double x, double y) { return fmax(x,y); }
#undef __df__
(see here for a more complete set of these wrappers.)
I made a Dll file in visual C++ to compute modulus of an array of complex numbers in CUDA. The array is type of cufftComplex. I then called the Dll in LabVIEW to check the accuracy of the result. I'm receiving an incorrect result. Could anyone tell me what is wrong with the following code, please? I think there should be something wrong with my kernel function(the way I am retrieving the cufftComplex data should be incorrect).
#include <math.h>
#include <cstdlib>
#include <cuda_runtime.h>
#include <cufft.h>
extern "C" __declspec(dllexport) void Modulus(cufftComplex *digits,float *result);
__global__ void ModulusComputation(cufftComplex *a, int N, float *temp)
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
if (idx<N)
{
temp[idx] = sqrt((a[idx].x * a[idx].x) + (a[idx].y * a[idx].y));
}
}
void Modulus(cufftComplex *digits,float *result)
{
#define N 1024
cufftComplex *d_data;
float *temp;
size_t size = sizeof(cufftComplex)*N;
cudaMalloc((void**)&d_data, size);
cudaMalloc((void**)&temp, sizeof(float)*N);
cudaMemcpy(d_data, digits, size, cudaMemcpyHostToDevice);
int blockSize = 16;
int nBlocks = N/blockSize;
if( N % blockSize != 0 )
nBlocks++;
ModulusComputation <<< nBlocks, blockSize >>> (d_data, N,temp);
cudaMemcpy(result, temp, size, cudaMemcpyDeviceToHost);
cudaFree(d_data);
cudaFree(temp);
}
In the final cudaMemcpy in your code, you have:
cudaMemcpy(result, temp, size, cudaMemcpyDeviceToHost);
It should be:
cudaMemcpy(result, temp, sizeof(float)*N, cudaMemcpyDeviceToHost);
If you had included error checking for your cuda calls, you would have seen this cuda call (as originally written) throw an error.
There's other comments that could be made. For example your block size (16) should be an integral multiple of 32. But this does not prevent proper operation.
After the kernel call, when copying back the result, you are using size as the memory size. The third argument of cudaMemcpy should be N * sizeof(float).
I have the following code which I am trying to compile using nvcc.
Code:
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
int main(void)
{
size_t n = 100;
size_t i;
int *hostData;
unsigned int *devData;
hostData = (int *)calloc(n, sizeof(int));
curandGenerator_t gen;
curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_MRG32K3A);
curandSetPseudoRandomGeneratorSeed(gen, 12345);
cudaMalloc((void **)&devData, n * sizeof(int));
curandGenerate(gen, devData, n);
cudaMemcpy(hostData, devData, n * sizeof(int), cudaMemcpyDeviceToHost);
for(i = 0; i < n; i++)
{
printf("%d ", hostData[i]);
}
printf("\n");
curandDestroyGenerator (gen);
cudaFree ( devData );
free ( hostData );
return 0;
}
This is the output I receive:
$ nvcc -o RNG7 RNG7.cu
/tmp/tmpxft_00005da4_00000000-13_RNG7.o: In function `main':
tmpxft_00005da4_00000000-1_RNG7.cudafe1.cpp:(.text+0x6c): undefined reference to `curandCreateGenerator'
tmpxft_00005da4_00000000-1_RNG7.cudafe1.cpp:(.text+0x7a): undefined reference to `curandSetPseudoRandomGeneratorSeed'
tmpxft_00005da4_00000000-1_RNG7.cudafe1.cpp:(.text+0xa0): undefined reference to `curandGenerate'
tmpxft_00005da4_00000000-1_RNG7.cudafe1.cpp:(.text+0x107): undefined reference to `curandDestroyGenerator'
collect2: ld returned 1 exit status
My initial guess is that for some reason the CURAND Library is not properly installed or that it cannot find the curand.h header file.
Please let me know what I should look for or how to solve my problem.
Thanks!
#Wilo Maldonado: just use a linker flag -lcurand and
additionally -L/path/to/cuda/libs if you do not have it already
The problem is not the header file, otherwise you would have got a compile error. You have a linker error. You will need to tell your linker where to find the object or library file that contains those functions.
I am trying to compile this code using the CUDA Compiler:
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
int main(void)
{
size_t n = 100;
size_t i;
int *hostData;
unsigned int *devData;
hostData = (int *)calloc(n, sizeof(int));
curandGenerator_t gen;
curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT);
curandSetPseudoRandomGeneratorSeed(gen, 12345);
cudaMalloc((void **)&devData, n * sizeof(int));
curandGenerate(gen, devData, n);
cudaMemcpy(hostData, devData, n * sizeof(int), cudaMemcpyDeviceToHost);
for(i = 0; i < n; i++)
{
printf("%d ", hostData[i]);
}
printf("\n");
curandDestroyGenerator (gen);
cudaFree ( devData );
free ( hostData );
return 0;
}
By using this command:
nvcc -o RNG RNG7.cu
This is the output I receive:
[root#client2 CUDA]$ nvcc -o RNG7 RNG7.cu
/tmp/tmpxft_00001ed1_00000000-13_RNG7.o: In function `main':
tmpxft_00001ed1_00000000-1_RNG7.cudafe1.cpp:(.text+0x6c): undefined reference to `curandCreateGenerator'
tmpxft_00001ed1_00000000-1_RNG7.cudafe1.cpp:(.text+0x7a): undefined reference to `curandSetPseudoRandomGeneratorSeed'
tmpxft_00001ed1_00000000-1_RNG7.cudafe1.cpp:(.text+0xa0): undefined reference to `curandGenerate'
tmpxft_00001ed1_00000000-1_RNG7.cudafe1.cpp:(.text+0x107): undefined reference to `curandDestroyGenerator'
collect2: ld returned 1 exit status
In another discussion they stated that this problem could be related to a linker problem or something, that I need to manually link the library in the compiler command to include the ones stated on my code.
I have no idea to achieve this, can someone please help with this?
Thanks!
Use the following options.
nvcc -o RNG7 RNG7.cu -lcurand -Xlinker=-rpath,/usr/local/cuda/lib
it will work like charm.