Incorrect results with CUB ReduceByKey when specifying gencode - cuda

In one of my projects, I'm seeing some incorrect results when using CUB's
DeviceReduce::ReduceByKey. However, using the same inputs/outputs with thrust::reduce_by_key produces the expected results.
#include "cub/cub.cuh"
#include <vector>
#include <iostream>
#include <cuda.h>
struct AddFunctor {
__host__ __device__ __forceinline__
float operator()(const float & a, const float & b) const {
return a + b;
}
} reduction_op;
int main() {
int n = 7680;
std::vector < uint64_t > keys_h(n);
for (int i = 0; i < 4000; i++) keys_h[i] = 1;
for (int i = 4000; i < 5000; i++) keys_h[i] = 2;
for (int i = 5000; i < 7680; i++) keys_h[i] = 3;
uint64_t * keys;
cudaMalloc(&keys, sizeof(uint64_t) * n);
cudaMemcpy(keys, &keys_h[0], sizeof(uint64_t) * n, cudaMemcpyDefault);
uint64_t * unique_keys;
cudaMalloc(&unique_keys, sizeof(uint64_t) * n);
std::vector < float > values_h(n);
for (int i = 0; i < n; i++) values_h[i] = 1.0;
float * values;
cudaMalloc(&values, sizeof(float) * n);
cudaMemcpy(values, &values_h[0], sizeof(float) * n, cudaMemcpyDefault);
float * aggregates;
cudaMalloc(&aggregates, sizeof(float) * n);
int * remaining;
cudaMalloc(&remaining, sizeof(int));
size_t size = 0;
void * buffer = NULL;
cub::DeviceReduce::ReduceByKey(
buffer,
size,
keys,
unique_keys,
values,
aggregates,
remaining,
reduction_op,
n);
cudaMalloc(&buffer, sizeof(char) * size);
cub::DeviceReduce::ReduceByKey(
buffer,
size,
keys,
unique_keys,
values,
aggregates,
remaining,
reduction_op,
n);
int remaining_h;
cudaMemcpy(&remaining_h, remaining, sizeof(int), cudaMemcpyDefault);
std::vector < float > aggregates_h(remaining_h);
cudaMemcpy(&aggregates_h[0], aggregates, sizeof(float) * remaining_h, cudaMemcpyDefault);
for (int i = 0; i < remaining_h; i++) {
std::cout << i << ", " << aggregates_h[i] << std::endl;
}
cudaFree(buffer);
cudaFree(keys);
cudaFree(unique_keys);
cudaFree(values);
cudaFree(aggregates);
cudaFree(remaining);
}
When I include "-gencode arch=compute_35,code=sm_35" (for a Kepler GTX Titan), it produces the wrong results, but when I leave these flags out entirely, it works.
$ nvcc cub_test.cu
$ ./a.out
0, 4000
1, 1000
2, 2680
$ nvcc cub_test.cu -gencode arch=compute_35,code=sm_35
$ ./a.out
0, 4000
1, 1000
2, 768
I use a handful of other CUB calls without issue, just this one is misbehaving. I've also tried running this code on a GTX 1080 Ti (with
compute_61, sm_61) and see the same behavior.
Is the right solution to omit these compiler flags?
tried on one machine with:
cuda 8.0
ubuntu 16.04
gcc 5.4.0
cub 1.6.4
Kepler GTX Titan (compute capability 3.5)
and another with:
cuda 8.0
ubuntu 16.04
gcc 5.4.0
cub 1.6.4
Pascal GTX 1080 Ti (compute capability 6.1)

Sounds like you should file a bug report at the CUB repository issues page.
Edit: I can reproduce this issue:
[joeuser#myhost:/tmp]$ nvcc -I/opt/cub -o a a.cu
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
[joeuser#myhost:/tmp]$ ./a
0, 4000
1, 1000
2, 2680
[joeuser#myhost:/tmp]$ nvcc -I/opt/cub -o a a.cu -gencode arch=compute_30,code=sm_30
[joeuser#myhost:/tmp]$ ./a
0, 4000
1, 1000
2, 512
Relevant info:
CUDA: 8.0.61
nVIDIA driver: 375.39
Distribution: GNU/Linux Mint 18.1
Linux kernel: 4.4.0
GCC: 5.4.0-6ubuntu1~16.04.4
cub: 1.6.4
GPU: GTX 650 Ti (Compute Capability 3.0)

Related

Cuda - nvcc - No kernel image is available for execution on the device. What is the problem?

I'm trying to use nvcc with the most simple example, but it doesn't work correctly. I'm compiling and execute the example from https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/, however my server can't execute the global function. I rewrite the code to get some error message and I receive the following message:
"no kernel image is available for execution on the device"
My GPU is a Quadro 6000 and the cuda version is 9.0.
#include <stdio.h>
#include <cuda_runtime.h>
__global__ void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
y[i] = 10.0; //a*x[i] + y[i];
}
int main(int argc, char *argv[])
{
int N = 120;
int nDevices;
float *x, *y, *d_x, *d_y;
cudaError_t err = cudaGetDeviceCount(&nDevices);
if (err != cudaSuccess)
printf("%s\n", cudaGetErrorString(err));
else
printf("Number of devices %d\n", nDevices);
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements
saxpy<<<1, 1>>>(N, 2.0f, d_x, d_y);
cudaDeviceSynchronize();
err = cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
printf("%s\n",cudaGetErrorString(err));
cudaError_t errSync = cudaGetLastError();
cudaError_t errAsync = cudaDeviceSynchronize();
if (errSync != cudaSuccess)
printf("Sync kernel error: %s\n", cudaGetErrorString(errSync));
if (errAsync != cudaSuccess)
printf("Async kernel error: %s\n", cudaGetErrorString(errAsync));
cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
}"
Execution command
bash-4.1$ nvcc -o sapx simples_cuda.cu
bash-4.1$ ./sapx
Number of devices 1
no error
Sync kernel error: no kernel image is available for execution on the device
GPUs of compute capability less than 2.0 are only supported by CUDA toolkits of version 6.5 and older.
GPUs of compute capability less than 3.0 (but greater than or equal to 2.0) are only supported by CUDA toolkits of version 8.0 and older.
Your Quadro 6000 is a compute capability 2.0 GPU. This can be determined programmatically with the deviceQuery CUDA sample code, or via a google search. It is not supported by CUDA 9.0
You should add the compute capability of your Video Card as a parameter to the nvcc compiler. In my case (windows/Visual Studio 2017) I set this at the Code Generation field. So as #einpoklum answered before add the gencode parameter like this -gencode arch=compute_${COMPUTE_CAPABILITY},code=compute_${SM_CAPABILITY} where {COMPUTE_CAPABILITY} and {SM_CAPABILITY} belong to the following pairs (you can add them all as VS2017 do),
{COMPUTE_CAPABILITY},{SM_CAPABILITY}
compute_35,sm_35
compute_37,sm_37
compute_50,sm_50
compute_52,sm_52
compute_60,sm_60
compute_61,sm_61
compute_70,sm_70
compute_75,sm_75
compute_80,sm_80
D:\Program Files\nVidia\CUDA Samples\MySamples\IntroToCUDA_1\IntroToCUDA_1>"D:\Program Files\nVidia\GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe" -gencode=arch=compute_35,code=\"sm_35,compute_35\" -gencode=arch=compute_37,code=\"sm_37,compute_37\" -gencode=arch=compute_50,code=\"sm_50,compute_50\" -gencode=arch=compute_52,code=\"sm_52,compute_52\" -gencode=arch=compute_60,code=\"sm_60,compute_60\" -gencode=arch=compute_61,code=\"sm_61,compute_61\" -gencode=arch=compute_70,code=\"sm_70,compute_70\" -gencode=arch=compute_75,code=\"sm_75,compute_75\" -gencode=arch=compute_80,code=\"sm_80,compute_80\" --use-local-env -ccbin "D:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\VC\Tools\MSVC\14.16.27023\bin\HostX86\x64" -x cu -I"D:\Program Files\nVidia\GPU Computing Toolkit\CUDA\v11.0\include" -I"D:\Program Files\nVidia\GPU Computing Toolkit\CUDA\v11.0\include" -G --keep-dir x64\Debug -maxrregcount=0 --machine 64 --compile -cudart static -g -D_DEBUG -D_CONSOLE -D_UNICODE -DUNICODE -Xcompiler "/EHsc /W3 /nologo /Od /Fdx64\Debug\vc141.pdb /FS /Zi /RTC1 /MDd " -o x64\Debug\IntroToCUDA_1.cu.obj "D:\Program Files\nVidia\CUDA Samples\MySamples\IntroToCUDA_1\IntroToCUDA_1\IntroToCUDA_1.cu"
You can check your CC of your video card with the deviceQuery example you can find in CUDA Samples SDK
Adding to #RobertCrovella's answer:
When compiling with nvcc, you should always set appropriate flags to generate binary kernel images for the microarchitecture / compute capability you intend to run on. For example: -gencode arch=compute_${COMPUTE_CAPABILITY},code=compute_${COMPUTE_CAPABILITY},
with, say COMPUTE_CAPABILITY=61.
Read nvcc --help for more information on these flags (although, to be honest, it's a bit of a murky subject).

-ta=tesla:managed:cuda8 but cuMemAllocManaged returned error 2: Out of memory

I'm new to OpenACC. I like it very much so far as I'm familiar with OpenMP.
I have 2 1080Ti cards each with 9GB and I've 128GB of RAM. I'm trying a very basic test to allocate an array, initialize it, then sum it up in parallel. This works for 8 GB but when I increase to 10 GB I get out-of-memory error. My understanding was that with unified memory of Pascal (which these card are) and CUDA 8, I could allocate an array larger than the GPU's memory and the hardware will page in and page out on demand.
Here's my full C code test :
$ cat firstAcc.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>
#define GB 10
int main()
{
float *a;
size_t n = GB*1024*1024*1024/sizeof(float);
size_t s = n * sizeof(float);
a = (float *)malloc(s);
if (!a) { printf("Failed to malloc.\n"); return 1; }
printf("Initializing ... ");
for (int i = 0; i < n; ++i) {
a[i] = 0.1f;
}
printf("done\n");
float sum=0.0;
#pragma acc loop reduction (+:sum)
for (int i = 0; i < n; ++i) {
sum+=a[i];
}
printf("Sum is %f\n", sum);
free(a);
return 0;
}
As per the "Enable Unified Memory" section of this article I compile it with :
$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo firstAcc.c
main:
20, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop
28, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
I need to understand those messages but for now I don't think they are relevant. Then I run it :
$ ./a.out
malloc: call to cuMemAllocManaged returned error 2: Out of memory
Aborted (core dumped)
This works fine if I change GB to 8. I expected 10GB to work (despite the GPU card having 9GB) thanks to Pascal 1080Ti and CUDA 8.
Have I misunderstand, or what am I doing wrong? Thanks in advance.
$ pgcc -V
pgcc 17.4-0 64-bit target on x86-64 Linux -tp haswell
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
$ cat /usr/local/cuda-8.0/version.txt
CUDA Version 8.0.61
Besides what Bob mentioned, I made a few more fixes.
First, you're not actually generating an OpenACC compute region since you only have a "#pragma acc loop" directive. This should be "#pragma acc parallel loop". You can see this in the compiler feedback messages where it's only showing host code optimizations.
Second, the "i" index should be declared as a "long". Otherwise, you'll overflow the index.
Finally, you need to add "cc60" to your target accelerator options to tell the compiler to target a Pascal based GPU.
% cat mi.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>
#define GB 20ULL
int main()
{
float *a;
size_t n = GB*1024ULL*1024ULL*1024ULL/sizeof(float);
size_t s = n * sizeof(float);
printf("n = %lu, s = %lu\n", n, s);
a = (float *)malloc(s);
if (!a) { printf("Failed to malloc.\n"); return 1; }
printf("Initializing ... ");
for (int i = 0; i < n; ++i) {
a[i] = 0.1f;
}
printf("done\n");
double sum=0.0;
#pragma acc parallel loop reduction (+:sum)
for (long i = 0; i < n; ++i) {
sum+=a[i];
}
printf("Sum is %f\n", sum);
free(a);
return 0;
}
% pgcc -fast -acc -ta=tesla:managed,cuda8.0,cc60 -Minfo=accel mi.c
main:
21, Accelerator kernel generated
Generating Tesla code
21, Generating reduction(+:sum)
22, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
21, Generating implicit copyin(a[:5368709120])
% ./a.out
n = 5368709120, s = 21474836480
Initializing ... done
Sum is 536870920.000000
I believe a problem is here:
size_t n = GB*1024*1024*1024/sizeof(float);
when I compile that line of code with g++, I get a warning about integer overflow. For some reason the PGI compiler is not warning, but the same badness is occurring under the hood. After the declarations of s, and n, if I add a printout like this:
size_t n = GB*1024*1024*1024/sizeof(float);
size_t s = n * sizeof(float);
printf("n = %lu, s = %lu\n", n, s); // add this line
and compile with PGI 17.04, and run (on a P100, with 16GB) I get output like this:
$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo m1.c
main:
16, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop
22, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
$ ./a.out
n = 4611686017890516992, s = 18446744071562067968
malloc: call to cuMemAllocManaged returned error 2: Out of memory
Aborted
$
so it's evident that n and s are not what you intended.
We can fix this by marking all of those constants with ULL, and then things seem to work correctly for me:
$ cat m1.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>
#define GB 20ULL
int main()
{
float *a;
size_t n = GB*1024ULL*1024ULL*1024ULL/sizeof(float);
size_t s = n * sizeof(float);
printf("n = %lu, s = %lu\n", n, s);
a = (float *)malloc(s);
if (!a) { printf("Failed to malloc.\n"); return 1; }
printf("Initializing ... ");
for (int i = 0; i < n; ++i) {
a[i] = 0.1f;
}
printf("done\n");
double sum=0.0;
#pragma acc loop reduction (+:sum)
for (int i = 0; i < n; ++i) {
sum+=a[i];
}
printf("Sum is %f\n", sum);
free(a);
return 0;
}
$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo m1.c
main:
16, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop
22, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
$ ./a.out
n = 5368709120, s = 21474836480
Initializing ... done
Sum is 536870920.000000
$
Note that I've made another change above as well. I changed the sum accumulation variable from float to double. This is necessary to preserve somewhat "sensible" results when doing a very large reduction across very small quantities.
And, as #MatColgrove pointed out in his answer, I missed a few other things as well.

Multi-gpu CUDA Thrust

I have a Cuda C++ code that uses Thrust currently working properly on a single GPU. I'd now like to modify it for multi-gpu. I have a host function that includes a number of Thrust calls that sort, copy, calculate differences etc on device arrays. I want to use each GPU to run this sequence of Thrust calls on it's own (independent) set of arrays at the same time. I've read that Thrust functions that return values are synchronous but can I use OpenMP to have each host thread call up a function (with Thrust calls) that runs on a separate GPU?
For example (coded in browser):
#pragma omp parallel for
for (int dev=0; dev<Ndev; dev++){
cudaSetDevice(dev);
runthrustfunctions(dev);
}
void runthrustfunctions(int dev){
/*lots of Thrust functions running on device arrays stored on corresponding GPU*/
//for example this is just a few of the lines"
thrust::device_ptr<double> pos_ptr = thrust::device_pointer_cast(particle[dev].pos);
thrust::device_ptr<int> list_ptr = thrust::device_pointer_cast(particle[dev].list);
thrust::sequence(list_ptr,list_ptr+length);
thrust::sort_by_key(pos_ptr, pos_ptr+length,list_ptr);
thrust::device_vector<double> temp(length);
thrust::gather(list_ptr,list_ptr+length,pos_ptr,temp.begin());
thrust::copy(temp.begin(), temp.end(), pos_ptr);
}`
I think I also need the structure "particle[0]" to be stored on GPU 0, particle[1] on GPU 1 etc and I my guess is this not possible. An option might be to use "switch" with separate code for each GPU case.
I'd like to know if this is a correct approach or if there is a better way?
Thanks
Yes, you can combine thrust and OpenMP.
Here's a complete worked example with results:
$ cat t340.cu
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <time.h>
#include <sys/time.h>
#define DSIZE 200000000
using namespace std;
int main(int argc, char *argv[])
{
timeval t1, t2;
int num_gpus = 0; // number of CUDA GPUs
printf("%s Starting...\n\n", argv[0]);
// determine the number of CUDA capable GPUs
cudaGetDeviceCount(&num_gpus);
if (num_gpus < 1)
{
printf("no CUDA capable devices were detected\n");
return 1;
}
// display CPU and GPU configuration
printf("number of host CPUs:\t%d\n", omp_get_num_procs());
printf("number of CUDA devices:\t%d\n", num_gpus);
for (int i = 0; i < num_gpus; i++)
{
cudaDeviceProp dprop;
cudaGetDeviceProperties(&dprop, i);
printf(" %d: %s\n", i, dprop.name);
}
printf("initialize data\n");
// initialize data
typedef thrust::device_vector<int> dvec;
typedef dvec *p_dvec;
std::vector<p_dvec> dvecs;
for(unsigned int i = 0; i < num_gpus; i++) {
cudaSetDevice(i);
p_dvec temp = new dvec(DSIZE);
dvecs.push_back(temp);
}
thrust::host_vector<int> data(DSIZE);
thrust::generate(data.begin(), data.end(), rand);
// copy data
for (unsigned int i = 0; i < num_gpus; i++) {
cudaSetDevice(i);
thrust::copy(data.begin(), data.end(), (*(dvecs[i])).begin());
}
printf("start sort\n");
gettimeofday(&t1,NULL);
// run as many CPU threads as there are CUDA devices
omp_set_num_threads(num_gpus); // create as many CPU threads as there are CUDA devices
#pragma omp parallel
{
unsigned int cpu_thread_id = omp_get_thread_num();
cudaSetDevice(cpu_thread_id);
thrust::sort((*(dvecs[cpu_thread_id])).begin(), (*(dvecs[cpu_thread_id])).end());
cudaDeviceSynchronize();
}
gettimeofday(&t2,NULL);
printf("finished\n");
unsigned long et = ((t2.tv_sec * 1000000)+t2.tv_usec) - ((t1.tv_sec * 1000000) + t1.tv_usec);
if (cudaSuccess != cudaGetLastError())
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
printf("sort time = %fs\n", (float)et/(float)(1000000));
// check results
thrust::host_vector<int> result(DSIZE);
thrust::sort(data.begin(), data.end());
for (int i = 0; i < num_gpus; i++)
{
cudaSetDevice(i);
thrust::copy((*(dvecs[i])).begin(), (*(dvecs[i])).end(), result.begin());
for (int j = 0; j < DSIZE; j++)
if (data[j] != result[j]) { printf("mismatch on device %d at index %d, host: %d, device: %d\n", i, j, data[j], result[j]); return 1;}
}
printf("Success\n");
return 0;
}
$ nvcc -Xcompiler -fopenmp -O3 -arch=sm_20 -o t340 t340.cu -lgomp
$ CUDA_VISIBLE_DEVICES="0" ./t340
./t340 Starting...
number of host CPUs: 12
number of CUDA devices: 1
0: Tesla M2050
initialize data
start sort
finished
sort time = 0.398922s
Success
$ ./t340
./t340 Starting...
number of host CPUs: 12
number of CUDA devices: 4
0: Tesla M2050
1: Tesla M2070
2: Tesla M2050
3: Tesla M2070
initialize data
start sort
finished
sort time = 0.460058s
Success
$
We can see that when I restrict the program to using a single device, the sort operation takes about 0.4 seconds. Then when I allow it to use all 4 devices (repeating the same sort on all 4 devices) the overall operation only take 0.46 seconds, even though we're doing 4 times as much work.
For this particular case I happened to be using CUDA 5.0 with thrust v1.7, and gcc 4.4.6 (RHEL 6.2)

CUDA Makefile Include Error

I'm attempting to write a basic matrix multiplication program using CUDA and C. The code itself doesn't really do anything right now, but should at least compile. After some research on the issue, I've determined that the issue is failure to include CUDA header files, indicating an issue with my Makefile. I'm extremely inexperienced with CUDA (and C for that matter), so any help would be greatly appreciated.
Output on command: make matrixMult1
c99 -I. -I/usr/local/cuda/include -c matrixMult1.c -o matrixMult1.o
matrixMult1.c: In function 'main':
matrixMult1.c:77: warning: implicit declaration of function 'cudaMalloc'
matrixMult1.c:82: warning: implicit declaration of function 'cudaMemcpy'
matrixMult1.c:83: error: 'cudaMemcpyHostToDevice' undeclared (first use in this
function)
matrixMult1.c:83: error: (Each undeclared identifier is reported only once
matrixMult1.c:83: error: for each function it appears in.)
matrixMult1.c:106: warning: implicit declaration of function 'cudaFree'
make: *** [matrixMult1.o] Error 1
Makefile:
GCC = c99
CUDA_INSTALL_PATH := /usr/local/cuda
INCLUDES := -I. -I$(CUDA_INSTALL_PATH)/include
CUDA_LIBS := -L$(CUDA_INSTALL_PATH)/lib -lcudart
matrixMult1.o: matrixMult1.c
$(GCC) $(INCLUDES) -c matrixMult1.c -o $#
matrixMult1: matrixMult1.o
$(GCC) -o $# matrixMult1.o $(CUDA_LIBS)
C Program:
//********************************************************************
// matrixMult1.c
//
// A basic matrix multiplication program.
//********************************************************************
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include "cuda.h"
#define WA 3
#define HA 3
#define WB 3
#define HB WA
#define WC WB
#define HC HA
void initMatrix(float * matrix, int numIndices);
//*************
// Main Program
//*************
int main(int argc, char** argv) {
/* Set random seed */
srand(2013);
/* Compute memory sizes for matrices A, B, and C */
unsigned int sizeA = WA * HA;
unsigned int sizeB = WB * HB;
unsigned int sizeC = WC * HC;
unsigned int memoryA = sizeof(float) * sizeA;
unsigned int memoryB = sizeof(float) * sizeB;
unsigned int memoryC = sizeof(float) * sizeC;
/* Allocate memory for matrices A, B, and C */
float * matrixA = (float *) malloc(memoryA);
float * matrixB = (float *) malloc(memoryB);
float * matrixC = (float *) malloc(memoryC);
/* Initialize matrices A and B */
initMatrix(matrixA, sizeA);
initMatrix(matrixB, sizeB);
/* Print matrix A */
printf("\nMatrix A:\n");
for (int i = 0; i < sizeA; i++) {
printf("%f ", matrixA[i]);
if (((i + 1) % WA) == 0) {
printf("\n");
} else {
printf(" | ");
}
}
/* Print matrix B */
printf("\nMatrix B:\n");
for (int i = 0; i < sizeB; i++) {
printf("%f ", matrixB[i]);
if (((i + 1) % WA) == 0) {
printf("\n");
} else {
printf(" | ");
}
}
/* Allocate device memory */
float* deviceMemA;
float* deviceMemB;
float* deviceMemC;
cudaMalloc((void**) &deviceMemA, memoryA);
cudaMalloc((void**) &deviceMemB, memoryB);
cudaMalloc((void**) &deviceMemC, memoryC);
/* Copy host memory to device */
cudaMemcpy(deviceMemA, matrixA, memoryA,
cudaMemcpyHostToDevice);
cudaMemcpy(deviceMemB, matrixB, memoryB,
cudaMemcpyHostToDevice);
cudaMemcpy(deviceMemC, matrixC, memoryC,
cudaMemcpyHostToDevice);
/* Print matrix C */
printf("\nMatrix C:\n");
for (int i = 0; i < sizeC; i++) {
printf("%f ", matrixC[i]);
if (((i + 1) % WC) == 0) {
printf("\n");
} else {
printf(" | ");
}
}
printf("\n");
/* Free up memory */
free(matrixA);
free(matrixB);
free(matrixC);
cudaFree(deviceMemA);
cudaFree(deviceMemB);
cudaFree(deviceMemC);
}
//--------------------------------------------------------------------
// initMatrix - Assigns a random float value to each indice of the
// matrix.
//
// PRE: matrix is a pointer to a block of bytes in memory; numIndices
// is the number of indicies in the matrix being instantiated.
// POST: Each index of the matrix has been instantiated with a random
// float value.
//--------------------------------------------------------------------
void initMatrix(float * matrix, int numIndices) {
/*
Loop through the block of bytes, assigning a random float
for each index of the matrix
*/
for (int i = 0; i < numIndices; ++i) {
/* Assign a random float between 0 and 1 at this byte */
matrix[i] = rand() / (float)RAND_MAX;
}
}
CUDA programs need to be compiled by nvcc. While your program does not yet contain any CUDA kernel yet, I believe that is what you want to achieve.
Rename your file from matrixMult1.c to matrixMult1.cu, remove the #include "cuda.h" line (programs compiled with nvcc don't need any CUDA-specific includes) and compile with nvcc instead of gcc (e.g. by setting GCC = nvcc at the beginning of the Makefile).
Two problems here:
You were not including the appropriate header into your code (which you fixed)
Your Makefile is, in fact, broken. It should look something like:
GCC = c99
CUDA_INSTALL_PATH := /usr/local/cuda
INCLUDES := -I. -I$(CUDA_INSTALL_PATH)/include
CUDA_LIBS := -L$(CUDA_INSTALL_PATH)/lib -lcudart
matrixMult1.o: matrixMult1.c
$(GCC) $(INCLUDES) -c matrixMult1.c -o $#
matrixMult1: matrixMult1.o
$(GCC) -o $# matrixMult1.o $(CUDA_LIBS)
[Disclaimer: not tested, use at own risk]
The current problem is that the include path was only specified at the linkage phase of the build.
Note that these changes also preempt the missing symbols error you will get during linkage from not linking with the CUDA runtime library. Note that depending on whether you are using a 32 or 64 bit host OS, you may need to change the library path to $(CUDA_INSTALL_PATH)/lib64 for the linkage to work correctly.

Link .ll files generated by compiling .cu file with clang

I am compiling the following code using clang with:
clang++ -std=c++11 -emit-llvm -c -S $1 --cuda-gpu-arch=sm_30. This generates vectoradd-cuda-nvptx64-nvidia-cuda-sm_30.ll and vectoradd.ll files. The goal to run some LLVM analysis passes on the kernel, which would possibly instrument it. So i would like to link the post-analysis IR to an executable but i am not sure how. When i try to link the .ll files with llvm-link i am getting the error Linking globals named '_Z9vectoraddPiS_S_i': symbol multiply defined!. I am not really sure how to achieve this, so any help is appreciated.
#define THREADS_PER_BLOCK 512
__global__ void vectoradd(int *A, int *B, int *C, int N) {
int gi = threadIdx.x + blockIdx.x * blockDim.x;
if ( gi < N) {
C[gi] = A[gi] + B[gi];
}
}
int main(int argc, char **argv) {
int N = 10000, *d_A, *d_B, *d_C;
/// allocate host memory
std::vector<int> A(N);
std::vector<int> B(N);
std::vector<int> C(N);
/// allocate device memory
cudaMalloc((void **) &d_A, N * sizeof(int));
cudaMalloc((void **) &d_B, N * sizeof(int));
cudaMalloc((void **) &d_C, N * sizeof(int));
/// populate host data
for ( size_t i = 0; i < N; ++i) {
A[i] = i; B[i] = i;
}
/// copy to device
cudaMemcpy(d_A, &A[0], N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, &B[0], N * sizeof(int), cudaMemcpyHostToDevice);
dim3 block(THREADS_PER_BLOCK, 1, 1);
dim3 grid((N + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK, 1, 1);
vectoradd<<<grid,block>>>(d_A, d_B, d_C, N);
cudaDeviceSynchronize();
cudaMemcpy(&C[0], d_C, N * sizeof(int), cudaMemcpyDeviceToHost);
return 0;
}
The CUDA compilation trajectory in Clang is rather complicated (as it is in the NVIDIA toolchain) and what you are trying to do cannot work. The LLVM IR from each branch of the compilation process must remain separate until directly linkable objects are available. As a result, there are many intermediate steps which you will need to perform manually.
LLVM IR code for the GPU must be compiled firstly to PTX code, and then assembled to a binary payload which can be linked against host object files.
So in your example, you first do something like:
clang++ -std=c++11 -emit-llvm -c -S test.cu --cuda-gpu-arch=sm_52
which emits two llvm IR files test-cuda-nvptx64-nvidia-cuda-sm_52.ll and test.ll. The GPU code then needs to be compiled to PTX (see more about the nvptx backend here):
llc -mcpu=sm_52 test-cuda-nvptx64-nvidia-cuda-sm_52.ll -o test.ptx
Now the PTX code can be assembled into an ELF file which can later be linked by nvcc (or the host linker with an couple of additional steps) in the normal way:
ptxas --gpu-name=sm_52 test.ptx -o test.ptx.o
fatbinary --cuda -64 --create test.fatbin --image=profile=sm_52,file=test.ptx.o
For the host code you do something like
llc test.ll
clang -m64 -c test.s
to produce assembler output from the LLVM IR and then assemble that to an object file.
Now with a fatbin file containing CUDA the compiled code, and an object file containing the compiled host code, you can perform linkage. I have not been able to test linking a host object file with a fatbinary using clang, that is something you will need to work out yourself. It will be instructive to study both the verbose output of clang during a CUDA compilation call, and also the nvcc documentation to get a better feel for how the device code build system works.