Can't get simple CUDA program to work - cuda

I'm trying the "hello world" program of CUDA programming: adding two vectors together. Here's the program I have tried:
#include <cuda.h>
#include <stdio.h>
#define SIZE 10
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
float A[SIZE], B[SIZE], C[SIZE];
float *devPtrA, *devPtrB, *devPtrC;
size_t memsize= SIZE * sizeof(float);
for (int i=0; i< SIZE; i++) {
A[i] = i;
B[i] = i;
}
cudaMalloc(&devPtrA, memsize);
cudaMalloc(&devPtrB, memsize);
cudaMalloc(&devPtrC, memsize);
cudaMemcpy(devPtrA, A, memsize, cudaMemcpyHostToDevice);
cudaMemcpy(devPtrB, B, memsize, cudaMemcpyHostToDevice);
vecAdd<<<1, SIZE>>>(devPtrA, devPtrB, devPtrC);
cudaMemcpy(C, devPtrC, memsize, cudaMemcpyDeviceToHost);
for (int i=0; i<SIZE; i++)
printf("C[%d]: %f + %f => %f\n",i,A[i],B[i],C[i]);
cudaFree(devPtrA);
cudaFree(devPtrB);
cudaFree(devPtrC);
}
Compiled with:
nvcc cuda.cu
Output is this:
C[0]: 0.000000 + 0.000000 => 0.000000
C[1]: 1.000000 + 1.000000 => 0.000000
C[2]: 2.000000 + 2.000000 => 0.000000
C[3]: 3.000000 + 3.000000 => 0.000000
C[4]: 4.000000 + 4.000000 => 0.000000
C[5]: 5.000000 + 5.000000 => 0.000000
C[6]: 6.000000 + 6.000000 => 0.000000
C[7]: 7.000000 + 7.000000 => 0.000000
C[8]: 8.000000 + 8.000000 => 366987238703104.000000
C[9]: 9.000000 + 9.000000 => 0.000000
Every time I run it, I get a different answer for C[8], but the results for all the other elements are always 0.000000.
The Ubuntu 11.04 system a 64-bit Xeon server with 4 cores running the latest NVIDIA drivers (downloaded on Oct 4, 2012). The card is an EVGA GeForce GT 430 with 96 cores and 1GB of RAM.
What should I do to figure out what's going on?

It seems that your drivers are not initialized, but not checking the cuda return codes is always a bad practice, you should avoid that. Here is simple function + Macro that you can use for cuda calls(quoted from Cuda by Example):
static void HandleError( cudaError_t err,
const char *file,
int line ) {
if (err != cudaSuccess) {
printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
file, line );
exit( EXIT_FAILURE );
}
}
#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))
Now start calling your functions like:
HANDLE_ERROR(cudaMemcpy(...));

Most likely cause: the NVIDIA drivers weren't loaded. On a headless Linux system, X Windows isn't running, so the drivers aren't loaded at boot time.
Run nvidia-smi -a as root to load them and get a confirmation in the form of a report.
Although the drivers are now loaded, they still need to be initialized every time a CUDA program is run. Put the drivers into persistent mode with nvidia-smi -pm 1 so they remain initialized all the time. Add this to a boot script (such as rc.local) so it happens at every boot.

Related

Can I update or change a cuda kernel without stopping the process? [duplicate]

I have an application which generates CUDA C++ source code, compiles it into PTX at runtime using NVRTC, and then creates CUDA modules from it using the CUDA driver API.
If I debug this application using cuda-gdb, it displays the kernel (where an error occured) in the backtrace, but does not show the line number.
I export the generated source code into a file, and give the directory to cuda-gdb using the --directory option. I also tried passing its file name to nvrtcCreateProgram() (name argument). I use the compile options --device-debug and --generate-line-info with NVRTC.
Is there a way to let cuda-gdb know the location of the generated source code file, and display the line number information in its backtrace?
For those who may not be familiar with nvrtc, it is a CUDA facility that allows runtime-compilation of CUDA C++ device code. As a result, device code generated at runtime (including modifications) can be used on a CUDA GPU. There is documentation for nvrtc and there are various CUDA sample codes for nvrtc, most or all of which have _nvrtc in the file name.
I was able to do kernel source-level debugging on a nvrtc-generated kernel with cuda-gdb as follows:
start with vectorAdd_nvrtc sample code
modify the compileFileToPTX routine (provided by nvrtc_helper.h) to add the --device-debug switch during the compile-cu-to-ptx step.
modify the loadPTX routine (provided by nvrtc_helper.h) to add the CU_JIT_GENERATE_DEBUG_INFO option (set to 1) for the cuModuleLoadDataEx load/JIT PTX-to-binary step.
compile the main function (vectorAdd.cpp) with -g option.
Here is a complete test case/session. I'm only showing the vectorAdd.cpp file from the project because that is the only file I modified. Other project file(s) are identical to what is in the sample project:
$ cat vectorAdd.cpp
/**
* Copyright 1993-2015 NVIDIA Corporation. All rights reserved.
*
* Please refer to the NVIDIA end user license agreement (EULA) associated
* with this source code for terms and conditions that govern your use of
* this software. Any use, reproduction, disclosure, or distribution of
* this software and related documentation outside the terms of the EULA
* is strictly prohibited.
*
*/
/**
* Vector addition: C = A + B.
*
* This sample is a very basic sample that implements element by element
* vector addition. It is the same as the sample illustrating Chapter 2
* of the programming guide with some additions like error checking.
*/
#include <stdio.h>
#include <cmath>
// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda.h>
#include <cuda_runtime.h>
// helper functions and utilities to work with CUDA
#include <helper_functions.h>
#include <nvrtc_helper.h>
#include <iostream>
#include <fstream>
/**
* Host main routine
*/
void my_compileFileToPTX(char *filename, int argc, char **argv, char **ptxResult,
size_t *ptxResultSize, int requiresCGheaders) {
std::ifstream inputFile(filename,
std::ios::in | std::ios::binary | std::ios::ate);
if (!inputFile.is_open()) {
std::cerr << "\nerror: unable to open " << filename << " for reading!\n";
exit(1);
}
std::streampos pos = inputFile.tellg();
size_t inputSize = (size_t)pos;
char *memBlock = new char[inputSize + 1];
inputFile.seekg(0, std::ios::beg);
inputFile.read(memBlock, inputSize);
inputFile.close();
memBlock[inputSize] = '\x0';
int numCompileOptions = 0;
char *compileParams[2];
std::string compileOptions;
if (requiresCGheaders) {
char HeaderNames[256];
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
sprintf_s(HeaderNames, sizeof(HeaderNames), "%s", "cooperative_groups.h");
#else
snprintf(HeaderNames, sizeof(HeaderNames), "%s", "cooperative_groups.h");
#endif
compileOptions = "--include-path=";
std::string path = sdkFindFilePath(HeaderNames, argv[0]);
if (!path.empty()) {
std::size_t found = path.find(HeaderNames);
path.erase(found);
} else {
printf(
"\nCooperativeGroups headers not found, please install it in %s "
"sample directory..\n Exiting..\n",
argv[0]);
}
compileOptions += path.c_str();
compileParams[0] = reinterpret_cast<char *>(
malloc(sizeof(char) * (compileOptions.length() + 1)));
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
sprintf_s(compileParams[0], sizeof(char) * (compileOptions.length() + 1),
"%s", compileOptions.c_str());
#else
snprintf(compileParams[0], compileOptions.size(), "%s",
compileOptions.c_str());
#endif
numCompileOptions++;
}
compileOptions = "--device-debug ";
compileParams[numCompileOptions] = reinterpret_cast<char *>(malloc(sizeof(char) * (compileOptions.length() + 1)));
snprintf(compileParams[numCompileOptions], compileOptions.size(), "%s", compileOptions.c_str());
numCompileOptions++;
// compile
nvrtcProgram prog;
NVRTC_SAFE_CALL("nvrtcCreateProgram",
nvrtcCreateProgram(&prog, memBlock, filename, 0, NULL, NULL));
nvrtcResult res = nvrtcCompileProgram(prog, numCompileOptions, compileParams);
// dump log
size_t logSize;
NVRTC_SAFE_CALL("nvrtcGetProgramLogSize",
nvrtcGetProgramLogSize(prog, &logSize));
char *log = reinterpret_cast<char *>(malloc(sizeof(char) * logSize + 1));
NVRTC_SAFE_CALL("nvrtcGetProgramLog", nvrtcGetProgramLog(prog, log));
log[logSize] = '\x0';
if (strlen(log) >= 2) {
std::cerr << "\n compilation log ---\n";
std::cerr << log;
std::cerr << "\n end log ---\n";
}
free(log);
NVRTC_SAFE_CALL("nvrtcCompileProgram", res);
// fetch PTX
size_t ptxSize;
NVRTC_SAFE_CALL("nvrtcGetPTXSize", nvrtcGetPTXSize(prog, &ptxSize));
char *ptx = reinterpret_cast<char *>(malloc(sizeof(char) * ptxSize));
NVRTC_SAFE_CALL("nvrtcGetPTX", nvrtcGetPTX(prog, ptx));
NVRTC_SAFE_CALL("nvrtcDestroyProgram", nvrtcDestroyProgram(&prog));
*ptxResult = ptx;
*ptxResultSize = ptxSize;
#ifdef DUMP_PTX
std::ofstream my_f;
my_f.open("vectorAdd.ptx");
for (int i = 0; i < ptxSize; i++)
my_f << ptx[i];
my_f.close();
#endif
if (requiresCGheaders) free(compileParams[0]);
}
CUmodule my_loadPTX(char *ptx, int argc, char **argv) {
CUmodule module;
CUcontext context;
int major = 0, minor = 0;
char deviceName[256];
// Picks the best CUDA device available
CUdevice cuDevice = findCudaDeviceDRV(argc, (const char **)argv);
// get compute capabilities and the devicename
checkCudaErrors(cuDeviceGetAttribute(
&major, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, cuDevice));
checkCudaErrors(cuDeviceGetAttribute(
&minor, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, cuDevice));
checkCudaErrors(cuDeviceGetName(deviceName, 256, cuDevice));
printf("> GPU Device has SM %d.%d compute capability\n", major, minor);
checkCudaErrors(cuInit(0));
checkCudaErrors(cuDeviceGet(&cuDevice, 0));
checkCudaErrors(cuCtxCreate(&context, 0, cuDevice));
CUjit_option opt[1];
opt[0] = CU_JIT_GENERATE_DEBUG_INFO;
void **vals = new void *[1];
vals[0] = (void *)(size_t)1;
checkCudaErrors(cuModuleLoadDataEx(&module, ptx, 1, opt, vals));
free(ptx);
return module;
}
int main(int argc, char **argv) {
char *ptx, *kernel_file;
size_t ptxSize;
kernel_file = sdkFindFilePath("vectorAdd_kernel.cu", argv[0]);
my_compileFileToPTX(kernel_file, argc, argv, &ptx, &ptxSize, 0);
CUmodule module = my_loadPTX(ptx, argc, argv);
CUfunction kernel_addr;
checkCudaErrors(cuModuleGetFunction(&kernel_addr, module, "vectorAdd"));
// Print the vector length to be used, and compute its size
int numElements = 50000;
size_t size = numElements * sizeof(float);
printf("[Vector addition of %d elements]\n", numElements);
// Allocate the host input vector A
float *h_A = reinterpret_cast<float *>(malloc(size));
// Allocate the host input vector B
float *h_B = reinterpret_cast<float *>(malloc(size));
// Allocate the host output vector C
float *h_C = reinterpret_cast<float *>(malloc(size));
// Verify that allocations succeeded
if (h_A == NULL || h_B == NULL || h_C == NULL) {
fprintf(stderr, "Failed to allocate host vectors!\n");
exit(EXIT_FAILURE);
}
// Initialize the host input vectors
for (int i = 0; i < numElements; ++i) {
h_A[i] = rand() / static_cast<float>(RAND_MAX);
h_B[i] = rand() / static_cast<float>(RAND_MAX);
}
// Allocate the device input vector A
CUdeviceptr d_A;
checkCudaErrors(cuMemAlloc(&d_A, size));
// Allocate the device input vector B
CUdeviceptr d_B;
checkCudaErrors(cuMemAlloc(&d_B, size));
// Allocate the device output vector C
CUdeviceptr d_C;
checkCudaErrors(cuMemAlloc(&d_C, size));
// Copy the host input vectors A and B in host memory to the device input
// vectors in device memory
printf("Copy input data from the host memory to the CUDA device\n");
checkCudaErrors(cuMemcpyHtoD(d_A, h_A, size));
checkCudaErrors(cuMemcpyHtoD(d_B, h_B, size));
// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid,
threadsPerBlock);
dim3 cudaBlockSize(threadsPerBlock, 1, 1);
dim3 cudaGridSize(blocksPerGrid, 1, 1);
void *arr[] = {reinterpret_cast<void *>(&d_A), reinterpret_cast<void *>(&d_B),
reinterpret_cast<void *>(&d_C),
reinterpret_cast<void *>(&numElements)};
checkCudaErrors(cuLaunchKernel(kernel_addr, cudaGridSize.x, cudaGridSize.y,
cudaGridSize.z, /* grid dim */
cudaBlockSize.x, cudaBlockSize.y,
cudaBlockSize.z, /* block dim */
0, 0, /* shared mem, stream */
&arr[0], /* arguments */
0));
checkCudaErrors(cuCtxSynchronize());
// Copy the device result vector in device memory to the host result vector
// in host memory.
printf("Copy output data from the CUDA device to the host memory\n");
checkCudaErrors(cuMemcpyDtoH(h_C, d_C, size));
// Verify that the result vector is correct
for (int i = 0; i < numElements; ++i) {
if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5) {
fprintf(stderr, "Result verification failed at element %d!\n", i);
exit(EXIT_FAILURE);
}
}
printf("Test PASSED\n");
// Free device global memory
checkCudaErrors(cuMemFree(d_A));
checkCudaErrors(cuMemFree(d_B));
checkCudaErrors(cuMemFree(d_C));
// Free host memory
free(h_A);
free(h_B);
free(h_C);
printf("Done\n");
return 0;
}
$ nvcc -g -I/usr/local/cuda/samples/common/inc -o test vectorAdd.cpp -lnvrtc -lcuda
$ cuda-gdb ./test
NVIDIA (R) CUDA Debugger
10.0 release
Portions Copyright (C) 2007-2018 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./test...done.
(cuda-gdb) break vectorAdd
Function "vectorAdd" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (vectorAdd) pending.
(cuda-gdb) r
Starting program: /home/user2/misc/junk/vectorAdd_nvrtc/test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffedc00700 (LWP 16789)]
> Using CUDA Device [1]: Tesla K40m
> GPU Device has SM 3.5 compute capability
[New Thread 0x7fffed3ff700 (LWP 16790)]
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
Thread 1 "test" hit Breakpoint 1, vectorAdd<<<(196,1,1),(256,1,1)>>> (A=0x7fffce800000, B=0x7fffce830e00, C=0x7fffce861c00, numElements=50000) at ./vectorAdd_kernel.cu:21
21 int i = blockDim.x * blockIdx.x + threadIdx.x;
(cuda-gdb) step
23 if (i < numElements) {
(cuda-gdb) step
24 C[i] = A[i] + B[i];
(cuda-gdb) step
26 }
(cuda-gdb) quit
A debugging session is active.
Inferior 1 [process 16777] will be killed.
Quit anyway? (y or n) y
$

Using CUDA-gdb with NVRTC

I have an application which generates CUDA C++ source code, compiles it into PTX at runtime using NVRTC, and then creates CUDA modules from it using the CUDA driver API.
If I debug this application using cuda-gdb, it displays the kernel (where an error occured) in the backtrace, but does not show the line number.
I export the generated source code into a file, and give the directory to cuda-gdb using the --directory option. I also tried passing its file name to nvrtcCreateProgram() (name argument). I use the compile options --device-debug and --generate-line-info with NVRTC.
Is there a way to let cuda-gdb know the location of the generated source code file, and display the line number information in its backtrace?
For those who may not be familiar with nvrtc, it is a CUDA facility that allows runtime-compilation of CUDA C++ device code. As a result, device code generated at runtime (including modifications) can be used on a CUDA GPU. There is documentation for nvrtc and there are various CUDA sample codes for nvrtc, most or all of which have _nvrtc in the file name.
I was able to do kernel source-level debugging on a nvrtc-generated kernel with cuda-gdb as follows:
start with vectorAdd_nvrtc sample code
modify the compileFileToPTX routine (provided by nvrtc_helper.h) to add the --device-debug switch during the compile-cu-to-ptx step.
modify the loadPTX routine (provided by nvrtc_helper.h) to add the CU_JIT_GENERATE_DEBUG_INFO option (set to 1) for the cuModuleLoadDataEx load/JIT PTX-to-binary step.
compile the main function (vectorAdd.cpp) with -g option.
Here is a complete test case/session. I'm only showing the vectorAdd.cpp file from the project because that is the only file I modified. Other project file(s) are identical to what is in the sample project:
$ cat vectorAdd.cpp
/**
* Copyright 1993-2015 NVIDIA Corporation. All rights reserved.
*
* Please refer to the NVIDIA end user license agreement (EULA) associated
* with this source code for terms and conditions that govern your use of
* this software. Any use, reproduction, disclosure, or distribution of
* this software and related documentation outside the terms of the EULA
* is strictly prohibited.
*
*/
/**
* Vector addition: C = A + B.
*
* This sample is a very basic sample that implements element by element
* vector addition. It is the same as the sample illustrating Chapter 2
* of the programming guide with some additions like error checking.
*/
#include <stdio.h>
#include <cmath>
// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda.h>
#include <cuda_runtime.h>
// helper functions and utilities to work with CUDA
#include <helper_functions.h>
#include <nvrtc_helper.h>
#include <iostream>
#include <fstream>
/**
* Host main routine
*/
void my_compileFileToPTX(char *filename, int argc, char **argv, char **ptxResult,
size_t *ptxResultSize, int requiresCGheaders) {
std::ifstream inputFile(filename,
std::ios::in | std::ios::binary | std::ios::ate);
if (!inputFile.is_open()) {
std::cerr << "\nerror: unable to open " << filename << " for reading!\n";
exit(1);
}
std::streampos pos = inputFile.tellg();
size_t inputSize = (size_t)pos;
char *memBlock = new char[inputSize + 1];
inputFile.seekg(0, std::ios::beg);
inputFile.read(memBlock, inputSize);
inputFile.close();
memBlock[inputSize] = '\x0';
int numCompileOptions = 0;
char *compileParams[2];
std::string compileOptions;
if (requiresCGheaders) {
char HeaderNames[256];
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
sprintf_s(HeaderNames, sizeof(HeaderNames), "%s", "cooperative_groups.h");
#else
snprintf(HeaderNames, sizeof(HeaderNames), "%s", "cooperative_groups.h");
#endif
compileOptions = "--include-path=";
std::string path = sdkFindFilePath(HeaderNames, argv[0]);
if (!path.empty()) {
std::size_t found = path.find(HeaderNames);
path.erase(found);
} else {
printf(
"\nCooperativeGroups headers not found, please install it in %s "
"sample directory..\n Exiting..\n",
argv[0]);
}
compileOptions += path.c_str();
compileParams[0] = reinterpret_cast<char *>(
malloc(sizeof(char) * (compileOptions.length() + 1)));
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
sprintf_s(compileParams[0], sizeof(char) * (compileOptions.length() + 1),
"%s", compileOptions.c_str());
#else
snprintf(compileParams[0], compileOptions.size(), "%s",
compileOptions.c_str());
#endif
numCompileOptions++;
}
compileOptions = "--device-debug ";
compileParams[numCompileOptions] = reinterpret_cast<char *>(malloc(sizeof(char) * (compileOptions.length() + 1)));
snprintf(compileParams[numCompileOptions], compileOptions.size(), "%s", compileOptions.c_str());
numCompileOptions++;
// compile
nvrtcProgram prog;
NVRTC_SAFE_CALL("nvrtcCreateProgram",
nvrtcCreateProgram(&prog, memBlock, filename, 0, NULL, NULL));
nvrtcResult res = nvrtcCompileProgram(prog, numCompileOptions, compileParams);
// dump log
size_t logSize;
NVRTC_SAFE_CALL("nvrtcGetProgramLogSize",
nvrtcGetProgramLogSize(prog, &logSize));
char *log = reinterpret_cast<char *>(malloc(sizeof(char) * logSize + 1));
NVRTC_SAFE_CALL("nvrtcGetProgramLog", nvrtcGetProgramLog(prog, log));
log[logSize] = '\x0';
if (strlen(log) >= 2) {
std::cerr << "\n compilation log ---\n";
std::cerr << log;
std::cerr << "\n end log ---\n";
}
free(log);
NVRTC_SAFE_CALL("nvrtcCompileProgram", res);
// fetch PTX
size_t ptxSize;
NVRTC_SAFE_CALL("nvrtcGetPTXSize", nvrtcGetPTXSize(prog, &ptxSize));
char *ptx = reinterpret_cast<char *>(malloc(sizeof(char) * ptxSize));
NVRTC_SAFE_CALL("nvrtcGetPTX", nvrtcGetPTX(prog, ptx));
NVRTC_SAFE_CALL("nvrtcDestroyProgram", nvrtcDestroyProgram(&prog));
*ptxResult = ptx;
*ptxResultSize = ptxSize;
#ifdef DUMP_PTX
std::ofstream my_f;
my_f.open("vectorAdd.ptx");
for (int i = 0; i < ptxSize; i++)
my_f << ptx[i];
my_f.close();
#endif
if (requiresCGheaders) free(compileParams[0]);
}
CUmodule my_loadPTX(char *ptx, int argc, char **argv) {
CUmodule module;
CUcontext context;
int major = 0, minor = 0;
char deviceName[256];
// Picks the best CUDA device available
CUdevice cuDevice = findCudaDeviceDRV(argc, (const char **)argv);
// get compute capabilities and the devicename
checkCudaErrors(cuDeviceGetAttribute(
&major, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, cuDevice));
checkCudaErrors(cuDeviceGetAttribute(
&minor, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, cuDevice));
checkCudaErrors(cuDeviceGetName(deviceName, 256, cuDevice));
printf("> GPU Device has SM %d.%d compute capability\n", major, minor);
checkCudaErrors(cuInit(0));
checkCudaErrors(cuDeviceGet(&cuDevice, 0));
checkCudaErrors(cuCtxCreate(&context, 0, cuDevice));
CUjit_option opt[1];
opt[0] = CU_JIT_GENERATE_DEBUG_INFO;
void **vals = new void *[1];
vals[0] = (void *)(size_t)1;
checkCudaErrors(cuModuleLoadDataEx(&module, ptx, 1, opt, vals));
free(ptx);
return module;
}
int main(int argc, char **argv) {
char *ptx, *kernel_file;
size_t ptxSize;
kernel_file = sdkFindFilePath("vectorAdd_kernel.cu", argv[0]);
my_compileFileToPTX(kernel_file, argc, argv, &ptx, &ptxSize, 0);
CUmodule module = my_loadPTX(ptx, argc, argv);
CUfunction kernel_addr;
checkCudaErrors(cuModuleGetFunction(&kernel_addr, module, "vectorAdd"));
// Print the vector length to be used, and compute its size
int numElements = 50000;
size_t size = numElements * sizeof(float);
printf("[Vector addition of %d elements]\n", numElements);
// Allocate the host input vector A
float *h_A = reinterpret_cast<float *>(malloc(size));
// Allocate the host input vector B
float *h_B = reinterpret_cast<float *>(malloc(size));
// Allocate the host output vector C
float *h_C = reinterpret_cast<float *>(malloc(size));
// Verify that allocations succeeded
if (h_A == NULL || h_B == NULL || h_C == NULL) {
fprintf(stderr, "Failed to allocate host vectors!\n");
exit(EXIT_FAILURE);
}
// Initialize the host input vectors
for (int i = 0; i < numElements; ++i) {
h_A[i] = rand() / static_cast<float>(RAND_MAX);
h_B[i] = rand() / static_cast<float>(RAND_MAX);
}
// Allocate the device input vector A
CUdeviceptr d_A;
checkCudaErrors(cuMemAlloc(&d_A, size));
// Allocate the device input vector B
CUdeviceptr d_B;
checkCudaErrors(cuMemAlloc(&d_B, size));
// Allocate the device output vector C
CUdeviceptr d_C;
checkCudaErrors(cuMemAlloc(&d_C, size));
// Copy the host input vectors A and B in host memory to the device input
// vectors in device memory
printf("Copy input data from the host memory to the CUDA device\n");
checkCudaErrors(cuMemcpyHtoD(d_A, h_A, size));
checkCudaErrors(cuMemcpyHtoD(d_B, h_B, size));
// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid,
threadsPerBlock);
dim3 cudaBlockSize(threadsPerBlock, 1, 1);
dim3 cudaGridSize(blocksPerGrid, 1, 1);
void *arr[] = {reinterpret_cast<void *>(&d_A), reinterpret_cast<void *>(&d_B),
reinterpret_cast<void *>(&d_C),
reinterpret_cast<void *>(&numElements)};
checkCudaErrors(cuLaunchKernel(kernel_addr, cudaGridSize.x, cudaGridSize.y,
cudaGridSize.z, /* grid dim */
cudaBlockSize.x, cudaBlockSize.y,
cudaBlockSize.z, /* block dim */
0, 0, /* shared mem, stream */
&arr[0], /* arguments */
0));
checkCudaErrors(cuCtxSynchronize());
// Copy the device result vector in device memory to the host result vector
// in host memory.
printf("Copy output data from the CUDA device to the host memory\n");
checkCudaErrors(cuMemcpyDtoH(h_C, d_C, size));
// Verify that the result vector is correct
for (int i = 0; i < numElements; ++i) {
if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5) {
fprintf(stderr, "Result verification failed at element %d!\n", i);
exit(EXIT_FAILURE);
}
}
printf("Test PASSED\n");
// Free device global memory
checkCudaErrors(cuMemFree(d_A));
checkCudaErrors(cuMemFree(d_B));
checkCudaErrors(cuMemFree(d_C));
// Free host memory
free(h_A);
free(h_B);
free(h_C);
printf("Done\n");
return 0;
}
$ nvcc -g -I/usr/local/cuda/samples/common/inc -o test vectorAdd.cpp -lnvrtc -lcuda
$ cuda-gdb ./test
NVIDIA (R) CUDA Debugger
10.0 release
Portions Copyright (C) 2007-2018 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./test...done.
(cuda-gdb) break vectorAdd
Function "vectorAdd" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (vectorAdd) pending.
(cuda-gdb) r
Starting program: /home/user2/misc/junk/vectorAdd_nvrtc/test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffedc00700 (LWP 16789)]
> Using CUDA Device [1]: Tesla K40m
> GPU Device has SM 3.5 compute capability
[New Thread 0x7fffed3ff700 (LWP 16790)]
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
Thread 1 "test" hit Breakpoint 1, vectorAdd<<<(196,1,1),(256,1,1)>>> (A=0x7fffce800000, B=0x7fffce830e00, C=0x7fffce861c00, numElements=50000) at ./vectorAdd_kernel.cu:21
21 int i = blockDim.x * blockIdx.x + threadIdx.x;
(cuda-gdb) step
23 if (i < numElements) {
(cuda-gdb) step
24 C[i] = A[i] + B[i];
(cuda-gdb) step
26 }
(cuda-gdb) quit
A debugging session is active.
Inferior 1 [process 16777] will be killed.
Quit anyway? (y or n) y
$

Cuda hello_world.cu compiles but wrongly prints "Hello Hello"

I have found the following hello world program for CUDA:
#include <stdio.h>
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
const int N = 16;
const int blocksize = 16;
__global__
void hello(char *a, int *b)
{
a[threadIdx.x] += b[threadIdx.x];
}
int main()
{
char a[N] = "Hello \0\0\0\0\0\0";
int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
char *ad;
int *bd;
const int csize = N*sizeof(char);
const int isize = N*sizeof(int);
printf("%s", a);
cudaMalloc( (void**)&ad, csize );
cudaMalloc( (void**)&bd, isize );
cudaCheckErrors("cudaMalloc fail");
cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice );
cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice );
cudaCheckErrors("cudaMemcpy H2D fail");
dim3 dimBlock( blocksize, 1 );
dim3 dimGrid( 1, 1 );
hello<<<dimGrid, dimBlock>>>(ad, bd);
cudaCheckErrors("Kernel fail");
cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost );
cudaCheckErrors("cudaMemcpy D2H/Kernel fail");
cudaFree( ad );
cudaFree( bd );
printf("%s\n", a);
return EXIT_SUCCESS;
}
I compile it successfully with nvcc hello_world.cu -o hello, but when I run cuda-memcheck ./hello , I get:
========= CUDA-MEMCHECK
Fatal error: cudaMalloc fail (unknown error at hello_world.cu:39)
*** FAILED - ABORTING
Hello ========= ERROR SUMMARY: 0 errors
I'm a CUDA newbie, my questions are:
1) what's going on under the hood?
2) how can I fix it?
I'm running Ubuntu 13.04, x86_64, Cuda 5.5, without root access.
the upper output of nvidia-smi is:
+------------------------------------------------------+
| NVIDIA-SMI 337.19 Driver Version: 337.19 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:05:00.0 N/A | N/A |
| 26% 37C N/A N/A / N/A | 53MiB / 6143MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
When I run deviceQuery, I get:
../../bin/x86_64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 30
-> unknown error
Result = FAIL
And when I run deviceQueryDrv, I get:
../../bin/x86_64/linux/release/deviceQueryDrv Starting...
CUDA Device Query (Driver API) statically linked version
cuInit(0) returned 999
-> CUDA_ERROR_UNKNOWN
Result = FAIL
When I run:
#include <cublas_v2.h>
#include <cstdio>
int main()
{
int res;
cublasHandle_t handle;
res = cublasCreate(&handle);
switch(res) {
case CUBLAS_STATUS_SUCCESS:
printf("the initialization succeeded\n");
break;
case CUBLAS_STATUS_NOT_INITIALIZED:
printf("the CUDA Runtime initialization failed\n");
break;
case CUBLAS_STATUS_ALLOC_FAILED:
printf("the resources could not be allocated\n");
break;
}
return 0;
}
I get the CUDA Runtime initialization failed.
I have had a problem exactly like yours just now. Running a sample was returning an "Unknown Error" and printing "Hello Hello ", and cublasCreate was returning CUBLAS_STATUS_NOT_INITIALIZED.
I found the answer here: http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#post-installation-actions
If a CUDA-capable device and the CUDA Driver are installed but deviceQuery reports that no CUDA-capable devices are present, this likely means that the /dev/nvidia* files are missing or have the wrong permissions.
Indeed, my /dev/nvidia* files were owned by root.
I just ran
chown user.user /dev/nvidia*
(where user is my local user) and all the errors disappeared.
EDIT: later I had a similar issue on another machine, and this solution did not work, because one of the devices in /dev/nvidia was missing. What did work is just executing a sample code once as a sudo. After that one extra device appeared in /dev/nvidia*, and sample code started working without sudo.

Simple adding of two int's in Cuda, result always the same

I'm starting my journary to learn Cuda. I am playing with some hello world type cuda code but its not working, and I'm not sure why.
The code is very simple, take two ints and add them on the GPU and return the result, but no matter what I change the numbers to I get the same result(If math worked that way I would have done alot better in the subject than I actually did).
Here's the sample code:
// CUDA-C includes
#include <cuda.h>
#include <stdio.h>
__global__ void add( int a, int b, int *c ) {
*c = a + b;
}
extern "C"
void runCudaPart();
// Main cuda function
void runCudaPart() {
int c;
int *dev_c;
cudaMalloc( (void**)&dev_c, sizeof(int) );
add<<<1,1>>>( 1, 4, dev_c );
cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
printf( "1 + 4 = %d\n", c );
cudaFree( dev_c );
}
The output seems a bit off: 1 + 4 = -1065287167
I'm working on setting up my environment and just wanted to know if there was a problem with the code otherwise its probably my environment.
Update: I tried to add some code to show the error but I don't get an output but the number changes(is it outputing error codes instead of answers? Even if I don't do any work in the kernal other than assign a variable I still get simlair results).
// CUDA-C includes
#include <cuda.h>
#include <stdio.h>
__global__ void add( int a, int b, int *c ) {
//*c = a + b;
*c = 5;
}
extern "C"
void runCudaPart();
// Main cuda function
void runCudaPart() {
int c;
int *dev_c;
cudaError_t err = cudaMalloc( (void**)&dev_c, sizeof(int) );
if(err != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
add<<<1,1>>>( 1, 4, dev_c );
cudaError_t err2 = cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
if(err2 != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
printf( "1 + 4 = %d\n", c );
cudaFree( dev_c );
}
Code appears to be fine, maybe its related to my setup. Its been a nightmare to get Cuda installed on OSX lion but I thought it worked as the examples in the SDK seemed to be fine. The steps I took so far are go to the Nvida website and download the latest mac releases for the driver, toolkit and SDK. I then added export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH and 'PATH=/usr/local/cuda/bin:$PATH` I did a deviceQuery and it passed with the following info about my system:
[deviceQuery] starting...
/Developer/GPU Computing/C/bin/darwin/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 1 CUDA Capable device(s)
Device 0: "GeForce 320M"
CUDA Driver Version / Runtime Version 4.2 / 4.2
CUDA Capability Major/Minor version number: 1.2
Total amount of global memory: 253 MBytes (265027584 bytes)
( 6) Multiprocessors x ( 8) CUDA Cores/MP: 48 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 1064 Mhz
Memory Bus Width: 128-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.2, NumDevs = 1, Device = GeForce 320M
[deviceQuery] test results...
PASSED
UPDATE: what's really weird is even if I remove all the work in the kernel I stil get a result for c? I have reinstalled cuda and used make on the examples and all of them pass.
Basically there are two problems here:
You are not compiling the kernel for the correct architecture (gleaned from comments)
Your code contains imperfect error checking which is missing the point when the runtime error is occurring, leading to mysterious and unexplained symptoms.
In the runtime API, most context related actions are performed "lazily". When you launch a kernel for the first time, the runtime API will invoke code to intelligently find a suitable CUBIN image from inside the fat binary image emitted by the toolchain for the target hardware and load it into the context. This can also include JIT recompilation of PTX for a backwards compatible architecture, but not the other way around. So if you had a kernel compiled for a compute capability 1.2 device and you run it on a compute capability 2.0 device, the driver can JIT compile the PTX 1.x code it contains for the newer architecture. But the reverse doesn't work. So in your example, the runtime API will generate an error because it cannot find a usable binary image in the CUDA fatbinary image embedded in the executable. The error message is pretty cryptic, but you will get an error (see this question for a bit more information).
If your code contained error checking like this:
cudaError_t err = cudaMalloc( (void**)&dev_c, sizeof(int) );
if(err != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
add<<<1,1>>>( 1, 4, dev_c );
if (cudaPeekAtLastError() != cudaSuccess) {
printf("The error is %s", cudaGetErrorString(cudaGetLastError()));
}
cudaError_t err2 = cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
if(err2 != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
the extra error checking after the kernel launch should catch the runtime API error generated by the kernel load/launch failure.
#include <stdio.h>
#include <conio.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
__global__ void Addition(int *a,int *b,int *c)
{
*c = *a + *b;
}
int main()
{
int a,b,c;
int *dev_a,*dev_b,*dev_c;
int size = sizeof(int);
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);
a=5,b=6;
cudaMemcpy(dev_a, &a,sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, &b,sizeof(int), cudaMemcpyHostToDevice);
Addition<<< 1,1 >>>(dev_a,dev_b,dev_c);
cudaMemcpy(&c, dev_c,size, cudaMemcpyDeviceToHost);
cudaFree(&dev_a);
cudaFree(&dev_b);
cudaFree(&dev_c);
printf("%d\n", c);
getch();
return 0;
}

cuda programming problem

I'm very new to cuda .I'm using cuda on my ubuntu 10.04 in device emulation mode.
I write a code to compute the square of array which is following :
#include <stdio.h>
#include <cuda.h>
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x + threadIdx.x;
if (idx<=N)
a[idx] = a[idx] * a[idx];
}
int main(void)
{
float *a_h, *a_d;
const int N = 10;
size_t size = N * sizeof(float);
a_h = (float *)malloc(size);
cudaMalloc((void **) &a_d, size);
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
square_array <<< 1,10>>> (a_d, N);
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf(" %f\n", a_h[i]);
free(a_h);
cudaFree(a_d);
return 0;
}
When I run this code it show no problem it give me proper output.
Now my problem is that when i use <<<2,5>>> or<<<5,2>>> the result is same . what is happening on gpu ?
All I understand is that I just launch cuda kernel with 5 blocks containing 2 thread.
Can anyone explain me how Gpu handle this or implement the launch(kernel call)?
Now my real problem is that when i call the kernel with <<<1,10>>> It is ok . It shows the perfect result.
but when i call the kernel with <<<1,5>> the result is following:
0.000000
1.000000
4.000000
9.000000
16.000000
5.000000
6.000000
7.000000
8.000000
9.000000
similarly when i reduce or increase the second parameter in kernel call it show different result for example when i change it to <<1,4>> it shows following result:
0.000000
1.000000
4.000000
9.000000
4.000000
5.000000
6.000000
7.000000
8.000000
9.000000
Why this result is coming ?
Can any body explain the working of kernel launch call ?
what is blockdim type variable contain ?
Please help me to understand the concept of kernel call launching and working ?
I searched the programming guide but they didn't explain it very well.
The calculation of idx in your kernel code is incorrect. If you change it to:
int idx = blockDim.x * blockIdx.x + threadIdx.x;
You might find the results a little easier to understand.
EDIT: For any given kernel launch
square_array<<<gridDim,blockDim>>>(...)
in the GPU, the automatic variable blockDim will contain the x,y, and z components of the blockDim argument passed in the host side kernel launch. Similarly gridDim will contain the x and y components of the gridDim argument passed in the launch.
Apart from what talonmies has said, you may need to do the following to have better performance in real world applications.
if (idx < N) {
tmp = a[idx];
a[idx] = tmp * tmp;
}
The way kernels are invoked in CUDA is like so:
kernel<<<numBlocks,numThreads>>>(Kernel arguments);
This means that there will be numBlocks blocks with numThreads threads running in each block. For example, if you call
kernel<<<1,5>>>(Kernel args);
then 1 block will run with 5 threads running in parallel. and if you call
kernel<<<2,5>>>(Kernel args);
then there well be 2 blocks with 5 threads running in each. Unless you alter your device code, the maximum dimension of the array that you are "squaring" is the product numBlocks*numThreads. This explains why not all of the values in your original array were squared.
I suggest you read through the CUDA_C_Programming_Guide.pdf that comes with the CUDA toolkit.