Determining dimGrid and dimBlock sizes in CUDA - cuda

First off, I'm fairly new to CUDA programming so I apologize for such a simple question. I have researched the best way to determine dimGrid and dimBlock in my GPU kernel call and for some reason I'm not quite getting it to work.
On my home PC, I have a GeForce GTX 580 (Compute Capability 2.0). 1024 threads per block etc. I can get my code to run properly on this PC. My gpu is populating the distance array of size 988*988. Here is part of the code below:
#define SIZE 988
__global__ void createDistanceTable(double *d_distances, double *d_coordinates)
{
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if(row < SIZE && col < SIZE)
d_distances[row * SIZE + col] =
acos(__sinf(d_coordinates[row * 2 + 0])*
__sinf(d_coordinates[col * 2 + 0])+__cosf(d_coordinates[row * 2 + 0])*
__cosf(d_coordinates[col * 2 + 0])*__cosf(d_coordinates[col * 2 + 1]-
d_coordinates[row * 2 + 1]))*6371;
}
Kernel call in main:
dim3 dimBlock(32,32,1);
dim3 dimGrid(32,32,1);
createDistanceTable<<<dimGrid, dimBlock>>>(d_distances, d_coordinates);
My issue is I simply have not found a way to get this code to run properly on my laptop. My laptop's GPU is a GeForce 9600M GT (Compute Capability 1.1). 512 threads per block etc. I would greatly appreciate any guidance in helping my understand how I should approach the dimBlock and dimGrid for my kernel call on my laptop. Thanks for any advice!

Several things were wrong in your code.
Using double-precision on CC < 1.3.
The size of your thread blocks (as you said, CC <= 1.3 means 512 threads max per block, you used 1024 threads per block). I guess you could use __CUDA_ARCH__ if you do need some multi-architecture code.
No error checking or memory checking (cuda-memcheck). You may allocate more memory than you have, or use more threads/blocks than your GPU can handle, and you will not detect it.
Consider the following example based on your code (I am using float instead of double):
#include <cuda.h>
#include <stdio.h> // printf
#define SIZE 988
#define GRID_SIZE 32
#define BLOCK_SIZE 16 // set to 16 instead of 32 for instance
#define CUDA_CHECK_ERROR() __cuda_check_errors(__FILE__, __LINE__)
#define CUDA_SAFE_CALL(err) __cuda_safe_call(err, __FILE__, __LINE__)
// See: http://codeyarns.com/2011/03/02/how-to-do-error-checking-in-cuda/
inline void
__cuda_check_errors (const char *filename, const int line_number)
{
cudaError err = cudaDeviceSynchronize ();
if (err != cudaSuccess)
{
printf ("CUDA error %i at %s:%i: %s\n",
err, filename, line_number, cudaGetErrorString (err));
exit (-1);
}
}
inline void
__cuda_safe_call (cudaError err, const char *filename, const int line_number)
{
if (err != cudaSuccess)
{
printf ("CUDA error %i at %s:%i: %s\n",
err, filename, line_number, cudaGetErrorString (err));
exit (-1);
}
}
__global__ void
createDistanceTable (float *d_distances, float *d_coordinates)
{
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (row < SIZE && col < SIZE)
d_distances[row * SIZE + col] =
acos (__sinf (d_coordinates[row * 2 + 0]) *
__sinf (d_coordinates[col * 2 + 0]) +
__cosf (d_coordinates[row * 2 + 0]) *
__cosf (d_coordinates[col * 2 + 0]) *
__cosf (d_coordinates[col * 2 + 1] -
d_coordinates[row * 2 + 1])) * 6371;
}
int
main ()
{
float *d_distances;
float *d_coordinates;
CUDA_SAFE_CALL (cudaMalloc (&d_distances, SIZE * SIZE * sizeof (float)));
CUDA_SAFE_CALL (cudaMalloc (&d_coordinates, SIZE * SIZE * sizeof (float)));
dim3 dimGrid (GRID_SIZE, GRID_SIZE);
dim3 dimBlock (BLOCK_SIZE, BLOCK_SIZE);
createDistanceTable <<< dimGrid, dimBlock >>> (d_distances, d_coordinates);
CUDA_CHECK_ERROR ();
CUDA_SAFE_CALL (cudaFree (d_distances));
CUDA_SAFE_CALL (cudaFree (d_coordinates));
}
Compilation command (change architecture accordingly):
nvcc prog.cu -g -G -lineinfo -gencode arch=compute_11,code=sm_11 -o prog
With 32x32 block on CC 2.0 or 16x16 on CC 1.1:
cuda-memcheck ./prog
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
With 33x33 block on CC 2.0 or 32x32 block on CC 1.1:
cuda-memcheck ./prog
========= CUDA-MEMCHECK
========= Program hit error 9 on CUDA API call to cudaLaunch
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/nvidia-current-updates/libcuda.so [0x26a230]
========= Host Frame:/opt/cuda/lib64/libcudart.so.5.0 (cudaLaunch + 0x242) [0x2f592]
========= Host Frame:./prog [0xc76]
========= Host Frame:./prog [0xa99]
========= Host Frame:./prog [0xac4]
========= Host Frame:./prog [0x9d1]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
========= Host Frame:./prog [0x859]
=========
========= ERROR SUMMARY: 1 error
Error 9:
/**
* This indicates that a kernel launch is requesting resources that can
* never be satisfied by the current device. Requesting more shared memory
* per block than the device supports will trigger this error, as will
* requesting too many threads or blocks. See ::cudaDeviceProp for more
* device limitations.
*/ cudaErrorInvalidConfiguration = 9,

Related

Can I update or change a cuda kernel without stopping the process? [duplicate]

I have an application which generates CUDA C++ source code, compiles it into PTX at runtime using NVRTC, and then creates CUDA modules from it using the CUDA driver API.
If I debug this application using cuda-gdb, it displays the kernel (where an error occured) in the backtrace, but does not show the line number.
I export the generated source code into a file, and give the directory to cuda-gdb using the --directory option. I also tried passing its file name to nvrtcCreateProgram() (name argument). I use the compile options --device-debug and --generate-line-info with NVRTC.
Is there a way to let cuda-gdb know the location of the generated source code file, and display the line number information in its backtrace?
For those who may not be familiar with nvrtc, it is a CUDA facility that allows runtime-compilation of CUDA C++ device code. As a result, device code generated at runtime (including modifications) can be used on a CUDA GPU. There is documentation for nvrtc and there are various CUDA sample codes for nvrtc, most or all of which have _nvrtc in the file name.
I was able to do kernel source-level debugging on a nvrtc-generated kernel with cuda-gdb as follows:
start with vectorAdd_nvrtc sample code
modify the compileFileToPTX routine (provided by nvrtc_helper.h) to add the --device-debug switch during the compile-cu-to-ptx step.
modify the loadPTX routine (provided by nvrtc_helper.h) to add the CU_JIT_GENERATE_DEBUG_INFO option (set to 1) for the cuModuleLoadDataEx load/JIT PTX-to-binary step.
compile the main function (vectorAdd.cpp) with -g option.
Here is a complete test case/session. I'm only showing the vectorAdd.cpp file from the project because that is the only file I modified. Other project file(s) are identical to what is in the sample project:
$ cat vectorAdd.cpp
/**
* Copyright 1993-2015 NVIDIA Corporation. All rights reserved.
*
* Please refer to the NVIDIA end user license agreement (EULA) associated
* with this source code for terms and conditions that govern your use of
* this software. Any use, reproduction, disclosure, or distribution of
* this software and related documentation outside the terms of the EULA
* is strictly prohibited.
*
*/
/**
* Vector addition: C = A + B.
*
* This sample is a very basic sample that implements element by element
* vector addition. It is the same as the sample illustrating Chapter 2
* of the programming guide with some additions like error checking.
*/
#include <stdio.h>
#include <cmath>
// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda.h>
#include <cuda_runtime.h>
// helper functions and utilities to work with CUDA
#include <helper_functions.h>
#include <nvrtc_helper.h>
#include <iostream>
#include <fstream>
/**
* Host main routine
*/
void my_compileFileToPTX(char *filename, int argc, char **argv, char **ptxResult,
size_t *ptxResultSize, int requiresCGheaders) {
std::ifstream inputFile(filename,
std::ios::in | std::ios::binary | std::ios::ate);
if (!inputFile.is_open()) {
std::cerr << "\nerror: unable to open " << filename << " for reading!\n";
exit(1);
}
std::streampos pos = inputFile.tellg();
size_t inputSize = (size_t)pos;
char *memBlock = new char[inputSize + 1];
inputFile.seekg(0, std::ios::beg);
inputFile.read(memBlock, inputSize);
inputFile.close();
memBlock[inputSize] = '\x0';
int numCompileOptions = 0;
char *compileParams[2];
std::string compileOptions;
if (requiresCGheaders) {
char HeaderNames[256];
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
sprintf_s(HeaderNames, sizeof(HeaderNames), "%s", "cooperative_groups.h");
#else
snprintf(HeaderNames, sizeof(HeaderNames), "%s", "cooperative_groups.h");
#endif
compileOptions = "--include-path=";
std::string path = sdkFindFilePath(HeaderNames, argv[0]);
if (!path.empty()) {
std::size_t found = path.find(HeaderNames);
path.erase(found);
} else {
printf(
"\nCooperativeGroups headers not found, please install it in %s "
"sample directory..\n Exiting..\n",
argv[0]);
}
compileOptions += path.c_str();
compileParams[0] = reinterpret_cast<char *>(
malloc(sizeof(char) * (compileOptions.length() + 1)));
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
sprintf_s(compileParams[0], sizeof(char) * (compileOptions.length() + 1),
"%s", compileOptions.c_str());
#else
snprintf(compileParams[0], compileOptions.size(), "%s",
compileOptions.c_str());
#endif
numCompileOptions++;
}
compileOptions = "--device-debug ";
compileParams[numCompileOptions] = reinterpret_cast<char *>(malloc(sizeof(char) * (compileOptions.length() + 1)));
snprintf(compileParams[numCompileOptions], compileOptions.size(), "%s", compileOptions.c_str());
numCompileOptions++;
// compile
nvrtcProgram prog;
NVRTC_SAFE_CALL("nvrtcCreateProgram",
nvrtcCreateProgram(&prog, memBlock, filename, 0, NULL, NULL));
nvrtcResult res = nvrtcCompileProgram(prog, numCompileOptions, compileParams);
// dump log
size_t logSize;
NVRTC_SAFE_CALL("nvrtcGetProgramLogSize",
nvrtcGetProgramLogSize(prog, &logSize));
char *log = reinterpret_cast<char *>(malloc(sizeof(char) * logSize + 1));
NVRTC_SAFE_CALL("nvrtcGetProgramLog", nvrtcGetProgramLog(prog, log));
log[logSize] = '\x0';
if (strlen(log) >= 2) {
std::cerr << "\n compilation log ---\n";
std::cerr << log;
std::cerr << "\n end log ---\n";
}
free(log);
NVRTC_SAFE_CALL("nvrtcCompileProgram", res);
// fetch PTX
size_t ptxSize;
NVRTC_SAFE_CALL("nvrtcGetPTXSize", nvrtcGetPTXSize(prog, &ptxSize));
char *ptx = reinterpret_cast<char *>(malloc(sizeof(char) * ptxSize));
NVRTC_SAFE_CALL("nvrtcGetPTX", nvrtcGetPTX(prog, ptx));
NVRTC_SAFE_CALL("nvrtcDestroyProgram", nvrtcDestroyProgram(&prog));
*ptxResult = ptx;
*ptxResultSize = ptxSize;
#ifdef DUMP_PTX
std::ofstream my_f;
my_f.open("vectorAdd.ptx");
for (int i = 0; i < ptxSize; i++)
my_f << ptx[i];
my_f.close();
#endif
if (requiresCGheaders) free(compileParams[0]);
}
CUmodule my_loadPTX(char *ptx, int argc, char **argv) {
CUmodule module;
CUcontext context;
int major = 0, minor = 0;
char deviceName[256];
// Picks the best CUDA device available
CUdevice cuDevice = findCudaDeviceDRV(argc, (const char **)argv);
// get compute capabilities and the devicename
checkCudaErrors(cuDeviceGetAttribute(
&major, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, cuDevice));
checkCudaErrors(cuDeviceGetAttribute(
&minor, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, cuDevice));
checkCudaErrors(cuDeviceGetName(deviceName, 256, cuDevice));
printf("> GPU Device has SM %d.%d compute capability\n", major, minor);
checkCudaErrors(cuInit(0));
checkCudaErrors(cuDeviceGet(&cuDevice, 0));
checkCudaErrors(cuCtxCreate(&context, 0, cuDevice));
CUjit_option opt[1];
opt[0] = CU_JIT_GENERATE_DEBUG_INFO;
void **vals = new void *[1];
vals[0] = (void *)(size_t)1;
checkCudaErrors(cuModuleLoadDataEx(&module, ptx, 1, opt, vals));
free(ptx);
return module;
}
int main(int argc, char **argv) {
char *ptx, *kernel_file;
size_t ptxSize;
kernel_file = sdkFindFilePath("vectorAdd_kernel.cu", argv[0]);
my_compileFileToPTX(kernel_file, argc, argv, &ptx, &ptxSize, 0);
CUmodule module = my_loadPTX(ptx, argc, argv);
CUfunction kernel_addr;
checkCudaErrors(cuModuleGetFunction(&kernel_addr, module, "vectorAdd"));
// Print the vector length to be used, and compute its size
int numElements = 50000;
size_t size = numElements * sizeof(float);
printf("[Vector addition of %d elements]\n", numElements);
// Allocate the host input vector A
float *h_A = reinterpret_cast<float *>(malloc(size));
// Allocate the host input vector B
float *h_B = reinterpret_cast<float *>(malloc(size));
// Allocate the host output vector C
float *h_C = reinterpret_cast<float *>(malloc(size));
// Verify that allocations succeeded
if (h_A == NULL || h_B == NULL || h_C == NULL) {
fprintf(stderr, "Failed to allocate host vectors!\n");
exit(EXIT_FAILURE);
}
// Initialize the host input vectors
for (int i = 0; i < numElements; ++i) {
h_A[i] = rand() / static_cast<float>(RAND_MAX);
h_B[i] = rand() / static_cast<float>(RAND_MAX);
}
// Allocate the device input vector A
CUdeviceptr d_A;
checkCudaErrors(cuMemAlloc(&d_A, size));
// Allocate the device input vector B
CUdeviceptr d_B;
checkCudaErrors(cuMemAlloc(&d_B, size));
// Allocate the device output vector C
CUdeviceptr d_C;
checkCudaErrors(cuMemAlloc(&d_C, size));
// Copy the host input vectors A and B in host memory to the device input
// vectors in device memory
printf("Copy input data from the host memory to the CUDA device\n");
checkCudaErrors(cuMemcpyHtoD(d_A, h_A, size));
checkCudaErrors(cuMemcpyHtoD(d_B, h_B, size));
// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid,
threadsPerBlock);
dim3 cudaBlockSize(threadsPerBlock, 1, 1);
dim3 cudaGridSize(blocksPerGrid, 1, 1);
void *arr[] = {reinterpret_cast<void *>(&d_A), reinterpret_cast<void *>(&d_B),
reinterpret_cast<void *>(&d_C),
reinterpret_cast<void *>(&numElements)};
checkCudaErrors(cuLaunchKernel(kernel_addr, cudaGridSize.x, cudaGridSize.y,
cudaGridSize.z, /* grid dim */
cudaBlockSize.x, cudaBlockSize.y,
cudaBlockSize.z, /* block dim */
0, 0, /* shared mem, stream */
&arr[0], /* arguments */
0));
checkCudaErrors(cuCtxSynchronize());
// Copy the device result vector in device memory to the host result vector
// in host memory.
printf("Copy output data from the CUDA device to the host memory\n");
checkCudaErrors(cuMemcpyDtoH(h_C, d_C, size));
// Verify that the result vector is correct
for (int i = 0; i < numElements; ++i) {
if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5) {
fprintf(stderr, "Result verification failed at element %d!\n", i);
exit(EXIT_FAILURE);
}
}
printf("Test PASSED\n");
// Free device global memory
checkCudaErrors(cuMemFree(d_A));
checkCudaErrors(cuMemFree(d_B));
checkCudaErrors(cuMemFree(d_C));
// Free host memory
free(h_A);
free(h_B);
free(h_C);
printf("Done\n");
return 0;
}
$ nvcc -g -I/usr/local/cuda/samples/common/inc -o test vectorAdd.cpp -lnvrtc -lcuda
$ cuda-gdb ./test
NVIDIA (R) CUDA Debugger
10.0 release
Portions Copyright (C) 2007-2018 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./test...done.
(cuda-gdb) break vectorAdd
Function "vectorAdd" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (vectorAdd) pending.
(cuda-gdb) r
Starting program: /home/user2/misc/junk/vectorAdd_nvrtc/test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffedc00700 (LWP 16789)]
> Using CUDA Device [1]: Tesla K40m
> GPU Device has SM 3.5 compute capability
[New Thread 0x7fffed3ff700 (LWP 16790)]
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
Thread 1 "test" hit Breakpoint 1, vectorAdd<<<(196,1,1),(256,1,1)>>> (A=0x7fffce800000, B=0x7fffce830e00, C=0x7fffce861c00, numElements=50000) at ./vectorAdd_kernel.cu:21
21 int i = blockDim.x * blockIdx.x + threadIdx.x;
(cuda-gdb) step
23 if (i < numElements) {
(cuda-gdb) step
24 C[i] = A[i] + B[i];
(cuda-gdb) step
26 }
(cuda-gdb) quit
A debugging session is active.
Inferior 1 [process 16777] will be killed.
Quit anyway? (y or n) y
$

Using CUDA-gdb with NVRTC

I have an application which generates CUDA C++ source code, compiles it into PTX at runtime using NVRTC, and then creates CUDA modules from it using the CUDA driver API.
If I debug this application using cuda-gdb, it displays the kernel (where an error occured) in the backtrace, but does not show the line number.
I export the generated source code into a file, and give the directory to cuda-gdb using the --directory option. I also tried passing its file name to nvrtcCreateProgram() (name argument). I use the compile options --device-debug and --generate-line-info with NVRTC.
Is there a way to let cuda-gdb know the location of the generated source code file, and display the line number information in its backtrace?
For those who may not be familiar with nvrtc, it is a CUDA facility that allows runtime-compilation of CUDA C++ device code. As a result, device code generated at runtime (including modifications) can be used on a CUDA GPU. There is documentation for nvrtc and there are various CUDA sample codes for nvrtc, most or all of which have _nvrtc in the file name.
I was able to do kernel source-level debugging on a nvrtc-generated kernel with cuda-gdb as follows:
start with vectorAdd_nvrtc sample code
modify the compileFileToPTX routine (provided by nvrtc_helper.h) to add the --device-debug switch during the compile-cu-to-ptx step.
modify the loadPTX routine (provided by nvrtc_helper.h) to add the CU_JIT_GENERATE_DEBUG_INFO option (set to 1) for the cuModuleLoadDataEx load/JIT PTX-to-binary step.
compile the main function (vectorAdd.cpp) with -g option.
Here is a complete test case/session. I'm only showing the vectorAdd.cpp file from the project because that is the only file I modified. Other project file(s) are identical to what is in the sample project:
$ cat vectorAdd.cpp
/**
* Copyright 1993-2015 NVIDIA Corporation. All rights reserved.
*
* Please refer to the NVIDIA end user license agreement (EULA) associated
* with this source code for terms and conditions that govern your use of
* this software. Any use, reproduction, disclosure, or distribution of
* this software and related documentation outside the terms of the EULA
* is strictly prohibited.
*
*/
/**
* Vector addition: C = A + B.
*
* This sample is a very basic sample that implements element by element
* vector addition. It is the same as the sample illustrating Chapter 2
* of the programming guide with some additions like error checking.
*/
#include <stdio.h>
#include <cmath>
// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda.h>
#include <cuda_runtime.h>
// helper functions and utilities to work with CUDA
#include <helper_functions.h>
#include <nvrtc_helper.h>
#include <iostream>
#include <fstream>
/**
* Host main routine
*/
void my_compileFileToPTX(char *filename, int argc, char **argv, char **ptxResult,
size_t *ptxResultSize, int requiresCGheaders) {
std::ifstream inputFile(filename,
std::ios::in | std::ios::binary | std::ios::ate);
if (!inputFile.is_open()) {
std::cerr << "\nerror: unable to open " << filename << " for reading!\n";
exit(1);
}
std::streampos pos = inputFile.tellg();
size_t inputSize = (size_t)pos;
char *memBlock = new char[inputSize + 1];
inputFile.seekg(0, std::ios::beg);
inputFile.read(memBlock, inputSize);
inputFile.close();
memBlock[inputSize] = '\x0';
int numCompileOptions = 0;
char *compileParams[2];
std::string compileOptions;
if (requiresCGheaders) {
char HeaderNames[256];
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
sprintf_s(HeaderNames, sizeof(HeaderNames), "%s", "cooperative_groups.h");
#else
snprintf(HeaderNames, sizeof(HeaderNames), "%s", "cooperative_groups.h");
#endif
compileOptions = "--include-path=";
std::string path = sdkFindFilePath(HeaderNames, argv[0]);
if (!path.empty()) {
std::size_t found = path.find(HeaderNames);
path.erase(found);
} else {
printf(
"\nCooperativeGroups headers not found, please install it in %s "
"sample directory..\n Exiting..\n",
argv[0]);
}
compileOptions += path.c_str();
compileParams[0] = reinterpret_cast<char *>(
malloc(sizeof(char) * (compileOptions.length() + 1)));
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
sprintf_s(compileParams[0], sizeof(char) * (compileOptions.length() + 1),
"%s", compileOptions.c_str());
#else
snprintf(compileParams[0], compileOptions.size(), "%s",
compileOptions.c_str());
#endif
numCompileOptions++;
}
compileOptions = "--device-debug ";
compileParams[numCompileOptions] = reinterpret_cast<char *>(malloc(sizeof(char) * (compileOptions.length() + 1)));
snprintf(compileParams[numCompileOptions], compileOptions.size(), "%s", compileOptions.c_str());
numCompileOptions++;
// compile
nvrtcProgram prog;
NVRTC_SAFE_CALL("nvrtcCreateProgram",
nvrtcCreateProgram(&prog, memBlock, filename, 0, NULL, NULL));
nvrtcResult res = nvrtcCompileProgram(prog, numCompileOptions, compileParams);
// dump log
size_t logSize;
NVRTC_SAFE_CALL("nvrtcGetProgramLogSize",
nvrtcGetProgramLogSize(prog, &logSize));
char *log = reinterpret_cast<char *>(malloc(sizeof(char) * logSize + 1));
NVRTC_SAFE_CALL("nvrtcGetProgramLog", nvrtcGetProgramLog(prog, log));
log[logSize] = '\x0';
if (strlen(log) >= 2) {
std::cerr << "\n compilation log ---\n";
std::cerr << log;
std::cerr << "\n end log ---\n";
}
free(log);
NVRTC_SAFE_CALL("nvrtcCompileProgram", res);
// fetch PTX
size_t ptxSize;
NVRTC_SAFE_CALL("nvrtcGetPTXSize", nvrtcGetPTXSize(prog, &ptxSize));
char *ptx = reinterpret_cast<char *>(malloc(sizeof(char) * ptxSize));
NVRTC_SAFE_CALL("nvrtcGetPTX", nvrtcGetPTX(prog, ptx));
NVRTC_SAFE_CALL("nvrtcDestroyProgram", nvrtcDestroyProgram(&prog));
*ptxResult = ptx;
*ptxResultSize = ptxSize;
#ifdef DUMP_PTX
std::ofstream my_f;
my_f.open("vectorAdd.ptx");
for (int i = 0; i < ptxSize; i++)
my_f << ptx[i];
my_f.close();
#endif
if (requiresCGheaders) free(compileParams[0]);
}
CUmodule my_loadPTX(char *ptx, int argc, char **argv) {
CUmodule module;
CUcontext context;
int major = 0, minor = 0;
char deviceName[256];
// Picks the best CUDA device available
CUdevice cuDevice = findCudaDeviceDRV(argc, (const char **)argv);
// get compute capabilities and the devicename
checkCudaErrors(cuDeviceGetAttribute(
&major, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, cuDevice));
checkCudaErrors(cuDeviceGetAttribute(
&minor, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, cuDevice));
checkCudaErrors(cuDeviceGetName(deviceName, 256, cuDevice));
printf("> GPU Device has SM %d.%d compute capability\n", major, minor);
checkCudaErrors(cuInit(0));
checkCudaErrors(cuDeviceGet(&cuDevice, 0));
checkCudaErrors(cuCtxCreate(&context, 0, cuDevice));
CUjit_option opt[1];
opt[0] = CU_JIT_GENERATE_DEBUG_INFO;
void **vals = new void *[1];
vals[0] = (void *)(size_t)1;
checkCudaErrors(cuModuleLoadDataEx(&module, ptx, 1, opt, vals));
free(ptx);
return module;
}
int main(int argc, char **argv) {
char *ptx, *kernel_file;
size_t ptxSize;
kernel_file = sdkFindFilePath("vectorAdd_kernel.cu", argv[0]);
my_compileFileToPTX(kernel_file, argc, argv, &ptx, &ptxSize, 0);
CUmodule module = my_loadPTX(ptx, argc, argv);
CUfunction kernel_addr;
checkCudaErrors(cuModuleGetFunction(&kernel_addr, module, "vectorAdd"));
// Print the vector length to be used, and compute its size
int numElements = 50000;
size_t size = numElements * sizeof(float);
printf("[Vector addition of %d elements]\n", numElements);
// Allocate the host input vector A
float *h_A = reinterpret_cast<float *>(malloc(size));
// Allocate the host input vector B
float *h_B = reinterpret_cast<float *>(malloc(size));
// Allocate the host output vector C
float *h_C = reinterpret_cast<float *>(malloc(size));
// Verify that allocations succeeded
if (h_A == NULL || h_B == NULL || h_C == NULL) {
fprintf(stderr, "Failed to allocate host vectors!\n");
exit(EXIT_FAILURE);
}
// Initialize the host input vectors
for (int i = 0; i < numElements; ++i) {
h_A[i] = rand() / static_cast<float>(RAND_MAX);
h_B[i] = rand() / static_cast<float>(RAND_MAX);
}
// Allocate the device input vector A
CUdeviceptr d_A;
checkCudaErrors(cuMemAlloc(&d_A, size));
// Allocate the device input vector B
CUdeviceptr d_B;
checkCudaErrors(cuMemAlloc(&d_B, size));
// Allocate the device output vector C
CUdeviceptr d_C;
checkCudaErrors(cuMemAlloc(&d_C, size));
// Copy the host input vectors A and B in host memory to the device input
// vectors in device memory
printf("Copy input data from the host memory to the CUDA device\n");
checkCudaErrors(cuMemcpyHtoD(d_A, h_A, size));
checkCudaErrors(cuMemcpyHtoD(d_B, h_B, size));
// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid,
threadsPerBlock);
dim3 cudaBlockSize(threadsPerBlock, 1, 1);
dim3 cudaGridSize(blocksPerGrid, 1, 1);
void *arr[] = {reinterpret_cast<void *>(&d_A), reinterpret_cast<void *>(&d_B),
reinterpret_cast<void *>(&d_C),
reinterpret_cast<void *>(&numElements)};
checkCudaErrors(cuLaunchKernel(kernel_addr, cudaGridSize.x, cudaGridSize.y,
cudaGridSize.z, /* grid dim */
cudaBlockSize.x, cudaBlockSize.y,
cudaBlockSize.z, /* block dim */
0, 0, /* shared mem, stream */
&arr[0], /* arguments */
0));
checkCudaErrors(cuCtxSynchronize());
// Copy the device result vector in device memory to the host result vector
// in host memory.
printf("Copy output data from the CUDA device to the host memory\n");
checkCudaErrors(cuMemcpyDtoH(h_C, d_C, size));
// Verify that the result vector is correct
for (int i = 0; i < numElements; ++i) {
if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5) {
fprintf(stderr, "Result verification failed at element %d!\n", i);
exit(EXIT_FAILURE);
}
}
printf("Test PASSED\n");
// Free device global memory
checkCudaErrors(cuMemFree(d_A));
checkCudaErrors(cuMemFree(d_B));
checkCudaErrors(cuMemFree(d_C));
// Free host memory
free(h_A);
free(h_B);
free(h_C);
printf("Done\n");
return 0;
}
$ nvcc -g -I/usr/local/cuda/samples/common/inc -o test vectorAdd.cpp -lnvrtc -lcuda
$ cuda-gdb ./test
NVIDIA (R) CUDA Debugger
10.0 release
Portions Copyright (C) 2007-2018 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./test...done.
(cuda-gdb) break vectorAdd
Function "vectorAdd" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (vectorAdd) pending.
(cuda-gdb) r
Starting program: /home/user2/misc/junk/vectorAdd_nvrtc/test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffedc00700 (LWP 16789)]
> Using CUDA Device [1]: Tesla K40m
> GPU Device has SM 3.5 compute capability
[New Thread 0x7fffed3ff700 (LWP 16790)]
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
Thread 1 "test" hit Breakpoint 1, vectorAdd<<<(196,1,1),(256,1,1)>>> (A=0x7fffce800000, B=0x7fffce830e00, C=0x7fffce861c00, numElements=50000) at ./vectorAdd_kernel.cu:21
21 int i = blockDim.x * blockIdx.x + threadIdx.x;
(cuda-gdb) step
23 if (i < numElements) {
(cuda-gdb) step
24 C[i] = A[i] + B[i];
(cuda-gdb) step
26 }
(cuda-gdb) quit
A debugging session is active.
Inferior 1 [process 16777] will be killed.
Quit anyway? (y or n) y
$

Error with 'cuda-memcheck' in cuda 8.0

It is strange that when I do not add cuda-memcheck before ./main, the program runs without any warning or error message, however, when I add it, it will have error message like following.
========= Invalid __global__ write of size 8
========= at 0x00000120 in initCurand(curandStateXORWOW*, unsigned long)
========= by thread (9,0,0) in block (3,0,0)
========= Address 0x5005413b0 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x204115]
========= Host Frame:./main [0x18e11]
========= Host Frame:./main [0x369b3]
========= Host Frame:./main [0x3403]
========= Host Frame:./main [0x308c]
========= Host Frame:./main [0x30b7]
========= Host Frame:./main [0x2ebb]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]
Here is my functions, a brief introduction on the code, I try to generate a random numbers and save them to a device variable weights, then use this vector to sample from discrete numbers.
#include<iostream>
#include<curand.h>
#include<curand_kernel.h>
#include<time.h>
using namespace std;
#define num 100
__device__ float weights[num];
// function to define seed
__global__ void initCurand(curandState *state, unsigned long seed){
int idx = threadIdx.x + blockIdx.x * blockDim.x;
curand_init(seed, idx, 0, &state[idx]);
}
__device__ void sampling(float *weight, float max_weight, int *index, curandState *state){
int j;
float u;
do{
j = (int)(curand_uniform(state) * (num + 0.999999));
u = curand_uniform(state); //sample from uniform distribution;
}while( u > weight[j]/max_weight);
*index = j;
}
__global__ void test(int *dev_sample, curandState *state){
int idx = threadIdx.x + blockIdx.x * blockDim.x;\
// generate random numbers from uniform distribution and save them to weights
weights[idx] = curand_uniform(&state[idx]);
// run sampling function, in which, weights is an input for the function on each thread
sampling(weights, 1, dev_sample+idx, &state[idx]);
}
int main(){
// define the seed of random generator
curandState *devState;
cudaMalloc((void**)&devState, num*sizeof(curandState));
int *h_sample;
h_sample = (int*) malloc(num*sizeof(int));
int *d_sample;
cudaMalloc((void**)&d_sample, num*sizeof(float));
initCurand<<<(int)num/32 + 1, 32>>>(devState, 1);
test<<<(int)num/32 + 1, 32>>>(d_sample, devState);
cudaMemcpy(h_sample, d_sample, num*sizeof(float), cudaMemcpyDeviceToHost);
for (int i = 0; i < num; ++i)
{
cout << *(h_sample + i) << endl;
}
//free memory
cudaFree(devState);
free(h_sample);
cudaFree(d_sample);
return 0;
}
Just start to learn cuda, if the methods to access the global memory is incorrect, please help me with that. Thanks
This is launching "extra" threads:
initCurand<<<(int)num/32 + 1, 32>>>(devState, 1);
num is 100, so the above config will launch 4 blocks of 32 threads each, i.e. 128 threads. But you are only allocating space for 100 curandState here:
cudaMalloc((void**)&devState, num*sizeof(curandState));
So your initCurand kernel will have some threads (idx = 100-127) that are attempting to initialize some curandState that you haven't allocated. As a result when you run cuda-memcheck which does fairly rigorous out-of-bounds checking, an error is reported.
One possible solution would be to modify your initCurand kernel as follows:
__global__ void initCurand(curandState *state, unsigned long seed, int num){
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < num)
curand_init(seed, idx, 0, &state[idx]);
}
This will prevent any out-of-bounds threads from doing anything. Note that you will need to modify the kernel call to pass num to it. Also, it appears to me you have a similar problem in your test kernel. You may want to do something similar to fix it there. This is a common construct in CUDA kernels, I call it a "thread check". You can find other questions here on the SO tag discussing this same concept.

CUDA - Same algorithm works on CPU but not on GPU

I am currently working on my first project in CUDA and I ran into something odd, that must be inherent to CUDA and that I don't understand or have overlooked. The same algorithm - the exact same one really, it involves no parallel work - works on the CPU but not on the GPU.
Let me explain in more detail. I am doing thresholding using Otsu's method duplicates computation but reduces transfer time. Short story long, this function:
__device__ double computeThreshold(unsigned int* histogram, int* nbPixels){
double sum = 0;
for (int i = 0; i < 256; i++){
sum += i*histogram[i];
}
int sumB = 0, wB = 0, wF = 0;
double mB, mF, max = 1, between = 0, threshold1 = 0, threshold2 = 0;
for (int j = 0; j < 256 && !(wF == 0 && j != 0 && wB != 0); j++){
wB += histogram[j];
if (wB != 0) {
wF = *nbPixels - wB;
if (wF != 0){
sumB += j*histogram[i];
mB = sumB / wB;
mF = (sum - sumB) / wF;
between = wB * wF *(mB - mF) *(mB - mF);
if (max < 2.0){
threshold1 = j;
if (between > max){
threshold2 = j;
}
max = between;
}
}
}
}
return (threshold1 + threshold2) / 2.0;
}
This works as expected for an image size (ie number of pixels) not too big but fails otherwise; interestingly, even if I don't use histogram and nbPixels in the function and replace all their occurrences by a constant, it still fails - even if I remove the arguments from the function. (What I mean by fail is that the first operation after the call to the kernel returns an unspecified launch failure.)
EDIT 3: Ok, there was a small mistake due to copy/paste errors in what I provided before to test. Now this compiles and allows to reproduce the error:
__device__ double computeThreshold(unsigned int* histogram, long int* nbPixels){
double sum = 0;
for (int i = 0; i < 256; i++){
sum += i*histogram[i];
}
int sumB = 0, wB = 0, wF = 0;
double mB, mF, max = 1, between = 0, threshold1 = 0, threshold2 = 0;
for (int j = 0; j < 256 && !(wF == 0 && j != 0 && wB != 0); j++){
wB += histogram[j];
if (wB != 0) {
wF = *nbPixels - wB;
if (wF != 0){
sumB += j*histogram[j];
mB = sumB / wB;
mF = (sum - sumB) / wF;
between = wB * wF *(mB - mF) *(mB - mF);
if (max < 2.0){
threshold1 = j;
if (between > max){
threshold2 = j;
}
max = between;
}
}
}
}
return (threshold1 + threshold2) / 2.0;
}
__global__ void imageKernel(unsigned int* image, unsigned int* histogram, long int* nbPixels, double* t_threshold){
unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
if (i >= *nbPixels) return;
double threshold = computeThreshold(histogram, nbPixels);
unsigned int pixel = image[i];
if (pixel >= threshold){
pixel = 255;
} else {
pixel = 0;
}
image[i] = pixel;
*t_threshold = threshold;
}
int main(){
unsigned int histogram[256] = { 0 };
const int width = 2048 * 4096;
const int height = 1;
unsigned int* myimage;
myimage = new unsigned int[width*height];
for (int i = 0; i < width*height; i++){
myimage[i] = i % 256;
histogram[i % 256]++;
}
const int threadPerBlock = 256;
const int nbBlock = ceil((double)(width*height) / threadPerBlock);
unsigned int* partial_histograms = new unsigned int[256 * nbBlock];
dim3 dimBlock(threadPerBlock, 1);
dim3 dimGrid(nbBlock, 1);
unsigned int* dev_image;
unsigned int* dev_histogram;
unsigned int* dev_partial_histograms;
double* dev_threshold;
double x = 0;
double* threshold = &x;
long int* nbPixels;
long int nb = width*height;
nbPixels = &(nb);
long int* dev_nbPixels;
cudaSetDevice(0);
cudaMalloc((void**)&dev_image, sizeof(unsigned int)*width*height);
cudaMalloc((void**)&dev_histogram, sizeof(unsigned int)* 256);
cudaMalloc((void**)&dev_partial_histograms, sizeof(unsigned int)* 256 * nbBlock);
cudaMalloc((void**)&dev_threshold, sizeof(double));
cudaMalloc((void**)&dev_nbPixels, sizeof(long int));
cudaMemcpy(dev_image, myimage, sizeof(unsigned int)*width*height, cudaMemcpyHostToDevice);
cudaMemcpy(dev_histogram, histogram, sizeof(unsigned int)* 256, cudaMemcpyHostToDevice);
cudaMemcpy(dev_nbPixels, nbPixels, sizeof(long int), cudaMemcpyHostToDevice);
imageKernel<<<dimGrid, dimBlock>>>(dev_image, dev_histogram, dev_nbPixels, dev_threshold);
cudaMemcpy(histogram, dev_histogram, sizeof(unsigned int)* 256, cudaMemcpyDeviceToHost);
cudaMemcpy(partial_histograms, dev_partial_histograms, sizeof(unsigned int)* 256 * nbBlock, cudaMemcpyDeviceToHost);
cudaMemcpy(threshold, dev_threshold, sizeof(double), cudaMemcpyDeviceToHost);
cudaDeviceReset();
return 0;
}
EDIT 4: the characteristics of my GPU
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GT 750M"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147483648 bytes)
( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
GPU Max Clock rate: 1085 MHz (1.09 GHz)
Memory Clock rate: 900 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536),
3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Mo
del)
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simu
ltaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Versi
on = 7.5, NumDevs = 1, Device0 = GeForce GT 750M
Result = PASS
EDIT 5: I ran cuda-memcheck again and this time, it did output an error message. I don't know why it didn't the first time, I must have done something wrong again. I hope you will pardon me those hesitations and wastes of time. Here is the output message:
========= CUDA-MEMCHECK
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launc
h failure" on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xb780
2) [0xdb1e2]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0x160f]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xc764]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xfe24]
========= Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk +
0x22) [0x13d2]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x3
4) [0x15454]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launc
h failure" on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xb780
2) [0xdb1e2]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0x160f]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xc788]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xfe24]
========= Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk +
0x22) [0x13d2]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x3
4) [0x15454]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launc
h failure" on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xb780
2) [0xdb1e2]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0x160f]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xc7a6]
========= Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xfe24]
========= Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk +
0x22) [0x13d2]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x3
4) [0x15454]
=========
========= ERROR SUMMARY: 3 errors
Not very telling though, is it ?
Ok, turns out it wasn't an error of my side but Windows deciding that 2s was enough and that it needed to reset the GPU - stopping there my computation. Thanks a lot to #RobertCrovella, without whom I would never have found this out. And thanks to everyone who tried to answer as well.
So after providing a compileable example (was it really so hard?), I can't reproduce any errors with this code (64 bit linux, compute 3.0 device, CUDA 7.0 release version):
$ nvcc -arch=sm_30 -Xptxas="-v" histogram.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z11imageKernelPjS_PlPd' for 'sm_30'
ptxas info : Function properties for _Z11imageKernelPjS_PlPd
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 34 registers, 352 bytes cmem[0], 16 bytes cmem[2]
$ for i in `seq 1 20`;
> do
> cuda-memcheck ./a.out
> done
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
So if you can reproduce a runtime error doing as I have done, your environment/hardware/toolkit version are subtly different in some way from mine. But in any case the code itself works, and you have a platform specific issue I can't reproduce.

Simple adding of two int's in Cuda, result always the same

I'm starting my journary to learn Cuda. I am playing with some hello world type cuda code but its not working, and I'm not sure why.
The code is very simple, take two ints and add them on the GPU and return the result, but no matter what I change the numbers to I get the same result(If math worked that way I would have done alot better in the subject than I actually did).
Here's the sample code:
// CUDA-C includes
#include <cuda.h>
#include <stdio.h>
__global__ void add( int a, int b, int *c ) {
*c = a + b;
}
extern "C"
void runCudaPart();
// Main cuda function
void runCudaPart() {
int c;
int *dev_c;
cudaMalloc( (void**)&dev_c, sizeof(int) );
add<<<1,1>>>( 1, 4, dev_c );
cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
printf( "1 + 4 = %d\n", c );
cudaFree( dev_c );
}
The output seems a bit off: 1 + 4 = -1065287167
I'm working on setting up my environment and just wanted to know if there was a problem with the code otherwise its probably my environment.
Update: I tried to add some code to show the error but I don't get an output but the number changes(is it outputing error codes instead of answers? Even if I don't do any work in the kernal other than assign a variable I still get simlair results).
// CUDA-C includes
#include <cuda.h>
#include <stdio.h>
__global__ void add( int a, int b, int *c ) {
//*c = a + b;
*c = 5;
}
extern "C"
void runCudaPart();
// Main cuda function
void runCudaPart() {
int c;
int *dev_c;
cudaError_t err = cudaMalloc( (void**)&dev_c, sizeof(int) );
if(err != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
add<<<1,1>>>( 1, 4, dev_c );
cudaError_t err2 = cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
if(err2 != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
printf( "1 + 4 = %d\n", c );
cudaFree( dev_c );
}
Code appears to be fine, maybe its related to my setup. Its been a nightmare to get Cuda installed on OSX lion but I thought it worked as the examples in the SDK seemed to be fine. The steps I took so far are go to the Nvida website and download the latest mac releases for the driver, toolkit and SDK. I then added export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH and 'PATH=/usr/local/cuda/bin:$PATH` I did a deviceQuery and it passed with the following info about my system:
[deviceQuery] starting...
/Developer/GPU Computing/C/bin/darwin/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 1 CUDA Capable device(s)
Device 0: "GeForce 320M"
CUDA Driver Version / Runtime Version 4.2 / 4.2
CUDA Capability Major/Minor version number: 1.2
Total amount of global memory: 253 MBytes (265027584 bytes)
( 6) Multiprocessors x ( 8) CUDA Cores/MP: 48 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 1064 Mhz
Memory Bus Width: 128-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.2, NumDevs = 1, Device = GeForce 320M
[deviceQuery] test results...
PASSED
UPDATE: what's really weird is even if I remove all the work in the kernel I stil get a result for c? I have reinstalled cuda and used make on the examples and all of them pass.
Basically there are two problems here:
You are not compiling the kernel for the correct architecture (gleaned from comments)
Your code contains imperfect error checking which is missing the point when the runtime error is occurring, leading to mysterious and unexplained symptoms.
In the runtime API, most context related actions are performed "lazily". When you launch a kernel for the first time, the runtime API will invoke code to intelligently find a suitable CUBIN image from inside the fat binary image emitted by the toolchain for the target hardware and load it into the context. This can also include JIT recompilation of PTX for a backwards compatible architecture, but not the other way around. So if you had a kernel compiled for a compute capability 1.2 device and you run it on a compute capability 2.0 device, the driver can JIT compile the PTX 1.x code it contains for the newer architecture. But the reverse doesn't work. So in your example, the runtime API will generate an error because it cannot find a usable binary image in the CUDA fatbinary image embedded in the executable. The error message is pretty cryptic, but you will get an error (see this question for a bit more information).
If your code contained error checking like this:
cudaError_t err = cudaMalloc( (void**)&dev_c, sizeof(int) );
if(err != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
add<<<1,1>>>( 1, 4, dev_c );
if (cudaPeekAtLastError() != cudaSuccess) {
printf("The error is %s", cudaGetErrorString(cudaGetLastError()));
}
cudaError_t err2 = cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
if(err2 != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
the extra error checking after the kernel launch should catch the runtime API error generated by the kernel load/launch failure.
#include <stdio.h>
#include <conio.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
__global__ void Addition(int *a,int *b,int *c)
{
*c = *a + *b;
}
int main()
{
int a,b,c;
int *dev_a,*dev_b,*dev_c;
int size = sizeof(int);
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);
a=5,b=6;
cudaMemcpy(dev_a, &a,sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, &b,sizeof(int), cudaMemcpyHostToDevice);
Addition<<< 1,1 >>>(dev_a,dev_b,dev_c);
cudaMemcpy(&c, dev_c,size, cudaMemcpyDeviceToHost);
cudaFree(&dev_a);
cudaFree(&dev_b);
cudaFree(&dev_c);
printf("%d\n", c);
getch();
return 0;
}