Simple adding of two int's in Cuda, result always the same - cuda

I'm starting my journary to learn Cuda. I am playing with some hello world type cuda code but its not working, and I'm not sure why.
The code is very simple, take two ints and add them on the GPU and return the result, but no matter what I change the numbers to I get the same result(If math worked that way I would have done alot better in the subject than I actually did).
Here's the sample code:
// CUDA-C includes
#include <cuda.h>
#include <stdio.h>
__global__ void add( int a, int b, int *c ) {
*c = a + b;
}
extern "C"
void runCudaPart();
// Main cuda function
void runCudaPart() {
int c;
int *dev_c;
cudaMalloc( (void**)&dev_c, sizeof(int) );
add<<<1,1>>>( 1, 4, dev_c );
cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
printf( "1 + 4 = %d\n", c );
cudaFree( dev_c );
}
The output seems a bit off: 1 + 4 = -1065287167
I'm working on setting up my environment and just wanted to know if there was a problem with the code otherwise its probably my environment.
Update: I tried to add some code to show the error but I don't get an output but the number changes(is it outputing error codes instead of answers? Even if I don't do any work in the kernal other than assign a variable I still get simlair results).
// CUDA-C includes
#include <cuda.h>
#include <stdio.h>
__global__ void add( int a, int b, int *c ) {
//*c = a + b;
*c = 5;
}
extern "C"
void runCudaPart();
// Main cuda function
void runCudaPart() {
int c;
int *dev_c;
cudaError_t err = cudaMalloc( (void**)&dev_c, sizeof(int) );
if(err != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
add<<<1,1>>>( 1, 4, dev_c );
cudaError_t err2 = cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
if(err2 != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
printf( "1 + 4 = %d\n", c );
cudaFree( dev_c );
}
Code appears to be fine, maybe its related to my setup. Its been a nightmare to get Cuda installed on OSX lion but I thought it worked as the examples in the SDK seemed to be fine. The steps I took so far are go to the Nvida website and download the latest mac releases for the driver, toolkit and SDK. I then added export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH and 'PATH=/usr/local/cuda/bin:$PATH` I did a deviceQuery and it passed with the following info about my system:
[deviceQuery] starting...
/Developer/GPU Computing/C/bin/darwin/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 1 CUDA Capable device(s)
Device 0: "GeForce 320M"
CUDA Driver Version / Runtime Version 4.2 / 4.2
CUDA Capability Major/Minor version number: 1.2
Total amount of global memory: 253 MBytes (265027584 bytes)
( 6) Multiprocessors x ( 8) CUDA Cores/MP: 48 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 1064 Mhz
Memory Bus Width: 128-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.2, NumDevs = 1, Device = GeForce 320M
[deviceQuery] test results...
PASSED
UPDATE: what's really weird is even if I remove all the work in the kernel I stil get a result for c? I have reinstalled cuda and used make on the examples and all of them pass.

Basically there are two problems here:
You are not compiling the kernel for the correct architecture (gleaned from comments)
Your code contains imperfect error checking which is missing the point when the runtime error is occurring, leading to mysterious and unexplained symptoms.
In the runtime API, most context related actions are performed "lazily". When you launch a kernel for the first time, the runtime API will invoke code to intelligently find a suitable CUBIN image from inside the fat binary image emitted by the toolchain for the target hardware and load it into the context. This can also include JIT recompilation of PTX for a backwards compatible architecture, but not the other way around. So if you had a kernel compiled for a compute capability 1.2 device and you run it on a compute capability 2.0 device, the driver can JIT compile the PTX 1.x code it contains for the newer architecture. But the reverse doesn't work. So in your example, the runtime API will generate an error because it cannot find a usable binary image in the CUDA fatbinary image embedded in the executable. The error message is pretty cryptic, but you will get an error (see this question for a bit more information).
If your code contained error checking like this:
cudaError_t err = cudaMalloc( (void**)&dev_c, sizeof(int) );
if(err != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
add<<<1,1>>>( 1, 4, dev_c );
if (cudaPeekAtLastError() != cudaSuccess) {
printf("The error is %s", cudaGetErrorString(cudaGetLastError()));
}
cudaError_t err2 = cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
if(err2 != cudaSuccess){
printf("The error is %s", cudaGetErrorString(err));
}
the extra error checking after the kernel launch should catch the runtime API error generated by the kernel load/launch failure.

#include <stdio.h>
#include <conio.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
__global__ void Addition(int *a,int *b,int *c)
{
*c = *a + *b;
}
int main()
{
int a,b,c;
int *dev_a,*dev_b,*dev_c;
int size = sizeof(int);
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);
a=5,b=6;
cudaMemcpy(dev_a, &a,sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, &b,sizeof(int), cudaMemcpyHostToDevice);
Addition<<< 1,1 >>>(dev_a,dev_b,dev_c);
cudaMemcpy(&c, dev_c,size, cudaMemcpyDeviceToHost);
cudaFree(&dev_a);
cudaFree(&dev_b);
cudaFree(&dev_c);
printf("%d\n", c);
getch();
return 0;
}

Related

Nvidia CUDA Error: no kernel image is available for execution on the device

I have an NVidia GeForce GTX 770 and would like to use its CUDA capabilities for a project I am working on. My machine is running windows 10 64bit.
I have followed the provided CUDA Toolkit installation guide: https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/.
Once the drivers were installed I opened the samples solution (using Visual Studio 2019) and built the deviceQuery and bandwidthTest samples. Here is the output:
deviceQuery:
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v11.3\bin\win64\Debug\deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce GTX 770"
CUDA Driver Version / Runtime Version 11.3 / 11.3
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147483648 bytes)
(008) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 1137 MHz (1.14 GHz)
Memory Clock rate: 3505 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.3, CUDA Runtime Version = 11.3, NumDevs = 1
Result = PASS
Bandwidth:
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: NVIDIA GeForce GTX 770
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 3.1
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 3.4
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 161.7
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
However, when I try to run any other sample, for example the starter code that is provided with the CUDA 11.3 runtime template:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);
__global__ void addKernel(int* c, const int* a, const int* b) {
int i = threadIdx.x;
c[i] = a[i] + b[i];
}
int main() {
const int arraySize = 5;
const int a[arraySize] = { 1, 2, 3, 4, 5 };
const int b[arraySize] = { 10, 20, 30, 40, 50 };
int c[arraySize] = { 0 };
// Add vectors in parallel.
cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "addWithCuda failed!");
return 1;
}
printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n", c[0], c[1], c[2], c[3], c[4]);
// cudaDeviceReset must be called before exiting in order for profiling and
// tracing tools such as Nsight and Visual Profiler to show complete traces.
cudaStatus = cudaDeviceReset();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceReset failed!");
return 1;
}
return 0;
}
// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int* c, const int* a, const int* b, unsigned int size) {
int* dev_a = 0;
int* dev_b = 0;
int* dev_c = 0;
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
goto Error;
}
// Allocate GPU buffers for three vectors (two input, one output) .
cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
// Launch a kernel on the GPU with one thread for each element.
addKernel << <1, size >> > (dev_c, dev_a, dev_b);
// Check for any errors launching the kernel
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
goto Error;
}
// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
goto Error;
}
// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
Error:
cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);
return cudaStatus;
}
I get the following error:
addKernel launch failed: no kernel image is available for execution on the device
addWithCuda failed!
From this table: https://docs.nvidia.com/deploy/cuda-compatibility/index.html#support-hardware__table-hardware-support you can see that my GPU's compute capability version (3.0) is in fact compatible with the installed driver (465.19.01+), so why can't I run any code other than the query and bandwidth tests?
Your GTX770 GPU is a "Kepler" architecture compute capability 3.0 device. These devices were deprecated during the CUDA 10 release cycle and support for them dropped from CUDA 11.0 onwards
The CUDA 10.2 release is the last toolkit with support for compute 3.0 devices. You will not be able to make CUDA 11.0 or newer work with your GPU. The query and bandwidth tests use APIs which don't attempt to run code on your GPU, that is why they work where any other example will not work.
I had a similar problem. I have a Geforce 940 MX card on my laptop which has Cuda capability as 5.0 with CUDA driver 11.7.
The way I solved it was include the compute_50,sm_50 in the field at Properties > CUDA C/C++ > Device > Code Generation. Hope this helps.
I have a GeForce 940 MX too, however, in my case I am using KUbuntu 22.04, and I have solved the problem adding the support for the platform in the compilation command:
nvcc TestCUDA.cu -o testcu.bin --gpu-architecture=compute_50 --gpu-code=compute_50,sm_50,sm_52
After that, the code is working fine. However, it is essential using the code to handle errors to determine what is happening during the compilation. In my case I didn't include the cudaPeekAtLastError() and not error was showing.
Below are the supported sm variations and sample cards from that generation (source: Medium - Matching SM architectures (CUDA arch and CUDA gencode) for various NVIDIA cards
Supported on CUDA 7 and later
Fermi (CUDA 3.2 until CUDA 8) (deprecated from CUDA 9):
SM20 or SM_20, compute_30 — Older cards such as GeForce 400, 500, 600, GT-630
Kepler (CUDA 5 and later):
SM30 or SM_30, compute_30 — Kepler architecture (generic — Tesla K40/K80, GeForce 700, GT-730)
Adds support for unified memory programming
SM35 or SM_35, compute_35 — More specific Tesla K40
Adds support for dynamic parallelism. Shows no real benefit over SM30 in my experience.
SM37 or SM_37, compute_37 — More specific Tesla K80
Adds a few more registers. Shows no real benefit over SM30 in my experience
Maxwell (CUDA 6 and later):
SM50 or SM_50, compute_50 — Tesla/Quadro M series
SM52 or SM_52, compute_52 — Quadro M6000 , GeForce 900, GTX-970, GTX-980, GTX Titan X
SM53 or SM_53, compute_53 — Tegra (Jetson) TX1 / Tegra X1
Pascal (CUDA 8 and later)
SM60 or SM_60, compute_60 — Quadro GP100, Tesla P100, DGX-1 (Generic Pascal)
SM61 or SM_61, compute_61 — GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030, Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2
SM62 or SM_62, compute_62 — Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2
Volta (CUDA 9 and later)
SM70 or SM_70, compute_70 — DGX-1 with Volta, Tesla V100, GTX 1180 (GV104), Titan V, Quadro GV100
SM72 or SM_72, compute_72 — Jetson AGX Xavier
Turing (CUDA 10 and later)
SM75 or SM_75, compute_75 — GTX Turing — GTX 1660 Ti, RTX 2060, RTX 2070, RTX 2080, Titan RTX, Quadro RTX 4000, Quadro RTX 5000, Quadro RTX 6000, Quadro RTX 8000

Where is the boundary of start and end of CPU launch and GPU launch of Nvidia Profiling NVPROF?

What is the definition of start and end of kernel launch in the CPU and GPU (yellow block)? Where is the boundary between them?
Please notice that the start, end, and duration of those yellow blocks in CPU and GPU are different.Why CPU invocation of vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n); takes that long time?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
// Get our global thread ID
int id = blockIdx.x*blockDim.x+threadIdx.x;
//printf("id = %d \n", id);
// Make sure we do not go out of bounds
if (id < n)
c[id] = a[id] + b[id];
}
int main( int argc, char* argv[] )
{
// Size of vectors
int n = 1000000;
// Host input vectors
double *h_a;
double *h_b;
//Host output vector
double *h_c;
// Device input vectors
double *d_a;
double *d_b;
//Device output vector
double *d_c;
// Size, in bytes, of each vector
size_t bytes = n*sizeof(double);
// Allocate memory for each vector on host
h_a = (double*)malloc(bytes);
h_b = (double*)malloc(bytes);
h_c = (double*)malloc(bytes);
// Allocate memory for each vector on GPU
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
int i;
// Initialize vectors on host
for( i = 0; i < n; i++ ) {
h_a[i] = sin(i)*sin(i);
h_b[i] = cos(i)*cos(i);
}
// Copy host vectors to device
cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);
int blockSize, gridSize;
// Number of threads in each thread block
blockSize = 1024;
// Number of thread blocks in grid
gridSize = (int)ceil((float)n/blockSize);
// Execute the kernel
vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);
// Copy array back to host
cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );
// Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0;
for(i=0; i<n; i++)
sum += h_c[i];
printf("final result: %f\n", sum/n);
// Release device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
// Release host memory
free(h_a);
free(h_b);
free(h_c);
return 0;
}
CPU yellow block:
GPU yellow block:
Note that you mention NVPROF but the pictures you are showing are from nvvp - the visual profiler. nvprof is the command-line profiler
GPU Kernel launches are asynchronous. That means that the CPU thread launches the kernel but does not wait for the kernel to complete. In fact, the CPU activity is actually placing the kernel in a launch queue - the actual execution of the kernel may be delayed if anything else is happening on the GPU.
So there is no defined relationship between the CPU (API) activity, and the GPU activity with respect to time, except that the CPU kernel launch must obviously precede (at least slightly) the GPU kernel execution.
The CPU (API) yellow block represents the duration of time that the CPU thread spends in a library call into the CUDA Runtime library, to launch the kernel (i.e. place it in the launch queue). This library call activity usually has some time overhead associated with it, in the range of 5-50 microseconds. The start of this period is marked by the start of the call into the library. The end of this period is marked by the time at which the library returns control to your code (i.e. your next line of code after the kernel launch).
The GPU yellow block represents the actual time period during which the kernel was executing on the GPU. The start and end of this yellow block are marked by the start and end of kernel activity on the GPU. The duration here is a function of what the code in your kernel is doing, and how long it takes.
I don't think the exact reason why a GPU kernel launch takes ~5-50 microseconds of CPU time is documented or explained anywhere in an authoritative fashion, and it is a closed source library, so you will need to acknowledge that overhead as something you have little control over. If you design kernels that run for a long time and do a lot of work, this overhead can become insignificant.

Issue with using Multi GPU NVIDIA

I'm learning how to use multi GPU for my CUDA application. I tried out a simple program which successfully ran on a system having two Tesla C2070. But when I tried to run the same program on a different system having a Tesla K40c and a Tesla C2070, it shows a segmentation fault. What might be the problem? I'm sure that there is no problem with the code. Is there any settings to be done in the environment? I have attached my code here for your reference.
#include <stdio.h>
#include "device_launch_parameters.h"
#include "cuda_runtime_api.h"
__global__ void testA(int *a)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
a[i] = a[i] * 2;
}
int main()
{
int *ai, *bi, *ao, *bo;
int iter;
cudaStream_t streamA, streamB;
cudaSetDevice(0);
cudaStreamCreate(&streamA);
cudaMalloc((void**)&ao, 10 * sizeof(int));
cudaHostAlloc((void**)&ai, 10 * sizeof(int), cudaHostAllocMapped);
for(iter=0; iter<10; iter++)
{
ai[iter] = iter+1;
}
cudaSetDevice(1);
cudaStreamCreate(&streamB);
cudaMalloc((void**)&bo, 10 * sizeof(int));
cudaHostAlloc((void**)&bi, 10 * sizeof(int), cudaHostAllocMapped);
for(iter=0; iter<10; iter++)
{
bi[iter] = iter+11;
}
cudaSetDevice(0);
cudaMemcpyAsync(ao, ai, 10 * sizeof(int), cudaMemcpyHostToDevice, streamA);
testA<<<1, 10, 0, streamA>>>(ao);
cudaMemcpyAsync(ai, ao, 10 * sizeof(int), cudaMemcpyDeviceToHost, streamA);
cudaSetDevice(1);
cudaMemcpyAsync(bo, bi, 10 * sizeof(int), cudaMemcpyHostToDevice, streamB);
testA<<<1, 10, 0, streamB>>>(bo);
cudaMemcpyAsync(bi, bo, 10 * sizeof(int), cudaMemcpyDeviceToHost, streamB);
cudaSetDevice(0);
cudaStreamSynchronize(streamA);
cudaSetDevice(1);
cudaStreamSynchronize(streamB);
printf("%d %d %d %d %d\n",ai[0],ai[1],ai[2],ai[3],ai[4]);
printf("%d %d %d %d %d\n",bi[0],bi[1],bi[2],bi[3],bi[4]);
return 0;
}
The segmentation fault occurs when bi array is initialized inside the for loop, which means the memory is not allocated for bi.
With the new information you've provided based on the error checking, the problem you were having was due to the ECC error.
When a GPU has a double-bit ECC error detected in the current session, it is no longer usable for compute activities until either:
the GPU is reset (e.g. via system reboot, or via driver unload/reload, or manually via nvidia-smi, etc.),
(or)
ECC is disabled (which usually also may require a system reboot or gpu reset)
You can review ECC status of your GPUs with the nvidia-smi command. You probably already know which GPU was reporting the ECC error, since you disabled ECC, but in case not, based on your initial report it would be the one that was associated with the cudaSetDevice(1); command, which probably should have been the Tesla C2070 (i.e. not the K40).

CUDA 5.0 - cudaGetDeviceProperties strange grid size or a bug in my code?

I already posted my question on NVIDIA dev forums, but there are no definitive answers yet.
I'm just starting to learn CUDA and was really surprised that, contrary to what I found on the Internet, my card (GeForce GTX 660M) supports some insane grid sizes (2147483647 x 65535 x 65535). Please take a look at the following results I'm getting from deviceQuery.exe provided with the toolkit:
c:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.0\bin\win64\Release>deviceQuery.exe
deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 660M"
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147287040 bytes)
( 2) Multiprocessors x (192) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GTX 660M
I was curious enough to write a simple program testing if it's possible to use more than 65535 blocks in the first dimension of the grid, but it doesn't work confirming what I found on the Internet (or, to be more precise, does work fine for 65535 blocks and doesn't for 65536).
My program is extremely simple and basically just adds two vectors. This is the source code:
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>
#include <math.h>
#pragma comment(lib, "cudart")
typedef struct
{
float *content;
const unsigned int size;
} pjVector_t;
__global__ void AddVectorsKernel(float *firstVector, float *secondVector, float *resultVector)
{
unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;
resultVector[index] = firstVector[index] + secondVector[index];
}
int main(void)
{
//const unsigned int vectorLength = 67107840; // 1024 * 65535 - works fine
const unsigned int vectorLength = 67108864; // 1024 * 65536 - doesn't work
const unsigned int vectorSize = sizeof(float) * vectorLength;
int threads = 0;
unsigned int blocks = 0;
cudaDeviceProp deviceProperties;
cudaError_t error;
pjVector_t firstVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
pjVector_t secondVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
pjVector_t resultVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
float *d_firstVector;
float *d_secondVector;
float *d_resultVector;
cudaMalloc((void **)&d_firstVector, vectorSize);
cudaMalloc((void **)&d_secondVector, vectorSize);
cudaMalloc((void **)&d_resultVector, vectorSize);
cudaGetDeviceProperties(&deviceProperties, 0);
threads = deviceProperties.maxThreadsPerBlock;
blocks = (unsigned int)ceil(vectorLength / (double)threads);
for (unsigned int i = 0; i < vectorLength; i++)
{
firstVector.content[i] = 1.0f;
secondVector.content[i] = 2.0f;
}
cudaMemcpy(d_firstVector, firstVector.content, vectorSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_secondVector, secondVector.content, vectorSize, cudaMemcpyHostToDevice);
AddVectorsKernel<<<blocks, threads>>>(d_firstVector, d_secondVector, d_resultVector);
error = cudaPeekAtLastError();
cudaMemcpy(resultVector.content, d_resultVector, vectorSize, cudaMemcpyDeviceToHost);
for (unsigned int i = 0; i < vectorLength; i++)
{
if(resultVector.content[i] != 3.0f)
{
free(firstVector.content);
free(secondVector.content);
free(resultVector.content);
cudaFree(d_firstVector);
cudaFree(d_secondVector);
cudaFree(d_resultVector);
cudaDeviceReset();
printf("Error under index: %i\n", i);
return 0;
}
}
free(firstVector.content);
free(secondVector.content);
free(resultVector.content);
cudaFree(d_firstVector);
cudaFree(d_secondVector);
cudaFree(d_resultVector);
cudaDeviceReset();
printf("Everything ok!\n");
return 0;
}
When I run it from Visual Studio in debug mode (the bigger vector), the last cudaMemcpy always fills my resultVector with seemingly random data (very close to 0 if it matters) so that the result doesn't pass the final validation. When I try to profile it with Visual Profiler, it returns following error message:
2 events, 0 metrics and 0 source-level metrics were not associated with the kernels and will not be displayed
As a result, profiler measures only cudaMalloc and cudaMemcpy operations and doesn't even show the kernel execution.
I'm not sure if I'm checking cuda erros right, so please let me know if it can be done better. cudaPeekAtLastError() placed just after my kernel launch returns cudaErrorInvalidValue(11) error when the bigger vector is used and cudaSuccess(0) for every other call (cudaMalloc and cudaMemcpy). When I run my program with the smaller vector, all cuda functions and my kernel launch return no errors (cudaSuccess(0)) and it works just fine.
So my question is: is cudaGetDeviceProperties returning rubbish grid size values or am I doing something wrong?
If you want to run a kernel using the larger grid size support offered by the Kepler architecture, you must compile you code for that architecture. So change you build flags to sepcific sm_30 as the target architecture. Otherwise the compiler will build for compute 1.0 targets.
The underlying reason for the launch failure is that the driver will attempt to recompile the compute 1.0 code for your Kepler card, but in doing so it enforces the execution grid limits dictated by the source architecture, ie. two dimensional grids with 65535 x 65535 maximum blocks per grid.

cudaMemcpy is too slow on Tesla C2075

I'm currently working on a server with 2 cuda capable GPU's: Quadro 400 and Tesla C2075. I made a simple vector addition test program. My problem is that while Tesla C2075 GPU is supposed to be more powerful than Quadro 400, it takes it more time to do the job. I found that cudaMemcpy takes up most of the execution time and it works slower on a more powerful gpu.
Here's the source:
void get_matrix(float* arr1,float* arr2,int N1,int N2)
{
int Nx,Ny;
int n_blocks,n_threads;
int dev=0; // 1
float time;
size_t size;
clock_t start,end;
cudaSetDevice(dev);
cudaDeviceProp deviceProp;
start = clock();
cudaGetDeviceProperties(&deviceProp, dev);
Nx=N1;
Ny=N2;
n_threads=256;
n_blocks=(Nx*Ny+n_threads-1)/n_threads;
size=Nx*Ny*sizeof(float);
cudaMalloc((void**)&d_A,size);
cudaMalloc((void**)&d_B,size);
cudaMemcpy(d_A, arr1, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, arr2, size, cudaMemcpyHostToDevice);
vector_add<<<n_blocks,n_threads>>>(d_A,d_B,size);
cudaMemcpy(arr1, d_A, size, cudaMemcpyDeviceToHost);
printf("Running device %s \n",deviceProp.name);
end = clock();
time=float(end-start)/float(CLOCKS_PER_SEC);
printf("time = %e\n",time);
}
int main()
{
int const nx = 20000,ny = nx;
static float a[nx*ny],b[nx*ny];
for(int i=0;i<nx;i++)
{
for(int j=0;j<ny;j++)
{
a[j+ny*i]=j+10*i;
b[j+ny*i]=-(j+10*i);
}
}
get_matrix(a,b,nx,ny);
return 0;
}
The output is:
Running device Quadro 400
time = 1.100000e-01
Running device Tesla C2075
time = 1.050000e+00
And my questions are:
Should I modify the code depending on what GPU I am going to use?
Is there any connection between the number of blocks, threads per block specified in the code and the number of multiprocessors, cores per multiprocessor available on a GPU?
I'm running Linux Open Suse 11.2. The source code is compiled using the nvcc compiler (version 4.2).
Thanks for your help!
Try to invoke get_matrix(a,b,nx,ny) twice and take the second timing result. First time calling to CUDA API will create the cuda context. It often takes a long time.
Please refer to this section in CUDA C Best Practice Guide for how to determine the block size and grid size.