CUDA. Unable to use grid with maxGridSizes - cuda

Optimizing my code to use the most of CUDA card bumped on next.
Even though every source of information tells that the grid could be of (65535,65535,65535) size using 2.x compute capability, I'm unable to use the grid bigger than (65535,8192,1) size.
Example code shows that even using blockSize equal to (1,1,1) and empty kernel it causes the error "code=4(cudaErrorLaunchFailure)" when run with a grid bigger than the mentioned size.
OS: Win10Pro
HW: GTS 450
SDK: CUDA 8.0, VS2013CE (using through path of nvcc -ccbin options)
The test code:
#include <helper_cuda.h>
__global__ void KernelTest()
{}
int main()
{
int cudaDevice=0;
int driverVersion = 0, runtimeVersion = 0;
int deviceCount = 0;
cudaError_t error_id = cudaGetDeviceCount(&deviceCount);
if (error_id != cudaSuccess)
{
printf ("cudaGetDeviceCount returned %d\n-> %s\n", (int)error_id, cudaGetErrorString(error_id));
printf ("Result = FAIL\n");
exit(EXIT_FAILURE);
}
// This function call returns 0 if there are no CUDA capable devices.
if (deviceCount == 0)
{
printf("There are no available device(s) that support CUDA\n");
}
else
{
printf ("Detected %d CUDA Capable device(s)\n", deviceCount);
}
cudaSetDevice(cudaDevice);
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, cudaDevice);
cudaDriverGetVersion(&driverVersion);
cudaRuntimeGetVersion(&runtimeVersion);
printf(" CUDA Driver Version / Runtime Version %d.%d / %d.%d\n", driverVersion/1000, (driverVersion%100)/10, runtimeVersion/1000, (runtimeVersion%100)/10);
printf(" CUDA Capability Major/Minor version number: %d.%d\n", deviceProp.major, deviceProp.minor);
char msg[256];
…
//Code from deviceQuery
…
const char *sComputeMode[] =
{
"Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)",
"Exclusive (only one host thread in one process is able to use ::cudaSetDevice() with this device)",
"Prohibited (no host thread can use ::cudaSetDevice() with this device)",
"Exclusive Process (many threads in one process is able to use ::cudaSetDevice() with this device)",
"Unknown",
NULL
};
printf(" Compute Mode:\n");
printf(" < %s >\n", sComputeMode[deviceProp.computeMode]);
//dim3 gridtest(deviceProp.maxGridSize[0]-1, deviceProp.maxGridSize[1]-1, deviceProp.maxGridSize[2]-1);
dim3 gridtest(deviceProp.maxGridSize[0], 1, 1);
dim3 blocktest(1);
KernelTest<<<gridtest,blocktest>>>();
cudaDeviceSynchronize();
checkCudaErrors(cudaPeekAtLastError ( ));
dim3 gridtest2(deviceProp.maxGridSize[0]/2, 2, 1);
KernelTest<<<gridtest2,blocktest>>>();
cudaDeviceSynchronize();
checkCudaErrors(cudaPeekAtLastError ( ));
dim3 gridtest3(deviceProp.maxGridSize[0]/4, 4, 1);
KernelTest<<<gridtest3,blocktest>>>();
cudaDeviceSynchronize();
checkCudaErrors(cudaPeekAtLastError ( ));
dim3 gridtest4(deviceProp.maxGridSize[0], 2, 1);
KernelTest<<<gridtest4,blocktest>>>();
cudaDeviceSynchronize();
checkCudaErrors(cudaPeekAtLastError ( ));
dim3 gridtest5(deviceProp.maxGridSize[0], 4, 1);
KernelTest<<<gridtest5,blocktest>>>();
cudaDeviceSynchronize();
checkCudaErrors(cudaPeekAtLastError ( ));
dim3 gridtest6(deviceProp.maxGridSize[0], (deviceProp.maxGridSize[1]+1)/16, 1);//4096
KernelTest<<<gridtest6,blocktest>>>();
cudaDeviceSynchronize();
checkCudaErrors(cudaPeekAtLastError ( ));
dim3 gridtest7(deviceProp.maxGridSize[0], (deviceProp.maxGridSize[1]+1)/8, 1);//8192
KernelTest<<<gridtest7,blocktest>>>();
cudaDeviceSynchronize();
checkCudaErrors(cudaPeekAtLastError ( ));
dim3 gridtest8(deviceProp.maxGridSize[0], (deviceProp.maxGridSize[1]+1)/4, 1);//16384 - Causes Error
KernelTest<<<gridtest8,blocktest>>>();
cudaDeviceSynchronize();
checkCudaErrors(cudaPeekAtLastError ( ));
// dim3 gridtest9(deviceProp.maxGridSize[0], deviceProp.maxGridSize[1], 1);
// KernelTest<<<gridtest9,blocktest>>>();
// cudaDeviceSynchronize();
// checkCudaErrors(cudaPeekAtLastError ( ));
cudaDeviceReset() ;
}
Output of deviceQuery part:
CUDA Driver Version / Runtime Version 9.1 / 8.0
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073741824 bytes)
( 4) Multiprocessors, ( 48) CUDA Cores/MP: 192 CUDA Cores
GPU Max Clock rate: 1566 MHz (1.57 GHz)
Memory Clock rate: 1804 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

The key piece of information in your question is this:
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Because you are using a WDDM device, there is a time limit on how much wall clock time a kernel can consume imposed by the display driver. If you exceed this, the driver will kill your kernel.
That is what is happening here (cudaErrorLaunchFailure confirms this). Scheduling huge numbers of blocks isn't free and even a null kernel can take many seconds to complete if you are scheduling a lot of blocks. In your case this is being exacerbated by the small, old GPU you are using, which can only run 32 blocks simultaneously, meaning that there are a lot of block scheduling trips to the driver to finish running your kernel launch when you have requested between several hundred million and a billion blocks to be run in a single kernel launch.
For reference, here is the profiler output on a non-display GPU hosted on a Linux system, which has a much larger number of total resident blocks than your GPU (416 vs 32):
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Device Context Stream Name
235.86ms 139.29us (65535 1 1) (1 1 1) 8 0B 0B GeForce GTX 970 1 7 KernelTest(void) [106]
236.03ms 138.49us (32767 2 1) (1 1 1) 8 0B 0B GeForce GTX 970 1 7 KernelTest(void) [109]
236.19ms 138.46us (16383 4 1) (1 1 1) 8 0B 0B GeForce GTX 970 1 7 KernelTest(void) [112]
236.35ms 275.58us (65535 2 1) (1 1 1) 8 0B 0B GeForce GTX 970 1 7 KernelTest(void) [115]
236.65ms 550.09us (65535 4 1) (1 1 1) 8 0B 0B GeForce GTX 970 1 7 KernelTest(void) [118]
237.22ms 504.49ms (65535 4096 1) (1 1 1) 8 0B 0B GeForce GTX 970 1 7 KernelTest(void) [121]
741.79ms 924.72ms (65535 8192 1) (1 1 1) 8 0B 0B GeForce GTX 970 1 7 KernelTest(void) [124]
1.66659s 1.84941s (65535 16384 1) (1 1 1) 8 0B 0B GeForce GTX 970 1 7 KernelTest(void) [127]
You can see that the 65535 x 16384 case takes 1.8 seconds to run. On your GPU that will be much longer. Hopefully you will also conclude from this that running large numbers of blocks is not an optimization, because block scheduling is not zero cost.

Related

CUDA 8 Unified Memory on Pascal Titan X / GP102

This article says that CUDA 8 improved Unified Memory support on Pascal GPUs so that "on supporting platforms, memory allocated with the default OS allocator (e.g. ‘malloc’ or ‘new’) can be accessed from both GPU code and CPU code using the same pointer".
I was excited about this and wrote a small test program to see if my system support this:
#include <stdio.h>
#define CUDA_CHECK( call ) {\
cudaError_t code = ( call );\
if ( code != cudaSuccess ) {\
const char* msg = cudaGetErrorString( code );\
printf( "%s #%d: %s\n", __FILE__, __LINE__, msg );\
}\
}
#define N 10
__global__
void test_unified_memory( int* input, int* output )
{
output[ threadIdx.x ] = input[ threadIdx.x ] * 2;
}
int main()
{
int* input = (int*) malloc( N );
int* output = (int*) malloc( N );
for ( int i = 0; i < N; ++i ) input[ i ] = i;
test_unified_memory <<< 1, N >>>( input, output );
CUDA_CHECK( cudaDeviceSynchronize() );
for ( int i = 0; i < N; ++i ) printf( "%d, ", output[ i ] );
free( input );
free( output );
}
But it didn't work.
I am wondering what does "supporting platform" means. Here are my system configurations:
$uname -r
3.10.0-327.el7.x86_64
$nvidia-smi
Tue Jan 10 14:46:11 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 0000:01:00.0 Off | N/A |
| 36% 61C P0 88W / 250W | 2MiB / 12189MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$deviceQuery
NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "TITAN X (Pascal)"
CUDA Driver Version / Runtime Version 8.0 / 7.5
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 12189 MBytes (12781551616 bytes)
MapSMtoCores for SM 6.1 is undefined. Default to use 128 Cores/SM
MapSMtoCores for SM 6.1 is undefined. Default to use 128 Cores/SM
(28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores
GPU Max Clock rate: 1531 MHz (1.53 GHz)
Memory Clock rate: 5005 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = TITAN X (Pascal)
Result = PASS
The answer may simply be that Titan X / GP102 does not support this feature. However I could not find any information / documentation on this. Could anyone please let me know whether or not it is supported on my configuration, and point me to the reference of such information? Thank you.
As talonmies suggested in the comment, it may related to the host OS. Then, what is the requirements on the host, and how to check / fix them?
It appears that this new unified memory feature requires an experimental Linux kernel patch which is not yet integrated into any mainline kernel trees. It should be regarded as a future feature rather than something which can be used now.
EDIT to add that, as noted in comments, you are also using CUDA 7.5, and irrespective of the host kernel issue, you would need to use CUDA 8 for this feature.

Cuda failed to run on blocks but ok on grids

Here is part of my cuda code from cs344:
It is a task to convert a picture in rgb to gray.
The code runs ok now.
But when I use the code in the comments. It fails.
__global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
unsigned char* const greyImage,
int numRows, int numCols)
{
//int x = threadIdx.x ;
//int y = threadIdx.y ;
int x = blockIdx.x ;
int y = blockIdx.y ;
if (x<numCols && y<numRows)
{
uchar4 rgba = rgbaImage[y*numCols+x] ;
float channelSum = .299f * rgba.x + .587f * rgba.y + .114f * rgba.z;
greyImage[y*numCols+x] = channelSum;
}
}
void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage,
uchar4 * const d_rgbaImage, unsigned char* const d_greyImage,
size_t numRows, size_t numCols)
{
// const dim3 blockSize(numCols,numRows , 1); //TODO
// const dim3 gridSize( 1, 1, 1); //TODO
const dim3 blockSize(1,1 , 1); //TODO
const dim3 gridSize( numCols,numRows , 1); //TODO
std::cout << numCols << " " << numRows << std::endl ; // numCols=557 numRows=313
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
cudaDeviceSynchronize();
checkCudaErrors(cudaGetLastError());
}
The code fails to run when I use the commented ones, with errors:
CUDA error at: /home/yc/cuda_prj/cs344_bak/Problem Sets/Problem Set 1/student_func.cu:90
invalid configuration argument cudaGetLastError()
Here is my deviceQuery report:
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1080"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8110 MBytes (8504279040 bytes)
(20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1823 MHz (1.82 GHz)
Memory Clock rate: 5005 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1080
Can anyone tell me why?
Robert posts the answer that solves the problem:
The total threads per block is the product of the dimensions. If your dimensions are numCols=557 numRows=313 then the product of those is over 150,000. The limit on the total threads per block on your device is 1024 and it is in the deviceQuery output here: Maximum number of threads per block: 1024

Unexplained CUDA Crash

I have a dedicated compute GPU in my computer (not used for display). It's properties are:
Device 0: "Tesla C2050"
CUDA Driver Version / Runtime Version 6.0 / 6.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2688 MBytes (2818244608 bytes)
(14) Multiprocessors, ( 32) CUDA Cores/MP: 448 CUDA Cores
GPU Clock rate: 1147 MHz (1.15 GHz)
Memory Clock rate: 1500 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
I am trying to run the following simple program on it (copy an array to the device):
#include <cuda.h>
#include <curand_kernel.h>
#define N 252000
int main( void ) {
int a[N];
int *dev_a;
cudaSetDevice(0);
cudaMalloc( (void**)&dev_a, N * sizeof(int) );
for (long i=0; i<N; i++) {
a[i] = 1;
}
cudaMemcpy( dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice ); //**Crashes here**
cudaFree( dev_a );
cudaDeviceReset();
return 0;
}
If N = 251000 the program works. But if N = 252000 the program crashes at cudaMemcpy(). Any idea why this might be happening?
Congratulations, you've just found the limit on your stack size:
int a[N];
allocate your host array dynamically instead:
int *a = (int *)malloc(N*sizeof(int));
This will allocate from the heap instead. If you search on SO you will find many questions like this one that explain stack vs. heap allocations, and the limitations.

CUDA atomicAdd() with long long int

Any time I try to use atomicAdd with anything other than (*int, int) I get this error:
error: no instance of overloaded function "atomicAdd" matches the argument list
But I need to use a larger data type than int. Is there any workaround here?
Device Query:
/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 680"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4095 MBytes (4294246400 bytes)
( 8) Multiprocessors x (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Clock rate: 1084 MHz (1.08 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GTX 680
My guess is wrong compile flags. You're looking for anything other than int, you should be using sm_12 or higher.
As stated by Robert Crovella the unsigned long long int variable is supported, but the long long int is not.
Used the code from: Beginner CUDA - Simple var increment not working
#include <iostream>
using namespace std;
__global__ void inc(unsigned long long int *foo) {
atomicAdd(foo, 1);
}
int main() {
unsigned long long int count = 0, *cuda_count;
cudaMalloc((void**)&cuda_count, sizeof(unsigned long long int));
cudaMemcpy(cuda_count, &count, sizeof(unsigned long long int), cudaMemcpyHostToDevice);
cout << "count: " << count << '\n';
inc <<< 100, 25 >>> (cuda_count);
cudaMemcpy(&count, cuda_count, sizeof(unsigned long long int), cudaMemcpyDeviceToHost);
cudaFree(cuda_count);
cout << "count: " << count << '\n';
return 0;
}
Compiled from Linux: nvcc -gencode arch=compute_12,code=sm_12 -o add add.cu
Result:
count: 0
count: 2500

CUDA 5.0 - cudaGetDeviceProperties strange grid size or a bug in my code?

I already posted my question on NVIDIA dev forums, but there are no definitive answers yet.
I'm just starting to learn CUDA and was really surprised that, contrary to what I found on the Internet, my card (GeForce GTX 660M) supports some insane grid sizes (2147483647 x 65535 x 65535). Please take a look at the following results I'm getting from deviceQuery.exe provided with the toolkit:
c:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.0\bin\win64\Release>deviceQuery.exe
deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 660M"
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147287040 bytes)
( 2) Multiprocessors x (192) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GTX 660M
I was curious enough to write a simple program testing if it's possible to use more than 65535 blocks in the first dimension of the grid, but it doesn't work confirming what I found on the Internet (or, to be more precise, does work fine for 65535 blocks and doesn't for 65536).
My program is extremely simple and basically just adds two vectors. This is the source code:
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>
#include <math.h>
#pragma comment(lib, "cudart")
typedef struct
{
float *content;
const unsigned int size;
} pjVector_t;
__global__ void AddVectorsKernel(float *firstVector, float *secondVector, float *resultVector)
{
unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;
resultVector[index] = firstVector[index] + secondVector[index];
}
int main(void)
{
//const unsigned int vectorLength = 67107840; // 1024 * 65535 - works fine
const unsigned int vectorLength = 67108864; // 1024 * 65536 - doesn't work
const unsigned int vectorSize = sizeof(float) * vectorLength;
int threads = 0;
unsigned int blocks = 0;
cudaDeviceProp deviceProperties;
cudaError_t error;
pjVector_t firstVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
pjVector_t secondVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
pjVector_t resultVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
float *d_firstVector;
float *d_secondVector;
float *d_resultVector;
cudaMalloc((void **)&d_firstVector, vectorSize);
cudaMalloc((void **)&d_secondVector, vectorSize);
cudaMalloc((void **)&d_resultVector, vectorSize);
cudaGetDeviceProperties(&deviceProperties, 0);
threads = deviceProperties.maxThreadsPerBlock;
blocks = (unsigned int)ceil(vectorLength / (double)threads);
for (unsigned int i = 0; i < vectorLength; i++)
{
firstVector.content[i] = 1.0f;
secondVector.content[i] = 2.0f;
}
cudaMemcpy(d_firstVector, firstVector.content, vectorSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_secondVector, secondVector.content, vectorSize, cudaMemcpyHostToDevice);
AddVectorsKernel<<<blocks, threads>>>(d_firstVector, d_secondVector, d_resultVector);
error = cudaPeekAtLastError();
cudaMemcpy(resultVector.content, d_resultVector, vectorSize, cudaMemcpyDeviceToHost);
for (unsigned int i = 0; i < vectorLength; i++)
{
if(resultVector.content[i] != 3.0f)
{
free(firstVector.content);
free(secondVector.content);
free(resultVector.content);
cudaFree(d_firstVector);
cudaFree(d_secondVector);
cudaFree(d_resultVector);
cudaDeviceReset();
printf("Error under index: %i\n", i);
return 0;
}
}
free(firstVector.content);
free(secondVector.content);
free(resultVector.content);
cudaFree(d_firstVector);
cudaFree(d_secondVector);
cudaFree(d_resultVector);
cudaDeviceReset();
printf("Everything ok!\n");
return 0;
}
When I run it from Visual Studio in debug mode (the bigger vector), the last cudaMemcpy always fills my resultVector with seemingly random data (very close to 0 if it matters) so that the result doesn't pass the final validation. When I try to profile it with Visual Profiler, it returns following error message:
2 events, 0 metrics and 0 source-level metrics were not associated with the kernels and will not be displayed
As a result, profiler measures only cudaMalloc and cudaMemcpy operations and doesn't even show the kernel execution.
I'm not sure if I'm checking cuda erros right, so please let me know if it can be done better. cudaPeekAtLastError() placed just after my kernel launch returns cudaErrorInvalidValue(11) error when the bigger vector is used and cudaSuccess(0) for every other call (cudaMalloc and cudaMemcpy). When I run my program with the smaller vector, all cuda functions and my kernel launch return no errors (cudaSuccess(0)) and it works just fine.
So my question is: is cudaGetDeviceProperties returning rubbish grid size values or am I doing something wrong?
If you want to run a kernel using the larger grid size support offered by the Kepler architecture, you must compile you code for that architecture. So change you build flags to sepcific sm_30 as the target architecture. Otherwise the compiler will build for compute 1.0 targets.
The underlying reason for the launch failure is that the driver will attempt to recompile the compute 1.0 code for your Kepler card, but in doing so it enforces the execution grid limits dictated by the source architecture, ie. two dimensional grids with 65535 x 65535 maximum blocks per grid.