What is the difference between mapped memory and managed memory? - cuda

I have been reading about the various approaches to memory management offered by CUDA, and I'm struggling to understand the difference between mapped memory:
int *foo;
std::size_t size = 32;
cudaHostAlloc(&foo, size, cudaHostAllocMapped);
...and managed memory:
int *foo;
std::size_t size = 32;
cudaMallocManaged(&foo, size);
They both appear to implicitly transfer memory between the host and device. cudaMallocManaged seems to be the newer API, and it uses the so-called "Unified Memory" system. That said, cudaHostAlloc seems to share many of these properties on 64-bit systems thanks to the unified virtual address space.
There seem to be a few other differences in documentation, but I am not confident that the absence of explicit feature documentation will lead me to a correct understanding of the differences between these two functions (e.g. I don't believe it is explicitly stated that cudaMallocManaged's host memory is page-locked, but I suspect that it is).
They also correspond to different functions in the driver API (cuMemHostAlloc and cuMemAllocManaged), which I think is a good indicator that their behavior differs in some meaningful way.

I think the main difference is the paging/ page-fault mechanism.
Pinned memory acts the same as ordinary device memory. If one byte of pinned memory is requested, one byte will be transparently transfered to the GPU via PCIe bus. (Maybe the driver merges requests of contiguous memory locations, I do not know.)
On the other hand, managed memory has access granularity of memory pages. If the page of the requested byte is not present on the device, not only the single byte but the whole page (4096 bytes on many systems) is migrated to the GPU from its current location, which can be host memory, or device memory of another GPU.
The following program tries to show the different behaviours.
256 MB are allocated which is equivalent to 64 * 1024 pages of size 4096 bytes.
Then, in a kernel each thread accesses the first byte of each page, i.e each 4096th byte. The time is measured for pinned memory, managed memory, and normal device memory.
#include <iostream>
#include <cassert>
__global__
void kernel(char* __restrict__ data, int pagesize, int numpages){
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
if(tid < numpages){
data[tid * pagesize] += 1;
}
}
int main(){
const int pagesize = 4096;
const int numpages = 1024 * 64;
const int bytes = pagesize * numpages;
cudaError_t status = cudaSuccess;
float elapsed = 0.0f;
const int iterations = 5;
char* devicedata;
status = cudaMalloc(&devicedata, bytes);
assert(status == cudaSuccess);
char* pinneddata;
status = cudaMallocHost(&pinneddata, bytes);
assert(status == cudaSuccess);
char* manageddata;
status = cudaMallocManaged(&manageddata, bytes);
assert(status == cudaSuccess);
status = cudaMemPrefetchAsync(manageddata, bytes, cudaCpuDeviceId);
//status = cudaMemPrefetchAsync(manageddata, bytes, 0);
assert(status == cudaSuccess);
cudaEvent_t event1, event2;
cudaEventCreate(&event1);
cudaEventCreate(&event2);
for(int iteration = 0; iteration < iterations; iteration++){
cudaEventRecord(event1);
kernel<<<numpages / 256, 256>>>(pinneddata, pagesize, numpages);
cudaEventRecord(event2);
status = cudaEventSynchronize(event2);
assert(status == cudaSuccess);
cudaEventElapsedTime(&elapsed, event1, event2);
float bandwith = (numpages / elapsed) * 1000.0f / 1024.f / 1024.f;
std::cerr << "pinned: " << elapsed << ", throughput " << bandwith << " GB/s" << "\n";
}
for(int iteration = 0; iteration < iterations; iteration++){
cudaEventRecord(event1);
kernel<<<numpages / 256, 256>>>(manageddata, pagesize, numpages);
cudaEventRecord(event2);
status = cudaEventSynchronize(event2);
assert(status == cudaSuccess);
cudaEventElapsedTime(&elapsed, event1, event2);
float bandwith = (numpages / elapsed) * 1000.0f / 1024.f / 1024.f;
std::cerr << "managed: " << elapsed << ", throughput " << bandwith << " MB/s" << "\n";
status = cudaMemPrefetchAsync(manageddata, bytes, cudaCpuDeviceId);
assert(status == cudaSuccess);
}
for(int iteration = 0; iteration < iterations; iteration++){
cudaEventRecord(event1);
kernel<<<numpages / 256, 256>>>(devicedata, pagesize, numpages);
cudaEventRecord(event2);
status = cudaEventSynchronize(event2);
assert(status == cudaSuccess);
cudaEventElapsedTime(&elapsed, event1, event2);
float bandwith = (numpages / elapsed) * 1000.0f / 1024.f / 1024.f;
std::cerr << "device: " << elapsed << ", throughput " << bandwith << " MB/s" << "\n";
}
cudaFreeHost(pinneddata);
cudaFree(manageddata);
cudaFree(devicedata);
cudaEventDestroy(event1);
cudaEventDestroy(event2);
}
When the managed memory is prefetch to the host, the following times are observed
pinned: 1.4577 ms, throughput 42.8759 MB/s
pinned: 1.4927 ms, throughput 41.8703 MB/s
pinned: 1.44947 ms, throughput 43.1192 MB/s
pinned: 1.44371 ms, throughput 43.2912 MB/s
pinned: 1.4496 ms, throughput 43.1153 MB/s
managed: 40.3646 ms, throughput 1.54839 MB/s
managed: 35.8052 ms, throughput 1.74555 MB/s
managed: 36.7788 ms, throughput 1.69935 MB/s
managed: 37.3166 ms, throughput 1.67486 MB/s
managed: 35.3378 ms, throughput 1.76864 MB/s
device: 0.052256 ms, throughput 1196.03 MB/s
device: 0.061312 ms, throughput 1019.38 MB/s
device: 0.060736 ms, throughput 1029.04 MB/s
device: 0.060096 ms, throughput 1040 MB/s
device: 0.060352 ms, throughput 1035.59 MB/s
nvprof confirms that in the case of managed memory, all 256 MB are transfered to the device.
==27443== Unified Memory profiling result:
Device "TITAN Xp (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
6734 38.928KB 4.0000KB 776.00KB 256.0000MB 29.95677ms Host To Device
When we remove the prefetching within the loop, the migrated pages remain on the GPU, which improves access time to the level of normal device memory.
pinned: 1.46848 ms, throughput 42.561 MB/s
pinned: 1.50842 ms, throughput 41.4342 MB/s
pinned: 1.44285 ms, throughput 43.3171 MB/s
pinned: 1.45802 ms, throughput 42.8665 MB/s
pinned: 1.4431 ms, throughput 43.3094 MB/s
managed: 41.9972 ms, throughput 1.4882 MB/s <--- need to migrate pages
managed: 0.047584 ms, throughput 1313.47 MB/s <--- pages already present on GPU
managed: 0.059552 ms, throughput 1049.5 MB/s
managed: 0.057248 ms, throughput 1091.74 MB/s
managed: 0.062336 ms, throughput 1002.63 MB/s
device: 0.06176 ms, throughput 1011.98 MB/s
device: 0.062592 ms, throughput 998.53 MB/s
device: 0.062176 ms, throughput 1005.21 MB/s
device: 0.06128 ms, throughput 1019.91 MB/s
device: 0.063008 ms, throughput 991.937 MB/s

Related

Cuda failed to run on blocks but ok on grids

Here is part of my cuda code from cs344:
It is a task to convert a picture in rgb to gray.
The code runs ok now.
But when I use the code in the comments. It fails.
__global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
unsigned char* const greyImage,
int numRows, int numCols)
{
//int x = threadIdx.x ;
//int y = threadIdx.y ;
int x = blockIdx.x ;
int y = blockIdx.y ;
if (x<numCols && y<numRows)
{
uchar4 rgba = rgbaImage[y*numCols+x] ;
float channelSum = .299f * rgba.x + .587f * rgba.y + .114f * rgba.z;
greyImage[y*numCols+x] = channelSum;
}
}
void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage,
uchar4 * const d_rgbaImage, unsigned char* const d_greyImage,
size_t numRows, size_t numCols)
{
// const dim3 blockSize(numCols,numRows , 1); //TODO
// const dim3 gridSize( 1, 1, 1); //TODO
const dim3 blockSize(1,1 , 1); //TODO
const dim3 gridSize( numCols,numRows , 1); //TODO
std::cout << numCols << " " << numRows << std::endl ; // numCols=557 numRows=313
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
cudaDeviceSynchronize();
checkCudaErrors(cudaGetLastError());
}
The code fails to run when I use the commented ones, with errors:
CUDA error at: /home/yc/cuda_prj/cs344_bak/Problem Sets/Problem Set 1/student_func.cu:90
invalid configuration argument cudaGetLastError()
Here is my deviceQuery report:
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1080"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8110 MBytes (8504279040 bytes)
(20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1823 MHz (1.82 GHz)
Memory Clock rate: 5005 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1080
Can anyone tell me why?
Robert posts the answer that solves the problem:
The total threads per block is the product of the dimensions. If your dimensions are numCols=557 numRows=313 then the product of those is over 150,000. The limit on the total threads per block on your device is 1024 and it is in the deviceQuery output here: Maximum number of threads per block: 1024

FFT is slower on Jetson TK1?

I have written a CUDA program for Synthetic Aperture Radar Image processing. The significant portion of the computation involves finding FFTs and iFFTs and I have used cuFFT library for it. I ran my CUDA code on Jetson TK1 and on a laptop having GT635M (Fermi) and I find it is three times slower on Jetson. It is because FFTs is taking more time and shows lower GFLOPS/s on Jetson. The GFLOPS/s performance of the kernels I wrote are nearly same in both Jetson and Fermi GT635M. It is the FFTs which is slow on Jetson.
The other profiler parameters I observed are:
The Issued Control Flow Instructions, Texture Cache Transactions, Local Memory Store Throughput (bytes/sec), Local Memory Store Transactions Per Request are high on Jetson while the Requested Global Load Throughput(bytes/sec) and Global Load Transactions are high on Fermi GT635M.
Jetson
GPU Clock Rate: 852 Mhz
Mem Clock Rate: 924 Mhz
Fermi GT635M
GPU Clock Rate: 950 Mhz
Mem Clock Rate: 900 Mhz
Both of them have nearly same clock frequencies. Then why is the FFTs taking more time on Jetson and shows poor GFLOPS/s ?
To see the performance of FFTs, I have written a simple CUDA program which does 1D FFT on a matrix of size 2048 * 4912. The data here is contiguous and not strided. The timetaken and GFLOPS/s for them are:
Jetson
3.251 GFLOPS/s Duration: 1.393 sec
Fermi GT635M
47.1 GFLOPS/s Duration: 0.211 sec
#include <stdio.h>
#include <cstdlib>
#include <cufft.h>
#include <stdlib.h>
#include <math.h>
#include "cuda_runtime_api.h"
#include "device_launch_parameters.h"
#include "cuda_profiler_api.h"
int main()
{
int numLines = 2048, nValid = 4912;
int iter1, iter2, index=0;
cufftComplex *devData, *hostData;
hostData = (cufftComplex*)malloc(sizeof(cufftComplex) * numLines * nValid);
for(iter1=0; iter1<2048; iter1++)
{
for(iter2=0; iter2<4912; iter2++)
{
index = iter1*4912 + iter2;
hostData[index].x = iter1+1;
hostData[index].y = iter2+1;
}
}
cudaMalloc((void**)&devData, sizeof(cufftComplex) * numLines * nValid);
cudaMemcpy(devData, hostData, sizeof(cufftComplex) * numLines * nValid, cudaMemcpyHostToDevice);
// ----------------------------
cufftHandle plan;
cufftPlan1d(&plan, 4912, CUFFT_C2C, 2048);
cufftExecC2C(plan, (cufftComplex *)devData, (cufftComplex *)devData, CUFFT_FORWARD);
cufftDestroy(plan);
// ----------------------------
cudaMemcpy(hostData, devData, sizeof(cufftComplex) * numLines * nValid, cudaMemcpyDeviceToHost);
for(iter1=0; iter1<5; iter1++)
{
for(iter2=0; iter2<5; iter2++)
{
index = iter1*4912 + iter2;
printf("%lf+i%lf \n",hostData[index].x, hostData[index].y);
}
printf("\n");
}
cudaDeviceReset();
return 0;
}
This could be probably because you are using the LP (low power CPU).
Checkout this document to enable all the 4 main ARM cores (HP cluster) to take advantage of Hyper-Q.
I faced a similar issue.. After activating the main HP cluster I get good performance (from 3 GFLOPS (LP) to 160 GFLOPS (HP)).
My blind guess is that, though the TK1 has a more modern core, the memory bandwidth dedicatedly available to the 144 cores of your 635M is significantly higher than that of the Tegra.
Furthermore, CUDA is always a bit picky on the warp/thread/grid sizes, so it's perfectly possible that the cufft algorithms were optimized for the local storage sizes of Fermis, and don't work as well with the Keplers.

CUDA atomicAdd() with long long int

Any time I try to use atomicAdd with anything other than (*int, int) I get this error:
error: no instance of overloaded function "atomicAdd" matches the argument list
But I need to use a larger data type than int. Is there any workaround here?
Device Query:
/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 680"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4095 MBytes (4294246400 bytes)
( 8) Multiprocessors x (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Clock rate: 1084 MHz (1.08 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GTX 680
My guess is wrong compile flags. You're looking for anything other than int, you should be using sm_12 or higher.
As stated by Robert Crovella the unsigned long long int variable is supported, but the long long int is not.
Used the code from: Beginner CUDA - Simple var increment not working
#include <iostream>
using namespace std;
__global__ void inc(unsigned long long int *foo) {
atomicAdd(foo, 1);
}
int main() {
unsigned long long int count = 0, *cuda_count;
cudaMalloc((void**)&cuda_count, sizeof(unsigned long long int));
cudaMemcpy(cuda_count, &count, sizeof(unsigned long long int), cudaMemcpyHostToDevice);
cout << "count: " << count << '\n';
inc <<< 100, 25 >>> (cuda_count);
cudaMemcpy(&count, cuda_count, sizeof(unsigned long long int), cudaMemcpyDeviceToHost);
cudaFree(cuda_count);
cout << "count: " << count << '\n';
return 0;
}
Compiled from Linux: nvcc -gencode arch=compute_12,code=sm_12 -o add add.cu
Result:
count: 0
count: 2500

CUDA 5.0 - cudaGetDeviceProperties strange grid size or a bug in my code?

I already posted my question on NVIDIA dev forums, but there are no definitive answers yet.
I'm just starting to learn CUDA and was really surprised that, contrary to what I found on the Internet, my card (GeForce GTX 660M) supports some insane grid sizes (2147483647 x 65535 x 65535). Please take a look at the following results I'm getting from deviceQuery.exe provided with the toolkit:
c:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.0\bin\win64\Release>deviceQuery.exe
deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 660M"
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147287040 bytes)
( 2) Multiprocessors x (192) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GTX 660M
I was curious enough to write a simple program testing if it's possible to use more than 65535 blocks in the first dimension of the grid, but it doesn't work confirming what I found on the Internet (or, to be more precise, does work fine for 65535 blocks and doesn't for 65536).
My program is extremely simple and basically just adds two vectors. This is the source code:
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>
#include <math.h>
#pragma comment(lib, "cudart")
typedef struct
{
float *content;
const unsigned int size;
} pjVector_t;
__global__ void AddVectorsKernel(float *firstVector, float *secondVector, float *resultVector)
{
unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;
resultVector[index] = firstVector[index] + secondVector[index];
}
int main(void)
{
//const unsigned int vectorLength = 67107840; // 1024 * 65535 - works fine
const unsigned int vectorLength = 67108864; // 1024 * 65536 - doesn't work
const unsigned int vectorSize = sizeof(float) * vectorLength;
int threads = 0;
unsigned int blocks = 0;
cudaDeviceProp deviceProperties;
cudaError_t error;
pjVector_t firstVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
pjVector_t secondVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
pjVector_t resultVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
float *d_firstVector;
float *d_secondVector;
float *d_resultVector;
cudaMalloc((void **)&d_firstVector, vectorSize);
cudaMalloc((void **)&d_secondVector, vectorSize);
cudaMalloc((void **)&d_resultVector, vectorSize);
cudaGetDeviceProperties(&deviceProperties, 0);
threads = deviceProperties.maxThreadsPerBlock;
blocks = (unsigned int)ceil(vectorLength / (double)threads);
for (unsigned int i = 0; i < vectorLength; i++)
{
firstVector.content[i] = 1.0f;
secondVector.content[i] = 2.0f;
}
cudaMemcpy(d_firstVector, firstVector.content, vectorSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_secondVector, secondVector.content, vectorSize, cudaMemcpyHostToDevice);
AddVectorsKernel<<<blocks, threads>>>(d_firstVector, d_secondVector, d_resultVector);
error = cudaPeekAtLastError();
cudaMemcpy(resultVector.content, d_resultVector, vectorSize, cudaMemcpyDeviceToHost);
for (unsigned int i = 0; i < vectorLength; i++)
{
if(resultVector.content[i] != 3.0f)
{
free(firstVector.content);
free(secondVector.content);
free(resultVector.content);
cudaFree(d_firstVector);
cudaFree(d_secondVector);
cudaFree(d_resultVector);
cudaDeviceReset();
printf("Error under index: %i\n", i);
return 0;
}
}
free(firstVector.content);
free(secondVector.content);
free(resultVector.content);
cudaFree(d_firstVector);
cudaFree(d_secondVector);
cudaFree(d_resultVector);
cudaDeviceReset();
printf("Everything ok!\n");
return 0;
}
When I run it from Visual Studio in debug mode (the bigger vector), the last cudaMemcpy always fills my resultVector with seemingly random data (very close to 0 if it matters) so that the result doesn't pass the final validation. When I try to profile it with Visual Profiler, it returns following error message:
2 events, 0 metrics and 0 source-level metrics were not associated with the kernels and will not be displayed
As a result, profiler measures only cudaMalloc and cudaMemcpy operations and doesn't even show the kernel execution.
I'm not sure if I'm checking cuda erros right, so please let me know if it can be done better. cudaPeekAtLastError() placed just after my kernel launch returns cudaErrorInvalidValue(11) error when the bigger vector is used and cudaSuccess(0) for every other call (cudaMalloc and cudaMemcpy). When I run my program with the smaller vector, all cuda functions and my kernel launch return no errors (cudaSuccess(0)) and it works just fine.
So my question is: is cudaGetDeviceProperties returning rubbish grid size values or am I doing something wrong?
If you want to run a kernel using the larger grid size support offered by the Kepler architecture, you must compile you code for that architecture. So change you build flags to sepcific sm_30 as the target architecture. Otherwise the compiler will build for compute 1.0 targets.
The underlying reason for the launch failure is that the driver will attempt to recompile the compute 1.0 code for your Kepler card, but in doing so it enforces the execution grid limits dictated by the source architecture, ie. two dimensional grids with 65535 x 65535 maximum blocks per grid.

CUDA - Memory Limit - Vector Summation

I'm trying to learn CUDA and the following code works OK for the values N<= 16384, but fails for the greater values(Summation check at the end of the code fails, c values are always 0 for the index value of i>=16384).
#include<iostream>
#include"cuda_runtime.h"
#include"../cuda_be/book.h"
#define N (16384)
__global__ void add(int *a,int *b,int *c)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if(tid<N)
{
c[tid] = a[tid] + b[tid];
tid += blockDim.x * gridDim.x;
}
}
int main()
{
int a[N],b[N],c[N];
int *dev_a,*dev_b,*dev_c;
//allocate mem on gpu
HANDLE_ERROR(cudaMalloc((void**)&dev_a,N*sizeof(int)));
HANDLE_ERROR(cudaMalloc((void**)&dev_b,N*sizeof(int)));
HANDLE_ERROR(cudaMalloc((void**)&dev_c,N*sizeof(int)));
for(int i=0;i<N;i++)
{
a[i] = -i;
b[i] = i*i;
}
HANDLE_ERROR(cudaMemcpy(dev_a,a,N*sizeof(int),cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_b,b,N*sizeof(int),cudaMemcpyHostToDevice));
system("PAUSE");
add<<<128,128>>>(dev_a,dev_b,dev_c);
//copy the array 'c' back from the gpu to the cpu
HANDLE_ERROR( cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost));
system("PAUSE");
bool success = true;
for(int i=0;i<N;i++)
{
if((a[i] + b[i]) != c[i])
{
printf("Error in %d: %d + %d != %d\n",i,a[i],b[i],c[i]);
system("PAUSE");
success = false;
}
}
if(success) printf("We did it!\n");
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}
I think it's a shared memory related problem, but I can't come up with a good explanation(Possible lack of knowledge). Could you provide me an explanation and a workaround to run for the values of N greater than 16384. Here is the specs for my GPU:
General Info for device 0
Name: GeForce 9600M GT
Compute capability: 1.1
Clock rate: 1250000
Device copy overlap : Enabled
Kernel Execution timeout : Enabled
Mem info for device 0
Total global mem: 536870912
Total const mem: 65536
Max mem pitch: 2147483647
Texture Alignment 256
MP info about device 0
Multiproccessor count: 4
Shared mem per mp: 16384
Registers per mp: 8192
Threads in warp: 32
Max threads per block: 512
Max thread dimensions: (512,512,64)
Max grid dimensions: (65535,65535,1)
You probably intended to write
while(tid<N)
not
if(tid<N)
You aren't running out of shared memory, your vector arrays are being copied into your device's global memory. As you can see this has far more space available than the 196608 bytes (16384*4*3) you need.
The reason for your problem is that you are only performing one addition operation per thread so hence with this structure, the maximum dimension that your vectors can be is the block*thread parameters in your kernel launch as tera has pointed out. By correcting
if(tid<N)
to
while(tid<N)
in your code, each thread will perform its addition on multiple indexes and the whole array will be considered.
For more information about the memory hierarchy and the various different places memory can sit, you should read sections 2.3 and 5.3 of the CUDA_C_Programming_Guide.pdf provided with the CUDA toolkit.
Hope that helps.
If N is:
#define N (33 * 1024) //value defined in Cuda by Examples
The same code I found in Cuda by Example, but the value of N was different. I think that o value of N cant be 33 * 1024. I must change the parameters number of block and number of threads per blocks. Because:
add<<<128,128>>>(dev_a,dev_b,dev_c); //16384 threads
(128 * 128) < (33 * 1024) so we have a crash.