FFT is slower on Jetson TK1? - cuda

I have written a CUDA program for Synthetic Aperture Radar Image processing. The significant portion of the computation involves finding FFTs and iFFTs and I have used cuFFT library for it. I ran my CUDA code on Jetson TK1 and on a laptop having GT635M (Fermi) and I find it is three times slower on Jetson. It is because FFTs is taking more time and shows lower GFLOPS/s on Jetson. The GFLOPS/s performance of the kernels I wrote are nearly same in both Jetson and Fermi GT635M. It is the FFTs which is slow on Jetson.
The other profiler parameters I observed are:
The Issued Control Flow Instructions, Texture Cache Transactions, Local Memory Store Throughput (bytes/sec), Local Memory Store Transactions Per Request are high on Jetson while the Requested Global Load Throughput(bytes/sec) and Global Load Transactions are high on Fermi GT635M.
Jetson
GPU Clock Rate: 852 Mhz
Mem Clock Rate: 924 Mhz
Fermi GT635M
GPU Clock Rate: 950 Mhz
Mem Clock Rate: 900 Mhz
Both of them have nearly same clock frequencies. Then why is the FFTs taking more time on Jetson and shows poor GFLOPS/s ?
To see the performance of FFTs, I have written a simple CUDA program which does 1D FFT on a matrix of size 2048 * 4912. The data here is contiguous and not strided. The timetaken and GFLOPS/s for them are:
Jetson
3.251 GFLOPS/s Duration: 1.393 sec
Fermi GT635M
47.1 GFLOPS/s Duration: 0.211 sec
#include <stdio.h>
#include <cstdlib>
#include <cufft.h>
#include <stdlib.h>
#include <math.h>
#include "cuda_runtime_api.h"
#include "device_launch_parameters.h"
#include "cuda_profiler_api.h"
int main()
{
int numLines = 2048, nValid = 4912;
int iter1, iter2, index=0;
cufftComplex *devData, *hostData;
hostData = (cufftComplex*)malloc(sizeof(cufftComplex) * numLines * nValid);
for(iter1=0; iter1<2048; iter1++)
{
for(iter2=0; iter2<4912; iter2++)
{
index = iter1*4912 + iter2;
hostData[index].x = iter1+1;
hostData[index].y = iter2+1;
}
}
cudaMalloc((void**)&devData, sizeof(cufftComplex) * numLines * nValid);
cudaMemcpy(devData, hostData, sizeof(cufftComplex) * numLines * nValid, cudaMemcpyHostToDevice);
// ----------------------------
cufftHandle plan;
cufftPlan1d(&plan, 4912, CUFFT_C2C, 2048);
cufftExecC2C(plan, (cufftComplex *)devData, (cufftComplex *)devData, CUFFT_FORWARD);
cufftDestroy(plan);
// ----------------------------
cudaMemcpy(hostData, devData, sizeof(cufftComplex) * numLines * nValid, cudaMemcpyDeviceToHost);
for(iter1=0; iter1<5; iter1++)
{
for(iter2=0; iter2<5; iter2++)
{
index = iter1*4912 + iter2;
printf("%lf+i%lf \n",hostData[index].x, hostData[index].y);
}
printf("\n");
}
cudaDeviceReset();
return 0;
}

This could be probably because you are using the LP (low power CPU).
Checkout this document to enable all the 4 main ARM cores (HP cluster) to take advantage of Hyper-Q.
I faced a similar issue.. After activating the main HP cluster I get good performance (from 3 GFLOPS (LP) to 160 GFLOPS (HP)).

My blind guess is that, though the TK1 has a more modern core, the memory bandwidth dedicatedly available to the 144 cores of your 635M is significantly higher than that of the Tegra.
Furthermore, CUDA is always a bit picky on the warp/thread/grid sizes, so it's perfectly possible that the cufft algorithms were optimized for the local storage sizes of Fermis, and don't work as well with the Keplers.

Related

Where is the boundary of start and end of CPU launch and GPU launch of Nvidia Profiling NVPROF?

What is the definition of start and end of kernel launch in the CPU and GPU (yellow block)? Where is the boundary between them?
Please notice that the start, end, and duration of those yellow blocks in CPU and GPU are different.Why CPU invocation of vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n); takes that long time?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
// Get our global thread ID
int id = blockIdx.x*blockDim.x+threadIdx.x;
//printf("id = %d \n", id);
// Make sure we do not go out of bounds
if (id < n)
c[id] = a[id] + b[id];
}
int main( int argc, char* argv[] )
{
// Size of vectors
int n = 1000000;
// Host input vectors
double *h_a;
double *h_b;
//Host output vector
double *h_c;
// Device input vectors
double *d_a;
double *d_b;
//Device output vector
double *d_c;
// Size, in bytes, of each vector
size_t bytes = n*sizeof(double);
// Allocate memory for each vector on host
h_a = (double*)malloc(bytes);
h_b = (double*)malloc(bytes);
h_c = (double*)malloc(bytes);
// Allocate memory for each vector on GPU
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
int i;
// Initialize vectors on host
for( i = 0; i < n; i++ ) {
h_a[i] = sin(i)*sin(i);
h_b[i] = cos(i)*cos(i);
}
// Copy host vectors to device
cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);
int blockSize, gridSize;
// Number of threads in each thread block
blockSize = 1024;
// Number of thread blocks in grid
gridSize = (int)ceil((float)n/blockSize);
// Execute the kernel
vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);
// Copy array back to host
cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );
// Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0;
for(i=0; i<n; i++)
sum += h_c[i];
printf("final result: %f\n", sum/n);
// Release device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
// Release host memory
free(h_a);
free(h_b);
free(h_c);
return 0;
}
CPU yellow block:
GPU yellow block:
Note that you mention NVPROF but the pictures you are showing are from nvvp - the visual profiler. nvprof is the command-line profiler
GPU Kernel launches are asynchronous. That means that the CPU thread launches the kernel but does not wait for the kernel to complete. In fact, the CPU activity is actually placing the kernel in a launch queue - the actual execution of the kernel may be delayed if anything else is happening on the GPU.
So there is no defined relationship between the CPU (API) activity, and the GPU activity with respect to time, except that the CPU kernel launch must obviously precede (at least slightly) the GPU kernel execution.
The CPU (API) yellow block represents the duration of time that the CPU thread spends in a library call into the CUDA Runtime library, to launch the kernel (i.e. place it in the launch queue). This library call activity usually has some time overhead associated with it, in the range of 5-50 microseconds. The start of this period is marked by the start of the call into the library. The end of this period is marked by the time at which the library returns control to your code (i.e. your next line of code after the kernel launch).
The GPU yellow block represents the actual time period during which the kernel was executing on the GPU. The start and end of this yellow block are marked by the start and end of kernel activity on the GPU. The duration here is a function of what the code in your kernel is doing, and how long it takes.
I don't think the exact reason why a GPU kernel launch takes ~5-50 microseconds of CPU time is documented or explained anywhere in an authoritative fashion, and it is a closed source library, so you will need to acknowledge that overhead as something you have little control over. If you design kernels that run for a long time and do a lot of work, this overhead can become insignificant.

why increasing the number of blocks in cuda increase the time?

My understanding is that in CUDA, increase the number of blocks will not increase the time as they are implemented parallelly, but in my code, if I double the number of blocks, the time doubled as well.
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#define num_of_blocks 500
#define num_of_threads 512
__constant__ double y = 1.1;
// set seed for random number generator
__global__ void initcuRand(curandState* globalState, unsigned long seed){
int idx = threadIdx.x + blockIdx.x * blockDim.x;
curand_init(seed, idx, 0, &globalState[idx]);
}
// kernel function for SIR
__global__ void test(curandState* globalState, double *dev_data){
// global threads id
int idx = threadIdx.x + blockIdx.x * blockDim.x;
// local threads id
int lidx = threadIdx.x;
// creat shared memory to store seeds
__shared__ curandState localState[num_of_threads];
// shared memory to store samples
__shared__ double sample[num_of_threads];
// copy global seed to local
localState[lidx] = globalState[idx];
__syncthreads();
sample[lidx] = y + curand_normal_double(&localState[lidx]);
if(lidx == 0){
// save the first sample to dev_data;
dev_data[blockIdx.x] = sample[0];
}
globalState[idx] = localState[lidx];
}
int main(){
// creat random number seeds;
curandState *globalState;
cudaMalloc((void**)&globalState, num_of_blocks*num_of_threads*sizeof(curandState));
initcuRand<<<num_of_blocks, num_of_threads>>>(globalState, 1);
double *dev_data;
cudaMalloc((double**)&dev_data, num_of_blocks*sizeof(double));
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
// Start record
cudaEventRecord(start, 0);
test<<<num_of_blocks, num_of_threads>>>(globalState, dev_data);
// Stop event
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop); // that's our time!
// Clean up:
cudaEventDestroy(start);
cudaEventDestroy(stop);
std::cout << "Time ellapsed: " << elapsedTime << std::endl;
cudaFree(dev_data);
cudaFree(globalState);
return 0;
}
The test result is:
number of blocks: 500, Time ellapsed: 0.39136.
number of blocks: 1000, Time ellapsed: 0.618656.
So what is the reason that the time will increase? Is it because I access the constant memory or I copy the data from shared memory to global memory? Is that some ways to optimise it?
While the number of blocks being able to run in parallel can be large, it is still finite due to limited on-chip resources. If the number of blocks requested in a kernel launch exceeds that limit, any further blocks have to wait for earlier blocks to finish and free up their resources.
One limited resource is shared memory, of which your kernel uses 28 kilobytes. CUDA 8.0 compatible Nvidia GPUs offer between 48 and 112 kilobytes of shared memory per streaming multiprocessor (SM), so that the maximum number of blocks running at any one time is between 1× and 3× the number of SMs on your GPU.
Other limited resources are registers and various per-warp resources in the scheduler. The CUDA occupancy calculator is a convenient Excel spreadsheet (also works with OpenOffice/LibreOffice) that shows you how these resources limit the number of blocks per SM for a specific kernel. Compile the kernel adding the option --ptxas-options="-v" to the nvcc command line, locate the line saying "ptxas info : Used XX registers, YY bytes smem, zz bytes cmem[0], ww bytes cmem[2]", and enter XX, YY, the number of threads per block you are trying to launch, and the compute capability of your GPU into the spreadsheet. It will then show the maximum number of blocks that can run in parallel on one SM.
You don't mention the GPU you have been running the test on, so I'll use a GTX 980 as an example. It has 16 SMs with 96Kb of shared memory each, so at most 16×3=48 blocks can run in parallel. Had you not used shared memory, the maximum number of resident warps would have limited the number of blocks per SM to 4, allowing 64 blocks to run in parallel.
On any currently existing Nvidia GPU, your example requires at least about a dozen waves of blocks executing sequentially, explaining why doubling the number of blocks will also about double the runtime.

CUDA atomicAdd() with long long int

Any time I try to use atomicAdd with anything other than (*int, int) I get this error:
error: no instance of overloaded function "atomicAdd" matches the argument list
But I need to use a larger data type than int. Is there any workaround here?
Device Query:
/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 680"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4095 MBytes (4294246400 bytes)
( 8) Multiprocessors x (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Clock rate: 1084 MHz (1.08 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GTX 680
My guess is wrong compile flags. You're looking for anything other than int, you should be using sm_12 or higher.
As stated by Robert Crovella the unsigned long long int variable is supported, but the long long int is not.
Used the code from: Beginner CUDA - Simple var increment not working
#include <iostream>
using namespace std;
__global__ void inc(unsigned long long int *foo) {
atomicAdd(foo, 1);
}
int main() {
unsigned long long int count = 0, *cuda_count;
cudaMalloc((void**)&cuda_count, sizeof(unsigned long long int));
cudaMemcpy(cuda_count, &count, sizeof(unsigned long long int), cudaMemcpyHostToDevice);
cout << "count: " << count << '\n';
inc <<< 100, 25 >>> (cuda_count);
cudaMemcpy(&count, cuda_count, sizeof(unsigned long long int), cudaMemcpyDeviceToHost);
cudaFree(cuda_count);
cout << "count: " << count << '\n';
return 0;
}
Compiled from Linux: nvcc -gencode arch=compute_12,code=sm_12 -o add add.cu
Result:
count: 0
count: 2500

CUDA 5.0 - cudaGetDeviceProperties strange grid size or a bug in my code?

I already posted my question on NVIDIA dev forums, but there are no definitive answers yet.
I'm just starting to learn CUDA and was really surprised that, contrary to what I found on the Internet, my card (GeForce GTX 660M) supports some insane grid sizes (2147483647 x 65535 x 65535). Please take a look at the following results I'm getting from deviceQuery.exe provided with the toolkit:
c:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.0\bin\win64\Release>deviceQuery.exe
deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 660M"
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147287040 bytes)
( 2) Multiprocessors x (192) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GTX 660M
I was curious enough to write a simple program testing if it's possible to use more than 65535 blocks in the first dimension of the grid, but it doesn't work confirming what I found on the Internet (or, to be more precise, does work fine for 65535 blocks and doesn't for 65536).
My program is extremely simple and basically just adds two vectors. This is the source code:
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>
#include <math.h>
#pragma comment(lib, "cudart")
typedef struct
{
float *content;
const unsigned int size;
} pjVector_t;
__global__ void AddVectorsKernel(float *firstVector, float *secondVector, float *resultVector)
{
unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;
resultVector[index] = firstVector[index] + secondVector[index];
}
int main(void)
{
//const unsigned int vectorLength = 67107840; // 1024 * 65535 - works fine
const unsigned int vectorLength = 67108864; // 1024 * 65536 - doesn't work
const unsigned int vectorSize = sizeof(float) * vectorLength;
int threads = 0;
unsigned int blocks = 0;
cudaDeviceProp deviceProperties;
cudaError_t error;
pjVector_t firstVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
pjVector_t secondVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
pjVector_t resultVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
float *d_firstVector;
float *d_secondVector;
float *d_resultVector;
cudaMalloc((void **)&d_firstVector, vectorSize);
cudaMalloc((void **)&d_secondVector, vectorSize);
cudaMalloc((void **)&d_resultVector, vectorSize);
cudaGetDeviceProperties(&deviceProperties, 0);
threads = deviceProperties.maxThreadsPerBlock;
blocks = (unsigned int)ceil(vectorLength / (double)threads);
for (unsigned int i = 0; i < vectorLength; i++)
{
firstVector.content[i] = 1.0f;
secondVector.content[i] = 2.0f;
}
cudaMemcpy(d_firstVector, firstVector.content, vectorSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_secondVector, secondVector.content, vectorSize, cudaMemcpyHostToDevice);
AddVectorsKernel<<<blocks, threads>>>(d_firstVector, d_secondVector, d_resultVector);
error = cudaPeekAtLastError();
cudaMemcpy(resultVector.content, d_resultVector, vectorSize, cudaMemcpyDeviceToHost);
for (unsigned int i = 0; i < vectorLength; i++)
{
if(resultVector.content[i] != 3.0f)
{
free(firstVector.content);
free(secondVector.content);
free(resultVector.content);
cudaFree(d_firstVector);
cudaFree(d_secondVector);
cudaFree(d_resultVector);
cudaDeviceReset();
printf("Error under index: %i\n", i);
return 0;
}
}
free(firstVector.content);
free(secondVector.content);
free(resultVector.content);
cudaFree(d_firstVector);
cudaFree(d_secondVector);
cudaFree(d_resultVector);
cudaDeviceReset();
printf("Everything ok!\n");
return 0;
}
When I run it from Visual Studio in debug mode (the bigger vector), the last cudaMemcpy always fills my resultVector with seemingly random data (very close to 0 if it matters) so that the result doesn't pass the final validation. When I try to profile it with Visual Profiler, it returns following error message:
2 events, 0 metrics and 0 source-level metrics were not associated with the kernels and will not be displayed
As a result, profiler measures only cudaMalloc and cudaMemcpy operations and doesn't even show the kernel execution.
I'm not sure if I'm checking cuda erros right, so please let me know if it can be done better. cudaPeekAtLastError() placed just after my kernel launch returns cudaErrorInvalidValue(11) error when the bigger vector is used and cudaSuccess(0) for every other call (cudaMalloc and cudaMemcpy). When I run my program with the smaller vector, all cuda functions and my kernel launch return no errors (cudaSuccess(0)) and it works just fine.
So my question is: is cudaGetDeviceProperties returning rubbish grid size values or am I doing something wrong?
If you want to run a kernel using the larger grid size support offered by the Kepler architecture, you must compile you code for that architecture. So change you build flags to sepcific sm_30 as the target architecture. Otherwise the compiler will build for compute 1.0 targets.
The underlying reason for the launch failure is that the driver will attempt to recompile the compute 1.0 code for your Kepler card, but in doing so it enforces the execution grid limits dictated by the source architecture, ie. two dimensional grids with 65535 x 65535 maximum blocks per grid.

cudaMemcpy is too slow on Tesla C2075

I'm currently working on a server with 2 cuda capable GPU's: Quadro 400 and Tesla C2075. I made a simple vector addition test program. My problem is that while Tesla C2075 GPU is supposed to be more powerful than Quadro 400, it takes it more time to do the job. I found that cudaMemcpy takes up most of the execution time and it works slower on a more powerful gpu.
Here's the source:
void get_matrix(float* arr1,float* arr2,int N1,int N2)
{
int Nx,Ny;
int n_blocks,n_threads;
int dev=0; // 1
float time;
size_t size;
clock_t start,end;
cudaSetDevice(dev);
cudaDeviceProp deviceProp;
start = clock();
cudaGetDeviceProperties(&deviceProp, dev);
Nx=N1;
Ny=N2;
n_threads=256;
n_blocks=(Nx*Ny+n_threads-1)/n_threads;
size=Nx*Ny*sizeof(float);
cudaMalloc((void**)&d_A,size);
cudaMalloc((void**)&d_B,size);
cudaMemcpy(d_A, arr1, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, arr2, size, cudaMemcpyHostToDevice);
vector_add<<<n_blocks,n_threads>>>(d_A,d_B,size);
cudaMemcpy(arr1, d_A, size, cudaMemcpyDeviceToHost);
printf("Running device %s \n",deviceProp.name);
end = clock();
time=float(end-start)/float(CLOCKS_PER_SEC);
printf("time = %e\n",time);
}
int main()
{
int const nx = 20000,ny = nx;
static float a[nx*ny],b[nx*ny];
for(int i=0;i<nx;i++)
{
for(int j=0;j<ny;j++)
{
a[j+ny*i]=j+10*i;
b[j+ny*i]=-(j+10*i);
}
}
get_matrix(a,b,nx,ny);
return 0;
}
The output is:
Running device Quadro 400
time = 1.100000e-01
Running device Tesla C2075
time = 1.050000e+00
And my questions are:
Should I modify the code depending on what GPU I am going to use?
Is there any connection between the number of blocks, threads per block specified in the code and the number of multiprocessors, cores per multiprocessor available on a GPU?
I'm running Linux Open Suse 11.2. The source code is compiled using the nvcc compiler (version 4.2).
Thanks for your help!
Try to invoke get_matrix(a,b,nx,ny) twice and take the second timing result. First time calling to CUDA API will create the cuda context. It often takes a long time.
Please refer to this section in CUDA C Best Practice Guide for how to determine the block size and grid size.