cuda invalid Configuration argument - cuda

I am trying to launch a kernel with some params that I believe a valid but am receiving the invalid configuration argument error.
I am setting the sizes like this:
dim3 BlockDim = dim3(128, 1, 1);
dim3 GridDim = dim3(321, 320, 1);
and then launching my kernel
kernel<<<BlockDim,GridDim>>>();
My understanding is that this should be fine.
From device query I get:
Device 0: "Tesla C1060"
CUDA Driver Version / Runtime Version 6.0 / 5.5
CUDA Capability Major/Minor version number: 1.3
Total amount of global memory: 4096 MBytes (4294770688 bytes)
(30) Multiprocessors x ( 8) CUDA Cores/MP: 240 CUDA Cores
GPU Clock rate: 1296 MHz (1.30 GHz)
Memory Clock rate: 800 Mhz
Memory Bus Width: 512-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 66 / 0
Compute Mode:
Am I missing something here?
A few more tests I have run:
Works
dim3 BlockDim = dim3(128, 1, 1);
dim3 GridDim = dim3(200, 1, 1);
Does not work
dim3 BlockDim = dim3(128, 1, 1);
dim3 GridDim = dim3(30001, 1, 1);

Figured it out.
I had blockDim and gridDim reversed in my kernel call.
Should have been:
kernel<<<GridDim,BlockDim>>>();

Related

CUDA 2D kernel failed to execute due to large block size [duplicate]

Why I can't use max of Max dimension size of a thread block (x,y,z): (1024, 1024, 64)? If I use (1024, 1024) it doesn't work, and when I use (32, 32) or (1, 1024) etc. it works. Is it about shared memory?
Here is my result from deviceQuery:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 3 CUDA Capable device(s)
Device 0: "Tesla M2070"
CUDA Driver Version / Runtime Version 5.5 / 5.5
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5375 MBytes (5636554752 bytes)
(14) Multiprocessors, ( 32) CUDA Cores/MP: 448 CUDA Cores
GPU Clock rate: 1147 MHz (1.15 GHz)
Memory Clock rate: 1566 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 6 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Tesla M2070"
CUDA Driver Version / Runtime Version 5.5 / 5.5
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5375 MBytes (5636554752 bytes)
(14) Multiprocessors, ( 32) CUDA Cores/MP: 448 CUDA Cores
GPU Clock rate: 1147 MHz (1.15 GHz)
Memory Clock rate: 1566 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 20 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 2: "Tesla M2070"
CUDA Driver Version / Runtime Version 5.5 / 5.5
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5375 MBytes (5636554752 bytes)
(14) Multiprocessors, ( 32) CUDA Cores/MP: 448 CUDA Cores
GPU Clock rate: 1147 MHz (1.15 GHz)
Memory Clock rate: 1566 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 17 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla M2070 (GPU0) -> Tesla M2070 (GPU1) : No
> Peer access from Tesla M2070 (GPU0) -> Tesla M2070 (GPU2) : No
> Peer access from Tesla M2070 (GPU1) -> Tesla M2070 (GPU1) : No
> Peer access from Tesla M2070 (GPU1) -> Tesla M2070 (GPU2) : Yes
> Peer access from Tesla M2070 (GPU1) -> Tesla M2070 (GPU0) : No
> Peer access from Tesla M2070 (GPU1) -> Tesla M2070 (GPU1) : No
> Peer access from Tesla M2070 (GPU2) -> Tesla M2070 (GPU0) : No
> Peer access from Tesla M2070 (GPU2) -> Tesla M2070 (GPU1) : Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 3, Device0 = Tesla M2070, Device1 = Tesla M2070, Device2 = Tesla M2070
Result = PASS
Why I can't use max of Max dimension size of a thread block (x,y,z): (1024, 1024, 64)?
Because each one of those is an individual limit for that dimension. There is an additional overall limit also indicated in your deviceQuery printout:
Maximum number of threads per block: 1024
A threadblock is up to a 3-dimensional structure, so the total number of threads in a block is equal to the product of the individual dimensions that you choose. This product must also be less than or equal to 1024 (and greater than 0). This is just another hardware limit of the device.
Is it about shared memory?
The above is unrelated to any usage of shared memory. (Your code doesn't appear to be using shared memory anyway.)

How to maximise the use of the GPU without having blocks waiting to be scheduled?

A device query on my Titan-XP shows that I have 30 multiprocessors with a maximum number of 2048 threads per multiprocessor. Is it correct to think that the maximum number of threads that can simultaneously be executed physically on the hardware is 30 * 2048? I.e: will a kernel configuration like the following exploit this?
kernel<<<60, 1024>>>(...);
I'd really like to physically have the maximum number of blocks executing while avoiding having blocks waiting to be scheduled. Here's the full output of device query:
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "TITAN Xp"
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 12190 MBytes (12781682688 bytes)
(30) Multiprocessors, (128) CUDA Cores/MP: 3840 CUDA Cores
GPU Max Clock rate: 1582 MHz (1.58 GHz)
Memory Clock rate: 5705 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1, Device0 = TITAN Xp
Result = PASS
Yes, your conclusion is correct. The maximum number of threads that can be "in-flight" is 2048 * # of SMs for all GPUs supported by CUDA 9 or CUDA 9.1. (Fermi GPUs, supported by CUDA 8, are a bit lower at 1536 * # of SMs)
This is an upper bound, and the specifics of your kernel (resource utilization) may mean that fewer than this number can actually be "resident" or "in flight". This is in the general topic of GPU occupancy. CUDA includes an occupancy calculator spreadsheet and also a programmatic occupancy API to help determine this, for your specific kernel.
The usual kernel strategy to have a limited number of threads (e.g. 60 * 1024 in your case) handle an arbitrary data set size is to use some form of a construct called a grid striding loop.

My cuda code crash using unified memory

I'm newbie in cuda programmation. I have a problem with this code (it was written by my teacher):
#include <stdio.h>
#define THREAD_PER_BLOCK 128
__global__ void add(int *a,const int N){
int index=threadIdx.x+blockIdx.x*blockDim.x;
if (index<N)
a[index] = a[index]+10;
}
int main( void ){
int *a;
// managed
int i;
int N=1024;
int size = N * sizeof( int );
cudaMallocManaged( &a, size );
for(i=0; i<N; i++) {
a[i]=i;
}
add<<< N/THREAD_PER_BLOCK, THREAD_PER_BLOCK >>>( a,N);
cudaDeviceSynchronize();
for (int i=0; i<10; i++){
printf("%d %d\n", i, a[i]);
}
cudaFree( a );
return 0;
}
I've detected a seg fault on fist for-loop, I have no idea of why the program crashes. My operative system is Ubuntu 14.04 and this is the output of querydevice:
Detected 1 CUDA Capable device(s)
Device 0: "GeForce 820M"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1985 MBytes (2081095680 bytes)
( 2) Multiprocessors, ( 48) CUDA Cores/MP: 96 CUDA Cores
GPU Max Clock rate: 1550 MHz (1.55 GHz)
Memory Clock rate: 900 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 131072 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce 820M
Result = PASS
The problem here is that your GPU is a Fermi GPU (compute capability 2.x):
Device 0: "GeForce 820M"
...
CUDA Capability Major/Minor version number: 2.1
^^^
and unified memory (for cudaMallocManaged) requires a compute capability 3.0 or higher GPU.
Any time you are having trouble with a CUDA code, it's good practice to use proper CUDA error checking before asking others for help. Even if you don't understand the error output, it will be useful to others trying to help you. In this case you would have gotten a concise error message that says that the cudaMallocManaged function is not supported.

Number of threads in GeForce GTX 560Ti

I ran the deviceQuery and got the following result
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 560 Ti"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073283072 bytes)
(8) Multiprocessors x ( 48) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 1701 MHz (1.70 GHz)
Memory Clock rate: 2052 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D= (2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GTX 560 Ti
My understanding is that I can create maximum of 65535 x 65535 x 65535 blocks with 1024 threads per block. Does that I can have 65535 x 65535 x 65535 x 1024 threads of maximum ? If not what is the maximum number of threads I can have ?
Can anyone clarify this doubt ?
Your understanding is correct. You can launch 65535 x 65535 x 65535 x 1024 threads theoretically but due to resource constraints you may be not able to hit the maximum.
You can't just multiply all the maximum grid dimensions and assume that you can have that many threads, unfortunately. You have 8 MPs and a maximum number of threads per MP = 1536, so that makes 8 * 1536 = 12288 threads max.

cuda threads and blocks

I posted this on the NVIDIA forums, I thought I would get a few more eyes to help.
I'm having trouble trying to expand my code out to perform with multiple cases. I have been developing with the most common case in mind, now its time for testing and i need to ensure that it all works for the different cases. Currently my kernel is executed within a loop (there are reasons why we aren't doing one kernel call to do the whole thing.) to calculate a value across the row of a matrix. The most common case is 512 columns by 512 rows. I need to consider matricies of the size 512 x 512, 1024 x 512, 512 x 1024, and other combinations, but the largest will be a 1024 x 1024 matrix. I have been using a rather simple kernel call:
launchKernel<<<1,512>>>(................)
This kernel works fine for the common 512x512 and 512 x 1024 (column, row respectively) case, but not for the 1024 x 512 case. This case requires 1024 threads to execute. In my naivety i have been trying different versions of the simple kernel call to launch 1024 threads.
launchKernel<<<2,512>>>(................) // 2 blocks with 512 threads each ???
launchKernel<<<1,1024>>>(................) // 1 block with 1024 threads ???
I beleive my problem has something to do with my lack of understanding of the threads and blocks
Here is the output of deviceQuery, as you can see i can have a max of 1024 threads
C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win64\Release\deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 2 CUDA Capable device(s)
Device 0: "Tesla C2050"
CUDA Driver Version / Runtime Version 4.2 / 4.1
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2688 MBytes (2818572288 bytes)
(14) Multiprocessors x (32) CUDA Cores/MP: 448 CUDA Cores
GPU Clock Speed: 1.15 GHz
Memory Clock rate: 1500.00 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 40 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Quadro 600"
CUDA Driver Version / Runtime Version 4.2 / 4.1
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073741824 bytes)
( 2) Multiprocessors x (48) CUDA Cores/MP: 96 CUDA Cores
GPU Clock Speed: 1.28 GHz
Memory Clock rate: 800.00 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 131072 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 15 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.1, NumDevs = 2, Device = Tesla C2050, Device = Quadro 600
I am using only the Tesla C2050 device
Here is a stripped out version of my kernel, so you have an idea of what it is doing.
#define twoPi 6.283185307179586
#define speed_of_light 3.0E8
#define MaxSize 999
__global__ void calcRx4CPP4
(
const float *array1,
const double *array2,
const float scalar1,
const float scalar2,
const float scalar3,
const float scalar4,
const float scalar5,
const float scalar6,
const int scalar7,
const int scalar8,
float *outputArray1,
float *outputArray2)
{
float scalar9;
int idx;
double scalar10;
double scalar11;
float sumReal, sumImag;
float real, imag;
float coeff1, coeff2, coeff3, coeff4;
sumReal = 0.0;
sumImag = 0.0;
// kk loop 1 .. 512 (scalar7)
idx = (blockIdx.x * blockDim.x) + threadIdx.x;
/* Declare the shared memory parameters */
__shared__ float SharedArray1[MaxSize];
__shared__ double SharedArray2[MaxSize];
/* populate the arrays on shared memory */
SharedArray1[idx] = array1[idx]; // first 512 elements
SharedArray2[idx] = array2[idx];
if (idx+blockDim.x < MaxSize){
SharedArray1[idx+blockDim.x] = array1[idx+blockDim.x];
SharedArray2[idx+blockDim.x] = array2[idx+blockDim.x];
}
__syncthreads();
// input scalars used here.
scalar10 = ...;
scalar11 = ...;
for (int kk = 0; kk < scalar8; kk++)
{
/* some calculations */
// SharedArray1, SharedArray2 and scalar9 used here
sumReal = ...;
sumImag = ...;
}
/* calculation of the exponential of a complex number */
real = ...;
imag = ...;
coeff1 = (sumReal * real);
coeff2 = (sumReal * imag);
coeff3 = (sumImag * real);
coeff4 = (sumImag * imag);
outputArray1[idx] = (coeff1 - coeff4);
outputArray2[idx] = (coeff2 + coeff3);
}
Because my max threads per block is 1024, I thought I would be able to continue to use the simple kernel launch, am I wrong?
How do I successfully launch each kernel with 1024 threads?
You don't want to vary the number of threads per block. You should get the optimal number of threads per block for your kernel by using the CUDA Occupancy Calculator. After you have that number, you simply launch the number of blocks that are required to get the total number of threads that you need. If the number of threads that you need for a given case is not always a multiple of the threads per block, you add code in the top of your kernel to abort the unneeded threads. (if () return;). Then, you pass in the dimensions of your matrix either with extra parameters to the kernel or by using x and y grid dimensions, depending on which information is required in your kernel (I haven't studied it).
My guess is that the reason you're having trouble with 1024 threads is that, even though your GPU supports that many threads in a block, there is another limiting factor to the number of threads you can have in each block based on resource usage in your kernel. The limiting factor can be shared memory or register usage. The Occupancy Calculator will tell you which, though that information is only important if you want to optimize your kernel.
If you use one block with 1024 threads you will have problems since MaxSize is only 999 resulting in wrong data.
Lets simulate it for last thread #1023
__shared__ float SharedArray1[999];
__shared__ double SharedArray2[999];
/* populate the arrays on shared memory */
SharedArray1[1023] = array1[1023];
SharedArray2[1023] = array2[1023];
if (2047 < MaxSize)
{
SharedArray1[2047] = array1[2047];
SharedArray2[2047] = array2[2047];
}
__syncthreads();
If you now use all those elements in your calculation this should not work.
(Your calculation code is not shown so its an assumption)