Why memory prefetch has no impact when transferring from device to host? - cuda

I have the following setup:
constexpr uint32_t N{512};
constexpr uint32_t DATA_SIZE{sizeof(float) * N * N};
__managed__ float ma[N * N];
__managed__ float mb[N * N];
__managed__ float mc[N * N];
__global__ void kernel()
{
for (uint32_t i{0}; i < N * N; ++i)
{
mc[i] = ma[i] + mb[i];
}
}
int main(int argc, char *[])
{
for (uint32_t i{0}; i < N * N; ++i)
{
ma[i] = 1.0f;
mb[i] = 2.0f;
}
int deviceId{};
gpuErrchk(cudaGetDevice(&deviceId));
gpuErrchk(cudaMemPrefetchAsync(ma, DATA_SIZE, deviceId, nullptr));
gpuErrchk(cudaMemPrefetchAsync(mb, DATA_SIZE, deviceId, nullptr));
kernel<<<1, 1>>>();
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaMemPrefetchAsync(mc, DATA_SIZE, cudaCpuDeviceId, nullptr));
gpuErrchk(cudaDeviceSynchronize());
float result{0.0f};
for (uint32_t i{0}; i < N * N; ++i)
{
result += mc[i];
}
return static_cast<int>(result);
}
I compile the code with 03 optimizations. Profiling it with nvprof ./test gives me the following (only the memory part):
==29300== Unified Memory profiling result:
Device "Quadro P1000 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
2 1.0000MB 1.0000MB 1.0000MB 2.000000MB 164.9620us Host To Device
20 153.60KB 4.0000KB 1.0000MB 3.000000MB 266.0500us Device To Host
19 - - - - 551.9440us Gpu page fault groups
Total CPU Page faults: 9
The first line - HtoD - is straightforward - there were 2 prefetches for ma and mb arrays 1MB each.
The second line is strange for 2 reasons:
Prefetching was ignored (well, not completely, more on this later)
The total size of the data is 3MB despite the fact that the total array size is 1MB and in cudaMemPrefetchAsync also 1MB was specified.
If I run the same code with prefetching commented out I have the following results:
==30051== Unified Memory profiling result:
Device "Quadro P1000 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
20 102.40KB 4.0000KB 508.00KB 2.000000MB 189.9230us Host To Device
29 105.93KB 4.0000KB 512.00KB 3.000000MB 278.4960us Device To Host
24 - - - - 1.311533ms Gpu page fault groups
Total CPU Page faults: 14
As seen in the table prefetching has an impact on the number of transfers - for HtoD it was changed from 2 to 20, and for DtoH it changed from 20 to 29. It also has an impact on performance, but that impact is not major. Especially if I compare it with the third variation of the same code, where I use cudaMalloc instead of managed memory:
Type Time(%) Time Calls Avg Min Max Name
0.00% 164.80us 2 82.401us 82.209us 82.593us [CUDA memcpy HtoD]
0.00% 81.665us 1 81.665us 81.665us 81.665us [CUDA memcpy DtoH]
I am running the NVidia Quadro P1000 laptop, Ubuntu 18.04, Cuda 11.8.
To summarize, here are my questions:
Why does prefetch the memory to the host almost have no impact (29 migrations vs 20 migrations)?
Why is more memory transferred to the host than requested (3Mb instead of the requested 1MB)?
Why even with prefetching the managed memory is the order of magnitude slower than the device memory allocated with cudaMalloc?

As Robert Crovella has mentioned in the comment, the behavior is caused by the fact that the initial location of a managed memory is unspecified.
By default, the devices of compute capability lower than 6.x allocate
managed memory directly on the GPU. However, the devices of compute
capability 6.x and greater do not allocate physical memory when
calling cudaMallocManaged(): in this case physical memory is populated
on first touch and may be resident on the CPU or the GPU.
In my case, the memory was allocated on the GPU. That explains why there were 3MB transferred from the device to the host and the number of transfers themselves. If I remove the initialization loop, the number of HtoD becomes zero (despite the 2 calls to cudaMemPrefetchAsync) and DtoH becomes one.

Related

CUDA: Write directly from device to host pinned memory without sacrificing throughput?

In CUDA, is it possible to write directly to host (pinned) memory from a device kernel?
In my current setup, I first write to device DRAM and then copy from DRAM into host pinned memory.
I'm wondering if I can just write directly to host memory (i.e. use one step instead of two) without sacrificing throughput.
From what I understand, unified memory isn't the answer - guides mention that it's slower (perhaps because of its paging semantics?).
But I haven't tried it, so perhaps I'm mistaken - maybe there's an option to force everything to reside in host pinned memory?
There are numerous question here on the cuda SO tag about how to use pinned memory for "zero-copy" operations. Here is one example. You can find many more examples.
If you only have to write to each output point once, and your writes are/would be nicely coalesced, then there should not be a major performance difference between the costs of:
writing to device memory and then cudaMemcpy D->H after the kernel
writing directly to host-pinned memory
You will still need a cudaDeviceSynchronize() after the kernel call, before accessing the data on the host, to ensure consistency.
Differences on the order of ~10 microseconds are still possible due to CUDA operation overheads.
It should be possible to demonstrate that bulk transfer of data using direct read/writes to pinned memory from kernel code will achieve approximately the same bandwidth as what you would get with a cudaMemcpy transfer.
As an aside, the "paging semantics" of unified memory may be worked around but again a well optimized code in any of these 3 scenarios is not likely to show marked perf or duration differences.
Responding to comments, my use of "approximately" above is probably a stretch, here's a kernel that writes 4GB of data in less than half a second on a PCIE Gen2 system:
$ cat t2138.cu
template <typename T>
__global__ void k(T *d, size_t n){
for (size_t i = blockIdx.x*blockDim.x+threadIdx.x; i < n; i+=gridDim.x*blockDim.x)
d[i] = 0;
}
int main(){
int *d;
size_t n = 1048576*1024;
cudaHostAlloc(&d, sizeof(d[0])*n, cudaHostAllocDefault);
k<<<160, 1024>>>(d, n);
k<<<160, 1024>>>(d, n);
cudaDeviceSynchronize();
int *d1;
cudaMalloc(&d1, sizeof(d[0])*n);
cudaMemcpy(d, d1, sizeof(d[0])*n, cudaMemcpyDeviceToHost);
}
$ nvcc -o t2138 t2138.cu
$ compute-sanitizer ./t2138
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$ nvprof ./t2138
==21201== NVPROF is profiling process 21201, command: ./t2138
==21201== Profiling application: ./t2138
==21201== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 72.48% 889.00ms 2 444.50ms 439.93ms 449.07ms void k<int>(int*, unsigned long)
27.52% 337.47ms 1 337.47ms 337.47ms 337.47ms [CUDA memcpy DtoH]
API calls: 60.27% 1.88067s 1 1.88067s 1.88067s 1.88067s cudaHostAlloc
28.49% 889.01ms 1 889.01ms 889.01ms 889.01ms cudaDeviceSynchronize
10.82% 337.55ms 1 337.55ms 337.55ms 337.55ms cudaMemcpy
0.17% 5.1520ms 1 5.1520ms 5.1520ms 5.1520ms cudaMalloc
0.15% 4.6178ms 4 1.1544ms 594.35us 2.8265ms cuDeviceTotalMem
0.09% 2.6876ms 404 6.6520us 327ns 286.07us cuDeviceGetAttribute
0.01% 416.39us 4 104.10us 59.830us 232.21us cuDeviceGetName
0.00% 151.42us 2 75.710us 13.663us 137.76us cudaLaunchKernel
0.00% 21.172us 4 5.2930us 3.0730us 8.5010us cuDeviceGetPCIBusId
0.00% 9.5270us 8 1.1900us 428ns 4.5250us cuDeviceGet
0.00% 3.3090us 4 827ns 650ns 1.2230us cuDeviceGetUuid
0.00% 3.1080us 3 1.0360us 485ns 1.7180us cuDeviceGetCount
$
4GB/0.44s = 9GB/s
4GB/0.34s = 11.75GB/s (typical for PCIE Gen2 to pinned memory)
We can see that contrary to my previous statement, the transfer of data using in-kernel copying to a pinned allocation does seem to be slower (about 33% slower in my test case) than using a bulk copy (cudaMemcpy DtoH to a pinned allocation). However this isn't quite an apples-to-apples comparison, because the kernel itself would still have to write the 4GB of data to the device allocation to make the comparison to cudaMemcpy be sensible. The speed of this operation will depend on the GPU device memory bandwidth, which varies by GPU of course. So 33% higher is probably "too high" of an estimate of the comparison. But if your GPU has lots of memory bandwidth, this estimate will be pretty close. (On my V100, writing 4GB to device memory only takes ~7ms).

Where do shared memory of non-resident threadblocks go?

I am trying to understand how shared memory works, when blocks use alot of it.
So my gpu (RTX 2080 ti) has 48 kb of shared memory per SM, and the same per threadblock. In my example below i have 2 blocks forced on the same SM, each using the full 48 kb of memory. I force both blocks to communicate before finishing, but since they can't run in parallel, this should be a deadlock. The program however does terminate, whether i run 2 blocks or 1000.
Is this because block 1 is paused once it runs into the deadlock, and switched with block 2? If yes, where does the 48 kb of data from block 1 go while block 2 is active? Is it stored in global memory?
Kernel:
__global__ void testKernel(uint8_t* globalmem_message_buffer, int n) {
const uint32_t size = 48000;
__shared__ uint8_t data[size];
for (int i = 0; i < size; i++)
data[i] = 1;
globalmem_message_buffer[blockIdx.x] = 1;
while (globalmem_message_buffer[(blockIdx.x + 1) % n] == 0) {}
printf("ID: %d\n", blockIdx.x);
}
Host code:
int n = 2; // Still works with n=1000
cudaStream_t astream;
cudaStreamCreate(&astream);
uint8_t* globalmem_message_buffer;
cudaMallocManaged(&globalmem_message_buffer, sizeof(uint8_t) * n);
for (int i = 0; i < n; i++) globalmem_message_buffer[i] = 0;
cudaDeviceSynchronize();
testKernel << <n, 1, 0, astream >> > (globalmem_message_buffer, n);
Edit: Changed "threadIdx" to "blockIdx"
So my gpu (RTX 2080 ti) has 48 kb of shared memory per SM, and the same per threadblock. In my example below i have 2 blocks forced on the same SM, each using the full 48 kb of memory.
That wouldn't happen. The general premise here is flawed. The GPU block scheduler only deposits a block on a SM when there are free resources sufficient to support that block.
An SM with 48KB of shared memory, that already has a block resident on it that uses 48KB of shared memory, will not get any new blocks of that type deposited on it, until the existing/resident block "retires" and releases the resources it is using.
Therefore in the normal CUDA scheduling model, the only way a block can be non-resident is if it has never been scheduled yet on a SM. In that case, it uses no resources, while it is waiting in the queue.
The exceptions to this would be in the case of CUDA preemption. This mechanism is not well documented, but would occur for example at the point of a context switch. In such a case, the entire threadblock state is somehow removed from the SM and stored somewhere else. However preemption is not applicable in the case where we are analyzing the behavior of a single kernel launch.
You haven't provided a complete code example, however, for the n=2 case, your claim that these will somehow deposit on the same SM simply isn't true.
For the n=1000 case, your code only requires that a single location in memory be set to 1:
while (globalmem_message_buffer[(threadIdx.x + 1) % n] == 0) {}
threadIdx.x for your code is always 0, since you are launching threadblocks of only 1 thread:
testKernel << <n, 1, 0, astream >> > (globalmem_message_buffer, n);
Therefore the index generated here is always 1 (for n greater than or equal to 2). All threadblocks are checking location 1. Therefore, when the threadblock whose blockIdx.x is 1 executes, all threadblocks in the grid will be "unblocked", because they are all testing the same location. In short, your code may not be doing what you think it is or intended. Even if you had each threadblock check the location of another threadblock, we can imagine a sequence of threadblock deposits that would satisfy this without requiring all n threadblocks to be simultaneously resident, so I don't think that would prove anything either. (There is no specified order for the block deposit sequence.)

Maximum number of resident blocks per SM?

It seems the that there is a maximum number of resident blocks allowed per SM. But while other "hard" limits are easily found (via, for example, `cudaGetDeviceProperties'), a maximum number of resident blocks doesn't seem to be widely documented.
In the following sample code, I configure the kernel with one thread per block. To test the hypothesis that this GPU (a P100) has a maximum of 32 resident blocks per SM, I create a grid of 56*32 blocks (56 = number of SMs on the P100). Each kernel takes 1 second to process (via a "sleep" routine), so if I have configured the kernel correctly, the code should take 1 second. The timing results confirm this. Configuring with 32*56+1 blocks takes 2 seconds, suggesting the 32 blocks per SM is the maximum allowed per SM.
What I wonder is, why isn't this limit made more widely available? For example, it doesn't show up `cudaGetDeviceProperties'. Where can I find this limit for various GPUs? Or maybe this isn't a real limit, but is derived from other hard limits?
I am running CUDA 10.1
#include <stdio.h>
#include <sys/time.h>
double cpuSecond() {
struct timeval tp;
gettimeofday(&tp,NULL);
return (double) tp.tv_sec + (double)tp.tv_usec*1e-6;
}
#define CLOCK_RATE 1328500 /* Modify from below */
__device__ void sleep(float t) {
clock_t t0 = clock64();
clock_t t1 = t0;
while ((t1 - t0)/(CLOCK_RATE*1000.0f) < t)
t1 = clock64();
}
__global__ void mykernel() {
sleep(1.0);
}
int main(int argc, char* argv[]) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int mp = prop.multiProcessorCount;
//clock_t clock_rate = prop.clockRate;
int num_blocks = atoi(argv[1]);
dim3 block(1);
dim3 grid(num_blocks); /* N blocks */
double start = cpuSecond();
mykernel<<<grid,block>>>();
cudaDeviceSynchronize();
double etime = cpuSecond() - start;
printf("mp %10d\n",mp);
printf("blocks/SM %10.2f\n",num_blocks/((double)mp));
printf("time %10.2f\n",etime);
cudaDeviceReset();
}
Results :
% srun -p gpuq sm_short 1792
mp 56
blocks/SM 32.00
time 1.16
% srun -p gpuq sm_short 1793
mp 56
blocks/SM 32.02
time 2.16
% srun -p gpuq sm_short 3584
mp 56
blocks/SM 64.00
time 2.16
% srun -p gpuq sm_short 3585
mp 56
blocks/SM 64.02
time 3.16
Yes, there is a limit to the number of blocks per SM. The maximum number of blocks that can be contained in an SM refers to the maximum number of active blocks in a given time. Blocks can be organized into one- or two-dimensional grids of up to 65,535 blocks in each dimension but the SM of your gpu will be able to accommodate only a certain number of blocks. This limit is linked in two ways to the Compute Capability of your Gpu.
Hardware limit stated by CUDA.
Each gpu allows a maximum limit of blocks per SM, regardless of the number of threads it contains and the amount of resources used. For example, a Gpu with compute capability 2.0 has a limit of 8 Blocks/SM while one with compute capability 7.0 has a limit of 32 Blocks/SM. This is the best number of active blocks for each SM that you can achieve: let's call it MAX_BLOCKS.
Limit derived from the amount of resources used by each block.
A block is made up of threads and each thread uses a certain number of registers: the more registers it uses, the greater the number of resources used by the block that contains it. Similarly, the amount of shared memory assigned to a block increases the amount of resources the block needs to be allocated. Once a certain value is exceeded, the number of resources needed for a block will be so large that SM will not be able to allocate as many blocks as it is allowed by MAX_BLOCKS: this means that the amount of resources needed for each block is limiting the maximum number of active blocks for each SM.
How do I find these boundaries?
CUDA thought about that too. On their site is available the Cuda Occupancy Calculator file with which you can discover the hardware limits grouped by compute capability. You can also enter the amount of resources used by your blocks (number of threads, registers per threads, bytes of shared memory) and get graphs and important information about the number of active blocks.
The first tab of the linked file allows you to calculate the actual use of SM based on the resources used. If you want to know how many registers per thread you use you have to add the -Xptxas -v option to have the compiler tell you how many registers it is using when it creates the PTX.
In the last tab of the file you will find the hardware limits grouped by Compute capability.

Global load transaction count when in coalesced memory access

I've created a simple kernel to test the coalesced memory access by observing the transaction counts, in nvidia gtx980 card. The kernel is,
__global__
void copy_coalesced(float * d_in, float * d_out)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
d_out[tid] = d_in[tid];
}
When I run this with the following kernel configurations
#define BLOCKSIZE 32
int data_size = 10240; //always a multiply of the BLOCKSIZE
int gridSize = data_size / BLOCKSIZE;
copy_coalesced<<<gridSize, BLOCKSIZE>>>(d_in, d_out);
Since the the data access in the kernel is fully coalasced, and since the data type is float (4 bytes), The number of Load/Store Transactions expected can be found as following,
Load Transaction Size = 32 bytes
Number of floats that can be loaded per transaction = 32 bytes / 4 bytes = 8
Number of transactions needed to load 10240 of data = 10240/8 = 1280 transactions
The same amount of transactions are expected for writing the data as well.
But when observing the nvprof metrics, following was the results
gld_transactions 2560
gst_transactions 1280
gld_transactions_per_request 8.0
gst_transactions_per_request 4.0
I cannot figure out why it takes twice the transactions that it needs for loading the data. But when it comes to load/store efficiency both the metrics gives out 100%
What am I missing out here?
I reproduced your results on linux,
1 gld_transactions Global Load Transactions 2560
1 gst_transactions Global Store Transactions 1280
1 l2_tex_read_transactions L2 Transactions (Texture Reads) 1280
1 l2_tex_write_transactions L2 Transactions (Texture Writes) 1280
However, on Windows using NSIGHT Visual Studio edition, I get values that appear to be better:
You may want to contact NVIDIA as it could simply be a display issue in nvprof.

cuda threads and blocks

I posted this on the NVIDIA forums, I thought I would get a few more eyes to help.
I'm having trouble trying to expand my code out to perform with multiple cases. I have been developing with the most common case in mind, now its time for testing and i need to ensure that it all works for the different cases. Currently my kernel is executed within a loop (there are reasons why we aren't doing one kernel call to do the whole thing.) to calculate a value across the row of a matrix. The most common case is 512 columns by 512 rows. I need to consider matricies of the size 512 x 512, 1024 x 512, 512 x 1024, and other combinations, but the largest will be a 1024 x 1024 matrix. I have been using a rather simple kernel call:
launchKernel<<<1,512>>>(................)
This kernel works fine for the common 512x512 and 512 x 1024 (column, row respectively) case, but not for the 1024 x 512 case. This case requires 1024 threads to execute. In my naivety i have been trying different versions of the simple kernel call to launch 1024 threads.
launchKernel<<<2,512>>>(................) // 2 blocks with 512 threads each ???
launchKernel<<<1,1024>>>(................) // 1 block with 1024 threads ???
I beleive my problem has something to do with my lack of understanding of the threads and blocks
Here is the output of deviceQuery, as you can see i can have a max of 1024 threads
C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win64\Release\deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 2 CUDA Capable device(s)
Device 0: "Tesla C2050"
CUDA Driver Version / Runtime Version 4.2 / 4.1
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2688 MBytes (2818572288 bytes)
(14) Multiprocessors x (32) CUDA Cores/MP: 448 CUDA Cores
GPU Clock Speed: 1.15 GHz
Memory Clock rate: 1500.00 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 40 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Quadro 600"
CUDA Driver Version / Runtime Version 4.2 / 4.1
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073741824 bytes)
( 2) Multiprocessors x (48) CUDA Cores/MP: 96 CUDA Cores
GPU Clock Speed: 1.28 GHz
Memory Clock rate: 800.00 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 131072 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 15 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.1, NumDevs = 2, Device = Tesla C2050, Device = Quadro 600
I am using only the Tesla C2050 device
Here is a stripped out version of my kernel, so you have an idea of what it is doing.
#define twoPi 6.283185307179586
#define speed_of_light 3.0E8
#define MaxSize 999
__global__ void calcRx4CPP4
(
const float *array1,
const double *array2,
const float scalar1,
const float scalar2,
const float scalar3,
const float scalar4,
const float scalar5,
const float scalar6,
const int scalar7,
const int scalar8,
float *outputArray1,
float *outputArray2)
{
float scalar9;
int idx;
double scalar10;
double scalar11;
float sumReal, sumImag;
float real, imag;
float coeff1, coeff2, coeff3, coeff4;
sumReal = 0.0;
sumImag = 0.0;
// kk loop 1 .. 512 (scalar7)
idx = (blockIdx.x * blockDim.x) + threadIdx.x;
/* Declare the shared memory parameters */
__shared__ float SharedArray1[MaxSize];
__shared__ double SharedArray2[MaxSize];
/* populate the arrays on shared memory */
SharedArray1[idx] = array1[idx]; // first 512 elements
SharedArray2[idx] = array2[idx];
if (idx+blockDim.x < MaxSize){
SharedArray1[idx+blockDim.x] = array1[idx+blockDim.x];
SharedArray2[idx+blockDim.x] = array2[idx+blockDim.x];
}
__syncthreads();
// input scalars used here.
scalar10 = ...;
scalar11 = ...;
for (int kk = 0; kk < scalar8; kk++)
{
/* some calculations */
// SharedArray1, SharedArray2 and scalar9 used here
sumReal = ...;
sumImag = ...;
}
/* calculation of the exponential of a complex number */
real = ...;
imag = ...;
coeff1 = (sumReal * real);
coeff2 = (sumReal * imag);
coeff3 = (sumImag * real);
coeff4 = (sumImag * imag);
outputArray1[idx] = (coeff1 - coeff4);
outputArray2[idx] = (coeff2 + coeff3);
}
Because my max threads per block is 1024, I thought I would be able to continue to use the simple kernel launch, am I wrong?
How do I successfully launch each kernel with 1024 threads?
You don't want to vary the number of threads per block. You should get the optimal number of threads per block for your kernel by using the CUDA Occupancy Calculator. After you have that number, you simply launch the number of blocks that are required to get the total number of threads that you need. If the number of threads that you need for a given case is not always a multiple of the threads per block, you add code in the top of your kernel to abort the unneeded threads. (if () return;). Then, you pass in the dimensions of your matrix either with extra parameters to the kernel or by using x and y grid dimensions, depending on which information is required in your kernel (I haven't studied it).
My guess is that the reason you're having trouble with 1024 threads is that, even though your GPU supports that many threads in a block, there is another limiting factor to the number of threads you can have in each block based on resource usage in your kernel. The limiting factor can be shared memory or register usage. The Occupancy Calculator will tell you which, though that information is only important if you want to optimize your kernel.
If you use one block with 1024 threads you will have problems since MaxSize is only 999 resulting in wrong data.
Lets simulate it for last thread #1023
__shared__ float SharedArray1[999];
__shared__ double SharedArray2[999];
/* populate the arrays on shared memory */
SharedArray1[1023] = array1[1023];
SharedArray2[1023] = array2[1023];
if (2047 < MaxSize)
{
SharedArray1[2047] = array1[2047];
SharedArray2[2047] = array2[2047];
}
__syncthreads();
If you now use all those elements in your calculation this should not work.
(Your calculation code is not shown so its an assumption)