I have exactly the same problem as described in the post:
CUDA Error on cudaBindTexture2D
I even have the following error:
error 18: invalid texture reference." and also experienced "wouldn't
throw the error on cudaMalloc, but only on cudaBindTexture
Unfortunately, the poster (Anton Roth) answered his own question in a manner that was a bit too cryptic for someone such as myself who is just starting out with CUDA:
The answer was in the comments, I used a sm that my GPU wasn't
compatible to.
The "not compatible with GPU" makes sense since the sample program FluidsGL (called "Fluids (OpenGL Version)" in NVIDIA CUDA Samples Browser) fails on my laptop, but works fine on my desktop at work. Unfortunately, I still don't know what "in the comments" was referring it, or how to even check for GPU SM compatibilities.
Here is the code that seems to be causing the issue:
#define DIM 512
In main:
setupTexture(DIM, DIM);
bindTexture();
In fluidsGL_kernels.cu:
texture<float2, 2> texref;
static cudaArray *array = NULL;
void setupTexture(int x, int y)
{
// Wrap mode appears to be the new default
texref.filterMode = cudaFilterModeLinear;
cudaChannelFormatDesc desc = cudaCreateChannelDesc<float2>();
cudaMallocArray(&array, &desc, y, x);
getLastCudaError("cudaMalloc failed");
}
void bindTexture(void)
{
cudaBindTextureToArray(texref, array);//this function itself doesn't throw the error but error 18 is caught by the function below
getLastCudaError("cudaBindTexture failed");
}
Hardware information
Here is the output of deviceQuery:
Device 0: "GeForce 9800M GS"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 1024 MBytes (1073741824 bytes)
( 8) Multiprocessors x ( 8) CUDA Cores/MP: 64 CUDA Cores
GPU Clock rate: 1325 MHz (1.32 GHz)
Memory Clock rate: 799 Mhz
Memory Bus Width: 256-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D
=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192)
x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Mo
del)
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 8 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simu
ltaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Versi
on = 5.0, NumDevs = 1, Device0 = GeForce 9800M GS
I know my GPU is kind of old, but it still runs most of the examples pretty well.
You need to compile your code for the proper architecture (as explained in the post you linked).
Since you have a CC 1.1 device, use the following nvcc compilation options:
-gencode arch=compute_11,code=sm_11
The default Visual Studio project or Makefile may not compile for the proper architectures, so always make sure that it does.
For Visual Studio, refer to this answer: https://stackoverflow.com/a/14413360/1043187
For a Makefile, it depends. The CUDA SDK samples often have a GENCODE_FLAGS variable that you can modify.
Related
I can't get CUDA profiling tools from being working. My laptop Asus has two video cards. One integrated (Intel) and another one, Nvidia GTX 960M.
I suspected that the visual profiler is using the integrated video card, so I changed the default video card for this specific application, under the “Nvidia Control Panel” and “Manager 3d Settings->Program Settings” to use the “High-Performance NVidia Processor”.
Nothing changed. Running the Visual Profiler, in the “Overall GPU usage” tab, I get “No GPU devices in Session”, which means I far as I can understand that no GPU’s were used.
Also, I noticed that the Nvidia display icon in the notification area is not reporting any applications that are using the video card.
What seems to be the problem here? How can I enable also the Nvidia GPU for both Visual Profiler and the command line nvprof.exe application? It seems neither Nsight works for me.
The code I am testing is the following:
#include<stdio.h>
#include<iostream>
#include<stdlib.h>
#include<string.h>
#define NUM_THREADS 256
#define IMG_SIZE 1048576
struct Coefficients_SOA {
int r;
int b;
int g;
int hue;
int saturation;
int maxVal;
int minVal;
int finalVal;
};
__global__
void complicatedCalculation(Coefficients_SOA* data)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int grayscale = (data[i].r + data[i].g + data[i].b)/data[i].maxVal;
int hue_sat = data[i].hue * data[i].saturation / data[i].minVal;
data[i].finalVal = grayscale*hue_sat;
}
void complicatedCalculation()
{
Coefficients_SOA* d_x;
cudaMalloc(&d_x, IMG_SIZE*sizeof(Coefficients_SOA));
int num_blocks = IMG_SIZE/NUM_THREADS;
complicatedCalculation<<<num_blocks,NUM_THREADS>>>(d_x);
cudaFree(d_x);
}
int main(int argc, char*argv[])
{
complicatedCalculation();
return 0;
}
Best Regards,
PS: I installed CUDA Version 11 under win10/64bit
Also, I verified the CUDA installation according to https://docs.nvidia.com/cuda/pdf/CUDA_Installation_Guide_Windows.pdf
I attach the DeviceQuery and BandwidthTest CUDA sample programs for your convenience.
deviceQuery Sample report
D:\Program Files\nVidia\CUDA Samples\v11.0\bin\win64\Release>deviceQuery
deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 960M"
CUDA Driver Version / Runtime Version 11.0 / 11.0
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 4096 MBytes (4294967296 bytes)
( 5) Multiprocessors, (128) CUDA Cores/MP: 640 CUDA Cores
GPU Max Clock rate: 1176 MHz (1.18 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 4 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.0, CUDA Runtime Version = 11.0, NumDevs = 1
Result = PASS
BandwidthTest Sample report
D:\Program Files\nVidia\CUDA Samples\v11.0\bin\win64\Release>bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: GeForce GTX 960M
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 12.2
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 11.8
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 68.9
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Problem solved. As a beginner in the CUDA world, I didn't know that I should add the parameter gencode to compile my CUDA files under the command line (Visual Studio of CUDA SDK sample projects are already had these parameters that's why I had GPU activity).
So, the full parameter list under the command line should be like this for my maxwell architecture with CUDA Capability Major/Minor version number 5.0.
nvcc -run -m64 -gencode arch=compute_50,code=sm_50 -o aos_soa.exe aos_soa.cu
Unfortunately, my 1st book "Learn CUDA Programming", from Packt Publishing, on page 49 refers that I should compile ONLY with the following parameters, besides the fact that in the source code files is contains a "Makefile" that includes all of the parameters above (only for linux, so I ignored it).
$ nvcc -o aos_soa ./aos_soa.cu
Now I can see my GPU statistics under nvprof.
nvprof aos_soa.exe
==18308== NVPROF is profiling process 18308, command: aos_soa.exe
==18308== Profiling application: aos_soa.exe
==18308== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 1.1421ms 1 1.1421ms 1.1421ms 1.1421ms complicatedCalculation(Coefficients_SOA*)
API calls: 83.40% 226.57ms 1 226.57ms 226.57ms 226.57ms cudaMalloc
15.90% 43.183ms 1 43.183ms 43.183ms 43.183ms cuDevicePrimaryCtxRelease
0.58% 1.5790ms 1 1.5790ms 1.5790ms 1.5790ms cudaFree
0.07% 198.40us 1 198.40us 198.40us 198.40us cuModuleUnload
0.03% 70.100us 1 70.100us 70.100us 70.100us cudaLaunchKernel
0.01% 26.800us 1 26.800us 26.800us 26.800us cuDeviceTotalMem
0.01% 20.200us 101 200ns 100ns 3.3000us cuDeviceGetAttribute
0.00% 11.600us 1 11.600us 11.600us 11.600us cuDeviceGetPCIBusId
0.00% 1.4000us 3 466ns 200ns 700ns cuDeviceGetCount
0.00% 1.4000us 2 700ns 200ns 1.2000us cuDeviceGet
0.00% 600ns 1 600ns 600ns 600ns cuDeviceGetName
0.00% 400ns 1 400ns 400ns 400ns cuDeviceGetLuid
0.00% 300ns 1 300ns 300ns 300ns cuDeviceGetUuid
A device query on my Titan-XP shows that I have 30 multiprocessors with a maximum number of 2048 threads per multiprocessor. Is it correct to think that the maximum number of threads that can simultaneously be executed physically on the hardware is 30 * 2048? I.e: will a kernel configuration like the following exploit this?
kernel<<<60, 1024>>>(...);
I'd really like to physically have the maximum number of blocks executing while avoiding having blocks waiting to be scheduled. Here's the full output of device query:
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "TITAN Xp"
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 12190 MBytes (12781682688 bytes)
(30) Multiprocessors, (128) CUDA Cores/MP: 3840 CUDA Cores
GPU Max Clock rate: 1582 MHz (1.58 GHz)
Memory Clock rate: 5705 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1, Device0 = TITAN Xp
Result = PASS
Yes, your conclusion is correct. The maximum number of threads that can be "in-flight" is 2048 * # of SMs for all GPUs supported by CUDA 9 or CUDA 9.1. (Fermi GPUs, supported by CUDA 8, are a bit lower at 1536 * # of SMs)
This is an upper bound, and the specifics of your kernel (resource utilization) may mean that fewer than this number can actually be "resident" or "in flight". This is in the general topic of GPU occupancy. CUDA includes an occupancy calculator spreadsheet and also a programmatic occupancy API to help determine this, for your specific kernel.
The usual kernel strategy to have a limited number of threads (e.g. 60 * 1024 in your case) handle an arbitrary data set size is to use some form of a construct called a grid striding loop.
I'm newbie in cuda programmation. I have a problem with this code (it was written by my teacher):
#include <stdio.h>
#define THREAD_PER_BLOCK 128
__global__ void add(int *a,const int N){
int index=threadIdx.x+blockIdx.x*blockDim.x;
if (index<N)
a[index] = a[index]+10;
}
int main( void ){
int *a;
// managed
int i;
int N=1024;
int size = N * sizeof( int );
cudaMallocManaged( &a, size );
for(i=0; i<N; i++) {
a[i]=i;
}
add<<< N/THREAD_PER_BLOCK, THREAD_PER_BLOCK >>>( a,N);
cudaDeviceSynchronize();
for (int i=0; i<10; i++){
printf("%d %d\n", i, a[i]);
}
cudaFree( a );
return 0;
}
I've detected a seg fault on fist for-loop, I have no idea of why the program crashes. My operative system is Ubuntu 14.04 and this is the output of querydevice:
Detected 1 CUDA Capable device(s)
Device 0: "GeForce 820M"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1985 MBytes (2081095680 bytes)
( 2) Multiprocessors, ( 48) CUDA Cores/MP: 96 CUDA Cores
GPU Max Clock rate: 1550 MHz (1.55 GHz)
Memory Clock rate: 900 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 131072 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce 820M
Result = PASS
The problem here is that your GPU is a Fermi GPU (compute capability 2.x):
Device 0: "GeForce 820M"
...
CUDA Capability Major/Minor version number: 2.1
^^^
and unified memory (for cudaMallocManaged) requires a compute capability 3.0 or higher GPU.
Any time you are having trouble with a CUDA code, it's good practice to use proper CUDA error checking before asking others for help. Even if you don't understand the error output, it will be useful to others trying to help you. In this case you would have gotten a concise error message that says that the cudaMallocManaged function is not supported.
Comments / Notes
Can I have more thread blocks than the maximum number of CUDA cores?
How does warp size relate to what I am doing?
Begin
I am running a cuda program using the following code to launch cuda kernels:
cuda_kernel_func<<<960, 1>>> (... arguments ...)
I thought this would be the limit of what I would be allowed to do, as I have a GTX670MX graphics processor on a laptop, which according to Nvidia's website has 960 CUDA cores.
So I tried changing 960 to 961 assuming that the program would crash. It did not...
What's going on here?
This is the output of deviceQuery:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 670MX"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 3072 MBytes (3221028864 bytes)
( 5) Multiprocessors, (192) CUDA Cores/MP: 960 CUDA Cores
GPU Max Clock rate: 601 MHz (0.60 GHz)
Memory Clock rate: 1400 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 393216 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GTX 670MX
Result = PASS
I am not sure how to interpret this information. It says here "960 CUDA cores", but then "2048 threads per multiprocessor" and "1024 threads per block".
I am slightly confused about what these things mean, and therefore what the limitations of the cuda_kernel_func<<<..., ...>>> arguments are. (And how to get the maximum performance out of my device.)
I guess you could also interpret this question as "What do all the statistics about my device mean." For example, what actually is a CUDA core / thread / multiprocessor / texture dimension size?
It didn't crash because the number of 'CUDA cores' has nothing to do with the number of blocks. Not all blocks necessarily execute in parallel. CUDA just schedules some of your blocks after others, returning after all block executions have taken place.
You see, NVIDIA is misstating the number of cores in its GPUs, so as to make for a simpler comparison with single-threaded non-vectorized CPU execution. Your GPU actually has 6 cores in the proper sense of the word; but each of these can execute a lot of instructions in parallel on a lot of data. Note that the bona-fide cores on a Kepler GPUs are called "SMx"es (and are described here briefly).
So:
[Number of actual cores] x [max number of instructions a single core can execute in parallel] = [Number of "CUDA cores"]
e.g. 6 x 160 = 960 for your card.
Even this is a rough description of things, and what happens within an SMx doesn't always allow us to execute 160 instructions in parallel per cycle. For example, when each block has only 1 thread, that number goes down by a factor of 32 (!)
Thus even if you use 960 rather than 961 blocks, your execution isn't as well-parallelized as you would hope. And - you should really use more threads per block to utilize a GPU's capabilities for parallel execution. More importantly, you should find a good book on CUDA/GPU programming.
A simpler answer:
Blocks do not all execute at the same time. Some blocks may finish before others have even started. The GPU takes does X amount of blocks at a time, finishes those, grabs more blocks, and continues until all blocks are finished.
Aside: This is why thread_sync only synchronizes threads within a block, and not within the whole kernel.
I posted this on the NVIDIA forums, I thought I would get a few more eyes to help.
I'm having trouble trying to expand my code out to perform with multiple cases. I have been developing with the most common case in mind, now its time for testing and i need to ensure that it all works for the different cases. Currently my kernel is executed within a loop (there are reasons why we aren't doing one kernel call to do the whole thing.) to calculate a value across the row of a matrix. The most common case is 512 columns by 512 rows. I need to consider matricies of the size 512 x 512, 1024 x 512, 512 x 1024, and other combinations, but the largest will be a 1024 x 1024 matrix. I have been using a rather simple kernel call:
launchKernel<<<1,512>>>(................)
This kernel works fine for the common 512x512 and 512 x 1024 (column, row respectively) case, but not for the 1024 x 512 case. This case requires 1024 threads to execute. In my naivety i have been trying different versions of the simple kernel call to launch 1024 threads.
launchKernel<<<2,512>>>(................) // 2 blocks with 512 threads each ???
launchKernel<<<1,1024>>>(................) // 1 block with 1024 threads ???
I beleive my problem has something to do with my lack of understanding of the threads and blocks
Here is the output of deviceQuery, as you can see i can have a max of 1024 threads
C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win64\Release\deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 2 CUDA Capable device(s)
Device 0: "Tesla C2050"
CUDA Driver Version / Runtime Version 4.2 / 4.1
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2688 MBytes (2818572288 bytes)
(14) Multiprocessors x (32) CUDA Cores/MP: 448 CUDA Cores
GPU Clock Speed: 1.15 GHz
Memory Clock rate: 1500.00 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 40 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Quadro 600"
CUDA Driver Version / Runtime Version 4.2 / 4.1
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073741824 bytes)
( 2) Multiprocessors x (48) CUDA Cores/MP: 96 CUDA Cores
GPU Clock Speed: 1.28 GHz
Memory Clock rate: 800.00 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 131072 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 15 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.1, NumDevs = 2, Device = Tesla C2050, Device = Quadro 600
I am using only the Tesla C2050 device
Here is a stripped out version of my kernel, so you have an idea of what it is doing.
#define twoPi 6.283185307179586
#define speed_of_light 3.0E8
#define MaxSize 999
__global__ void calcRx4CPP4
(
const float *array1,
const double *array2,
const float scalar1,
const float scalar2,
const float scalar3,
const float scalar4,
const float scalar5,
const float scalar6,
const int scalar7,
const int scalar8,
float *outputArray1,
float *outputArray2)
{
float scalar9;
int idx;
double scalar10;
double scalar11;
float sumReal, sumImag;
float real, imag;
float coeff1, coeff2, coeff3, coeff4;
sumReal = 0.0;
sumImag = 0.0;
// kk loop 1 .. 512 (scalar7)
idx = (blockIdx.x * blockDim.x) + threadIdx.x;
/* Declare the shared memory parameters */
__shared__ float SharedArray1[MaxSize];
__shared__ double SharedArray2[MaxSize];
/* populate the arrays on shared memory */
SharedArray1[idx] = array1[idx]; // first 512 elements
SharedArray2[idx] = array2[idx];
if (idx+blockDim.x < MaxSize){
SharedArray1[idx+blockDim.x] = array1[idx+blockDim.x];
SharedArray2[idx+blockDim.x] = array2[idx+blockDim.x];
}
__syncthreads();
// input scalars used here.
scalar10 = ...;
scalar11 = ...;
for (int kk = 0; kk < scalar8; kk++)
{
/* some calculations */
// SharedArray1, SharedArray2 and scalar9 used here
sumReal = ...;
sumImag = ...;
}
/* calculation of the exponential of a complex number */
real = ...;
imag = ...;
coeff1 = (sumReal * real);
coeff2 = (sumReal * imag);
coeff3 = (sumImag * real);
coeff4 = (sumImag * imag);
outputArray1[idx] = (coeff1 - coeff4);
outputArray2[idx] = (coeff2 + coeff3);
}
Because my max threads per block is 1024, I thought I would be able to continue to use the simple kernel launch, am I wrong?
How do I successfully launch each kernel with 1024 threads?
You don't want to vary the number of threads per block. You should get the optimal number of threads per block for your kernel by using the CUDA Occupancy Calculator. After you have that number, you simply launch the number of blocks that are required to get the total number of threads that you need. If the number of threads that you need for a given case is not always a multiple of the threads per block, you add code in the top of your kernel to abort the unneeded threads. (if () return;). Then, you pass in the dimensions of your matrix either with extra parameters to the kernel or by using x and y grid dimensions, depending on which information is required in your kernel (I haven't studied it).
My guess is that the reason you're having trouble with 1024 threads is that, even though your GPU supports that many threads in a block, there is another limiting factor to the number of threads you can have in each block based on resource usage in your kernel. The limiting factor can be shared memory or register usage. The Occupancy Calculator will tell you which, though that information is only important if you want to optimize your kernel.
If you use one block with 1024 threads you will have problems since MaxSize is only 999 resulting in wrong data.
Lets simulate it for last thread #1023
__shared__ float SharedArray1[999];
__shared__ double SharedArray2[999];
/* populate the arrays on shared memory */
SharedArray1[1023] = array1[1023];
SharedArray2[1023] = array2[1023];
if (2047 < MaxSize)
{
SharedArray1[2047] = array1[2047];
SharedArray2[2047] = array2[2047];
}
__syncthreads();
If you now use all those elements in your calculation this should not work.
(Your calculation code is not shown so its an assumption)