I have GeForce 620M and my code is:
int threadsPerBlock = 256;
int blocksPerGrid = Number_AA_GPU / threadsPerBlock;
for(it=0;it<Number_repeatGPU;it++)
{
Kernel_Update<<<blocksPerGrid,threadsPerBlock>>>(A, B, C, D, rand(), rand());
}
I get:
invalid configuration argument.
What could be the reason?
The kernel configuration arguments are the arguments between the <<<...>>> symbols.
Your GeForce 620M is a compute capability 2.1 device.
A compute capability 2.1 device is limited to 65535 when you pass a 1-dimensional parameter for the blocks per grid parameter (the first of the two arguments you are passing.)
Since the other parameter you are passing (256, threadsPerBlock) is definitely in-bounds, I conclude that your first parameter is out of bounds:
int blocksPerGrid = Number_AA_GPU / threadsPerBlock;
i.e. Number_AA_GPU is either greater than 65535*256 (greater than or equal to 65536*256 would trigger a failure), or it is zero (actually Number_AA_GPU less than 256 would fail, due to integer division), or it is negative.
In the future, you can write more easily decipherable questions if you provide a complete example. In this case, telling us what Number_AA_GPU is could make my answer more definite.
Related
http://us.hardware.info/reviews/5419/nvidia-geforce-gtx-titan-z-sli-review-incl-tones-tizair-system
says that "GTX Titan-Z" has 5760 Shader units. Also here is written that "GTX Titan-Z" has 2x GK110 GPU.
CUDA exp() expf() and __expf() mentiones that it is possible to calculate exponent in cuda.
Let's say I have array of 500 000 000 ( five hundred millions ) of doubles. I want to calculate exponents of each of value in array. Who knows what to expect: 5760 shader units will be able to calculate exp, or this task can be done only with two GK110 GPU? Difference in perfomance is drastical, so I need to be sure, that if I rewrite my app with CUDA, then it will not work slower.
In other words, can I make 5760 threads to calculate 500 000 000 exponents?
GTX Titan Z is a dual GPU device. Each of the two GK110 GPUs on the card is attached via a 384-bit memory interface to its own 6 GB of high-speed memory. The theoretical bandwidth of each memory is 336 GB/sec. The particular GK110 variant used in the GTX Titan Z is comprised of fifteen clusters of execution units called SMX. Each SMX in turn is comprised of 192 single-precision floating-point units, 64 double-precision floating point units, and various other units.
Each double-precision unit in GK110 can execute one FMA (fused multiply-add), or one FMUL, or one FADD per clock cycle. At a base clock of 705 MHz, the maximum total number of DP operations that can be executed by each of the GK110 GPUs on Titan Z per second is therefore 705e6 * 15 * 64 = 676.8e9. Assuming all operations are FMAs, that equates to 1.3536 double-precision TFLOPS. Since the card uses two GPUs, the total DP performance of a GTX Titan Z is thus 2.7072 TFLOPS.
Like CPUs, GPUs provide general-purpose computation via various integer and floating-point units. GPUs also provide special function units (called MUFU = multifunction unit on GK110) that can compute rough single-precision approximations to some frequently used functions such as reciprocal, reciprocal square root, sine, cosine, exponential base 2, and logarithm based 2. As far as exponentiation is concerned, the standard single-precision math function exp2f() is the only function that maps more or less directly to a MUFU instruction (MUFU.EX2). Depending on compilation mode, there is a thin wrapper around this hardware instruction since the hardware does not support denormal operands in the special function units.
All other exponentiaton in CUDA is performed via software subroutines. The standard single-precision function expf() is a fairly heavy-weight wrapper around the hardware's exp2 capability. The double-precision exp() function is a pure software routine based on minimax polynomial approximation. The complete source code for it is visible in the CUDA header file math_functions_dbl_ptx3.h (in CUDA 6.5, DP exp() code starts at line 1706 in that file). As you can see, the computation involves primarily double-precision floating-point operations, as well as integer and some single-precision floating-point operations. You can also look at the machine code by disassembling a binary executable that calls exp() with cuobjdump --dump-sass.
In terms of performance, in CUDA 6.5 the double precision exp() function has a throughput on the order of 25e9 function calls per second on a Tesla K20 (1.170 DP TFLOPS). Since each call to DP exp() consumes an 8-byte source operand and produces an 8-byte result, this equates to roughly 400 GB/sec of memory bandwidth. Since each GK110 on a Titan Z provides about 15% more performance than the GK110 on a Tesla K20, the throughput and bandwidth requirements increase accordingly. Since the required bandwidth exceeds the theoretical memory bandwidth of the GPU, code that simply applies DP exp() to an array will be completely bound by memory bandwidth.
The number of functional units in the GPU and the number of threads executing has no relationship with the number of array elements that can be processed, but can have an impact on the performance of such processing. The mapping of array elements to threads can be freely chosen by the programmer. The number of array elements that can be processed in one go is a function of the size of the GPU's memory. Note that not all of the raw memory on the device is available for user code as the CUDA software stack needs some memory for its own use, typically around 100 MB or so. An exemplary mapping for applying DP exp() to an array is shown in this code snippet:
__global__ void exp_kernel (const double * __restrict__ src,
double * __restrict__ dst, int len)
{
int stride = gridDim.x * blockDim.x;
int tid = blockDim.x * blockIdx.x + threadIdx.x;
for (int i = tid; i < len; i += stride) {
dst[i] = exp (src[i]);
}
}
#define ARRAY_LENGTH (500000000)
#define THREADS_PER_BLOCK (256)
int main (void) {
// ...
int len = ARRAY_LENGTH;
dim3 dimBlock(THREADS_PER_BLOCK);
int threadBlocks = (len + (dimBlock.x - 1)) / dimBlock.x;
if (threadBlocks > 65520) threadBlocks = 65520;
dim3 dimGrid(threadBlocks);
double *d_a = 0, *d_b = 0;
cudaMalloc((void**)&d_a, sizeof(d_a[0]), len);
cudaMalloc((void**)&d_b, sizeof(d_b[0]), len);
// ...
exp_kernel<<<dimGrid,dimBlock>>>(d_a, d_b, len);
// ...
}
I'm trying to implement sum of absolute differences in CUDA for a homework assignment, but am having trouble getting correct results.
I am given a Blocksize that represents X and Y size (in pixels) of a square portion of the images I am given to compare. I am also given two images in YUV format. Below are the portions of the program I have to implement: the kernel that calculates the SAD and the setup for the size of the grid/blocks of threads. The rest of the program is provided, and can be assumed to be correct.
Here I'm getting the x and y index of the current thread and using those to get the pixel in the image arrays I'm dealing with in the current thread. Then I calculate the absolute difference, wait for all the threads to finish calculating that, then if the current thread is within the block in the image we care about the absolute difference is added to the sum in global memory with an atomicAdd to avoid a collision during write.
__global__ void gpuCounterKernel(pixel* cuda_curBlock, pixel* cuda_refBlock, uint32* cuda_SAD, uint32 cuda_Blocksize)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
int id = idx * cuda_Blocksize + idy;
int AD = abs( cuda_curBlock[id] - cuda_refBlock[id] );
__syncthreads();
if( idx < cuda_Blocksize && idy < cuda_Blocksize ) {
atomicAdd( cuda_SAD, AD );
}
}
And this is how I'm setting up the grid and blocks for the kernel:
int grid_sizeX = Blocksize/2;
int grid_sizeY = Blocksize/2;
int block_sizeX = Blocksize/4;
int block_sizeY = Blocksize/4;
dim3 blocksInGrid(grid_sizeX, grid_sizeY);
dim3 threadsInBlock(block_sizeX, block_sizeY);
The given program calculates the SAD on the CPU as well and compares our result from the GPU with that one to check for correctness. Valid block sizes within the image are from 1-1000. My solution above is getting correct results from 10-91, but anything above 91 just returns 0 for the sum. What am I doing wrong?
Your grid and block size settings looks odd.
Usually we use the settings for image pixels similar as follows.
int imageROISize=1000;
dim3 threadInBlock(16,16);
dim3 blocksInGrid((imageROISize+15)/16, (imageROISize+15)/16);
You could refer to the following section in cuda programming guide for more information on how to distribute workloads to CUDA threads.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy
You really should show all the code and identify the GPU you are running on. At least the portion that calls the kernel and allocates data for GPU use.
Are you doing proper cuda error
checking on all cuda API calls and kernel calls?
Probably your kernel is not running at all because your
threadsInBlock parameter is exceeding 512 threads total. You indicate that at Blocksize = 92 and above, things are not working. Let's do the math:
92/4 = 23 threads in X and Y dimensions
23 * 23 = 529 total threads requested per threadblock
529 exceeds 512 which is the limit for cc 1.x devices, so I'm guessing you're running on a cc 1.x device, and therefore your kernel launch is failing, so your kernel is not running, and so you get no computed results (i.e. 0). Note that at 91/4 = 22 threads in X and Y dimensions, you are requesting 484 total threads which does not exceed the 512 limit for cc 1.x devices.
If you were doing proper cuda error checking, the error report would have focused your attention on the cuda kernel launch failing due to incorrect launch parameters.
I used x & y for calculating cells of a matrix in device.
when I used more than 32 for lenA & lenB, the breakpoint (in int x= threadIdx.x; in device code) can't work and output isn't correct.
in host code:
int lenA=52;
int lenB=52;
dim3 threadsPerBlock(lenA, lenB);
dim3 numBlocks(lenA / threadsPerBlock.x, lenB / threadsPerBlock.y);
kernel_matrix<<<numBlocks,threadsPerBlock>>>(dev_A, dev_B);
in device code:
int x= threadIdx.x;
int y= threadIdx.y;
...
Your threadsPerBlock dim3 variable must satisfy the requirements for the compute capability that you are targetting.
CC 1.x devices can handle up to 512 threads per block
CC 2.0 - 8.6 devices can handle up to 1024 threads per block.
Your dim3 variable at (32,32) is specifying 1024 (=32x32) threads per block. When you exceed that you are getting a kernel launch fail.
If you did cuda error checking on your kernel launch, you would see the error.
Since the kernel doesn't actually launch with this type of error, any breakpoints set in the kernel code also won't be hit.
Additional notes:
You won't get any compilation error for threads per block, regardless of what you do. It doesn't work that way. The compiler doesn't check that.
If you do proper CUDA error checking you will get a runtime error report, and even if you don't do proper CUDA error checking, your kernel will not actually run with that sort of error.
I am interested in knowing how cublasSgemm/clAmdBlasSgemm routines are mapped on GPU while calculating matrix multiplication (C = A * B).
Assume the dimensions of input Matrix ::A_rows = 6144;
A_cols = 12288; B_rows = 12288; B_cols = 15360;
and dimensions of resultant matrix :: C_rows = 6144; C_cols = 15360;
Assume i have initialized the input matrices on host and i copied the matrix data into device memory. After that i am calling following cuBlas or clAmdBlas routines to do matrix multiplication on GPU.
void cublasSgemm (char transa, char transb, int m, int n, int k, float alpha, const float *A, int lda, const float *B, int ldb, float beta, float *C, int ldc);
where m = A_rows; and
n = B_cols;
So my doubts are:
1. ) How these routines are implemented on GPU ?
2. ) Does m and n values mapped on one compute unit (SM)? If No, then what can be maximum value for m and n ?
3. ) Do we have control of threads/Blocks ?
For the host side CUBLAS API (note that I have no idea why you would assume that clAmdBlasSgemm would be the same), the short answer to your questions are as follows:
Modern CUBLAS is closed source. There are code bases like Magma which you could look at to at least get a feel for how CUBLAS might be implemented. You can also run CUBLAS code in one of the NVIDIA supplied profilers to see what it does on the GPU. But the point is that you don't need to know how it works. There is an API and some very thorough documentation. That is all you need to know.
You example problem requires roughly 1.2Gb of memory. If you have a GPU with that much memory, and either enough computational capacity to avoid the display driver watchdog timer, or a compute dedicated GPU, it will work. Memory and the display driver time limitations (where applicable) are the only limitations.
No.
Note that there is also a CUBLAS device API for K20 Kepler devices, and the answers I provided above do not apply to that library.
Before going any further you must read the papers of Volkov and Demmel, have a look here: http://www.cs.berkeley.edu/~volkov/ see his article regarding SGEMM. The answers are there since 2008.
I am facing the following problem on a GeForce GTX 580 (Fermi-class) GPU.
Just to give you some background, I am reading single-byte samples packed in the following manner in a file: Real(Signal 1), Imaginary(Signal 1), Real(Signal 2), Imaginary(Signal 2). (Each byte is a signed char, taking values between, -128 and 127.) I read these into a char4 array, and use the kernel given below to copy them to two float2 arrays corresponding to each signal. (This is just an isolated part of a larger program.)
When I run the program using cuda-memcheck, I get either an unqualified unspecified launch failure, or the same message along with User Stack Overflow or Breakpoint Hit or Invalid __global__ write of size 8 at random thread and block indices.
The main kernel and launch-related code is reproduced below. The strange thing is that this code works (and cuda-memcheck throws no error) on a non-Fermi-class GPU that I have access to. Another thing that I observed is that the Fermi gives no error for N less than 16384.
#define N 32768
int main(int argc, char *argv[])
{
char4 *pc4Buf_h = NULL;
char4 *pc4Buf_d = NULL;
float2 *pf2InX_d = NULL;
float2 *pf2InY_d = NULL;
dim3 dimBCopy(1, 1, 1);
dim3 dimGCopy(1, 1);
...
/* i do check for errors in the actual code */
pc4Buf_h = (char4 *) malloc(N * sizeof(char4));
(void) cudaMalloc((void **) &pc4Buf_d, N * sizeof(char4));
(void) cudaMalloc((void **) &pf2InX_d, N * sizeof(float2));
(void) cudaMalloc((void **) &pf2InY_d, N * sizeof(float2));
...
dimBCopy.x = 1024; /* number of threads in a block, for my GPU */
dimGCopy.x = N / 1024;
CopyDataForFFT<<<dimGCopy, dimBCopy>>>(pc4Buf_d,
pf2InX_d,
pf2InY_d);
...
}
__global__ void CopyDataForFFT(char4 *pc4Data,
float2 *pf2FFTInX,
float2 *pf2FFTInY)
{
int i = (blockIdx.x * blockDim.x) + threadIdx.x;
pf2FFTInX[i].x = (float) pc4Data[i].x;
pf2FFTInX[i].y = (float) pc4Data[i].y;
pf2FFTInY[i].x = (float) pc4Data[i].z;
pf2FFTInY[i].y = (float) pc4Data[i].w;
return;
}
One other thing I noticed in my program is that if I comment out any two char-to-float assignment statements in my kernel, there's no memory error. One other thing I noticed in my program is that if I comment out either the first two or the last two char-to-float assignment statements in my kernel, there's no memory error. If I comment out one from the first two (pf2FFTInX), and another from the second two (pf2FFTInY), errors still crop up, but less frequently. The kernel uses 6 registers with all four assignment statements uncommented, and uses 5 4 registers with two assignment statements commented out.
I tried the 32-bit toolkit in place of the 64-bit toolkit, 32-bit compilation with the -m32 compiler option, running without X windows, etc. but the program behaviour is the same.
I use CUDA 4.0 driver and runtime (also tried CUDA 3.2) on RHEL 5.6. The GPU compute capability is 2.0.
Please help! I could post the entire code if anybody is interested in running it on their Fermi cards.
UPDATE: Just for the heck of it, I inserted a __syncthreads() between the pf2FFTInX and the pf2FFTInY assignment statements, and memory errors disappeared for N = 32768. But at N = 65536, I still get errors. <-- This didn't last long. Still getting errors.
UPDATE: In continuing with the weird behaviour, when I run the program using cuda-memcheck, I get these 16x16 blocks of multi-coloured pixels distributed randomly all over my screen. This does not happen if I run the program directly.
The problem was a bad GPU card (see the comments). [I'm Adding this answer to remove the question from the unanswered list and make it more useful.]