http://us.hardware.info/reviews/5419/nvidia-geforce-gtx-titan-z-sli-review-incl-tones-tizair-system
says that "GTX Titan-Z" has 5760 Shader units. Also here is written that "GTX Titan-Z" has 2x GK110 GPU.
CUDA exp() expf() and __expf() mentiones that it is possible to calculate exponent in cuda.
Let's say I have array of 500 000 000 ( five hundred millions ) of doubles. I want to calculate exponents of each of value in array. Who knows what to expect: 5760 shader units will be able to calculate exp, or this task can be done only with two GK110 GPU? Difference in perfomance is drastical, so I need to be sure, that if I rewrite my app with CUDA, then it will not work slower.
In other words, can I make 5760 threads to calculate 500 000 000 exponents?
GTX Titan Z is a dual GPU device. Each of the two GK110 GPUs on the card is attached via a 384-bit memory interface to its own 6 GB of high-speed memory. The theoretical bandwidth of each memory is 336 GB/sec. The particular GK110 variant used in the GTX Titan Z is comprised of fifteen clusters of execution units called SMX. Each SMX in turn is comprised of 192 single-precision floating-point units, 64 double-precision floating point units, and various other units.
Each double-precision unit in GK110 can execute one FMA (fused multiply-add), or one FMUL, or one FADD per clock cycle. At a base clock of 705 MHz, the maximum total number of DP operations that can be executed by each of the GK110 GPUs on Titan Z per second is therefore 705e6 * 15 * 64 = 676.8e9. Assuming all operations are FMAs, that equates to 1.3536 double-precision TFLOPS. Since the card uses two GPUs, the total DP performance of a GTX Titan Z is thus 2.7072 TFLOPS.
Like CPUs, GPUs provide general-purpose computation via various integer and floating-point units. GPUs also provide special function units (called MUFU = multifunction unit on GK110) that can compute rough single-precision approximations to some frequently used functions such as reciprocal, reciprocal square root, sine, cosine, exponential base 2, and logarithm based 2. As far as exponentiation is concerned, the standard single-precision math function exp2f() is the only function that maps more or less directly to a MUFU instruction (MUFU.EX2). Depending on compilation mode, there is a thin wrapper around this hardware instruction since the hardware does not support denormal operands in the special function units.
All other exponentiaton in CUDA is performed via software subroutines. The standard single-precision function expf() is a fairly heavy-weight wrapper around the hardware's exp2 capability. The double-precision exp() function is a pure software routine based on minimax polynomial approximation. The complete source code for it is visible in the CUDA header file math_functions_dbl_ptx3.h (in CUDA 6.5, DP exp() code starts at line 1706 in that file). As you can see, the computation involves primarily double-precision floating-point operations, as well as integer and some single-precision floating-point operations. You can also look at the machine code by disassembling a binary executable that calls exp() with cuobjdump --dump-sass.
In terms of performance, in CUDA 6.5 the double precision exp() function has a throughput on the order of 25e9 function calls per second on a Tesla K20 (1.170 DP TFLOPS). Since each call to DP exp() consumes an 8-byte source operand and produces an 8-byte result, this equates to roughly 400 GB/sec of memory bandwidth. Since each GK110 on a Titan Z provides about 15% more performance than the GK110 on a Tesla K20, the throughput and bandwidth requirements increase accordingly. Since the required bandwidth exceeds the theoretical memory bandwidth of the GPU, code that simply applies DP exp() to an array will be completely bound by memory bandwidth.
The number of functional units in the GPU and the number of threads executing has no relationship with the number of array elements that can be processed, but can have an impact on the performance of such processing. The mapping of array elements to threads can be freely chosen by the programmer. The number of array elements that can be processed in one go is a function of the size of the GPU's memory. Note that not all of the raw memory on the device is available for user code as the CUDA software stack needs some memory for its own use, typically around 100 MB or so. An exemplary mapping for applying DP exp() to an array is shown in this code snippet:
__global__ void exp_kernel (const double * __restrict__ src,
double * __restrict__ dst, int len)
{
int stride = gridDim.x * blockDim.x;
int tid = blockDim.x * blockIdx.x + threadIdx.x;
for (int i = tid; i < len; i += stride) {
dst[i] = exp (src[i]);
}
}
#define ARRAY_LENGTH (500000000)
#define THREADS_PER_BLOCK (256)
int main (void) {
// ...
int len = ARRAY_LENGTH;
dim3 dimBlock(THREADS_PER_BLOCK);
int threadBlocks = (len + (dimBlock.x - 1)) / dimBlock.x;
if (threadBlocks > 65520) threadBlocks = 65520;
dim3 dimGrid(threadBlocks);
double *d_a = 0, *d_b = 0;
cudaMalloc((void**)&d_a, sizeof(d_a[0]), len);
cudaMalloc((void**)&d_b, sizeof(d_b[0]), len);
// ...
exp_kernel<<<dimGrid,dimBlock>>>(d_a, d_b, len);
// ...
}
Related
I am wondering what the effect of NumBlocks and ThreadsPerBlock on this simple matrix multiplication routine is
__global__ void wmma_matrix_mult(half *a, half *b, half *out) {
// Declare the fragments
wmma::fragment<wmma::matrix_a, M, N, K, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, M, N, K, half, wmma::row_major> b_frag;
wmma::fragment<wmma::accumulator, M, N, K, half> c_frag;
// Initialize the output to zero
wmma::fill_fragment(c_frag, 0.0f);
// Load the inputs
wmma::load_matrix_sync(a_frag, a, N);
wmma::load_matrix_sync(b_frag, b, N);
// Perform the matrix multiplication
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
// Store the output
wmma::store_matrix_sync(out, c_frag, N, wmma::mem_row_major);
}
Calling
`wmma_matrix_mult<<1, 1>>`: Incorrect
`wmma_matrix_mult<<1, 2>>`: Incorrect
`wmma_matrix_mult<<1, 4>>`: Incorrect
`wmma_matrix_mult<<1, 8>>`: Incorrect
`wmma_matrix_mult<<1, 16>>`: Incorrect
`wmma_matrix_mult<<1, 32>>`: Correct
Why does the number of threads per block even matter if every thread is doing then same execution? As you can see, I am not doing anything with threadIdx.x inside the kernel.
Tensor core operations happen at the warp level. The w in wmma signifies that. Referring to the documentation:
This requires co-operation from all threads in a warp.
Each tensorcore unit can accept one matrix multiply operation (i.e. wmma::mma_sync), from a warp, per clock cycle.
This means that a full warp (32 threads) must be available and participating, for the operation to make any sense (i.e. to be legal). All of the wmma:: operations are collective ops, which means that an entire warp is expected to be executing them, and is necessary for correct usage.
If you have multiple warps participating (e.g. a threadblock size of 64, or 128, etc.), you are effectively asking for multiple operations to be done, just like any other CUDA code.
Like any other CUDA code, launching an operation with multiple blocks is just a way to scale up the work being done, and of course is necessary if you want to engage the resources of a GPU that has multiple SMs. Since tensorcore units are a per-SM resource, this would be necessary to witness a CUDA GPU delivering anything approaching its full rated throughput for tensorcore ops.
Why does the number of threads per block even matter if every thread is doing then same execution?
Every thread is not doing the same thing. The wmma:: collective ops are hiding code under the hood that is specializing thread behavior according to which warp lane it belongs to. For example, the thread in warp lane 0 will select different elements of the fragment to associate with (i.e. load, store) than any thread in any other warp lane.
Think i have a block with 1024 size and assume my gpu has 192 cuda cores.
How cuda handle __syncthreads() in kernels when cuda cores size is lower than block size?
__global__ void staticReverse(int *d, int n)
{
__shared__ int s[1024];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}
How 'tr' remaining in local memory?
I think you are mixing a few things.
First of all, GPU having 192 CUDA cores is the total core count. Each block however maps to a single Streaming Multiprocessor (SM) which may have a lower core count (depending on the GPU generation).
Let us assume that you own a Pascal GPU which has 64 cores per SM and you have 3
SMs.
A single block maps to a single SM. So you will have 64 cores handling 1024 threads concurrently. Such an SM has enough registers to hold all the necessary data for 1024 threads, but it has only 64 cores which quickly swap which threads they are handling.
This way all the local data, e.g. tr can remain in memory.
Now, because of this quick swapping and concurrent execution, it may happen -- completely by accident -- that some threads get ahead of others. If you want to ensure that at certain point all threads are at the same spot, you use __syncthreads(). All that function does is to instruct the scheduler to properly assign work to the CUDA cores so that they all are at that spot in program at some moment.
Im was writing a program which includes a cuda kernel. I found that if you are using#define OPERATOR * one thread will use 11 registers, but I you will use #define OPERATOR / (division operator) one thread will use 52 registers!! Whats wrong? I must
decrease register number (I dot want to set maxregcount)! How can I decrease number of registers when Im using devision operator in cuda kernel?
#include <stdio.h>
#include <stdlib.h>
#define GRID_SIZE 1
#define BLOCK_SIZE 1
#define OPERATOR /
__global__ void kernel(double* array){
for (int curEl=0;curEl<BLOCK_SIZE;++curEl){
array[curEl]=array[curEl] OPERATOR 10;
}
}
int main(void) {
double *devPtr=NULL,*data=(double*)malloc(sizeof(double)*BLOCK_SIZE);
cudaFuncAttributes cudaFuncAttr;
cudaFuncGetAttributes(&cudaFuncAttr,kernel);
for (int curElem=0;curElem<BLOCK_SIZE;++curElem){
data[curElem]=curElem;
}
cudaMalloc(&devPtr,sizeof(double)*BLOCK_SIZE);
cudaMemcpy(devPtr,data,sizeof(double)*BLOCK_SIZE,cudaMemcpyHostToDevice);
kernel<<<1,BLOCK_SIZE>>>(devPtr);
printf("1 thread needs %d regs\n",cudaFuncAttr.numRegs);
return 0;
}
The increase in register use when switching from a double-precision multiplication to a double-precision division in kernel computation is due to the fact that double-precision multiplication is a built-in hardware instruction, while double-precision division is a sizable called software subroutine (that is, a function call of sorts). This is easily verified by inspection of the generated machine code (SASS) with cuobjdump --dump-sass.
The reason that double-precision divisions (and in fact all divisions, including single-precision division and integer division) are emulated either by inline code or called subroutines is due to the fact that the GPU hardware has no direct support for division operations, in order to keep the individual computational cores ("CUDA cores") as simple and as small as possible, which ultimately leads to higher peak performance for a given size chip. It likely also improves the efficiency of the cores as measured by the GFLOPS/watt metric.
For release builds, the typical increase in register use caused by the introduction of double-precision division is around 26 registers. These additional registers are needed to store intermediate variables in the division computation, where each double-precision temporary variable requires two 32-bit registers.
As Marco13 points out in a comment above, it may be possible to manually replace division by multiplication with the reciprocal. However, this causes slight numerical differences in most cases, which is why the CUDA compiler does not apply this transformation automatically.
Generally speaking, register use can be controlled with compilation-unit granularity through the -maxrregcount nvcc compiler flag, or with per-function granularity using the __launch_bounds__ function attribute. However, forcing lower register use by more than a few registers below the level determined by the compiler frequently leads to register spilling in the generated code, which usually has a negative impact on kernel performance.
I had a CUDA program in which kernel registers were limiting maximum theoretical achieved occupancy to %50. So I decided to use shared memory instead of registers for those variables that were constant between block threads and were almost read-only throughout kernel run. I cannot provide source code here; what I did was conceptually like this:
My initial program:
__global__ void GPU_Kernel (...) {
__shared__ int sharedData[N]; //N:maximum amount that doesn't limit maximum occupancy
int r_1 = A; //except for this first initialization, these registers don't change anymore
int r_2 = B;
...
int r_m = Y;
... //rest of kernel;
}
I changed above program to:
__global__ void GPU_Kernel (...) {
__shared__ int sharedData[N-m];
__shared__ int r_1, r_2, ..., r_m;
if ( threadIdx.x == 0 ) {
r_1 = A;
r_2 = B;
...
r_m = Y; //last of them
}
__syncthreads();
... //rest of kernel
}
Now threads of warps inside a block perform broadcast reads to access newly created shared memory variables. At the same time, threads don't use too much registers to limit achieved occupancy.
The second program has maximum theoretical achieved occupancy equal to %100. In actual runs, the average achieved occupancy for the first programs was ~%48 and for the second one is around ~%80. But the issue is enhancement in net speed up is around %5 to %10, much less than what I was anticipating considering improved gained occupancy. Why isn't this correlation linear?
Considering below image from Nvidia whitepaper, what I've been thinking was that when achieved occupancy is %50, for example, half of SMX (in newer architectures) cores are idle at a time because excessive requested resources by other cores stop them from being active. Is my understanding flawed? Or is it incomplete to explain above phenomenon? Or is it added __syncthreads(); and shared memory accesses cost?
Why isn't this correlation linear?
If you are already memory bandwidth bound or compute bound, and either one of those bounds is near the theoretical performance of the device, improving occupancy may not help much. Improving occupancy usually helps when niether of these are the limiters to performance for your code (i.e. you are not at or near peak memory bandwidth utilization or peak compute). Since you haven't provided any code or any metrics for your program, nobody can tell you why it didn't speed up more. The profiling tools can help you find the limiters to performance.
You might be interested in a couple webinars:
CUDA Optimization: Identifying Performance Limiters by Dr Paulius Micikevicius
CUDA Warps and Occupancy Considerations+ Live with Dr Justin Luitjens, NVIDIA
In particular, review slide 10 from the second webinar.
Please understand me, but I don't know English.
My Computing environment is
CPU : Intel Xeon x5690 3.46Ghz * 2EA
OS : CentOS 5.8
VGA : Nvidia Geforce GTX580 (CC is 2.0)
I read already the documents about "coalesced memory access" on CUDA C programming guide.
But I can't apply them in my case.
I've 32x32 blocks/grid and 16x16 threads/block.
That means as following code.
dim3 grid(32, 32);
dim3 block(16,16);
kernel<<<grid, block>>>(...);
Then, How can I use that coalesced memory access?
I used code in below kernel.
int i = blockIdx.x*16 + threadIdx.x;
int j = blockIdx.y*16 + threadIdx.y;
...
global_memory[i*512+j] = ...;
I used the constant 512 because total amount of threads is 512x512 threads:It is grid_size x block_size.
But, I saw "Low Global Memory Store Efficiency[9.7% avg, for kernels accounting for 100% of compute]" from Visual Profiler.
Helper says using the coalesced memory access.
But, I cannot know what should I use the index context of the memory.
For more information for detail code, The result of an experiment different from CUDA Occupancy Calculator
Coalescing memory loads and stores in CUDA is a pretty straightforward concept - threads in the same warp need to load or store from/into suitably aligned, consecutive words in memory.
The warp size is 32 in CUDA, and warps are formed from threads within the same block, ordered so that the x dimension of threadIdx.{xyz} varies the fastest, the y the next fastest, and the z the slowest (functionally this is the same as column major ordering in arrays).
The code you have posted isn't achieving coalesced memory stores because threads within the same warp are storing with a pitch of 512 words, not within the required 32 consecutive words.
A simple hack to improve coalescing would be to address the memory in column major order, so:
int i = blockIdx.x*16 + threadIdx.x;
int j = blockIdx.y*16 + threadIdx.y;
...
global_memory[i+512*j] = ...;
A more general approach on a 2D block and grid to achieve coalescing in the spirit of what you showed in the question would be like this:
tid_in_block = threadIdx.x + threadIdx.y * blockDim.x;
bid_in_grid = blockIdx.x + blockIdx.y * gridDim.x;
threads_per_block = blockDim.x * blockDim.y;
tid_in_grid = tid_in_block + thread_per_block * bid_in_grid;
global_memory[tid_in_grid] = ...;
The most appropriate solution will depend on other details of the code and data which you have not described.