According to the Kepler architecture whitepaper, a SMX has 192 CUDA cores and 64 Double Precision Units (DPUs). For a K20Xm there are 14 SMXs totalling at 2688 cores, which means that only the CUDA cores are counted. What exactly is then the usage of the DPUs for and how is their usage related to the cores?
My thoughts:
a) The CUDA cores can't do double precision operations and only the DPUs can. Therefore, the CUDA cores are free for other stuff while the DPUs are busy.
b) The CUDA cores somehow need a double precision unit to do double precision operations, therefore only 128 of the 192 CUDA cores are available for other stuff.
Cheers
Andi
The double precision units are actually separate hardware floating point units that do double precision arithmetic. They are independent from the "cuda cores", which roughly speaking, could be considered to be the single-precision units.
So for single precision arithmetic, the throughput can be computed based on the "cuda cores" or single precision units. For double precision arithmetic, the throughput must be computed based on the double precision units.
In a Kepler K20 SMX, the ratio of double-precision units to single precision units is 1:3. Therefore the throughput for each type of arithmetic follows the same ratio. By "arithmetic" I mean here floating point multiply or floating point add.
Related
http://us.hardware.info/reviews/5419/nvidia-geforce-gtx-titan-z-sli-review-incl-tones-tizair-system
says that "GTX Titan-Z" has 5760 Shader units. Also here is written that "GTX Titan-Z" has 2x GK110 GPU.
CUDA exp() expf() and __expf() mentiones that it is possible to calculate exponent in cuda.
Let's say I have array of 500 000 000 ( five hundred millions ) of doubles. I want to calculate exponents of each of value in array. Who knows what to expect: 5760 shader units will be able to calculate exp, or this task can be done only with two GK110 GPU? Difference in perfomance is drastical, so I need to be sure, that if I rewrite my app with CUDA, then it will not work slower.
In other words, can I make 5760 threads to calculate 500 000 000 exponents?
GTX Titan Z is a dual GPU device. Each of the two GK110 GPUs on the card is attached via a 384-bit memory interface to its own 6 GB of high-speed memory. The theoretical bandwidth of each memory is 336 GB/sec. The particular GK110 variant used in the GTX Titan Z is comprised of fifteen clusters of execution units called SMX. Each SMX in turn is comprised of 192 single-precision floating-point units, 64 double-precision floating point units, and various other units.
Each double-precision unit in GK110 can execute one FMA (fused multiply-add), or one FMUL, or one FADD per clock cycle. At a base clock of 705 MHz, the maximum total number of DP operations that can be executed by each of the GK110 GPUs on Titan Z per second is therefore 705e6 * 15 * 64 = 676.8e9. Assuming all operations are FMAs, that equates to 1.3536 double-precision TFLOPS. Since the card uses two GPUs, the total DP performance of a GTX Titan Z is thus 2.7072 TFLOPS.
Like CPUs, GPUs provide general-purpose computation via various integer and floating-point units. GPUs also provide special function units (called MUFU = multifunction unit on GK110) that can compute rough single-precision approximations to some frequently used functions such as reciprocal, reciprocal square root, sine, cosine, exponential base 2, and logarithm based 2. As far as exponentiation is concerned, the standard single-precision math function exp2f() is the only function that maps more or less directly to a MUFU instruction (MUFU.EX2). Depending on compilation mode, there is a thin wrapper around this hardware instruction since the hardware does not support denormal operands in the special function units.
All other exponentiaton in CUDA is performed via software subroutines. The standard single-precision function expf() is a fairly heavy-weight wrapper around the hardware's exp2 capability. The double-precision exp() function is a pure software routine based on minimax polynomial approximation. The complete source code for it is visible in the CUDA header file math_functions_dbl_ptx3.h (in CUDA 6.5, DP exp() code starts at line 1706 in that file). As you can see, the computation involves primarily double-precision floating-point operations, as well as integer and some single-precision floating-point operations. You can also look at the machine code by disassembling a binary executable that calls exp() with cuobjdump --dump-sass.
In terms of performance, in CUDA 6.5 the double precision exp() function has a throughput on the order of 25e9 function calls per second on a Tesla K20 (1.170 DP TFLOPS). Since each call to DP exp() consumes an 8-byte source operand and produces an 8-byte result, this equates to roughly 400 GB/sec of memory bandwidth. Since each GK110 on a Titan Z provides about 15% more performance than the GK110 on a Tesla K20, the throughput and bandwidth requirements increase accordingly. Since the required bandwidth exceeds the theoretical memory bandwidth of the GPU, code that simply applies DP exp() to an array will be completely bound by memory bandwidth.
The number of functional units in the GPU and the number of threads executing has no relationship with the number of array elements that can be processed, but can have an impact on the performance of such processing. The mapping of array elements to threads can be freely chosen by the programmer. The number of array elements that can be processed in one go is a function of the size of the GPU's memory. Note that not all of the raw memory on the device is available for user code as the CUDA software stack needs some memory for its own use, typically around 100 MB or so. An exemplary mapping for applying DP exp() to an array is shown in this code snippet:
__global__ void exp_kernel (const double * __restrict__ src,
double * __restrict__ dst, int len)
{
int stride = gridDim.x * blockDim.x;
int tid = blockDim.x * blockIdx.x + threadIdx.x;
for (int i = tid; i < len; i += stride) {
dst[i] = exp (src[i]);
}
}
#define ARRAY_LENGTH (500000000)
#define THREADS_PER_BLOCK (256)
int main (void) {
// ...
int len = ARRAY_LENGTH;
dim3 dimBlock(THREADS_PER_BLOCK);
int threadBlocks = (len + (dimBlock.x - 1)) / dimBlock.x;
if (threadBlocks > 65520) threadBlocks = 65520;
dim3 dimGrid(threadBlocks);
double *d_a = 0, *d_b = 0;
cudaMalloc((void**)&d_a, sizeof(d_a[0]), len);
cudaMalloc((void**)&d_b, sizeof(d_b[0]), len);
// ...
exp_kernel<<<dimGrid,dimBlock>>>(d_a, d_b, len);
// ...
}
Im was writing a program which includes a cuda kernel. I found that if you are using#define OPERATOR * one thread will use 11 registers, but I you will use #define OPERATOR / (division operator) one thread will use 52 registers!! Whats wrong? I must
decrease register number (I dot want to set maxregcount)! How can I decrease number of registers when Im using devision operator in cuda kernel?
#include <stdio.h>
#include <stdlib.h>
#define GRID_SIZE 1
#define BLOCK_SIZE 1
#define OPERATOR /
__global__ void kernel(double* array){
for (int curEl=0;curEl<BLOCK_SIZE;++curEl){
array[curEl]=array[curEl] OPERATOR 10;
}
}
int main(void) {
double *devPtr=NULL,*data=(double*)malloc(sizeof(double)*BLOCK_SIZE);
cudaFuncAttributes cudaFuncAttr;
cudaFuncGetAttributes(&cudaFuncAttr,kernel);
for (int curElem=0;curElem<BLOCK_SIZE;++curElem){
data[curElem]=curElem;
}
cudaMalloc(&devPtr,sizeof(double)*BLOCK_SIZE);
cudaMemcpy(devPtr,data,sizeof(double)*BLOCK_SIZE,cudaMemcpyHostToDevice);
kernel<<<1,BLOCK_SIZE>>>(devPtr);
printf("1 thread needs %d regs\n",cudaFuncAttr.numRegs);
return 0;
}
The increase in register use when switching from a double-precision multiplication to a double-precision division in kernel computation is due to the fact that double-precision multiplication is a built-in hardware instruction, while double-precision division is a sizable called software subroutine (that is, a function call of sorts). This is easily verified by inspection of the generated machine code (SASS) with cuobjdump --dump-sass.
The reason that double-precision divisions (and in fact all divisions, including single-precision division and integer division) are emulated either by inline code or called subroutines is due to the fact that the GPU hardware has no direct support for division operations, in order to keep the individual computational cores ("CUDA cores") as simple and as small as possible, which ultimately leads to higher peak performance for a given size chip. It likely also improves the efficiency of the cores as measured by the GFLOPS/watt metric.
For release builds, the typical increase in register use caused by the introduction of double-precision division is around 26 registers. These additional registers are needed to store intermediate variables in the division computation, where each double-precision temporary variable requires two 32-bit registers.
As Marco13 points out in a comment above, it may be possible to manually replace division by multiplication with the reciprocal. However, this causes slight numerical differences in most cases, which is why the CUDA compiler does not apply this transformation automatically.
Generally speaking, register use can be controlled with compilation-unit granularity through the -maxrregcount nvcc compiler flag, or with per-function granularity using the __launch_bounds__ function attribute. However, forcing lower register use by more than a few registers below the level determined by the compiler frequently leads to register spilling in the generated code, which usually has a negative impact on kernel performance.
Im a bit confuzed how many scalar chanels ( i mean "gpu simd width" x "gpu simd cores")
GPU have, for example my own GPU "nvidia geforce gt 610")
it has 48 shader processors (i hoppe each od such processor has separate SIMD
as a processing word), some say also that mosc common (?) gpu simd width is 32
floats/ints - so are my calculations right and it has just 48x32 = 1536 scalar
channels? (i mean when all shader processors are at work 1536 floats can be processed in one step)
The GT610 is a cc 2.1 GPU with a single SM. That SM contains 48 CUDA cores (=shader processors). Each CUDA core is capable of producing one single precision scalar result per clock cycle. Each CUDA core does not have a separate SIMD path to process a SIMD word. It processes one scalar element per clock cycle.
It has 48 scalar channels. 48 floats can be processed in one step, i.e. in one clock cycle.
The SIMT vector width of GT610 is 32, just as it is on all CUDA GPUs -- this is the "warp size". This means when a CUDA instruction is issued, it will be executed across 32 threads per instruction issue.
this is what i found: " Cores perform only single-precision floating-point arithmetics. There is 1 double-precision floating-point unit. "
is this true for all compute capabilities (versions) ?
Single and double precision floating point accuracy and performance has continuously evolved and is different for each of the compute capabilities.
http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf
http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf
section 5.4.1, table 5-1.
I am researching for a possible gpu based teraflop computing machine...
the benchmark to be used will be LINPACK
now heres the problem; going through linpack documentation it says that it calculates in full precision and not in double precision ,for some machines full precision can be single precision. Can some one plz throw some light on the difference as this will dictate if I should go for the GTX 590s or the Tesla 2070s.
I think the term "full precision" was chosen to cover both IEEE-754 double precision (this is what is used on the GPUs mentioned) and the "single precision" format of old Cray vector computers, which sported 1 sign bit, 15 exponent bits, and 48 mantissa bits, providing a larger range but slightly less precision than IEEE-754 double precision. Here is documentation for the floating-point format used on the Cray-1:
http://ed-thelen.org/comp-hist/CRAY-1-HardRefMan/CRAY-1-HRM.html#p3-20
Concerning official nVidia's HPL version 0.8 (that's what we use to benchmark our hybrid machines):
It will run only on Teslas (it works only if your GPU has more than 2 GiB of memory, which, as far as I know, is true only for Tesla)
It uses double precision, so another point for using Teslas, since double arithmetic performance is limited on mainstream GPUs.
BTW: achieving at least 50% efficiency on 6-node machine (2 GPUs per node) is considered barely possible.