In the CUDA C Programming Guide Version 5, Appendix E.2 (Linear Filtering), it is stated that:
In this filtering mode, which is only available for floating-point
textures, the value returned by the texture fetch is...
The part in bold case is confusing me. Does floating point mean the texel type only, or the return type also? For example, I declare 3 textures as follows.
texture<float,cudaTextureType2D> tex32f;
texture<unsigned char, cudaTextureType2D, cudaReadModeNormalizedFloat> tex8u;
texture<unsigned short, cudaTextureType2D, cudaReadModeNormalizedFloat> tex16u;
Is linear filtering available for tex32f only, or also for tex8u and tex16u?
It means that linear filtering is available only when the "read mode" of the texture is cudaReadModeNormalizedFloat, i.e. integer types (such as u8) get promoted to floating point values in the range [0.0, 1.0] (for unsigned integers) or [-1.0, 1.0] (for signed integers).
Related
I am reading some tensor core material and related code on simple GEMM. I have two question:
1, when using tensor core for D=A*B+C, it multiplies two fp16 matrices 4x4 and adds the multiplication product fp32 matrix to fp32 accumulator.Why two fp16 input multiplication A*Bresults in fp32 type?
2, in the code example, why the scale factor alpha and beta is needed? in the example, they are set to 2.0f
code snippet from NV blog:
for(int i=0; i < c_frag.num_elements; i++) {
c_frag.x[i] = alpha * acc_frag.x[i] + beta * c_frag.x[i];
}
The Tensorcore designers in this case chose to provide a FP32 accumulate option so that the results of many multiply-accumulate steps could be represented both with greater precision (more mantissa bits) as well as greater range (more exponent bits). This was considered valuable for the overall computational problems they wanted to support, including HPC and AI calculations. The product of two FP16 numbers might be not representable in FP16, whereas many more or most products of two FP16 numbers will be representable in FP32.
The scale factors alpha and beta are provided so that the provided GEMM operation could easily correspond to the well-known BLAS GEMM operation, which is widely used in numerical computation. This allows developers to more easily use the Tensorcore capability to provide a commonly used calculation paradigm in existing numerical computation codes. It is the same reason that the CUBLAS GEMM implementation provides these adjustable parameters.
OpenCL has a built-in function named native_recip:
gentype native_recip(gentype x);
native_recip computes reciprocal over an implementation-defined range. The maximum error is implementation-defined.
The vector versions of the math functions operate component-wise. The description is per-component.
The built-in math functions are not affected by the prevailing rounding mode in the calling environment, and always return the same value as they would if called with the round to nearest even rounding mode.
Is there an equivalent to this function in CUDA?
As noted in comments, it's __frcp_rn() for float's and __drcp_rn() for double's; and an implementation for vector types (e.g. float4) such that frcp/drcp is applied elementwise.
Note: "rcp" is short for "reciprocal" and "rn" is for the rounding mode "round to nearest even".
So, I want to divide me some 32-bit unsigned integers on a GPU, and I don't care about getting an exact result. In fact, let's be lenient and suppose I'm willing to accept a multiplicative error factor of upto 2, i.e. if q = x/y I'm willing to accept anything between 0.5*q and 2*q.
I haven't yet measured anything, but it seems to me that something like this (CUDA code) should be useful:
__device__ unsigned cheap_approximate_division(unsigned dividend, unsigned divisor)
{
return 1u << (__clz(dividend) - __clz(divisor));
}
it uses the "find first (bit) set" integer intrinsic as a cheap base-2-logarithm function.
Notes: I could make this non-32-bit-specific but then I'd have to complicate the code with templates, wrap __clz() with a templated function to use __clzl() and __clzll() etc.
Questions:
Is there a better method for such approximate division, in terms of clock cycles? Perhaps with slightly different constraints?
If I want better accuracy, should I stay with integers or should I just go through floating-point arithemtic?
Going via floating point gives you a much more precise result, slightly lower instruction count on most architectures, and potentially a higher throughput:
__device__ unsigned cheap_approximate_division(unsigned dividend, unsigned divisor)
{
return (unsigned)(__fdividef(dividend, divisor) /*+0.5f*/ );
}
The +0.5f in the comment shall indicate that you can also turn the float->int conversion into proper rounding at essentially no cost other than higher energy consumption (it turns an fmul into an fmad with the constant coming straight from the constant cache). Rounding would take you further away from the exact integer result though.
Can QR algorithm find repeat eigenvalues (https://en.wikipedia.org/wiki/QR_algorithm) ? I.e. Does it support the case when not all N eigen value for real matrix N x N are distinct?
How extend QR algorithm to support finding complex eigenvalues?
In principle yes. It will work if the eigenvalues are really all eigenvalues, i.e., the algebraic and geometric multiplicity are the same.
If the multiple eigenvalue occurs in an Jordan-block of size s, then the unavoidable floating point error during the iteration will almost surely result in a star-shaped perturbation into an eigenvalue cluster with relative error of size mu^(1/s) where mu is the machine constant of the floating point data type.
The reason this happens is that on the irreducible invariant subspace corresponding to a Jordan block of size s the characteristic polynomial of the reduction of the linear operator to this subspace has is (λ-λ[j])^s. During the computation this gets perturbed to (λ-λ[j])^s+μq(λ) which in first approximation has roots close to λ[j]+μ^(1/s)*z[k], where z[k] denotes the s roots of 0=z^s+q(λ[k]). What the perturbation function q is is quite random, accumulated floating point truncation errors, and depends on details of the method.
Can I take advantage of CUDA texture filtering when using 16-bit float texture type? I have already made test with 32-bit float texture in CUDA 3D Array and filtering works fine. CUDA doesn't support unsigned short texture interpolation, which would be perfect for me, as it occupies less memory space.
I'm thinking about this solution - correct me if i'm wrong:
convert my unsigned short data to 16-bit floats in range [0;1]; (how?)
malloc 3D array width cudaCreateChannelDescHalf() channel descriptor
bind texture of unsigned short data to that array
send it to the GPU memory, into the 3D array;
in kernel - use tex3D() function to get values
See answer below...
Again I answer my own question. Next time I'll try to dig more before posting here...
I think the problem was with the texture declaration:
texture<unsigned short, cudaTextureType3D, cudaReadModeNormalizedFloat> tex;
Filtering, as I see, is supported only if the returned value is float type, which can be forced with cudaReadModeNormalizedFloat as above. Then tex3D returns [0;1] interpolated float value.