Nvidia Tesla T4 tensor core benchmark [closed] - cuda

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 years ago.
Improve this question
I am using the code given here to find out the TFlops of mixed precision ops on Nvidia Tesla T4. Its theoretical value is given 65 Tflops. however, the code produces the value as 10 Tflops. Any explanation that can justify this happening?

This might be more of an extended comment, bet hear me out ...
As pointed out in the comments CUDA Samples are not meant as performance measuring tools.
The second benchmark you provided does not actually use tensor cores, but just a normal instruction executed on FP32 or FP64 cores.
for(int i=0; i<compute_iterations; i++){
tmps[j] = mad(tmps[j], tmps[j], seed);
}
On a Turing T4 this, for single precision operations gives me a peak of 7.97 TFLOPS, so very close to the theoretical limit of 8.1 TFLOPS.
For half precision operations I get 16.09 TFLOPS, as expected about double that of the single precision performance.
Now, on to Tensor cores. As the previously mentioned benchmark does not use them, let's look for something that does.
CUTLASS (https://github.com/NVIDIA/cutlass) is a high performance Matrix-Matrix Multiplication library from NVIDIA.
They provide a profiling application for all the kernels provided. If you run this on a T4, you should get output like this:
Problem ID: 1
Provider: ^[[1;37mCUTLASS^[[0m
OperationKind: ^[[1;37mgemm^[[0m
Operation: cutlass_tensorop_h1688gemm_256x128_32x2_nt_align8
Status: ^[[1;37mSuccess^[[0m
Verification: ^[[1;37mON^[[0m
Disposition: ^[[1;32mPassed^[[0m
reference_device: Passed
cuBLAS: Passed
Arguments: --gemm_kind=universal --m=1024 --n=1024 --k=1024 --A=f16:column --B=f16:row --C=f16:column --alpha=1 \
--beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f16 --cta_m=256 --cta_n=128 \
--cta_k=32 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75 \
--max_cc=1024
Bytes: 6291456 bytes
FLOPs: 2149580800 flops
Runtime: 0.0640419 ms
Memory: 91.4928 GiB/s
Math: 33565.2 GFLOP/s
As you can see we are now actually using Tensor cores, and half-precision operation, with a performance of 33.5 TFLOPS. Now, this might not be at 65 TFLOS, but for an application you can use in the real world, that is pretty good.

Related

What is the Mathematical formula for sparse categorical cross entropy loss? [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 months ago.
Improve this question
Can anyone help me with the Mathematics of sparse categorical cross entropy loss function? I have searched for the derivation, explanation (Mathematical) but couldn't find any
I know it is not the right place to ask question like this. But I am helpless.
It is just cross entropy loss. The "sparse" refers to the representation it is expecting for efficiency reasons. E.g. in keras it is expected that label provided is an integer i*, an index for which target[i*] = 1.
CE(target, pred) = -1/n SUM_k [ SUM_i target_ki log pred_ki ]
and since we have sparse target, we have
sparse-CE(int_target, pred) = -1/n SUM_k [ log pred_k{int_target_k} ]
So instead of summing over label dimension we just index, since we know all remaining ones are 0s either way.
And overall as long as targets are exactly one class we have:
CE(target, pred) = CE(onehot(int_target), pred) = sparse-CE(int_target, pred)
The only reason for this distinction is efficiency. For regular classification with ~10-100 classes it does not really matter, but imagine word-level language models where we have thousands of classes.

CUDA float addition gives wrong answer (compared to CPU float ops) [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I am new to CUDA. I was using cuda to find the dot prod of float vectors and I came across a float point addition issue in cuda. In essence following is the simple kernel. I'm using -arch=sm_50
So the basic idea is for the thread_0 to add the values of vector a.
__global__ void temp(float *a, float *b, float *c) {
if (0 == threadIdx.x && blockIdx.x == 0 && blockIdx.y ==0 ) {
float xx = 0.0f;
for (int i = 0; i < LENGTH; i++){
xx += a[i];
}
*c = xx;
}
}
When I initialize 'a' with 1000 elements of 1.0 I get the desired result of 1000.00
but when I initialize 'a' with 1.1, I should get 1100.00xx but istead, I am getting 1099.989014. The cpu implementation simply yields 1100.000024
I am trying to understand what the issue here! :-(
I even tried to count the number of 1.1 elements in the a vector and that yeilds 1000, which is expected. and I even used atomicAdd and still I have the same issue.
would be very grateful if someone could help me out here!
best
EDIT:
Biggest concern here is the disparity of the CPU result vs GPU result! I understand floats can be off by some decimal points. But the GPU error is very significant! :-(
It is not possible to represent 1.1 exactly using IEEE-754 floating point representation. As #RobertCrovella mentionned in his comment, the computation performed on the CPU does not use the same IEEE-754 settings than the GPU one.
Indeed, 1.1 in floating point is stored as 0x3F8CCCCD = which is 1.10000002384185. Performing the sum on 1000 elements, the last bits gets lost in rouding, one bit for the first addition, two bits after four, etc, until 10 bits after 1000. Depending on rounding mode, you may truncate the 10 bits for the last half of operations, hence ending up summing 0x3F8CCC00 which is 1.09997558.
The result from CUDA divided by 1000 is 0x3F8CCC71, which is consistent with a calculation in 32 bits.
When compiling on CPU, depending on optimization flags, you may be using fast math, which uses the internal register precision. It can be, if not specifying vector registers, using the x87 FPU which is 80 bits precision. In that occurence, the computation would read 1.1 in float which is 1.10000002384185, add it 1000 times using higher precision, hence not loosing any bit in rounding resulting in 1100.00002384185, and display 1100.000024 which is its round to nearest display.
Depending on compilation flags, the actual equivalent computation on Cpu may require enforcement of 32 bits floating-point arithmetics which can be done using addss of the SSE2 instruction set for example.
You can also play with /fp: option or -mfpmath with the compiler and explore issued instructions. In that case assembly instruction fadd is the 80-bits precision addition.
All of this has nothing to do with GPU floating-point precision. It is rather some misunderstanding of the IEEE-754 norm and the legacy x87 FPU behaviour.

Multiply two matrix in cuda c [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have to matrices A(32*32) and B(32*n), in which 'n' is coming from inputs and is between 2000 to 2000000.
I have two kind of inputs one is integers between 0 to 255 and the other one is 0,1. this multiplication is in a loop that iterates 3000 times. B(32*n) comes form input and is constant in all of the iterations but A(32*32) can change in each iteration.
//read B from file
//read A from file
double D[3000];
for(int i = 0; i < 3000; i++)
{
C = multiply(A, B);
// D[i] = mean of all elements in C
// build A from B using D[i] (this part is really complicated sequential process that contains lots of if and switches)
}
What is the fastest way to do this?
thank you.
Nobody here is going to write code for you, that is not what Stack Overflow is intended for. However, it would appear to be that there are a number of characteristics of the problem which you should be looking to exploit to improve the performance of your code:
Recognise that because one of the matrices only contains 0 or 1 and you are performing this in integer, what you are describing as matrix multiplication is really a large number of independent sparse sums
Recognise that because the next operation is to compute an average, you don't actually have to store the intermediate dot products and could directly perform a reduction on partial results of the matrix row summation
There are probably parallel primitives in the thrust library which you could use for prototyping, and an optimal hand written kernel would be aiming to fuse both the first and most of the second part of the operation into a single kernel.

Vectorization or sum as matrix operations [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
Let there be the following definition of gradient descent cost function
with the hypothesis function defined as
what I've come up with for multivariate linear regression is
theta = theta - alpha * 1/m * ([theta', -1]*[X';y']*X)';
h_theta = 1/(2*m)* (X*theta - y)'*(X*theta-y);
(octave notation, ' means matrix transpose, [A, n] means adding a new column to matrix A with scalar value n, [A; B] means appending matrix B to matrix A row-wise)
It's doing its job correctly how far I can tell (the plots look ok), however I have a strong feeling that it's unnecessarily complicated.
How to write it with as little matrix operations as possible (and no element-wise operations, of course)?
I don't think that is unnecessarily complicated, and instead this is what you want. Matrix operations are good because you don't have to loop over elements yourself or do element-wise operations. I remember taking a course online and my solution seems pretty similar.
The way you have it is the most efficient way of doing it as it is fully vectorized. It can be done by having a for loop over the summation and so on, however this is very inefficient in terms of processing power.

What's the shortest code to cause a stack overflow? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
To commemorate the public launch of Stack Overflow, what's the shortest code to cause a stack overflow? Any language welcome.
ETA: Just to be clear on this question, seeing as I'm an occasional Scheme user: tail-call "recursion" is really iteration, and any solution which can be converted to an iterative solution relatively trivially by a decent compiler won't be counted. :-P
ETA2: I've now selected a “best answer”; see this post for rationale. Thanks to everyone who contributed! :-)
Read this line, and do what it says twice.
All these answers and no Befunge? I'd wager a fair amount it's shortest solution of them all:
1
Not kidding. Try it yourself: http://www.quirkster.com/iano/js/befunge.html
EDIT: I guess I need to explain this one. The 1 operand pushes a 1 onto Befunge's internal stack and the lack of anything else puts it in a loop under the rules of the language.
Using the interpreter provided, you will eventually--and I mean eventually--hit a point where the Javascript array that represents the Befunge stack becomes too large for the browser to reallocate. If you had a simple Befunge interpreter with a smaller and bounded stack--as is the case with most of the languages below--this program would cause a more noticeable overflow faster.
You could also try this in C#.net
throw new StackOverflowException();
Nemerle:
This crashes the compiler with a StackOverflowException:
def o(){[o()]}
My current best (in x86 assembly) is:
push eax
jmp short $-1
which results in 3 bytes of object code (50 EB FD). For 16-bit code, this is also possible:
call $
which also results in 3 bytes (E8 FD FF).
PIC18
The PIC18 answer given by TK results in the following instructions (binary):
overflow
PUSH
0000 0000 0000 0101
CALL overflow
1110 1100 0000 0000
0000 0000 0000 0000
However, CALL alone will perform a stack overflow:
CALL $
1110 1100 0000 0000
0000 0000 0000 0000
Smaller, faster PIC18
But RCALL (relative call) is smaller still (not global memory, so no need for the extra 2 bytes):
RCALL $
1101 1000 0000 0000
So the smallest on the PIC18 is a single instruction, 16 bits (two bytes). This would take 2 instruction cycles per loop. At 4 clock cycles per instruction cycle you've got 8 clock cycles. The PIC18 has a 31 level stack, so after the 32nd loop it will overflow the stack, in 256 clock cycles. At 64MHz, you would overflow the stack in 4 micro seconds and 2 bytes.
PIC16F5x (even smaller and faster)
However, the PIC16F5x series uses 12 bit instructions:
CALL $
1001 0000 0000
Again, two instruction cycles per loop, 4 clocks per instruction so 8 clock cycles per loop.
However, the PIC16F5x has a two level stack, so on the third loop it would overflow, in 24 instructions. At 20MHz, it would overflow in 1.2 micro seconds and 1.5 bytes.
Intel 4004
The Intel 4004 has an 8 bit call subroutine instruction:
CALL $
0101 0000
For the curious that corresponds to an ascii 'P'. With a 3 level stack that takes 24 clock cycles for a total of 32.4 micro seconds and one byte. (Unless you overclock your 4004 - come on, you know you want to.)
Which is as small as the befunge answer, but much, much faster than the befunge code running in current interpreters.
C#:
public int Foo { get { return Foo; } }
Hoot overflow!
// v___v
let rec f o = f(o);(o)
// ['---']
// -"---"-
Every task needs the right tool. Meet the SO Overflow language, optimized to produce stack overflows:
so
TeX:
\def~{~.}~
Results in:
! TeX capacity exceeded, sorry [input stack size=5000].
~->~
.
~->~
.
~->~
.
~->~
.
~->~
.
~->~
.
...
<*> \def~{~.}~
LaTeX:
\end\end
Results in:
! TeX capacity exceeded, sorry [input stack size=5000].
\end #1->\csname end#1
\endcsname \#checkend {#1}\expandafter \endgroup \if#e...
<*> \end\end
Z-80 assembler -- at memory location 0x0000:
rst 00
one byte -- 0xC7 -- endless loop of pushing the current PC to the stack and jumping to address 0x0000.
In english:
recursion = n. See recursion.
Another PHP Example:
<?
require(__FILE__);
How about the following in BASIC:
10 GOSUB 10
(I don't have a BASIC interpreter I'm afraid so that's a guess).
I loved Cody's answer heaps, so here is my similar contribution, in C++:
template <int i>
class Overflow {
typedef typename Overflow<i + 1>::type type;
};
typedef Overflow<0>::type Kaboom;
Not a code golf entry by any means, but still, anything for a meta stack overflow! :-P
Here's my C contribution, weighing in at 18 characters:
void o(){o();o();}
This is a lot harder to tail-call optimise! :-P
Using a Window's batch file named "s.bat":
call s
Javascript
To trim a few more characters, and to get ourselves kicked out of more software shops, let's go with:
eval(i='eval(i)');
Groovy:
main()
$ groovy stack.groovy:
Caught: java.lang.StackOverflowError
at stack.main(stack.groovy)
at stack.run(stack.groovy:1)
...
Please tell me what the acronym "GNU" stands for.
Person JeffAtwood;
Person JoelSpolsky;
JeffAtwood.TalkTo(JoelSpolsky);
Here's hoping for no tail recursion!
C - It's not the shortest, but it's recursion-free. It's also not portable: it crashes on Solaris, but some alloca() implementations might return an error here (or call malloc()). The call to printf() is necessary.
#include <stdio.h>
#include <alloca.h>
#include <sys/resource.h>
int main(int argc, char *argv[]) {
struct rlimit rl = {0};
getrlimit(RLIMIT_STACK, &rl);
(void) alloca(rl.rlim_cur);
printf("Goodbye, world\n");
return 0;
}
perl in 12 chars:
$_=sub{&$_};&$_
bash in 10 chars (the space in the function is important):
i(){ i;};i
try and put more than 4 patties on a single burger. stack overflow.
Python:
so=lambda:so();so()
Alternatively:
def so():so()
so()
And if Python optimized tail calls...:
o=lambda:map(o,o());o()
I'm selecting the “best answer” after this post. But first, I'd like to acknowledge some very original contributions:
aku's ones. Each one explores a new and original way of causing stack overflow. The idea of doing f(x) ⇒ f(f(x)) is one I'll explore in my next entry, below. :-)
Cody's one that gave the Nemerle compiler a stack overflow.
And (a bit grudgingly), GateKiller's one about throwing a stack overflow exception. :-P
Much as I love the above, the challenge is about doing code golf, and to be fair to respondents, I have to award “best answer” to the shortest code, which is the Befunge entry; I don't believe anybody will be able to beat that (although Konrad has certainly tried), so congrats Patrick!
Seeing the large number of stack-overflow-by-recursion solutions, I'm surprised that nobody has (as of current writing) brought up the Y combinator (see Dick Gabriel's essay, The Why of Y, for a primer). I have a recursive solution that uses the Y combinator, as well as aku's f(f(x)) approach. :-)
((Y (lambda (f) (lambda (x) (f (f x))))) #f)
Here's another interesting one from Scheme:
((lambda (x) (x x)) (lambda (x) (x x)))
Java
Slightly shorter version of the Java solution.
class X{public static void main(String[]a){main(a);}}
xor esp, esp
ret
3 bytes:
label:
pusha
jmp label
Update
According to the (old?) Intel(?) documentation, this is also 3 bytes:
label:
call label