What is the Mathematical formula for sparse categorical cross entropy loss? [closed] - deep-learning

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 months ago.
Improve this question
Can anyone help me with the Mathematics of sparse categorical cross entropy loss function? I have searched for the derivation, explanation (Mathematical) but couldn't find any
I know it is not the right place to ask question like this. But I am helpless.

It is just cross entropy loss. The "sparse" refers to the representation it is expecting for efficiency reasons. E.g. in keras it is expected that label provided is an integer i*, an index for which target[i*] = 1.
CE(target, pred) = -1/n SUM_k [ SUM_i target_ki log pred_ki ]
and since we have sparse target, we have
sparse-CE(int_target, pred) = -1/n SUM_k [ log pred_k{int_target_k} ]
So instead of summing over label dimension we just index, since we know all remaining ones are 0s either way.
And overall as long as targets are exactly one class we have:
CE(target, pred) = CE(onehot(int_target), pred) = sparse-CE(int_target, pred)
The only reason for this distinction is efficiency. For regular classification with ~10-100 classes it does not really matter, but imagine word-level language models where we have thousands of classes.

Related

Prove that not all ternary linear codes contain the all zero codeword [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 days ago.
Improve this question
I have to prove that not all ternary linear codes contain the all zero codeword.
Some term explanations:
A linear code is a code that the sum or difference of any two codewords must also be a codeword. In other words, for all x, y in C, x +/- y is also in C.
A ternary codeword in this exercise is a codeword containing 0, 1 or 2. Eg, 1011, 1012, 012, ...
An all zero codeword is something like 0000, 000, 00000, ...
My attempt:
The addition of codewords must be done MOD 3 which means there are three allowable numbers in this MOD 3 world including 0, 1 and 2. For any ternary linear codes, there are cases when all the additions of any two codewords yeild non-zero codewords.
I am not sure about the proof. I have an idea that it must deal with the MOD 3 but not sure how to explain.
Thank you so much!

Shaping theorem for MDPs [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed last year.
Improve this question
I need help with understanding the shaping theorem for MDPs. Here's the relevant paper: https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf it basically says that a markov decision process that has some reward function on transitions between states and actions R(s, a, s') has the same optimal policy as a different markov decision process with it's reward defined as R'(s, a, s') = R(s, a, s') + gamma*f(s') - f(s), where gamma is the time-discount-rate.
I understand the proof, but it seems like a trivial case where it breaks down is when R(s, a, s') = 0 for all states and actions, and the agent is faced with the path A -> s -> B versus A -> r -> t -> B. With the original markov process we get an EV of 0 for both paths, so both paths are optimal. But with the potential added to each transition we get, gamma^2*f(B)-f(A) for the first path, and gamma^3*f(B) - f(A) for the second. So if gamma < 1, and 0 < f(B), f(A), then the second path is no longer optimal.
Am I misunderstanding the theorem, or am I making some other mistake?
You are missing the assumption that for every terminal, and starting state s_T, s_0 we have f(s_T) = f(s_0) = 0. (Note, that in the paper there is an assumption that after terminal state there is always the new starting state, and the potential "wraps around).

Nvidia Tesla T4 tensor core benchmark [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 years ago.
Improve this question
I am using the code given here to find out the TFlops of mixed precision ops on Nvidia Tesla T4. Its theoretical value is given 65 Tflops. however, the code produces the value as 10 Tflops. Any explanation that can justify this happening?
This might be more of an extended comment, bet hear me out ...
As pointed out in the comments CUDA Samples are not meant as performance measuring tools.
The second benchmark you provided does not actually use tensor cores, but just a normal instruction executed on FP32 or FP64 cores.
for(int i=0; i<compute_iterations; i++){
tmps[j] = mad(tmps[j], tmps[j], seed);
}
On a Turing T4 this, for single precision operations gives me a peak of 7.97 TFLOPS, so very close to the theoretical limit of 8.1 TFLOPS.
For half precision operations I get 16.09 TFLOPS, as expected about double that of the single precision performance.
Now, on to Tensor cores. As the previously mentioned benchmark does not use them, let's look for something that does.
CUTLASS (https://github.com/NVIDIA/cutlass) is a high performance Matrix-Matrix Multiplication library from NVIDIA.
They provide a profiling application for all the kernels provided. If you run this on a T4, you should get output like this:
Problem ID: 1
Provider: ^[[1;37mCUTLASS^[[0m
OperationKind: ^[[1;37mgemm^[[0m
Operation: cutlass_tensorop_h1688gemm_256x128_32x2_nt_align8
Status: ^[[1;37mSuccess^[[0m
Verification: ^[[1;37mON^[[0m
Disposition: ^[[1;32mPassed^[[0m
reference_device: Passed
cuBLAS: Passed
Arguments: --gemm_kind=universal --m=1024 --n=1024 --k=1024 --A=f16:column --B=f16:row --C=f16:column --alpha=1 \
--beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f16 --cta_m=256 --cta_n=128 \
--cta_k=32 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75 \
--max_cc=1024
Bytes: 6291456 bytes
FLOPs: 2149580800 flops
Runtime: 0.0640419 ms
Memory: 91.4928 GiB/s
Math: 33565.2 GFLOP/s
As you can see we are now actually using Tensor cores, and half-precision operation, with a performance of 33.5 TFLOPS. Now, this might not be at 65 TFLOS, but for an application you can use in the real world, that is pretty good.

Vectorization or sum as matrix operations [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
Let there be the following definition of gradient descent cost function
with the hypothesis function defined as
what I've come up with for multivariate linear regression is
theta = theta - alpha * 1/m * ([theta', -1]*[X';y']*X)';
h_theta = 1/(2*m)* (X*theta - y)'*(X*theta-y);
(octave notation, ' means matrix transpose, [A, n] means adding a new column to matrix A with scalar value n, [A; B] means appending matrix B to matrix A row-wise)
It's doing its job correctly how far I can tell (the plots look ok), however I have a strong feeling that it's unnecessarily complicated.
How to write it with as little matrix operations as possible (and no element-wise operations, of course)?
I don't think that is unnecessarily complicated, and instead this is what you want. Matrix operations are good because you don't have to loop over elements yourself or do element-wise operations. I remember taking a course online and my solution seems pretty similar.
The way you have it is the most efficient way of doing it as it is fully vectorized. It can be done by having a for loop over the summation and so on, however this is very inefficient in terms of processing power.

Working example of Principal Component Analysis? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Are there any examples available that give a hands-on example of Principal Component Analysis on a dataset? I am reading articles discussing only theory and am really looking for something that will show me how to use PCA and then interpret the results and transform the original dataset into the new dataset. Any suggestions please?
If you know Python, here is a short hands-on example:
# Generate correlated data from uncorrelated data.
# Each column of X is a 3-dimensional feature vector.
Z = scipy.randn(3, 1000)
C = scipy.randn(3, 3)
X = scipy.dot(C, Z)
# Visualize the correlation among the features.
pylab.scatter(X[0,:], X[1,:])
pylab.scatter(X[0,:], X[2,:])
pylab.scatter(X[1,:], X[2,:])
# Perform PCA. It can be shown that the principal components of the
# matrix X are equivalent to the left singular vectors of X, which are
# equivalent to the eigenvectors of X X^T (up to indeterminacy in sign).
U, S, Vh = scipy.linalg.svd(X)
W, Q = scipy.linalg.eig(scipy.dot(X, X.T))
print U
print Q
# Project the original features onto the eigenspace.
Y = scipy.dot(U.T, X)
# Visualize the absence of correlation among the projected features.
pylab.scatter(Y[0,:], Y[1,:])
pylab.scatter(Y[1,:], Y[2,:])
pylab.scatter(Y[0,:], Y[2,:])
You can check http://alias-i.com/lingpipe/demos/tutorial/svd/read-me.html SVD and LSA is very similar approach to PCA both are space reduction methods. The only difference in basis evaluation approach.
Since you're asking for available hands-on examples, here you have an interactive demo to play with.