regarding one code segment in computing log_sum_exp - deep-learning

In this tutorial on using Pytorch to implement BiLSTM-CRF, author implements the following function. In specific, I am not quite understand what does max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1]) try to do?, or which kind of math formula it corresponds to?
# Compute log sum exp in a numerically stable way for the forward algorithm
def log_sum_exp(vec):
max_score = vec[0, argmax(vec)]
max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
return max_score + \
torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))

Looking at the code, it seems like vec has a shape of (1, n).
Now we can follow the code line by line:
max_score = vec[0, argmax(vec)]
Using vec in the location 0, argmax(v) is just a fancy way of taking the maximal value of vec. So, max_score is (as the name suggests) the maximal value of vec.
max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
Next, we want to subtract max_score from each of the elements of vec.
To do so the code creates a vector of the same shape as vec with all elements equal to max_score.
First, max_score is reshaped to have two dimensions using the view command, then the expanded 2d vector is "stretched" to have length n using the expand command.
Finally, the log sum exp is computed robustly:
return max_score + \
torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))
The validity of this computation can be seen in this picture:
The rationale behind it is that exp(x) can "explode" for x > 0, therefore, for numerical stability, it is best to subtract the maximal value before taking exp.
As a side note, I think a slightly more elegant way to do the same computation, taking advantage of broadcasting, would be
max_score, _ = vec.max(dim=1, keepdim=True) # take max along second dimension
lse = max_score + torch.log(torch.sum(torch.exp(vec - max_score), dim=1))
return lse
Also note that log sum exp is already implemented by pytorch: torch.logsumexp.

Related

Backpropagation on Two Layered Networks

i have been following cs231n lectures of Stanford and trying to complete assignments on my own and sharing these solutions both on github and my blog. But i'm having a hard time on understanding how to modelize backpropagation. I mean i can code modular forward and backward passes but what bothers me is that if i have the model below : Two Layered Neural Network
Lets assume that our loss function here is a softmax loss function. In my modular softmax_loss() function i am calculating loss and gradient with respect to scores (dSoft = dL/dY). After that, when i'am following backwards lets say for b2, db2 would be equal to dSoft*1 or dW2 would be equal to dSoft*dX2(outputs of relu gate). What's the chain rule here ? Why isnt dSoft equal to 1 ? Because dL/dL would be 1 ?
The softmax function is outputs a number given an input x.
What dSoft means is that you're computing the derivative of the function softmax(x) with respect to the input x. Then to calculate the derivative with respect to W of the last layer you use the chain rule i.e. dL/dW = dsoftmax/dx * dx/dW. Note that x = W*x_prev + b where x_prev is the input to the last node. Therefore dx/dW is just x and dx/db is just 1, which means that dL/dW or simply dW is dsoftmax/dx * x_prev and dL/db or simply db is dsoftmax/dx * 1. Note that here dsoftmax/dx is dSoft we defined earlier.

Third partial derivative approximations on uniform grid

I have uniform grid and have to calculate third partial derivative approximations at nodes.
There I found approximations only for second order.
Could someone point me to or explain a way to build formula for third order partial derivatives.
Particularly, I have to calculate fxxx(x,y), fxxy(x,y), fyyy(x,y) and fyyx(x,y).
Many thanks.
Let's say that f[i,j] is the value at node (i,j), and h is the size of space step. You already know how to calculate second order derivatives of f, for example
fxx[i,j] = (f[i+1,j]-2*f[i,j]+f[i-1,j])/h^2
fyy[i,j] = (f[i,j+1]-2*f[i,j]+f[i,j-1])/h^2
fxy[i,j] = (f[i+1,j+1]-f[i+1,j-1]-f[i-1,j+1]+f[i-1,j-1])/h^2
These are of second degree of accuracy, that is the error is about h2. In order to maintain this accuracy, take one more derivative using the symmetric difference rule, like
gx[i,j] = (g[i+1,j]-g[i-1,j])/(2*h)
This results in:
fxx[i,j] = ((f[i+2,j]-2*f[i+1,j]+f[i,j])-(f[i,j]-2*f[i-1,j]+f[i-2,j]))/(2*h^3)
(which can be simplified), and similarly for other derivatives:
fxxy[i,j] = ((f[i+1,j+1]-2*f[i,j+1]+f[i-1,j+1])-(f[i+1,j-1]-2*f[i,j-1]+f[i-1,j-1]))/(2*h^3)

Function to dampen a value

I have a list of documents each having a relevance score for a search query. I need older documents to have their relevance score dampened, to try to introduce their date in the ranking process. I already tried fiddling with functions such as 1/(1+date_difference), but the reciprocal function is too discriminating for close recent dates.
I was thinking maybe a mathematical function with range (0..1) and domain(0..x) to amplify their score, where the x-axis is the age of a document. It's best to explain what I further need from the function by an image:
Decaying behavior is often modeled well by an exponentional function (many decaying processes in nature also follow it). You would use 2 positive parameters A and B and get
y(x) = A exp(-B x)
Since you want a y-range [0,1] set A=1. Larger B give slower decays.
If a simple 1/(1+x) decreases too quickly too soon, a sigmoid function like 1/(1+e^-x) or the error function might be better suited to your purpose. Let the current date be somewhere in the negative numbers for such a function, and you can get a value that is current for some configurable time and then decreases towards a base value.
log((x+1)-age_of_document)
Where the base of the logarithm is (x+1). Note the x is as per your diagram and is the "threshold". If the age of the document is greater than x the score goes negative. Multiply by the maximum possible score to introduce scaling.
E.g. Domain = (0,10) with a maximum score of 10: 10*(log(11-x))/log(11)
A bit late, but as thiton says, you might want to use a sigmoid function instead, since it has a "floor" value for your long tail data points. E.g.:
0.8/(1+5^(x-3)) + 0.2 - You can adjust the constants 5 and 3 to control the slope of the curve. The 0.2 is where the floor will be.

How to interpret the result from KissFFT's kiss_fftr (FFT for a real signal) function

I'm using KissFFT's real function to transform some real audio signals. I'm confused, since I input a real signal with nfft samples, but the result is nfft/2+1 complex frequency bins.
From KissFFT's README:
The real (i.e. not complex) optimization code only works for even length ffts. It does two half-length FFTs in parallel (packed into real&imag), and then combines them via twiddling. The result is nfft/2+1 complex frequency bins from DC to Nyquist.
So I have no concrete knowledge of how to interpret the result. My assumption is the data is packed like r[0]i[0]r[1]i[1]...r[nfft/2]i[nfft/2], where r[0] would be DC, i[0] is the first frequency bin, r[1] the second, and so on. Is this the case?
Yes. The reason kiss_fftr makes only Nfft/2+1 bins is because the DFT of a real signal is conjugate-symmetric. The coefficients corresponding to negative frequencies ( -pi:0 or pi:2pi , whichever way to like to think about it) , are the conjugated coefficients from [0:pi).
Note the out[0] and out[Nfft/2] bins (DC and Nyquist) have zero in the imaginary part. I've seen some libraries pack these two real parts together in the first complex, but I view that as a breach of contract that leads to difficult-to-diagnose, nearly-right bugs.
Tip: If you are using float for your data type (default), you can cast the output array to float complex* (c99) or std::complex* (c++). The packing for the kiss_fft_cpx struct is compatible. The reason it doesn't use these by default is because kiss_fft works with other types beside float and double and on older ANSI C compilers that lack these features.
Here's a contrived example (assuming c99 compiler and type==float)
float get_nth_bin_phase(const float * in, int nfft, int whichbin )
{
kiss_fftr_cfg st = kiss_fftr_alloc(1024,0,0,0);
float complex * out = malloc(sizeof(float complex)*(nfft/2+1));
kiss_fftr(st,in,(kiss_fft_cpx*)out);
whichbin %= nfft;
if ( whichbin <= nfft/2 )
ph = cargf(out[whichbin]);
else
ph = cargf( conjf( out[nfft-whichbin] ) );
free(out);
kiss_fft_free(st);
return ph;
}
r[1]and i[1] of the fftr result constitute a complex vector. Together they give you a magnitude (sqrt of the sum of the squares of the 2 components) and a phase (via atan2()) of the first frequency bin.

Compare large sets of weighted tag clouds?

I have thousands of large sets of tag cloud data; I can retrieve a weighted tag clouds for each set with a simple select/group statement (for example)
SELECT tag, COUNT( * ) AS weight
FROM tags
WHERE set_id = $set_id
GROUP BY tag
ORDER BY COUNT( * ) DESC
What I'd like to know is this -- what is the best way to compare weighted tag clouds and find other sets that are most similar, taking the weight (the number of occurrences within the set) into account and possibly even computing a comparison score, all in one somewhat effiecient statement?
I found the web to be lacking quality literature on the topic, thought it somewhat broadly relevant and tried to abstract my example to keep it generally applicable.
First you need to normalize every tag cloud like you would do for a vector, assuming that a tag cloud is a n-dimensional vector in which every dimension rapresents a word and its value rapresents the weight of the word.
You can do it by calculating the norm (or magnitude) of every cloud, that is the square root of all the weights squared:
m = sqrt( w1*w1 + w2*w2 + ... + wn*wn)
then you generate your normalized tag cloud by dividing each weight for the norm of the cloud.
After this you can easily calculate similarity by using a scalar product between the clouds, that is just multiply every component of each pair and all all of them together. Eg:
v1 = { a: 0.12, b: 0.31; c: 0.17; e: 0.11 }
v2 = { a: 0.21, b: 0.11; d: 0.08; e: 0.28 }
similarity = v1.a*v2.a + v1.b*v1.b + 0 + 0 + v1.e*v2.e
if a vector has a tag that the other one doesn't then that specific product is obviously 0.
This similarity in within range [0,1], 0 means no correlation while 1 means equality.