Efficient loss on softmax output in Keras? - deep-learning

When the number of classes is large (e.g., number of words in language models), the softmax cross-entropy loss computation is very expensive. There seems to be no efficient options in the built-in losses. Is there a way to compute this efficiently in Keras, or should I self-implement it and how to?

Related

Doesn't introduction of polynomial features lead to increased collinearity?

I was going through Linear and Logistic regression from ISLR and in both cases I found that one of the approaches adopted to increase the flexibility of the model was to use polynomial features - X and X^2 both as features and then apply the regression models as usual while considering X and X^2 as independent features (in sklearn, not the polynomial fit of statsmodel). Does that not increase the collinearity amongst the features though? How does it affect the model performance?
To summarize my thoughts regarding this -
First, X and X^2 have substantial correlation no doubt.
Second, I wrote a blog demonstrating that, at least in Linear regression, collinearity amongst features does not affect the model fit score though it makes the model less interpretable by increasing coefficient uncertainty.
So does the second point have anything to do with this, given that model performance is measured by the fit score.
Multi-collinearity isn't always a hindrance. It depends from data to data. If your model isn't giving you the best results(high accuracy or low loss), you then remove the outliers or highly correlated features to improve it but is everything is hunky-dory, you don't bother about them.
Same goes with polynomial regression. Yes it adds multi-collinearity in your model by introducing x^2, x^3 features into your model.
To overcome that, you can use orthogonal polynomial regression which introduces polynomials that are orthogonal to each other.
But it will still introduce higher degree polynomials which can become unstable at the boundaries of your data space.
To overcome this issue, you can use Regression Splines in which it divides the distribution of the data into separate portions and fit linear or low degree polynomial functions on each of these portions. The points where the division occurs are called Knots. Functions which we can use for modelling each piece/bin are known as Piecewise functions. This function has a constraint , suppose, if it is introducing 3 degree of polynomials or cubic features and then the function should be second-order differentiable.
Such a piecewise polynomial of degree m with m-1 continuous derivatives is called a Spline.

Using OLS regression on binary outcome variable

I have previously been told that -- for reasons that make complete sense -- one shouldn't run OLS regressions when the outcome variable is binary (i.e. yes/no, true/false, win/loss, etc). However, I often read papers in economics/other social sciences in which researchers run OLS regressions on binary variables and interpret the coefficients just like they would for a continuous outcome variable. A few questions about this:
Why do they not run a logistic regression? Is there any disadvantage/limitation to using logit models? In economics, for example, I very often see papers using OLS regression for binary variable and not logit. Can logit only be used in certain situations?
In general, when can one run an OLS regression on ordinal data? If I have a variable that captures "number of times in a week survey respondent does X", can I - in any circumstance - use it as a dependent variable in a linear regression? I often see this being done in literature as well, even though we're always told in introductory statistics/econometrics that outcome variables in an OLS regression should be continuous.
The application of applying OLS to a binary outcome is called Linear Probability Model. Compared to a logistic model, LPM has advantages in terms of implementation and interpretation that make it an appealing option for researchers conducting impact analysis. In LPM, parameters represent mean marginal effects while parameters represent log odds ratio in logistic regression. To calculate the mean marginal effects in logistic regression, we need calculate that derivative for every data point and then
calculate the mean of those derivatives. While logistic regression and the LPM usually yield the same expected average impact estimate[1], researchers prefer LPM for estimating treatment impacts.
In general, yes, we can definitely apply OLS to an ordinal outcome. Similar to the previous case, applying OLS to a binary or ordinal outcome result in violations of the assumptions of OLS. However, within econometrics, they believe the practical effect of violating these assumptions is minor and that the simplicity of interpreting an OLS outweighs the technical correctness of an ordered logit or probit model, especially when the ordinal outcome looks quasi-normal.
Reference:
[1] Deke, J. (2014). Using the linear probability model to estimate impacts on binary outcomes in randomized controlled trials. Mathematica Policy Research.

Convolutional filter applied to Fourier-transformed signal

I understand that the Fourier transform of a convolution of two signals is the pointwise product of their Fourier transforms (convolutional theorem). What I wonder is there known cases where a convolution can be meaningfully applied to a Fourier-transformed signal (e.g. time series, or image) in the frequency domain to act as a filter instead of the multiplication by a square matrix. Also, are there known applications of filters that increase the size of the time domain, ie where the matrix in the frequency domain is rectangular, and then an inverse FT is applied to back to the time domain? In particular, I'm interested known examples of such method for deep learning.
As you say, convolution of two signals is the pointwise product of their Fourier transforms. This is true in both directions - the convolution of two Fourier-transformed signals is equal to the pointwise product of the two time series.
You do have to define "convolution" suitably - for discrete Fourier transforms, the convolution is a circular convolution.
There are definitely meaningful uses for doing a pointwise block multiply in the time domain (for example, applying a data window to a signal before converting to the frequency domain, or modulating a carrier), so you can say that it is meaningful to do the convolution in the frequency domain. But it is unlikely to be efficient, compared to just doing the operation in the time domain.
Note that a LOT of effort has been spent over the years at optimizing Fourier transforms, precisely because it is more efficient to do a block multiply in the frequency domain (it is O(n)) compared to doing a convolution in the time domain (which is O(n^2)). Since the Fourier transform is O(n log(n)), the combination of forwardTransform-blockMultiply-inverseTransform is usually faster than doing a convolution. This is still true in the other direction, so if you start with frequency data a inverseTransform-blockMultiply-forwardTransform will usually be faster than doing a convolution in the frequency domain. And, of course, usually you already have the original time data somewhere, so the block multiply in the time domain would then be even faster.
Unfortunately, I don't know of applications that increase the size of the time domain off the top of my head. And I don't know anything about deep learning, so I can't help you there.

Predicting rare events and their strength with LSTM autoencoder

I’m currently creating and LSTM to predict rare events. I’ve seen this paper which suggest: first an autoencoder LSTM for extracting features and second to use the embeddings for a second LSTM that will make the actual prediction. According to them, the autoencoder extract features (this is usually true) which are then useful for the prediction layers to predict.
In my case, I need to predict if it would be or not an extreme event (this is the most important thing) and then how strong is gonna be. Following their advice, I’ve created the model, but instead of adding one LSTM from embeddings to predictions I add two. One for binary prediction (It is, or it is not), ending with a sigmoid layer, and the second one for predicting how strong will be. Then I have three losses. The reconstruction loss (MSE), the prediction loss (MSE), and the binary loss (Binary Entropy).
The thing is that I’m not sure that is learning anything… the binary loss keeps in 0.5, and even the reconstruction loss is not really good. And of course, the bad thing is that the time series is plenty of 0, and some numbers from 1 to 10, so definitely MSE is not a good metric.
What do you think about this approach?
This is the better architecture for predicting rare events? Which one would be better?
Should I add some CNN or FC from the embeddings before the other to LSTM, for extracting 1D patterns from the embedding, or directly to make the prediction?
Should the LSTM that predicts be just one? And only use MSE loss?
Would be a good idea to multiply the two predictions to force in both cases the predicted days without the event coincide?
Thanks,

loss function in LSTM neural network

I do not understand what is being minimized in these networks.
Can someone please explain what is going on mathematically when the loss gets smaller in LSTM network?
model.compile(loss='categorical_crossentropy', optimizer='adam')
From the keras documentation, categorical_crossentropy is just the multiclass logloss. Math and theoretical explanation for log loss here.
Basically, the LSTM is assigning labels to words (or characters, depending on your model), and optimizing the model by penalizing incorrect labels in word (or character) sequences. The model takes an input word or character vector, and tries to guess the next "best" word, based on training examples. Categorical crossentropy is a quantitative way of measuring how good the guess is. As the model iterates over the training set, it makes less mistakes in guessing the next best word (or character).