Can LSTM train for regression with different numbers of feature in each sample? - deep-learning

In my problem, each training and testing sample has different number of features. For example, the training sample is as following:
There are four features in sample1: x1, x2, x3, x4, y1
There are two features in sample2: x6, x3, y2
There are three features in sample3: x8, x1, x5, y3
x is feature, y is target.
Can these samples train for the LSTM regression and make prediction?

Consider following scenario: you have a (way to small) dataset of 6 sample sequences of lengths: { 1, 2, 3, 4, 5, 6} and you want to train your LSTM (or, more general, an RNN) with minibatch of size 3 (you feed 3 sequences at a time at every training step), that is, you have 2 batches per epoch.
Let's say that due to randomization, on step 1 batch ended up to be constructed from sequences of lengths {2, 1, 5}:
batch 1
----------
2 | xx
1 | x
5 | xxxxx
and, the next batch of sequences of length {6, 3, 4}:
batch 2
----------
6 | xxxxxx
3 | xxx
4 | xxxx
What people would typically do, is pad sample sequences up to the longest sequence in the minibatch (not necessarily to the length of the longest sequence overall) and to concatenate sequences together, one on top of another, to get a nice matrix that can be fed into RNN. Let's say your features consist of real numbers and it is not unreasonable to pad with zeros:
batch 1
----------
2 | xx000
1 | x0000
5 | xxxxx
(batch * length = 3 * 5)
(sequence length 5)
batch 2
----------
6 | xxxxxx
3 | xxx000
4 | xxxx00
(batch * length = 3 * 6)
(sequence length 6)
This way, for the first batch your RNN will only run up to necessary number of steps (5) to save some compute. For the second batch it will have to go up to the longest one (6).
The padding value is chosen arbitrarily. It usually should not influence anything, unless you have bugs. Trying some bogus values, like Inf or NaN may help you during debugging and verification.
Importantly, when using padding like that, there are some other things to do for model to work correctly. If you are using backpropagation, you should exclude the results of the padding from both, output computation and gradient computation (deep learning frameworks will do that for you). If you are running a supervised model, labels should typically also be padded and padding should not be considered for the loss calculation. For example, you calculate cross-entropy for the entire batch (with padding). In order to calculate a correct loss, the bogus cross-entropy values that correspond to padding should be masked with zeros, then each sequence should be summed independently and divided by its real length. That is, averaging should be performed without taking padding into account (in my example this is guaranteed due to the neutrality of zero with respect to addition). Same rule applies to regression losses and metrics such as accuracy, MAE etc (that is, if you average together with padding your metrics will also be wrong).
To save even more compute, sometimes people construct batches such that sequences in batches have roughly the same length (or even exactly the same, if dataset allows). This may introduce some undesired effects though, as long and short sequences are never in the same batch.
To conclude, padding is a powerful tool and if you are attentive, it allows you to run RNNs very efficiently with batching and dynamic sequence length.

Yes. Your input_size for LSTM-layer should be maximal among all input_sizes. And spare cells you replace with nulls:
max(input_size) = 5
input array = [x1, x2, x3]
And you transform it this way:
[x1, x2, x3] -> [x1, x2, x3, 0, 0]
This approach is rather common and does not show any negative big influence on prediction accuracy.

Related

LC-3 algorithm for converting ASCII strings to Binary Values

Figure 10.4 provides an algorithm for converting ASCII strings to binary values. Suppose the decimal number is arbitrarily long. Rather than store a table of 10 values for the thousands-place digit, another table for the 10 ten-thousands-place digit, and so on, design an algorithm to do the conversion without resorting to any tables whatsoever.
I have attached pictures of figure 10.4. I am not looking for an answer to the problem, but rather can someone please explain this problem and perhaps give some direction on how to go about creating the algorithm?
Figure 10.4
Figure 10.4 second image
I am unsure as to what it means by tables and do not know where to start really.
The tables are those global, initialized arrays: one called Lookup10 holding 10, 20, 30, 40, ..., and another called Lookup100 holding 100, 200, 300, 400...
You can ignore the tables: as per the assignment instructions you're supposed to find a different way to accomplish this anyway.  Or, you can run that code in simulator or mentally to understand how it works.
The bottom line is that LC-3, while it can do anything (it is turning complete), it can't do much in any one instruction.  For arithmetic & logic, it can do add, not, and.  That's pretty much it!  But that's enough — let's note that modern hardware does everything with only one logic gate, namely NAND, which is a binary operator (so NAND directly available; NOT by providing NAND with the same operand for both inputs; AND by doing NOT after NAND; OR using NOT on both inputs first and then NAND; etc..)
For example, LC-3 cannot multiply or divide or modulus or right shift directly — each of those operations is many instructions and in the general case, some looping construct.  Multiplication can be done by repetitive addition, and division/modulus by repetitive subtraction.  These are super inefficient for larger operands, and there are much more efficient algorithms that are also substantially more complex, so those greatly increase program complexity beyond that already with the repetitive operation approach.
That subroutine goes backwards through the use input string.  It takes a string length count in R1 as parameter supplied by caller (not shown).  It looks at the last character in the input and converts it from an ASCII character to a binary number.
(We would commonly do that conversion from ascii character to numeric value using subtraction: moving the character values from the ascii character range of 0x30..0x39 to numeric values in the range 0..9, but they do it with masking, which also works.  The subtraction approach integrates better with error detection (checking if not a valid digit character, which is not done here), whereas the masking approach is simpler for LC-3.)
The subroutine then obtains the 2nd last digit (moving backwards through the user's input string), converting that to binary using the mask approach.  That yields a number between 0 and 9, which is used as an index into the first table Lookup10.  The value obtained from the table at that index position is basically the index × 10.  So this table is a × 10 table.  The same approach is used for the third (and first or, last-going-backwards) digit, except it uses the 2nd table which is a × 100 table.
The standard approach for string to binary is called atoi (search it) standing for ascii to integer.  It moves forwards through the string, and for every new digit, it multiples the existing value, computed so far, by 10 before adding in the next digit's numeric value.
So, if the string is 456, the first it obtains 4, then because there is another digit, 4 × 10 = 40, then + 5 for 45, then × 10 for 450, then + 6 for 456, and so on.
The advantage of this approach is that it can handle any number of digits (up to overflow).  The disadvantage, of course, is that it requires multiplication, which is a complication for LC-3.
Multiplication where one operand is the constant 10 is fairly easy even in LC-3's limited capabilities, and can be done with simple addition without looping.  Basically:
n × 10 = n + n + n + n + n + n + n + n + n + n
and LC-3 can do those 9 additions in just 9 instructions.  Still, we can also observe that:
n × 10 = n × 8 + n × 2
and also that:
n × 10 = (n × 4 + n) × 2     (which is n × 5 × 2)
which can be done in just 4 instructions on LC-3 (and none of these needs looping)!
So, if you want to do this approach, you'll have to figure out how to go forwards through the string instead of backwards as the given table version does, and, how to multiply by 10 (use any one of the above suggestions).
There are other approaches as well if you study atoi.  You could keep the backwards approach, but now will have to multiply by 10, by 100, by 1000, a different factor for each successive digit .  That might be done by repetitive addition.  Or a count of how many times to multiply by 10 — e.g. n × 1000 = n × 10 × 10 × 10.

FFT outptut for a signal with 2 cosf() cycles

I am transforming a signal using the ZeroFFT library. The results I get from it, are not what I would intuitively expect.
As a test, I feed the FFT algorithm with a buffer that contains two full cycles of cosine:
Which is sampled over 512 samples.
Which are fed as int16_t values. What I expected to get back, is 256 amplitudes, with the values [ 0, 4095, 0, 0, ..., 0 ].
Instead, this is the result:
2 2052 4086 2053 0 2 2 2 1 2 2 2 4 4 3 4...
And it gets weirder! If I feed it the same signal, but shifted (so sinf() over 0 .. 4*pi instead of cosf() function) I get a completely different result: 4 10 2 16 2 4 4 4 2 2 2 3 2 4 3 4
This throws up the questions:
1. Doesn't a sine signal and cosine signal with same period, contain exactly the same frequencies?
2. If I feed it a buffer with exactly 2 cycles of cosine, wouldn't the Fourier transform result in all zeros, except for 1 frequency?
I generate my test signal as:
static void setup_reference(void)
{
for (int i=0; i<CAPTURESZ; ++i)
{
const float phase = 2 * 3.14159f * i / (256.0f);
reference_wave[i] = (int16_t) (cosf(phase) * 0x7fff);
}
}
And call the ZeroFFT function as:
ZeroFFT(reference_Wave, CAPTURESZ);
Note: the ZeroFFT docs state that a Hanning window is applied.
Windowing causes some spectral leakage. Including the window function, the wave now looks like this:
If I feed it a buffer with exactly 2 cycles of cosine, wouldn't the Fourier transform result in all zeros, except for 1 frequency?
Yes, if you do it without windowing. Actually two frequencies: both the positive frequency that you expect, and the equivalent negative frequency, though not all FFT functions will include the negative frequencies in their output (for Real input, the result is Hermitian-symmetric, there is no extra information in the negative frequencies). For practical reasons, since neither the input signal nor the FFT calculation are exact, you may not get exactly zero everywhere else either, but it should be close - that's mainly a concern for floating point output.
By the way by this I don't mean that windowing is bad, but in this special case (perfectly periodic input) it didn't work out in your favour.
As for the sine wave, the magnitudes of the result should be similar (within reason - exactness shouldn't be expected), but the comments on the FFT function you used mention
The complex portion is discarded, and the real values are returned.
While phase shifts would not change the magnitudes much, they change the phases of the results, and therefore also their Real component.

Question on the kernel dimensions for convolutions on mel filter bank features

I am currently trying to understand the following paper: https://arxiv.org/pdf/1703.08581.pdf. I am struggling to understand a part about how a convolution is performed on an input of log mel filterbank features:
We train seq2seq models for both end-to-end speech translation, and a baseline model for speech recognition. We found
that the same architecture, a variation of that from [10], works
well for both tasks. We use 80 channel log mel filterbank features extracted from 25ms windows with a hop size of 10ms,
stacked with delta and delta-delta features. The output softmax
of all models predicts one of 90 symbols, described in detail in
Section 4, that includes English and Spanish lowercase letters.
The encoder is composed of a total of 8 layers. The input
features are organized as a T × 80 × 3 tensor, i.e. raw features,
deltas, and delta-deltas are concatenated along the ’depth’ dimension. This is passed into a stack of two convolutional layers
with ReLU activations, each consisting of 32 kernels with shape
3 × 3 × depth in time × frequency. These are both strided by
2 × 2, downsampling the sequence in time by a total factor of 4,
decreasing the computation performed in the following layers.
Batch normalization [26] is applied after each layer.
As I understand it, the input to the convolutional layer is 3 dimensional (number of 25 ms windows (T) x 80 (features for each window) x 3 (features, delta features and delta-delta features). However, the kernels used on those inputs seem to have 4 dimensions and I do not understand why that is. Wouldn't a 4 dimensional kernel need a 4 dimensional input? In my head, the input has the same dimensions as a rgb picture: width (time) x height (frequency) x color channels (features, delta features and delta-delta features). Therefore I would think of a kernel for a 2D convolution as a filter of size a (filter width) x b (filter height) x 3 (depth of the input). Am I missing something here? What is wrong about my idea or what is done different in this paper?
Thanks in advance for your answer!
I figured it out, turns out it was just a misunderstanding from my side: the authors are using 32 kernels of shape 3x3, which results (after two layers with 2x2 striding) in an output of shape t/4x20x32 where t stands for the time dimension.

Why is alpha set to 15 in NLTK - VADER?

I am trying to understand what the VADER does for analysis of sentences.
Why is the hyper-parameter Alpha set to 15 here? I understand that the it is unstable when left unbound, but why 15?
def normalize(score, alpha=15):
"""
Normalize the score to be between -1 and 1 using an alpha that
approximates the max expected value
"""
norm_score = score/math.sqrt((score*score) + alpha)
return norm_score
Vader's normalization equation is which is the equation for
I have read the paper of the research for Vader from here:http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf
Unfortunately, I could not find any reason why such a formula and 15 as the value for alpha was chosen but the experiments and the graph show that as x grows which is the sum of sentiments' scores grow the value becomes closer to -1 or 1 which indicates that as number of words grow the score tends more towards -1 or 1. Which means that Vader works better with short documents or tweets compared to long documents.

When to use logarithmic variables

I want to make a regression. My dependent variable Y is a score of discrete numbers from 0 to 12. I have 6 independent variables. Four of them are in percent between 0.4 % and 40 %. The two other independent variables are not in percent. One of them measures years and is between 1.8 and 67. The other independent variable measures size and is between 0.008 and 5,117 (very high).
I don't want to use winsorizing, that's why I decided to use logarithm. My question now: Is it good when I log the two independent variables which are not in percent, because they have high vaules?
That means I would use: reg Y x1 x2 x3 x4 logx5 logx6
Why I would like to use log is because I want to look out for the outliers. I have not really outliers, but I think it's better to use log for comparision reasons.
Or do I just have to use log, when the distribution of some variables is skewed? Meaning log leads to a lower skewness and is thus used appropriately.
Thank you in advance.
Lukas