Understanding variable length backpropagation sequences

Understanding variable length backpropagation sequences - deep-learning

Citing from "Regularizing and Optimizing LSTM Language Models paper":
Given a fixed sequence length that is used to break a dataset into fixed length batches, the data set is not efficiently used. To illustrate this, imagine being given 100 elements
to perform backpropagation through with a fixed backpropagation through time (BPTT) window of 10. Any element divisible by 10 will never have any elements to backprop into, no matter how many times you may traverse the data set. Indeed, the backpropagation window that each element
receives is equal to i mod 10 where i is the element’s index. This is data inefficient, preventing 1/10 of the data set from ever being able to improve itself in a recurrent fashion, and resulting in 8/10 of the remaining elements receiving only a partial backpropagation window compared to the full possible backpropagation window of length 10.
I could not get the intuition from the above statements

Related

Simulating a matrix of variables with predefined correlation structure

For a simulation study I am working on, we are trying to test an algorithm that aims to identify specific culprit factors that predict a binary outcome of interest from a large mixture of possible exposures that are mostly unrelated to the outcome. To test this algorithm, I am trying to simulate the following data:
A binary dependent variable
A set of, say, 1000 variables, most binary and some continuous, that are not associated with the outcome (that is, are completely independent from the binary dependent variable, but that can still be correlated with one another).
A group of 10 or so binary variables which will be associated with the dependent variable. I will a-priori determine the magnitude of the correlation with the binary dependent variable, as well as their frequency in the data.
Generating a random set of binary variables is easy. But is there a way of doing this while ensuring that none of these variables are correlated with the dependent outcome?
Thank you!

"But is there a way of doing this while ensuring that none of these variables are correlated with the dependent outcome?"
With statistical sampling you can't ensure anything, you can only adjust the acceptable risk. Finding an acceptable level of risk may be harder than many people think.
Spurious correlations are a very real phenomenon. Real independent observations will often contain correlations, and if you want to actually test your algorithm to see how it will perform in reality then your tests should produce such phenomena in a manner similar to the real world—you should be generating independent candidate factors and allowing spurious correlations to occur.
If you are performing ~1000 independent tests of candidate factors, and you're targeting a risk level of α = 0.05, you can expect 50 non-significant terms to leak through into your analysis. To avoid this, you need to adjust your testing threshold using something along the lines of a Bonferroni correction. Recall that statistical discriminating power is based on standard error, which is inversely proportional to the square root of the sample size. Bonferroni says that 1000 simultaneous tests need their individual test threshold to be adjusted by a factor of 1000, which in turn means the sample size needs to be a million times larger than when performing a single test for significance.
So in summary I'd say that you shouldn't attempt to ensure lack of correlation, it's going to occur in the real world. You can mitigate the risk of non-predictive factors being included due to spurious correlation by generating massive amounts of data. In practice there will be non-predictors that leak through unless you can obtain enough data, so I'd suggest that your testing should address the rates of occurrence as a function of number of candidate factors and the sample size.

Can I find price floors and ceilings with cuda

Background
I'm trying to convert an algorithm from sequential to parallel, but I am stuck.
Point and Figure Charts
I am creating point and figure charts.
Decreasing
While the stock is going down, add an O every time it breaks through the floor.
Increasing
While the stock is going up, add an X every time it breaks through the ceiling.
Reversal
If the stock reverses direction, but the change is less than a reversal threshold (3 units) do nothing. If the change is greater than the reversal threshold, start a new column (X or O)
Sequential vs Parallel
Sequentially, this is pretty straight forward. I keep a variable for the floor and ceiling. If the current price breaks through the floor or ceiling, or changes more than the reversal threshold, I can take the appropriate action.
My question is, is there a way to find these reversal point in parallel? I'm fairly new to thinking in parallel, so I'm sorry if this is trivial. I am trying to do this in CUDA, but I have been stuck for weeks. I have tried using the finite difference algorithms from NVidia. These produce local max / min but not the reversal points. Small fluctuations produce numerous relative max / min, but most of them are trivial because the change is not greater than the reversal size.

My question is, is there a way to find these reversal point in parallel?
one possible approach:
use thrust::unique to remove periods where the price is numerically constant
use thrust::adjacent_difference to produce 1st difference data
use thrust::adjacent_difference on 1st difference data to get the 2nd difference data, i.e the points where there is a change in the sign of the slope.
use these points of change in sign of slope to identify separate regions of data - build a key vector from these (e.g. with a prefix sum). This key vector segments the price data into "runs" where the price change is in a particular direction.
use thrust::exclusive_scan_by_key on the 1st difference data, to produce the net change of the run
Wherever the net change of the run exceeds a threshold, flag as a "reversal"
Your description of what constitutes a reversal may also be slightly unclear. The above method would not flag a reversal on certain data patterns that you might classify as a reversal. I suspect you are looking beyond a single run as I have defined it here. If that is the case, there may be a method to address that as well - with more steps.

Recurrent NNs: what's the point of parameter sharing? Doesn't padding do the trick anyway?

The following is how I understand the point of parameter sharing in RNNs:
In regular feed-forward neural networks, every input unit is assigned an individual parameter, which means that the number of input units (features) corresponds to the number of parameters to learn. In processing e.g. image data, the number of input units is the same over all training examples (usually constant pixel size * pixel size * rgb frames).
However, sequential input data like sentences can come in highly varying lengths, which means that the number of parameters will not be the same depending on which example sentence is processed. That is why parameter sharing is necessary for efficiently processing sequential data: it makes sure that the model always has the same input size regardless of the sequence length, as it is specified in terms of transition from one state to another. It is thus possible to use the same transition function with the same weights (input to hidden weights, hidden to output weights, hidden to hidden weights) at every time step. The big advantage is that it allows generalization to sequence lengths that did not appear in the training set.
My questions are:
Is my understanding of RNNs, as summarized above, correct?
In the actual code example in Keras I looked at for LSTMs, they padded the sentences to equal lengths before all. By doing so, doesn't this wash away the whole purpose of parameter sharing in RNNs?

Parameter Sharing
Being able to efficiently process sequences of varying length is not the only advantage of parameter sharing. As you said, you can achieve that with padding. The main purpose of parameter sharing is a reduction of the parameters that the model has to learn. This is the whole purpose of using a RNN.
If you would learn a different network for each time step and feed the output of the first model to the second etc. you would end up with a regular feed-forward network. For a number of 20 time steps, you would have 20 models to learn. In Convolutional Nets, parameters are shared by the Convolutional Filters because when we can assume that there are similar interesting patterns in different regions of the picture (for example a simple edge). This drastically reduces the number of parameters we have to learn. Analogously, in sequence learning we can often assume that there are similar patterns at different time steps. Compare 'Yesterday I ate an apple' and 'I ate an apple yesterday'. These two sentences mean the same, but the 'I ate an apple' part occurs on different time steps. By sharing parameters, you only have to learn what that part means once. Otherwise, you'd have to learn it for every time step, where it could occur in your model.
There is a drawback to sharing the parameters. Because our model applies the same transformation to the input at every time step, it now has to learn a transformation that makes sense for all time steps. So, it has to remember, what word came in which time step, i.e. 'chocolate milk' should not lead to the same hidden and memory state as 'milk chocolate'. But this drawback is small compared to using a large feed-forward network.
Padding
As for padding the sequences: the main purpose is not directly to let the model predict sequences of varying length. Like you said, this can be done by using parameter sharing. Padding is used for efficient training - specifically to keep the computational graph during training low. Without padding, we have two options for training:
We unroll the model for each training sample. So, when we have a sequence of length 7, we unroll the model to 7 time steps, feed the sequence, do back-propagation through the 7 time steps and update the parameters. This seems intuitive in theory. But in practice, this is inefficient, because TensorFlow's computational graphs don't allow recurrency, they are feedforward.
The other option is to create the computational graphs before starting training. We let them share the same weights and create one computational graph for every sequence length in our training data. But when our dataset has 30 different sequence lengths this means 30 different graphs during training, so for large models, this is not feasible.
This is why we need padding. We pad all sequences to the same length and then only need to construct one computational graph before starting training. When you have both very short and very long sequence lengths (5 and 100 for example), you can use bucketing and padding. This means, you pad the sequences to different bucket lengths, for example [5, 20, 50, 100]. Then, you create a computational graph for each bucket. The advantage of this is, that you don't have to pad a sequence of length 5 to 100, as you would waste a lot of time on "learning" the 95 padding tokens in there.

Finding images in RAM dump

Extracting screenshots from RAM dumps
Some classical security / hacking challenges include having to analyze the dump of the physical RAM of a system. volatility does a great job at extracting useful information, including wire-view of the windows displayed at the time (using the command screenshot). But I would like to go further and find the actual content of the windows.
So I'd like to reformulate the problem as finding raw images (think matrix of pixels) in a large file. If I can do this, I hope to find the content of the windows, at least partially.
My idea was to rely on the fact that a row of pixels is similar to the next one. If I find a large enough number of lines of the same size, then I let the user fiddle around with an interactive tool and see if it decodes to something interesting.
For this, I would compute a kind of spectrogram. More precise a heatmap where the shade show how likely it is for the block of data #x to be part of an image of width y bytes, with x and y the axis of the spectrogram. Then I'd just have to look for horizontal lines in it. (See the examples below.)
The problem I have right now is to find a method to compute that "kind of spectrogram" accurately and quickly. As an order of magnitude, I would like to be able to find images of width 2048 in RGBA (8192 bytes per row) in a 4GB file in a few minutes. That means processing a few tens of MB per second.
I tried using FFT and autocorrelation, but they do not show the kind of accuracy I'm after.
The problem with FFT
Since finding the length of a mostly repeating pattern looks like finding a frequency, I tried to use a Fourier transform with 1 byte = 1 sample and plot the absolute value of the spectrum.
But the main problem is the period resolution. Since I'm interested in finding the period of the signal (the byte length of a row of pixels), I want to plot the spectrogram with period length on the y axis, not the frequency. But the way the discrete Fourier transform work is that it computes the frequencies multiple of 1/n (for n data points). Which gives me a very low resolution for large periods and a higher-than-needed resolution for short periods.
Here is a spectrogram computed with this method on a 144x90 RGB BMP file. We expect a peak at an offset 432. The window size for the FFT was 4320 bytes.
And the segment plot of the first block of data.
I calculated that if I need to distinguish between periods k and k+1, then I need a window size of roughly k². So for 8192 bytes, that makes the FFT window about 16MB. Which would be way too slow.
So the FFT computes too much information I don't need and not enough information I would need. But given a reasonable window size, it usually show a sharp peak at about the right period.
The problem with autocorrelation
The other method I tried is to use a kind of discrete autocorrelation to plot the spectrogram.
More exactly, what I compute is the cross-correlation between a block of data and half of it. And only compute it for the offsets where the small block is fully inside the large block. The size of the large block has to be twice larger than the max period to plot.
Here is an example of spectrogram computed with this method on the same image as before.
And the segment plot of the autocorrelation of the first block of data.
Altough it produces just the right amount of data, the value of the autocorrelation change slowly, thus not making a sharp peak for the right period.
Question
Is there a way to get both a sharp peak and around the correct period and enough precision around the large periods? Either by tweaking the afformentioned algorithms or by using a completely different one.

I can't judge much about the FFT part. From the title ("Finding images in RAM dump") it seems you are trying to solve a bigger problem and FFT is only a part of it, so let me answer on those parts where I have some knowledge.
analyze the RAM dump of a system
This sounds much like physical RAM. If an application takes a screenshot, that screenshot is in virtual RAM. This results in two problems:
a) the screenshot may be incomplete, because parts of it are paged out to disk
b) you need to perform a physical address to virtual address mapping in order to bring the bytes of the screenshot into correct order
find raw images from the memory dump
To me, the definition of what a raw image is is unclear. Any application storing an image will use an image file format. Storing only the data makes the screenshot useless.
In order to perform an FFT on the data, you should probably know whether it uses 24 bit per pixel or 32 bit per pixel.
I hope to find a screenshot or the content of the current windows
This would require an application that takes screenshots. You can of course hope. I can't judge about the likeliness of that.
rely on the fact that a row of pixels is similar to the next one
You probably hope to find some black on white text. For that, the assumption may be ok. If the user is viewing his holiday pictures, this may be different.
Also note that many values in a PC are 32 bit (Integer, Float) and 0x00000000 is a very common value. Your algorithm may detect this.
images of width 2048
Is this just a guess? Or would you finally brute-force all common screen sizes?
in RGBA
Why RGBA? A screenshot typically does not have transparency.
With all of the above, I wonder whether it wouldn't be more efficient to search for image signatures like JPEG, BMP or PNG headers in the dump and then analyze those headers and simply get the picture from the metadata.
Note that this has been done before, e.g. WinDbg has some commands in the ext debugger extension which is loaded by default
!findgifs
!findjpegs
!findjpgs
!findpngs

How to find all frequencies in audio with discrete fourier transform?

I want to analyze some audio and decompose it as best as I can into sine waves. I have never used FFT before and am just doing some initial reading and about the concepts and available libraries, like FFTW and KissFFT.
I'm confused on this point... it sounds like the DFT/FFT will give you the sine amplitudes only at certain frequencies, multiples of a base frequency. For example, if I have audio sampled at the usual 44100 Hz, and I pick a chunk of say 256 samples, then that chuck could fit one cycle of 44100/256=172Hz, and the DFT will give me the sine amplitudes at 172, 172*2, 172*3, etc. Is that correct? How do you then find the strength at other frequencies? I'd like to see a spectrum all the way from 20Hz to about 15Khz, at about 1Hz increments.

Fourier decomposition allows you to take any function of time and describe it as a sum of sine waves each with different amplitudes and frequencies. If however you want to approach this problem using the DFT, you need to make sure you have sufficient resolution in the frequency domain in order to distinguish between different frequencies. Once you have that you can determine which frequencies are dominant in the signal and create a signal consisting of multiples sinewaves corresponding to those frequencies. You are correct in saying that with a sampling frequency of 44.1 kHz, only looking at 256 samples, the lowest frequency you will be able to detect in those 256 samples is a frequency of 172 Hz.
OBTAIN SUFFICIENT RESOLUTION IN THE FREQUENCY DOMAIN:
Amplitude values for frequencies "only at certain frequencies, multiples of a base frequency", is true for Fourier decomposition, NOT the DFT, which will have a frequency resolution of a certain increment. The frequency resolution of the DFT is related to the sampling rate and number of samples of the time-domain signal used to calculate the DFT. Reducing the frequency spacing will give you a better ability to distinguish between two frequencies close together and this can be done in two ways;
Decreasing the sampling rate, but this would move the periodic repetitions in frequency closer together. (Remember NyQuist theorem here)
Increase the number of samples which you use to calculate the DFT. If only the 256 samples are available, one can perform "zero padding" where 0-valued samples are appended to the end of the data, but there are some effects to this which needs to be considered.
HOW TO COME TO A CONCLUSION:
If you depict the frequency content of different audio signals into individual graphs, you will find that the amplitudes differ abit. This is because the individual signals will not be identical in sound, and there is always noise inherent in any signal (from the surroundings and the hardware itself). Therefore, what you want to do is to take the average of two or more DFT signals to remove noise and get a more accurate represention of the frequency content. Depending on your application, this may not be possible if the sound you are capturing is noticably changing rapidly over time (for example speech, or music). Averaging is thus only useful if all the signals to be averaged are pretty much equal in sound (individual seperate recordings of "the same thing"). Just to clarify, from, for example, four time-domain signals, you want to create four frequency domain signals (using a DFT method), and then calculate the average of the four frequency-domain signals into a single averaged frequency-domain signal. This will remove noise and give you a better representation of which frequencies are inherent in your audio.
AN ALTERNATIVE SOLUTION:
If you know that your signal is supposed to contain a certain number of dominant frequencies (not too many) and these are the only ones your are interesting in, then I would recommend that you use Pisarenko's harmonic decomposition (PHD) or Multiple signal classification (MUSIC, nice abbreviation!) to find these frequencies (and their corresponding amplitude values). This is less intensive computationally than the DFT. For example. if you KNOW the signal contains 3 dominant frequencies, Pisarenko will return the frequency values for these three, but keep in mind that the DFT reveals much more information, allowing you come to more conclusions.

Your initial assumption is incorrect. An FFT/DFT will not give you amplitudes only at certain discrete frequencies. Those discrete frequencies are only the centers of bins, each bin constituting a narrow-band filter with a main lobe of non-zero bandwidth, roughly a width or two of the FFT bin separation, depending on the window (rectangular, von Hann, etc.) applied before the FFT. Thus the amplitude of spectral content between bin centers will show up, but spread across multiple FFT result bins.
If the separation of key signals is large enough and the noise level is low enough, then you can interpolate the FFT results to examine frequencies between bin centers. You may need to use a high quality interpolator, such as a Sinc kernel.
If your signal separation is smaller or the noise level is higher, then you may need a longer window of data to feed a longer FFT to gather sufficient resolution information. An FFT window of length 256 at 44.1k sample rate is almost certainly just too short to gather sufficient information regarding spectral content below a few 100 Hz, if those are among the frequencies you would like to see examined, as they can't be separated cleanly from a DC bias (bin 0).

Unfortunately, there's a degree of uncertainty in identifying the frequencies in a fixed sample of a signal. If you use a short FFT, then there's no way to tell the difference between frequencies over a fairly wide range. If you use a long FFT to get higher resolution in the frequency domain, then you can't detect frequency changes as quickly. This is inherent in the math.
Off the top of my head: If you want a 15kHz range at 1Hz increments, you need a 15000 point FFT, which at 44.1kHz means you'll get a frequency plot three times per second. (I may be missing a factor of 2 in there as I can't recall whether the Nyquist limit means you actually want a 30kHz bandwidth.)
You may also be interested in the Short-time Fourier transform. It doesn't solve the fundamental trade-off problem but in practice may get you what you want.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008