Integrating Non-Observation Frame Data with Different Dimensionality in Reinforcement Learning - reinforcement-learning

I am trying to understand a conceptual approach to integrating data into a stack of observation frames that don't have the same dimensionality as the frames.
Example Frame: [1, 2, 3]
Example extra data: [a, b]
Currently, I am approaching this as follows, with the example of 3 frames (rows) representing temporal observation data over 3 time periods, and a 4th frame (row) representing non-temporal data for which only the most recent observed values are needed.
Example:
[
[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[a, b, NaN]
]
The a and b are the added data and the NaN is just a value added to match the dimensions of the existing data. Would there be differences (all inputs welcomed) in using NaN vs an outlier value like -1 that would never be observed by other measures?
One possible alternative would be to structure the observation data as such:
[
[1, 2, 3, a, b],
[4, 5, 6, a-1, b-1],
[7, 8, 9, a-2, b-3],
]
It seems this would be a noticeable increase in resources and the measures (in my context) of a and b can be universally understood as "bigger is better" or "smaller is better" without context from the other data values.

Related

Why do we "pack" the sequences in PyTorch?

I was trying to replicate How to use packing for variable-length sequence inputs for rnn but I guess I first need to understand why we need to "pack" the sequence.
I understand why we "pad" them but why is "packing" (via pack_padded_sequence) necessary?
I have stumbled upon this problem too and below is what I figured out.
When training RNN (LSTM or GRU or vanilla-RNN), it is difficult to batch the variable length sequences. For example: if the length of sequences in a size 8 batch is [4,6,8,5,4,3,7,8], you will pad all the sequences and that will result in 8 sequences of length 8. You would end up doing 64 computations (8x8), but you needed to do only 45 computations. Moreover, if you wanted to do something fancy like using a bidirectional-RNN, it would be harder to do batch computations just by padding and you might end up doing more computations than required.
Instead, PyTorch allows us to pack the sequence, internally packed sequence is a tuple of two lists. One contains the elements of sequences. Elements are interleaved by time steps (see example below) and other contains the size of each sequence the batch size at each step. This is helpful in recovering the actual sequences as well as telling RNN what is the batch size at each time step. This has been pointed by #Aerin. This can be passed to RNN and it will internally optimize the computations.
I might have been unclear at some points, so let me know and I can add more explanations.
Here's a code example:
a = [torch.tensor([1,2,3]), torch.tensor([3,4])]
b = torch.nn.utils.rnn.pad_sequence(a, batch_first=True)
>>>>
tensor([[ 1, 2, 3],
[ 3, 4, 0]])
torch.nn.utils.rnn.pack_padded_sequence(b, batch_first=True, lengths=[3,2])
>>>>PackedSequence(data=tensor([ 1, 3, 2, 4, 3]), batch_sizes=tensor([ 2, 2, 1]))
Here are some visual explanations1 that might help to develop better intuition for the functionality of pack_padded_sequence().
TL;DR: It is performed primarily to save compute. Consequently, the time required for training neural network models is also (drastically) reduced, especially when carried out on very large (a.k.a. web-scale) datasets.
Let's assume we have 6 sequences (of variable lengths) in total. You can also consider this number 6 as the batch_size hyperparameter. (The batch_size will vary depending on the length of the sequence (cf. Fig.2 below))
Now, we want to pass these sequences to some recurrent neural network architecture(s). To do so, we have to pad all of the sequences (typically with 0s) in our batch to the maximum sequence length in our batch (max(sequence_lengths)), which in the below figure is 9.
So, the data preparation work should be complete by now, right? Not really.. Because there is still one pressing problem, mainly in terms of how much compute do we have to do when compared to the actually required computations.
For the sake of understanding, let's also assume that we will matrix multiply the above padded_batch_of_sequences of shape (6, 9) with a weight matrix W of shape (9, 3).
Thus, we will have to perform 6x9 = 54 multiplication and 6x8 = 48 addition                    
(nrows x (n-1)_cols) operations, only to throw away most of the computed results since they would be 0s (where we have pads). The actual required compute in this case is as follows:
9-mult 8-add
8-mult 7-add
6-mult 5-add
4-mult 3-add
3-mult 2-add
2-mult 1-add
---------------
32-mult 26-add
------------------------------
#savings: 22-mult & 22-add ops
(32-54) (26-48)
That's a LOT more savings even for this very simple (toy) example. You can now imagine how much compute (eventually: cost, energy, time, carbon emission etc.) can be saved using pack_padded_sequence() for large tensors with millions of entries, and million+ systems all over the world doing that, again and again.
The functionality of pack_padded_sequence() can be understood from the figure below, with the help of the used color-coding:
As a result of using pack_padded_sequence(), we will get a tuple of tensors containing (i) the flattened (along axis-1, in the above figure) sequences , (ii) the corresponding batch sizes, tensor([6,6,5,4,3,3,2,2,1]) for the above example.
The data tensor (i.e. the flattened sequences) could then be passed to objective functions such as CrossEntropy for loss calculations.
1 image credits to #sgrvinod
The above answers addressed the question why very well. I just want to add an example for better understanding the use of pack_padded_sequence.
Let's take an example
Note: pack_padded_sequence requires sorted sequences in the batch (in the descending order of sequence lengths). In the below example, the sequence batch were already sorted for less cluttering. Visit this gist link for the full implementation.
First, we create a batch of 2 sequences of different sequence lengths as below. We have 7 elements in the batch totally.
Each sequence has embedding size of 2.
The first sequence has the length: 5
The second sequence has the length: 2
import torch
seq_batch = [torch.tensor([[1, 1],
[2, 2],
[3, 3],
[4, 4],
[5, 5]]),
torch.tensor([[10, 10],
[20, 20]])]
seq_lens = [5, 2]
We pad seq_batch to get the batch of sequences with equal length of 5 (The max length in the batch). Now, the new batch has 10 elements totally.
# pad the seq_batch
padded_seq_batch = torch.nn.utils.rnn.pad_sequence(seq_batch, batch_first=True)
"""
>>>padded_seq_batch
tensor([[[ 1, 1],
[ 2, 2],
[ 3, 3],
[ 4, 4],
[ 5, 5]],
[[10, 10],
[20, 20],
[ 0, 0],
[ 0, 0],
[ 0, 0]]])
"""
Then, we pack the padded_seq_batch. It returns a tuple of two tensors:
The first is the data including all the elements in the sequence batch.
The second is the batch_sizes which will tell how the elements related to each other by the steps.
# pack the padded_seq_batch
packed_seq_batch = torch.nn.utils.rnn.pack_padded_sequence(padded_seq_batch, lengths=seq_lens, batch_first=True)
"""
>>> packed_seq_batch
PackedSequence(
data=tensor([[ 1, 1],
[10, 10],
[ 2, 2],
[20, 20],
[ 3, 3],
[ 4, 4],
[ 5, 5]]),
batch_sizes=tensor([2, 2, 1, 1, 1]))
"""
Now, we pass the tuple packed_seq_batch to the recurrent modules in Pytorch, such as RNN, LSTM. This only requires 5 + 2=7 computations in the recurrrent module.
lstm = nn.LSTM(input_size=2, hidden_size=3, batch_first=True)
output, (hn, cn) = lstm(packed_seq_batch.float()) # pass float tensor instead long tensor.
"""
>>> output # PackedSequence
PackedSequence(data=tensor(
[[-3.6256e-02, 1.5403e-01, 1.6556e-02],
[-6.3486e-05, 4.0227e-03, 1.2513e-01],
[-5.3134e-02, 1.6058e-01, 2.0192e-01],
[-4.3123e-05, 2.3017e-05, 1.4112e-01],
[-5.9372e-02, 1.0934e-01, 4.1991e-01],
[-6.0768e-02, 7.0689e-02, 5.9374e-01],
[-6.0125e-02, 4.6476e-02, 7.1243e-01]], grad_fn=<CatBackward>), batch_sizes=tensor([2, 2, 1, 1, 1]))
>>>hn
tensor([[[-6.0125e-02, 4.6476e-02, 7.1243e-01],
[-4.3123e-05, 2.3017e-05, 1.4112e-01]]], grad_fn=<StackBackward>),
>>>cn
tensor([[[-1.8826e-01, 5.8109e-02, 1.2209e+00],
[-2.2475e-04, 2.3041e-05, 1.4254e-01]]], grad_fn=<StackBackward>)))
"""
We need to convert output back to the padded batch of output:
padded_output, output_lens = torch.nn.utils.rnn.pad_packed_sequence(output, batch_first=True, total_length=5)
"""
>>> padded_output
tensor([[[-3.6256e-02, 1.5403e-01, 1.6556e-02],
[-5.3134e-02, 1.6058e-01, 2.0192e-01],
[-5.9372e-02, 1.0934e-01, 4.1991e-01],
[-6.0768e-02, 7.0689e-02, 5.9374e-01],
[-6.0125e-02, 4.6476e-02, 7.1243e-01]],
[[-6.3486e-05, 4.0227e-03, 1.2513e-01],
[-4.3123e-05, 2.3017e-05, 1.4112e-01],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00]]],
grad_fn=<TransposeBackward0>)
>>> output_lens
tensor([5, 2])
"""
Compare this effort with the standard way
In the standard way, we only need to pass the padded_seq_batch to lstm module. However, it requires 10 computations. It involves several computes more on padding elements which would be computationally inefficient.
Note that it does not lead to inaccurate representations, but need much more logic to extract correct representations.
For LSTM (or any recurrent modules) with only forward direction, if we would like to extract the hidden vector of the last step as a representation for a sequence, we would have to pick up hidden vectors from T(th) step, where T is the length of the input. Picking up the last representation will be incorrect. Note that T will be different for different inputs in batch.
For Bi-directional LSTM (or any recurrent modules), it is even more cumbersome, as one would have to maintain two RNN modules, one that works with padding at the beginning of the input and one with padding at end of the input, and finally extracting and concatenating the hidden vectors as explained above.
Let's see the difference:
# The standard approach: using padding batch for recurrent modules
output, (hn, cn) = lstm(padded_seq_batch.float())
"""
>>> output
tensor([[[-3.6256e-02, 1.5403e-01, 1.6556e-02],
[-5.3134e-02, 1.6058e-01, 2.0192e-01],
[-5.9372e-02, 1.0934e-01, 4.1991e-01],
[-6.0768e-02, 7.0689e-02, 5.9374e-01],
[-6.0125e-02, 4.6476e-02, 7.1243e-01]],
[[-6.3486e-05, 4.0227e-03, 1.2513e-01],
[-4.3123e-05, 2.3017e-05, 1.4112e-01],
[-4.1217e-02, 1.0726e-01, -1.2697e-01],
[-7.7770e-02, 1.5477e-01, -2.2911e-01],
[-9.9957e-02, 1.7440e-01, -2.7972e-01]]],
grad_fn= < TransposeBackward0 >)
>>> hn
tensor([[[-0.0601, 0.0465, 0.7124],
[-0.1000, 0.1744, -0.2797]]], grad_fn= < StackBackward >),
>>> cn
tensor([[[-0.1883, 0.0581, 1.2209],
[-0.2531, 0.3600, -0.4141]]], grad_fn= < StackBackward >))
"""
The above results show that hn, cn are different in two ways while output from two ways lead to different values for padding elements.
Adding to Umang's answer, I found this important to note.
The first item in the returned tuple of pack_padded_sequence is a data (tensor) -- a tensor containing the packed sequence. The second item is a tensor of integers holding information about the batch size at each sequence step.
What's important here though is the second item (Batch sizes) represents the number of elements at each sequence step in the batch, not the varying sequence lengths passed to pack_padded_sequence.
For instance, given the data abc and x
the :class:PackedSequence would contain the data axbc with
batch_sizes=[2,1,1].
I used pack padded sequence as follows.
packed_embedded = nn.utils.rnn.pack_padded_sequence(seq, text_lengths)
packed_output, hidden = self.rnn(packed_embedded)
where text_lengths are the length of the individual sequence before padding and sequence are sorted according to decreasing order of length within a given batch.
you can check out an example here.
And we do packing so that the RNN doesn't see the unwanted padded index while processing the sequence which would affect the overall performance.

The result of numpy.convolve is not as expected

I am new to Deep Learning, I think I've got the point of this Understanding NumPy's Convolve
.
I tried this in numpy
np.convolve([3, 4], [1, 1, 5, 5], 'valid')
the output is
array([ 7, 19, 35])
According to the link the second element of the output should be 23.
[3 4]
[1 1 5 5]
= 3 * 1 + 4 * 5 = 23
It seems that the second element (19) is wrong in my case, though I have no idea how and why. Any responses will be grateful.
I think you are confused with convolution implementation in neural networks, which is actually cross-corellation. However if you refer to mathematical definition of the convolution, you will see that the the second function has to be time-reversed (or mirrored). Also, note that numpy swaps araguments if the second element has bigger size (as in your case). So the result your get is obtained as following:
[1*4+3*1,1*4+3*5,5*4+3*5]
In case you want numpy to perform calculations as you did, you should use:
np.correlate([3, 4], [1, 1, 5, 5], 'valid')
Here is useful illustration for convolution and cross-correlation:
the reason is that numpy reverse the shorter array, here [3, 4] becomes [4,3]. It is done because of the definition of the convolution (you can find more informations in the section definition of wikipedia here https://en.wikipedia.org/wiki/Convolution).
So in fact : np.convolve([3, 4], [1, 1, 5, 5], 'valid')
makes :
[4 3]
[1 1 5 5]
= 4 * 1 + 3 * 5 = 19
:)

Not sure why I'm failing test cases?

I'm building a function that replaces a limited number of old values with new values in a list (xs) and returns the replaced numbers as a new list (new_xs). I am failing most test cases. I have provided two examples of expected output and two examples of failing test cases.
Example:
Limit= None (means replace all old values) xs=[1,2,1,3,1,4,1,5,1] old= 1 new=2 --> new_xs[2,2,2,3,2,4,2,5,2]
Limit= 0 or negative means do not alter anything in the list
Limit=2 means only replace two old values and leave rest untouched.
Here is my non-working code:
def replace(xs, old, new, limit=None):
new_xs=[]
replacements=0
for num in xs:
if num == old and (limit is None or replacements<limit):
new_xs.append(new)
replacements+=1
else:
new_xs.append(num)
return new_xs
Still fails 6 tests:
AssertionError: [] != None
AssertionError: Lists differ: [9, 2, 9, 3, 9, 4, 9, 5, 9] != [-10, 2, -10, 3, -10, 4, -10, 5, -10]

Integer CSV Compression Algorithm

I did surface level research about the existance of an algorithm that compresses comma seperated integers however i did not find anything relevant.
My goal is to compress large amounts of structured comma separated integers whos value ranges are known. Is there a known algorithm to do such a thing? If not where would be a good start to read about some relevant areas of interest which will get me started on developing such algorithm? Ofcourse the algorithm has to be reversable and lossles such that i can uncompress the compressed data to retrieve the csv values.
The data structure is an array of three values, first number's domain is from 0 to 4, second is from 0 to 6, third is from 0 to n where n is not a large number. This structure is repeated to create data which is in a two dimensional array.
Using standard compression algorithms such as gzip or bzip2 on structured data does not yield optimum compression efficiency, therefore constructing a case specific algorithm did the trick.
The data structure is shown below with an example.
// cell: a data structure, array of three numbers
// digits[0]: { 0, 1, 2, 3, 4 }
// digits[1]: { 0, 1, 2, 3 }
// digits[2]: { 0, 1, 2, ..., n } n is not an absurdly large number
// Below it is reused in a multi-dimensional array.
var cells = [
[ [3, 0, 1], [4, 2, 4], [3, 0, 2], [4, 1, 3] ],
[ [4, 2, 3], [3, 0, 3], [4, 3, 3], [1, 1, 0] ],
[ [3, 3, 0], [2, 3, 1], [2, 2, 5], [0, 2, 4] ],
[ [2, 1, 0], [3, 0, 0], [0, 2, 3], [1, 0, 0] ]
];
I did various tests on this data structure (excluding the white-spaces as string) using standard compression algorithms:
gz compressed from 171 to 88 bytes
bzip2 compressed from 171 to 87 bytes
deflate compressed from 171 to 76 bytes
The algorithm I constructed compressed the data down to 33 bytes works up till n = 192. So on a case specific basis I was able to compress my data with more than double efficiency of standard text compression algorithms.
The way I achieved such compression is by mapping the possible values of all the different combinations which cells can hold to integers. If you want to investigate such a concept it's known as combinatorics in Mathematics. I then converted the base 10 integer into a higher base for string representation.
Since I am aiming for human usability (the compressed code will be typed) I used base 62 which I represented as {[0-9], [a-z], [A-Z]} from 0 to 61 respectively. I buffered the cell length when converted to Base62 to two digits. This allowed for 62*62 (3844) different cell combinations.
Finally, I added a base 62 digit at the beginning of the compressed string which represents the number of columns. When decompressing the y size is used to deduce the x size from the string's length. Thus the data can be correctly decompressed with no loss of data.
The compressed string of the above example looks like this:
var uncompressed = compress(cells); // "4n0w1H071c111h160i0B0O1s170308110"
I have provided an explanation of my method to solve my problem to help other facing a similar problem. I have not provided my code for obscurity reasons.
TL;DR
To compress structured data:
Represent discrete object as an integer
Encode the base 10 integer to a higher base
Repeat for all objects
Append number of rows or columns to the compressed string
To decompress structured data:
Read the rows or columns and deduce the other from the string length
Reverse steps 1 and 2 in compression
Repeat for all objects
Unless there's some specific structure to your list that you're not divulging and that might drastically help compression, standard lossless compression algorithms such a gzip or bzip2 should handle a string of numbers just fine.
Libraries for such common algorithms should be ubiquitously available for pretty much all languages and platforms.

PostgreSQL group order array totally wrong

This is the correct ordered array with MySQL:
[
[1330210800000, 1],
[1330297200000, 6],
[1330383600000, 10],
[1330470000000, 2],
[1330556400000, 5],
[1330815600000, 9],
[1331593200000, 2],
[1331852400000, 4],
[1331938800000, 8],
[1332111600000, 8],
[1332198000000, 4],
[1332284400000, 8],
[1332370800000, 3],
[1332630000000, 2]
]
But with PostgreSQL the array is:
[
[1330588800000, 5],
[1332399600000, 3],
[1330848000000, 9],
[1330416000000, 10],
[1331622000000, 2],
[1330329600000, 6],
[1330502400000, 2],
[1332140400000, 8],
[1332313200000, 8],
[1330243200000, 1],
[1332226800000, 4],
[1331967600000, 8],
[1332658800000, 2],
[1331881200000, 4]
]
The postgreSQL is the order wrong and the dates different and the count of kliks:
This is the query in my controller:
#kliks = Klik.count( :group => "DATE( created_at )" )
.map{|k, v| [(Time.parse(k).to_i * 1000), v] }
You haven't specified any particular order in your query so the database is free to return your results in any order it wants. Apparently MySQL is ordering the results as a side effect of its GROUP BY but PostgreSQL won't necessarily do that. So your first "bug" is just an incorrect assumption on your part. If you want the database to do the sorting then you want something like:
Klik.count(:group => 'date(created_at)', :order => :date_created_at)
If you throw out the * 1000 and sort the integer timestamps:
1330210800, 1, MySQL
1330243200, 1, PostgreSQL
1330297200, 6, MySQL
1330329600, 6, PostgreSQL
1330383600, 10, MySQL
1330416000, 10, PostreSQL
...
You'll see that they do actually line up quite nicely and the integer timestamps differ by 32400s (AKA 9 hours) or 28800s (AKA 8 hours or 9 hours with a DST adjustment) in each MySQL/PostgreSQL pair. Presumably you're including a time zone (with DST) in one of your conversions while the other is left in UTC.
You are really missing the order clause. By default, database servers return groups in "random" order. The rule is: when you need to fix the order, always use ORDER BY (in rails its :order).