How to create a discrete distribution in OpenTURNS? - openturns

I have a sample of real values which contains independent realizations of a discrete random variable and I want to create the distribution which fits this data.
sample = [2.0, 2.0, 1.0, 1.0, 2.0, 3.0, 1.0, 2.0, 2.0, 1.0]
The UserDefined distribution seems to be designed for this purpose, but requires to compute the weights of each point, depending on its frequency in the sample:
import openturns as ot
distribution = ot.UserDefined(points, weights)
But we have to compute the points and weights first. To do this, I computed the points and weights using the Numpy unique function. However, this sounds like a limitation of the UserDefined class. How may I do this more simply?

The UserDefinedFactory class creates a UserDefined distribution by estimating the points and weights from the sample. The build method takes the sample as input and returns the ot.UserDefined object that fits the data.
import openturns as ot
sample = ot.Sample([[2.0], [2.0], [1.0], [1.0], [2.0], [3.0], [1.0], [2.0], [2.0], [1.0]])
factory = ot.UserDefinedFactory()
distribution = factory.build(sample)

Related

1D Sequence Classification

I am working with a long sequence (~60 000 timesteps) classification task with continuous input domain. The input has the shape (B, L, C) where B is the batch size, L is the sequence length (i.e. timesteps) and C is the number of features where each feature is continuous (i.e. values like 0.6, 0.2, 0.5, 1.3, etc.).
Since the sequence is very long, I can't directly apply an RNN or Transformer Encoder layer without exceeding memory limits. Some proposed methods use several CNN layers to "downsample" the sequence length before feeding it into an RNN model. A successful example of this includes the CNN-LSTM model. By introducing several subsequent Convolutional blocks followed by max-pooling it is possible to "downsample" the sequence length by a given factor. The sampled sequence would instead have a sequence length of 60 timesteps for instance, which is much more manageable for an LSTM model.
Does it make sense to directly substitute the LSTM model with a Transformer encoder? I have read that the transformer attention mechanism can complement the LSTM layers and be used in succession.
There also exist many variants of Transformers and other architectures designed for handling long sequences. Latest examples include Performer, Linformer, Reformer, Nyströmformer, BigBird, FNet, S4, CDIL-CNN. Does there exist a library similar to torchvision for implementing these models in pytorch without copy-pasting large amounts of code from the respective repositories?

One Hot Encoding dimension - Model Compexity

I will explain my problem:
I have around 50.000 samples, each of one described by a list of codes representing "events"
The number of unique codes are around 800.
The max number of codes that a sample could have is around 600.
I want to represent each sample using one-hot encoding. The representation should be, if we consider the operation of padding for those samples that has fewer codes, a 800x600 matrix.
Giving this new representation as input of a network, means to flatten each matrix to a vector of size 800x600 (460.000 values).
At the end the dataset should consist in 50.000 vectors of size 460.000 .
Now, I have two considerations:
How is it possible to handle a dataset of that size?(I tried data generator to obtain the representation on-the-fly but they are really slow).
Having a vector of size 460.000 as input for each sample, means that the complexity of my model( number of parameters to learn ) is extremely high ( around 15.000.000 in my case ) and, so, I need an huge dataset to train the model properly. Doesn't it?
Why do not you use the conventional model used in NLP?
These events can be translated as you say by embedding matrix.
Then you can represent the chains of events using LSTM (or GRU or RNN o Bilateral LSTM), the difference of using LSTM instead of a conventional network is that you use the same module repeated by N times.
So your input really is not 460,000, but internally an event A indirectly helps you learn about an event B. That's because the LSTM has a module that repeats itself for each event in the chain.
You have an example here:
https://www.kaggle.com/ngyptr/lstm-sentiment-analysis-keras
Broadly speaking what I would do would be the following (in Keras pseudo-code):
Detect the number of total events. I generate a unique list.
unique_events = list (set ([event_0, ..., event_n]))
You can perform the translation of a sequence with:
seq_events_idx = map (unique_events.index, seq_events)
Add the necessary pad to each sequence:
sequences_pad = pad_sequences (sequences, max_seq)
Then you can directly use an embedding to carry out the transfer of the event to an associated vector of the dimension that you consider.
input_ = Input (shape = (max_seq,), dtype = 'int32')
embedding = Embedding (len(unique_events),
                    dimensions,
                    input_length = max_seq,
                    trainable = True) (input_)
Then you define the architecture of your LSTM (For example):
lstm = LSTM (128, input_shape = (max_seq, dimensions), dropout = 0.2, recurrent_dropout = 0.2, return_sequences = True) (embedding)
Add the dense and the result you want:
out = Dense (10, activation = 'softmax') (lstm)
I think that this type of model can help you and give better results.

Why the cost function and the last activation function are bound in MXNet?

When we define a deep learning model, we do the following steps:
Specify how the output should be calculated based on the input and the model's parameters.
Specify a cost (loss) function.
Search for the model's parameters by minimizing the cost function.
It looks to me that in MXNet the first two steps are bound. For example, in the following way I define a linear transformation:
# declare a symbolic variable for the model's input
inp = mx.sym.Variable(name = 'inp')
# define how output should be determined by the input
out = mx.sym.FullyConnected(inp, name = 'out', num_hidden = 2)
# specify input and model's parameters
x = mx.nd.array(np.ones(shape = (5,3)))
w = mx.nd.array(np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]))
b = mx.nd.array(np.array([7.0, 8.0]))
# calculate output based on the input and parameters
p = out.bind(ctx = mx.cpu(), args = {'inp':x, 'out_weight':w, 'out_bias':b})
print(p.forward()[0].asnumpy())
Now, if I want to add a SoftMax transformation on top of it, I need to do the following:
# define the cost function
target = mx.sym.Variable(name = 'target')
cost = mx.symbol.SoftmaxOutput(out, target, name='softmax')
y = mx.nd.array(np.array([[1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 0.0], [0.0, 1.0]]))
c = cost.bind(ctx = mx.cpu(), args = {'inp':x, 'out_weight':w, 'out_bias':b, 'target':y})
print(c.forward()[0].asnumpy())
What I do not understand, is why do we need to create the symbolic variable target. We would need it only if we want to calculate costs, but so far, we just calculate output based on the input (by doing a linear transformation and SoftMax).
Moreover, we need to provide a numerical value for the target to get the output calculated. So, it looks like it is required but it is not used (the provided value of the target does not change the value of the output).
Finally, we can use the cost object to define a model which we can fit as soon as we have data. But what about the cost function? It has to be specified, but it is not. Basically, it looks like I am forced to use a specific cost bunction just because I use SoftMax. But why?
ADDED
For more statistical / mathematical point of view check here. Although the current question is more pragmatic / programmatic in nature. It is basically: How to decouple the output nonlinearity and the cost function in MXNEt. For example I might want to do a linear transformation and then find the model parameters by minimizing absolute deviation instead of squared one.
You can use mx.sym.softmax() if you only want softmax. mx.sym.SoftmaxOutput() contains efficient code for calculating gradient of cross-entropy (negative log loss), which is the most common loss used with softmax. If you want to use your own loss, just use softmax and add a loss on top during training. I should note that you could also replace the SoftmaxOutput layer with a simple softmax during inference if you really want to.

Keras model gives wrong predictions of only 1 class

Background
I used Python and Keras to implement the model of [1].
The model's structure is described in Fig.3 of this paper:
[1] Zheng, Y.: Time Series Classification Using Multi-Channels Deep Convolutional Neural Networks, 2014
Problem
The trained model gives predictions of only 1 class out of 4 classes. For example, [3,3,3,...,3] (= all 3's)
My code at Github
Run main_q02.py
The model is defined in mcdcnn_3.py
Utility functions are defined in utils.py and PAMAP2Utils.py
Dataset Download
The code requires only two files:
PAMAP2_Dataset/Protocol/subject101.dat
PAMAP2_Dataset/Protocol/subject102.dat
About the dataset
The dataset classes are NOT balanced.
class: 0, 1, 2, 3
number of samples (%): 28.76%, 36.18%, 18.42%, 16.64%
Note: computed over all 7 subjects
Does one dominate? Classes 0 and 1 dominate around 65% of all samples.
class 0: 28.76%
class 1: 36.18%
Additional details
Operating system: Ubuntu 14.04 LTS
Version of python packages:
Theano (0.8.2)
Keras (1.1.0)
numpy (1.13.0)
pandas (0.20.2)
Details of the model (from the paper):
"separate multivariate time series into univariate ones and perform feature learning on each univariate series individually." [1]
"adopt sigmoid function in all activation layers" [1]
"utilize average pooling without overlapping" [1]
use stochastic gradient descent (SGD) for learning
parameters: momentum = 0.9, decay = 0.0005, learning rate = 0.01

Why does this binary classification accuracy calculation work?

I have began to play around with Keras, and have noticed that many of the examples do not use the built in Keras accuracy metrics but rather their own accuracy function which they run the y_pred values through in relation to y_true.
The function is used for computing the accuracy of binary classification [0,1] but I do not understand why it works, as even on a small example I can see it is incorrect.
The function in the Keras example code is:
def compute_accuracy(predictions, labels):
return labels[predictions.ravel()<0.5].mean()
However, if we use the example of
labels = [0,1,1,1]
predictions = [0.2, 0.6, 0.7, 0.3]
we can see that the classifier only got one of the predictions incorrect (0.3 was labelled as 1 instead of 0). However in this case the above accuracy function says that the accuracy is 0.5 and not 0.75. Furthermore, if using the binary accuracy metric within Keras, I get completely different accuracy results. I think I am misunderstanding something.