multi label problem with intermediate labels

multi label problem with intermediate labels - deep-learning

I am trying to create a model for the following problem
id input (diagnoses) elapsed_days output (medication)
1 [2,3,4] 0 [3,4]
1 [4,5,6] 7 [1]
1 [2,3] 56 [6,3]
2 [6,5,9,10] 0 [5,3,1]
Rather than a single label for the different codes over time, there are labels at each time period.
I am think that my arch would be [input] -> [embedding for diagnoses] -> [append normalized elapsed days to embeddings]
-> [LSTM] -> [FFNs] -> [labels over time]
I am familiar with how to set this up if there were a single label per id. Given there are labels for each row (i.e. multiple per id), should I be passing the hidden layers of the LSTM through the FFN and then assigning the labels? I would really appreciate if somebody could point me to a reference/blog/github/anything for this kind of problem or suggest an alternative approach here.

Assuming the [6,3] is equal to [3, 6].
You can use Sigmoid activation with Binary Cross-Entropy loss function (nn.BCELoss class) instead of Softmax Cross-Entropy (nn.CrossEntropyLoss class).
But the output ground truth instead of integers like when using nn.CrossEntropyLoss. You need to make them sort of one hot encoding instead. For example, if the desired output is [6, 3] and the output has 10 nodes. The y_true has to be [0, 0, 0, 1, 0, 0, 1, 0, 0, 0].
Depending on how you implement your data generator, this is one way to do it.
output = [3, 6]
out_tensor = torch.zeros(10)
out_tensor[output] = 1
But if [6,3] is not equal to [3, 6]. Then more information about this is needed.

Related

Is there some "scale invariant" substitute for the softmax function?

It is very common tu use softmax function for converting an array of values in an array of probabilities. In general, the function amplifies the probability of the greater values of the array.
However, this function is not scale invariant. Let us consider an example:
If we take an input of [1, 2, 3, 4, 1, 2, 3], the softmax of that is [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175]. The output has most of its weight where the '4' was in the original input. That is, softmax highlights the largest values and suppress values which are significantly below the maximum value. However, if the input were [0.1, 0.2, 0.3, 0.4, 0.1, 0.2, 0.3] (which sums to 1.6) the softmax would be [0.125, 0.138, 0.153, 0.169, 0.125, 0.138, 0.153]. This shows that for values between 0 and 1 softmax, in fact, de-emphasizes the maximum value (note that 0.169 is not only less than 0.475, it is also less than the initial proportion of 0.4/1.6=0.25).
I would need a function that amplifies differences between values in an array, emphasizing the greatest values and that is not so affected by the scale of the numbers in the array.
Can you suggest some function with these properties?

As Robert suggested in the comment, you can use temperature. Here is a toy realization in Python using numpy:
import numpy as np
def softmax(preds):
exp_preds = np.exp(preds)
sum_preds = np.sum(exp_preds)
return exp_preds / sum_preds
def softmax_with_temperature(preds, temperature=0.5):
preds = np.log(preds) / temperature
preds = np.exp(preds)
sum_preds = np.sum(preds)
return preds / sum_preds
def check_softmax_scalability():
base_preds = [1, 2, 3, 4, 1, 2, 3]
base_preds = np.asarray(base_preds).astype("float64")
for i in range(1,3):
print('logits: ', base_preds*i,
'\nsoftmax: ', softmax(base_preds*i),
'\nwith temperature: ', softmax_with_temperature(base_preds*i))
Calling check_softmax_scalability() would return:
logits: [1. 2. 3. 4. 1. 2. 3.]
softmax: [0.02364054 0.06426166 0.1746813 0.474833 0.02364054 0.06426166
0.1746813 ]
with temperature: [0.02272727 0.09090909 0.20454545 0.36363636 0.02272727 0.09090909
0.20454545]
logits: [2. 4. 6. 8. 2. 4. 6.]
softmax: [0.00188892 0.01395733 0.10313151 0.76204449 0.00188892 0.01395733
0.10313151]
with temperature: [0.02272727 0.09090909 0.20454545 0.36363636 0.02272727 0.09090909
0.20454545]
But the scale invariance comes with a cost: as you increase temperature, the output values will come closer to each other. Increase it too much, and you will have an output that looks like a uniform distribution. In your case, you should pick a low value for temperature to emphasize the maximum value.
You can read more about how temperature works here.

PyTorch and Chainer implementations of the Linear layer- are they equivalent?

I want to use a Linear, Fully-Connected Layer as one of the input layers in my network. The input has shape (batch_size, in_channels, num_samples). It is based on the Tacotron paper: https://arxiv.org/pdf/1703.10135.pdf, the Enocder prenet part.
It feels to me as if Chainer and PyTorch have different implementations of the Linear layer - are they really performing the same operations or am I misunderstanding something?
In PyTorch, behavior of the Linear layer follows the documentations: https://pytorch.org/docs/0.3.1/nn.html#torch.nn.Linear
according to which, the shape of the input and output data are as follows:
Input: (N,∗,in_features) where * means any number of additional dimensions
Output: (N,∗,out_features) where all but the last dimension are the same shape as the input.
Now, let's try creating a linear layer in pytorch and performing the operation. I want an output with 8 channels, and the input data will have 3 channels.
import numpy as np
import torch
from torch import nn
linear_layer_pytorch = nn.Linear(3, 8)
Let's create some dummy input data of shape (1, 4, 3) - (batch_size, num_samples, in_channels:
data = np.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], dtype=np.float32).reshape(1, 4, 3)
data_pytorch = torch.from_numpy(data)
and finally, perform the operation:
results_pytorch = linear_layer_pytorch(data_pytorch)
results_pytorch.shape
the shape of the output is as follows: Out[27]: torch.Size([1, 4, 8])
Taking a look at the source of the PyTorch implementation:
def linear(input, weight, bias=None):
# type: (Tensor, Tensor, Optional[Tensor]) -> Tensor
r"""
Applies a linear transformation to the incoming data: :math:`y = xA^T + b`.
Shape:
- Input: :math:`(N, *, in\_features)` where `*` means any number of
additional dimensions
- Weight: :math:`(out\_features, in\_features)`
- Bias: :math:`(out\_features)`
- Output: :math:`(N, *, out\_features)`
"""
if input.dim() == 2 and bias is not None:
# fused op is marginally faster
ret = torch.addmm(bias, input, weight.t())
else:
output = input.matmul(weight.t())
if bias is not None:
output += bias
ret = output
return ret
It transposes the weight matrix that is passed to it, broadcasts it along the batch_size axis and performs a matrix multiplications. Having in mind how a linear layer works, I imagine it as 8 nodes, connected through a synapse, holding a weight, with every channel in an input sample, thus in my case it has 3*8 weights. And that is exactly the shape I see in debugger (8, 3).
Now, let's jump to Chainer. The Chainer's linear layer documentation is available here: https://docs.chainer.org/en/stable/reference/generated/chainer.links.Linear.html#chainer.links.Linear. According to this documentation, the Linear layer wraps the function linear, which according to the docs, flattens the input along the non-batch dimensions and the shape of it's weight matrix is (output_size, flattend_input_size)
import chainer
linear_layer_chainer = chainer.links.Linear(8)
results_chainer = linear_layer_chainer(data)
results_chainer.shape
Out[21]: (1, 8)
Creating the layer as linear_layer_chainer = chainer.links.Linear(3, 8) and calling it causes a size mismatch. So in case of chainer, I have gotten a totally different results, because this time around I have a weight matrix that is of shape (8, 12) and my results have a shape of (1, 8). So now, here is my question : since the results are clearly different,both the weight matrices and the outputs have different shapes, how can I make them equivalent and what should be the desired output? In the PyTorch implementation of Tacotron it seems that the PyTorch approach is used as is (https://github.com/mozilla/TTS/blob/master/layers/tacotron.py) - Prenet. If that is the case, how can I make the Chainer produce the same results (I have to implement this in Chainer). I will be grateful for any inshight, sorry that the post has gotten this long.

Chainer Linear layer (a bit frustratingly) does not apply the transformation to the last axis. Chainer flattens the rest of the axes. Instead you need to provide how many batch axes there are, documentation which is 2 in your case:
# data.shape == (1, 4, 3)
results_chainer = linear_layer_chainer(data, n_batch_axes=2)
# 2 batch axes (1,4) means you apply linear to (..., 3)
# results_chainer.shape == (1, 4, 8)
You can also use l(data, n_batch_axes=len(data.shape)-1) to always apply to the last dimension which is the default behaviour in PyTorch, Keras etc.

Finding Median WITHOUT Data Structures

(my code is written in Java but the question is agnostic; I'm just looking for an algorithm idea)
So here's the problem: I made a method that simply finds the median of a data set (given in the form of an array). Here's the implementation:
public static double getMedian(int[] numset) {
ArrayList<Integer> anumset = new ArrayList<Integer>();
for(int num : numset) {
anumset.add(num);
}
anumset.sort(null);
if(anumset.size() % 2 == 0) {
return anumset.get(anumset.size() / 2);
} else {
return (anumset.get(anumset.size() / 2)
+ anumset.get((anumset.size() / 2) + 1)) / 2;
}
}
A teacher in the school that I go to then challenged me to write a method to find the median again, but without using any data structures. This includes anything that can hold more than one value, so that includes Strings, any forms of arrays, etc. I spent a long while trying to even conceive of an idea, and I was stumped. Any ideas?

The usual algorithm for the task is Hoare's Select algorithm. This is pretty much like a quicksort, except that in quicksort you recursively sort both halves after partitioning, but for select you only do a recursive call in the partition that contains the item of interest.
For example, let's consider an input like this in which we're going to find the fourth element:
[ 7, 1, 17, 21, 3, 12, 0, 5 ]
We'll arbitrarily use the first element (7) as our pivot. We initially split it like (with the pivot marked with a *:
[ 1, 3, 0, 5, ] *7, [ 17, 21, 12]
We're looking for the fourth element, and 7 is the fifth element, so we then partition (only) the left side. We'll again use the first element as our pivot, giving (using { and } to mark the part of the input we're now just ignoring).
[ 0 ] 1 [ 3, 5 ] { 7, 17, 21, 12 }
1 has ended up as the second element, so we need to partition the items to its right (3 and 5):
{0, 1} 3 [5] {7, 17, 21, 12}
Using 3 as the pivot element, we end up with nothing to the left, and 5 to the right. 3 is the third element, so we need to look to its right. That's only one element, so that (5) is our median.
By ignoring the unused side, this reduces the complexity from O(n log n) for sorting to only O(N) [though I'm abusing the notation a bit--in this case we're dealing with expected behavior, not worst case, as big-O normally does].
There's also a median of medians algorithm if you want to assure good behavior (at the expense of being somewhat slower on average).
This gives guaranteed O(N) complexity.

Sort the array in place. Take the element in the middle of the array as you're already doing. No additional storage needed.
That'll take n log n time or so in Java. Best possible time is linear (you've got to inspect every element at least once to ensure you get the right answer). For pedagogical purposes, the additional complexity reduction isn't worthwhile.
If you can't modify the array in place, you have to trade significant additional time complexity to avoid avoid using additional storage proportional to half the input's size. (If you're willing to accept approximations, that's not the case.)

Some not very efficient ideas:
For each value in the array, make a pass through the array counting the number of values lower than the current value. If that count is "half" the length of the array, you have the median. O(n^2) (Requires some thought to figure out how to handle duplicates of the median value.)
You can improve the performance somewhat by keeping track of the min and max values so far. For example, if you've already determined that 50 is too high to be the median, then you can skip the counting pass through the array for every value that's greater than or equal to 50. Similarly, if you've already determined that 25 is too low, you can skip the counting pass for every value that's less than or equal to 25.
In C++:
int Median(const std::vector<int> &values) {
assert(!values.empty());
const std::size_t half = values.size() / 2;
int min = *std::min_element(values.begin(), values.end());
int max = *std::max_element(values.begin(), values.end());
for (auto candidate : values) {
if (min <= candidate && candidate <= max) {
const std::size_t count =
std::count_if(values.begin(), values.end(), [&](int x)
{ return x < candidate; });
if (count == half) return candidate;
else if (count > half) max = candidate;
else min = candidate;
}
}
return min + (max - min) / 2;
}
Terrible performance, but it uses no data structures and does not modify the input array.

Matlab | Matrix Function of Several Variables

I'm working on Matlab and I need to define a matrix function that depends on several variables.
For example, I have this vectors:
t=[1,2,3,4,5,6,7,8,9,10]
y=[1,2,3,4,5,6,7,8,9,10]
That can contain any real numbers or have any length (same length for t and y, I called it NumData).
I have a function that depends on some parameters P1, P2,...,P5. What I want to do is to form a Matrix (NumData x 5) that depends of p, a vector of parameters:
I don't know how to step further. I thought of define a Matrix:
Matrix = ones(NumData,NumParameters)
But when I try to assign, for example
Matrix(1,3) = p(1)+3*p(2)
I got an error.
I tried to define:
Matrix(1,3)=#(p) p(1)+3*p(2)
But it's useless...
I tried to define the matrix in code, like this:
J=#(p) [1 1 1 exp(-p(5)) -p(4)*exp(-p(5))
1 2 4 exp(-2*p(5)) -p(4)*exp(-2*p(5))
1 3 9 exp(-3*p(5)) -p(4)*exp(-3*p(5))
1 4 16 exp(-4*p(5)) -p(4)*exp(-4*p(5))
1 5 25 exp(-5*p(5)) -p(4)*exp(-5*p(5))]
but it isn't good because this is for a specific case...
My main goal is to form J from t vector, and that J depends on the vector parameter p so I can evaluate later
A= J(1,2,1,2,2)
for example, and then factorize A as QR.
Do you have any suggestions? Or I am asking too much for Matlab?

I'm not 100% sure of what you are trying to do, but let me give you some examples of things that will work, in the hopes that it can help you a bit.
p=[1 2 3 4 5];
M=zeros(3,2);
M=[p(1) p(2) p(5); p(3)/p(2) p(5)^p(2) exp(p(3))]

Returning a function in a list, from a function

I searched for this question, but found answers that weren't specific enough.
I'm cleaning up old code and I'm trying to make sure that the following is relatively clean, and hoping that it won't bite me on the rear later on.
My question is about passing a function through a function. Look at the "y" part of the following plot statement. The goo(df)[[1]](x) thing works, but am I asking for trouble in any way? If so, is there a cleaner way?
Also, if the goo() function is called many many times, for instance in a Monte Carlo analysis, will this load up R's internals or possibly cause some type of environment issues?
Edit (02/21/2011) --- The following code is just an example. The real function "goo" has a lot of code before it gets to the approxfun() statement.
#Build a dataframe
df <- data.frame(a=c(1, 2, 3, 4, 5), b=c(4, 3, 1, 2, 6))
#Build a function that passes a function
goo <- function(inp.df) {
out.fun <- approxfun(x=inp.df$a, y=inp.df$b, yright=max(inp.df$b), method="linear", f=1)
list(out.fun, inp.df$a[5], inp.df$b[5])
}
#Set up the plot range
x <- seq(1, 4.3, 0.01)
#Plot the function
plot(x, goo(df)[[1]](x), type="l", xlim=c(0, goo(df)[[2]]), ylim=c(0, goo(df)[[3]]), lwd=2, col="red")
grid()
goo(df)
[[1]]
function (v)
.C("R_approxfun", as.double(x), as.double(y), as.integer(n),
xout = as.double(v), as.integer(length(v)), as.integer(method),
as.double(yleft), as.double(yright), as.double(f), NAOK = TRUE,
PACKAGE = "stats")$xout
<environment: 0219d56c>
[[2]]
[1] 5
[[3]]
[1] 6

It's hard to give you specific recommendations without knowing exactly what your code is, but here are a few things to consider:
Is it really necessary to include pieces of goo's input data in its return value? In other words, can you make goo a straightforward factory that just returns a function? In your example, at least, the plot function already has all the data it needs to determine the limits.
If this is not possible, then stay with this pattern, but give the elements of goo's return value descriptive names so that at least it's easy to see what's going on when you reference them. (E.g., goo(df)$approx(x).) If this structure is used widely in your code, consider making it an S3 class.
Finally, don't invoke goo(df) multiple times in the plot function, just to get different elements out. When you do that, you literally call goo every time, which as you said will execute a lot of code. Also, each invocation will have its own environment with a copy of the input data (although R will be smart enough to reduce the copying to a certain extent and use the same physical instance of df.) Instead, call goo once, assign its value to a variable, and reference that variable subsequently.

I would remove a level of function handling and keep the input data out of the function generation. Then you can keep your function out of the goo and call approxfun only once.
It also generalizes to an input dataframe of any size, not just one with 5 rows.
#Build a dataframe
df <- data.frame(a=c(1, 2, 3, 4, 5), b=c(4, 3, 1, 2, 6))
#Build a function
fun <- approxfun(x = df$a, y = df$b, yright=max(df$b), method="linear", f = 1)
#Set up the plot range
x <- seq(1, 4.3, 0.01)
#Plot the function
plot(x, fun(x), type="l", xlim=c(0, max(df$a)), ylim=c(0, max(df$b)), lwd=2, col="red")
That might not be quite what you need ultimately, but it does remove a level of complexity and gives a cleaner starting point.

This might not be better in a big Monte Carlo simulation, but for simpler situations, it might be clearer to include the x and y ranges as attributes of the output from the created function instead of in a list with the created function. This way goo is a more straightforward factory, like Davor mentions. You could also make the result from your function an object (here using S3) so that it can be plotted more simply.
goo <- function(inp.df) {
out.fun <- approxfun(x=inp.df$a, y=inp.df$b, yright=max(inp.df$b),
method="linear", f=1)
xmax <- inp.df$a[5]
ymax <- inp.df$b[5]
function(...) {
structure(data.frame(x=x, y = out.fun(...)),
limits=list(x=xmax, y=ymax),
class=c("goo","data.frame"))
}
}
plot.goo <- function(x, xlab="x", ylab="approx",
xlim=c(0, attr(x, "limits")$x),
ylim=c(0, attr(x, "limits")$y),
lwd=2, col="red", ...) {
plot(x$x, x$y, type="l", xlab=xlab, ylab=ylab,
xlim=xlim, ylim=ylim, lwd=lwd, col=col, ...)
}
Then to make the function for a data frame, you'd do:
df <- data.frame(a=c(1, 2, 3, 4, 5), b=c(4, 3, 1, 2, 6))
goodf <- goo(df)
And to use it on a vector, you'd do:
x <- seq(1, 4.3, 0.01)
goodfx <- goodf(x)
plot(goodfx)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

multi label problem with intermediate labels - deep-learning

Related

Is there some "scale invariant" substitute for the softmax function?

PyTorch and Chainer implementations of the Linear layer- are they equivalent?

Finding Median WITHOUT Data Structures

Matlab | Matrix Function of Several Variables

Returning a function in a list, from a function

Categories

Resources