I have a neural network that has a shared embedding. More specifically, one input is pairs of words [(a1, b1), (a2, b2)] and the embedding is shared for the positions, i.e., a1 and a2 have the same embedding matrix and b1 and b2 have the same embedding matrix. This means that in a single iteration the following can occur [(a1, b1), (a1, b2)] meaning that the embedding for a1 will be updated twice for the same value for a single pass. Will this cause any problems during learning? How can this be solved?
Related
if the inputs to my model are x,y,z and my outputs are continuous variables a,b,c,d I can obviously use the model to predict the vector [a,b,c,d] from [x,y,z].
However what happens if I find myself in a situation whereby I have say value b as a prior, before inference? Can i run the network in a manner such that I am predicting [a,c,d] based on [x,y,z,b]?
Real World example: I have an image with pixel locations (x,y) and pixel values (R,G,B). I then build a neural implicit model to predict pixel values based on pixel locations, say now I have a situation where I have the green values for some pixels as well as their locations, can I use these green values as a prior with my original network to get an improved result. Note that I am not interested in training a new network on said data.
In mathematical terms: I have network f(x,y,z) -> (a,b,c,d) how can I perform f(x,y,z|b) -> (a,c,d)?
I have not tried much here, thinking of maybe passing the prior value back through the network but am kinda lost.
After reading the research paper on batchnorm and its various descriptions in forums, I am still not clear how the basic computations are performed. The core of my questions is: a vector is normalized with respect to the set to which it belongs; we can thus normalize vectors input to layer 1 using the batch selected from the training set. Each input vector to the next layer needs to be normalized with respect to the set to which it belongs, but how do we get hold of that set?
More precisely, let
N = batch size;
Bi = set of vectors Xij (j = 1..n) whose normalized value will be
input to layer i of the network;
BN = Batch normalization function;
BN(Xij, Bi) = Normalized version of j-th vector, Xij, with respect to the set Bi.
BN(X1j, B1), j = 1..n, can be calculated because we know B1. They are inputs to layer 1.
We need BN(X2j, B2), j = 1..n, to input to layer 2, but we do not have B2 readily available. My questions is how to get B2, B3, etc.
We could process each BN(X1j, B1), j = 1..n, by layer 1 and remember the outputs as X2j (that collection will be B2). Then calculate BN(x2j, B2) for each j by normalizing with respect to B2 and input them to layer 2, etc. So the forward pass would consists of many such steps. For simplicity, I have ignored the scale and shift step, as that is not relevant to my question.
Being new to this topic, I would appreciate expert opinion on it.
Batch Norm first subtracts the mean to center the output around zero. Then Batch Normalization devides by the standart derivation to scale the output to unit variance.
This means that each vector is normalized with respect to the batch it belongs to. Assume you run your x through your first layer, lets call its output y1. Batch Normalization is then applied to y1 in the form of (y1 - y1.mean()) / y.std(). The 'set' you're talking about is just the vector y1, the output of the layer.
All this is of course ignoring versions where a bias and a derivation.
Suppose, we have dataset X (2D array), and we divide it into batches X_1, ..., X_k.
Then for each batch we do normalization, then each i-th component of batch element we multiply by parameter gamma_i and add to them beta_i.
Batch normalization layer can be repeated several times and I didn't found anything about how it is implemented deeper in network.
In next BN layers do we use the same division to batches as in the beginning (using the same rows in X as in the firsh BN layer), just adding new gamma and beta parameters, or we do it from scratch for every layers's input?
Hope, my question is clear.
Referring to the original paper on CycleGAN i am confused about this line
The optimal G thereby translates the domain X to a domain Yˆ
distributed identically to Y . However, such a translation does not
guarantee that an individual input x and output y are paired up in a
meaningful way – there are infinitely many mappings G that will induce
the same distribution over yˆ.
I understand there are two sets of images and there is no pairing between them so when generator will taken one image lets say x from set X as input and try to translate it to an image similar to the images in Y set then my question is that there are many images present in the set Y so which y will our x be translated into? There are so many options available in set Y. Is that what is pointed out in these lines of the paper that i have written above? And is this the reason we take cyclic loss to overcome this problem and to create some type of pairing between any two random images by converting x to y and then converting y back to x?
The image x won't be translated to a concrete image y but rather to a "style" of the domain Y. The input is fed to the generator, which tries to produce a sample from the desired distribution (the other domain), the generated image then goes to the discriminator, which tries to predict if the sample is from the actual distribution or produced by the generator. This is just the normal GAN workflow.
If I understand it correctly, in the lines you quoted, authors explain the problems that arise with adversarial loss. They say it again here:
Adversarial training can, in theory, learn mappings G and F that produce outputs identically distributed as target domains Y and X respectively. However, with large enough capacity, a network can map the same set of input images to any random permutation of images in the target domain, where any of the learned mappings can induce an output distribution that matches the target distribution. Thus, an adversarial loss alone cannot guarantee that the learned function can map an individual input x_i to a desired output y_i.
This is one of the reasons for introducing the concept of cycle-consistency to produce meaningful mappings, reduce the space of possible mapping functions (can be viewed as a form of regularization). The idea is not to create a pairing between 2 random images which already are in the dataset (the dataset stays unpaired), but to make sure, that if you map a real image from the domain X to the domain Y and then back again, you get the original image back.
Cycle consistency encourages generators to avoid unnecessary changes and thus to generate images that share structural similarity with inputs, it also prevents generators from excessive hallucinations and mode collapse.
I hope that answers your questions.
I'm using a 1D CNN on temporal data. Let's say that I have two features A and B. The ratio between A and B (i.e. A/B) is important - let's call this feature C. I'm wondering if I need to explicitly calculate and include feature C, or can the CNN theoretically infer feature C from the given features A and B?
I understand that in deep learning, it's best to exclude highly-correlated features (such as feature C), but I don't understand why.
The short answer is NO. Using the standard DNN layers will not automatically capture this A/B relationship, because standard layers like Conv/Dense will only perform the matrix multiplication operations.
To simplify the discussion, let us assume that your input feature is two-dimensional, where the first dimension is A and the second is B. Applying a Conv layer to this feature simply learns a weight matrix w and bias b
y = w * [f_A, f_B] + b = w_A * f_A + w_B * f_B + b
As you can see, there is no way for this representation to mimic or even approximate the ratio operation between A and B.
You don't have to use the feature C in the same way as feature A and B. Instead, it may be a better idea to keep feature C as an individual input, because its dynamic range may be very different from those of A and B. This means that you can have a multiple-input network, where each input has its own feature extraction layers and the resulting features from both inputs can be concatenated together to predict your target.