How to call a function batchwise in Pytorch - deep-learning

I have a function defined and I just want to know if it is possible to perform it batchwise. For instance,
def function():
Some processes here
return x
def forward():
encode = self._encoding(embedded_premises,premises_lengths)
Now since, the encode will be 3D tensor, which will be batch size, seq_length, hidden size I want to perform function() batchwise and return x also batchwise.
Is there any other way than looping over all batches?

If you're working with pytorch functions inside your function, which you most likely are if you want your method to work with autograd, it can work batchwise. That means most pytorch operations respect or can be made to respect the first dimension as batch dimension (for instance convolutions, linear layers, etc). Sometimes it's more complex to express your operation such that it is both correct and fast, but in general pytorch is built with the assumption that operations will be used on batched data and it is made as simple as reasonably possible. If you have a more specific example of your function, please post it.

Related

Can HuggingFace `Trainer` be customised for curriculum learning?

I have been looking for certain features in the HuggingFace transformer Trainer object (in particular Seq2SeqTrainer) and would like to know whether they exist and if so, how to implement them, or whether I would have to write my own training loop to enable them.
I am looking to apply Curriculum Learning to my training strategy, as well as evaluating the model at regular intervals, and therefore would like to enable the following
choose in which order the model sees training samples at each epoch (it seems that the data passed onto the train_dataset argument are automatically shuffled by some internal code, and even if I managed to stop that, I would still need to pass differently ordered data at different epochs, as I may want to start training the model from easy samples for a few epochs, and then pass a random shuffle of all data for later epochs)
run custom evaluation at integer multiples of a fix number of steps. The standard compute_metrics argument of the Trainer takes a function to which the predictions and labels are passed* and the user can decide how to generate the metrics given these. However I'd like a finer level of control, for example changing the maximum sequence length for the tokenizer when doing the evaluation, as opposed to when doing training, which would require me including some explicit evaluation code inside compute_metrics which needs to access the trained model and the data from disk.
Can these two points be achieved by using the Trainer on a multi-GPU machine, or would I have to write my own training loop?
*The function often looks something like this and I'm not sure it would work with the Trainer if it doesn't have this configuration
def compute_metrics(eval_pred):
predictions, labels = eval_pred
...
You can pass custom functions to compute metrics in the training arguments

Standard error of absorved fixed effect // Run regression with noninteger factor variable

I have a regression that I can run for example as
reghdfe y, a(x1_est=x1 x2_est=x2)
which will store the estimated coefficients in x1_est and x2_est. Now, the issue is that using absorb() does not allow me to get the standard errors for these coefficients. If I understand it correctly, no postestimation method of reghdfe allows me to retrieve those.
Luckily, I only care about the standard errors of x1. So, I could instead run
reg y i.x1, a(x2)
and inspect _se[x1]. Unfortunately, x1 has so many different levels that it is not possible to store it as integer, it has to be double. The previous regression hence will fail with x1: factor variables may not contain noninteger values.
What could be another approach to get standard errors for x1?
With large number of fixed effects, STATA's default approaches won't work. One angle is to bootstrap fixed effects and generate standard errors. Again, the issue is that there are so many FE, such that standard bootstrapping methods won't work (cannot return such a large matrix in each bootstrap).
Essentially, to bootstrap the FE, one would (for a large number of iterations)
preserve
bsample
run the regression, reghdfe y, a(x1_est=x1 x2_est-x2)
Store x1_est in a .dta file
restore
After the loop is done, iteratively append all the .dta files, and compute standard errors.

In DQN, hwo to perform gradient descent when each record in experience buffer corresponds to only one action?

The DQN algorithm below
Source
At the gradient descent line, there's something I don't quite understand.
For example, if I have 8 actions, then the output Q is a vector of 8 components, right?
But for each record in D, the return y_i is only a scalar with respect to a given action. How can I perform gradient descent on (y_i - Q)^2 ? I think it's not guaranteed that within a minibatch I have all actions' returns for a state.
You need to calculate the loss only on the Q-value which its action is selected. In your example, assume for a given row in your mini-batch, the action is 3. Then, you obtain the corresponding target, y_3, and then the loss is (Q(s,3) - y_3)^2, and basically you set the loss value of other actions to zero. You can implement this by using gather_nd in tensorflow or by obtaining one-hot-encode version of actions and then multiplying that one-hot-encode vector to Q-value vector. Using a one-hot-encode vector you can write:
action_input = tf.placeholder("float",[None,action_len])
QValue_batch = tf.reduce_sum(tf.multiply(T_Q_value,action_input), reduction_indices = 1)
in which action_input = np.eye(nb_classes)[your_action (e.g. 3)]. Same procedure can be followed by gather_nd:
https://www.tensorflow.org/api_docs/python/tf/gather_nd
I hope this resolves your confusion.

Understanding when to use python list in Pytorch

Basically as this thread discusses here, you cannot use python list to wrap your sub-modules (for example your layers); otherwise, Pytorch is not going to update the parameters of the sub-modules inside the list. Instead you should use nn.ModuleList to wrap your sub-modules to make sure their parameters are going to be updated. Now I have also seen codes like following where the author uses python list to calculate the loss and then do loss.backward() to do the update (in reinforce algorithm of RL). Here is the code:
policy_loss = []
for log_prob in self.controller.log_probability_slected_action_list:
policy_loss.append(- log_prob * (average_reward - b))
self.optimizer.zero_grad()
final_policy_loss = (torch.cat(policy_loss).sum()) * gamma
final_policy_loss.backward()
self.optimizer.step()
Why using the list in this format works for updating the parameters of modules but the first case does not work? I am very confused now. If I change in the previous code policy_loss = nn.ModuleList([]), it throws an exception saying that tensor float is not sub-module.
You are misunderstanding what Modules are. A Module stores parameters and defines an implementation of the forward pass.
You're allowed to perform arbitrary computation with tensors and parameters resulting in other new tensors. Modules need not be aware of those tensors. You're also allowed to store lists of tensors in Python lists. When calling backward it needs to be on a scalar tensor thus the sum of the concatenation. These tensors are losses and not parameters so they should not be attributes of a Module nor wrapped in a ModuleList.

What should be the loss function for classification problem in pytorch if sigmoid is used in the output layer

I am trying to implement a model for binary classification problem. Up to now, I was using softmax function (at the output layer) together with torch.NLLLoss function to calculate the loss. However, now I want to use the sigmoid function (instead of softmax) at the output layer. If I do that, should I also change the loss function (to BCELoss or binary_cross_entropy) or may I still use torch.NLLLoss function ?
If you use sigmoid function, then you can only do binary classification. It's not possible to do a multi-class classification. The reason for this is because sigmoid function always returns a value in the range between 0 and 1. So, for instance one can threshold the value at 0.5 and separate (or classify) it into two classes based on the obtained values.
Regarding the objective function NLLLoss - Negative Log Likelihood Loss. It just learns the data distribution. So, it's not a problem as long as that what you're trying to achieve during training.