https://github.com/huggingface/transformers/blob/master/examples/run_glue.py
I want to adapt this script to do text classification on my data. The computer for this task is one single machine with two graphic cards. So this involves kind of "distributed" training with the term local_rank in the script above, especially when local_rank equals 0 or -1 like in line 83.
After reading some materials from distributed computation I guess that local_rank is like an ID for a machine. And 0 may mean this machine is the "main" or "head" in the computation. But what is -1?
Q: But what is -1?
Usually, this is used to disable the distributed setting. Indeed, as you can see here:
train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
and here:
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
output_device=args.local_rank,
find_unused_parameters=True)
setting local_rank to -1 has this effect.
I want to add something more for the #Berriel's answer. Since you have two GPUs and not a distributed machine with a node structure, you do not need distributed methods like DistributedSampler. Hugginface use -1 to disable the distributed settings in training mechanisms.
Check out the following code from huggiface training_args.py script. As you can see if there is a distributed training mechanism self.local_rank get changed.
def _setup_devices(self) -> "torch.device":
logger.info("PyTorch: setting up devices")
if self.no_cuda:
device = torch.device("cpu")
self._n_gpu = 0
elif is_torch_tpu_available():
device = xm.xla_device()
self._n_gpu = 0
elif is_sagemaker_distributed_available():
import smdistributed.dataparallel.torch.distributed as dist
dist.init_process_group()
self.local_rank = dist.get_local_rank()
device = torch.device("cuda", self.local_rank)
self._n_gpu = 1
elif self.local_rank == -1:
# if n_gpu is > 1 we'll use nn.DataParallel.
# If you only want to use a specific subset of GPUs use `CUDA_VISIBLE_DEVICES=0`
# Explicitly set CUDA to the first (index 0) CUDA device, otherwise `set_device` will
# trigger an error that a device index is missing. Index 0 takes into account the
# GPUs available in the environment, so `CUDA_VISIBLE_DEVICES=1,2` with `cuda:0`
# will use the first GPU in that env, i.e. GPU#1
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Sometimes the line in the postinit has not been run before we end up here, so just checking we're not at
# the default value.
self._n_gpu = torch.cuda.device_count()
Related
I use torch.nn.Embedding to embed my model’s categorical input features, however, I face problems when I set the max_norm parameter to not None.
There is a note on the pytorch docs page that explains how to use max_norm parameter through the following example:
n, d, m = 3, 5, 7
embedding = nn.Embedding(n, d, max_norm=True)
W = torch.randn((m, d), requires_grad=True)
idx = torch.tensor(\[1, 2\])
a = embedding.weight.clone() # W.t() # weight must be cloned for this to be differentiable
b = embedding(idx) # W.t() # modifies weight in-place
out = (a.unsqueeze(0) + b.unsqueeze(1))
loss = out.sigmoid().prod()
loss.backward()
I can’t easily understand this example from the docs. What is the purpose of having both ‘a’ and ‘b’ and why ‘out’ is defined as, out = (a.unsqueeze(0) + b.unsqueeze(1))?
Do we need to first clone the entire embedding tensor as in ‘a’, and then finding the embeddings for our desired indices as in ‘b’? Then how do ‘a’ and ‘b’ need to be added?
In my code, I don’t have W explicitly, I am assuming that W is representative of the weights applied by the torch.nn.Linear layers. So, I just need to prepare the input (which includes the embeddings for categorical features) that goes into my network.
I greatly appreciate any instructions on this, as understanding this example would help me adapt my code accordingly.
Because W in the line computing a requires gradients, we must save embedding.weight to compute those gradients in the backward pass. However, in the line computing b, executing embedding(idx) will scale embedding.weight by max_norm - in place. So, without cloning it in line a, embedding.weight will be modified when line b is executed - changing what was saved for the backward pass to update W. Hence the requirement to clone embedding.weight - to save it before it gets scaled in line b.
If you don't use embedding.weight outside of the normal forward pass, you don't need to worry about all this.
If you get an error, post it (and your code).
I am trying to train a DQN on the OpenAI LunarLander Enviroment. I included an argument parser to control which device I use in different runs (CPU and GPU computing with Pytorch's to("cpu") or to("dml") command).
Here is my code:
# Putting networks to either CPU or DML e.g. .to("cpu") for CPU .to("dml") for Microsoft DirectML GPU computing.
self.Q = self.Q.to(self.args.device)
self.Q_target = self.Q_target.to(self.args.device)
However, in pytorch-directml some methods do not have support yet such as .gather(), .max(), MSE_Loss() etc. That is why I need to unload the data from GPU to CPU, do the computations, calculate loss and put it back to GPU for further actions. See it below.
Q_targets_next = self.Q_target(next_states.to("cpu")).detach().max(1)[0].unsqueeze(1).to("cpu") # Calculate target value from bellman equation
Q_targets = (rewards.to("cpu") + self.args.gamma * Q_targets_next.to("cpu") * (1-dones.to("cpu"))) # Calculate expected value from local network
Q_expected = self.Q(states).contiguous().to("cpu").gather(1, actions.to("cpu"))
# Calculate loss (on CPU)
loss = F.mse_loss(Q_expected, Q_targets)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# Put the networks back to DML
self.Q = self.Q.to(self.args.device)
self.Q_target = self.Q_target.to(self.args.device)
The strange thing is this:
Code is bug free; when I run it with args.device = "cpu" it works perfectly however, when I run the exact same code with args.device = "dml" it is terrible and network does not learn anything.
I noticed in every iteration results between CPU and GPU are changing just a little bit(1e-5) but after long iterations this makes a huge difference and GPU and CPU results are almost completely different.
What am I missing here? Is there something I need to pay attention when moving matrices between CPU and GPU? Should I make them contiguous()? Or simply is this a bug in pytorch-dml library?
I constructed several glmer.nb models with different combinations of random intercepts, and for one of the models (nested random intercepts, with the lowest AICc), I consistently get: "iteration limit reached", without the usual "Warning message:
In theta.ml(Y, mu, weights = object#resp$weights, limit = limit, :..."
Here's what I know:
it is a warning (from the color) but not labeled as such
you can also have that warning with GLMs and LMERs
Here's what I don't know:
does it mean the model is invalid?
what causes that issue?
what could I do to resolve that issue?
Here's what I searched:
https://stats.stackexchange.com/questions/67287/very-large-theta-values-using-glm-nb-in-r-alternative-approaches (no explanation as to the why and how)
GLMM FAQ: no mention
I am not the only regularly running into that or similar problems: Using glmer.nb(), the error message:(maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate is returned
https://stats.stackexchange.com/questions/40647/lme-error-iteration-limit-reached/40664
Here's what would be highly appreciated:
A more informative warning message: did the model converge? what caused this? What can one do to fix it? Can we read more about this (link to GLMM FAQ - brms-style)?
This is a general question. I did not provide reproducible code because an answer that is generalisable would be most useful.
library(lme4)
dd <- data.frame(f = factor(rep(1:20, each = 20)))
dd$y <- simulate(~ 1 + (1|f), family = "poisson",
newdata = dd,
newparam = list(beta = 1, theta = 1),
seed = 101)[[1]]
m1 <- glmer.nb(y ~ 1 + (1|f), data = dd)
Warning message:
In theta.ml(Y, mu, weights = object#resp$weights, limit = limit, :
iteration limit reached
It's a bit hard to tell, but this warning occurs in MASS::theta.ml(), which is called to get an initial estimate of the dispersion parameter. (If you set options(error = recover, warn = 2), warnings will be converted to errors and errors will dump you into a debugger, where you can see the sequence of calls that were active when the warning/error occurred).
This generally occurs when the data (specifically, the conditional distribution of the data) is actually equidispersed (variance == mean) or underdispersed (i.e. variance < mean), which can't be achieved by a negative binomial distribution. If you run getME(m1, "glmer.nb.theta") you'll generally get a very large value (in this case it's 62376), which indicates where the optimizer gave up while it was trying to send the dispersion parameter to infinity.
You can:
ignore the warning (the negative binomial isn't a good choice, but the model is effectively converging to a Poisson solution anyway).
revert to a Poisson model (the CV question you link to does say "a Poisson model might be a better choice")
People often worry less about underdispersion than overdispersion (because underdispersion makes results of a Poisson model conservative), but if you want to take underdispersion into account you can fit your model with a conditional distribution that allows underdispersion as well as overdispersion (not directly possible within lme4, but see here)
PS the "iteration limit reached without convergence" warning in one of your linked answers, from nlminb within lme, is a completely different issue (except that both situations involve some form of iterative solution scheme with a set maximum number of iterations ...)
I am using Ray 1.3.0 (for RLlib) with a combination of SUMO version 1.9.2 for the simulation of a multi-agent scenario. I have configured RLlib to use a single PPO network that is commonly updated/used by all N agents. My evaluation settings look like this:
# === Evaluation Settings ===
# Evaluate with every `evaluation_interval` training iterations.
# The evaluation stats will be reported under the "evaluation" metric key.
# Note that evaluation is currently not parallelized, and that for Ape-X
# metrics are already only reported for the lowest epsilon workers.
"evaluation_interval": 20,
# Number of episodes to run per evaluation period. If using multiple
# evaluation workers, we will run at least this many episodes total.
"evaluation_num_episodes": 10,
# Whether to run evaluation in parallel to a Trainer.train() call
# using threading. Default=False.
# E.g. evaluation_interval=2 -> For every other training iteration,
# the Trainer.train() and Trainer.evaluate() calls run in parallel.
# Note: This is experimental. Possible pitfalls could be race conditions
# for weight synching at the beginning of the evaluation loop.
"evaluation_parallel_to_training": False,
# Internal flag that is set to True for evaluation workers.
"in_evaluation": True,
# Typical usage is to pass extra args to evaluation env creator
# and to disable exploration by computing deterministic actions.
# IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal
# policy, even if this is a stochastic one. Setting "explore=False" here
# will result in the evaluation workers not using this optimal policy!
"evaluation_config": {
# Example: overriding env_config, exploration, etc:
"lr": 0, # To prevent any kind of learning during evaluation
"explore": True # As required by PPO (read IMPORTANT NOTE above)
},
# Number of parallel workers to use for evaluation. Note that this is set
# to zero by default, which means evaluation will be run in the trainer
# process (only if evaluation_interval is not None). If you increase this,
# it will increase the Ray resource usage of the trainer since evaluation
# workers are created separately from rollout workers (used to sample data
# for training).
"evaluation_num_workers": 1,
# Customize the evaluation method. This must be a function of signature
# (trainer: Trainer, eval_workers: WorkerSet) -> metrics: dict. See the
# Trainer.evaluate() method to see the default implementation. The
# trainer guarantees all eval workers have the latest policy state before
# this function is called.
"custom_eval_function": None,
What happens is every 20 iterations (each iteration collecting "X" training samples), there is an evaluation run of a minimum of 10 episodes. The sum of reward received by all N agents is summed over these episodes and that is set as the reward sum for that particular evaluation run. Over time, I notice that there is a pattern with the reward sums that repeats over the same interval of evaluation runs continuously, and the learning goes nowhere.
UPDATE (23/06/2021)
Unfortunately, I did not have TensorBoard activated for that particular run but from the mean rewards that were collected during evaluations (that happens every 20 iterations) of 10 episodes each, it is clear that there is a repeating pattern as shown in the annotated plot below:
The 20 agents in the scenario should be learning to avoid colliding but instead continue to somehow stagnate at a certain policy and end up showing the exact same reward sequence during evaluation?
Is this a characteristic of how I have configured the evaluation aspect, or should I be checking something else? I would be grateful if anyone could advise or point me in the right direction.
Thank you.
Step 1: I noticed that when I stopped the run at some point for any reason, and then restarted it from the saved checkpoint after restoration, most graphs on TensorBoard (including rewards) charted out the line in EXACTLY the same fashion all over again, which made it look like the sequence was repeating.
Step 2: This led me to believe that there was something wrong with my checkpoints. I compared the weights in checkpoints using a loop and voila, they are all the same! Not a single change! So either there was something wrong with the saving/restoring of checkpoints which after a bit of playing around I found was not the case. So it just meant my weights were not being updated!
Step 3: I sifted through my training configuration to see if something there was preventing the network from learning, and I noticed I had set my "multiagent" configuration option "policies_to_train" to a policy that did not exist. This unfortunately, either did not throw a warning/error or it did and I completely missed it.
Solution step: By setting the multiagent "policies_to_train" configuration option correctly, it started to work!
Could it be that due to the multi-agent dynamics, your policy is chasing its tail? How many policies do you have? Are they competing/collaborating/neutral to each other?
Note that multi-agent training can be very unstable and seeing these fluctuations is quite normal as the different policies get updated and then have to face different "env"-dynamics b/c of that (env=env+all other policies, which appear as part of the env as well).
I want to perform inference (i.e. semantic segmentation) on a very large satellite image without splitting it into pieces. I have access to 4 GPUs (each having 15 GBs of memory) and was wondering if it is possible to somehow use all the memory of these GPUs combined (i.e. 60 GB) for inference on the image in PyTorch?
You are looking for model parallel mode of work.
Basically, you can assign different parts of your model to different GPUs and then you should take care of the "bookkeeping".
This solution is very model-specific and task-specific therefore, there are no "generic" wrappers for it (as opposed to data parallel).
For example:
class MyModelParallelNetwork(nn.Module):
def __init__(self, ...):
# define the network
self.part_one = ... # some nn.Module
self.part_two = ... # additional nn.Module
self.part_three = ...
self.part_four = ...
# important part - "send" the different parts to different GPUs
self.part_one.to(torch.device('gpu:0'))
self.part_two.to(torch.device('gpu:1'))
self.part_three.to(torch.device('gpu:2'))
self.part_four.to(torch.device('gpu:3'))
def forward(self, x):
# forward through model parts and GPUs:
p1 = self.part_one(x.to(torch.device('gpu:0')))
p2 = self.part_two(p1.to(torch.device('gpu:1')))
p3 = self.part_three(p2.to(torch.device('gpu:2')))
y = self.part_four(p3.to(torch.device('gpu:3')))
return y # result is on cuda:3 device