States from multiple parameters as inputs to Deep Q Network - reinforcement-learning

How do I pass states from multiple parameters as inputs to a DQN Network.
For example in a manufacturing environment, jobs will have multiple tasks and there are m machines, so essentially inputs, integrated state space of task-machines would be like
{
(s_1,s_2....,s_n)_t;
(s_1,s_2,...s_n)_m
}
where t is the index of a task like t1,t2 etc and m is the index of machines and output will be the machines {m_1,m_2....m_n}.

Related

Can HuggingFace `Trainer` be customised for curriculum learning?

I have been looking for certain features in the HuggingFace transformer Trainer object (in particular Seq2SeqTrainer) and would like to know whether they exist and if so, how to implement them, or whether I would have to write my own training loop to enable them.
I am looking to apply Curriculum Learning to my training strategy, as well as evaluating the model at regular intervals, and therefore would like to enable the following
choose in which order the model sees training samples at each epoch (it seems that the data passed onto the train_dataset argument are automatically shuffled by some internal code, and even if I managed to stop that, I would still need to pass differently ordered data at different epochs, as I may want to start training the model from easy samples for a few epochs, and then pass a random shuffle of all data for later epochs)
run custom evaluation at integer multiples of a fix number of steps. The standard compute_metrics argument of the Trainer takes a function to which the predictions and labels are passed* and the user can decide how to generate the metrics given these. However I'd like a finer level of control, for example changing the maximum sequence length for the tokenizer when doing the evaluation, as opposed to when doing training, which would require me including some explicit evaluation code inside compute_metrics which needs to access the trained model and the data from disk.
Can these two points be achieved by using the Trainer on a multi-GPU machine, or would I have to write my own training loop?
*The function often looks something like this and I'm not sure it would work with the Trainer if it doesn't have this configuration
def compute_metrics(eval_pred):
predictions, labels = eval_pred
...
You can pass custom functions to compute metrics in the training arguments

How to use nlmer from lme4 with non-Normal data

I am testing some non-linear models with nlmer from the lme4 library. But this function assumes normality while my data clearly follows a Gamma distribution.
fit <- lme4::nlmer(y ~ nfun(x, b0, b1, b2) ~
(b0|id),
data = df,
start = start.df, REML=T)
summary(fit)
Is there a way of adding a family group, as for lme4, or any other tips for testing groups within non-linear models when data is not Gaussian?
The options for fitting non-linear generalized mixed models in (pure) R are slim to nonexistent.
the brms package offers some scope for nonlinear fitting
you can build a nonlinear model in TMB (but you have to do your own programming in an extension of C++)
other front-end packages like rethinking, greta, rjags, etc. also allow you to build nonlinear mixed effects models — but all (?) in a Bayesian/MCMC framework
On the other hand, Gamma-based models can often be adequately substituted by log-Normal-based models (i.e. log(y) ~ nfun(x, b0, b1, b2)); while the tail shapes of these distributions are very different, the variance-mean relationship for the standard parameterizations is the same, and the results of a (log-link) Gamma GLMM and the corresponding log-Gaussian model are often very similar.
Or you could use a fixed effect for the groups (if you have enough data) and use bbmle::mle2 with the params argument.

How to prevent my reward sum received during evaluation runs repeating in intervals when using RLlib?

I am using Ray 1.3.0 (for RLlib) with a combination of SUMO version 1.9.2 for the simulation of a multi-agent scenario. I have configured RLlib to use a single PPO network that is commonly updated/used by all N agents. My evaluation settings look like this:
# === Evaluation Settings ===
# Evaluate with every `evaluation_interval` training iterations.
# The evaluation stats will be reported under the "evaluation" metric key.
# Note that evaluation is currently not parallelized, and that for Ape-X
# metrics are already only reported for the lowest epsilon workers.
"evaluation_interval": 20,
# Number of episodes to run per evaluation period. If using multiple
# evaluation workers, we will run at least this many episodes total.
"evaluation_num_episodes": 10,
# Whether to run evaluation in parallel to a Trainer.train() call
# using threading. Default=False.
# E.g. evaluation_interval=2 -> For every other training iteration,
# the Trainer.train() and Trainer.evaluate() calls run in parallel.
# Note: This is experimental. Possible pitfalls could be race conditions
# for weight synching at the beginning of the evaluation loop.
"evaluation_parallel_to_training": False,
# Internal flag that is set to True for evaluation workers.
"in_evaluation": True,
# Typical usage is to pass extra args to evaluation env creator
# and to disable exploration by computing deterministic actions.
# IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal
# policy, even if this is a stochastic one. Setting "explore=False" here
# will result in the evaluation workers not using this optimal policy!
"evaluation_config": {
# Example: overriding env_config, exploration, etc:
"lr": 0, # To prevent any kind of learning during evaluation
"explore": True # As required by PPO (read IMPORTANT NOTE above)
},
# Number of parallel workers to use for evaluation. Note that this is set
# to zero by default, which means evaluation will be run in the trainer
# process (only if evaluation_interval is not None). If you increase this,
# it will increase the Ray resource usage of the trainer since evaluation
# workers are created separately from rollout workers (used to sample data
# for training).
"evaluation_num_workers": 1,
# Customize the evaluation method. This must be a function of signature
# (trainer: Trainer, eval_workers: WorkerSet) -> metrics: dict. See the
# Trainer.evaluate() method to see the default implementation. The
# trainer guarantees all eval workers have the latest policy state before
# this function is called.
"custom_eval_function": None,
What happens is every 20 iterations (each iteration collecting "X" training samples), there is an evaluation run of a minimum of 10 episodes. The sum of reward received by all N agents is summed over these episodes and that is set as the reward sum for that particular evaluation run. Over time, I notice that there is a pattern with the reward sums that repeats over the same interval of evaluation runs continuously, and the learning goes nowhere.
UPDATE (23/06/2021)
Unfortunately, I did not have TensorBoard activated for that particular run but from the mean rewards that were collected during evaluations (that happens every 20 iterations) of 10 episodes each, it is clear that there is a repeating pattern as shown in the annotated plot below:
The 20 agents in the scenario should be learning to avoid colliding but instead continue to somehow stagnate at a certain policy and end up showing the exact same reward sequence during evaluation?
Is this a characteristic of how I have configured the evaluation aspect, or should I be checking something else? I would be grateful if anyone could advise or point me in the right direction.
Thank you.
Step 1: I noticed that when I stopped the run at some point for any reason, and then restarted it from the saved checkpoint after restoration, most graphs on TensorBoard (including rewards) charted out the line in EXACTLY the same fashion all over again, which made it look like the sequence was repeating.
Step 2: This led me to believe that there was something wrong with my checkpoints. I compared the weights in checkpoints using a loop and voila, they are all the same! Not a single change! So either there was something wrong with the saving/restoring of checkpoints which after a bit of playing around I found was not the case. So it just meant my weights were not being updated!
Step 3: I sifted through my training configuration to see if something there was preventing the network from learning, and I noticed I had set my "multiagent" configuration option "policies_to_train" to a policy that did not exist. This unfortunately, either did not throw a warning/error or it did and I completely missed it.
Solution step: By setting the multiagent "policies_to_train" configuration option correctly, it started to work!
Could it be that due to the multi-agent dynamics, your policy is chasing its tail? How many policies do you have? Are they competing/collaborating/neutral to each other?
Note that multi-agent training can be very unstable and seeing these fluctuations is quite normal as the different policies get updated and then have to face different "env"-dynamics b/c of that (env=env+all other policies, which appear as part of the env as well).

How can we define an RNN - LSTM neural network with multiple output for the input at time "t"?

I am trying to construct a RNN to predict the possibility of a player playing the match along with the runs score and wickets taken by the player.I would use a LSTM so that performance in current match would influence player's future selection.
Architecture summary:
Input features: Match details - Venue, teams involved, team batting first
Input samples: Player roster of both teams.
Output:
Discrete: Binary: Did the player play.
Discrete: Wickets taken.
Continous: Runs scored.
Continous: Balls bowled.
Question:
Most often RNN uses "Softmax" or"MSE" in the final layers to process "a" from LSTM -providing only a single variable "Y" as output. But here there are four dependant variables( 2 Discrete and 2 Continuous). Is it possible to stitch together all four as output variables?
If yes, how do we handle mix of continuous and discrete outputs with loss function?
(Though the output from LSTM "a" has multiple features and carries the information to the next time-slot, we need multiple features at output for training based on the ground-truth)
You just do it. Without more detail on the software (if any) in use it is hard to give more detasmail
The output of the LSTM unit is at every times step on of the hidden layers of your network
You can then input it in to 4 output layers.
1 sigmoid
2 i'ld messarfound wuth this abit. Maybe 4x sigmoid(4 wickets to an innnings right?) Or relu4
3,4 linear (squarijng it is as lso an option,e or relu)
For training purposes your loss function is the sum of your 4 individual losses.
Since f they were all MSE you could concatenat your 4 outputs before calculating the loss.
But sincd the first is cross-entropy (for a decision sigmoid) yould calculate seperately and sum.
You can still concatenate them after to have a output vector

h2o deep learning different results per run

I use the h2o deep learning using python on a data of 2 balanced classes "0" and "1", and adjusted the parameters to be as follows:
prostate_dl = H2ODeepLearningEstimator(
activation=,"Tanh"
hidden=[50,50,50],
distribution="multinomial",
score_interval=10,
epochs=1000,
input_dropout_ratio=0.2
,adaptive_rate=True
, rho=0.998, epsilon = 1e-8
)
prostate_dl .train(
x=x,
y=y,
training_frame =train,
validation_frame = test)
Each time the program runs gives different confusion matric and accuarcy results, can anyway explain that? how can the results can be reliable?
Also, all of the runs gives the majority prediction as class "1" not "0" , is their any suggestion?
This question has already been answered here, but you need to set reproducible=TRUE when you initialize the H2ODeepLearningEstimator in Python (or in h2o.deeplearning() in R).
Even after setting reproducible=TRUE, the H2O Deep Learning results are only reproducible when using a single core; in other words, when h2o.init(nthreads = 1). The reasons behind this are outlined here.
Also, per the H2O Deep Learning user guide:
Does each Mapper task work on a separate neural-net model that is combined during reduction, or is each Mapper manipulating a shared object that’s persistent across nodes?
Neither; there’s one model per compute node, so multiple
Mappers/threads share one model, which is why H2O is not reproducible
unless a small dataset is used and force_load_balance=F or
reproducible=T, which effectively rebalances to a single chunk and
leads to only one thread to launch a map(). The current behavior is
simple model averaging; between-node model averaging via “Elastic
Averaging” is currently in progress.