Ray Tune: How do schedulers and search algorithms interact? - deep-learning

It seems to me that the natural way to integrate hyperband with a bayesian optimization search is to have the search algorithm determine each bracket and have the hyperband scheduler run the bracket. That is to say, the bayesian optimization search runs only once per bracket. Looking at Tune's source code for this, it's not clear to me whether the Tune library applies this strategy or not.
In particular, I want to know how the Tune library handles passing between the search algorithm and trial scheduler. For instance, how does this work if I call SkOptSearch and AsyncHyperBandScheduler (or HyperBandScheduler) together as follows:
sk_search = SkOptSearch(optimizer,
['group','dimensions','normalize','sampling_weights','batch_size','lr_adam','loss_weight'],
max_concurrent=4,
reward_attr="neg_loss",
points_to_evaluate=current_params)
hyperband = AsyncHyperBandScheduler(
time_attr="training_iteration",
reward_attr="neg_loss",
max_t=50,
grace_period=5,
reduction_factor=2,
brackets=5
)
run(Trainable_Dense,
name='hp_search_0',
stop={"training_iteration": 9999,
"neg_loss": -0.2},
num_samples=75,
resources_per_trial={'cpu':4,'gpu':1},
local_dir='./tune_save',
checkpoint_freq=5,
search_alg=sk_search,
scheduler=hyperband,
verbose=2,
resume=False,
reuse_actors=True)
Based on the source code linked above and the source code here, it seems to me that sk_search would return groups of up to 4 trials at a time, but hyperband should be querying the sk_search algorithm for N_sizeofbracket trials at a time.

There is now a Bayesian Optimization HyperBand implementation in Tune - https://ray.readthedocs.io/en/latest/tune-searchalg.html#bohb.
For standard search algorithms and schedulers, the search algorithm currently only sees the result of a trial if it is completed.

Related

Can HuggingFace `Trainer` be customised for curriculum learning?

I have been looking for certain features in the HuggingFace transformer Trainer object (in particular Seq2SeqTrainer) and would like to know whether they exist and if so, how to implement them, or whether I would have to write my own training loop to enable them.
I am looking to apply Curriculum Learning to my training strategy, as well as evaluating the model at regular intervals, and therefore would like to enable the following
choose in which order the model sees training samples at each epoch (it seems that the data passed onto the train_dataset argument are automatically shuffled by some internal code, and even if I managed to stop that, I would still need to pass differently ordered data at different epochs, as I may want to start training the model from easy samples for a few epochs, and then pass a random shuffle of all data for later epochs)
run custom evaluation at integer multiples of a fix number of steps. The standard compute_metrics argument of the Trainer takes a function to which the predictions and labels are passed* and the user can decide how to generate the metrics given these. However I'd like a finer level of control, for example changing the maximum sequence length for the tokenizer when doing the evaluation, as opposed to when doing training, which would require me including some explicit evaluation code inside compute_metrics which needs to access the trained model and the data from disk.
Can these two points be achieved by using the Trainer on a multi-GPU machine, or would I have to write my own training loop?
*The function often looks something like this and I'm not sure it would work with the Trainer if it doesn't have this configuration
def compute_metrics(eval_pred):
predictions, labels = eval_pred
...
You can pass custom functions to compute metrics in the training arguments

LSTM Evolution Forecast

I have a confusion about the way the LSTM networks work when forecasting with an horizon that is not finite but I'm rather searching for a prediction in whatever time in future. In physical terms I would call it the evolution of the system.
Suppose I have a time series $y(t)$ (output) I want to forecast, and some external inputs $u_1(t), u_2(t),\cdots u_N(t)$ on which the series $y(t)$ depends.
It's common to use the lagged value of the output $y(t)$ as input for the network, such that I schematically have something like (let's consider for simplicity just lag 1 for the output and no lag for the external input):
[y(t-1), u_1(t), u_2(t),\cdots u_N(t)] \to y(t)
In this way of thinking the network, when one wants to do recursive forecast it is forced to use the predicted value at the previous step as input for the next step. In this way we have an effect of propagation of error that makes the long term forecast badly behaving.
Now, my confusion is, I'm thinking as a RNN as a kind of an (simple version) implementation of a state space model where I have the inputs, my output and one or more state variable responsible for the memory of the system. These variables are hidden and not observed.
So now the question, if there is this kind of variable taking already into account previous states of the system why would I need to use the lagged output value as input of my network/model ?
Getting rid of this does my long term forecast would be better, since I'm not expecting anymore the propagation of the error of the forecasted output. (I guess there will be anyway an error in the internal state propagating)
Thanks !
Please see DeepAR - a LSTM forecaster more than one step into the future.
The main contributions of the paper are twofold: (1) we propose an RNN
architecture for probabilistic forecasting, incorporating a negative
Binomial likelihood for count data as well as special treatment for
the case when the magnitudes of the time series vary widely; (2) we
demonstrate empirically on several real-world data sets that this
model produces accurate probabilistic forecasts across a range of
input characteristics, thus showing that modern deep learning-based
approaches can effective address the probabilistic forecasting
problem, which is in contrast to common belief in the field and the
mixed results
In this paper, they forecast multiple steps into the future, to negate exactly what you state here which is the error propagation.
Skipping several steps allows to get more accurate predictions, further into the future.
One more thing done in this paper is predicting percentiles, and interpolating, rather than predicting the value directly. This adds stability, and an error assessment.
Disclaimer - I read an older version of this paper.

convergence code 1 (glmer model, lme4 package)

I'm running a glmer model with a three-way interaction, which causes me to receive the following warning:
Warning:
In optwrap(optimizer, devfun, start, rho$lower, control = control, :
convergence code 1 from nlminbwrap
The warning is not there when the 3-way interaction is omitted, so I suspect it has to do with model complexity.
Unfortunately, there is no further information about the nature of the convergence issue in the warning (and also not in the model summary), which makes it hard to resolve. [I've tried different optimizers and increasing the nr of function evaluations already].
Is there any way of finding out what precisely convergence code 1 means? Also, I'm wondering whether it is as serious as when it says Model failed to converge? I've been looking for an answer in the R help pages and in the GLMM FAQs, but can't seem to find any. Any help is much appreciated!
Okay, so I've done some reading here with the hope of being able to help out a fellow linguist. Let's start with the model you specified in the comments:
model=glmer(Correct_or_incorrect~ (condition|CASE) + condition + sound + syll + condition:sound + condition:syll + syll:sound + condition:sound:syll, dataMelt, control=glmerControl(optimizer="nlminbwrap"), family = binomial)
The warning message code didn't say anything useful, but convergence code 1 from bobyqa at the very least used to be about exceeding the maximum number of function evaluations. How high did you try and go with the iterations? All you're going to lose is a few hours, so I would try and set it really high and see if the warning message goes away. All you'd be losing is computer time, and I personally think that's a small price to pay for a model that doesn't throw warnings.
You also mentioned that the warning was not there when the 3-way interaction is omitted, and I would be inclined to think that you are right concerning model complexity. If you don't have any specific hypotheses about that interaction I would leave it out and be done, but if you do, I think there are a few options that you haven't mentioned that you have tried yet.
There is a function called allFit() that will fit the model with all available optimizers. This would be a quick and easy way to see if your estimates are roughly the same among all the different optimizers. You run it on an already fitted model, like this:
allFit(model)
There is a good walkthough of using allFit() and it's different arguments here:https://joshua-nugent.github.io/allFit/ This page also has a lot of other potential solutions to your problem.
If you can, I would take advantage of a machine with multiple cores and run allFit with as many iterations as you can swing, and see if any of the optimizers don't give this warning, which is presumably about not minimizing the loss function before the iterations run out.

how to predict query topics using word-topic matrix?

I'm implementing LDA using Java. I know how the algorithm works. In the end of the training (the given iterations) I will get 2 matrices (topic-word and document-topic) that represent the set of the input documents.
My problem is that when I input a new document (query) I want to use these matrices (or any other way) to get the document-topic vector of that query. How would I do that?
Are you using Variational Inference or Gibbs Sampling?
For Gibbs Sampling a typical approach is adding the new document/s to the inference, and only updating its own counters, keeping constant the counters for the documents you used to learn the model.
This is specified in equations 84 and 85 in Parameter Estimation for Text Analysis
I guess there has to be a similar approach in VI LDA.

2D non-polynomial function fitting from the command line

I just wrote a simple Unix command line utility that could be implemented a lot more efficiently. I can measure its performance by just running it on a number of inputs and measuring the time it takes. This will produce a set of pairs of numbers, s t, where s is the input size and t the processing time. In order to determine the performance characteristics of my utility, I need to fit a function through these data points. I can do this manually, but I prefer to be lazy and let a utility do it for me.
Does such a utility exist?
Its input is a sequence of pairs of numbers.
Its output is a formula that expresses how the second number depends as a function on the first, plus an error measure.
One step of the way is to have a utility that does this just for polynomials.
This has been discussed here but it didn't produce a ready-to-use solution.
The next step is to extend the utility to try non-polynomial terms: negative-degree polynomials (as in y = 1/x) and logarithmic terms (as in y = x log x) will need to be tried as well. One idea to cope with the non-polynomial terms is to just surround the polynomial fitting with x and y scale transformations. I don't know whether that will do. This question is related but not exactly the same.
As I said, I'm lazy: I'm not looking for ideas on how to to write this myself, I'm looking for a reliable result of a project that has already done it for me. Any suggestions?
I believe that SAS has this, RS/1 has this, I think that Mathematica has this, Execel and most spreadsheets have a primitive form of this and usually there are add-ons available for more advanced forms. There are lots of Lab analysis and Statistical analysis tools that have stuff like this.
RE., Command Line Tools:
SAS, RS/1 and Minitab were all command line tools 20 years ago when I used them. I bet at least one of them still has this capability.