Plot Learning Curve in Rapidminer: Send Training Set Size to Log - rapidminer

Rapidminer noob question.
I am trying to plot the learning curve for a classifier. Ideally I want to log the classifier's cross-validated performance vs the training set size.
My approach is to send the data to the create learning curve operator and send the operators input data to a cross validation operator. The outputs of which are then piped to a log operator. From this point I can easily log the classifier's performance. The problem is that I can't figure out how to get the training set size to send to the log.

One approach is to use the operator Extract Macro choosing the option number_of_examples. Once you have the macro, you use the Provide Macro as Log Value operator to allow the Log operator to log the macro value.

Related

Can HuggingFace `Trainer` be customised for curriculum learning?

I have been looking for certain features in the HuggingFace transformer Trainer object (in particular Seq2SeqTrainer) and would like to know whether they exist and if so, how to implement them, or whether I would have to write my own training loop to enable them.
I am looking to apply Curriculum Learning to my training strategy, as well as evaluating the model at regular intervals, and therefore would like to enable the following
choose in which order the model sees training samples at each epoch (it seems that the data passed onto the train_dataset argument are automatically shuffled by some internal code, and even if I managed to stop that, I would still need to pass differently ordered data at different epochs, as I may want to start training the model from easy samples for a few epochs, and then pass a random shuffle of all data for later epochs)
run custom evaluation at integer multiples of a fix number of steps. The standard compute_metrics argument of the Trainer takes a function to which the predictions and labels are passed* and the user can decide how to generate the metrics given these. However I'd like a finer level of control, for example changing the maximum sequence length for the tokenizer when doing the evaluation, as opposed to when doing training, which would require me including some explicit evaluation code inside compute_metrics which needs to access the trained model and the data from disk.
Can these two points be achieved by using the Trainer on a multi-GPU machine, or would I have to write my own training loop?
*The function often looks something like this and I'm not sure it would work with the Trainer if it doesn't have this configuration
def compute_metrics(eval_pred):
predictions, labels = eval_pred
...
You can pass custom functions to compute metrics in the training arguments

Why are metrics in AllenNLP calculated with tensors? Can I define a metric based on strings?

I'm using AllenNLP in my project, and I'm confused by the Metric: all of the metrics are calculated with tensors, include bleu and rouge. However sometime I may want to calculate the metric with strings tokenized with white spaces. The built-in metrics are calculated in a token level tokenized by BertTokenizer, and it may have a different result because of the difference of tokenization.
Currently I'm converting the tensor to tokens, putting them into a string and calculate my defined Metric in forward. The code is working now, but I wonder this may not be the right way.
What does not seem to be exactly the AllenNLP way of calculating metrics is getting them in the forward call. AllenNLP train_loop features a special blueprint function get_metrics for this.
However, if you mean that you update your metrics in forward and reset them in get_metrics, there seems to be no better way to do it.

Is it possible to feed the output back to input in artificial neural network?

I am currently designing a artificial neural network for a problem with a decay curve.
For example, building a model for predicting the durability of the some material. It may includes the environment condition like temperature and humidity.
However, it is not adequate to predict the durability of the material. For such a problem, I think it is better to using the output durability of previous time slots as one of the current input to predict the durability of next time slot.
Moreover, I do not know how to train a model which feed the output back to input as one of the input columns has only the initial value before training.
For this case,
Method 1 (fail)
I have tried to fill the predicted output durability of current row to the input durability of next row. Nevertheless, it will prevent the model from "loss.backward()" so we cannot compute and update the gradient if we do so. The gradient function used was "CopySlices" instead of "MSELoss" when I copied the predicted output to the next row of the input data.
Feed output to input
gradient function -copy-
Method 2 "fill the input column with expected output"
In this method, I fill the blank input column with expected output (row-1) before training the model. Filling the input column with expected output of previous row is only done for training. For real prediction, I will feed the predicted output to the input. In this case, I am successful to train a overfitting model with MSELoss.
Moreover, I do not believe it is a right method as it uses the expected output as the input no matter how bad it predict. I strongly believed that it is not a right method.
Therefore, I want to ask whether it is possible to feed output to input in linear regression problem using artificial neural network.
I apologize for uploading no code here as I am not convenient to upload the full code here. It may be confidential.
It looks like you need an RNN (recurrent neural network). This tutorial is pretty helpful for understanding an RNN: https://colah.github.io/posts/2015-08-Understanding-LSTMs/.

how to predict query topics using word-topic matrix?

I'm implementing LDA using Java. I know how the algorithm works. In the end of the training (the given iterations) I will get 2 matrices (topic-word and document-topic) that represent the set of the input documents.
My problem is that when I input a new document (query) I want to use these matrices (or any other way) to get the document-topic vector of that query. How would I do that?
Are you using Variational Inference or Gibbs Sampling?
For Gibbs Sampling a typical approach is adding the new document/s to the inference, and only updating its own counters, keeping constant the counters for the documents you used to learn the model.
This is specified in equations 84 and 85 in Parameter Estimation for Text Analysis
I guess there has to be a similar approach in VI LDA.

2D non-polynomial function fitting from the command line

I just wrote a simple Unix command line utility that could be implemented a lot more efficiently. I can measure its performance by just running it on a number of inputs and measuring the time it takes. This will produce a set of pairs of numbers, s t, where s is the input size and t the processing time. In order to determine the performance characteristics of my utility, I need to fit a function through these data points. I can do this manually, but I prefer to be lazy and let a utility do it for me.
Does such a utility exist?
Its input is a sequence of pairs of numbers.
Its output is a formula that expresses how the second number depends as a function on the first, plus an error measure.
One step of the way is to have a utility that does this just for polynomials.
This has been discussed here but it didn't produce a ready-to-use solution.
The next step is to extend the utility to try non-polynomial terms: negative-degree polynomials (as in y = 1/x) and logarithmic terms (as in y = x log x) will need to be tried as well. One idea to cope with the non-polynomial terms is to just surround the polynomial fitting with x and y scale transformations. I don't know whether that will do. This question is related but not exactly the same.
As I said, I'm lazy: I'm not looking for ideas on how to to write this myself, I'm looking for a reliable result of a project that has already done it for me. Any suggestions?
I believe that SAS has this, RS/1 has this, I think that Mathematica has this, Execel and most spreadsheets have a primitive form of this and usually there are add-ons available for more advanced forms. There are lots of Lab analysis and Statistical analysis tools that have stuff like this.
RE., Command Line Tools:
SAS, RS/1 and Minitab were all command line tools 20 years ago when I used them. I bet at least one of them still has this capability.