Does catboost support one-hot encoding? - catboost

I have labels that are one-hot encoded. I would like to use them to train and predict with a catboost classifier. However, it is giving me an error when I am fitting, saying that multiple integer values are not allowed per row for the labels. So does catboost not allow one-hot encoding for the labels? If not, how can I get catboost to work?

I have found a workaround to this problem. There might be a better solution to this problem, which I would love to hear about.
The workaround is to convert the one-hot encoding to categorical values. Of course, most of the time, we take our categorical values and convert to one-hot encoding. So just don't do this step.
Then, set the loss function to 'MultiClass'. This is the only loss function that catboost (and I think most gradient boosting packages) will support for multiclassification.

catboost does the factors encoding automatically internally, no need to do it manually

Related

How do I know which parameters to use with a pretrained Tokenizer?

I must be missing something ...
I want to use a pretrained model with HuggingFace:
transformer_name = "Geotrend/distilbert-base-fr-cased" # Or whatever model
model = AutoModelForSequenceClassification.from_pretrained(transformer_name, num_labels=5)
tokenizer = AutoTokenizer.from_pretrained(transformer_name)
Now that I have my model and my tokenizer, I need to tokenize my dataset, but I don't know which parameters (padding, truncation, max_length) to use with my Tokenizer.
Some examples just call the tokenizer tokenizer(data), others use truncation only tokenizer(data, truncation=True), and others will use many parameters tokenizer(data, padding=True, truncation=True, return_tensors='pt', max_length=512).
As I am reloading a pretrained Tokenizer, I would have love it to use the same parameters as in the original training process. How do I know which parameters to use ?
My understanding is that I always need to truncate my data and leave max_length to None so that my sequences length will always be lower than the model's maximum length. Is that it ? Does leaving max_length to None makes it backup on the model's maximum length ?
And what should I do with padding ? As I am using a Trainer object for training with a DataCollatorWithPadding should I set padding to False to reduce the memory impact and let the collator pad my batches ?
Final question : what should I do if I use a TextClassificationPipeline for inference ? Should I specify these parameters (padding, etc.) ? Will the pipeline handle it for me ?
The choice on whether to use padding and truncation depends on the model you are fine-tuning and on your training process, and not on the pretrained tokenizer.
Tranformer-based models have a constraint on the number of tokens the model can process, so generally yes that's it. Yes, when max_length is None then the maximum acceptable input length for the model is considered. (see docs).
Yes, you should not pad the input sequence if you use DataCollatorWithPadding. More about it in this video.
As you already noticed, you have to specify them yourself when you pass your input text to the pipeline.

Using binary outcome variable and predictors with geeglm and gee in R?

When I use glm to perform logistic regression, first I read my binary outcome and predictor variables from a CSV file and convert them to factors (they are read in as ints). But if I do that before using gee or geeglm for clustering, geeglm gives me an error message, and although gee doesn't give me an error, the documentation says that you must use continuous predictors. It seems like I get sensical results if I just read in my binary variables and leave them as ints and do not convert them to factors. Is this an ok thing to do, and are there any concerns I need to be aware of? I specify binomial, and they are both for generalized linear models, so I'm not sure what's so wrong with passing factors.
Relatedly, in the scenario I'm describing, do I need to specify scale.fix=TRUE?
Thanks in advance!

Why are metrics in AllenNLP calculated with tensors? Can I define a metric based on strings?

I'm using AllenNLP in my project, and I'm confused by the Metric: all of the metrics are calculated with tensors, include bleu and rouge. However sometime I may want to calculate the metric with strings tokenized with white spaces. The built-in metrics are calculated in a token level tokenized by BertTokenizer, and it may have a different result because of the difference of tokenization.
Currently I'm converting the tensor to tokens, putting them into a string and calculate my defined Metric in forward. The code is working now, but I wonder this may not be the right way.
What does not seem to be exactly the AllenNLP way of calculating metrics is getting them in the forward call. AllenNLP train_loop features a special blueprint function get_metrics for this.
However, if you mean that you update your metrics in forward and reset them in get_metrics, there seems to be no better way to do it.

sequence to sequence model using pytorch

I have dataset (sequence to sequence), each sample input is seq of charterers (combination from from 20 characters and max length 2166) and out is list of charterers (combination of three characters G,H,B). for example OIREDSSSRTTT ----> GGGHHHHBHBBB
I would like to do simple pytorch model that work in that type of dataset. Model that can predict sequence of classes. I would appreciate any suggestions or links for simple mode that do the same?
Thanks
If the output sequence always has the same length as the input sequence, you might want to use transformer encoder, because it basically transforms the inputs with attention to the context. Also you can try to use anything that is used to tagging: BiLSTM, BiGRU, etc.
If you want your model to be able to predict sequences of different length (not necessary the same as input length), look at some encoder-decoder models, such as vanilla transformer.
You can start with the sequence tagging model from PyTorch tutorial https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html .
As #Ilya Fedorov said, you can move to transformer models for potentially better performance.

how to predict query topics using word-topic matrix?

I'm implementing LDA using Java. I know how the algorithm works. In the end of the training (the given iterations) I will get 2 matrices (topic-word and document-topic) that represent the set of the input documents.
My problem is that when I input a new document (query) I want to use these matrices (or any other way) to get the document-topic vector of that query. How would I do that?
Are you using Variational Inference or Gibbs Sampling?
For Gibbs Sampling a typical approach is adding the new document/s to the inference, and only updating its own counters, keeping constant the counters for the documents you used to learn the model.
This is specified in equations 84 and 85 in Parameter Estimation for Text Analysis
I guess there has to be a similar approach in VI LDA.