Using binary outcome variable and predictors with geeglm and gee in R? - binary

When I use glm to perform logistic regression, first I read my binary outcome and predictor variables from a CSV file and convert them to factors (they are read in as ints). But if I do that before using gee or geeglm for clustering, geeglm gives me an error message, and although gee doesn't give me an error, the documentation says that you must use continuous predictors. It seems like I get sensical results if I just read in my binary variables and leave them as ints and do not convert them to factors. Is this an ok thing to do, and are there any concerns I need to be aware of? I specify binomial, and they are both for generalized linear models, so I'm not sure what's so wrong with passing factors.
Relatedly, in the scenario I'm describing, do I need to specify scale.fix=TRUE?
Thanks in advance!

Related

Why are metrics in AllenNLP calculated with tensors? Can I define a metric based on strings?

I'm using AllenNLP in my project, and I'm confused by the Metric: all of the metrics are calculated with tensors, include bleu and rouge. However sometime I may want to calculate the metric with strings tokenized with white spaces. The built-in metrics are calculated in a token level tokenized by BertTokenizer, and it may have a different result because of the difference of tokenization.
Currently I'm converting the tensor to tokens, putting them into a string and calculate my defined Metric in forward. The code is working now, but I wonder this may not be the right way.
What does not seem to be exactly the AllenNLP way of calculating metrics is getting them in the forward call. AllenNLP train_loop features a special blueprint function get_metrics for this.
However, if you mean that you update your metrics in forward and reset them in get_metrics, there seems to be no better way to do it.

Standard error of absorved fixed effect // Run regression with noninteger factor variable

I have a regression that I can run for example as
reghdfe y, a(x1_est=x1 x2_est=x2)
which will store the estimated coefficients in x1_est and x2_est. Now, the issue is that using absorb() does not allow me to get the standard errors for these coefficients. If I understand it correctly, no postestimation method of reghdfe allows me to retrieve those.
Luckily, I only care about the standard errors of x1. So, I could instead run
reg y i.x1, a(x2)
and inspect _se[x1]. Unfortunately, x1 has so many different levels that it is not possible to store it as integer, it has to be double. The previous regression hence will fail with x1: factor variables may not contain noninteger values.
What could be another approach to get standard errors for x1?
With large number of fixed effects, STATA's default approaches won't work. One angle is to bootstrap fixed effects and generate standard errors. Again, the issue is that there are so many FE, such that standard bootstrapping methods won't work (cannot return such a large matrix in each bootstrap).
Essentially, to bootstrap the FE, one would (for a large number of iterations)
preserve
bsample
run the regression, reghdfe y, a(x1_est=x1 x2_est-x2)
Store x1_est in a .dta file
restore
After the loop is done, iteratively append all the .dta files, and compute standard errors.

Does catboost support one-hot encoding?

I have labels that are one-hot encoded. I would like to use them to train and predict with a catboost classifier. However, it is giving me an error when I am fitting, saying that multiple integer values are not allowed per row for the labels. So does catboost not allow one-hot encoding for the labels? If not, how can I get catboost to work?
I have found a workaround to this problem. There might be a better solution to this problem, which I would love to hear about.
The workaround is to convert the one-hot encoding to categorical values. Of course, most of the time, we take our categorical values and convert to one-hot encoding. So just don't do this step.
Then, set the loss function to 'MultiClass'. This is the only loss function that catboost (and I think most gradient boosting packages) will support for multiclassification.
catboost does the factors encoding automatically internally, no need to do it manually

algorithm to solve related equations

I am working on a project to create a generic equation solver... envision this to take the form of 25-30 equations that will be saved in a table- variable names along with the operators.
I would then call this table for solving any equation with a missing variable and it would move operators/ other pieces to the other side of the missing variable
e.g. 2x+ 3y=z and if x were missing variable. I would call equation with values for y and z and it would convert to solve for x=(z-3y)/2
equations could be linear, polynomial, binary(yes/no result)...
i am not sure if i can get any light-weight library available or whether this needs to built from scratch... any pointers or guidance will be appreciated
See Maxima.
I rather like it for my symbolic computation needs.
If such a general black-box algorithm could be made accurate, robust and stable, pigs could fly. Solutions can be nonexistent, multiple, parametrized, etc.
Even for linear equations it gets tricky to do it right.
Your best bet is some form of Newton algorithm, but generally you tailor it to your problem at hand.
EDIT: I didn't see you wanted something symbolic, rather than numerical. It's another bag of worms.

2D non-polynomial function fitting from the command line

I just wrote a simple Unix command line utility that could be implemented a lot more efficiently. I can measure its performance by just running it on a number of inputs and measuring the time it takes. This will produce a set of pairs of numbers, s t, where s is the input size and t the processing time. In order to determine the performance characteristics of my utility, I need to fit a function through these data points. I can do this manually, but I prefer to be lazy and let a utility do it for me.
Does such a utility exist?
Its input is a sequence of pairs of numbers.
Its output is a formula that expresses how the second number depends as a function on the first, plus an error measure.
One step of the way is to have a utility that does this just for polynomials.
This has been discussed here but it didn't produce a ready-to-use solution.
The next step is to extend the utility to try non-polynomial terms: negative-degree polynomials (as in y = 1/x) and logarithmic terms (as in y = x log x) will need to be tried as well. One idea to cope with the non-polynomial terms is to just surround the polynomial fitting with x and y scale transformations. I don't know whether that will do. This question is related but not exactly the same.
As I said, I'm lazy: I'm not looking for ideas on how to to write this myself, I'm looking for a reliable result of a project that has already done it for me. Any suggestions?
I believe that SAS has this, RS/1 has this, I think that Mathematica has this, Execel and most spreadsheets have a primitive form of this and usually there are add-ons available for more advanced forms. There are lots of Lab analysis and Statistical analysis tools that have stuff like this.
RE., Command Line Tools:
SAS, RS/1 and Minitab were all command line tools 20 years ago when I used them. I bet at least one of them still has this capability.