Standard error of absorved fixed effect // Run regression with noninteger factor variable - regression

I have a regression that I can run for example as
reghdfe y, a(x1_est=x1 x2_est=x2)
which will store the estimated coefficients in x1_est and x2_est. Now, the issue is that using absorb() does not allow me to get the standard errors for these coefficients. If I understand it correctly, no postestimation method of reghdfe allows me to retrieve those.
Luckily, I only care about the standard errors of x1. So, I could instead run
reg y i.x1, a(x2)
and inspect _se[x1]. Unfortunately, x1 has so many different levels that it is not possible to store it as integer, it has to be double. The previous regression hence will fail with x1: factor variables may not contain noninteger values.
What could be another approach to get standard errors for x1?

With large number of fixed effects, STATA's default approaches won't work. One angle is to bootstrap fixed effects and generate standard errors. Again, the issue is that there are so many FE, such that standard bootstrapping methods won't work (cannot return such a large matrix in each bootstrap).
Essentially, to bootstrap the FE, one would (for a large number of iterations)
preserve
bsample
run the regression, reghdfe y, a(x1_est=x1 x2_est-x2)
Store x1_est in a .dta file
restore
After the loop is done, iteratively append all the .dta files, and compute standard errors.

Related

LSTM Evolution Forecast

I have a confusion about the way the LSTM networks work when forecasting with an horizon that is not finite but I'm rather searching for a prediction in whatever time in future. In physical terms I would call it the evolution of the system.
Suppose I have a time series $y(t)$ (output) I want to forecast, and some external inputs $u_1(t), u_2(t),\cdots u_N(t)$ on which the series $y(t)$ depends.
It's common to use the lagged value of the output $y(t)$ as input for the network, such that I schematically have something like (let's consider for simplicity just lag 1 for the output and no lag for the external input):
[y(t-1), u_1(t), u_2(t),\cdots u_N(t)] \to y(t)
In this way of thinking the network, when one wants to do recursive forecast it is forced to use the predicted value at the previous step as input for the next step. In this way we have an effect of propagation of error that makes the long term forecast badly behaving.
Now, my confusion is, I'm thinking as a RNN as a kind of an (simple version) implementation of a state space model where I have the inputs, my output and one or more state variable responsible for the memory of the system. These variables are hidden and not observed.
So now the question, if there is this kind of variable taking already into account previous states of the system why would I need to use the lagged output value as input of my network/model ?
Getting rid of this does my long term forecast would be better, since I'm not expecting anymore the propagation of the error of the forecasted output. (I guess there will be anyway an error in the internal state propagating)
Thanks !
Please see DeepAR - a LSTM forecaster more than one step into the future.
The main contributions of the paper are twofold: (1) we propose an RNN
architecture for probabilistic forecasting, incorporating a negative
Binomial likelihood for count data as well as special treatment for
the case when the magnitudes of the time series vary widely; (2) we
demonstrate empirically on several real-world data sets that this
model produces accurate probabilistic forecasts across a range of
input characteristics, thus showing that modern deep learning-based
approaches can effective address the probabilistic forecasting
problem, which is in contrast to common belief in the field and the
mixed results
In this paper, they forecast multiple steps into the future, to negate exactly what you state here which is the error propagation.
Skipping several steps allows to get more accurate predictions, further into the future.
One more thing done in this paper is predicting percentiles, and interpolating, rather than predicting the value directly. This adds stability, and an error assessment.
Disclaimer - I read an older version of this paper.

Why W_q matrix in torch.nn.MultiheadAttention is quadratic

I am trying to implement nn.MultiheadAttention in my network. According to the docs,
embed_dim  – total dimension of the model.
However, according to the source file,
embed_dim must be divisible by num_heads
and
self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))
If I understand properly, this means each head takes only a part of features of each query, as the matrix is quadratic. Is it a bug of realization or is my understanding wrong?
Each head uses a different part of the projected query vector. You can imagine it as if the query gets split into num_heads vectors that are independently used to compute the scaled dot-product attention. So, each head operates on a different linear combination of the features in queries (and keys and values, too). This linear projection is done using the self.q_proj_weight matrix and the projected queries are passed to F.multi_head_attention_forward function.
In F.multi_head_attention_forward, it is implemented by reshaping and transposing the query vector, so that the independent attentions for individual heads can be computed efficiently by matrix multiplication.
The attention head sizes are a design decision of PyTorch. In theory, you could have a different head size, so the projection matrix would have a shape of embedding_dim × num_heads * head_dims. Some implementations of transformers (such as C++-based Marian for machine translation, or Huggingface's Transformers) allow that.

Faster-RCNN bbox/image normalization

I am playing with py-faster-rcnn on a custom dataset (about 3000 images, 7 different classes, including the background), and following these tutorials:
https://github.com/zeyuanxy/fast-rcnn/blob/master/help/train/README.md (Fast-RCNN tutorial)
https://github.com/deboc/py-faster-rcnn/tree/master/help (Faster-RCNN tutorial)
I am using the end2end solution with VGG16 network.
Everything works fine, expect my results so I have some questions:
What kind of normalizations are needed on the images and on the bbox annotations?
It is similar to the previous question: There are two config options: BBOX_NORMALIZE_TARGETS and BBOX_NORMALIZE_TARGETS_PRECOMPUTED. Should I calculate the mean and std before the training and use these options for bbox normalization?
I modified the num_output at the cls_score and bbox_pred layers (according to this thread: https://github.com/rbgirshick/py-faster-rcnn/issues/1), but in the end2end solution there are rpn_cls_score and rpn_bbox_pred layers too. Should I modify the num_outputs of these too? If I should then how could I calculate the number of outputs for 7 classes?
No, you do not need to pre-compute anything. In lib/roi_data_layer/roidb.py, it computes the mean and standard deviation for your dataset if you set the BBOX_NORMALIZE_TARGETS_PRECOMPUTED to False, otherwise, it will use the default values which are specified in lib/fast_rcnn/config.py. RPN is agnostic to number of classes. It only treats regions which contain any object as positive and everything else as negative.

Mapping Nonlinear Functions By Using Artificial Neural Network

I am dealing with an hard assignment which I could not move the pen. What is the way to solve the following problem? Any help would be appreciated.
f(x)=1/x and x is between 0.1 and 1
The problem is asking to traing the network by using back propagation algorithm with one hidden layer.
Trainin set will have 200 input/output pattern, test set will have 100 and validation will have 50 patterns.
How can I solve this? Regards.
That sound much more complicated than it actually is. The network does not know anything about what you actually want to represent with the input and output pattern. So do not worry about that. All you need to do is setup such a network (I assume that you know how to do that - otherwise just check around there are couple of libs, but it is even possible in Excel to set it up quickly for testing purposes)
Then just run the test data against the network in a loop. Once the network is kind of stable store it and start testing.
I assume the representation of the patters has been defined already? It's one of the most important point that defines the quality. The closer the x/y pairs are semantically the closer the representation patterns have to be - meaning here the delta between x/y pairs. In particular for the small x value/large y pairs!
Otherwise the network will not "understand" that and you can teach forever - since there is no correct representation of the similarity - in this case the delta x and delta y
For example the value 7 in binary format is not close at all to the value 8. Meaning if the network did not "learn" that because it has never seen the 8 it will not work well.
So the closer the values the more similarities the representation of the values should be for the network! - That's the key.
Tweaking the parameters will then fine tune your model

2D non-polynomial function fitting from the command line

I just wrote a simple Unix command line utility that could be implemented a lot more efficiently. I can measure its performance by just running it on a number of inputs and measuring the time it takes. This will produce a set of pairs of numbers, s t, where s is the input size and t the processing time. In order to determine the performance characteristics of my utility, I need to fit a function through these data points. I can do this manually, but I prefer to be lazy and let a utility do it for me.
Does such a utility exist?
Its input is a sequence of pairs of numbers.
Its output is a formula that expresses how the second number depends as a function on the first, plus an error measure.
One step of the way is to have a utility that does this just for polynomials.
This has been discussed here but it didn't produce a ready-to-use solution.
The next step is to extend the utility to try non-polynomial terms: negative-degree polynomials (as in y = 1/x) and logarithmic terms (as in y = x log x) will need to be tried as well. One idea to cope with the non-polynomial terms is to just surround the polynomial fitting with x and y scale transformations. I don't know whether that will do. This question is related but not exactly the same.
As I said, I'm lazy: I'm not looking for ideas on how to to write this myself, I'm looking for a reliable result of a project that has already done it for me. Any suggestions?
I believe that SAS has this, RS/1 has this, I think that Mathematica has this, Execel and most spreadsheets have a primitive form of this and usually there are add-ons available for more advanced forms. There are lots of Lab analysis and Statistical analysis tools that have stuff like this.
RE., Command Line Tools:
SAS, RS/1 and Minitab were all command line tools 20 years ago when I used them. I bet at least one of them still has this capability.