weak instruments stata results - regression

I'm using the command ivregress 2sls, with clusters (each cluster is a school) and with pweights.
I have one endogenous variables x1, and 4 instruments. I'm trying to test my model and check if my instruments are not weak.
I used the estat firstatage command and I'm not sure how to interpret the result.
picture:
results

You might want to try the user-written weakivtest which is available from the SSC (ssc install weakivtest) and an improvement to estat firststage. If your F-test stat rejects the Null at the common significance levels (or the critical values when using weakivtest), your instruments are strong.

Related

Is it possible to specify "episodes_this_iter" with the ray Tune search algorithm?

I'm new to programming/ray and have a simple question about which parameters can be specified when using Ray Tune. In particular, the ray tune documentation says that all of the auto-filled fields (steps_this_iter, episodes_this_iter, etc.) can be used as stopping conditions or in the Scheduler/Search Algorithm specification.
However, the following only works once I remove the "episodes_this_iter" specification. Does this work only as part of the stopping criteria?
ray.init()
tune.run(
PPOTrainer,
stop = {"training_iteration": 1000},
config={"env": qsdm.QSDEnv,
"env_config": defaultconfig,
"num_gpus": 0,
"num_workers": 1,
"lr": tune.grid_search([0.00005, 0.00001, 0.0001]),},
"episodes_this_iter": 2500,
)
tune.run() is the one filling up those fields so we can use them elsewhere. And the stopping criterion is just one of the places where we can use them in.
To see why the example doesn't work, consider a simpler analogue:
episodes_total: 100
The trainer itself is the one incrementing the episode count so the rest of the system knows how far along we are. It doesn't work on them if we try to change it or fix it to a particular value. The same reasoning applies to other fields in the list.
As for the scheduler and search algorithms, I have no experience with.
But what we want to do is put those conditions inside the schedule or search algorithm itself, and not in the trainer directly.
Here's an example with Bayesian optimisation search, although I don't know what it would mean to do this:
from ray.tune.suggest.bayesopt import BayesOptSearch
tune.run(
# ...
# 10 trials
num_samples=10,
search_alg=BayesOptSearch(
# look for learning rates within this range:
{'lr': (0.1, 0.00001)},
# optimise for this metric:
metric='episodes_this_iter', # <------- auto-filled field here
mode='max',
utility_kwargs={
'kind': 'ucb',
'kappa': '2.5',
'xi': 0.0
}
)
)

Ray Tune: How do schedulers and search algorithms interact?

It seems to me that the natural way to integrate hyperband with a bayesian optimization search is to have the search algorithm determine each bracket and have the hyperband scheduler run the bracket. That is to say, the bayesian optimization search runs only once per bracket. Looking at Tune's source code for this, it's not clear to me whether the Tune library applies this strategy or not.
In particular, I want to know how the Tune library handles passing between the search algorithm and trial scheduler. For instance, how does this work if I call SkOptSearch and AsyncHyperBandScheduler (or HyperBandScheduler) together as follows:
sk_search = SkOptSearch(optimizer,
['group','dimensions','normalize','sampling_weights','batch_size','lr_adam','loss_weight'],
max_concurrent=4,
reward_attr="neg_loss",
points_to_evaluate=current_params)
hyperband = AsyncHyperBandScheduler(
time_attr="training_iteration",
reward_attr="neg_loss",
max_t=50,
grace_period=5,
reduction_factor=2,
brackets=5
)
run(Trainable_Dense,
name='hp_search_0',
stop={"training_iteration": 9999,
"neg_loss": -0.2},
num_samples=75,
resources_per_trial={'cpu':4,'gpu':1},
local_dir='./tune_save',
checkpoint_freq=5,
search_alg=sk_search,
scheduler=hyperband,
verbose=2,
resume=False,
reuse_actors=True)
Based on the source code linked above and the source code here, it seems to me that sk_search would return groups of up to 4 trials at a time, but hyperband should be querying the sk_search algorithm for N_sizeofbracket trials at a time.
There is now a Bayesian Optimization HyperBand implementation in Tune - https://ray.readthedocs.io/en/latest/tune-searchalg.html#bohb.
For standard search algorithms and schedulers, the search algorithm currently only sees the result of a trial if it is completed.

Trying to reproduce mirbase results locally with BLAST

I am trying to reproduce locally on my computer what I get running mirbase on their website using BLAST. The 'search sequences' option is: mature miRNAs which I had downloaded on my computer and make it as a BLAST database with command:
./makeblastdb -in /home/marianoavino/Downloads/mature.fa -dbtype 'nucl' -out /home/marianoavino/Downloads/mature
then on mirbase I see they use an e-value of 10, which I leave locally.
On mirbase at the end of the analysis they give you these parameter setting:
Search parameters
Search algorithm:
BLASTN
Sequence database:
mature
Evalue cutoff:
10
Max alignments:
100
Word size:
4
Match score:
+5
Mismatch penalty:
-4
and this is the command line I use on my computer for BLAST
./blastn -db /home/marianoavino/Downloads/mature -evalue 10 -word_size 4 -query /home/marianoavino/Downloads/testinputblast.fasta -task "blastn" -out /home/marianoavino/Downloads/testBLast.out
The results of the two analysis are different, with mirbase finding much more stuff than local BLAST.
Do you have any idea on which parameters I should use on local blast command line to match those listed mirbase parameters in order to get the same answer?
There can be lots of reasons for different results including the version of blast you are using and which they used, parameters (like you said) and differences in the databases (remember, database size is used to calculate things like evalue, so you may end up with different results).
Exact replication of results may be difficult, but the question is are the differences meaningful? Just because an alignment has some evalue (which 10 is unusually high) does not mean it is meaningful. For a given sequence, if searches are yielding different number of alignments, but the same number of high quality alignments (high bitscore, low evalue, full alignment between query and subject sequences), does it matter?
I would try and compare results to see where these differences are, then move forward

In Stata, how can I save estimates and std errors while running multiple regressions?

I am using quite a big data set and would like to estimate the Fama French coefficients for two event windows for each ID. I am using the following code (dummy_reg allocates the observations to the respective event window):
sort ID dummy_reg count
by ID dummy_reg: reg ret_px Mkt_RF SMB HML
Furthermore, I would like to use the coefficients to compute deltas between the event windows; however, I don't know how to include saving/generating new variables in the estimation process.
I have tried the following but it didn't work:
by ID dummy_reg: reg ret_px Mkt_RF SMB HML & egen b_Mkt_RF=_b[Mkt_RF] & egen b_SMB=_b[SMB] & egen b_HML=_b[HML]
In all forums supporting questions on software problems:
Reproducible problems are strongly preferred. You don't give us your dataset or use a publicly available dataset to illustrate the problem. In this case the lack of a reproducible example does not bite hard, as errors can be identified, but in other problems it can be crucial, so please note for any future questions.
A report such as "didn't work" is regarded as maximally uninformative. Naturally, you don't understand what is happening, but you should always report exactly what happened, e.g. what error message was produced, which you can just copy and paste.
The single command
by ID dummy_reg: reg ret_px Mkt_RF SMB HML & egen b_Mkt_RF=_b[Mkt_RF] & egen b_SMB=_b[SMB] & egen b_HML=_b[HML]
is syntactically incorrect and/or not what you want in several different senses. It could only arise from some very wild guesses.
Your by: prefix is legal given your prior sort but the regress command following would run through the distinct groups so defined and only the last set of regression results would be available in memory afterwards.
The logical operator & is for combining numerical arguments into expressions to be evaluated for truth or falsity; there is no sense in which it combines commands to be followed one after the other.
The egen calls following would all be quite illegal as they contain no egen function call.
Even if they were legal under #3, the second time each egen command was invoked under by: there would be a problem as the variable being named already exists.
Even if they were correct under #3 and #4, there would be a problem for you as at the end of the code, the variables so created could only contain the last set of coefficient estimates. Problem #3 could be fixed by using generate rather than egen, and different code too, but problems #4 and #5 would remain.
Fortunately, there is a simple way out of all this. You need the statsby command to save regress results. If you want the coefficient estimates alongside the original dataset, use merge.
Your question names standard errors [of what?] but nothing in your code accesses standard errors; nevertheless they can also be saved using statsby.
I have focused here on the one piece of code you present. The rest of the question refers to economics details I don't understand. Like almost everyone on Stack Overflow, I am not an economist.

2D non-polynomial function fitting from the command line

I just wrote a simple Unix command line utility that could be implemented a lot more efficiently. I can measure its performance by just running it on a number of inputs and measuring the time it takes. This will produce a set of pairs of numbers, s t, where s is the input size and t the processing time. In order to determine the performance characteristics of my utility, I need to fit a function through these data points. I can do this manually, but I prefer to be lazy and let a utility do it for me.
Does such a utility exist?
Its input is a sequence of pairs of numbers.
Its output is a formula that expresses how the second number depends as a function on the first, plus an error measure.
One step of the way is to have a utility that does this just for polynomials.
This has been discussed here but it didn't produce a ready-to-use solution.
The next step is to extend the utility to try non-polynomial terms: negative-degree polynomials (as in y = 1/x) and logarithmic terms (as in y = x log x) will need to be tried as well. One idea to cope with the non-polynomial terms is to just surround the polynomial fitting with x and y scale transformations. I don't know whether that will do. This question is related but not exactly the same.
As I said, I'm lazy: I'm not looking for ideas on how to to write this myself, I'm looking for a reliable result of a project that has already done it for me. Any suggestions?
I believe that SAS has this, RS/1 has this, I think that Mathematica has this, Execel and most spreadsheets have a primitive form of this and usually there are add-ons available for more advanced forms. There are lots of Lab analysis and Statistical analysis tools that have stuff like this.
RE., Command Line Tools:
SAS, RS/1 and Minitab were all command line tools 20 years ago when I used them. I bet at least one of them still has this capability.