I estimate the dependency of export,gdp and human capital. If choosing the linear method, I got this:
Dependent Variable: EXPORTS
Method: Least Squares
Date: 05/23/15 Time: 18:20
Sample: 1960 2011
Included observations: 52
Variable Coefficient Std. Error t-Statistic Prob.
C 2.63E+10 1.38E+10 1.911506 0.0618
HC -1.36E+10 6.08E+09 -2.233089 0.0301
GDP 2903680. 192313.2 15.09870 0.0000
R-squared 0.967407 Mean dependent var 1.90E+10
Adjusted R-squared 0.966076 S.D. dependent var 2.22E+10
S.E. of regression 4.08E+09 Akaike info criterion 47.15324
Sum squared resid 8.16E+20 Schwarz criterion 47.26581
Log likelihood -1222.984 Hannan-Quinn criter. 47.19640
F-statistic 727.1844 Durbin-Watson stat 0.745562
Prob(F-statistic) 0.000000
The sign of HC coefficient is negative, which is against the theory.I have tried logarithmic, exponential forms, but I still get negative results for HC.
I wonder what is the way to estimate it right.
Thank you in advance.
here is my data
Durbin-Watson stat 0.745562
It means that there is auto correlation problem in your model.
Related
I'm trying to get the sentiments for comments with the help of hugging face sentiment analysis pretrained model. It's returning error like Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier.
Below I'm attaching the code please look at it
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers
import pandas as pd
model = AutoModelForSequenceClassification.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
token = AutoTokenizer.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=token)
data = pd.read_csv('/content/drive/MyDrive/DisneylandReviews.csv', encoding='latin-1')
data.head()
Output is
Review
0 If you've ever been to Disneyland anywhere you...
1 Its been a while since d last time we visit HK...
2 Thanks God it wasn t too hot or too humid wh...
3 HK Disneyland is a great compact park. Unfortu...
4 the location is not in the city, took around 1...
Followed by
classifier("My name is mark")
Output is
[{'label': 'POSITIVE', 'score': 0.9953688383102417}]
Followed by code
basic_sentiment = [i['label'] for i in value if 'label' in i]
basic_sentiment
Output is
['POSITIVE']
Appending the total rows to empty list
text = []
for index, row in data.iterrows():
text.append(row['Review'])
I'm trying to get the sentiment for all the rows
sent = []
for i in range(len(data)):
sentiment = classifier(data.iloc[i,0])
sent.append(sentiment)
The error is :
Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-19-4bb136563e7c> in <module>()
2
3 for i in range(len(data)):
----> 4 sentiment = classifier(data.iloc[i,0])
5 sent.append(sentiment)
11 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1914 # remove once script supports set_grad_enabled
1915 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1916 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1917
1918
IndexError: index out of range in self
some of the sentences in your Review column of the data frame are too long. when these sentences are converted to tokens and sent inside the model they are exceeding the 512 seq_length limit of the model, the embedding of the model used in the sentiment-analysis task was trained on 512 tokens embedding.
to fix this issue you can filter out the long sentences and keep only smaller ones (with token length < 512 )
or you can truncate the sentences with truncating = True
sentiment = classifier(data.iloc[i,0], truncation=True)
If you're tokenizing separately from your classification step, this warning can be output during tokenization itself (as opposed to classification).
In my case, I am using a BERT model, so I have MAX_TOKENS=510 (leaving room for the sequence-start and sequence-end tokens).
token = AutoTokenizer.from_pretrained("your model")
tokens = token.tokenize(
text, max_length=MAX_TOKENS, truncation=True
)
Now, when you run your classifier, the tokens are guaranteed not to exceed the maximum length.
I want to output the ATE, std error, and p-value from this code:
teffects aipw (dep_var, logit) (treatment pred1 pred2 pred3)
I used this code:
putexcel set "$root/filename.xlsx", sheet("5") modify
putexcel A1=`e(stat)'
but it says "ate not found." Shouldn't the ate be stored automatically in e(stat)?
e(stat) stores the statistic that is estimated as a string, i.e. "ate" or "pomeans". This doesn't contain the actual point estimate.
The coefficients and standard errors can be accessed after any estimation command with the following syntax: _b[coef], _se[coef] or [eqno]_b[coef]/_b[eqno:coef] and [eqno]_se[coef]/_se[eqno:coef] in the case of multiple equation models.
You can specify the coeflegend option to most estimation commands to see how coefficients are named.
Example:
. webuse cattaneo2
(Excerpt from Cattaneo (2010) Journal of Econometrics 155: 138-154)
. teffects aipw (bweight prenatal1 mmarried mage fbaby) (mbsmoke mmarried c.mage##c.mage fbaby medu, probit), coeflegend
Iteration 0: EE criterion = 4.629e-21
Iteration 1: EE criterion = 1.939e-25
Treatment-effects estimation Number of obs = 4,642
Estimator : augmented IPW
Outcome model : linear by ML
Treatment model: probit
----------------------------------------------------------------------------------------
bweight | Coef. Legend
-----------------------+----------------------------------------------------------------
ATE |
mbsmoke |
(smoker vs nonsmoker) | -230.9892 _b[ATE:r1vs0.mbsmoke]
-----------------------+----------------------------------------------------------------
POmean |
mbsmoke |
nonsmoker | 3403.355 _b[POmean:0.mbsmoke]
----------------------------------------------------------------------------------------
. di _b[ATE:r1vs0.mbsmoke]
-230.9892
. di _se[ATE:r1vs0.mbsmoke]
26.210565
Any other statistics can be obtained from r(table), type matrix list r(table) after the estimation command to see this. For example, to obtain the pvalue:
mat A = r(table)
scalar pval = A[4,1]
di pval
I am running the following in Stata:
eststo: ivregress 2sls y (x=z) control [aw=weight], cluster(cluster) first
esttab using file.tex, b(%9.3f) se(%9.3f) r2(%9.8f) replace
This produces a publication-style table for 2nd stage.
However, what should I do to do that for 1st stage? I need coefficients and R^2.
I am fine with using any command for publication-style output - it doesn't need to be esttab.
I tried ivregress2 but it did not work:
_iv_vce_wrk(): 3001 expected 21 arguments but received 20
<istmt>: - function returned error
You just need to run the first stage separately:
webuse hsng2, clear
eststo clear
regress hsngval pcturban faminc i.region
eststo
ivregress 2sls rent pcturban (hsngval = faminc i.region), first
eststo
Which then produces:
esttab, r2(2) mtitles("First Stage" "Second Stage")
--------------------------------------------
(1) (2)
First Stage Second Stage
--------------------------------------------
pcturban 182.2 0.0815
(1.58) (0.27)
faminc 2.731***
(4.01)
1.region 0
(.)
2.region -5095.0
(-1.24)
3.region -1778.1
(-0.44)
4.region 13413.8**
(3.31)
hsngval 0.00224***
(6.82)
_cons -18671.9 120.7***
(-1.56) (7.93)
--------------------------------------------
N 50 50
R-sq 0.69 0.60
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
I have a question regarding a means separation test to test interactions. Is it possible to perform a Tukey's Test with interactions? One of my terms in my model is significant and I want to know which ones they are.
For example my model is
ModelB<- lm(RATING.2~Allele%in%QTL11.Source+Family%in%QTL11.Source+REP)
anova(ModelB)
Response: RATING.2
Df Sum Sq Mean Sq F value Pr(>F)
REP 2 1.301 0.6507 0.9266 0.3993
Allele:QTL11.Source 8 105.021 13.1276 18.6941 < 2.2e-16 ***
QTL11.Source:Family 37 68.644 1.8552 2.6419 6.873e-05 ***
Residuals 100 70.223 0.7022
---
TukeysTestB<- HSD.test(ModelB,"Allele%in%QTL11.Source")
TukeysTestB
NULL
Why am I getting a NULL output, Is this not possible to test?
I get sums of squares and mean sums of squares 10x higher when I use anova on lmerTest:: lmer compared to lme4:: lmer objects. See the R log file below. Note the warning that when I attach the lmerTest package, the stats::sigma function overrides the lme4::sigma function, and I suspect that it is this that leads to the discrepancy. In addition, the anova report now says that it is a Type III anova rather than the expected Type I. Is this a bug in the lmerTest package, or is there something about use of the Kenward-Roger approximation for degrees of freedom that changes the calculation of SumSQ and MSS and specification of the anova Type that I don't understand?
I would append the test file, but it is confidential clinical trial information. If necessary I can see if I can cobble up a test case.
Thanks in advance for any advice you all can provide about this.
> library(lme4)
Loading required package: Matrix
Attaching package: ‘lme4’
The following object is masked from ‘package:stats’:
sigma
> test100 <- lmer(log(value) ~ prepost * lowhi + (1|CID/LotNo/rep),
REML = F, data = GSIRlong, subset = !is.na(value))
> library(lmerTest)
Attaching package: ‘lmerTest’
The following object is masked from ‘package:lme4’:
lmer
The following object is masked from ‘package:stats’:
step
Warning message:
replacing previous import ‘lme4::sigma’ by ‘stats::sigma’ when loading
‘lmerTest’
> test200 <- lmer(log(value) ~ prepost * lowhi + (1|CID/LotNo/rep),
REML = F, data = GSIRlong, subset = !is.na(value))
> anova(test100)
Analysis of Variance Table
Df Sum Sq Mean Sq F value
prepost 1 3.956 3.956 18.4825
lowhi 1 130.647 130.647 610.3836
prepost:lowhi 1 0.038 0.038 0.1758
> anova(test200, ddf = 'Ken')
Analysis of Variance Table of type III with Kenward-Roger
approximation for degrees of freedom
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
prepost 37.15 37.15 1 308.04 18.68 2.094e-05 ***
lowhi 1207.97 1207.97 1 376.43 607.33 < 2.2e-16 ***
prepost:lowhi 0.35 0.35 1 376.43 0.17 0.676
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Update: Thanks, Ben. I did a little code archeology on lmerTest to see if I could dope out an explanation for the above anomalies. First, it turns out that lmerTest::lmer just submits the model to lme4::lmer and then relabels the result as an "mermodLmerTest" object. The only effect of this is to invoke versions of summary() and anova() from the lmerTest package rather than the usual defaults from base and stats. (These lmerTest functions are compiled, and I have not yet gone farther to look at the C++ code.) lmerTest::summary just adds three columns to the base::summary result, giving df, t value, and Pr. Note that lmerTest::anova, by default, computes a type III anova rather than a type I as in stats::anova. (Explanation of my second question above.) Not a great choice if one's model includes interactions. One can request a type I, II, or III anova using the type = 1/2/3 option.
However there are other surprises using the nlmeTest versions of summary and anova, as shown in the R console file below. I used lmerTest's included sleepstudy data so that this code should be replicable.
a. Note that "sleepstudy" has 180 records (with 3 variables)
b. The summaries of fm1 and fm1a are identical except for the added Fixed effects columns. But note that in the lmerTest::summary the ddfs for the intercept and Days are 1371 and 1281 respectively; odd given that there are only 180 records in "sleepstudy."
c. Just as in my original model above, the nlm4 anad nlmrTest versions of anova give very different values of Sum Sq and Mean Sq. (30031 and 446.65 respectively).
d: Interestingly, the nlmrTest versions of anova using Satterthwaite and Kenward-Rogers estimates of the DenDF are wildly different (5794080 and 28 respecitvely). The K-R value seems more reasonable.
Given the above issues, I am reluctant at this point to depend on lmerTest to give reliable p-values. Based on your (Doug Bates's) blog entry (https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html), I am using now (and recommending) the method from the posting by Dan Mirman (http://mindingthebrain.blogspot.ch/2014/02/three-ways-to-get-parameter-specific-p.html) in the final bit of code below to estimate a naive t-test p-value (assuming essentially infinite degrees of freedom) and a Kenward-Rogers estimate of df (using the R package 'pbkrtest' -- the same package used by lmerTest). I couldn't find R code to compute the Satterthwaite estimates. The naive t-test p-value is clearly anti-conservative, but the KR estimate is thought to be pretty good. If the two give similar estimates of p, then I think that one can feel comfortable with a p-value in the range of [naive t-test, KR estimate].
> library(lme4); library(lmerTest); library(pbkrtest);
dim(sleepstudy)
[1] 180 3
>
> fm1 <- lme4::lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
> fm1a <- lmerTest::lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
>
> summary(fm1)
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (Days | Subject)
Data: sleepstudy
REML criterion at convergence: 1743.6
Scaled residuals:
Min 1Q Median 3Q Max
-3.9536 -0.4634 0.0231 0.4634 5.1793
Random effects:
Groups Name Variance Std.Dev. Corr
Subject (Intercept) 612.09 24.740
Days 35.07 5.922 0.07
Residual 654.94 25.592
Number of obs: 180, groups: Subject, 18
Fixed effects:
Estimate Std. Error t value
(Intercept) 251.405 6.825 36.84
Days 10.467 1.546 6.77
Correlation of Fixed Effects:
(Intr)
Days -0.138
> summary(fm1a)
Linear mixed model fit by REML t-tests use Satterthwaite approximations to
degrees of freedom [lmerMod]
Formula: Reaction ~ Days + (Days | Subject)
Data: sleepstudy
REML criterion at convergence: 1743.6
Scaled residuals:
Min 1Q Median 3Q Max
-3.9536 -0.4634 0.0231 0.4634 5.1793
Random effects:
Groups Name Variance Std.Dev. Corr
Subject (Intercept) 612.09 24.740
Days 35.07 5.922 0.07
Residual 654.94 25.592
Number of obs: 180, groups: Subject, 18
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 251.405 6.825 1371.100 302.06 <2e-16 ***
Days 10.467 1.546 1281.700 55.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
Days -0.138
Warning message:
In deviance.merMod(object, ...) :
deviance() is deprecated for REML fits; use REMLcrit for the REML
criterion or deviance(.,REML=FALSE) for deviance calculated at the REML fit
>
> anova(fm1)
Analysis of Variance Table
Df Sum Sq Mean Sq F value
Days 1 30031 30031 45.853
> anova(fm1a, ddf = 'Sat', type = 1)
Analysis of Variance Table of type I with Satterthwaite
approximation for degrees of freedom
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
Days 446.65 446.65 1 5794080 45.853 1.275e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Warning message:
In deviance.merMod(object, ...) :
deviance() is deprecated for REML fits; use REMLcrit for the REML criterion or deviance(.,REML=FALSE) for deviance calculated at the REML fit
> anova(fm1a, ddf = 'Ken', type = 1)
Analysis of Variance Table of type I with Kenward-Roger
approximation for degrees of freedom
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
Days 446.65 446.65 1 27.997 45.853 2.359e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Warning message:
In deviance.merMod(object, ...) :
deviance() is deprecated for REML fits; use REMLcrit for the REML criterion or deviance(.,REML=FALSE) for deviance calculated at the REML fit
>
> # t.test
> coefs <- data.frame(coef(summary(fm1)))
> coefs$p.z <- 2 * (1 - pnorm(abs(coefs$t.value)))
> coefs
Estimate Std..Error t.value p.z
(Intercept) 251.40510 6.824556 36.838311 0.000000e+00
Days 10.46729 1.545789 6.771485 1.274669e-11
>
> # Kenward-Rogers
> df.KR <- get_ddf_Lb(fm1, fixef(fm1))
> df.KR
[1] 25.89366
> coefs$p.KR <- 2 * (1 - pt(abs(coefs$t.value), df.KR))
> coefs
Estimate Std..Error t.value p.z p.KR
(Intercept) 251.40510 6.824556 36.838311 0.000000e+00 0.0000e+00
Days 10.46729 1.545789 6.771485 1.274669e-11 3.5447e-07