Remove outliers with large standardized residuals in Stata - regression

I run a simple regression in Stata for two subsamples and afterwards I want to exclude all observations with standardized residuals larger than 3.0. I tried:
regress y x if subsample_criteria==1
gen st_res1=e(rsta)
regress y x if subsample_criteria==0
gen st_res2=e(rsta)
drop if st_res1 | st_res2 > 3.0
However, the new variable is full of missing values and the values for the stand. residuals are not stored in the variables st_res1 and st_res2.
I am grateful for any hints!

The problem with your code is that Stata does not know what e(rsta) is (and neither do I), so it creates a missing, which Stata thinks of as very large positive number. All missings are greater than 3, so your constraint does not bind.
Ignoring the statistical merits of doing this, here's one way:
sysuse auto, clear
reg price mpg
predict ehat, rstandard
reg price mpg if abs(ehat)<3
Note that I am using the absolute value of the residual, which I think makes more sense here.

First, providing a MCVE is always a good first step (and fairly easy given Stata's sysuse and webuse commands). Now, on to the question.
See help regress postestimation and help predict for the proper syntax for generating new variables with residuals, etc. The syntax is a bit different from the gen command, as you will see below.
Note also that your drop if condition is improperly formatted, and right now is interpreted as drop if st_res1 != 0 | st_res2 > 3.0. (I also assume you want to drop standardized residuals < -3.0, but if this is incorrect, you can remove the abs() function.)
sysuse auto , clear
replace mpg = 10000 in 1/2
replace mpg = 0.0001 in 70
reg mpg weight if foreign
predict rst_for , rstandard
reg mpg weight if !foreign
predict rst_dom , rstandard
drop if abs(rst_for) > 3.0 | abs(rst_dom) > 3.0
Postscript: Note that you may also consider adding if e(sample) to your predict commands, depending on whether you wish to extrapolate the results of the subsample regression to the entire sample and evaluate all residuals, or whether you only wish to drop observations based on in-sample standardized residuals.

Related

Factor variables may not contain noninteger values

I am attempting to fit a regression model in Stata. My variables are are all continuous variables of type float.
regress _gdp all_indexn_c 90_days consistent _incpc all_indexn_c#90_days
all_indexn_c: factor variables may not contain noninteger values
r(452);
How do I fix this issue? I don't have factor variables and I'd like to use float variables in the model.
The problem arises because you're asking for an interaction term and if you do that Stata requires you to flag that a variable with non-integer values is continuous. See for example help fvvarlist.
Running this reproducible example shows a similar problem and its fix.
sysuse auto, clear
regress price mpg headroom mpg#headroom
regress price mpg headroom mpg#c.headroom
P.S. Regression with GDP pc as outcome usually work better on a logarithmic scale.

Find marginal effects in multiple equation model with ordered probit - cmp

I am really new to Stata, so that my question might be trivial
I am using package cmp to estimate a bivariate model that goes as follows:
cmp(d_ln_jobs = d_layer) (d_layer = d_tariff), vce(robust) ind($cmp_cont $cmp_probit) nolr quietly difficult
d_layer is an ordered variable that assumes -4, -3, ... 4.
How could I obtain the marginal effect of d_tariff on both dependent variables, evaluated at d_tariff's median?
Here is what I've tried:
margins, dydx(d_tariff) at((median)) force
I don't think this is correct since, as an output, the entry related to dy/dx says 0, and at the header of the output it shows:
"Expression: linear prediction, predict()"
Does this last part mean that it would show predicted probabilities rather than marginal effects? Besides, shouldn't I get a value different from 0? In my mind, d_tariff would change d_layer, which would change d_ln_jobs? Why don't I get two values, one showing the marginal effect on d_layer and the other on d_ln_jobs?

Negative coefficient for dummy variable regression analysis

I am interpreting multiple regression analysis with the dependent variable being rate of return (ROR) on a stock. Also, I have included a dummy-variable that represents the stock getting included and excluded out of an index (1 = included, 0 = excluded).
The output showed a negative coefficient-value (the value is -0,03852), and I´m not quite sure how to make of this. Does this mean that being included in an index has a negative effect on the dependent variable?
Could someone please provide me an "easy" explanation?
Thanks!
It means when the stock is included (value of 1), you expect the rate of return to decrease by 0.03852. More or less, i guess you can call it a negative effect. What you should check is whether this coefficient is significant or what is the standard error of this coefficient. Usually regression tools should provide that.

Temperature Scale in SA

First, this is not a question about temperature iteration counts or automatically optimized scheduling. It's how the data magnitude relates to the scaling of the exponentiation.
I'm using the classic formula:
if(delta < 0 || exp(-delta/tK) > random()) { // new state }
The input to the exp function is negative because delta/tK is positive, so the exp result is always less then 1. The random function also returns a value in the 0 to 1 range.
My test data is in the range 1 to 20, and the delta values are below 20. I pick a start temperature equal to the initial computed temperature of the system and linearly ramp down to 1.
In order to get SA to work, I have to scale tK. The working version uses:
exp(-delta/(tK * .001)) > random()
So how does the magnitude of tK relate to the magnitude of delta? I found the scaling factor by trial and error, and I don't understand why it's needed. To my understanding, as long as delta > tK and the step size and number of iterations are reasonable, it should work. In my test case, if I leave out the extra scale the temperature of the system does not decrease.
The various online sources I've looked at say nothing about working with real data. Sometimes they include the Boltzmann constant as a scale, but since I'm not simulating a physical particle system that doesn't help. Examples (typically with pseudocode) use values like 100 or 1000000.
So what am I missing? Is scaling another value that I must set by trial and error? It's bugging me because I don't just want to get this test case running, I want to understand the algorithm, and magic constants mean I don't know what's going on.
Classical SA has 2 parameters: startingTemperate and cooldownSchedule (= what you call scaling).
Configuring 2+ parameters is annoying, so in OptaPlanner's implementation, I automatically calculate the cooldownSchedule based on the timeGradiant (which is a double going from 0.0 to 1.0 during the solver time). This works well. As a guideline for the startingTemperature, I use the maximum score diff of a single move. For more information, see the docs.

"Reverse" statistics: generating data based on mean and standard deviation

Having a dataset and calculating statistics from it is easy. How about the other way around?
Let's say I know some variable has an average X, standard deviation Y and assume it has normal (Gaussian) distribution. What would be the best way to generate a "random" dataset (of arbitrary size) which will fit the distribution?
EDIT: This kind of develops from this question; I could make something based on that method, but I am wondering if there's a more efficient way to do it.
You can generate standard normal random variables with the Box-Mueller method. Then to transform that to have mean mu and standard deviation sigma, multiply your samples by sigma and add mu. I.e. for each z from the standard normal, return mu + sigma*z.
This is really easy to do in Excel with the norminv() function. Example:
=norminv(rand(), 100, 15)
would generate a value from a normal distribution with mean of 100 and stdev of 15 (human IQs). Drag this formula down a column and you have as many values as you want.
I found a page where this problem is solved in several programming languages:
http://rosettacode.org/wiki/Random_numbers
There are several methods to generate Gaussian random variables. The standard method is Box-Meuller which was mentioned earlier. A slightly faster version is here:
http://en.wikipedia.org/wiki/Ziggurat_algorithm
Here's the wikipedia reference on generating Gaussian variables
http://en.wikipedia.org/wiki/Normal_distribution#Generating_values_from_normal_distribution
I'll give an example using R and the 2nd algorithm in the list here.
X<-4; Y<-2 # mean and std
z <- sapply(rep(0,100000), function(x) (sum(runif(12)) - 6) * Y + X)
plot(density(z))
> mean(z)
[1] 4.002347
> sd(z)
[1] 2.005114
> library(fUtilities)
> skewness(z,method ="moment")
[1] -0.003924771
attr(,"method")
[1] "moment"
> kurtosis(z,method ="moment")
[1] 2.882696
attr(,"method")
[1] "moment"
You could make it a kind of Monte Carlo simulation. Start with a wide random "acceptable range" and generate a few truly random values. Check your statistics and see if the average and variance are off. Adjust the "acceptable range" for the random values and add a few more values. Repeat until you have hit both your requirements and your population sample size.
Just off the top of my head, let me know what you think. :-)
The MATLAB function normrnd from the Statistics Toolbox can generate normally distributed random numbers with a given mu and sigma.
It is easy to generate dataset with normal distribution (see http://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform ).
Remember that generated sample will not have exact N(0,1) distribution! You need to standarize it - substract mean and then divide by std deviation. Then You are free to transform this sample to Normal distribution with given parameters: multiply by std deviation and then add mean.
Interestingly numpy has a prebuilt function for that:
import numpy as np
def generate_dataset(mean, std, samples):
dataset = np.random.normal(mean, std, samples)
return dataset