Stata regression with conditions on dummies and variable values - regression

I'm trying to create a regression that would include a polynomial (let's say 2nd order) of year on a certain interval of year (say 1 to 70) and a number of dummies for certain values of year (say for every year between 45 and 60).
If I didn't have the restriction for dummies, I believe the commands would be:
gen year2=year^2
regress y year year2 i.year if inrange(year,1,70)
I can't make the dummies manually, there will be more than 15 of them in the end). Could anybody help me, please?
If I then want to plot the estimated function without the dummies, why do these two bring different things?
twoway function _b[_cons] +_b[year]*x + _b[year2]*x^2, range(1 70)
twoway function _b[_cons] +_b[year]*year + _b[year2]*year^2, range(1 70)
The way I understood it, _b[_cons], _b[year] and _b[year2] call previously calculated coefficients for the corresponding independent variables and then multiplies it with them. Why does it bring different results then if x should be the same thing as year in this case?

I am not sure why Pearly is giving you such a hard time, I think this may be what you're looking for, but let me know if it is something different:
One thing to note, I am using a dataset that comes preloaded with Stata and this is usually a nice way to make a MVCE like Nick was saying in your other post.
clear
sysuse gnp96
/* variables: gnp, date (quarterly) */
gen year = year(dofq(date)) // get yearly variable
gen year2=year^2 // get the square of the yearly variable
tab year if inrange(year,1970,1975), gen(yr) // generate dummy variables
// the dummy varibales generated have null values for years not
// in the specified range, so we're going to fill those in
foreach v of varlist yr* {
replace `v' = 0 if `v' == .
}
// here's your regression
regress gnp year year2 yr* if inrange(year,1967,1990)
Now, the yr* are your dummy variables and the * is a wildcard calling all variables named like yr[something]
This gives you the range for the dummy variables and the range for the year variables.
As to your question on using x vs year, I am only hypothesizing, but I think that when you use x it is continuous since Stata isn't looking at your variables, but instead just at the x axis whereas your year variable is discrete (a bunch of integers) so it looks more like a step function. More information can be found using the command help twoway function

Related

Engineering Equation Solver - Functions

I must calculate a function on EES.
Function: T(t)=((T_surface-T_infinity)*(e^(-bt)))+T_infinity
t is time and limits are between 1 and 40 second. I need calculate every seconds in 1-40.
How can I write this function in EES?
If I understand the question correctly, no differential equation should be solved, no integration, summation or the like should be carried out. Only a calculation with variation of the variable t is to be accomplished.
There are two possibilities. But first: EES is not 'key-sensitive'! So you should choose one—T or t—and the other should get another 'name'. I prefer 'tau' for time.
Parametric table
Write the equation in the equation window and aad the parameter you may have(?). Do not define 'tau'. like:
T=((T_surface-T_infinity)*(exp(-b*tau)))+T_infinity
T_surcace = 80[C]
T_infinity = 20[C]
b=3
Then open the parametic table, add the variables you want to see. At least you should take T and tau. Expand the rows of the table to 40 and enter the respective times. Then press the green play-button (top left in the table).
Duplicate function:
T_surface = 80[C]
$varinfo tau[] units='s'
b=0.1[1/s]
Duplicate i=1,40
tau[i] = i*1[s]
T[i]=((T_surface-T_infinity)*(exp(-b*tau[i] )))+T_infinity
End
T_infinity = 20[C]
$varinfo T[] units='C'

Store 2 previous array to implement Leapfrog numerical Scheme

In the context of advection numerical solving, I try to implement the following recurrence formula in a time loop:
As you can see, I need the second previous time value for (j-1) and previous (j) value to compute the (j+1) time value.
I don't know how to implement this recurrence formula. Here below my attempt in Python where u represents the array of values T for each iteration:
l = 1
# Time loop
for i in range(1,nt+1):
# Leapfrog scheme
# Store (i-1) value for scheme formula
if (l < 2):
atemp = copy(u)
l = l+1
elif (l == 2):
btemp = copy(atemp)
l = 1
u[1:nx-1] = btemp[1:nx-1] - cfl*(u[2:nx] - u[0:nx-2])
t=t+dt
Coefficient cfl is equal to s.
But the results of simulation don't give fully good results. I think my way to do is not correct.
How can I implement this recurrence? i.e mostly how to store the (j-1) value in time to inject it into formula for computing (j+1) ?
Update
In the formula:
the time index j has to start from j=1since we have the term T_(i,j-1).
So for the first iteration, we have :
T_i,2 = T_i,0 - s (T_(i+1),1 - T_(i-1),1)
Then, if In only use time loop (and not spatial loop such that way, I can't compute dudx[i]=T[i+1]-T[i-1]), how can I compute (T_(i+1),1 - T_(i-1),1), I mean, without precalculating dudx[i] = T_(i+1),1 - T_(i-1),1 ?
That was the trick I try to implement in my original question. The main problem is that I am imposed to use only time loop.
The code would be simpler if I could use 2D array with T[i][j] element, ifor spatial and jfor time but I am not allowed to use 2D array in my examination.
There are few problems I see in your code. First is notation. From the numerical scheme you posted it looks like you are discretizing time with j and space with i using central differences in both. But in your code it looks like the time loop is written in terms of i and this is confusing. I will use j for space and n for time here.
Second, this line
u[1:nx-1] = btemp[1:nx-1] - cfl*(u[2:nx] - u[0:nx-2])
is not correct since for the spatial derivatve du/dx you need to apply the central difference scheme at every spatial point of u. Hence, u[2:nx] - u[0:nx-2] is doing nothing like this, it is just subtracting what seems to be the solution including boundary points on the left from the solution including boundary points on the right. You need to properly calculate this spatial derivative.
Finally, the Leapfrog method which indeed takes into account the n-1 solution is usually implemented by keeping a copy of the previous time step in another variable such as u_prev. So if you use the Leapfrog time scheme plus central difference spatial scheme, in the end you should have something like
u_prev = u_init
u = u_prev
for n in time...:
u_new = u_prev - cfl*(dudx)
u_prev = u
u = u_new
Note that u on the LHS is to compute time n+1, u_prev is at time n-1 and dudx uses u at the current time n. Also, you can compute dudx with
for j in space...:
dudx[j] = u[j+1]-u[j-1]

decimal lag values in ACF plot instead of integers lags

I have a monthly time series. When I run the code acf(timeseries), the lags on the x axis show up as decimals instead of integers, as shown in the screenshot:
What is wrong? How could I have lags=c(1,2,3,4,5,6,etc) on the x-axis? I need something like this (photoshopped photo) (excuse the mis-alignment of values with ticks on the x-axis):
Try Acf (first letter is in upper-case) function in package "forecast".
Perhaps it's because you are using a non-desirable format for the acf/ccf function.
I faced the same problem and I solved it by changing the input vectors from time-series (ts) to numeric:
[variable]<-as.numeric([variable])
And it worked. I hope it helped.
Since you are using a tseries object, you need to pass coredata() to the ACF and PACF functions:
acf(coredata(your_ts_object))
pacf(coredata(your_ts_object))
This will pass just the numerical values in the time series and won't make a mess, giving you integer lags.
I think you have to get the results from the acf() function then plot it in your own like this:
storing acf results:
a=acf(ts,plot = F) #ts is an annual time series(frequency =12)
plotting acf:
plot(a$lag*12,a$acf,xlab="Lag",ylab="ACF",main="",type="h")
note that you have to multiple the lag * frequency of your serie in this case 12.
plotting horizontal lines:
abline(h=c(-0.19,0,0.19),col=c("blue","black","blue"),lty=c(2,1,2))
h : to specify where to plot ths lines
col : for the colors of the lines
lty : to specify the type of the line
that worked for me, i hope that's what you locking for
As the ts is monthly, so the yearly lag is divided into 12. The first figure is just a portion of the total ACF (i.e., for 1.5 years approx.). To have ACF for the full ts, use acf(ts_object, lag.max = the max length of your ts_object). E.g., if you have 15 years monthly data, then set lag.max = 12*15.

Testing large input range scenarios with JUnit

I am pondering on how it is best to develop a JUnit test for a function that calculates a number of points and values in time based on a number of inputs. The purpose of the method is to calculate a series of points in time given a series of gradient value pairs, i.e.
Gradient 1 to Value 1, Gradient 2 to Value 2, Gradient 3 to Value 3, and so on...
Given a starting point in time and starting value, the function calculates the points in time each Value is reached (in the gradient value pairs) up until a target value is reached. This is essentially to plot a line on a graph with x-axis having date values and the y-axing having numeric values.
The method to test takes the following inputs:
StartTime (Date)
StartValue (Double)
TargetValue (Double)
GradientValuePairs (ArrayList)
EnsurePointEvery5Minutes (Boolean)
Where GradientValuePair is like:
class GradientValuePair {
Double gradient; // Gradient up to Target
Double target;
...
}
The output from this method is essentially ArrayList - a profile - with:
class DatePoint {
Date date;
Double value;
...
}
The EnsurePointEvery5Minuntes parameter basically adds a date point every 5 minutes for the calcualted profile which is then returned by the method.
To ensure the test has worked I will need to check each date and value is to what is expected by either:
Iterating through the array with an array of what is expected.
Store minute/second offsets from the StartTime with the expected value in some sort of structure.
Now the difficult part for me is deciding on how to write the TestCase. I want to test a broad/diverse range of inputs so that:
StartTime will cover 30 minutes i.e. in range of 2012-03-08 00:00 to 2012-03-08 00:30.
StartValue will be in the range of 0 to 1000.
TargetValue will be in the range of StartValue to 1000.
GradientValuePairs will require around 10 different arrays to be tested.
EnsurePointEvery5Minutes will be tested with both true and false.
Now given the number of different input sets will be something like:
30 * (0 to 1000 * 0 to 1000 = 500500) * 10 * 2 = 300,300,000 different test input sets per GradientValuePairs input
Or call us crazy for wanting to do this. Maybe the tests are too diverse for this instance.
I am wondering if anybody has any advice for testing such scenarios like this. I can't think of any other way to do this than implement my own algorithm for calculating the output before each call to the method I am testing - then who is to say that the algorithm I implement to test it is correct.
If I understand correctly. you are proposing that you test every possible set of combination of numeric inputs. That is almost never required of unit tests, as it would be essentially equivalent to testing whether the Java math library works for all numbers for all operations. Generally what you will do is try to identify edge conditions and write tests for those. These would include things like 0's. negatives, numeric overflow, and combinations of inputs which have intermediate computations that result in the same things. Then of course, you would want to test a handful of normal vanilla cases of data as well that are not edge cases.
So short answer: no you should not need to test 300M+ input sets.

Function to dampen a value

I have a list of documents each having a relevance score for a search query. I need older documents to have their relevance score dampened, to try to introduce their date in the ranking process. I already tried fiddling with functions such as 1/(1+date_difference), but the reciprocal function is too discriminating for close recent dates.
I was thinking maybe a mathematical function with range (0..1) and domain(0..x) to amplify their score, where the x-axis is the age of a document. It's best to explain what I further need from the function by an image:
Decaying behavior is often modeled well by an exponentional function (many decaying processes in nature also follow it). You would use 2 positive parameters A and B and get
y(x) = A exp(-B x)
Since you want a y-range [0,1] set A=1. Larger B give slower decays.
If a simple 1/(1+x) decreases too quickly too soon, a sigmoid function like 1/(1+e^-x) or the error function might be better suited to your purpose. Let the current date be somewhere in the negative numbers for such a function, and you can get a value that is current for some configurable time and then decreases towards a base value.
log((x+1)-age_of_document)
Where the base of the logarithm is (x+1). Note the x is as per your diagram and is the "threshold". If the age of the document is greater than x the score goes negative. Multiply by the maximum possible score to introduce scaling.
E.g. Domain = (0,10) with a maximum score of 10: 10*(log(11-x))/log(11)
A bit late, but as thiton says, you might want to use a sigmoid function instead, since it has a "floor" value for your long tail data points. E.g.:
0.8/(1+5^(x-3)) + 0.2 - You can adjust the constants 5 and 3 to control the slope of the curve. The 0.2 is where the floor will be.