How does Stata treat multiple factor variables in regression? - regression

I have a city-year level dataset, and run the follow regression with city fixed effects:
reg y x i.city
I think this is equivalent to generating a dummy variable for each of 300 cities in the data, and run (city 1 as base level):
reg y x city2 ... city300
However, I need to include year dummies as well. I get the estimates using:
reg y x i.city i.year
Does anyone know what is going behind this regression in matrix form? Is that the same as generating one dummy for each year and run the following?
reg y x city2 ... city300 year2 ... year20
The reason I want to do this is try to code the command from scratch using matrix operations (X'X)^{-1}(X'y), where X includes the city dummies and year dummies.

What you are using is called corner-point coding for dummy(0,1) variables, where k-1 binary(0,1) variables levels are used for each factor (categorical variable). If you specify that a constant term should not be used:
reg y x i.city i.year, nocon
then sum-to-zero constraints coding will be used for binary variable construction, in which there will be a binary variable for both city1 and year1 used in the X matrix.
As you can see (below) when retinol concentration in the diet (retdiet) is regressed on the male dummy variable, the coefficient term for constant (y-intcp) is the mean y among females (815), and the coeff for male is the delta in y-values between males and females. Whereas, when both dummy indicators are used - fem and male, and , nocon is specified (after a comma), the values of the regression coefficients for fem and male are the mean values of y (retdiet) among each group.

Related

How to group in MySQL a column with multiple choices under only one of them?

I have a table in which a certain column's data is sometimes missing, in which cases has just a dash.
But since there's another column which is unique, the missing data can be deduced from other rows.
In other words, I have something like this:
Unique model
Maker
Profit
abcd1234
-
56
zgh675
Company Y
40
abcd1234
Company X
3
zgh675
-
10
abcd1234
Company X
1
Which query can I use to automatically reach the following (the list of Makers is dynamic but every model can only go to one of them):
Unique model
Maker
Profit
abcd1234
Company X
60
zgh675
Company Y
50
?
You may aggregate by unique model and then take the MAX value of the maker, along with the sum of the profit:
SELECT model, MAX(maker) AS maker, SUM(profit) AS profit
FROM yourTable
GROUP BY model;
This approach should work assuming that each model only has one unique non NULL value.

Need a different permutation of groups of numbers

I have numbers from 1 to 36. What I am trying to do is put all these numbers into three groups and works out all various permutations of groups.
Each group must contain 12 numbers, from 1 to 36
A number cannot appear in more than one group, per permutation
Here is an example....
Permutation 1
Group 1: 1,2,3,4,5,6,7,8,9,10,11,12
Group 2: 13,14,15,16,17,18,19,20,21,22,23,24
Group 3: 25,26,27,28,29,30,31,32,33,34,35,36
Permutation 2
Group 1: 1,2,3,4,5,6,7,8,9,10,11,13
Group 2: 12,14,15,16,17,18,19,20,21,22,23,24
Group 3: 25,26,27,28,29,30,31,32,33,34,35,36
Permutation 3
Group 1: 1,2,3,4,5,6,7,8,9,10,11,14
Group 2: 12,11,15,16,17,18,19,20,21,22,23,24
Group 3: 25,26,27,28,29,30,31,32,33,34,35,36
Those are three example, I would expect there to be millions/billions more
The analysis that follows assumes the order of groups matters - that is, if the numbers were 1, 2, 3 then the grouping [{1},{2},{3}] is distinct from the grouping [{3},{2},{1}] (indeed, there are six distinct groupings when taking from this set of numbers).
In your case, how do we proceed? Well, we must first choose the first group. There are 36 choose 12 ways to do this, or (36!)/[(12!)(24!)] = 1,251,677,700 ways. We must then choose the second group. There are 24 choose 12 ways to do this, or (24!)/[(12!)(12!)] = 2,704,156 ways. Since the second choice is already conditioned upon the first we may get the total number of ways of taking the three groups by multiplying the numbers; the total number of ways to choose three equal groups of 12 from a pool of 36 is 3,384,731,762,521,200. If you represented numbers using 8-bit bytes then to store every list would take at least ~3 pentabytes (well, I guess times the size of the list, which would be 36 bytes, so more like ~108 pentabytes). This is a lot of data and will take some time to generate and no small amount of disk space to store, so be aware of this.
To actually implement this is not so terrible. However, I think you are going to have undue difficulty implementing this in SQL, if it's possible at all. Pure SQL does not have operations that return more than n^2 entries (for a simple cross join) and so getting such huge numbers of results would require a large number of joins. Moreover, it does not strike me as possible to generalize the procedure since pure SQL has no ability to do general recursion and therefore cannot do a variable number of joins.
You could use a procedural language to generate the groupings and then write the groupings into a database. I don't know whether this is what you are after.
n = 36
group1[1...12] = []
group2[1...12] = []
group3[1...12] = []
function Choose(input[1...n], m, minIndex, group)
if minIndex + m > n + 1 then
return
if m = 0 then
if group = group1 then
Choose(input[1...n], 12, 1, group2)
else if group = group2 then
group3[1...12] = input[1...12]
print group1, group2, group3
for i = i to n do
group[12 - m + 1] = input[i]
Choose(input[1 ... i - 1].input[i + 1 ... n], m - 1, i, group)
When you call this like Choose([1...36], 12, 1, group1) what it does is fill in group1 with all possible ordered subsequences of length 12. At that point, m = 0 and group = group1, so the call Choose([?], 12, 1, group2) is made (for every possible choice of group1, hence the ?). That will choose all remaining ordered subsequences of length 12 for group2, at which point again m = 0 and now group = group2. We may now safely assign group3 to the remaining entries (there is only one way to choose group3 after choosing group1 and group2).
We take ordered subsequences only by propagating the index at which to begin looking on the recursive call (minIdx). We take ordered subsequences to avoid getting permutations of the same set of 12 items (since order doesn't matter within a group).
Each recursive call to Choose in the loop passes input with one element removed: precisely that element that just got added to the group under consideration.
We check for minIndex + m > n + 1 and stop the recursion early because, in this case, we have skipped too many items in the input to be able to ever fill up the current group with 12 items (while choosing the subsequence to be ordered).
You will notice I have hard-coded the assumption of 12/36/3 groups right into the logic of the program. This was done for brevity and clarity, not because you can't make parameterize it in the input size N and the number of groups k to form. To do this, you'd need to create an array of groups (k groups of size N/k each), then call Choose with N/k instead of 12 and use a select/switch case statement instead of if/then/else to determine whether to Choose again or print. But those details can be left as an exercise.

Estimating probabilities of probability products

I have a process which generates a binary outcome from a multiplication of two stochastic binary inputs:
Out = X1(P1) * X2(P2) .
X1 and X2 generate a binary output with some probability, P1 and P2, respectively.
I have a set of examples, for which I know Out and P1. I would like to estimate P2. Note: P2 is constant across all examples, but P1 and Out are not.
What is a good way to do this?

Sort grouped series in SSRS Report

I have a chart with a count of X per year per category, i.e.
X ##### # 2012
X ###### # 2013
Y #############
Y ##
Z ###########
Z #################
How can I apply a sort to the 2013 values (but keep the grouping within X, Y, Z) - i.e. in this case, Z at the top, X mid and Y at the bottom, based on 2013 values.
Ok, the trick is to name the Category Group and then use a sort expression on the Category Group. If you want to sort by values in 2013, use
=Sum(IIf(year(Fields!TradeDate.Value) = 2013,
CDbl(Fields![The field being summed].Value), CDbl(0)),
"[The Category Grouping Name]")
The Cdbl is required to ensure Sum is working on the same data types.
I bet if you add row groups to your report By Year and By -Whatever XYZ Is- then you can sort your output by year while keeping the xyz grouping.

Calculating the cost of Block Nested Loop Joins

I am trying to calculate the cost of the (most efficient) block nested loop join in terms of NDPR (number of disk page reads). Suppose you have a query of the form:
SELECT COUNT(*)
FROM county JOIN mcd
ON count.state_code = mcd.state_code
AND county.fips_code = mcd.fips_code
WHERE county.state_code = #NO
where #NO is substituted for a state code on each execution of the query.
I know that I can derive the NPDR using: NPDR(R x S) = |Pages(R)| + Pages(R) / B - 2 . |P ages(S)|
(where the smaller table is used as the outer in order to produce less page reads. Ergo:
R = county, S = mcd).
I also know that Page size = 2048 bytes
Pointer = 8 byte
Num. rows in mcd table = 35298
Num. rows in county table = 3141
Free memory buffer pages B = 100
Pages(X) = (rowsize)(numrows) / pagesize
What I am trying to figure out is how the "WHERE county.state_code = #NO" affects my cost?
Thanks for your time.
First a couple of observations regarding the formula you wrote:
I'm not sure why it you write "B - 2" instead of "B - 1". From a theoretical perspective, you need a single buffer page to read in relation S (you can do it by reading one page at a time).
Make sure you use all the brackets. I would write the formula as:
NPDR(R x S) = |Pages(R)| + |Pages(R)| / (B-2) * |Pages(S)|
The all numbers in the formula would need to be rounded up (but this is nitpicking).
The explanation for the generic BNLJ formula:
You read in as many tuples from the smaller relation (R) as you can keep in memory (B-1 or B-2 pages worth of tuples).
For each group of B-2 pages worth of tuples, you then have to read the whole S relation ( |Pages(S)|) to perform the join for that particular range of relation R.
At the end of the join, relation R is read exactly one time and relation S is read as many times as we filled the memory buffer, namely |Pages(R)| / (B-2) times.
Now the answer:
In your example a selection criteria is applied to relation R (table Country in this case). This is the WHERE county.state_code = #NO part of the query. Therefore, the generic formula does not apply directly.
When reading from relation R (i.e., table Country in your example), we can discard all the non-qualifying tuples that do not match the selection criteria. Assuming that there are 50 states in the USA and that all states have the same number of counties, only 2% of the tuples in table Country qualify on average and need to be stored in memory. This reduces the number of iteration of the inner loop of the join (i.e., the number of times we need to scan relation S / table mcs). The 2% number is obviously just the expected average and will change depending on the actual given state.
The formula for your problem therefore becomes:
NPDR(R x S) = |Pages(County)| + |Pages(County)| / (B - 2) * |Counties in state #NO| / |Rows in table County| * |Pages(Mcd)|