How to use esttab to create columns with different cells - output

Suppose that I have this data:
sysuse auto2, clear
For two different samples, I can use the community-contributed command esttab to create a table of means a and b with standard deviations in parentheses below the means:
eststo clear
eststo a: estpost summarize trunk weight length turn
keep if inrange(mpg, 12, 20)
eststo b: estpost summarize trunk weight length turn
esttab a b, label cells("mean(fmt(2))" "sd(fmt(2) par)") ///
nonumbers booktabs collabels("a" "b")
I want the produced table to have the two columns above exactly as is, but then to add additional summary statistics (here min and max) corresponding to the b estimates.
For example, I want the third column to be like:
esttab b, label cells("min") ///
nonumbers booktabs collabels("min")
In addition, I would like the fourth column to be as follows:
esttab b, label cells("max") ///
nonumbers booktabs collabels("min")
The problem is that I am not sure how to make all of this be in one table together (other than perhaps saving everything to a matrix and using esttab on that).
The reason is that it does not seem like one can get the cells option to correspond to an individual column; it applies the changes to all columns.
Note that if there is a way to do this, but it would require the s.d.s to not be included, that is fine.
How can I generate the desired output without creating a matrix?

This is the best you can do without creating a matrix:
esttab a b, label cells( (mean(fmt(2)) min max) sd(fmt(2) par) )
--------------------------------------------------------------------------------------------------
(1) (2)
mean/sd min max mean/sd min max
--------------------------------------------------------------------------------------------------
Trunk space (.. ft.) 13.76 5.00 23.00 16.32 7.00 23.00
(4.28) (3.28)
Weight (lbs.) 3019.46 1760.00 4840.00 3558.68 2410.00 4840.00
(777.19) (498.89)
Length (in.) 187.93 142.00 233.00 203.89 173.00 233.00
(22.27) (13.87)
Turn Circle (ft.) 39.65 31.00 51.00 42.55 36.00 51.00
(4.40) (3.21)
--------------------------------------------------------------------------------------------------
Observations 74 38
--------------------------------------------------------------------------------------------------

Related

Return value of first argument of condition without recalculation

I know I can solve it by using a separate cell and make it invisible but I'd like a clean solution for this and use only 1 cell and 1 formula.
For this example I need to get a random number between X.00 and Y.00 (with decimals), not lower than 0.00 and not superior than 9.00. But for this example I am using only 1 condition for avoiding values less than 0.00 to simplify.
I will use RANDBETWEEN(). The X and Y are supplied by 2 other cells on the dummy file bellow: B3 and D3.
The reason why sometimes the random number can be less than 0.00 or above 9.00 is that I need the random result to be between random +/- 1 than (X+Y)/2.
Also X and Y values will vary. Sometimes X will be 3.54, sometimes 8.99 and same for Y. For situations when it happens that X and Y are a low number like 0.5 the RANDBETWEEN() function might output a negative number. When so I need the output to be 0.00. Same for high values. If both X and Y will be close to 9.00 the result might be something like 9.35. In this cases I´d need the output to be 9.000. But in the below formula I am only using cases below zero to make it simple.
So the problem I am unable to resolve is that I need to get the value of the fist argument of the ÌF() condition without recalculating it. If I refer to the 3rd argument for FALSE then I will recalculate. my formula is this:
=IF(
(RANDBETWEEN(
((((B3*100)+(D3*100))/2)-100),
((((B3*100)+(D3*100))/2)+100))
/100)<0,0,
(RANDBETWEEN(
((((B3*100)+(D3*100))/2)-100),
((((B3*100)+(D3*100))/2)+100))
/100))
So this will check if the first argument is less than 0.00, if it is it displays 0.00 if not it recalculates again and this is a problem because it might sometimes recalculate a value less than 0.00.
My question is: Is there a way to return the value of the first argument of the condition without recalculation of RANDBETWEEN() and without using a separate cell?
If not possible I would also welcome any solution using custom GAS functions.
My dummy file:
https://docs.google.com/spreadsheets/d/15YtgUVqDTuC-raMNJiN-YG4j3URaorXdPiwTy_Kb_K0/edit
(click checkbox to reset and again to recalculate the random number).
Wrap your RANDBETWEEN within a MIN - MAX
=MAX(
MIN(
RANDBETWEEN(((((D3*100)+(B3*100))/2)-100),((((D3*100)+(B3*100))/2)+100))/100,
9
),
0
)*A1

Can LSTM train for regression with different numbers of feature in each sample?

In my problem, each training and testing sample has different number of features. For example, the training sample is as following:
There are four features in sample1: x1, x2, x3, x4, y1
There are two features in sample2: x6, x3, y2
There are three features in sample3: x8, x1, x5, y3
x is feature, y is target.
Can these samples train for the LSTM regression and make prediction?
Consider following scenario: you have a (way to small) dataset of 6 sample sequences of lengths: { 1, 2, 3, 4, 5, 6} and you want to train your LSTM (or, more general, an RNN) with minibatch of size 3 (you feed 3 sequences at a time at every training step), that is, you have 2 batches per epoch.
Let's say that due to randomization, on step 1 batch ended up to be constructed from sequences of lengths {2, 1, 5}:
batch 1
----------
2 | xx
1 | x
5 | xxxxx
and, the next batch of sequences of length {6, 3, 4}:
batch 2
----------
6 | xxxxxx
3 | xxx
4 | xxxx
What people would typically do, is pad sample sequences up to the longest sequence in the minibatch (not necessarily to the length of the longest sequence overall) and to concatenate sequences together, one on top of another, to get a nice matrix that can be fed into RNN. Let's say your features consist of real numbers and it is not unreasonable to pad with zeros:
batch 1
----------
2 | xx000
1 | x0000
5 | xxxxx
(batch * length = 3 * 5)
(sequence length 5)
batch 2
----------
6 | xxxxxx
3 | xxx000
4 | xxxx00
(batch * length = 3 * 6)
(sequence length 6)
This way, for the first batch your RNN will only run up to necessary number of steps (5) to save some compute. For the second batch it will have to go up to the longest one (6).
The padding value is chosen arbitrarily. It usually should not influence anything, unless you have bugs. Trying some bogus values, like Inf or NaN may help you during debugging and verification.
Importantly, when using padding like that, there are some other things to do for model to work correctly. If you are using backpropagation, you should exclude the results of the padding from both, output computation and gradient computation (deep learning frameworks will do that for you). If you are running a supervised model, labels should typically also be padded and padding should not be considered for the loss calculation. For example, you calculate cross-entropy for the entire batch (with padding). In order to calculate a correct loss, the bogus cross-entropy values that correspond to padding should be masked with zeros, then each sequence should be summed independently and divided by its real length. That is, averaging should be performed without taking padding into account (in my example this is guaranteed due to the neutrality of zero with respect to addition). Same rule applies to regression losses and metrics such as accuracy, MAE etc (that is, if you average together with padding your metrics will also be wrong).
To save even more compute, sometimes people construct batches such that sequences in batches have roughly the same length (or even exactly the same, if dataset allows). This may introduce some undesired effects though, as long and short sequences are never in the same batch.
To conclude, padding is a powerful tool and if you are attentive, it allows you to run RNNs very efficiently with batching and dynamic sequence length.
Yes. Your input_size for LSTM-layer should be maximal among all input_sizes. And spare cells you replace with nulls:
max(input_size) = 5
input array = [x1, x2, x3]
And you transform it this way:
[x1, x2, x3] -> [x1, x2, x3, 0, 0]
This approach is rather common and does not show any negative big influence on prediction accuracy.

How to batch rename variables in esttab

I am using the community-contributed command esttab with the rename() option.
I have a special situation in which I run multiple regressions where each regression has a coefficient that is from a different (similarly-named) variable, but each corresponds to the same idea.
Here is a (very contrived) toy example:
sysuse auto, clear
rename weight mpg1
rename mpg mpg2
rename turn mpg3
I want to display the results of three regressions, but have only one line for mpg1, mpg2, and mpg3 (instead of each one appearing on a separate line).
One way to accomplish this is to do the following:
eststo clear
eststo: quietly reg price mpg1
eststo: quietly reg price mpg2
eststo: quietly reg price mpg3
esttab, rename(mpg1 mpg mpg2 mpg mpg3 mpg)
Can I rename all of the variables at the same time by doing something such as rename(mpg* mpg)?
If I want to run a large number of regressions, it becomes more advantageous to do this instead of writing them all out by hand.
Stata's rename group command can handle abbreviations and wildcards, unlike the rename() option of estout. However, for the latter, you need to build a list of names and store it in a local macro.
Below you can find an improved version of your toy example code:
sysuse auto, clear
eststo clear
rename (weight mpg turn) mpg#, addnumber
forvalues i = 1 / 3 {
eststo: quietly reg price mpg`i'
local mpglist `mpglist' mpg`i' mpg
}
esttab, rename(`mpglist')
------------------------------------------------------------
(1) (2) (3)
price price price
------------------------------------------------------------
mpg 2.044*** -238.9*** 207.6**
(5.42) (-4.50) (2.76)
_cons -6.707 11253.1*** -2065.0
(-0.01) (9.61) (-0.69)
------------------------------------------------------------
N 74 74 74
------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

Change in the requirements - store 2 different values in a single database table field

I have a MySQL table with a field which is an unsigned tinyint (max value: 255).
Typical change in the requirements. We would need to create a new field because of a bunch of records in that table. But that would be very expensive for the application (lots of changes, a lot of work).
So we are thinking to combine the new value with the old value.
Basically in an unsigned tinyint (max value: 255), we need to store:
an integer that can be 1, 2, 3 or 4
an integer that can span from 1 to 30 (limits included)
The requirement is to get and set the 'combined' value with an algorithm as easy as possible.
How would you do that?
If possible I would like not to use any binary representation.
Thanks,
Dan
You could use multiples of 32 to represent 1-4 and add the 1-30 on top.
[1,1] would be 33
[1,2] would be 34
[1,30] would be 62
[2,1] would be 65
[2,30] would be 94
[4,1] would be 129
[4,30] would be 158
This would work and be unambiguous, but in general I really think you should not consort to a hack like this. Add the column and change your code. What will you do with the next requirements change? At the end, your software will be a collection of hacks and it can't be maintained anymore.
Let's call the two values x and y.
While storing the numbers perform these steps:
Multiply x by 100.
Add the result of 1 to y.
Store the result of 2 in the column.
Thus, if x were to be 3, and y 15, I would get 315 for the result. You can decode that easily by first extracting the last two digits from the number and then dividing by 100 will give you the first one.
But because you have to fit the numbers within 255, you can chose an appropriate multiplier that is less than 100.

The most efficient way to calculate an integral in a dataset range

I have an array of 10 rows by 20 columns. Each columns corresponds to a data set that cannot be fitted with any sort of continuous mathematical function (it's a series of numbers derived experimentally). I would like to calculate the integral of each column between row 4 and row 8, then store the obtained result in a new array (20 rows x 1 column).
I have tried using different scipy.integrate modules (e.g. quad, trpz,...).
The problem is that, from what I understand, scipy.integrate must be applied to functions, and I am not sure how to convert each column of my initial array into a function. As an alternative, I thought of calculating the average of each column between row 4 and row 8, then multiply this number by 4 (i.e. 8-4=4, the x-interval) and then store this into my final 20x1 array. The problem is...ehm...that I don't know how to calculate the average over a given range. The question I am asking are:
Which method is more efficient/straightforward?
Can integrals be calculated over a data set like the one that I have described?
How do I calculate the average over a range of rows?
Since you know only the data points, the best choice is to use trapz (the trapezoidal approximation to the integral, based on the data points you know).
You most likely don't want to convert your data sets to functions, and with trapz you don't need to.
So if I understand correctly, you want to do something like this:
from numpy import *
# x-coordinates for data points
x = array([0, 0.4, 1.6, 1.9, 2, 4, 5, 9, 10])
# some random data: 3 whatever data sets (sharing the same x-coordinates)
y = zeros([len(x), 3])
y[:,0] = 123
y[:,1] = 1 + x
y[:,2] = cos(x/5.)
print y
# compute approximations for integral(dataset, x=0..10) for datasets i=0,1,2
yi = trapz(y, x[:,newaxis], axis=0)
# what happens here: x must be an array of the same shape as y
# newaxis tells numpy to add a new "virtual" axis to x, in effect saying that the
# x-coordinates are the same for each data set
# approximations of the integrals based the datasets
# (here we also know the exact values, so print them too)
print yi[0], 123*10
print yi[1], 10 + 10*10/2.
print yi[2], sin(10./5.)*5.
To get the sum of the entries 4 to 8 (including both ends) in each column, use
a = numpy.arange(200).reshape(10, 20)
a[4:9].sum(axis=0)
(The first line is just to create an example array of the desired shape.)