Finding the popular (frequent) number in Tcl - tcl

It is pretty much to look for the "mode", which is the value that appears most frequently in a data set.
Here is my test code in TCL:
proc mode {list} {
foreach val $list {dict incr h $val}
set h [lsort -stride 2 -real -index 1 -decreasing $h]
return [lindex $h 0]
}
set a [list 0 0 0 0.4 0.4 0.4 0.4 0.4 0.1 0.2 0.4 0.35 0.29 0.19 0.15 0.45 0.39 0.39 0.39 0.39 0.39 0.39 0.39]
set m [mode $a]
puts $m
Is it a good/efficient way for a large dataset?
How to remove those "0" elements before the mode calculation?

Is it a good/ efficient way for a large dataset?
Define "large" and measure!
As pointed out to you by others, your combo of dict incr/ lsort -integer is a solid choice iff length was the only factor:
Length | Method 1 | Method 2 | Ratio (m1 / m2)
-------+----------+----------+----------------
46 | 4.4 | 7.9 | 0.55
92 | 7.2 | 14.6 | 0.49
184 | 12.7 | 28.2 | 0.45
368 | 23.8 | 55.1 | 0.43
736 | 46.2 | 108.5 | 0.43
1472 | 90.6 | 217.3 | 0.42
2944 | 180.2 | 428.3 | 0.42
5888 | 359.8 | 857.4 | 0.42
11776 | 715.9 | 1704.9 | 0.42
23552 | 1437.0 | 3408.9 | 0.42
47104 | 2878.2 | 6855.4 | 0.42
94208 | 5741.7 | 13664.4 | 0.42
Method 1:
proc mode {list} {
set h [dict create]
foreach val $list {dict incr h $val}
set h [lsort -stride 2 -integer -index 1 -decreasing $h]
return [lindex $h 0]
}
Method 2:
proc mode2 {list} {
set maxCount 0
set mode ""
foreach val $list {
dict incr h $val
set count [dict get $h $val]
if {$count > $maxCount} {
set maxCount $count
set mode $val
}
}
return $mode
}
You mentioned "real" numbers. In many cases, the distribution of levels/ bins (unique values) in a collection is even more important. Let's take the worst case that each measurement point is unique, so the length equals the number of levels/ bins:
Length | Method 1 | Method 2 | Ratio (m1 / m2)
-------+----------+----------+----------------
23 | 4.3 | 4.8 | 0.90
46 | 7.7 | 8.9 | 0.86
92 | 15.6 | 17.1 | 0.91
184 | 31.0 | 34.3 | 0.90
368 | 63.7 | 67.9 | 0.94
736 | 133.2 | 137.8 | 0.97
1472 | 300.8 | 300.8 | 1.00
2944 | 651.5 | 628.0 | 1.04
5888 | 1560.8 | 1310.8 | 1.19
11776 | 2886.6 | 2702.7 | 1.07
23552 | 6408.2 | 5654.6 | 1.13
47104 | 23331.4 | 19110.5 | 1.22
94208 | 69697.9 | 55569.2 | 1.25
... then lsort will start becoming overly heavy on your bill. Also, if you want to detect more than one mode (bimodal etc.), then the picture changes. In either case, Method 2 above might become a valid candidate for large and heterogeneous data sets (w/ and w/o multi modes).
This is the driver code for the above measurement tables:
namespace import tcl::unsupported::timerate
timerate -calibrate {}
proc r {} {expr {10+rand()*40}}
puts " Length | Method 1 | Method 2 | Ratio (m1 / m2)"
puts " -------+----------+----------+----------------"
set l 23
while {$l <= 100000} {
set a [list]
for {set i 0} {$i<$l} {incr i} { lappend a [r]}
set m1 [lindex [timerate {mode $a} 1000] 0]
set m2 [lindex [timerate {mode2 $a} 1000] 0]
set ratio [expr {double($m1) / double($m2)}]
puts [format " %6d | %8.1f | %8.1f | %9.2f" $l $m1 $m2 $ratio]
incr l $l
}
puts " Length | Method 1 | Method 2 | Ratio (m1 / m2)"
puts " -------+----------+----------+----------------"
set a [list 0 0 0 0.4 0.4 0.4 0.4 0.4 0.1 0.2 0.4 0.35 0.29 0.19 0.15 0.45 0.39 0.39 0.39 0.39 0.39 0.39 0.39]
while {[llength $a]*2 <= 100000} {
lappend a {*}$a
set m1 [lindex [timerate {mode $a} 1000] 0]
set m2 [lindex [timerate {mode2 $a} 1000] 0]
set ratio [expr {double($m1) / double($m2)}]
puts [format " %6d | %8.1f | %8.1f | %9.2f" [llength $a] $m1 $m2 $ratio]
}

Related

Stata Probit Model Interaction Term Interpretation

for my thesis i am currently investigating the effects of emissions on health on a regional basis. the dependent variable is bicategorical which takes the value 0 (if health is good) and 1 (if health is bad) with the exception of emissions and capita_gdp every variable is categorical:
here is an exemplary regression:
probit health i.year i.region##emissions age educ smoker gender urban capita_gdp, robust
nofvlabel allbaselevels
Probit regression Number of obs = 67,041
Wald chi2(64) = 5850.28
Prob > chi2 = 0.0000
Log pseudolikelihood = -43026.965 Pseudo R2 = 0.0660
-------------------------------------------------------------------------------------
| Robust
health | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------------+----------------------------------------------------------------
year |
1 | 0 (base)
2 | -.0236149 .0290446 -0.81 0.416 -.0805412 .0333115
3 | -.0552885 .0343119 -1.61 0.107 -.1225386 .0119615
4 | -.7498958 .0521191 -14.39 0.000 -.8520474 -.6477442
|
region |
1 | 0 (base)
2 | .3424928 .1944582 1.76 0.078 -.0386383 .723624
3 | .6631291 .343445 1.93 0.054 -.0100107 1.336269
4 | 1.005453 .1809361 5.56 0.000 .6508251 1.360081
5 | .5202438 .2705144 1.92 0.054 -.0099547 1.050442
6 | .853456 .2053275 4.16 0.000 .4510215 1.25589
7 | -1.32784 1.329886 -1.00 0.318 -3.934369 1.278688
8 | .2074103 .5587633 0.37 0.710 -.8877457 1.302566
9 | .8778635 1.005655 0.87 0.383 -1.093184 2.848911
10 | .614019 .2058646 2.98 0.003 .2105317 1.017506
11 | 1.103564 .2395228 4.61 0.000 .6341078 1.57302
12 | -.9928198 1.189953 -0.83 0.404 -3.325084 1.339444
13 | .2024027 .3014841 0.67 0.502 -.3884953 .7933008
14 | .8510637 .1966648 4.33 0.000 .4656078 1.23652
15 | -.4685238 1.062594 -0.44 0.659 -2.551171 1.614123
16 | .1222191 .4271317 0.29 0.775 -.7149435 .9593818
17 | 1.777416 .9296525 1.91 0.056 -.0446694 3.599502
18 | .7016812 .3960197 1.77 0.076 -.0745032 1.477866
19 | .2164103 .2324297 0.93 0.352 -.2391436 .6719642
20 | -.8683004 2.079837 -0.42 0.676 -4.944707 3.208106
21 | .6094313 .1969787 3.09 0.002 .2233601 .9955025
22 | .4586692 .2175369 2.11 0.035 .0323048 .8850336
23 | .1376296 .316405 0.43 0.664 -.4825129 .7577721
24 | .8800929 .2139805 4.11 0.000 .4606989 1.299487
25 | .5008748 .181908 2.75 0.006 .1443417 .8574079
26 | .7885192 .2055236 3.84 0.000 .3857004 1.191338
27 | .8370192 .2066431 4.05 0.000 .4320061 1.242032
28 | .0342872 .3383975 0.10 0.919 -.6289597 .697534
|
emissions | .2331187 .0475761 4.90 0.000 .1398713 .3263662
|
region#c.emissions|
1 | 0 (base)
2 | -.1763598 .0473856 -3.72 0.000 -.2692338 -.0834858
3 | .0902526 .3483855 0.26 0.796 -.5925705 .7730757
4 | -.2545669 .0436166 -5.84 0.000 -.3400539 -.1690798
5 | -.1903919 .0525988 -3.62 0.000 -.2934837 -.0873002
6 | -.2595892 .0565328 -4.59 0.000 -.3703914 -.148787
7 | .3660934 .3615611 1.01 0.311 -.3425534 1.07474
8 | -.1810636 .0873587 -2.07 0.038 -.3522836 -.0098436
9 | -.2360667 .2817683 -0.84 0.402 -.7883225 .316189
10 | -.2362498 .0452001 -5.23 0.000 -.3248403 -.1476593
11 | -.2986525 .0606014 -4.93 0.000 -.4174291 -.179876
12 | .4210453 .4355456 0.97 0.334 -.4326084 1.274699
13 | -.1393217 .063414 -2.20 0.028 -.2636109 -.0150324
14 | -.2428271 .0452505 -5.37 0.000 -.3315166 -.1541377
15 | -.1078827 .1281398 -0.84 0.400 -.359032 .1432667
16 | -.1121361 .0991541 -1.13 0.258 -.3064746 .0822024
17 | -.3670531 .1360779 -2.70 0.007 -.6337609 -.1003453
18 | -.241021 .1572069 -1.53 0.125 -.5491408 .0670988
19 | -.2128744 .0452858 -4.70 0.000 -.3016328 -.1241159
20 | .103139 .4313025 0.24 0.811 -.7421983 .9484763
21 | -.217597 .0532092 -4.09 0.000 -.3218851 -.1133089
22 | -.1796928 .0509009 -3.53 0.000 -.2794568 -.0799288
23 | -.1510797 .0529603 -2.85 0.004 -.2548799 -.0472795
24 | -.2589344 .0509662 -5.08 0.000 -.3588264 -.1590425
25 | -.231851 .0448358 -5.17 0.000 -.3197276 -.1439745
26 | -.2411263 .0442314 -5.45 0.000 -.3278182 -.1544344
27 | -.2452313 .0465597 -5.27 0.000 -.3364867 -.153976
28 | -.0563099 .1191566 -0.47 0.637 -.2898525 .1772328
|
age | .1085835 .0049886 21.77 0.000 .098806 .1183609
educ | -.1802489 .0107034 -16.84 0.000 -.2012272 -.1592707
smoker | .080728 .0145963 5.53 0.000 .0521198 .1093362
gender | -.2019473 .0145416 -13.89 0.000 -.2304483 -.1734463
urban | -.1362217 .0112233 -12.14 0.000 -.1582189 -.1142245
capita_gdp | -8.36e-06 .0000194 -0.43 0.667 -.0000464 .0000297
_cons | -.4987429 .1638654 -3.04 0.002 -.8199132 -.1775726
-------------------------------------------------------------------------------------
My question is, how can I exactly interpret the coefficients of emissions and the interaction of region.c#emissions on the dependent variable ? To my understanding the coefficient of emissions for region 1 is the base level and the coefficient of emissions in region 2 is lower than region 1 by -.176 ?
Correct. Two extra things worth noting:
Interactions work both ways. So the interaction coefficient tells you that the emissions effect is 0.176 smaller in region 2, but also that the effect of being in region 2 is 0.176 smaller if emissions are one unit larger. That also means you cannot directly interpret any coefficient involved in the interaction (region & emissions) as they both depend on each other.
Stata has an excellent margins and marginsplot command that calculates for you what the coefficients are at particular levels of region and/or emissions. It has a bit of a learning curve, but if you get the hang of it you can produce beautiful graphs to illustrate the interaction effect that will be much more informative than a long regression table.
There are many tutorials online on how to use margins and there's also this presentation by Ben Jann.

Print sum of rows and other row value for each column in awk

I have a csv file structured as the one below:
| Taiwan | | US |
| ASUS | MSI | DELL | HP
------------------------------------------
CPU | 50 | 49 | 43 | 65
GPU | 60 | 64 | 75 | 54
HDD | 75 | 70 | 65 | 46
RAM | 60 | 79 | 64 | 63
assembled| 235 | 244 | 254 | 269
and I have to use an awk script to print a comparison between the sum of prices of the individual computer pieces (rows 3 to 6) "versus" the assembled computer price (row 7) displaying also the country each brand comes from. The printed result in the terminal should be something like:
Taiwan Asus 245 235
Taiwan MSI 262 244
US DELL 247 254
US HP 228 269
Where the third column is the sum of CPU, GPU, HDD and RAM prices and the fourth column is the price same value seen in row 7 per each computer brand.
So far I have been able to sum the individual columns transforming the solution provided at the post I link below, but I don´t know how I could display the result I want in the desired format. Could anyone help me with this? I´m a bit desperate at this point.
Sum all values in each column bash
This is the content of the original csv file represented at the top of this message:
,Taiwan,,US,
,ASUS,MSI,DELL,HP
CPU,50,49,43,65
GPU,60,64,75,54
HDD,75,70,65,46
RAM,60,79,64,63
assembled,235,244,254,269
Thank you very much in advance.
$ cat tst.awk
BEGIN { FS=","; OFS="\t" }
NR == 2 {
for (i=2; i<=NF; i++) {
corp[i] = (p[i] == "" ? p[i-1] : p[i]) OFS $i
}
}
NR > 2 {
for (i=2; i<=NF; i++) {
tot[i] += p[i]
}
}
{ split($0,p) }
END {
for (i=2; i<=NF; i++) {
print corp[i], tot[i], p[i]
}
}
.
$ awk -f tst.awk file
Taiwan ASUS 245 235
Taiwan MSI 262 244
US DELL 247 254
US HP 228 269

Regression with all variables without explicitly declaring them

I have a dataset that I would like to run a regression on in Stata. I want to make one of the dummy variables the base so I use the ib1.month1 in the regress command.
Is it possible to include in my regression all other variables in the dataset without explicitly writing out each variable again?
You can use the ds command:
sysuse auto, clear
drop make
ds price foreign, not
regress price ib1.foreign `r(varlist)'
Source | SS df MS Number of obs = 69
-------------+---------------------------------- F(10, 58) = 8.66
Model | 345416162 10 34541616.2 Prob > F = 0.0000
Residual | 231380797 58 3989324.09 R-squared = 0.5989
-------------+---------------------------------- Adj R-squared = 0.5297
Total | 576796959 68 8482308.22 Root MSE = 1997.3
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
foreign |
Domestic | -3334.848 957.2253 -3.48 0.001 -5250.943 -1418.754
mpg | -21.80518 77.3599 -0.28 0.779 -176.6578 133.0475
rep78 | 184.7935 331.7921 0.56 0.580 -479.3606 848.9476
headroom | -635.4921 383.0243 -1.66 0.102 -1402.198 131.2142
trunk | 71.49929 95.05012 0.75 0.455 -118.7642 261.7628
weight | 4.521161 1.411926 3.20 0.002 1.694884 7.347438
length | -76.49101 40.40303 -1.89 0.063 -157.3665 4.38444
turn | -114.2777 123.5374 -0.93 0.359 -361.5646 133.0092
displacement | 11.54012 8.378315 1.38 0.174 -5.230896 28.31115
gear_ratio | -318.6479 1124.34 -0.28 0.778 -2569.259 1931.964
_cons | 13124.34 6726.3 1.95 0.056 -339.8103 26588.5
------------------------------------------------------------------------------

How can I specify the base level of a factor variable?

I have data for 2000-2016 and I am trying to estimate the following regression:
xtset id
xtreg lnp i.year i.year#fp, fe vce(robust)
However, when I do this, Stata omits 2008 because of collinearity.
Is there a way to specify which year is omitted?
More generally, you can specify the omitted level of a factor variable (i.e. the
base) by using the ib operator (see also help fvvarlist).
Below is a reproducible example using Stata's toy dataset nlswork:
webuse nlswork, clear
xtset idcode
Using 77 as the base year:
xtreg ln_wage ib77.year age, fe vce(robust)
Fixed-effects (within) regression Number of obs = 28,510
Group variable: idcode Number of groups = 4,710
R-sq: Obs per group:
within = 0.1060 min = 1
between = 0.0914 avg = 6.1
overall = 0.0805 max = 15
F(15,4709) = 69.49
corr(u_i, Xb) = 0.0467 Prob > F = 0.0000
(Std. Err. adjusted for 4,710 clusters in idcode)
------------------------------------------------------------------------------
| Robust
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
year |
68 | -.108365 .1111117 -0.98 0.329 -.3261959 .1094659
69 | -.0335029 .0995142 -0.34 0.736 -.2285973 .1615915
70 | -.0604953 .0867605 -0.70 0.486 -.2305866 .1095959
71 | -.0218073 .0742761 -0.29 0.769 -.1674232 .1238087
72 | -.0226893 .0622792 -0.36 0.716 -.1447857 .0994071
73 | -.0203581 .049851 -0.41 0.683 -.1180894 .0773732
75 | -.0305043 .0259707 -1.17 0.240 -.081419 .0204104
78 | .0225868 .0147272 1.53 0.125 -.0062854 .0514591
80 | .0058999 .0381391 0.15 0.877 -.0688706 .0806704
82 | .0006801 .0622403 0.01 0.991 -.1213399 .1227001
83 | .0127622 .074435 0.17 0.864 -.1331653 .1586897
85 | .0381987 .0989316 0.39 0.699 -.1557535 .2321508
87 | .0298993 .1237839 0.24 0.809 -.2127751 .2725736
88 | .0716091 .1397635 0.51 0.608 -.2023927 .345611
|
age | .0125992 .0123091 1.02 0.306 -.0115323 .0367308
_cons | 1.312096 .3453967 3.80 0.000 .6349571 1.989235
-------------+----------------------------------------------------------------
sigma_u | .4058746
sigma_e | .30300411
rho | .64212421 (fraction of variance due to u_i)
------------------------------------------------------------------------------
Using 80 as the base year:
xtreg ln_wage ib80.year age, fe vce(robust)
Fixed-effects (within) regression Number of obs = 28,510
Group variable: idcode Number of groups = 4,710
R-sq: Obs per group:
within = 0.1060 min = 1
between = 0.0914 avg = 6.1
overall = 0.0805 max = 15
F(15,4709) = 69.49
corr(u_i, Xb) = 0.0467 Prob > F = 0.0000
(Std. Err. adjusted for 4,710 clusters in idcode)
------------------------------------------------------------------------------
| Robust
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
year |
68 | -.1142649 .1480678 -0.77 0.440 -.4045471 .1760172
69 | -.0394028 .136462 -0.29 0.773 -.3069323 .2281266
70 | -.0663953 .1237179 -0.54 0.592 -.3089402 .1761497
71 | -.0277072 .1112026 -0.25 0.803 -.2457164 .190302
72 | -.0285892 .0991208 -0.29 0.773 -.2229124 .165734
73 | -.026258 .0866489 -0.30 0.762 -.1961303 .1436142
75 | -.0364042 .0625743 -0.58 0.561 -.1590791 .0862706
77 | -.0058999 .0381391 -0.15 0.877 -.0806704 .0688706
78 | .0166869 .0258678 0.65 0.519 -.0340261 .0673999
82 | -.0052198 .0257713 -0.20 0.840 -.0557437 .0453041
83 | .0068623 .0378166 0.18 0.856 -.0672759 .0810005
85 | .0322987 .0620538 0.52 0.603 -.0893558 .1539533
87 | .0239993 .0868397 0.28 0.782 -.1462471 .1942457
88 | .0657092 .1028815 0.64 0.523 -.1359868 .2674052
|
age | .0125992 .0123091 1.02 0.306 -.0115323 .0367308
_cons | 1.317996 .3824809 3.45 0.001 .5681546 2.067838
-------------+----------------------------------------------------------------
sigma_u | .4058746
sigma_e | .30300411
rho | .64212421 (fraction of variance due to u_i)
------------------------------------------------------------------------------

Constraining slope

I'm a beginner in Stata. I'm trying to run the following regression:
regress logy logI logh logL
but I would like to constrain the slope of logh to be one. Can someone tell me the command for this?
There are at least three ways to do this in Stata.
1) Use constrained linear regression:
. sysuse auto
(1978 Automobile Data)
. constraint 1 mpg = 1
. cnsreg price mpg weight, constraints(1)
Constrained linear regression Number of obs = 74
Root MSE = 2502.5449
( 1) mpg = 1
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg | 1 (constrained)
weight | 2.050071 .3768697 5.44 0.000 1.298795 2.801347
_cons | -46.14764 1174.541 -0.04 0.969 -2387.551 2295.256
------------------------------------------------------------------------------
2) Variable transformation (suggested by whuber in the comment above):
. gen price2 = price - mpg
. reg price2 weight
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 1, 72) = 29.59
Model | 185318670 1 185318670 Prob > F = 0.0000
Residual | 450916627 72 6262730.93 R-squared = 0.2913
-------------+------------------------------ Adj R-squared = 0.2814
Total | 636235297 73 8715552.01 Root MSE = 2502.5
------------------------------------------------------------------------------
price2 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | 2.050071 .3768697 5.44 0.000 1.298795 2.801347
_cons | -46.14764 1174.541 -0.04 0.969 -2387.551 2295.256
------------------------------------------------------------------------------
3) Using a GLM model with an offset:
. glm price weight , family(gaussian) link(identity) offset(mpg)
Iteration 0: log likelihood = -683.04238
Iteration 1: log likelihood = -683.04238
Generalized linear models No. of obs = 74
Optimization : ML Residual df = 72
Scale parameter = 6262731
Deviance = 450916626.9 (1/df) Deviance = 6262731
Pearson = 450916626.9 (1/df) Pearson = 6262731
Variance function: V(u) = 1 [Gaussian]
Link function : g(u) = u [Identity]
AIC = 18.51466
Log likelihood = -683.0423847 BIC = 4.51e+08
------------------------------------------------------------------------------
| OIM
price | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | 2.050071 .3768697 5.44 0.000 1.31142 2.788722
_cons | -46.14764 1174.541 -0.04 0.969 -2348.205 2255.909
mpg | 1 (offset)
------------------------------------------------------------------------------
The glm route could also handle the log transformation of your outcome for you if you change the link and family options appropriately.