statsmodels OLS gives parameters despite perfect multicollinearity - regression

Assume the following df:
ib c d1 d2
0 1.14 1 1 0
1 1.0 1 1 0
2 0.71 1 1 0
3 0.6 1 1 0
4 0.66 1 1 0
5 1.0 1 1 0
6 1.26 1 1 0
7 1.29 1 1 0
8 1.52 1 1 0
9 1.31 1 1 0
10 0.89 1 0 1
d1 and d2 are perfectly colinear. Now I estimate the following regression model:
import statsmodels.api as sm
reg = sm.OLS(df['ib'], df[['c', 'd1', 'd2']]).fit().summary()
reg
This gives me the following output:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: ib R-squared: 0.087
Model: OLS Adj. R-squared: -0.028
Method: Least Squares F-statistic: 0.7590
Date: Thu, 17 Nov 2022 Prob (F-statistic): 0.409
Time: 12:19:34 Log-Likelihood: -1.5470
No. Observations: 10 AIC: 7.094
Df Residuals: 8 BIC: 7.699
Df Model: 1
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
c 0.7767 0.111 7.000 0.000 0.521 1.033
d1 0.2433 0.127 1.923 0.091 -0.048 0.535
d2 0.5333 0.213 2.499 0.037 0.041 1.026
==============================================================================
Omnibus: 0.257 Durbin-Watson: 0.760
Prob(Omnibus): 0.879 Jarque-Bera (JB): 0.404
Skew: 0.043 Prob(JB): 0.817
Kurtosis: 2.019 Cond. No. 8.91e+15
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.34e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
"""
However, including c, d1 and d2 represents the well known dummy variable trap which, from my understanding, should make it impossible to estimate the model. Why is this not the case here?

Related

ANOVA table in R: F-value does not "match the math"

I was playing around with a simple linear models when I noticed that, in the ANOVA table, the ratio MSreg/MSres does not exactly correspond to the F-value. Indeed, the two values are very similar but not the same.
Here my script
#quick view of the dataset
> head(my_data)
Diameter Height
1 0.325 0.080
2 0.320 0.100
3 0.280 0.110
4 0.125 0.040
5 0.400 0.135
6 0.335 0.100
#setting up the lm()
> ls1 <- lm(Diameter~Height, data=my_data)
> anova(ls1)
Analysis of Variance Table
Response: Diameter
Df Sum Sq Mean Sq F value Pr(>F)
Height 1 0.82415 0.82415 602.63 < 2.2e-16 ***
Residuals 98 0.13402 0.00137
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Here 0.82415/0.00137=601.5693 which is not the F value in the table. Is there a particular reason for that?

which post-hoc test after welch-anova

i´m doing the statistical evaluation for my master´s thesis. the levene test was significant so i did the welch anova which was significant. now i tried the games-howell post hoc test but it didn´t work.
can anybody help me sending me the exact functions which i have to run in R to do the games-howell post hoc test and to get kind of a compact letter display, where it shows me which treatments are not significantly different from each other? i also wanted to ask if i did the welch anova the right way (you can find the output of R below)
here it the output which i did till now for the statistical evalutation:
data.frame': 30 obs. of 3 variables:
$ Dauer: Factor w/ 6 levels "0","2","4","6",..: 1 2 3 4 5 6 1 2 3 4 ...
$ WH : Factor w/ 5 levels "r1","r2","r3",..: 1 1 1 1 1 1 2 2 2 2 ...
$ TSO2 : num 107 86 98 97 88 95 93 96 96 99 ...
> leveneTest(TSO2~Dauer, data=TSO2R)
`Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 5 3.3491 0.01956 *
24
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1`
`> oneway.test (TSO2 ~Dauer, data=TSO2R, var.equal = FALSE) ###Welch-ANOVA
One-way analysis of means (not assuming equal variances)
data: TSO2 and Dauer
F = 5.7466, num df = 5.000, denom df = 10.685, p-value = 0.00807
'''`
Thank you very much!

Where is the "Corr" column for my lmer summary under "Random effects?"

I'm using the lmer package to build a model and want to check for correlations between random effects.
First I build a tibble:
id <- rep(1:6, each=4)
group <- rep(c("A","B"), each=12)
type <- rep(c("pencil", "pencil", "pen", "pen"), times=6)
color <- rep (c ("blue", "red"), times = 12)
dv <- c(-24.3854453, 17.0358639, -15.5174479, 8.6462489, -7.0561166, 3.3524410, 21.6199364, -6.1020999, 13.2464223, 20.3740206, 22.8571793, -6.6159629, 18.7898553, -8.2504319, 17.9571641, 2.9555213, -19.5516738, -0.5845135, 9.6041710, -4.1301420, 4.1740094, -24.2496521, 7.4432948, -0.8290391)
sample_data <- as_tibble(cbind(id, group, type, color, dv)
Here is my sample_data:
id group type color dv
1 A pencil blue 0.05925979
1 A pencil red 4.60326151
1 A pen blue -20.72000620
1 A pen red -15.27612843
2 A pencil blue -0.68719576
2 A pencil red 16.34200026
2 A pen blue 18.23954687
2 A pen red 21.02837383
3 A pencil blue -22.28695974
3 A pencil red -18.36587259
3 A pen blue -15.13952913
3 A pen red 19.95919637
4 B pencil blue -19.52410412
4 B pencil red -3.25912890
4 B pen blue -12.11669400
4 B pen red 15.93333896
5 B pencil blue -17.93575204
5 B pencil red -8.58879605
5 B pen blue 8.89757943
5 B pen red -13.42995221
6 B pencil blue 12.03769124
6 B pencil red -10.28876053
6 B pen blue 7.69523239
6 B pen red -2.94621122
Now I run my model and summarize it:
test.model <- lmer(dv ~ 1 + group * type * color + (1 * type * color | id), data = sample_data, REML = FALSE)
summary(test.model)
Here's my output:
Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: dv ~ 1 + group * type * color + (1 * type * color | id)
Data: test
AIC BIC logLik deviance df.resid
204.7 216.5 -92.4 184.7 14
Scaled residuals:
Min 1Q Median 3Q Max
-2.16529 -0.45429 0.09296 0.62406 1.62720
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 4.975 2.23
Residual 124.228 11.15
Number of obs: 24, groups: id, 6
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) -0.6679 6.5626 23.8937 -0.102 0.9198
groupA -0.6894 9.2809 23.8937 -0.074 0.9414
typepencil -10.3603 9.1005 18.0000 -1.138 0.2699
colorblue 12.3361 9.1005 18.0000 1.356 0.1920
groupA:typepencil 25.3050 12.8700 18.0000 1.966 0.0649 .
groupA:colorblue -1.3256 12.8700 18.0000 -0.103 0.9191
typepencil:colorblue -0.1705 12.8700 18.0000 -0.013 0.9896
groupA:typepencil:colorblue -30.4925 18.2010 18.0000 -1.675 0.1112
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) groupA typpnc colrbl grpA:t grpA:c typpn:
groupA -0.707
typepencil -0.693 0.490
colorblue -0.693 0.490 0.500
grpA:typpnc 0.490 -0.693 -0.707 -0.354
gropA:clrbl 0.490 -0.693 -0.354 -0.707 0.500
typpncl:clr 0.490 -0.347 -0.707 -0.707 0.500 0.500
grpA:typpn: -0.347 0.490 0.500 0.500 -0.707 -0.707 -0.707
I want to check the correlations for random effects, but I don't see the usual "Corr" column (usually appears next to "St.Dev." in the output under "Random effects"). Where is it?
I think that the problem stems from the random effects part of your model. You currently have:
(1 * type * color | id)
However, the standard formula is:
(1 + type * color | id)
When I run this, I get an error about the number of observations being less than the number of random effects (the interaction makes the random effects structure too complex for your sample dataset). Using a less complex random effects structure, (1 + type + color | id), I am able to get the Corr column that you are looking for:
Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: dv ~ 1 + group * type * color + (1 + type + color | id)
Data: sample_data
AIC BIC logLik deviance df.resid
203.8 221.5 -86.9 173.8 9
Scaled residuals:
Min 1Q Median 3Q Max
-1.5320 -0.7217 0.1363 0.7089 1.3920
Random effects:
Groups Name Variance Std.Dev. Corr
id (Intercept) 130.22 11.411
typepencil 15.49 3.936 0.42
colorred 219.98 14.832 -1.00 -0.37
Residual 41.79 6.464
Number of obs: 24, groups: id, 6
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 9.653 7.572 6.888 1.275 0.24368
groupB 2.015 10.708 6.888 0.188 0.85617
typepencil -15.718 5.747 14.358 -2.735 0.01582 *
colorred -11.010 10.059 7.985 -1.095 0.30562
groupB:typepencil 5.187 8.127 14.358 0.638 0.53333
groupB:colorred -1.326 14.226 7.985 -0.093 0.92805
typepencil:colorred 30.663 7.465 11.996 4.108 0.00145 **
groupB:typepencil:colorred -30.492 10.556 11.996 -2.889 0.01362 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) groupB typpnc colrrd grpB:t grpB:c typpn:
groupB -0.707
typepencil -0.174 0.123
colorred -0.922 0.652 0.117
grpB:typpnc 0.123 -0.174 -0.707 -0.083
gropB:clrrd 0.652 -0.922 -0.083 -0.707 0.117
typpncl:clr 0.246 -0.174 -0.649 -0.371 0.459 0.262
grpB:typpn: -0.174 0.246 0.459 0.262 -0.649 -0.371 -0.707
convergence code: 0
Model failed to converge with max|grad| = 0.00237651 (tol = 0.002, component 1)
I still get a warning about the model failing to converge. This is likely again due to the random effects structure being too complex for your sample dataset: lmer(dv ~ 1 + group * type * color + (1 | id), data = sample_data, REML = FALSE) gives no such warning.
Hope that helps?

Custom datetimeparsing to combine date and time after reading csv - Pandas

upon reading text file, I am presented with an odd format, where date and time are contained in separate columns, as follows (files is tabs as separators).
temp
room 1
Date Time simulation
Fri, 01/Jan 00:30 11.94
01:30 12
02:30 12.04
03:30 12.06
04:30 12.08
05:30 12.09
06:30 11.99
07:30 12.01
08:30 12.29
09:30 12.46
10:30 12.35
11:30 12.25
12:30 12.19
13:30 12.12
14:30 12.04
15:30 11.96
16:30 11.9
17:30 11.92
18:30 11.87
19:30 11.79
20:30 12
21:30 12.16
22:30 12.27
23:30 12.3
Sat, 02/Jan 00:30 12.25
01:30 12.19
02:30 12.14
03:30 12.11
etc.
I would like to:
parse date and time over two columns ([0],[1]);
shift all timestamps 30minutes early, that is replacing :30 with :00;
I have used the following code:
timeparse = lambda x: pd.datetime.strptime(x.replace(':30',':00'), '%H:%M')
df = pd.read_csv('Chart_1.txt',
sep='\t',
skiprows=1,
date_parser=timeparse,
parse_dates=['Time'],
header=1)
Which does seem to be parsing time not dates (obviously, as this is what I told it to do).
Also, skipping rows is useful for finding the Date and Time headers, but it discards the headers temp and room 1, that I need.
You can use:
import pandas as pd
df = pd.read_csv('Chart_1.txt', sep='\t')
#get temperature to variable tempfrom third column
temp = df.columns[2]
print (temp)
Dry resultant temperature (°C)
#get aps to variable aps from second row and third column
aps = df.iloc[1, 2]
print (aps)
AE4854c_Campshill_openings reduced_communal areas increased openings2.aps
#create mask from first column - all values contains / - dates
mask = df.iloc[:, 0].str.contains('/',na=False)
#shift all rows to right NOT contain dates
df1 = df[~mask].shift(1, axis=1)
#get rows with dates
df2 = df[mask]
#concat df1 and df2, sort unsorted indexes
df = pd.concat([df1, df2]).sort_index()
#create new column names by assign
#first 3 are custom, other are from first row and fourth to end columns
df.columns = ['date','time','no name'] + df.iloc[0, 3:].tolist()
#remove first 2 row
df = df[2:]
#fill NaN values in column date by forward filling
df.date = df.date.ffill()
#convert column to datetime
df.date = pd.to_datetime(df.date, format='%a, %d/%b')
#replace 30 minutes to 00
df.time = df.time.str.replace(':30', ':00')
print (df.head())
date time no name 3F_T09_SE_SW_Bed1 GF_office_S GF_office_W_tea \
2 1900-01-01 00:00 11.94 11.47 14.72 16.66
3 1900-01-01 01:00 12.00 11.63 14.83 16.69
4 1900-01-01 02:00 12.04 11.73 14.85 16.68
5 1900-01-01 03:00 12.06 11.80 14.83 16.65
6 1900-01-01 04:00 12.08 11.84 14.79 16.62
GF_Act.Room GF_Communal areas GF_Reception GF_Ent Lobby ... \
2 17.41 12.74 12.93 10.85 ...
3 17.45 12.74 13.14 11.00 ...
4 17.44 12.71 13.23 11.09 ...
5 17.41 12.68 13.27 11.16 ...
6 17.36 12.65 13.28 11.21 ...
2F_S01_SE_SW_Bedroom 2F_S01_SE_SW_Int Circ 2F_S01_SE_SW_Storage_int circ \
2 12.58 12.17 12.54
3 12.64 12.22 12.49
4 12.68 12.27 12.48
5 12.70 12.30 12.49
6 12.71 12.31 12.51
GF_G01_SE_SW_Bedroom GF_G01_SE_SW_Storage_Bed 3F_T09_SE_SW_Bathroom \
2 14.51 14.61 11.49
3 14.55 14.59 11.50
4 14.56 14.59 11.52
5 14.55 14.58 11.54
6 14.54 14.57 11.56
3F_T09_SE_SW_Circ 3F_T09_SE_SW_Storage_int circ GF_Lounge GF_Cafe
2 11.52 11.38 12.83 12.86
3 11.56 11.35 13.03 13.03
4 11.61 11.36 13.13 13.13
5 11.65 11.39 13.17 13.17
6 11.68 11.42 13.18 13.18
[5 rows x 31 columns]

How to create a scikit learn dataset?

I have an array where the first columns are classes (in integer form), and the rest of the columns are features.
SG like this
1,0,34,23,2
0,0,21,11,0
3,11,2,11,1
How can I turn this into a scikit compatible dataset, so I can call sg like
mydataset = datasets.load_mydataset()?
You can simply use pandas. e.g. If you have copied your dataset to dataset.csv file. Just label the columns in csv file appropriately.
In [1]: import pandas as pd
In [2]: df = pd.read_csv('temp.csv')
In [3]: df
Out[3]:
Label f1 f2 f3 f4
0 1 0 34 23 2
1 0 0 21 11 0
2 3 11 2 11 1
In [4]: y_train= df['Label']
In [5]: x_train = df.drop('Label', axis=1)
In [6]: x_train
Out[6]:
f1 f2 f3 f4
0 0 34 23 2
1 0 21 11 0
2 11 2 11 1
In [7]: y_train
Out[7]:
0 1
1 0
2 3