Assume the following df:
ib c d1 d2
0 1.14 1 1 0
1 1.0 1 1 0
2 0.71 1 1 0
3 0.6 1 1 0
4 0.66 1 1 0
5 1.0 1 1 0
6 1.26 1 1 0
7 1.29 1 1 0
8 1.52 1 1 0
9 1.31 1 1 0
10 0.89 1 0 1
d1 and d2 are perfectly colinear. Now I estimate the following regression model:
import statsmodels.api as sm
reg = sm.OLS(df['ib'], df[['c', 'd1', 'd2']]).fit().summary()
reg
This gives me the following output:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: ib R-squared: 0.087
Model: OLS Adj. R-squared: -0.028
Method: Least Squares F-statistic: 0.7590
Date: Thu, 17 Nov 2022 Prob (F-statistic): 0.409
Time: 12:19:34 Log-Likelihood: -1.5470
No. Observations: 10 AIC: 7.094
Df Residuals: 8 BIC: 7.699
Df Model: 1
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
c 0.7767 0.111 7.000 0.000 0.521 1.033
d1 0.2433 0.127 1.923 0.091 -0.048 0.535
d2 0.5333 0.213 2.499 0.037 0.041 1.026
==============================================================================
Omnibus: 0.257 Durbin-Watson: 0.760
Prob(Omnibus): 0.879 Jarque-Bera (JB): 0.404
Skew: 0.043 Prob(JB): 0.817
Kurtosis: 2.019 Cond. No. 8.91e+15
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.34e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
"""
However, including c, d1 and d2 represents the well known dummy variable trap which, from my understanding, should make it impossible to estimate the model. Why is this not the case here?
I was playing around with a simple linear models when I noticed that, in the ANOVA table, the ratio MSreg/MSres does not exactly correspond to the F-value. Indeed, the two values are very similar but not the same.
Here my script
#quick view of the dataset
> head(my_data)
Diameter Height
1 0.325 0.080
2 0.320 0.100
3 0.280 0.110
4 0.125 0.040
5 0.400 0.135
6 0.335 0.100
#setting up the lm()
> ls1 <- lm(Diameter~Height, data=my_data)
> anova(ls1)
Analysis of Variance Table
Response: Diameter
Df Sum Sq Mean Sq F value Pr(>F)
Height 1 0.82415 0.82415 602.63 < 2.2e-16 ***
Residuals 98 0.13402 0.00137
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Here 0.82415/0.00137=601.5693 which is not the F value in the table. Is there a particular reason for that?
i´m doing the statistical evaluation for my master´s thesis. the levene test was significant so i did the welch anova which was significant. now i tried the games-howell post hoc test but it didn´t work.
can anybody help me sending me the exact functions which i have to run in R to do the games-howell post hoc test and to get kind of a compact letter display, where it shows me which treatments are not significantly different from each other? i also wanted to ask if i did the welch anova the right way (you can find the output of R below)
here it the output which i did till now for the statistical evalutation:
data.frame': 30 obs. of 3 variables:
$ Dauer: Factor w/ 6 levels "0","2","4","6",..: 1 2 3 4 5 6 1 2 3 4 ...
$ WH : Factor w/ 5 levels "r1","r2","r3",..: 1 1 1 1 1 1 2 2 2 2 ...
$ TSO2 : num 107 86 98 97 88 95 93 96 96 99 ...
> leveneTest(TSO2~Dauer, data=TSO2R)
`Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 5 3.3491 0.01956 *
24
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1`
`> oneway.test (TSO2 ~Dauer, data=TSO2R, var.equal = FALSE) ###Welch-ANOVA
One-way analysis of means (not assuming equal variances)
data: TSO2 and Dauer
F = 5.7466, num df = 5.000, denom df = 10.685, p-value = 0.00807
'''`
Thank you very much!
I usually find all my answers by searching here, and I've never had to post anything before. However this is a very particular problem and I haven't been able to find an answer. I hope you can help.
I have this table, called "FSR":
Mouse Day Percent.Rewarded Percent.Premature
Y3.5 1 0.72 0.73
Y3.6 1 0.47 0.68
Y3.7 1 0.74 0.71
X7.1 1 0.74 0.79
X7.2 1 0.74 0.80
AA1.1 1 0.91 0.84
AA1.2 1 0.70 0.75
AA1.3 1 0.95 0.85
I want to count the number of times each Mouse ID appears in the column Mouse, which should be easy:
FSRCounts <- table(FSR$Mouse)
So far, so good. This appears to work:
print(FSRCounts)
AA1.1 AA1.2 AA1.3 X7.1 X7.2 Y3.5 Y3.6 Y3.7
1 1 1 1 1 1 1 1
If I want to know how many times a particular mouse has appeared, this also works, no matter which mouse:
FSRCounts["Y3.6"]
Y3.6
1
FSRCounts["AA1.1"]
AA1.1
1
However, for some reason when I use another table, called data, to get the mouse IDs the code doesn't work for every mouse and I can't figure out why.
Here's the table "data". I have deleted columns irrelevant to this question:
Experiment mouse
1 RIGHT_autoshape_gonogo_LMR X6.1
2 LEFT_autoshape_gonogo_LMR X6.2
3 RIGHT_autoshape_gonogo_LMR Y3.1
4 LEFT_autoshape_gonogo_LMR Y3.2
5 RIGHT_autoshape_gonogo_LMR Y3.3
6 LEFT_autoshape_gonogo_LMR Y3.4
7 RIGHT_5sec_reactiontime_gonogo_LMR Y3.5
8 LEFT_5sec_reactiontime_gonogo_LMR Y3.6
9 RIGHT_5sec_reactiontime_gonogo_LMR Y3.7
10 LEFT_5sec_reactiontime_gonogo_LMR X7.1
11 RIGHT_5sec_reactiontime_gonogo_LMR X7.2
12 LEFT_5sec_reactiontime_gonogo_LMR AA1.1
13 RIGHT_5sec_reactiontime_gonogo_LMR AA1.2
14 LEFT_5sec_reactiontime_gonogo_LMR AA1.3
15 RIGHT_autoshape_gonogo_LMR AA1.4
16 LEFT_autoshape_gonogo_LMR AA1.5
17 RIGHT_autoshape_gonogo_LMR AA1.6
18 RIGHT_autoshape_gonogo_LMR Y4.2
19 LEFT_autoshape_gonogo_LMR Y4.3
And here's the code:
for (i in 1:nrow(data)) {
if (grepl("5sec", data[i,"Experiment"])) {
FiveToday <- T
FM <- data[i,"mouse"]
FD <- FSRCounts[FM] + 1
if (is.na(FSRCounts[FM])){
FD <- 1
}
}
}
It works for some mice, but not others. I've added the "print()" lines to show exactly where the code is screwing up.
For example, it works for AA1.2:
> FM <- data[13,"mouse"]
> print("FM:")
[1] "FM:"
> print(FM)
[1] AA1.2
Levels: AA1.1 AA1.2 AA1.3 AA1.4 AA1.5 AA1.6 X6.1 X6.2 X7.1 X7.2 Y3.1 Y3.2 Y3.3 Y3.4 Y3.5 Y3.6 Y3.7 Y4.2 Y4.3
> print("FSR Count Table:")
[1] "FSR Count Table:"
> print(FSRCounts)
AA1.1 AA1.2 AA1.3 X7.1 X7.2 Y3.5 Y3.6 Y3.7
1 1 1 1 1 1 1 1
> print("Count:")
[1] "Count:"
> print(FSRCounts[FM])
AA1.2
1
> FD <- FSRCounts[FM] + 1
> print("FD:")
[1] "FD:"
> print(FD)
AA1.2
2
> if (is.na(FSRCounts[FM])){print("FD is 1")
+ FD <- 1
}
But not for Y3.6:
> FM <- data[8,"mouse"]
> print("FM:")
[1] "FM:"
> print(FM)
[1] Y3.6
Levels: AA1.1 AA1.2 AA1.3 AA1.4 AA1.5 AA1.6 X6.1 X6.2 X7.1 X7.2 Y3.1 Y3.2 Y3.3 Y3.4 Y3.5 Y3.6 Y3.7 Y4.2 Y4.3
> print("FSR Count Table:")
[1] "FSR Count Table:"
> print(FSRCounts)
AA1.1 AA1.2 AA1.3 X7.1 X7.2 Y3.5 Y3.6 Y3.7
1 1 1 1 1 1 1 1
> print("Count:")
[1] "Count:"
> print(FSRCounts[FM])
<NA>
NA
> FD <- FSRCounts[FM] + 1
> print("FD:")
[1] "FD:"
> print(FD)
<NA>
NA
> if (is.na(FSRCounts[FM])){print("FD is 1")
+ FD <- 1
+ }
[1] "FD is 1"
Can anyone help me figure out why this is happening and fix it? Or suggest an alternate way to do the same thing? Thanks for your help!
Dan Hoops
I have a data frame with results for certain instruments, and I want to create a new column which contains the totals of each row. Because I have different numbers of instruments each time I run an analysis on new data, I need a function to dynamically calculate the new column with the Row Total.
To simply my problem, here’s what my data frame looks like:
Type Value
1 A 10
2 A 15
3 A 20
4 A 25
5 B 30
6 B 40
7 B 50
8 B 60
9 B 70
10 B 80
11 B 90
My goal is to achieve the following:
A B Total
1 10 30 40
2 15 40 55
3 20 50 70
4 25 60 85
5 70 70
6 80 80
7 90 90
I’ve tried various method, but this way holds the most promise:
myList <- list(a = c(10, 15, 20, 25), b = c(30, 40, 50, 60, 70, 80, 90))
tmpDF <- data.frame(sapply(myList, '[', 1:max(sapply(myList, length))))
> tmpDF
a b
1 10 30
2 15 40
3 20 50
4 25 60
5 NA 70
6 NA 80
7 NA 90
totalSum <- rowSums(tmpDF)
totalSum <- data.frame(totalSum)
tmpDF <- cbind(tmpDF, totalSum)
> tmpDF
a b totalSum
1 10 30 40
2 15 40 55
3 20 50 70
4 25 60 85
5 NA 70 NA
6 NA 80 NA
7 NA 90 NA
Even though this way did succeeded in combining two data frames of different lengths, the ‘rowSums’ function gives the wrong values in this example. Besides that, my original data isn't in a list format, so I can't apply such a 'solution'.
I think I’m overcomplicating this problem, so I was wondering how can I …
Subset data from a data frame on the basis of ‘Type’,
Insert these individual subsets of different lengths into a new data frame,
Add an ‘Total’ column to this data frame which is the correct sum of the
individual subsets.
An added complication to this problem is that this needs to be done in an function or in an otherwise dynamic way, so that I don’t need to manually subset the dozens of ‘Types’ (A, B, C, and so on) in my data frame.
Here’s what I have so far, which doesn’t work, but illustrates the lines I’m thinking along:
TotalDf <- function(x){
tmpNumberOfTypes <- c(levels(x$Type))
for( i in tmpNumberOfTypes){
subSetofData <- subset(x, Type = i, select = Value)
if( i == 1) {
totalDf <- subSetOfData }
else{
totalDf <- cbind(totalDf, subSetofData)}
}
return(totalDf)
}
Thanks in advance for any thoughts or ideas on this,
Regards,
EDIT:
Thanks to the comment of Joris (see below) I got an end in the right direction, however, when trying to translate his solution to my data frame, I run into additional problems. His proposed answer works, and gives me the following (correct) sum of the values of A and B:
> tmp78 <- tapply(DF$value,DF$id,sum)
> tmp78
1 2 3 4 5 6
6 8 10 12 9 10
> data.frame(tmp78)
tmp78
1 6
2 8
3 10
4 12
5 9
6 10
However, when I try this solution on my data frame, it doesn’t work:
> subSetOfData <- copyOfTradesList[c(1:3,11:13),c(1,10)]
> subSetOfData
Instrument AccountValue
1 JPM 6997
2 JPM 7261
3 JPM 7545
11 KFT 6992
12 KFT 6944
13 KFT 7069
> unlist(sapply(rle(subSetOfData$Instrument)$lengths,function(x) 1:x))
Error in rle(subSetOfData$Instrument) : 'x' must be an atomic vector
> subSetOfData$InstrumentNumeric <- as.numeric(subSetOfData$Instrument)
> unlist(sapply(rle(subSetOfData$InstrumentNumeric)$lengths,function(x) 1:x))
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
> subSetOfData$id <- unlist(sapply(rle(subSetOfData$InstrumentNumeric)$lengths,function(x) 1:x))
Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 2L, 3L, 1L, 2L, :
replacement has 3 rows, data has 6
I have the disturbing idea that I’m going around in circles…
Two thoughts :
1) you could use na.rm=T in rowSums
2) How do you know which one has to go with which? You might add some indexing.
eg :
DF <- data.frame(
type=c(rep("A",4),rep("B",6)),
value = 1:10,
stringsAsFactors=F
)
DF$id <- unlist(lapply(rle(DF$type)$lengths,function(x) 1:x))
Now this allows you to easily tapply the sum on the original dataframe
tapply(DF$value,DF$id,sum)
And, more importantly, get your dataframe in the correct form :
> DF
type value id
1 A 1 1
2 A 2 2
3 A 3 3
4 A 4 4
5 B 5 1
6 B 6 2
7 B 7 3
8 B 8 4
9 B 9 5
10 B 10 6
> library(reshape)
> cast(DF,id~type)
id A B
1 1 1 5
2 2 2 6
3 3 3 7
4 4 4 8
5 5 NA 9
6 6 NA 10
TV <- data.frame(Type = c("A","A","A","A","B","B","B","B","B","B","B")
, Value = c(10,15,20,25,30,40,50,60,70,80,90)
, stringsAsFactors = FALSE)
# Added Type C for testing
# TV <- data.frame(Type = c("A","A","A","A","B","B","B","B","B","B","B", "C", "C", "C")
# , Value = c(10,15,20,25,30,40,50,60,70,80,90, 100, 150, 130)
# , stringsAsFactors = FALSE)
lnType <- with(TV, tapply(Value, Type, length))
lnType <- as.integer(lnType)
lnType
id <- unlist(mapply(FUN = rep_len, length.out = lnType, x = list(1:max(lnType))))
(TV <- cbind(id, TV))
require(reshape2)
tvWide <- dcast(TV, id ~ Type)
# Alternatively
# tvWide <- reshape(data = TV, direction = "wide", timevar = "Type", ids = c(id, Type))
tvWide <- subset(tvWide, select = -id)
# If you want something neat without the <NA>
# for(i in 1:ncol(tvWide)){
#
# if (is.na(tvWide[j,i])){
# tvWide[j,i] = 0
# }
#
# }
# }
tvWide
transform(tvWide, rowSum=rowSums(tvWide, na.rm = TRUE))