Using table() to count occurrences of strings in the R results in some, but not all, incorrect outputs - mysql

I usually find all my answers by searching here, and I've never had to post anything before. However this is a very particular problem and I haven't been able to find an answer. I hope you can help.
I have this table, called "FSR":
Mouse Day Percent.Rewarded Percent.Premature
Y3.5 1 0.72 0.73
Y3.6 1 0.47 0.68
Y3.7 1 0.74 0.71
X7.1 1 0.74 0.79
X7.2 1 0.74 0.80
AA1.1 1 0.91 0.84
AA1.2 1 0.70 0.75
AA1.3 1 0.95 0.85
I want to count the number of times each Mouse ID appears in the column Mouse, which should be easy:
FSRCounts <- table(FSR$Mouse)
So far, so good. This appears to work:
print(FSRCounts)
AA1.1 AA1.2 AA1.3 X7.1 X7.2 Y3.5 Y3.6 Y3.7
1 1 1 1 1 1 1 1
If I want to know how many times a particular mouse has appeared, this also works, no matter which mouse:
FSRCounts["Y3.6"]
Y3.6
1
FSRCounts["AA1.1"]
AA1.1
1
However, for some reason when I use another table, called data, to get the mouse IDs the code doesn't work for every mouse and I can't figure out why.
Here's the table "data". I have deleted columns irrelevant to this question:
Experiment mouse
1 RIGHT_autoshape_gonogo_LMR X6.1
2 LEFT_autoshape_gonogo_LMR X6.2
3 RIGHT_autoshape_gonogo_LMR Y3.1
4 LEFT_autoshape_gonogo_LMR Y3.2
5 RIGHT_autoshape_gonogo_LMR Y3.3
6 LEFT_autoshape_gonogo_LMR Y3.4
7 RIGHT_5sec_reactiontime_gonogo_LMR Y3.5
8 LEFT_5sec_reactiontime_gonogo_LMR Y3.6
9 RIGHT_5sec_reactiontime_gonogo_LMR Y3.7
10 LEFT_5sec_reactiontime_gonogo_LMR X7.1
11 RIGHT_5sec_reactiontime_gonogo_LMR X7.2
12 LEFT_5sec_reactiontime_gonogo_LMR AA1.1
13 RIGHT_5sec_reactiontime_gonogo_LMR AA1.2
14 LEFT_5sec_reactiontime_gonogo_LMR AA1.3
15 RIGHT_autoshape_gonogo_LMR AA1.4
16 LEFT_autoshape_gonogo_LMR AA1.5
17 RIGHT_autoshape_gonogo_LMR AA1.6
18 RIGHT_autoshape_gonogo_LMR Y4.2
19 LEFT_autoshape_gonogo_LMR Y4.3
And here's the code:
for (i in 1:nrow(data)) {
if (grepl("5sec", data[i,"Experiment"])) {
FiveToday <- T
FM <- data[i,"mouse"]
FD <- FSRCounts[FM] + 1
if (is.na(FSRCounts[FM])){
FD <- 1
}
}
}
It works for some mice, but not others. I've added the "print()" lines to show exactly where the code is screwing up.
For example, it works for AA1.2:
> FM <- data[13,"mouse"]
> print("FM:")
[1] "FM:"
> print(FM)
[1] AA1.2
Levels: AA1.1 AA1.2 AA1.3 AA1.4 AA1.5 AA1.6 X6.1 X6.2 X7.1 X7.2 Y3.1 Y3.2 Y3.3 Y3.4 Y3.5 Y3.6 Y3.7 Y4.2 Y4.3
> print("FSR Count Table:")
[1] "FSR Count Table:"
> print(FSRCounts)
AA1.1 AA1.2 AA1.3 X7.1 X7.2 Y3.5 Y3.6 Y3.7
1 1 1 1 1 1 1 1
> print("Count:")
[1] "Count:"
> print(FSRCounts[FM])
AA1.2
1
> FD <- FSRCounts[FM] + 1
> print("FD:")
[1] "FD:"
> print(FD)
AA1.2
2
> if (is.na(FSRCounts[FM])){print("FD is 1")
+ FD <- 1
}
But not for Y3.6:
> FM <- data[8,"mouse"]
> print("FM:")
[1] "FM:"
> print(FM)
[1] Y3.6
Levels: AA1.1 AA1.2 AA1.3 AA1.4 AA1.5 AA1.6 X6.1 X6.2 X7.1 X7.2 Y3.1 Y3.2 Y3.3 Y3.4 Y3.5 Y3.6 Y3.7 Y4.2 Y4.3
> print("FSR Count Table:")
[1] "FSR Count Table:"
> print(FSRCounts)
AA1.1 AA1.2 AA1.3 X7.1 X7.2 Y3.5 Y3.6 Y3.7
1 1 1 1 1 1 1 1
> print("Count:")
[1] "Count:"
> print(FSRCounts[FM])
<NA>
NA
> FD <- FSRCounts[FM] + 1
> print("FD:")
[1] "FD:"
> print(FD)
<NA>
NA
> if (is.na(FSRCounts[FM])){print("FD is 1")
+ FD <- 1
+ }
[1] "FD is 1"
Can anyone help me figure out why this is happening and fix it? Or suggest an alternate way to do the same thing? Thanks for your help!
Dan Hoops

Related

statsmodels OLS gives parameters despite perfect multicollinearity

Assume the following df:
ib c d1 d2
0 1.14 1 1 0
1 1.0 1 1 0
2 0.71 1 1 0
3 0.6 1 1 0
4 0.66 1 1 0
5 1.0 1 1 0
6 1.26 1 1 0
7 1.29 1 1 0
8 1.52 1 1 0
9 1.31 1 1 0
10 0.89 1 0 1
d1 and d2 are perfectly colinear. Now I estimate the following regression model:
import statsmodels.api as sm
reg = sm.OLS(df['ib'], df[['c', 'd1', 'd2']]).fit().summary()
reg
This gives me the following output:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: ib R-squared: 0.087
Model: OLS Adj. R-squared: -0.028
Method: Least Squares F-statistic: 0.7590
Date: Thu, 17 Nov 2022 Prob (F-statistic): 0.409
Time: 12:19:34 Log-Likelihood: -1.5470
No. Observations: 10 AIC: 7.094
Df Residuals: 8 BIC: 7.699
Df Model: 1
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
c 0.7767 0.111 7.000 0.000 0.521 1.033
d1 0.2433 0.127 1.923 0.091 -0.048 0.535
d2 0.5333 0.213 2.499 0.037 0.041 1.026
==============================================================================
Omnibus: 0.257 Durbin-Watson: 0.760
Prob(Omnibus): 0.879 Jarque-Bera (JB): 0.404
Skew: 0.043 Prob(JB): 0.817
Kurtosis: 2.019 Cond. No. 8.91e+15
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.34e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
"""
However, including c, d1 and d2 represents the well known dummy variable trap which, from my understanding, should make it impossible to estimate the model. Why is this not the case here?

which post-hoc test after welch-anova

i´m doing the statistical evaluation for my master´s thesis. the levene test was significant so i did the welch anova which was significant. now i tried the games-howell post hoc test but it didn´t work.
can anybody help me sending me the exact functions which i have to run in R to do the games-howell post hoc test and to get kind of a compact letter display, where it shows me which treatments are not significantly different from each other? i also wanted to ask if i did the welch anova the right way (you can find the output of R below)
here it the output which i did till now for the statistical evalutation:
data.frame': 30 obs. of 3 variables:
$ Dauer: Factor w/ 6 levels "0","2","4","6",..: 1 2 3 4 5 6 1 2 3 4 ...
$ WH : Factor w/ 5 levels "r1","r2","r3",..: 1 1 1 1 1 1 2 2 2 2 ...
$ TSO2 : num 107 86 98 97 88 95 93 96 96 99 ...
> leveneTest(TSO2~Dauer, data=TSO2R)
`Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 5 3.3491 0.01956 *
24
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1`
`> oneway.test (TSO2 ~Dauer, data=TSO2R, var.equal = FALSE) ###Welch-ANOVA
One-way analysis of means (not assuming equal variances)
data: TSO2 and Dauer
F = 5.7466, num df = 5.000, denom df = 10.685, p-value = 0.00807
'''`
Thank you very much!

Custom datetimeparsing to combine date and time after reading csv - Pandas

upon reading text file, I am presented with an odd format, where date and time are contained in separate columns, as follows (files is tabs as separators).
temp
room 1
Date Time simulation
Fri, 01/Jan 00:30 11.94
01:30 12
02:30 12.04
03:30 12.06
04:30 12.08
05:30 12.09
06:30 11.99
07:30 12.01
08:30 12.29
09:30 12.46
10:30 12.35
11:30 12.25
12:30 12.19
13:30 12.12
14:30 12.04
15:30 11.96
16:30 11.9
17:30 11.92
18:30 11.87
19:30 11.79
20:30 12
21:30 12.16
22:30 12.27
23:30 12.3
Sat, 02/Jan 00:30 12.25
01:30 12.19
02:30 12.14
03:30 12.11
etc.
I would like to:
parse date and time over two columns ([0],[1]);
shift all timestamps 30minutes early, that is replacing :30 with :00;
I have used the following code:
timeparse = lambda x: pd.datetime.strptime(x.replace(':30',':00'), '%H:%M')
df = pd.read_csv('Chart_1.txt',
sep='\t',
skiprows=1,
date_parser=timeparse,
parse_dates=['Time'],
header=1)
Which does seem to be parsing time not dates (obviously, as this is what I told it to do).
Also, skipping rows is useful for finding the Date and Time headers, but it discards the headers temp and room 1, that I need.
You can use:
import pandas as pd
df = pd.read_csv('Chart_1.txt', sep='\t')
#get temperature to variable tempfrom third column
temp = df.columns[2]
print (temp)
Dry resultant temperature (°C)
#get aps to variable aps from second row and third column
aps = df.iloc[1, 2]
print (aps)
AE4854c_Campshill_openings reduced_communal areas increased openings2.aps
#create mask from first column - all values contains / - dates
mask = df.iloc[:, 0].str.contains('/',na=False)
#shift all rows to right NOT contain dates
df1 = df[~mask].shift(1, axis=1)
#get rows with dates
df2 = df[mask]
#concat df1 and df2, sort unsorted indexes
df = pd.concat([df1, df2]).sort_index()
#create new column names by assign
#first 3 are custom, other are from first row and fourth to end columns
df.columns = ['date','time','no name'] + df.iloc[0, 3:].tolist()
#remove first 2 row
df = df[2:]
#fill NaN values in column date by forward filling
df.date = df.date.ffill()
#convert column to datetime
df.date = pd.to_datetime(df.date, format='%a, %d/%b')
#replace 30 minutes to 00
df.time = df.time.str.replace(':30', ':00')
print (df.head())
date time no name 3F_T09_SE_SW_Bed1 GF_office_S GF_office_W_tea \
2 1900-01-01 00:00 11.94 11.47 14.72 16.66
3 1900-01-01 01:00 12.00 11.63 14.83 16.69
4 1900-01-01 02:00 12.04 11.73 14.85 16.68
5 1900-01-01 03:00 12.06 11.80 14.83 16.65
6 1900-01-01 04:00 12.08 11.84 14.79 16.62
GF_Act.Room GF_Communal areas GF_Reception GF_Ent Lobby ... \
2 17.41 12.74 12.93 10.85 ...
3 17.45 12.74 13.14 11.00 ...
4 17.44 12.71 13.23 11.09 ...
5 17.41 12.68 13.27 11.16 ...
6 17.36 12.65 13.28 11.21 ...
2F_S01_SE_SW_Bedroom 2F_S01_SE_SW_Int Circ 2F_S01_SE_SW_Storage_int circ \
2 12.58 12.17 12.54
3 12.64 12.22 12.49
4 12.68 12.27 12.48
5 12.70 12.30 12.49
6 12.71 12.31 12.51
GF_G01_SE_SW_Bedroom GF_G01_SE_SW_Storage_Bed 3F_T09_SE_SW_Bathroom \
2 14.51 14.61 11.49
3 14.55 14.59 11.50
4 14.56 14.59 11.52
5 14.55 14.58 11.54
6 14.54 14.57 11.56
3F_T09_SE_SW_Circ 3F_T09_SE_SW_Storage_int circ GF_Lounge GF_Cafe
2 11.52 11.38 12.83 12.86
3 11.56 11.35 13.03 13.03
4 11.61 11.36 13.13 13.13
5 11.65 11.39 13.17 13.17
6 11.68 11.42 13.18 13.18
[5 rows x 31 columns]

R question about sapply /plyr syntax: how to pass variable values to a function

Is there a way to pass a variable value in ddply/sapply directly to a function without the function (x) notation?
E.g. Instead of:
ddply(bu,.(trial), function (x) print(x$tangle) )
Is there a way to do:
ddply(bu,.(trial), print(tangle) )
I am asking because with many variables this notation becomes very cumbersome.
Thanks!
You can use fn$ in the gsubfn package. Just preface the function in question with fn$ and then you can use a formula notation as shown here:
> library(gsubfn)
>
> # instead of specifying function(x) mean(x) / sd(x)
>
> fn$sapply(iris[-5], ~ mean(x) / sd(x))
Sepal.Length Sepal.Width Petal.Length Petal.Width
7.056602 7.014384 2.128819 1.573438
> library(plyr)
> # instead of specifying function(x) colMeans(x[-5]) / sd(x[-5])
>
> fn$ddply(iris, .(Species), ~ colMeans(x[-5]) / sd(x[-5]))
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 14.20183 9.043319 8.418556 2.334285
2 versicolor 11.50006 8.827326 9.065547 6.705345
3 virginica 10.36045 9.221802 10.059890 7.376660
Just add your function parameters in the **ply command. For example:
ddply(my_data, c("var1","var2"), my_function, param1=something, param2=something)
where my_function usually looks like
my_function(x, param1, param2)
Here's a working example of this:
require(plyr)
n=1000
my_data = data.frame(
subject=1:n,
city=sample(1:4, n, T),
gender=sample(1:2, n, T),
income=sample(50:200, n, T)
)
my_function = function(data_in, dv, extra=F){
dv = data_in[,dv]
output = data.frame(mean=mean(dv), sd=sd(dv))
if(extra) output = cbind(output, data.frame(n=length(dv), se=sd(dv)/sqrt(length(dv)) ) )
return(output)
}
#with params
ddply(my_data, c("city", "gender"), my_function, dv="income", extra=T)
city gender mean sd n se
1 1 1 127.1158 44.64347 95 4.580324
2 1 2 125.0154 44.83492 130 3.932283
3 2 1 130.3178 41.00359 107 3.963967
4 2 2 128.1608 43.33454 143 3.623816
5 3 1 121.1419 45.02290 148 3.700859
6 3 2 120.1220 45.01031 123 4.058443
7 4 1 126.6769 38.33233 130 3.361968
8 4 2 125.6129 44.46168 124 3.992777
#without params
ddply(my_data, c("city", "gender"), my_function, dv="income", extra=F)
city gender mean sd
1 1 1 127.1158 44.64347
2 1 2 125.0154 44.83492
3 2 1 130.3178 41.00359
4 2 2 128.1608 43.33454
5 3 1 121.1419 45.02290
6 3 2 120.1220 45.01031
7 4 1 126.6769 38.33233
8 4 2 125.6129 44.46168

Subsetting in a function to calculate a row total

I have a data frame with results for certain instruments, and I want to create a new column which contains the totals of each row. Because I have different numbers of instruments each time I run an analysis on new data, I need a function to dynamically calculate the new column with the Row Total.
To simply my problem, here’s what my data frame looks like:
Type Value
1 A 10
2 A 15
3 A 20
4 A 25
5 B 30
6 B 40
7 B 50
8 B 60
9 B 70
10 B 80
11 B 90
My goal is to achieve the following:
A B Total
1 10 30 40
2 15 40 55
3 20 50 70
4 25 60 85
5 70 70
6 80 80
7 90 90
I’ve tried various method, but this way holds the most promise:
myList <- list(a = c(10, 15, 20, 25), b = c(30, 40, 50, 60, 70, 80, 90))
tmpDF <- data.frame(sapply(myList, '[', 1:max(sapply(myList, length))))
> tmpDF
a b
1 10 30
2 15 40
3 20 50
4 25 60
5 NA 70
6 NA 80
7 NA 90
totalSum <- rowSums(tmpDF)
totalSum <- data.frame(totalSum)
tmpDF <- cbind(tmpDF, totalSum)
> tmpDF
a b totalSum
1 10 30 40
2 15 40 55
3 20 50 70
4 25 60 85
5 NA 70 NA
6 NA 80 NA
7 NA 90 NA
Even though this way did succeeded in combining two data frames of different lengths, the ‘rowSums’ function gives the wrong values in this example. Besides that, my original data isn't in a list format, so I can't apply such a 'solution'.
I think I’m overcomplicating this problem, so I was wondering how can I …
Subset data from a data frame on the basis of ‘Type’,
Insert these individual subsets of different lengths into a new data frame,
Add an ‘Total’ column to this data frame which is the correct sum of the
individual subsets.
An added complication to this problem is that this needs to be done in an function or in an otherwise dynamic way, so that I don’t need to manually subset the dozens of ‘Types’ (A, B, C, and so on) in my data frame.
Here’s what I have so far, which doesn’t work, but illustrates the lines I’m thinking along:
TotalDf <- function(x){
tmpNumberOfTypes <- c(levels(x$Type))
for( i in tmpNumberOfTypes){
subSetofData <- subset(x, Type = i, select = Value)
if( i == 1) {
totalDf <- subSetOfData }
else{
totalDf <- cbind(totalDf, subSetofData)}
}
return(totalDf)
}
Thanks in advance for any thoughts or ideas on this,
Regards,
EDIT:
Thanks to the comment of Joris (see below) I got an end in the right direction, however, when trying to translate his solution to my data frame, I run into additional problems. His proposed answer works, and gives me the following (correct) sum of the values of A and B:
> tmp78 <- tapply(DF$value,DF$id,sum)
> tmp78
1 2 3 4 5 6
6 8 10 12 9 10
> data.frame(tmp78)
tmp78
1 6
2 8
3 10
4 12
5 9
6 10
However, when I try this solution on my data frame, it doesn’t work:
> subSetOfData <- copyOfTradesList[c(1:3,11:13),c(1,10)]
> subSetOfData
Instrument AccountValue
1 JPM 6997
2 JPM 7261
3 JPM 7545
11 KFT 6992
12 KFT 6944
13 KFT 7069
> unlist(sapply(rle(subSetOfData$Instrument)$lengths,function(x) 1:x))
Error in rle(subSetOfData$Instrument) : 'x' must be an atomic vector
> subSetOfData$InstrumentNumeric <- as.numeric(subSetOfData$Instrument)
> unlist(sapply(rle(subSetOfData$InstrumentNumeric)$lengths,function(x) 1:x))
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
> subSetOfData$id <- unlist(sapply(rle(subSetOfData$InstrumentNumeric)$lengths,function(x) 1:x))
Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 2L, 3L, 1L, 2L, :
replacement has 3 rows, data has 6
I have the disturbing idea that I’m going around in circles…
Two thoughts :
1) you could use na.rm=T in rowSums
2) How do you know which one has to go with which? You might add some indexing.
eg :
DF <- data.frame(
type=c(rep("A",4),rep("B",6)),
value = 1:10,
stringsAsFactors=F
)
DF$id <- unlist(lapply(rle(DF$type)$lengths,function(x) 1:x))
Now this allows you to easily tapply the sum on the original dataframe
tapply(DF$value,DF$id,sum)
And, more importantly, get your dataframe in the correct form :
> DF
type value id
1 A 1 1
2 A 2 2
3 A 3 3
4 A 4 4
5 B 5 1
6 B 6 2
7 B 7 3
8 B 8 4
9 B 9 5
10 B 10 6
> library(reshape)
> cast(DF,id~type)
id A B
1 1 1 5
2 2 2 6
3 3 3 7
4 4 4 8
5 5 NA 9
6 6 NA 10
TV <- data.frame(Type = c("A","A","A","A","B","B","B","B","B","B","B")
, Value = c(10,15,20,25,30,40,50,60,70,80,90)
, stringsAsFactors = FALSE)
# Added Type C for testing
# TV <- data.frame(Type = c("A","A","A","A","B","B","B","B","B","B","B", "C", "C", "C")
# , Value = c(10,15,20,25,30,40,50,60,70,80,90, 100, 150, 130)
# , stringsAsFactors = FALSE)
lnType <- with(TV, tapply(Value, Type, length))
lnType <- as.integer(lnType)
lnType
id <- unlist(mapply(FUN = rep_len, length.out = lnType, x = list(1:max(lnType))))
(TV <- cbind(id, TV))
require(reshape2)
tvWide <- dcast(TV, id ~ Type)
# Alternatively
# tvWide <- reshape(data = TV, direction = "wide", timevar = "Type", ids = c(id, Type))
tvWide <- subset(tvWide, select = -id)
# If you want something neat without the <NA>
# for(i in 1:ncol(tvWide)){
#
# if (is.na(tvWide[j,i])){
# tvWide[j,i] = 0
# }
#
# }
# }
tvWide
transform(tvWide, rowSum=rowSums(tvWide, na.rm = TRUE))