Custom function to create an index of results - function
I’m trying to create a function which creates an index (starting at 100) and then adjust this index according to the results of investments. So, in a nutshell, if the first investment gives an profit of 5%, then the index will stand 105, if the second result is -7%, then the index stands at 97.65. In this question when I use the word "index", I'm not referring to the index function of the zoo package.
Besides creating this index, my goal is also to create an function which can be applied to various subsets of my complete data set (i.e. with the use of sapply and it's friends).
Here’s the function which I have so far (data at end of this question):
CalculateIndex <- function(x){
totalAccount <- accountValueStart
if(x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)){
indexedValues <- 100 + ( 100 *((((x$Size.Units. * x$EntryPrice) / totalAccount) * x$TradeResult.Percent.) / 100))
# Update the accountvalue
totalAccount <- totalAccount + x$TradeResult.Currency.
}
else{ # the value is not the first
indexedValues <- c(indexedValues,
indexedValues[-1] + (indexedValues[-1] *(((x$Size.Units. * x$EntryPrice) / totalAccount) * x$TradeResult.Percent.) / 100)
)
# Update the accountvalue
totalAccount <- totalAccount + x$TradeResult.Currency.
}
return(indexedValues)
}
In words the function does (read: is intended to do) the following:
If the value is the first, use 100 as an starting point for the index. If the value is not the first, use the previous calculated index value as the starting point for calculating the new index value. Besides this, the function also takes the weight of the individual result (compared with the totalAccount value) into account.
The problem:
Using this CalculateIndex function on the theData data frame gives the following incorrect output:
> CalculateIndex(theData)
[1] 99.97901 99.94180 99.65632 101.88689 100.89309 98.92878 102.02911 100.49159 98.52955 102.02243 98.43655 100.76502 99.34869 100.76401 101.18014 99.75136 97.90130
[18] 100.39935 99.81311 101.34961
Warning message:
In if (x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)) { :
the condition has length > 1 and only the first element will be used
Edit:
Wow, I already got an vote down, though I thought my question was already too long. Sorry, I thought/think the problem lay inside my loop, so I didn't want to bore you with the details, which I thought would only give less answers. Sorry, misjudgement on my part.
The problem is, with the above output from CalculateIndex, that the results are wildly different from Excel. Even though this could be resulting from rounding errors (as Joris mentions below), I doubt it. In comparison with the Excel results, the R results differ quite some:
R output Excel calculate values
99,9790085700 99,97900857
99,9418035700 99,92081189
99,6563228600 99,57713687
101,8868850000 101,4639947
100,8930864300 102,3570786
98,9287771400 101,2858564
102,0291071400 103,3149664
100,4915864300 103,806556
98,5295542900 102,3361186
102,0224285700 104,3585552
98,4365550000 102,795089
100,7650171400 103,5601228
99,3486857100 102,9087897
100,7640057100 103,6728077
101,1801400000 104,8529634
99,7513600000 104,6043164
97,9013000000 102,5055298
100,3993485700 102,9048999
99,8131085700 102,7179995
101,3496071400 104,0676555
I think it would be fair to say that the difference in output isn't the result of R versus Excel problems, but more an error in my function. So, let's focus on the function.
The manual calculation of the function
The function uses different variables:
Size.Units.; this is the number of units which are bought at the EntryPrice.
EntryPrice: the price at which the stocks are bought,
TradeResult.Percent.: the percentage gain or loss resulting from the investment,
TradeResult.Currency.: the currency value ($) of the gain or loss resulting from the investment,
These variables are used in the following section of the function:
100 + ( 100 *((((x$Size.Units. * x$EntryPrice) / totalAccount) * x$TradeResult.Percent.) / 100))
and
indexedValues[-1] + (indexedValues[-1] *(((x$Size.Units. * x$EntryPrice) / totalAccount) * x$TradeResult.Percent.) / 100)
Both of the formula's are essentially the same, with the difference that the the first starts at 100, and the second uses the previous value to calculate the new indexed value.
The formula can be broken down in different steps:
First, x$Size.Units. * x$EntryPrice determines the total position that was taken, in the sense that buying 100 shares at an price of 48.98 gives an position of $4898.
The resulting total position is then divided by the total account size (i.e. totalAccount). This is needed to correct the impact of one position relative to the complete portfolio. For example, if our 100 shares bought at 48.98 drop 10 percent, the calculated index (i.e. the CalculateIndex function) doesn't have to drop 10%, because off course not all the money in totalAccount is invested in one stock. So, by dividing the total position by the totalAccount we get an ratio which tells us how much money is invested. For example, the position with the size of 4898 dollar (on a total account of 14000) results in a total account loss of 3.49% if the stock drops 10%. (i.e. 4898 / 14000 = 0.349857. 0.349857 * 10% = 3.49857%)
This ratio (of invested amount versus total amount) is then in the formula multiplied with x$TradeResult.Percent., so to get the percentage impact on the total account (see calculation example in the previous paragraph).
As an last step, the percentage loss on the total account is applied to the index value (which starts at 100). In this case, the first investment in 100 shares bought at 48.89 dollar let's the index drop from it starting point at 100 to 99.97901, reflecting the losing trade's impact on the total account.
End of Edit
Stripping the function clean and then adding a part of the formula at a time, so to uncover the error, I came to the following step where the error seems to reside:
CalculateIndex <- function(x){
totalAccount <- accountValueStart
if(x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)){
indexedValues <- totalAccount
# Update the accountvalue
totalAccount <- totalAccount + x$TradeResult.Currency.
}
else{ # the value is not the first
indexedValues <- c(indexedValues, totalAccount)
# Update the accountvalue
totalAccount <- totalAccount + x$TradeResult.Currency.
}
return(indexedValues)
}
> CalculateIndex(theData)
[1] 14000
Warning message:
In if (x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)) { :
the condition has length > 1 and only the first element will be used
So, it seems that if I just use the totalAccount variable, the function doesn’t get updated correctly. This seems to suggest there is some error with the basics of the if else statement, because it only outputs the first value.
If I remove the else statement from the function, I do get values for each of the rows in theData. However, these are then wrongly calculated. So, it seems to me that there is some error in how this function updates the totalAccount variable. I don’t see where I made an error, so any suggestion would be highly appreciated. What am I doing wrong?
The Data
Here’s what my data looks like:
> theData
Size.Units. EntryPrice TradeResult.Percent. TradeResult.Currency.
1 100 48.98 -0.06 -3
11 100 32.59 -0.25 -8
12 100 32.51 -1.48 -48
2 100 49.01 5.39 264
13 100 32.99 3.79 125
14 100 34.24 -4.38 -150
3 100 51.65 5.50 284
4 100 48.81 1.41 69
15 100 35.74 -5.76 -206
5 100 49.50 5.72 283
6 100 46.67 -4.69 -219
16 100 33.68 3.18 107
7 100 44.48 -2.05 -91
17 100 32.61 3.28 107
8 100 45.39 3.64 165
9 100 47.04 -0.74 -35
10 100 47.39 -6.20 -294
18 100 33.68 1.66 56
19 100 33.12 -0.79 -26
20 100 32.86 5.75 189
theData <- structure(list(X = c(1L, 11L, 12L, 2L, 13L, 14L, 3L, 4L, 15L,
5L, 6L, 16L, 7L, 17L, 8L, 9L, 10L, 18L, 19L, 20L), Size.Units. = c(100L,
100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L,
100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L), EntryPrice = c(48.98,
32.59, 32.51, 49.01, 32.99, 34.24, 51.65, 48.81, 35.74, 49.5,
46.67, 33.68, 44.48, 32.61, 45.39, 47.04, 47.39, 33.68, 33.12,
32.86), TradeResult.Percent. = c(-0.06, -0.25, -1.48, 5.39, 3.79,
-4.38, 5.5, 1.41, -5.76, 5.72, -4.69, 3.18, -2.05, 3.28, 3.64,
-0.74, -6.2, 1.66, -0.79, 5.75), TradeResult.Currency. = c(-3L,
-8L, -48L, 264L, 125L, -150L, 284L, 69L, -206L, 283L, -219L,
107L, -91L, 107L, 165L, -35L, -294L, 56L, -26L, 189L)), .Names = c("X",
"Size.Units.", "EntryPrice", "TradeResult.Percent.", "TradeResult.Currency."
), class = "data.frame", row.names = c(NA, -20L))
# Set the account start # 14000
> accountValueStart <- 14000
Your code looks very strange, and it seems you have a lot of misconceptions about R that come from another programming language. Gavin and Gillespie pointed out already why you get the warniong. Let me add some tips for far more optimal coding:
[-1] does NOT mean: drop the last one. It means "keep everything but the first value", which also explains why you get erroneous results.
calculate common things in the beginning, to unclutter your code.
head(x$TradeResult.Currency., n = 1) is the same as x$TradeResult.Currency.[1].
Keep an eye on your vectors. Most of the mistakes in your code come from forgetting you're working with vectors.
If you need a value to be the first in a vector, put that OUTSIDE of any loop you'd use, never add an if-clause in the function.
predefine your vectors/matrices as much as possible, that goes a lot faster and gives less memory headaches when working with big data.
vectorization, vectorization, vectorization. Did I mention vectorization?
Learn the use of debug(), debugonce() and browser() to check what your function is doing. Many of your problems could have been solved by checking the objects when manipulated within the function.
This said and taken into account, your function becomes :
CalculateIndex <- function(x,accountValueStart){
# predifine your vector
indexedValues <- vector("numeric",nrow(x))
# get your totalAccount calculated FAST. This is a VECTOR!!!
totalAccount <- cumsum(c(accountValueStart,x$TradeResult.Currency.))
#adjust length:
totalAccount <- totalAccount[-(nrow(x)+1)]
# only once this calculation. This is a VECTOR!!!!
totRatio <- 1+(((x$Size.Units. * x$EntryPrice)/totalAccount) *
x$TradeResult.Percent.)/100
# and now the calculations
indexedValues[1] <- 100 * totRatio[1]
for(i in 2:nrow(x)){
indexedValues[i] <- indexedValues[i-1]*totRatio[i]
}
return(indexedValues)
}
and returns
> CalculateIndex(theData,14000)
[1] 99.97901 99.92081 99.57714 101.46399 102.35708 101.28586 103.31497
103.80656 102.33612 104.35856 102.79509 103.56012
[13] 102.90879 103.67281 104.85296 104.60432 102.50553 102.90490 102.71800
104.06766
So now you do:
invisible(replicate(10,print("I will never forget about vectorization any more!")))
The warning message is coming from this line:
if(x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)){
It is easy to see why; x$TradeResult.Currency is a vector and thus the comparison with head(x$TradeResult.Currency., n = 1) yields a vector of logicals. (By the way, why not x$TradeResult.Currency[1] instead of the head() call?). if() requires a single logical not a vector of logicals, and that is what the warning is about. ifelse() is useful if you want to do one of two things depending upon a condition that gives a vector of logicals.
In effect, what you are doing is only entering the if() part of the statement and it gets executed once only, because the first element of x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1) is TRUE and R ignores the others.
> if(c(TRUE, FALSE)) {
+ print("Hi")
+ } else {
+ print("Bye")
+ }
[1] "Hi"
Warning message:
In if (c(TRUE, FALSE)) { :
the condition has length > 1 and only the first element will be used
> ifelse(c(TRUE, FALSE), print("Hi"), print("Bye"))
[1] "Hi"
[1] "Bye"
[1] "Hi" "Bye"
As to solving your real problem:
CalculateIndex2 <- function(x, value, start = 100) {
rowSeq <- seq_len(NROW(x))
totalAc <- cumsum(c(value, x$TradeResult.Currency.))[rowSeq]
idx <- numeric(length = nrow(x))
interm <- (((x$Size.Units. * x$EntryPrice) / totalAc) *
x$TradeResult.Percent.) / 100
for(i in rowSeq) {
idx[i] <- start + (start * interm[i])
start <- idx[i]
}
idx
}
which when used on theData gives:
> CalculateIndex2(theData, 14000)
[1] 99.97901 99.92081 99.57714 101.46399 102.35708 101.28586 103.31497
[8] 103.80656 102.33612 104.35856 102.79509 103.56012 102.90879 103.67281
[15] 104.85296 104.60432 102.50553 102.90490 102.71800 104.06766
What you want is a recursive function (IIRC); the current index is some function of the previous index. These are hard to solve in a vectorised way in R, hence the loop.
I'm still slightly confused as to what exactly you want to do, but hopefully the following will be helpful.
Your R script gives the same answers as your Excel function for the first value. You see a difference because R doesn't print out all digits.
> tmp = CalculateIndex(thedata)
Warning message:
In if (x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)) { :
the condition has length > 1 and only the first element will be used
> print(tmp, digits=10)
[1] 99.97900857 99.94180357 99.65632286 101.88688500 100.89308643
<snip>
The reason for the warning message is because x$TradeResult.Currency is a vector that is being compared to a single number.
That warning message is also where your bug lives. In your if statement, you never execute the else part, since only the value of x$TradeResult.Currency is being used. As the warning message states, only the first element of x$TradeResult.Currency is being used.
Related
Troubleshooting a Loop in R
I have this loop in R that is scraping Reddit comments from an API incrementally on an hourly basis (e.g. all comments containing a certain keyword between now and 1 hour ago, 1 hour ago and 2 hours ago, 2 hours ago and 3 hours ago, etc.): library(jsonlite) part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after=" part2 = "h&before=" part3 = "h&size=500" results = list() for (i in 1:50000) {tryCatch({ { url_i<- paste0(part1, i+1, part2, i, part3) r_i <- data.frame(fromJSON(url_i)) results[[i]] <- r_i myvec_i <- sapply(results, NROW) print(c(i, sum(myvec_i))) } }, error = function(e){}) } final = do.call(rbind.data.frame, results) saveRDS(final, "final.RDS") In this loop, I added a line of code that prints which iteration the loop is currently on, and the cumulative number of results that the loop has scraped as of the current iteration. I also added a line of code ("tryCatch") that in the worst case scenario, forces this loop to skip an iteration which produces an error - however, I was not anticipating that to happen very often. However, I am noticing that this loop is producing errors and often skipping iterations, far more than I had expected. For example (left column is the iteration number, right column is the cumulative results). My guess is that between certain times, the API might not have recorded any comments that were left between those times thus not adding any new results to the list. E.g. iteration_number cumulative_results 1 13432 5673 2 13433 5673 3 13434 5673 But in my case - can someone please help me understand why this loop is producing so many errors that is resulting in so many skipped iterations? Thank you!
Your issue is almost certainly caused by rate limits imposed on your requests by the Pushshift API. When doing scraping tasks like you are here, the server may track how many requests a client has made within a certain time interval (1) and choose to return an error code (HTTP 429) instead of the requested data. This is called rate limiting and is a way for web sites to limit abuse, charge customers for usage, or both. Per this discussion about Pushshift on Reddit, it does look like Pushshift imposes a rate limit of 120 requests per minute (Also: See their /meta endpoint). I was able to confirm that a script like yours will run into rate limiting by changing this line and re-running your script: }, error = function(e){}) to: }, error = function(e){ message(e) }) After a while, I got output like: HTTP error 429 The solution is to slow yourself down in order to stay within this limit. A straightforward way to do this is add a call to Sys.sleep(1) into your for loop, where 1 is the number of seconds to pause execution. I modified your script as follows: library(jsonlite) part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after=" part2 = "h&before=" part3 = "h&size=500" results = list() for (i in 1:50000) {tryCatch({ { Sys.sleep(1) # Changed. Change the value as needed. url_i<- paste0(part1, i+1, part2, i, part3) r_i <- data.frame(fromJSON(url_i)) results[[i]] <- r_i myvec_i <- sapply(results, NROW) print(c(i, sum(myvec_i))) } }, error = function(e){ message(e) # Changed. Prints to the console on error. }) } final = do.call(rbind.data.frame, results) saveRDS(final, "final.RDS") Note that you may have to try a number larger than 1 and you'll notice that your script takes longer to run.
As the error seems to come randomly, depending on API availability, you could retry, and set a maxattempt number: library(jsonlite) part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after=" part2 = "h&before=" part3 = "h&size=500" results = list() maxattempt <- 3 for (i in 1:1000) { attempt <- 1 r_i <- NULL while( is.null(r_i) && attempt <= maxattempt ) { if (attempt>1) {print(paste("retry: i =",i))} attempt <- attempt + 1 url_i<- paste0(part1, i+1, part2, i, part3) try({ r_i <- data.frame(fromJSON(url_i)) results[[i]] <- r_i myvec_i <- sapply(results, NROW) print(c(i, sum(myvec_i))) }) } } final = do.call(rbind.data.frame, results) saveRDS(final, "final.RDS") Result: [1] 1 250 [1] 2 498 [1] 3 748 [1] 4 997 [1] 5 1247 [1] 6 1497 [1] 7 1747 [1] 8 1997 [1] 9 2247 [1] 10 2497 ... [1] 416 101527 [1] 417 101776 Error in open.connection(con, "rb") : Timeout was reached: [api.pushshift.io] Operation timed out after 21678399 milliseconds with 0 out of 0 bytes received [1] "retry: i = 418" [1] 418 102026 [1] 419 102276 ... On above example, a retry occured for i = 418. Note that for i in 1:500, results size is already 250Mb, meaning that you could expect 25Gb for i in 1:50000 : do you have enough RAM?
rjson::fromJSON returns only the first item
I have a sqlite database file with several columns. One of the columns has a JSON dictionary (with two keys) embedded in it. I want to extract the JSON column to a data frame in R that shows each key in a separate column. I tried rjson::fromJSON, but it reads only the first item. Is there a trick that I'm missing? Here's an example that mimics my problem: > eg <- as.vector(c("{\"3x\": 20, \"6y\": 23}", "{\"3x\": 60, \"6y\": 50}")) > fromJSON(eg) $3x [1] 20 $6y [1] 23 The desired output is something like: # a data frame for both variables 3x 6y 1 20 23 2 60 50 or, # a data frame for each variable 3x 1 20 2 60 6y 1 23 2 50
What you are looking for is actually a combination of lapply and some application of rbind or related. I'll extend your data a little, just to have more than 2 elements. eg <- c("{\"3x\": 20, \"6y\": 23}", "{\"3x\": 60, \"6y\": 50}", "{\"3x\": 99, \"6y\": 72}") library(jsonlite) Using base R, we can do do.call(rbind.data.frame, lapply(eg, fromJSON)) # X3x X6y # 1 20 23 # 2 60 50 # 3 99 72 You might be tempted to do something like Reduce(rbind, lapply(eg, fromJSON)), but the notable difference is that in the Reduce model, rbind is called "N-1" times, where "N" is the number of elements in eg; this results in a LOT of copying of data, and though it might work alright with small "N", it scales horribly. With the do.call option, rbind is called exactly once. Notice that the column labels have been R-ized, since data.frame column names should not start with numbers. (It is possible, but generally discouraged.) If you're confident that all substrings will have exactly the same elements, then you may be good here. If there's a chance that there will be a difference at some point, perhaps eg <- c(eg, "{\"3x\": 99}") then you'll notice that the base R solution no longer works by default. do.call(rbind.data.frame, lapply(eg, fromJSON)) # Error in (function (..., deparse.level = 1, make.row.names = TRUE, stringsAsFactors = default.stringsAsFactors()) : # numbers of columns of arguments do not match There may be techniques to try to normalize the elements such that you can be assured of matches. However, if you're not averse to a tidyverse package: library(dplyr) eg2 <- bind_rows(lapply(eg, fromJSON)) eg2 # # A tibble: 4 × 2 # `3x` `6y` # <int> <int> # 1 20 23 # 2 60 50 # 3 99 72 # 4 99 NA though you cannot call it as directly with the dollar-method, you can still use [[ or backticks. eg2$3x # Error: unexpected numeric constant in "eg2$3" eg2[["3x"]] # [1] 20 60 99 99 eg2$`3x` # [1] 20 60 99 99
Different Sum Sq and MSS using lme4::lmer and lmerTest::lmer
I get sums of squares and mean sums of squares 10x higher when I use anova on lmerTest:: lmer compared to lme4:: lmer objects. See the R log file below. Note the warning that when I attach the lmerTest package, the stats::sigma function overrides the lme4::sigma function, and I suspect that it is this that leads to the discrepancy. In addition, the anova report now says that it is a Type III anova rather than the expected Type I. Is this a bug in the lmerTest package, or is there something about use of the Kenward-Roger approximation for degrees of freedom that changes the calculation of SumSQ and MSS and specification of the anova Type that I don't understand? I would append the test file, but it is confidential clinical trial information. If necessary I can see if I can cobble up a test case. Thanks in advance for any advice you all can provide about this. > library(lme4) Loading required package: Matrix Attaching package: ‘lme4’ The following object is masked from ‘package:stats’: sigma > test100 <- lmer(log(value) ~ prepost * lowhi + (1|CID/LotNo/rep), REML = F, data = GSIRlong, subset = !is.na(value)) > library(lmerTest) Attaching package: ‘lmerTest’ The following object is masked from ‘package:lme4’: lmer The following object is masked from ‘package:stats’: step Warning message: replacing previous import ‘lme4::sigma’ by ‘stats::sigma’ when loading ‘lmerTest’ > test200 <- lmer(log(value) ~ prepost * lowhi + (1|CID/LotNo/rep), REML = F, data = GSIRlong, subset = !is.na(value)) > anova(test100) Analysis of Variance Table Df Sum Sq Mean Sq F value prepost 1 3.956 3.956 18.4825 lowhi 1 130.647 130.647 610.3836 prepost:lowhi 1 0.038 0.038 0.1758 > anova(test200, ddf = 'Ken') Analysis of Variance Table of type III with Kenward-Roger approximation for degrees of freedom Sum Sq Mean Sq NumDF DenDF F.value Pr(>F) prepost 37.15 37.15 1 308.04 18.68 2.094e-05 *** lowhi 1207.97 1207.97 1 376.43 607.33 < 2.2e-16 *** prepost:lowhi 0.35 0.35 1 376.43 0.17 0.676 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Update: Thanks, Ben. I did a little code archeology on lmerTest to see if I could dope out an explanation for the above anomalies. First, it turns out that lmerTest::lmer just submits the model to lme4::lmer and then relabels the result as an "mermodLmerTest" object. The only effect of this is to invoke versions of summary() and anova() from the lmerTest package rather than the usual defaults from base and stats. (These lmerTest functions are compiled, and I have not yet gone farther to look at the C++ code.) lmerTest::summary just adds three columns to the base::summary result, giving df, t value, and Pr. Note that lmerTest::anova, by default, computes a type III anova rather than a type I as in stats::anova. (Explanation of my second question above.) Not a great choice if one's model includes interactions. One can request a type I, II, or III anova using the type = 1/2/3 option. However there are other surprises using the nlmeTest versions of summary and anova, as shown in the R console file below. I used lmerTest's included sleepstudy data so that this code should be replicable. a. Note that "sleepstudy" has 180 records (with 3 variables) b. The summaries of fm1 and fm1a are identical except for the added Fixed effects columns. But note that in the lmerTest::summary the ddfs for the intercept and Days are 1371 and 1281 respectively; odd given that there are only 180 records in "sleepstudy." c. Just as in my original model above, the nlm4 anad nlmrTest versions of anova give very different values of Sum Sq and Mean Sq. (30031 and 446.65 respectively). d: Interestingly, the nlmrTest versions of anova using Satterthwaite and Kenward-Rogers estimates of the DenDF are wildly different (5794080 and 28 respecitvely). The K-R value seems more reasonable. Given the above issues, I am reluctant at this point to depend on lmerTest to give reliable p-values. Based on your (Doug Bates's) blog entry (https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html), I am using now (and recommending) the method from the posting by Dan Mirman (http://mindingthebrain.blogspot.ch/2014/02/three-ways-to-get-parameter-specific-p.html) in the final bit of code below to estimate a naive t-test p-value (assuming essentially infinite degrees of freedom) and a Kenward-Rogers estimate of df (using the R package 'pbkrtest' -- the same package used by lmerTest). I couldn't find R code to compute the Satterthwaite estimates. The naive t-test p-value is clearly anti-conservative, but the KR estimate is thought to be pretty good. If the two give similar estimates of p, then I think that one can feel comfortable with a p-value in the range of [naive t-test, KR estimate]. > library(lme4); library(lmerTest); library(pbkrtest); dim(sleepstudy) [1] 180 3 > > fm1 <- lme4::lmer(Reaction ~ Days + (Days|Subject), sleepstudy) > fm1a <- lmerTest::lmer(Reaction ~ Days + (Days|Subject), sleepstudy) > > summary(fm1) Linear mixed model fit by REML ['lmerMod'] Formula: Reaction ~ Days + (Days | Subject) Data: sleepstudy REML criterion at convergence: 1743.6 Scaled residuals: Min 1Q Median 3Q Max -3.9536 -0.4634 0.0231 0.4634 5.1793 Random effects: Groups Name Variance Std.Dev. Corr Subject (Intercept) 612.09 24.740 Days 35.07 5.922 0.07 Residual 654.94 25.592 Number of obs: 180, groups: Subject, 18 Fixed effects: Estimate Std. Error t value (Intercept) 251.405 6.825 36.84 Days 10.467 1.546 6.77 Correlation of Fixed Effects: (Intr) Days -0.138 > summary(fm1a) Linear mixed model fit by REML t-tests use Satterthwaite approximations to degrees of freedom [lmerMod] Formula: Reaction ~ Days + (Days | Subject) Data: sleepstudy REML criterion at convergence: 1743.6 Scaled residuals: Min 1Q Median 3Q Max -3.9536 -0.4634 0.0231 0.4634 5.1793 Random effects: Groups Name Variance Std.Dev. Corr Subject (Intercept) 612.09 24.740 Days 35.07 5.922 0.07 Residual 654.94 25.592 Number of obs: 180, groups: Subject, 18 Fixed effects: Estimate Std. Error df t value Pr(>|t|) (Intercept) 251.405 6.825 1371.100 302.06 <2e-16 *** Days 10.467 1.546 1281.700 55.52 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Correlation of Fixed Effects: (Intr) Days -0.138 Warning message: In deviance.merMod(object, ...) : deviance() is deprecated for REML fits; use REMLcrit for the REML criterion or deviance(.,REML=FALSE) for deviance calculated at the REML fit > > anova(fm1) Analysis of Variance Table Df Sum Sq Mean Sq F value Days 1 30031 30031 45.853 > anova(fm1a, ddf = 'Sat', type = 1) Analysis of Variance Table of type I with Satterthwaite approximation for degrees of freedom Sum Sq Mean Sq NumDF DenDF F.value Pr(>F) Days 446.65 446.65 1 5794080 45.853 1.275e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Warning message: In deviance.merMod(object, ...) : deviance() is deprecated for REML fits; use REMLcrit for the REML criterion or deviance(.,REML=FALSE) for deviance calculated at the REML fit > anova(fm1a, ddf = 'Ken', type = 1) Analysis of Variance Table of type I with Kenward-Roger approximation for degrees of freedom Sum Sq Mean Sq NumDF DenDF F.value Pr(>F) Days 446.65 446.65 1 27.997 45.853 2.359e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Warning message: In deviance.merMod(object, ...) : deviance() is deprecated for REML fits; use REMLcrit for the REML criterion or deviance(.,REML=FALSE) for deviance calculated at the REML fit > > # t.test > coefs <- data.frame(coef(summary(fm1))) > coefs$p.z <- 2 * (1 - pnorm(abs(coefs$t.value))) > coefs Estimate Std..Error t.value p.z (Intercept) 251.40510 6.824556 36.838311 0.000000e+00 Days 10.46729 1.545789 6.771485 1.274669e-11 > > # Kenward-Rogers > df.KR <- get_ddf_Lb(fm1, fixef(fm1)) > df.KR [1] 25.89366 > coefs$p.KR <- 2 * (1 - pt(abs(coefs$t.value), df.KR)) > coefs Estimate Std..Error t.value p.z p.KR (Intercept) 251.40510 6.824556 36.838311 0.000000e+00 0.0000e+00 Days 10.46729 1.545789 6.771485 1.274669e-11 3.5447e-07
Calculate cutoff and sensitivity for specific values of specificity?
After calculating several regression models, I want to calculate sensitivity-values and the cut-off for pre-specified values of specificity (i.e, 0.99, 0.90, 0.85 and so on) to find the best model. I have created code to calculate sensitivity and specificity for given values of the cut-off (from 0.1 to 0.9), but now I want to use specific values of specificity (i.e., calculate the corresponding cut-off value and sensitivity-values), and here I'm stuck. Suppose I have the following regression model (using the example dataset 'mtcars'): data(mtcars) model <- glm(formula= vs ~ wt + disp, data=mtcars, family=binomial) Here is the code I've used for the calculation of sens and spec for given values of the cut-off: predvalues <- model$fitted.values getMisclass <- function(cutoff, p, labels){ d <- cut(predvalues, breaks=c(-Inf, cutoff, Inf), labels=c("0", "1")) print(confusionMatrix(d, mtcars$vs, positive="1")) cat("cutoff", cutoff, ":\n") t <- table(d, mtcars$vs) print(round(sum(t[c(1,4)])/sum(t), 2)) } cutoffs <- seq(.1,.9,by=.1) sapply(cutoffs, getMisclass, p=predval, labels=mtcars$vs) Can someone help me how to rewrite this code for the calculation of sensitivity and cut-off scores given a range of specificity-values? Is it possible? The values for the cutoff should be cutoffs <- c(0.99, 0.90, 0.85, 0.80, 0.75) Thanks a lot!
This is closely related to how ROC curves are calculated: if those are calculated with fine granularity you essentially get a sensitivity and specificity for "every" threshold value. So, what you could do is simply calculate the sensitivities, specificities and corresponding threshold as if you would want to obtain a ROC curve... library(pROC) myRoc <- roc(predictor = predvalues, response = mtcars$vs) plot(myRoc) myRoc$specificities print(with(myRoc, data.frame(specificities, sensitivities, thresholds))) # specificities sensitivities thresholds # 1 0.00000000 1.00000000 -Inf # 2 0.05555556 1.00000000 0.002462809 # 3 0.11111111 1.00000000 0.003577104 # 4 0.16666667 1.00000000 0.004656164 # 5 0.22222222 1.00000000 0.005191974 # 6 0.27777778 1.00000000 0.006171197 # [...] ...and then lookup the corresponding sensitivities and thresholds for whichever specificities you are interested in, e.g. as: cutoffs <- c(0.99, 0.90, 0.85, 0.80, 0.75) myData <- with(myRoc, data.frame(specificities, sensitivities, thresholds)) library(plyr) print(laply(cutoffs, function(cutoff) myData$sensitivities[which.min(abs(myData$specificities-cutoff))])) # [1] 0.7857143 0.8571429 0.8571429 0.9285714 0.9285714
What's the correct way to expand a [0,1] interval to [a,b]?
Many random-number generators return floating numbers between 0 and 1. What's the best and correct way to get integers between a and b?
Divide the interval [0,1] in B-A+1 bins Example A=2, B=5 [----+----+----+----] 0 1/4 1/2 3/4 1 Maps to 2 3 4 5 The problem with the formula Int (Rnd() * (B-A+1)) + A is that your Rnd() generation interval is closed on both sides, thus the 0 and the 1 are both possible outputs and the formula gives 6 when the Rnd() is exactly 1. In a real random distribution (not pseudo), the 1 has probability zero. I think it is safe enough to program something like: r=Rnd() if r equal 1 MyInt = B else MyInt = Int(r * (B-A+1)) + A endif Edit Just a quick test in Mathematica: Define our function: f[a_, b_] := If[(r = RandomReal[]) == 1, b, IntegerPart[r (b - a + 1)] + a] Build a table with 3 10^5 numbers in [1,100]: table = SortBy[Tally[Table[f[1, 100], {300000}]], First] Check minimum and maximum: In[137]:= {Max[First /# table], Min[First /# table]} Out[137]= {100, 1} Lets see the distribution: BarChart[Last /# SortBy[Tally[Table[f[1, 100], {300000}]], First], ChartStyle -> "DarkRainbow"]
X = (Rand() * (B - A)) + A
Another way to look at it, where r is your random number in the range 0 to 1: (1-r)a + rb As for your additional requirement of the result being an integer, maybe (apart from using built in casting) the modulus operator can help you out. Check out this question and the answer: Expand a random range from 1–5 to 1–7
Well, why not just look at how Python does it itself? Read random.py in your installation's lib directory. After gutting it to only support the behavior of random.randint() (which is what you want) and removing all error checks for non-integer or out-of-bounds arguments, you get: import random def randint(start, stop): width = stop+1 - start return start + int(random.random()*width) Testing: >>> l = [] >>> for i in range(2000000): ... l.append(randint(3,6)) ... >>> l.count(3) 499593 >>> l.count(4) 499359 >>> l.count(5) 501432 >>> l.count(6) 499616 >>>
Assuming r_a_b is the desired random number between a and b and r_0_1 is a random number between 0 and 1 the following should work just fine: r_a_b = (r_0_1 * (b-a)) + a