Function for bootstrap - get sd of statistics of simulations - function

I am trying to write a function for bootstrap (an assignment). The question is as follow:
Compute the bootstrap standard error for:
- mean() and
- median() and
- the top quartile and
- the standard deviation of the price (yes, I want the standard deviation of the standard deviation... If this confuses you, go through the mean example and replace the computation of the mean by the_thing_we_want and realize that the_thing_we_want can be the standard deviation)
- max()
One way to approach this is to define a new function for each. Another is to write a bootstrap_func function that takes an additional argument called fun, and then you call it bootstrap_func(B, v, median) to have the same effect as bootstrap_median. Implement this function bootstrap_func.
Example call to this function: bootstrap_func(1000, vienna_data$price, mean).
This is what I attempt to do:
hotels_price <- read.csv("data_repo/hotels-europe/clean/hotels-europe_price.csv")
hotels_features_vienna <- read.csv("data_repo/hotels-europe/clean/hotels-europe_features.csv") %>%
filter(city == "Vienna")
hotels_vienna <- left_join(hotels_price, hotels_features_vienna) %>%
filter(across(.cols = everything(), function(x){!is.na(x)}))
set.seed(7777)
B <- 100 # B is asked to be set to 100 - the number of simulations
v <- hotels_vienna$price
get_sim <- function(v) {
sample(v, replace = TRUE)
}
fun <- function(get_sim) {
mean <- function(get_sim) {mean(get_sim)}
median <- function(get_sim) {mean(get_sim)}
top_quartile <- function(get_sim) {quantile(get_sim, 0.75)}
sd <- function(get_sim) {sd(get_sim)}
max <- function(get_sim) {max(get_sim)}
}
sim_stats <- function(B) {
replicate(B, fun(get_sim()))
}
bootstrap_func <- function(B, v, fun) {
sd(sim_stats())
}
bootstrap_func(100, hotels_vienna$price, sd)
When I try to run the last line R throw an error "Error: C stack usage 15926432 is too close to the limit". I do not know how to fix this, and more importantly, if my code will get to the desired results. I appreciate any inputs. TIA.

Related

Apply function (Jaccard similarity) on every row in R (using data from the previous row)

I would like to compute Jaccard Similarity on text using R.
I already found a way to compute JS using a function. Which works fine when I apply it stand-alone.
I have a dataset with utterances in conversation. I would like to add a column that presents the Jaccard similarity of each utterance with the (immediate) previous one.
Like I said I already use a function to compute JS.
jaccard <- function(a, b) {
intersection = length(intersect(a, b))
union = length(a) + length(b) - intersection
return (intersection/union)
}
I have already tried multiple things looking like this:
Text$J <- 0
# for every row in DT
for (i in 1:length(Text)) {
if(i==1) {
#using NA at first line
Text[i,2] <- NA
} else {
Text$J <- jaccard(Text$Utterance,Text$Utterance[i-1])
}
}
Is it possible to integrate a function like the above in a 'for every row' code? So far, my attempts are not successful, but that might be me. What happens in most cases is that is just pastes one Jaccard value to the whole column. Thank you in advance!

How to get descriptive table for both continuous and categorical variables?

I want to get descriptive table in html format for all variables that are in data frame. I need for continuous variables mean and standard deviation. For categorical variables frequency (absolute count) of each category and percentage of each category. Also I need the count of missing values to be included.
Lets use this data:
data("ToothGrowth")
df<-ToothGrowth
df$len[2]<-NA
df$supp[5]<-NA
I want to get table in html format that will look like this:
----------------------------------------------------------------------
Variables N (missing) Mean (SD) / %
----------------------------------------------------------------------
len 59 (1) 18.9 (7.65)
supp
OJ 30 50%
VC 29 48.33%
NA 1 1.67%
dose 60 1.17 (0.629)
I need also to set the number of digits after decimal point to show.
If you know better variant to display that information in html in better way than please provide your solution.
Here's a programatic way to create separate summary tables for the numeric and factor columns. Note that this doesn't make note of NAs in the table as you requested, but does ignore NAs to calculate summary stats as you did. It's a starting point, anyway. From here you could combine the tables and format the headers however you want.
If you knit this code within an RMarkdown document with HTML output, kable will automatically generate the html table and a css will format the table nicely with a horizontal rules as pictured below. Note that there's also a booktabs option to kable that makes prettier tables like the LaTeX booktabs package. Otherwise, see the documentation for knitr::kable for options.
library(dplyr)
library(tidyr)
library(knitr)
data("ToothGrowth")
df<-ToothGrowth
df$len[2]<-NA
df$supp[5]<-NA
numeric_cols <- dplyr::select_if(df, is.numeric) %>%
gather(key = "variable", value = "value") %>%
group_by(variable) %>%
summarize(count = n(),
mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE))
factor_cols <- dplyr::select_if(df, is.factor) %>%
gather(key = "variable", value = "value") %>%
group_by(variable, value) %>%
summarize(count = n()) %>%
mutate(p = count / sum(count, na.rm = TRUE))
knitr::kable(numeric_cols)
knitr::kable(factor_cols)
I found r package table1 that does what I want. Here is a code:
library(table1)
data("ToothGrowth")
df<-ToothGrowth
df$len[2]<-NA
df$supp[5]<-NA
table1(reformulate(colnames(df)), data=df)

How to extract input from a dynamic matrix in R shiny

This is related to an old question about creating a matrix-style input in a shiny app with dynamic dimensions. My goal is to have a matrix of numerical inputs (the dimensions of which are determined by other user inputs), and then pass that matrix to other R commands and print some output from those calculations. I have code that successfully executes everything except that I can only access the user inputs as characters.
Here is an example that sets up the input and just prints a couple cells from the matrix (this works fine, but isn't what I need):
shiny::runApp(list(
ui = pageWithSidebar(
headerPanel("test"),
sidebarPanel(
numericInput(inputId = "nrow",
label = "number of rows",
min = 1,
max = 20,
value = 1),
numericInput(inputId = "ncol",
label = "number of columns",
min = 1,
max = 20,
value = 1)
),
mainPanel(
tableOutput("value"),
uiOutput("textoutput"))
),
server = function(input,output){
isolate({
output$value <-renderTable({
num.inputs.col1 <- paste0("<input id='r", 1:input$nrow, "c", 1, "' class='shiny-bound-input' type='number' value='1'>")
df <- data.frame(num.inputs.col1)
if (input$ncol >= 2){
for (i in 2:input$ncol){
num.inputs.coli <- paste0("<input id='r", 1:input$nrow, "c", i, "' class='shiny-bound-input' type='number' value='1'>")
df <- cbind(df,num.inputs.coli)
}
}
colnames(df) <- paste0("time ",as.numeric(1:input$ncol))
df
}, sanitize.text.function = function(x) x)
})
output$textoutput <- renderUI(paste0("Cells [1,1] and [2,2]: ",input$r1c1,",",input$r2c2))
}
))
However, when I try to do any operation on the inputs in the matrix, such as output$textoutput <- renderUI(as.numeric(paste0(input$r1c1))+as.numeric(paste0(input$r2c2))), I get classic R errors like $ operator is invalid for atomic vectors. I have tried many combinations of 'as.numeric', 'as.character', ect. to try to get it into the correct format. When I check the structure of those input cells, I see that they have an extra 'NULL' attribute that I can't seem to get rid of, but I am unsure if that is the root of the problem.
In short, how do I extract the plain numbers from that matrix? Or is there a better way to do this in shiny? The only other solution I'm aware of is the rhandsontable package, which I would prefer not to use if there is a reasonable alternative.
Any suggestions would be very appreciated. Thank you!
Edit/solution: replacing renderUI and uiOutput with renderPrint and verbatimTextOutput solves the problem. Thank you for the comment, blondeclover!

Get column name in apply function

I am trying to make a function that makes a small report for every column in a data frame by using apply. In the report I want to use the name of the column so I have to 'extract' it somehow and that is what my question is about. How do I get the name of the column in my apply function?
Here is a simple example where I want to use the name of the column in the graph title: (for now I just hardcoded the name as 'x')
x <- c(1,1,2,2,2,3)
y <- c(2,3,4,5,4,4)
Tb <- data.frame(x,y)
Dq_Hist <- function(Tab){
Name <- 'x'
Ttl <- paste('Variable: ',Name,'')
hist(Tab,main=Ttl,col=c('grey'),xlab=Name)
}
D <- apply(Tb,MARGIN=2,FUN=Dq_Hist)
Well, if nobody answers you got to find out yourself... And I found out that you can call sapply with an index list and use this index in the function. So the solution is:
x <- c(1,1,2,2,2,3)
y <- c(2,3,4,5,4,4)
Tb <- data.frame(x,y)
Dq_Hist <- function(i){
Name <- colnames(Tb)[i]
Ttl <- paste('Variable: ',Name,'')
hist(Tb[,i],main=Ttl,col=c('grey'),xlab=Name)
}
D <- sapply(1:ncol(Tb),Dq_Hist)

Using mapply in a function to detect outliers more than 3sd from the mean returns incorrect values

I wrote a function that is supposed to return for each variable in a dataset a list of values that are more than 3sd apart from their column-mean. Therefore, I used mapply to compare for each single value in the dataset, whether its absolute value is more than three times larger than the sd of the respective variable it belongs to. The function works, however, it gives me a bunch of "outliers" that are not outside of this cutoff. Where might I have gone wrong?
findOutlier <- function(data, cutoff = 3) {
## Calculate the sd
sds <- apply(data, 2, sd, na.rm = TRUE)
## Identify the cells with value greater than cutoff * sd
result <- mapply(function(d, s) {
which(abs(d) > cutoff * s)
}, data, sds)
result
}