LInear regression: confidence interval, bootstrapping, by group - regression

I am trying to do all 3 things in a linear regression at once:
bootstrapping
by a grouping variable
generate confidence intervals
at this time I'm able to do 2 & 3
df %>%
group_by(group) %>%
group_modify(~ parameters::model_parameters(stats::lm(y~x, data = .x)))
But I'm not able to nest boot or lm.boot into the code above (says Error: 'lm.boot' is not an exported object from 'namespace:stats')
Can anyone advise?
The goal is to have this table but bootstrapped:
enter image description here

Related

Created factors with EFA, tried regressing (lm) with control variables - Error message "variable lengths differ"

EFA first-timer here!
I ran an Exploratory Factor Analysis (EFA) on a data set ("df1" = 1320 observations) with 50 variables by creating a subset with relevant variables only that have no missing values ("df2" = 301 observations).
I was able to filter 4 factors (19 variables in total).
Now I would like to take those 4 factors and regress them with control variables.
For instance: Factor 1 (df2$fa1) describes job satisfaction.
I would like to control for age and marital status.
Fa1Regression <- lm(df2$fa1 ~ df1$age + df1$marital)
However I receive the error message:
Error in model.frame.default(formula = df2$fa1 ~ df1$age + :
variable lengths differ (found for 'df1$age')
What can I do to run the regression correctly? Can I delete observations from df1 that are nonexistent in df2 so that the variable lengths are the same?
Its having a problem using lm to regress a latent factor on other coefficients. Instead, use the lavaan package, where your model statement would be myModel<- 'df2$fa1~ x1+x2+x3'

how to generate scatter plots between outcome variable and one of independent variable

plot(x, y) can generate scatter plots for simple regression results y=a +bx; then how to generate scatter plots between outcome variable and one of independent variable for regression results Y=A+bX1 + cX2 +dX3? can we run y=a1 +cX2 +dX3 to get residuals, and then plot residuals ~ X1?
Yes. Syntax doesn't change. Just remember that if you are working with R, syntax should be
plot(model$residuals, variable)
Where "model" is your linear model
By the way, if you want to store residuals in your dataset, then you should type:
db$errors <- model$residual
Where "db" is your dataset

Jaccard distance between tweets

I'm currently trying to measure the Jaccard Distance between tweets in a dataset
This is where the dataset is
http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json
I've tried a few things to measure the distance
This is what I have so far
I saved the linked dataset to a file called Tweets.json
json_alldata <- fromJSON(sprintf("[%s]", paste(readLines(file("Tweets.json")),collapse=",")))
Then I converted json_alldata to tweet.features and got rid of the geo column
# get rid of geo column
tweet.features = json_alldata
tweet.features$geo <- NULL
These are what the first two tweets look like
tweet.features$text[1]
[1] "RT #ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
> tweet.features$text[2]
[1] "RT #NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"
First thing I tried was using the method stringdist which is under the stringdist library
install.packages("stringdist")
library(stringdist)
#This works?
#
stringdist(tweet.features$text[1], tweet.features$text[2], method = "jaccard")
When I run that, I get
[1] 0.1621622
I'm not sure that's correct, though. A intersection B = 23, and A union B = 25. The Jaccard distance is A intersection B/A union B -- right? So by my calculation, the Jaccard distance should be 0.92?
So I figured I could do it by sets. Simply calculate intersection and union and divide
This is what I tried
# Jaccard distance is the intersection of A and B divided by the Union of A and B
#
#create set for First Tweet
A1 <- as.set(tweet.features$text[1])
A2 <- as.set(tweet.features$text[2])
When I try to do intersection, I get this: The output is just list()
Intersection <- intersect(A1, A2)
list()
When I try Union, I get this:
union(A1, A2)
[[1]]
[1] "RT #ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
[[2]]
[1] "RT #NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"
This doesn't seem to be grouping the words into a single set.
I figured I'd be able to divide the intersection by the union. But I guess I would need the program to count the number or words in each set, then do the calculations.
Needless to say, I'm a bit stuck and I'm not sure if I'm on the right track.
Any help would be appreciated. Thank you.
intersect and union expect vectors (as.set does not exist). I think you want to compare words so you can use strsplit but the way the split is done belongs to you. An example below:
tweet.features <- list(tweet1="RT #ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston",
tweet2= "RT #NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston")
jaccard_i <- function(tw1, tw2){
tw1 <- unlist(strsplit(tw1, " |\\."))
tw2 <- unlist(strsplit(tw2, " |\\."))
i <- length(intersect(tw1, tw2))
u <- length(union(tw1, tw2))
list(i=i, u=u, j=i/u)
}
jaccard_i(tweet.features[[1]], tweet.features[[2]])
$i
[1] 20
$u
[1] 23
$j
[1] 0.8695652
Is this want you want?
The strsplit is here done for every space or dot. You may want to refine the split argument from strsplit and replace " |\\." for something more specific (see ?regex).

R: NaiveBayes incrementally on a large data set

I have a large data set in a MySQL database (at least 11 GB of data). I would like to train a NaiveBayes model on the entire set and then test is against a smaller but also quite large data set (~3 GB).
The second part seems feasible - I assume that I would run the following in a loop:
data_test <- sqlQuery(con, paste("select * from test_data LIMIT 10000", "OFFSET", (i*10000) ))
model_pred <- predict(model, data_test, type="raw")
...and then dump the predictions back to MySQL or a CSV.
How can I, however, train my model incrementally on such a large data set? I noticed in the R documentation of the function (http://www.inside-r.org/packages/cran/e1071/docs/naiveBayes) that there is an addtional argument in the predict function "newdata" which suggests that incremental learning is possible. The predict function however will return the predictions and not a new model.
Please provide me with an example of how to incrementally train my model.

Organizing ranef.mer in ascending or descending order

I'm trying to figure out how to organize the ranef.mer list of random effects from a simple lmer model with only random intercepts and one variable (sex).
fit.b <- lmer(Math ~ 1 + Sex + (1+Sex|SchoolID), data=pisa_com, REML=FALSE)
I've plotted the random effects using qqmath, but I either need to be able to label each of the random effects by their cluster number (in this case, schools), or organize the ranef.mer output.
Solved this last night. The ranef.mer can be coerced into a dataframe.
I fit the model:
fit.b <- lmer(Math ~ 1 + Sex + (1+Sex|SchoolID), data=pisa_com, REML=FALSE)
Then coerced it into a dataframe by including the identifying variable
random.effects <- as.data.frame(ranef(fit.b)$SchoolID)
Then write it to a .csv for sorting in excel
write.csv(random.effects, file="~/folder/file.name.csv")