MLR - should i use CV in RF model training - mlr

I have a question in the MLR package,
after tuning a randomforest hyperparameters with a cross validation
getLearnerModel(rforest) - will not use CV, rather use the entire data set as a whole, is that correct?
#traintask
trainTask <- makeClassifTask(data = trainsample,target = "DIED30", positive="1")
#random forest tuning
rf <- makeLearner("classif.randomForest", predict.type = "prob", par.vals = list(ntree = 1000, mtry = 3))
rf$par.vals <- list( importance = TRUE)
rf_param <- makeParamSet(
makeDiscreteParam("ntree",values= c(500,750, 1000,2000)),
makeIntegerParam("mtry", lower = 1, upper = 15),
makeDiscreteParam("nodesize", values =c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20))
)
rancontrol <- makeTuneControlGrid()
set_cv <- makeResampleDesc("CV",iters = 10L)
rf_tune <- tuneParams(learner = rf, resampling = set_cv, task = trainTask, par.set = rf_param, control = rancontrol, measures = auc)
rf_tune$x
rf.tree <- setHyperPars(rf, par.vals = rf_tune$x)
#train best model
rforest <- train(rf.tree, trainTask)
getLearnerModel(rforest)
#predict
pforest<- predict(rforest,trainTask)
rforest is eventually trained using the RF model on the entire data, rather than cross validation.
is there any way to perform the final training with CV as well in MLR?
I'm planning to validate the result on an external dataset. Should I train the model with 10CV prior to running on the external dataset (don't know how) or just use parameters found in the 10CV hyperparameters search?
thanks in advance for your time,

Related

How can I adjust my pcor model for confounders and do it for many models at one time?

I have a dataset with many columns. First column is the outcome (Test)(Dependent variable, y). Columns 2-32 are confounders. Finally, columns 33-54 are miRNAs (expression)(Independent variable, x).
I want to do a partial correlation (to obtain p-value and estimate) between each one of the independent variables with the dependent variable, adjusting by confounders. Since my variables don't follow a normal distribution, I want to use Spearman method.
I don't want to put all of them in the same model, I want different models, one by one. That is:
Model 1: Test vs miRNA1 by confounders
Model 2: Test vs miRNA2 by confounders
[...]
Model 21: Test vs miRNA21 by confounders
I tried with an auxiliary function. But it doesn't work. Any help? Thanks :)
The script is here:
#data
n <- 10000
nc <- 30
nm <- 20
y <- rnorm(n = n)
X <- matrix(rnorm(n = n*(nc+nm)), ncol = nc + nm)
df <- data.frame(y = y, X)
#variable names
confounders <- colnames(df)[2:31]
mirnas <- colnames(df)[32:51]
#auxiliar regression function
pcor_fun <- function(data, y_col, X_cols) {
formula <- as.formula(paste(y_col, X_cols))
pcor <- pcor.test(formula = formula, data = data, method = "spearman")
pcor_summary <- summary(pcor)$coef
return(pcor_summary)
}
#simple linear regressions
lm_list1 <- lapply(X = mirnas, FUN = pcor_fun, data = df, y_col = "y")
lm_list1[[1]]
#adjusting by confounders
lm_list2 <- lapply(X = mirnas, FUN = function(x) pcor_fun(data = df, y_col = "y", X_cols = c(confounders, x)))
lm_list2[[1]]

R - MLR - Classifier Calibration - Benchmark Results

I've run a benchmark experiment with nested cross validation (tuning + performance measurement) for a classification problem and would like to create calibration charts.
If I pass a benchmark result object to generateCalibrationData, what does plotCalibration do? Is it averaging? If so how?
Does it make sense to have an aggregate = FALSE option to understand variability across folds as per generateThreshVsPerfData for ROC curves?
In response to #Zach's request for a reproducible example, I (the OP) edit my original post as follows:
Edit: Reproducible Example
# Practice Data
library("mlr")
library("ROCR")
library(mlbench)
data(BreastCancer)
dim(BreastCancer)
levels(BreastCancer$Class)
head(BreastCancer)
BreastCancer <- BreastCancer[, -c(1, 6, 7)]
BreastCancer$Cl.thickness <- as.factor(unclass(BreastCancer$Cl.thickness))
BreastCancer$Cell.size <- as.factor(unclass(BreastCancer$Cell.size))
BreastCancer$Cell.shape <- as.factor(unclass(BreastCancer$Cell.shape))
BreastCancer$Marg.adhesion <- as.factor(unclass(BreastCancer$Marg.adhesion))
head(BreastCancer)
# Define Nested Cross-Validation Strategy
cv.inner <- makeResampleDesc("CV", iters = 2, stratify = TRUE)
cv.outer <- makeResampleDesc("CV", iters = 6, stratify = TRUE)
# Define Performance Measures
perf.measures <- list(auc, mmce)
# Create Task
bc.task <- makeClassifTask(id = "bc",
data = BreastCancer,
target = "Class",
positive = "malignant")
# Create Tuned KSVM Learner
ksvm <- makeLearner("classif.ksvm",
predict.type = "prob")
ksvm.ps <- makeParamSet(makeDiscreteParam("C", values = 2^(-2:2)),
makeDiscreteParam("sigma", values = 2^(-2:2)))
ksvm.ctrl <- makeTuneControlGrid()
ksvm.lrn = makeTuneWrapper(ksvm,
resampling = cv.inner,
measures = perf.measures,
par.set = ksvm.ps,
control = ksvm.ctrl,
show.info = FALSE)
# Create Tuned Random Forest Learner
rf <- makeLearner("classif.randomForest",
predict.type = "prob",
fix.factors.prediction = TRUE)
rf.ps <- makeParamSet(makeDiscreteParam("mtry", values = c(2, 3, 5)))
rf.ctrl <- makeTuneControlGrid()
rf.lrn = makeTuneWrapper(rf,
resampling = cv.inner,
measures = perf.measures,
par.set = rf.ps,
control = rf.ctrl,
show.info = FALSE)
# Run Cross-Validation Experiments
bc.lrns = list(ksvm.lrn, rf.lrn)
bc.bmr <- benchmark(learners = bc.lrns,
tasks = bc.task,
resampling = cv.outer,
measures = perf.measures,
show.info = FALSE)
# Calibration Charts
bc.cal <- generateCalibrationData(bc.bmr)
plotCalibration(bc.cal)
Produces the following:
Aggregared Calibration Plot
Attempting to un-aggregate leads to:
> bc.cal <- generateCalibrationData(bc.bmr, aggregate = FALSE)
Error in generateCalibrationData(bc.bmr, aggregate = FALSE) :
unused argument (aggregate = FALSE)
>
> sessionInfo()
R version 3.2.3 (2015-12-10)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mlbench_2.1-1 ROCR_1.0-7 gplots_3.0.1 mlr_2.9
[5] stringi_1.1.1 ParamHelpers_1.10 ggplot2_2.1.0 BBmisc_1.10
loaded via a namespace (and not attached):
[1] digest_0.6.9 htmltools_0.3.5 R6_2.2.0 splines_3.2.3
[5] scales_0.4.0 assertthat_0.1 grid_3.2.3 stringr_1.0.0
[9] bitops_1.0-6 checkmate_1.8.2 gdata_2.17.0 survival_2.38-3
[13] munsell_0.4.3 tibble_1.2 randomForest_4.6-12 httpuv_1.3.3
[17] parallelMap_1.3 mime_0.5 DBI_0.5-1 labeling_0.3
[21] chron_2.3-47 shiny_1.0.0 KernSmooth_2.23-15 plyr_1.8.4
[25] data.table_1.9.6 magrittr_1.5 reshape2_1.4.1 kernlab_0.9-25
[29] ggvis_0.4.3 caTools_1.17.1 gtable_0.2.0 colorspace_1.2-6
[33] tools_3.2.3 parallel_3.2.3 dplyr_0.5.0 xtable_1.8-2
[37] gtools_3.5.0 backports_1.0.4 Rcpp_0.12.4
no plotCalibration doesn't do any averaging, though it can plot a smooth.
if you call generateCalibrationData on a benchmark result object it will treat each iteration of your resampled predictions as exchangeable and compute the calibration across all resampled predictions for that bin.
yes it probably would make sense to have an option to generate an unaggregated calibration data object and be able to plot it. you are welcome to open an issue on GitHub to that effect, but this is going to be low on my priority list TBH.

MLR - Regression Benchmark Results - Visualisation

What are the options for visualising the results of a benchmark experiment of regression learners? For instance, generateCalibrationData doesn't accept a benchmark result object derived from a set of regr. learners. I would like something similar to the calibration plots available for classification.
In response to #LarsKotthoff's comment, I (the OP) have edited my original post to provide greater detail as to what functionality I am seeking.
Edit:
I'm looking for actual vs predicted calibration type plots such as simple scatterplots or something like the plots that exist under Classifier Calibration. If I'm not mistaken, the following would make sense for regression problems (and seems to be what is done for Classifier Calibration):
decide on a number of buckets to discretize the predictions on the x-axis, say 10 equal length bins (obviously you could continue with the breaks and groups interface to generateCalibrationData that currently exists)
for each of those bins 10, calculate the mean "predicted" and plot (say via a dot) on the x-axis (possibly with some measure of variability) and join the dots across the 10 bins
for each of those bins 10, calculate the mean "actual" and plot on the y-axis (possibly with some measure of variability) and join the dots
provide some representation of volume in each bucket (as you've done for Classifier Calibration via "rag/rug" plots)
The basic premise behind my question is what kind of visualisation can be provided to help interpret an rsq, mae etc performance measure. There are many configurations of actual vs predicted that can lead to the same rsq, mae etc.
Once some plot exists, switching aggregation on/off would allow individual resampling results to be examined.
I would hope that the combination:
cal <- generateCalibrationData(bmr)
plotCalibration(cal)
would be available for regression tasks, at present it doesn't seem to be (reproducible example below):
# Practice Data
library("mlr")
library(mlbench)
data(BostonHousing)
dim(BostonHousing)
head(BostonHousing)
# Define Nested Cross-Validation Strategy
cv.inner <- makeResampleDesc("CV", iters = 2)
cv.outer <- makeResampleDesc("CV", iters = 6)
# Define Performance Measures
perf.measures <- list(rsq, mae)
# Create Task
bh.task <- makeRegrTask(id = "bh",
data = BostonHousing,
target = "medv")
# Create Tuned KSVM Learner
ksvm <- makeLearner("regr.ksvm")
ksvm.ps <- makeParamSet(makeDiscreteParam("C", values = 2^(-2:2)),
makeDiscreteParam("sigma", values = 2^(-2:2)))
ksvm.ctrl <- makeTuneControlGrid()
ksvm.lrn = makeTuneWrapper(ksvm,
resampling = cv.inner,
measures = perf.measures,
par.set = ksvm.ps,
control = ksvm.ctrl,
show.info = FALSE)
# Create Tuned Random Forest Learner
rf <- makeLearner("regr.randomForest",
fix.factors.prediction = TRUE)
rf.ps <- makeParamSet(makeDiscreteParam("mtry", values = c(2, 3, 5)))
rf.ctrl <- makeTuneControlGrid()
rf.lrn = makeTuneWrapper(rf,
resampling = cv.inner,
measures = perf.measures,
par.set = rf.ps,
control = rf.ctrl,
show.info = FALSE)
# Run Cross-Validation Experiments
bh.lrns = list(ksvm.lrn, rf.lrn)
bh.bmr <- benchmark(learners = bh.lrns,
tasks = bh.task,
resampling = cv.outer,
measures = perf.measures,
show.info = FALSE)
# Calibration Charts
bh.cal <- generateCalibrationData(bh.bmr)
plotCalibration(bh.cal)
which yields:
> bh.cal <- generateCalibrationData(bh.bmr)
Error in checkPrediction(x, task.type = "classif", predict.type = "prob") :
Prediction must be one of 'classif', but is: 'regr'
> sessionInfo()
R version 3.2.3 (2015-12-10)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mlbench_2.1-1 ROCR_1.0-7 gplots_3.0.1 mlr_2.9
[5] stringi_1.1.1 ParamHelpers_1.10 ggplot2_2.1.0 BBmisc_1.10
loaded via a namespace (and not attached):
[1] digest_0.6.9 htmltools_0.3.5 R6_2.2.0 splines_3.2.3
[5] scales_0.4.0 assertthat_0.1 grid_3.2.3 stringr_1.0.0
[9] bitops_1.0-6 checkmate_1.8.2 gdata_2.17.0 survival_2.38-3
[13] munsell_0.4.3 tibble_1.2 randomForest_4.6-12 httpuv_1.3.3
[17] parallelMap_1.3 mime_0.5 DBI_0.5-1 labeling_0.3
[21] chron_2.3-47 shiny_1.0.0 KernSmooth_2.23-15 plyr_1.8.4
[25] data.table_1.9.6 magrittr_1.5 reshape2_1.4.1 kernlab_0.9-25
[29] ggvis_0.4.3 caTools_1.17.1 gtable_0.2.0 colorspace_1.2-6
[33] tools_3.2.3 parallel_3.2.3 dplyr_0.5.0 xtable_1.8-2
[37] gtools_3.5.0 backports_1.0.4 Rcpp_0.12.4

How to plot a learning curve in R?

I want to plot a learning curve in my application.
A sample curve image is shown below.
Learning curve is a plot between the following Variance,
X-Axis: Number of samples (Training set size).
Y-axis: Error(RSS/J(theta)/cost function )
It helps in observing whether our model is having the high bias or high variance problem.
Is there any package in R which can help in getting this plot?
You can make such a plot using the excellent Caret package. The section on Customizing the tuning process will be very helpful.
Also, you can check out the well written blog posts on R-Bloggers by Joseph Rickert. They are titled "Why Big Data? Learning Curves" and "Learning from Learning Curves".
UPDATE
I just did a post on this question Plot learning curves with caret package and R. I think my answer will be more useful to you. For convenience sake, I have reproduced the same answer here on plotting a learning curve with R. However, I used the popular caret package to train my model and get the RMSE error for the training and test set.
# set seed for reproducibility
set.seed(7)
# randomize mtcars
mtcars <- mtcars[sample(nrow(mtcars)),]
# split iris data into training and test sets
mtcarsIndex <- createDataPartition(mtcars$mpg, p = .625, list = F)
mtcarsTrain <- mtcars[mtcarsIndex,]
mtcarsTest <- mtcars[-mtcarsIndex,]
# create empty data frame
learnCurve <- data.frame(m = integer(21),
trainRMSE = integer(21),
cvRMSE = integer(21))
# test data response feature
testY <- mtcarsTest$mpg
# Run algorithms using 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
# loop over training examples
for (i in 3:21) {
learnCurve$m[i] <- i
# train learning algorithm with size i
fit.lm <- train(mpg~., data=mtcarsTrain[1:i,], method="lm", metric=metric,
preProc=c("center", "scale"), trControl=trainControl)
learnCurve$trainRMSE[i] <- fit.lm$results$RMSE
# use trained parameters to predict on test data
prediction <- predict(fit.lm, newdata = mtcarsTest[,-1])
rmse <- postResample(prediction, testY)
learnCurve$cvRMSE[i] <- rmse[1]
}
pdf("LinearRegressionLearningCurve.pdf", width = 7, height = 7, pointsize=12)
# plot learning curves of training set size vs. error measure
# for training set and test set
plot(log(learnCurve$trainRMSE),type = "o",col = "red", xlab = "Training set size",
ylab = "Error (RMSE)", main = "Linear Model Learning Curve")
lines(log(learnCurve$cvRMSE), type = "o", col = "blue")
legend('topright', c("Train error", "Test error"), lty = c(1,1), lwd = c(2.5, 2.5),
col = c("red", "blue"))
dev.off()
The output plot is as shown below:

h2o deeplearning checkpoint model

Folks,
I have some problem when try resuming h2o deep learning in R from a checkpointed model with validation frame provided. It says "Validation dataset must be the same as for the check pointed model", which I believe I do have the same validation datasets. If I leave validation_frame blank, checkpointing model works fine. I attach my code below:
localh2o <- h2o.init(nthreads = -1)
train_image.hex <- read.csv("mnist_train.csv",header=FALSE)
train_image.hex[,785] <- factor(train_image.hex[,785])
train_image.hex <- as.h2o(train_image.hex)
test_image.hex <- read.csv("mnist_test.csv",header=FALSE)
test_image.hex[,785] <- factor(test_image.hex[,785])
test_image.hex <- as.h2o(test_image.hex)
mnist_model <- h2o.deeplearning(x=1:784, y = 785,
training_frame= train_image.hex,
validation_frame = test_image.hex,
activation = "RectifierWithDropout", hidden = c(500,1000),
input_dropout_ratio = 0.2,
hidden_dropout_ratios = c(0.5,0.5), adaptive_rate=TRUE,
rho=0.98, epsilon = 1e-7,
l1 = 1e-8, l2 = 1e-7, max_w2 = 10,
epochs = 10, export_weights_and_biases = TRUE,
variable_importances = FALSE
)
h2o.saveModel(mnist_model, path="/tmp",force=TRUE)
Then I shut down the h2o, quit R and restart h2o in R to resume training, where h2o errors out:
localh2o <- h2o.init(nthreads = -1)
train_image.hex <- read.csv("mnist_train.csv",header=FALSE)
train_image.hex[,785] <- factor(train_image.hex[,785])
train_image.hex <- as.h2o(train_image.hex)
test_image.hex <- read.csv("mnist_test.csv",header=FALSE)
test_image.hex[,785] <- factor(test_image.hex[,785])
test_image.hex <- as.h2o(test_image.hex)
startmodel <- h2o.loadModel("/tmp/DeepLearning_model_R_1443812402059_20", localh2o)
mnist_model <- h2o.deeplearning(x=1:784, y = 785,
checkpoint = startmodel#model_id,
training_frame= train_image.hex,
validation_frame = test_image.hex,
activation = "RectifierWithDropout", hidden = c(500,1000),
input_dropout_ratio = 0.2,
hidden_dropout_ratios = c(0.5,0.5), adaptive_rate=TRUE,
rho=0.98, epsilon = 1e-7,
l1 = 1e-8, l2 = 1e-7, max_w2 = 10,
epochs = 10, export_weights_and_biases = TRUE,
variable_importances = FALSE
)
Thank you for pointing this out to us. I have added a JIRA, and you can track its progress here: https://0xdata.atlassian.net/browse/PUBDEV-2182
You can expect the problem to be fixed soon.
Thanks!
Avni
Please try again using the latest version. This should work now.