How to plot a learning curve in R? - regression

I want to plot a learning curve in my application.
A sample curve image is shown below.
Learning curve is a plot between the following Variance,
X-Axis: Number of samples (Training set size).
Y-axis: Error(RSS/J(theta)/cost function )
It helps in observing whether our model is having the high bias or high variance problem.
Is there any package in R which can help in getting this plot?

You can make such a plot using the excellent Caret package. The section on Customizing the tuning process will be very helpful.
Also, you can check out the well written blog posts on R-Bloggers by Joseph Rickert. They are titled "Why Big Data? Learning Curves" and "Learning from Learning Curves".
UPDATE
I just did a post on this question Plot learning curves with caret package and R. I think my answer will be more useful to you. For convenience sake, I have reproduced the same answer here on plotting a learning curve with R. However, I used the popular caret package to train my model and get the RMSE error for the training and test set.
# set seed for reproducibility
set.seed(7)
# randomize mtcars
mtcars <- mtcars[sample(nrow(mtcars)),]
# split iris data into training and test sets
mtcarsIndex <- createDataPartition(mtcars$mpg, p = .625, list = F)
mtcarsTrain <- mtcars[mtcarsIndex,]
mtcarsTest <- mtcars[-mtcarsIndex,]
# create empty data frame
learnCurve <- data.frame(m = integer(21),
trainRMSE = integer(21),
cvRMSE = integer(21))
# test data response feature
testY <- mtcarsTest$mpg
# Run algorithms using 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
# loop over training examples
for (i in 3:21) {
learnCurve$m[i] <- i
# train learning algorithm with size i
fit.lm <- train(mpg~., data=mtcarsTrain[1:i,], method="lm", metric=metric,
preProc=c("center", "scale"), trControl=trainControl)
learnCurve$trainRMSE[i] <- fit.lm$results$RMSE
# use trained parameters to predict on test data
prediction <- predict(fit.lm, newdata = mtcarsTest[,-1])
rmse <- postResample(prediction, testY)
learnCurve$cvRMSE[i] <- rmse[1]
}
pdf("LinearRegressionLearningCurve.pdf", width = 7, height = 7, pointsize=12)
# plot learning curves of training set size vs. error measure
# for training set and test set
plot(log(learnCurve$trainRMSE),type = "o",col = "red", xlab = "Training set size",
ylab = "Error (RMSE)", main = "Linear Model Learning Curve")
lines(log(learnCurve$cvRMSE), type = "o", col = "blue")
legend('topright', c("Train error", "Test error"), lty = c(1,1), lwd = c(2.5, 2.5),
col = c("red", "blue"))
dev.off()
The output plot is as shown below:

Related

Mixed regression model controlling for Genetic Relatedness

I am trying to perform a regression analysis but controlling for genetic relatedness among individuals (genetic relatedness matrix). See the code for an example including identical twins (“1” and siblings “0.5”). N=10
I know I should use a mixed model. However, I am not able to include the genetic relatedness matrix into the model. I have seen that these two packages are often used for this (“Kindship2” and “pedigreemm”.
Here is the code but I am unable to fit the model.
require("pedigreemm")
require("lme4")
library(kinship2)
library(car)
FAMID <- c(1,1,2,2,3,3,4,4,5,5)
UNIQUEID <- 1:10
Pheno <- c(2,4,5,5,8,10,15,20,0,0)
PRS <- c(0.1,0.5, 1,1, 2,2 ,3,3, -0.5, -0.4)
data <- as.data.frame(cbind(FAMID,UNIQUEID,Pheno,PRS))
RELMAT <- matrix(c(1,1,0.04,0.03,0.03,0.03,0.03,0.03,0.03,0.03,
1,1,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,
0.04,0.03,1,1,0.03,0.05,0.03,0.03,0.03,0.03,
0.03,0.03,1,1,0.03,0.03,0.03,0.03,0.03,0.03,
0.03,0.03,0.03,0.03,1,0.5,0.03,0.03,0.03,0.03,
0.03,0.03,0.05,0.03,0.5,1,0.03,0.03,0.03,0.03,
0.03,0.03,0.03,0.03,0.03,0.03,1,0.5,0.03,0.03,
0.03,0.03,0.03,0.03,0.03,0.03,0.5,1,0.03,0.03,
0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,1,0.5,
0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.5,1), nrow = 10, ncol = 10, byrow = T)
This is the model that I want to fit:
m1 = lmer(Pheno ~ PRS + (1 | RELMAT), data = data)
Thank you so much in advance.

MLR - should i use CV in RF model training

I have a question in the MLR package,
after tuning a randomforest hyperparameters with a cross validation
getLearnerModel(rforest) - will not use CV, rather use the entire data set as a whole, is that correct?
#traintask
trainTask <- makeClassifTask(data = trainsample,target = "DIED30", positive="1")
#random forest tuning
rf <- makeLearner("classif.randomForest", predict.type = "prob", par.vals = list(ntree = 1000, mtry = 3))
rf$par.vals <- list( importance = TRUE)
rf_param <- makeParamSet(
makeDiscreteParam("ntree",values= c(500,750, 1000,2000)),
makeIntegerParam("mtry", lower = 1, upper = 15),
makeDiscreteParam("nodesize", values =c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20))
)
rancontrol <- makeTuneControlGrid()
set_cv <- makeResampleDesc("CV",iters = 10L)
rf_tune <- tuneParams(learner = rf, resampling = set_cv, task = trainTask, par.set = rf_param, control = rancontrol, measures = auc)
rf_tune$x
rf.tree <- setHyperPars(rf, par.vals = rf_tune$x)
#train best model
rforest <- train(rf.tree, trainTask)
getLearnerModel(rforest)
#predict
pforest<- predict(rforest,trainTask)
rforest is eventually trained using the RF model on the entire data, rather than cross validation.
is there any way to perform the final training with CV as well in MLR?
I'm planning to validate the result on an external dataset. Should I train the model with 10CV prior to running on the external dataset (don't know how) or just use parameters found in the 10CV hyperparameters search?
thanks in advance for your time,

How to extract an adjacency matrix of a giant component of a graph using R?

I would like to extract an adjacency matrix of a giant component of a graph using R.
For example, I can create Erdos-Renyi g(n,p)
n = 100
p = 1.5/n
g = erdos.renyi.game(n, p)
coords = layout.fruchterman.reingold(g)
plot(g, layout=coords, vertex.size = 3, vertex.label=NA)
# Get the components of an undirected graph
cl = clusters(g)
# How many components?
cl$no
# How big are these (the first row is size, the second is the number of components of that size)?
table(cl$csize)
cl$membership
# Get the giant component
nodes = which(cl$membership == which.max(cl$csize))
# Color in red the nodes in the giant component and in sky blue the rest
V(g)$color = "SkyBlue2"
V(g)[nodes]$color = "red"
plot(g, layout=coords, vertex.size = 3, vertex.label=NA)
here, I only want to extract the adjacency matrix of those red nodes.
enter image description here
It's easy to get the giant component as a new graph like below and then get the adjacency matrix.
g <- erdos.renyi.game(100, .015, directed = TRUE)
# if you have directed graph, decide if you want
# strongly or weakly connected components
co <- components(g, mode = 'STRONG')
gi <- induced.subgraph(g, which(co$membership == which.max(co$csize)))
# if you want here you can decide if you want values only
# in the upper or lower triangle or both
ad <- get.adjacency(gi)
But you might want to keep the vertex IDs of the original graph. In this case just subset the adjacency matrix:
g <- erdos.renyi.game(100, .015)
co <- components(g)
gi_vids <- which(co$membership == which.max(co$csize))
gi_ad <- get.adjacency(g)[gi_vids, gi_vids]
# you can even add the names of the nodes
# as row and column names.
# generating dummy node names:
V(g)$name <- sapply(
seq(vcount(g)),
function(i){
paste(letters[ceiling(runif(5) * 26)], collapse = '')
}
)
rownames(gi_ad) <- V(g)$name[gi_vids]
colnames(gi_ad) <- V(g)$name[gi_vids]

MLR - Regression Benchmark Results - Visualisation

What are the options for visualising the results of a benchmark experiment of regression learners? For instance, generateCalibrationData doesn't accept a benchmark result object derived from a set of regr. learners. I would like something similar to the calibration plots available for classification.
In response to #LarsKotthoff's comment, I (the OP) have edited my original post to provide greater detail as to what functionality I am seeking.
Edit:
I'm looking for actual vs predicted calibration type plots such as simple scatterplots or something like the plots that exist under Classifier Calibration. If I'm not mistaken, the following would make sense for regression problems (and seems to be what is done for Classifier Calibration):
decide on a number of buckets to discretize the predictions on the x-axis, say 10 equal length bins (obviously you could continue with the breaks and groups interface to generateCalibrationData that currently exists)
for each of those bins 10, calculate the mean "predicted" and plot (say via a dot) on the x-axis (possibly with some measure of variability) and join the dots across the 10 bins
for each of those bins 10, calculate the mean "actual" and plot on the y-axis (possibly with some measure of variability) and join the dots
provide some representation of volume in each bucket (as you've done for Classifier Calibration via "rag/rug" plots)
The basic premise behind my question is what kind of visualisation can be provided to help interpret an rsq, mae etc performance measure. There are many configurations of actual vs predicted that can lead to the same rsq, mae etc.
Once some plot exists, switching aggregation on/off would allow individual resampling results to be examined.
I would hope that the combination:
cal <- generateCalibrationData(bmr)
plotCalibration(cal)
would be available for regression tasks, at present it doesn't seem to be (reproducible example below):
# Practice Data
library("mlr")
library(mlbench)
data(BostonHousing)
dim(BostonHousing)
head(BostonHousing)
# Define Nested Cross-Validation Strategy
cv.inner <- makeResampleDesc("CV", iters = 2)
cv.outer <- makeResampleDesc("CV", iters = 6)
# Define Performance Measures
perf.measures <- list(rsq, mae)
# Create Task
bh.task <- makeRegrTask(id = "bh",
data = BostonHousing,
target = "medv")
# Create Tuned KSVM Learner
ksvm <- makeLearner("regr.ksvm")
ksvm.ps <- makeParamSet(makeDiscreteParam("C", values = 2^(-2:2)),
makeDiscreteParam("sigma", values = 2^(-2:2)))
ksvm.ctrl <- makeTuneControlGrid()
ksvm.lrn = makeTuneWrapper(ksvm,
resampling = cv.inner,
measures = perf.measures,
par.set = ksvm.ps,
control = ksvm.ctrl,
show.info = FALSE)
# Create Tuned Random Forest Learner
rf <- makeLearner("regr.randomForest",
fix.factors.prediction = TRUE)
rf.ps <- makeParamSet(makeDiscreteParam("mtry", values = c(2, 3, 5)))
rf.ctrl <- makeTuneControlGrid()
rf.lrn = makeTuneWrapper(rf,
resampling = cv.inner,
measures = perf.measures,
par.set = rf.ps,
control = rf.ctrl,
show.info = FALSE)
# Run Cross-Validation Experiments
bh.lrns = list(ksvm.lrn, rf.lrn)
bh.bmr <- benchmark(learners = bh.lrns,
tasks = bh.task,
resampling = cv.outer,
measures = perf.measures,
show.info = FALSE)
# Calibration Charts
bh.cal <- generateCalibrationData(bh.bmr)
plotCalibration(bh.cal)
which yields:
> bh.cal <- generateCalibrationData(bh.bmr)
Error in checkPrediction(x, task.type = "classif", predict.type = "prob") :
Prediction must be one of 'classif', but is: 'regr'
> sessionInfo()
R version 3.2.3 (2015-12-10)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mlbench_2.1-1 ROCR_1.0-7 gplots_3.0.1 mlr_2.9
[5] stringi_1.1.1 ParamHelpers_1.10 ggplot2_2.1.0 BBmisc_1.10
loaded via a namespace (and not attached):
[1] digest_0.6.9 htmltools_0.3.5 R6_2.2.0 splines_3.2.3
[5] scales_0.4.0 assertthat_0.1 grid_3.2.3 stringr_1.0.0
[9] bitops_1.0-6 checkmate_1.8.2 gdata_2.17.0 survival_2.38-3
[13] munsell_0.4.3 tibble_1.2 randomForest_4.6-12 httpuv_1.3.3
[17] parallelMap_1.3 mime_0.5 DBI_0.5-1 labeling_0.3
[21] chron_2.3-47 shiny_1.0.0 KernSmooth_2.23-15 plyr_1.8.4
[25] data.table_1.9.6 magrittr_1.5 reshape2_1.4.1 kernlab_0.9-25
[29] ggvis_0.4.3 caTools_1.17.1 gtable_0.2.0 colorspace_1.2-6
[33] tools_3.2.3 parallel_3.2.3 dplyr_0.5.0 xtable_1.8-2
[37] gtools_3.5.0 backports_1.0.4 Rcpp_0.12.4

ggplot2 is not printing all the information I need in R

I am trying to replicate the following script: San Francisco Crime Classification
here is my code:
library(dplyr)
library(ggmap)
library(ggplot2)
library(readr)
library(rjson)
library(RCurl)
library(RJSONIO)
library(jsonlite)
train=jsonlite::fromJSON("/home/felipe/Templates/Archivo de prueba/databritanica.json")
counts <- summarise(group_by(train, Crime_type), Counts=length(Crime_type))
#counts <- counts[order(-counts$Crime_type),]
# This removes the "Other Offenses" category
top12 <- train[train$Crime_type %in% counts$Crime_type[c(1,3:13)],]
map<-get_map(location=c(lon = -2.747770, lat = 53.389499) ,zoom=12,source="osm")
p <- ggmap(map) +
geom_point(data=top12, aes(x=Longitude, y=Latitude, color=factor(Crime_type)), alpha=0.05) +
guides(colour = guide_legend(override.aes = list(alpha=1.0, size=6.0),
title="Type of Crime")) +
scale_colour_brewer(type="qual",palette="Paired") +
ggtitle("Top Crimes in Britain") +
theme_light(base_size=20) +
theme(axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank())
ggsave("united kingdom_top_crimes_map.png", p, width=14, height=10, units="in")
I am reading the data from a JSON file and try to print points over the map according to the data. Each point is a type of crime that is have been committed, the location of each point depends of two parameters: longitude and latitude.
What is the problem? the points are not being printing. The script generate a new map without the points that is suppose to show.
This is the original map:
And this is the result:
Any ideas??
This a example of the data contain in the JSON file is:
[
{"Month":"2014-05","Longitude":-2.747770,"Latitude":53.389499,"Location":"On or near Cronton Road","LSOA_name":"Halton 001B","Crime_type":"Other theft"},
{"Month":"2014-05","Longitude":-2.799099,"Latitude":53.354676,"Location":"On or near Old Higher Road","LSOA_name":"Halton 008B","Crime_type":"Anti-social behaviour"},
{"Month":"2014-05","Longitude":-2.804451,"Latitude":53.352456,"Location":"On or near Higher Road","LSOA_name":"Halton 008B","Crime_type":"Anti-social behaviour"}
]
Short answer:
Your alpha = 0.05 is making the points practically invisible when plotted on the colorful map background, as mentioned by #aosmith.
Longer answer:
I suggest the following changes to your geom_point:
Increase the alpha to something more reasonable
Increase the size of the points
Optionally, change the shape to one with a background and fill for better visibility
This will require you to change the fill parameter in aes, as well as the scale_color_brewer to scale_fill_brewer
Example:
# Load required packages
library(dplyr)
library(ggplot2)
library(ggmap)
library(jsonlite)
# Example data provided in question, with one manually entered entry with
# Crime_type = "Other Offenses"
'[
{"Month":"2014-05","Longitude":-2.747770,"Latitude":53.389499,"Location":"On or near Cronton Road","LSOA_name":"Halton 001B","Crime_type":"Other theft"},
{"Month":"2014-05","Longitude":-2.799099,"Latitude":53.354676,"Location":"On or near Old Higher Road","LSOA_name":"Halton 008B","Crime_type":"Anti-social behaviour"},
{"Month":"2014-05","Longitude":-2.804451,"Latitude":53.352456,"Location":"On or near Higher Road","LSOA_name":"Halton 008B","Crime_type":"Anti-social behaviour"},
{"Month":"2014-05","Longitude":-2.81,"Latitude":53.36,"Location":"On or near Higher Road","LSOA_name":"Halton 008B","Crime_type":"Other Offenses"}
]' -> example_json
train <- fromJSON(example_json)
# Process the data, the dplyr way
counts <- train %>%
group_by(Crime_type) %>%
summarise(Counts = length(Crime_type))
# This removes the "Other Offenses" category
top12 <- train %>%
filter(Crime_type != "Other Offenses")
# Get the map
map <- get_map(location=c(lon = -2.747770, lat = 53.389499), zoom=12, source="osm")
# Plotting code
p <- ggmap(map) +
# Changes made to geom_point.
# I increased the alpha and size, and I used a shape that has
# a black border and a fill determined by Crime_type.
geom_point(data=top12, aes(x=Longitude, y=Latitude, fill=factor(Crime_type)),
shape = 21, alpha = 0.75, size = 3.5, color = "black") +
guides(fill = guide_legend(override.aes = list(alpha=1.0, size=6.0),
title="Type of Crime")) +
# Changed scale_color_brewer to scale_fill_brewer
scale_fill_brewer(type="qual", palette="Paired") +
ggtitle("Top Crimes in Britain") +
theme_light(base_size=20) +
theme(axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank())