Univariate cox regression analysis with multiple covariates - regression

covariates <- c("age", "sex", "ph.karno", "ph.ecog", "wt.loss")
univ_formulas <- sapply(covariates,
function(x) as.formula(paste('Surv(time, status)~', x)))
univ_models <- lapply( univ_formulas, function(x){coxph(x, data = lung)})
# Extract data
univ_results <- lapply(univ_models,
function(x){
x <- summary(x)
p.value<-signif(x$wald["pvalue"], digits=2)
wald.test<-signif(x$wald["test"], digits=2)
beta<-signif(x$coef[1], digits=2);#coeficient beta
HR <-signif(x$coef[2], digits=2);#exp(beta)
HR.confint.lower <- signif(x$conf.int[,"lower .95"], 2)
HR.confint.upper <- signif(x$conf.int[,"upper .95"],2)
HR <- paste0(HR, " (",
HR.confint.lower, "-", HR.confint.upper, ")")
res<-c(beta, HR, wald.test, p.value)
names(res)<-c("beta", "HR (95% CI for HR)", "wald.test",
"p.value")
return(res)
#return(exp(cbind(coef(x),confint(x))))
})
res <- t(as.data.frame(univ_results, check.names = FALSE))
as.data.frame(res)
Normally I use this code for univariate cox regression analysis but I have multiple genes >20000 that I want to run as independent variables in a univariate cox regression analysis and I am not sure how I can run this code without typing the individual covariates (gene names) out. All my column names for the genes begin with "ENSG..".
Is there a way to do univariate cox regression on so many genes in an efficient way please? Thanks in advance.

There are several ways to make the list of variable names without typing them out. Probably one of the easiest is to use names() to get all of the variable names in the data, then remove time and status from that list (as well as any other variables you don't want to include). For example, for the veteran dataset:
covariates <- names(survival::veteran)
covariates # Look at which names were detected
#> [1] "trt" "celltype" "time" "status" "karno" "diagtime" "age"
#> [8] "prior"
covariates <- covariates[-which(covariates %in% c("time", "status"))]
covariates # Confirm time and status have been removed
#> [1] "trt" "celltype" "karno" "diagtime" "age" "prior"
Created on 2022-08-30 by the reprex package (v2.0.1)
You could also programmatically create a list of names. For example:
covariates <- sapply(1:10, FUN = function(x) paste0("ENSG.", x))
covariates
#> [1] "ENSG.1" "ENSG.2" "ENSG.3" "ENSG.4" "ENSG.5" "ENSG.6" "ENSG.7"
#> [8] "ENSG.8" "ENSG.9" "ENSG.10"
This approach might be better if the naming is easy to program. If the gene names are irregular it might be more difficult.
As far as efficiency, I don't think much can be done to improve the overall runtime. The bulk of the runtime is doing the actually coxph() calculations. There are other questions/answers on the site about optimizing R code. If you want to pursue optimization I'd suggest looking through those, and then making a new question with a reproducible example if you need more help.

Related

rvest html_nodes() returns empty character

I am trying to scrape a website (https://genelab-data.ndc.nasa.gov/genelab/projects?page=1&paginate_by=281). In particular, I am trying to scrape all 281 "release dates" (with the first being '30-Oct-2006')
To do this, I am using the R package rvest and the SelectorGadget Chrome extension. I am using Mac version 10.15.6.
I attempted the following code:
library(rvest)
library(httr)
library(xml2)
library(dplyr)
link = "https://genelab-data.ndc.nasa.gov/genelab/projects?page=1&paginate_by=281"
page = read_html(link)
year = page %>% html_nodes("td:nth-child(4) ul") %>% html_text()
However, this returns 'character(0)`.
I used the code td:nth-child(4) ul because this is what SelectorGadget highlighted for each of the 281 release dates. I also tried to "View source page" but could not find these years listed on the source page.
I have read that rvest does not always work depending on the type of website. In this case, what is a possible workaround? Thank you.
This site gets the data from this API call https://genelab-data.ndc.nasa.gov/genelab/data/study/all that returns JSON data. You can use httr to get the data and parse JSON :
library(httr)
url <- "https://genelab-data.ndc.nasa.gov/genelab/data/study/all"
output <- content(GET(url), as = "parsed", type = "application/json")
#sort by glds_id
output = output[order(sapply(output, `[[`, i = "glds_id"))]
#build dataframe
result <- list();
index <- 1
for(t in output[length(output):1]){
result[[index]] <- t$metadata
result[[index]]$accession <- t$accession
result[[index]]$legacy_accession <- t$legacy_accession
index <- index + 1
}
df <- do.call(rbind, result)
options(width = 1200)
print(df)
Output sample (without all columns)
accession legacy_accession public_release_date title
[1,] "GLDS329" "GLDS-329" "30-Oct-2006" "Transcription profiling of atm mutant, adm mutant and wild type whole plants and roots of Arabidops" [truncated]
[2,] "GLDS322" "GLDS-322" "27-Aug-2020" "Comparative RNA-Seq transcriptome analyses reveal dynamic time dependent effects of 56Fe, 16O, and " [truncated]
[3,] "GLDS320" "GLDS-320" "18-Sep-2014" "Gamma radiation and HZE treatment of seedlings in Arabidopsis"
[4,] "GLDS319" "GLDS-319" "18-Jul-2018" "Muscle atrophy, osteoporosis prevention in hibernating mammals"
[5,] "GLDS318" "GLDS-318" "01-Dec-2019" "RNA seq of tumors derived from irradiated versus sham hosts transplanted with Trp53 null mammary ti" [truncated]
[6,] "GLDS317" "GLDS-317" "19-Dec-2017" "Galactic cosmic radiation induces stable epigenome alterations relevant to human lung cancer"
[7,] "GLDS311" "GLDS-311" "31-Jul-2020" "Part two: ISS Enterobacteriales"
[8,] "GLDS309" "GLDS-309" "12-Aug-2020" "Comparative Genomic Analysis of Klebsiella Exposed to Various Space Conditions at the International" [truncated]
[9,] "GLDS308" "GLDS-308" "07-Aug-2020" "Differential expression profiles of long non-coding RNAs during the mouse pronucleus stage under no" [truncated]
[10,] "GLDS305" "GLDS-305" "27-Aug-2020" "Transcriptomic responses of Serratia liquefaciens cells grown under simulated Martian conditions of" [truncated]
[11,] "GLDS304" "GLDS-304" "28-Aug-2020" "Global gene expression in response to X rays in mice deficient in Parp1"
[12,] "GLDS303" "GLDS-303" "15-Jun-2020" "ISS Bacillus Genomes"
[13,] "GLDS302" "GLDS-302" "31-May-2020" "ISS Enterobacteriales Genomes"
[14,] "GLDS301" "GLDS-301" "30-Apr-2020" "Eruca sativa Rocket Science RNA-seq"
[15,] "GLDS298" "GLDS-298" "09-May-2020" "Draft Genome Sequences of Sphingomonas sp. Isolated from the International Space Station Genome seq" [truncated]
...........................................................................

MLR - Regression Benchmark Results - Visualisation

What are the options for visualising the results of a benchmark experiment of regression learners? For instance, generateCalibrationData doesn't accept a benchmark result object derived from a set of regr. learners. I would like something similar to the calibration plots available for classification.
In response to #LarsKotthoff's comment, I (the OP) have edited my original post to provide greater detail as to what functionality I am seeking.
Edit:
I'm looking for actual vs predicted calibration type plots such as simple scatterplots or something like the plots that exist under Classifier Calibration. If I'm not mistaken, the following would make sense for regression problems (and seems to be what is done for Classifier Calibration):
decide on a number of buckets to discretize the predictions on the x-axis, say 10 equal length bins (obviously you could continue with the breaks and groups interface to generateCalibrationData that currently exists)
for each of those bins 10, calculate the mean "predicted" and plot (say via a dot) on the x-axis (possibly with some measure of variability) and join the dots across the 10 bins
for each of those bins 10, calculate the mean "actual" and plot on the y-axis (possibly with some measure of variability) and join the dots
provide some representation of volume in each bucket (as you've done for Classifier Calibration via "rag/rug" plots)
The basic premise behind my question is what kind of visualisation can be provided to help interpret an rsq, mae etc performance measure. There are many configurations of actual vs predicted that can lead to the same rsq, mae etc.
Once some plot exists, switching aggregation on/off would allow individual resampling results to be examined.
I would hope that the combination:
cal <- generateCalibrationData(bmr)
plotCalibration(cal)
would be available for regression tasks, at present it doesn't seem to be (reproducible example below):
# Practice Data
library("mlr")
library(mlbench)
data(BostonHousing)
dim(BostonHousing)
head(BostonHousing)
# Define Nested Cross-Validation Strategy
cv.inner <- makeResampleDesc("CV", iters = 2)
cv.outer <- makeResampleDesc("CV", iters = 6)
# Define Performance Measures
perf.measures <- list(rsq, mae)
# Create Task
bh.task <- makeRegrTask(id = "bh",
data = BostonHousing,
target = "medv")
# Create Tuned KSVM Learner
ksvm <- makeLearner("regr.ksvm")
ksvm.ps <- makeParamSet(makeDiscreteParam("C", values = 2^(-2:2)),
makeDiscreteParam("sigma", values = 2^(-2:2)))
ksvm.ctrl <- makeTuneControlGrid()
ksvm.lrn = makeTuneWrapper(ksvm,
resampling = cv.inner,
measures = perf.measures,
par.set = ksvm.ps,
control = ksvm.ctrl,
show.info = FALSE)
# Create Tuned Random Forest Learner
rf <- makeLearner("regr.randomForest",
fix.factors.prediction = TRUE)
rf.ps <- makeParamSet(makeDiscreteParam("mtry", values = c(2, 3, 5)))
rf.ctrl <- makeTuneControlGrid()
rf.lrn = makeTuneWrapper(rf,
resampling = cv.inner,
measures = perf.measures,
par.set = rf.ps,
control = rf.ctrl,
show.info = FALSE)
# Run Cross-Validation Experiments
bh.lrns = list(ksvm.lrn, rf.lrn)
bh.bmr <- benchmark(learners = bh.lrns,
tasks = bh.task,
resampling = cv.outer,
measures = perf.measures,
show.info = FALSE)
# Calibration Charts
bh.cal <- generateCalibrationData(bh.bmr)
plotCalibration(bh.cal)
which yields:
> bh.cal <- generateCalibrationData(bh.bmr)
Error in checkPrediction(x, task.type = "classif", predict.type = "prob") :
Prediction must be one of 'classif', but is: 'regr'
> sessionInfo()
R version 3.2.3 (2015-12-10)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mlbench_2.1-1 ROCR_1.0-7 gplots_3.0.1 mlr_2.9
[5] stringi_1.1.1 ParamHelpers_1.10 ggplot2_2.1.0 BBmisc_1.10
loaded via a namespace (and not attached):
[1] digest_0.6.9 htmltools_0.3.5 R6_2.2.0 splines_3.2.3
[5] scales_0.4.0 assertthat_0.1 grid_3.2.3 stringr_1.0.0
[9] bitops_1.0-6 checkmate_1.8.2 gdata_2.17.0 survival_2.38-3
[13] munsell_0.4.3 tibble_1.2 randomForest_4.6-12 httpuv_1.3.3
[17] parallelMap_1.3 mime_0.5 DBI_0.5-1 labeling_0.3
[21] chron_2.3-47 shiny_1.0.0 KernSmooth_2.23-15 plyr_1.8.4
[25] data.table_1.9.6 magrittr_1.5 reshape2_1.4.1 kernlab_0.9-25
[29] ggvis_0.4.3 caTools_1.17.1 gtable_0.2.0 colorspace_1.2-6
[33] tools_3.2.3 parallel_3.2.3 dplyr_0.5.0 xtable_1.8-2
[37] gtools_3.5.0 backports_1.0.4 Rcpp_0.12.4

json parsing in r

I am loading a heavily nested JSON in R -- the seed data on league of legends games. Thanks to another question I was able to open and get a flat data frame (100 x 14167).
library(json)
library(plyr)
data.json <- fromJSON(file = "data/matches1.json")
data.unlist <- lapply(data.json$matches, unlist)
funct <- function(x){
do.call("data.frame", as.list(x))
}
data.match <- rbind.fill(lapply(data.unlist, funct)) # takes ~15 min
data.frame <- as.data.frame(data.match)
However, most columns have the wrong type, and I run into anomalies when converting. Is there a way of converting the columns automatically to characters/factors or numerics? Or is this wishful thinking? :)

Substring in Data Frame R

I have data from GPS log like this : (this data in rows of data frame columns)
{"mAccuracy":20.0,"mAltitude":0.0,"mBearing":0.0,"mElapsedRealtimeNanos":21677339000000,"mExtras":{"networkLocationSource":"cached","networkLocationType":"wifi","noGPSLocation":{"mAccuracy":20.0,"mAltitude":0.0,"mBearing":0.0,"mElapsedRealtimeNanos":21677339000000,"mHasAccuracy":true,"mHasAltitude":false,"mHasBearing":false,"mHasSpeed":false,"mIsFromMockProvider":false,"mLatitude":35.1811956,"mLongitude":126.9104909,"mProvider":"network","mSpeed":0.0,"mTime":1402801381486},"travelState":"stationary"},"mHasAccuracy":true,"mHasAltitude":false,"mHasBearing":false,"mHasSpeed":false,"mIsFromMockProvider":false,"mLatitude":35.1811956,"mLongitude":126.9104909,"mProvider":"network","mSpeed":0.0,"mTime":1402801381486,"timestamp":1402801665.512}
The problem is I only need Latitude and longitude value, so I think i can use substring and sappy for applying to all data in dataframe.
But I am not sure this way is handsome because when i use substring ex: substr("abcdef", 2, 4) so I need to count who many chars from beginning until "mLatitude" , so anybody can give suggestion the fast way to processing it?
Thank you to #mnel for answering question, it's work , but i still have problem
From mnel answer I've created function like this :
fgps <- function(x) {
out <- fromJSON(x)
c(out$mExtras$noGPSLocation$mLatitude,
out$mExtras$noGPSLocation$mLongitude)
}
and then this is my data :
gpsdata <- head(dfallgps[,4],2)
[1] "{\"mAccuracy\":23.128,\"mAltitude\":0.0,\"mBearing\":0.0,\"mElapsedRealtimeNanos\":76437488000000,\"mExtras\":{\"networkLocationSource\":\"cached\",\"networkLocationType\":\"wifi\",\"noGPSLocation\":{\"mAccuracy\":23.128,\"mAltitude\":0.0,\"mBearing\":0.0,\"mElapsedRealtimeNanos\":76437488000000,\"mHasAccuracy\":true,\"mHasAltitude\":false,\"mHasBearing\":false,\"mHasSpeed\":false,\"mIsFromMockProvider\":false,\"mLatitude\":35.1779956,\"mLongitude\":126.9089661,\"mProvider\":\"network\",\"mSpeed\":0.0,\"mTime\":1402894224187},\"travelState\":\"stationary\"},\"mHasAccuracy\":true,\"mHasAltitude\":false,\"mHasBearing\":false,\"mHasSpeed\":false,\"mIsFromMockProvider\":false,\"mLatitude\":35.1779956,\"mLongitude\":126.9089661,\"mProvider\":\"network\",\"mSpeed\":0.0,\"mTime\":1402894224187,\"timestamp\":1402894517.425}"
[2] "{\"mAccuracy\":1625.0,\"mAltitude\":0.0,\"mBearing\":0.0,\"mElapsedRealtimeNanos\":77069916000000,\"mExtras\":{\"networkLocationSource\":\"cached\",\"networkLocationType\":\"cell\",\"noGPSLocation\":{\"mAccuracy\":1625.0,\"mAltitude\":0.0,\"mBearing\":0.0,\"mElapsedRealtimeNanos\":77069916000000,\"mHasAccuracy\":true,\"mHasAltitude\":false,\"mHasBearing\":false,\"mHasSpeed\":false,\"mIsFromMockProvider\":false,\"mLatitude\":35.1811881,\"mLongitude\":126.9084072,\"mProvider\":\"network\",\"mSpeed\":0.0,\"mTime\":1402894857416},\"travelState\":\"stationary\"},\"mHasAccuracy\":true,\"mHasAltitude\":false,\"mHasBearing\":false,\"mHasSpeed\":false,\"mIsFromMockProvider\":false,\"mLatitude\":35.1811881,\"mLongitude\":126.9084072,\"mProvider\":\"network\",\"mSpeed\":0.0,\"mTime\":1402894857416,\"timestamp\":1402894857.519}"
When run sapply why the data still shows in the result not just the results values.
sapply(gpsdata, function(gpsdata) fgps(gpsdata))
{"mAccuracy":23.128,"mAltitude":0.0,"mBearing":0.0,"mElapsedRealtimeNanos":76437488000000,"mExtras":{"networkLocationSource":"cached","networkLocationType":"wifi","noGPSLocation":{"mAccuracy":23.128,"mAltitude":0.0,"mBearing":0.0,"mElapsedRealtimeNanos":76437488000000,"mHasAccuracy":true,"mHasAltitude":false,"mHasBearing":false,"mHasSpeed":false,"mIsFromMockProvider":false,"mLatitude":35.1779956,"mLongitude":126.9089661,"mProvider":"network","mSpeed":0.0,"mTime":1402894224187},"travelState":"stationary"},"mHasAccuracy":true,"mHasAltitude":false,"mHasBearing":false,"mHasSpeed":false,"mIsFromMockProvider":false,"mLatitude":35.1779956,"mLongitude":126.9089661,"mProvider":"network","mSpeed":0.0,"mTime":1402894224187,"timestamp":1402894517.425}
[1,] 35.178
[2,] 126.909
{"mAccuracy":1625.0,"mAltitude":0.0,"mBearing":0.0,"mElapsedRealtimeNanos":77069916000000,"mExtras":{"networkLocationSource":"cached","networkLocationType":"cell","noGPSLocation":{"mAccuracy":1625.0,"mAltitude":0.0,"mBearing":0.0,"mElapsedRealtimeNanos":77069916000000,"mHasAccuracy":true,"mHasAltitude":false,"mHasBearing":false,"mHasSpeed":false,"mIsFromMockProvider":false,"mLatitude":35.1811881,"mLongitude":126.9084072,"mProvider":"network","mSpeed":0.0,"mTime":1402894857416},"travelState":"stationary"},"mHasAccuracy":true,"mHasAltitude":false,"mHasBearing":false,"mHasSpeed":false,"mIsFromMockProvider":false,"mLatitude":35.1811881,"mLongitude":126.9084072,"mProvider":"network","mSpeed":0.0,"mTime":1402894857416,"timestamp":1402894857.519}
[1,] 35.18119
[2,] 126.90841
I want the result looks like :
[1] 35.178 126.909
[2] 35.18119 126.90841
Thank you
It would appear that your data is in JSON format. Therefore, use a RJSONIO::fromJSON to read the file.
E.g.:
txt <- "{\"mAccuracy\":20.0,\"mAltitude\":0.0,\"mBearing\":0.0,\"mElapsedRealtimeNanos\":21677339000000,\"mExtras\":{\"networkLocationSource\":\"cached\",\"networkLocationType\":\"wifi\",\"noGPSLocation\":{\"mAccuracy\":20.0,\"mAltitude\":0.0,\"mBearing\":0.0,\"mElapsedRealtimeNanos\":21677339000000,\"mHasAccuracy\":true,\"mHasAltitude\":false,\"mHasBearing\":false,\"mHasSpeed\":false,\"mIsFromMockProvider\":false,\"mLatitude\":35.1811956,\"mLongitude\":126.9104909,\"mProvider\":\"network\",\"mSpeed\":0.0,\"mTime\":1402801381486},\"travelState\":\"stationary\"},\"mHasAccuracy\":true,\"mHasAltitude\":false,\"mHasBearing\":false,\"mHasSpeed\":false,\"mIsFromMockProvider\":false,\"mLatitude\":35.1811956,\"mLongitude\":126.9104909,\"mProvider\":\"network\",\"mSpeed\":0.0,\"mTime\":1402801381486,\"timestamp\":1402801665.512}"
Then process:
library(RJSONIO)
out <- fromJSON(txt)
out$$mLongitude
#[1] 126.9105
out$mLatitude
#[1] 35.1812
# to process multiple values
tt <- rep(txt,2)
myData <- lapply(tt, fromJSON)
latlong <- do.call(rbind,lapply(myData, `[` ,c('mLatitude','mLongitude')))
# or using rbind list
library(data.table)
latlong <- rbindlist(lapply(myData, `[` ,c('mLatitude','mLongitude')))

Is there a way to get a vector with the name of all functions that one could use in R?

I would like to have a call that returns me a vector with the names of all function that I could call in the current R session. Does anybody know how to achieve this?
(I would like to check user entered variables against this vector. We had some unforseen problem with users entering e.g., c as variable names)
UPDATE: I would like to get the function names from all packages currently loaded.
SOLUTION (half way): Based on Joris Meys tip with lsf.str() I came up with the following function that returns a sorted vector with all currently available function names:
getFunctionNames <- function() {
loaded <- (.packages())
loaded <- paste("package:", loaded, sep ="")
return(sort(unlist(lapply(loaded, lsf.str))))
}
Bu,t see also the comments on Joris Meys' post for even better answers.
I'd use lsf.str() as a start.
eg : x <- as.character(lsf.str("package:base")) gives you a list of all functions in the base package. You could do add all packages you want to check against. stats and utils come to mind first.
EDIT : Regarding your question about currently loaded packages :
x <- unlist(sapply(search()[-1],function(x)as.character(lsf.str(x)))) see comments
pkgs <- search()
pkgs <- pkgs[grep("package:",pkgs)]
y <- unlist(sapply(pkgs,lsf.str))
does the trick.
I asked a similar Q on R-Help many moons ago (2007) and Prof. Brian Ripley provided this as a solution:
findfuns <- function(x) {
if(require(x, character.only=TRUE)) {
env <- paste("package", x, sep=":")
nm <- ls(env, all=TRUE)
nm[unlist(lapply(nm, function(n) exists(n, where=env,
mode="function",
inherits=FALSE)))]
} else character(0)
}
pkgs <- dir(.Library)
z <- lapply(pkgs, findfuns)
names(z) <- pkgs
Z <- sort(unique(unlist(z)))
Which gives output like:
> head(Z)
[1] "^" "-" "-.Date" "-.POSIXt" ":" "::"
This was for finding all the functions in packages specified by object pkgs so you can control which packages are loaded/checked against.
A modified version that work on the currently loaded set of packages would be:
findfuns2 <- function(pkgs) {
nm <- ls(pkgs, all = TRUE)
nm[unlist(lapply(nm, function(n) exists(n, where = pkgs,
mode = "function",
inherits = FALSE)))]
if(isTRUE(all.equal(length(nm), 0)))
character(0)
else
nm
}
pkgs <- search()
pkgs <- pkgs[grep("package:", pkgs)]
z <- lapply(pkgs, findfuns2)
z <- sort(unique(unlist(z)))
head(z)