How read few columns of a table using dbReadTable - mysql

I am using the below code after creating my connection, "mydb" with my MySQL server to import the data into R and it's working fine.
my_data <- dbReadTable(mydb, "ar_data")
But I don't want to import or read the whole table, I just don't want to read first 5 columns. How can I do that ?

Maybe try dbSendQuery:
library(DBI)
library(RMySQL)
drv <- dbDriver("MySQL")
con <- dbConnect (drv, dbname="mydb", user="username")
dbWriteTable(con, "mtcars", mtcars)
dbReadTable(con, "mtcars") # full table
sql <- paste0("SELECT ", paste(dbListFields(con, "mtcars")[-(1:5)], collapse=","), " FROM mtcars LIMIT 5")
res <- dbSendQuery(con, sql)
dbFetch(res)
# drat wt qsec vs am gear carb
# 1 3.90 2.620 16.46 0 1 4 4
# 2 3.90 2.875 17.02 0 1 4 4
# 3 3.85 2.320 18.61 1 1 4 1
# 4 3.08 3.215 19.44 1 0 3 1
# 5 3.15 3.440 17.02 0 0 3 2
dbClearResult(res)
res <- dbSendQuery(con, 'DROP TABLE mtcars')
dbDisconnect(con)

Related

problem with bootMer CI: upper and lower limits are identical

I'm having the hardest time generating confidence intervals for my glmer poisson model. After following several very helpful tutorials (such as https://drewtyre.rbind.io/classes/nres803/week_12/lab_12/) as well as stackoverflow posts, I keep getting very strange results, i.e. the upper and lower limits of the CI are identical.
Here is a reproducible example containing a response variable called "production," a fixed effect called "Treatment_Num" and a random effect called "Genotype":
df1 <- data.frame(production=c(15,12,10,9,6,8,9,5,3,3,2,1,0,0,0,0), Treatment_Num=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), Genotype=c(1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2))
#run the glmer model
df1_glmer <- glmer(production ~ Treatment_Num +(1|Genotype),
data = df1, family = poisson(link = "log"))
#make an empty data set to predict from, that contains the explanatory variables but no response
require(magrittr)
df_empty <- df1 %>%
tidyr::expand(Treatment_Num, Genotype)
#create new column containing predictions
df_empty$PopPred <- predict(df1_glmer, newdata = df_empty, type="response",re.form = ~0)
#function for bootMer
myFunc_df1_glmer <- function(mm) {
predict(df1_glmer, newdata = df_empty, type="response",re.form=~0)
}
#run bootMer
require(lme4)
merBoot_df1_glmer <- bootMer(df1_glmer, myFunc_df1_glmer, nsim = 10)
#get confidence intervals out of it
predCL <- t(apply(merBoot_df1_glmer$t, MARGIN = 2, FUN = quantile, probs = c(0.025, 0.975)))
#enter lower and upper limits of confidence interval into df_empty
df_empty$lci <- predCL[, 1]
df_empty$uci <- predCL[, 2]
#when viewing df_empty the problem becomes clear: the lci and uci are identical!
df_empty
Any insights you can give me will be much appreciated!
Ignore my comment!
The issue is with the function you created to pass to bootMer(). You wrote:
myFunc_df1_glmer <- function(mm) {
predict(df1_glmer, newdata = df_empty, type="response",re.form=~0)
}
The argument mm should be a fitted model object derived from the bootstrapped data.
However, you don't pass this object to predict(), but rather the original model
object. If you change the function to:
myFunc_df1_glmer <- function(mm) {
predict(mm, newdata = df_empty, type="response",re.form=~0)
#^^ pass in the object created by bootMer
}
then it works:
> df_empty
# A tibble: 8 x 5
Treatment_Num Genotype PopPred lci uci
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 12.9 9.63 15.7
2 1 2 12.9 9.63 15.7
3 2 1 5.09 3.87 5.89
4 2 2 5.09 3.87 5.89
5 3 1 2.01 1.20 2.46
6 3 2 2.01 1.20 2.46
7 4 1 0.796 0.361 1.14
8 4 2 0.796 0.361 1.14
As an aside -- how many genotypes in your actual data? If less than 5-7 you might
do better using a straight up glm() with genotype as a factor using sum-to-zero
contrasts.

dbGetQuery select all existing columns

I have a vector of columns that I would like to select from the databases. If the column is missing, I want to select all of the columns that exists. But, I am not sure how to specify this in my query?
For example, to select column drat I specify "SELECT drat FROM mtcars". Let's say my column names are drat and colMissing.
My query does not work "SELECT drat, colMissing FROM mtcars" as Error: no such column: colMissing .
However, I want drat exporting. How can I make sure that all existing columns will be exported, and non existing skipped? In my real data, I have a long vector of columns names and many databases, so I want to do it automatically.
Dummy example:
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "mtcars", mtcars)
dbGetQuery(con, "SELECT * FROM mtcars") # select all columns
dbGetQuery(con, "SELECT drat, wt, disp, colMissing FROM mtcars", n = 6) # does not work as contains non existing columns name. How to export only existing ones?
I don't think SQL gives you an easy way to dynamically set the columns to select in this fashion. I think the easiest way to do this type of filtering is to determine the columns to join dynamically and create the query programmatically.
cols <- c("drat", "wt", "disp", "colMissing")
cols_to_select <- intersect(dbListFields(con, "mtcars"), cols)
cols_to_select
# [1] "disp" "drat" "wt"
qry <- paste("select", paste(dbQuoteIdentifier(con, cols_to_select), collapse = ","), "from mtcars")
qry
# [1] "select `disp`,`drat`,`wt` from mtcars"
head(dbGetQuery(con, qry))
# disp drat wt
# 1 160 3.90 2.620
# 2 160 3.90 2.875
# 3 108 3.85 2.320
# 4 258 3.08 3.215
# 5 360 3.15 3.440
# 6 225 2.76 3.460
I'm taking deliberate steps here to mitigate the risk of inadvertent SQL-injection that comes with paste-ing a query together. It is feasible that column names of an existing frame could be rather stupidly-malicious. (And no, I don't think the risk of these names is real, this type of mistake is much more likely to create a syntax error.)
someframe <- data.frame(a=1,b=2)
names(someframe)[1] <- "Robert');DROP TABLE Students;--"
qry <- paste("select", paste(names(someframe), collapse = ","), "from mtcars")
qry
# [1] "select Robert');DROP TABLE Students;--,b from mtcars"
Okay, so that won't work here (despite https://xkcd.com/327/), but ... be careful when forming a query dynamically. dbQuoteIdentifier is one function with the intent of mitigating this risk. With comparison data (e.g., WHERE cyl > 5), it is much better to use parameter-binding (i.e., WHERE cyl > ?); this doesn't work in the SELECT portion, however, so caveat emptor.
As an aside ... I believe SQL-injection discussions normally focus on the parameters (within the WHERE clause) of the query, not on the fields to be selected. However, it is feasible to make this happen with field names, though it requires knowing the target table name in the injection. (I'm using SQL Server below.)
DBI::dbWriteTable(con, "#r2mt", mtcars[1:2,])
DBI::dbGetQuery(con, "select * from #r2mt")
# row_names mpg cyl disp hp drat wt qsec vs am gear carb
# 1 Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
# 2 Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
names(someframe)[1] <- 'cyl" from #r2mt;DROP TABLE #r2mt;--'
qry <- paste("select", paste(dQuote(names(someframe)), collapse = ", "), "from #r2mt")
qry
# [1] "select \"cyl\" from #r2mt;DROP TABLE #r2mt;--\", \"b\" from #r2mt"
DBI::dbGetQuery(con, qry)
# cyl
# 1 6
# 2 6
DBI::dbGetQuery(con, "select * from #r2mt")
# Error: nanodbc/nanodbc.cpp:1655: 42000: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid object name '#r2mt'. [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Statement(s) could not be prepared.
# <SQL> 'select * from #r2mt'
I should note that while dQuote did not protect against this, dbQuoteIdentifer did:
DBI::dbWriteTable(con, "#r2mt", mtcars[1:2,])
qry <- paste("select", paste(DBI::dbQuoteIdentifier(con, names(someframe)), collapse = ", "), "from #r2mt")
qry
# [1] "select \"cyl\"\" from #r2mt;DROP TABLE #r2mt;--\", \"b\" from #r2mt"
DBI::dbGetQuery(con, "select * from #r2mt")
# row_names mpg cyl disp hp drat wt qsec vs am gear carb
# 1 Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
# 2 Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
DBI::dbGetQuery(con, qry)
# Error: nanodbc/nanodbc.cpp:1655: 42000: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid column name 'cyl" from #r2mt;DROP TABLE #r2mt;--'. [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid column name 'b'. [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Statement(s) could not be prepared.
# <SQL> 'select "cyl"" from #r2mt;DROP TABLE #r2mt;--", "b" from #r2mt'
Where the clear difference in qry is shown here:
# [1] "select \"cyl\" from #r2mt;DROP TABLE #r2mt;--\", \"b\" from #r2mt"
# [1] "select \"cyl\"\" from #r2mt;DROP TABLE #r2mt;--\", \"b\" from #r2mt"
I was unable to defeat dbQuoteIdentifier in order to stop the escaping of " in this use.

How to parse JSON in a DataFrame column using R

How do I come from here ...
| ID | JSON Request |
==============================================================================
| 1 | {"user":"xyz1","weightmap": {"P1":0,"P2":100}, "domains":["a1","b1"]} |
------------------------------------------------------------------------------
| 2 | {"user":"xyz2","weightmap": {"P1":100,"P2":0}, "domains":["a2","b2"]} |
------------------------------------------------------------------------------
to here (The requirement is to make a table of JSON in column 2):
| User | P1 | P2 | domains |
============================
| xyz1 | 0 |100 | a1, b1 |
----------------------------
| xyz2 |100 | 0 | a2, b2 |
----------------------------
Here is the code to generate the data.frame:
raw_df <-
data.frame(
id = 1:2,
json =
c(
'{"user": "xyz2", "weightmap": {"P1":100,"P2":0}, "domains": ["a2","b2"]}',
'{"user": "xyz1", "weightmap": {"P1":0,"P2":100}, "domains": ["a1","b1"]}'
),
stringsAsFactors = FALSE
)
Here's a tidyverse solution (also using jsonlite) if you're happy to work in a long format (for domains in this case):
library(jsonlite)
library(dplyr)
library(purrr)
library(tidyr)
d <- data.frame(
id = c(1, 2),
json = c(
'{"user":"xyz1","weightmap": {"P1":0,"P2":100}, "domains":["a1","b1"]}',
'{"user":"xyz2","weightmap": {"P1":100,"P2":0}, "domains":["a2","b2"]}'
),
stringsAsFactors = FALSE
)
d %>%
mutate(json = map(json, ~ fromJSON(.) %>% as.data.frame())) %>%
unnest(json)
#> id user weightmap.P1 weightmap.P2 domains
#> 1 1 xyz1 0 100 a1
#> 2 1 xyz1 0 100 b1
#> 3 2 xyz2 100 0 a2
#> 4 2 xyz2 100 0 b2
mutate... is converting from a string to column of nested data frames.
unnest... is unnesting these data frames into multiple columns
I would go for the jsonlite package in combination with the usage of mapply, a transformation function and data.table's rbindlist.
# data
raw_df <- data.frame(id = 1:2, json = c('{"user": "xyz2", "weightmap": {"P1":100,"P2":0}, "domains": ["a2","b2"]}', '{"user": "xyz1", "weightmap": {"P1":0,"P2":100}, "domains": ["a1","b1"]}'), stringsAsFactors = FALSE)
# libraries
library(jsonlite)
library(data.table)
# 1) First, make a transformation function that works for a single entry
f <- function(json, id){
# transform json to list
tmp <- jsonlite::fromJSON(json)
# transform list to data.frame
tmp <- as.data.frame(tmp)
# add id
tmp$id <- id
# return
return(tmp)
}
# 2) apply it via mapply
json_dfs <-
mapply(f, raw_df$json, raw_df$id, SIMPLIFY = FALSE)
# 3) combine the fragments via rbindlist
clean_df <-
data.table::rbindlist(json_dfs)
# 4) et-voila
clean_df
## user weightmap.P1 weightmap.P2 domains id
## 1: xyz2 100 0 a2 1
## 2: xyz2 100 0 b2 1
## 3: xyz1 0 100 a1 2
## 4: xyz1 0 100 b1 2
Could not get the flatten parameter to work as I expected so needed to unlist and then "re-list" before rbinding with do.call:
library(jsonlite)
do.call( rbind,
lapply(raw_df$json,
function(j) as.list(unlist(fromJSON(j, flatten=TRUE)))
) )
user weightmap.P1 weightmap.P2 domains1 domains2
[1,] "xyz2" "100" "0" "a2" "b2"
[2,] "xyz1" "0" "100" "a1" "b1"
Admittedly, this will require further processing since it coerces all the lines to character.
library(jsonlite)
json = c(
'{"user":"xyz1","weightmap": {"P1":0,"P2":100}, "domains":["a1","b1"]}',
'{"user":"xyz2","weightmap": {"P1":100,"P2":0}, "domains":["a2","b2"]}'
)
json <- lapply( paste0("[", json ,"]"),
function(x) jsonlite::fromJSON(x))
df <- data.frame(matrix(unlist(json), nrow=2, ncol=5, byrow=T))
df <- df %>% unite(Domains, X4, X5, sep = ", ")
colnames(df) <- c("user", "P1", "P2", "domains")
head(df)
The output is:
user P1 P2 domains
1 xyz1 0 100 a1, b1
2 xyz2 100 0 a2, b2
Using tidyjson
https://cran.r-project.org/web/packages/tidyjson/vignettes/introduction-to-tidyjson.html
install.packages("tidyjson")
library(tidyjson)
json_as_df <- raw_df$json %>% spread_all
# retain columns
json_as_df <- raw_df %>% as.tbl_json(json.column = "json") %>% spread_all

Webscraping html table using R

I had some help from users of Stackoverflow already, trying to solve this problem. However, I ran into new trouble:
URL <- "http://karakterstatistik.stads.ku.dk/Histogram/ASOB05038E/Summer-2015"
pg <- read_html(URL)
get_val <- function(x, label) {
xpath <- sprintf(".//table/tr/td[contains(., '%s')][1]/following-sibling::td", label)
html_nodes(x, xpath=xpath) %>%
html_text() %>%
trimws()
}
library("stringr")
trimmed = get_val(pg, "Karakter") %>%
str_replace_all(pattern = "\\n|\\t|\\r" ,
replacement = "")
trimmed
I want to get the exam results for both the retake and the exam, but since both of the headlines for the two tables are the same, R only takes the values from the retake.
To be specific, I would like to get the column "Antal" right next to the grades, 12, 10, 7, 4, 02, 00, -3 in both the tables under the headline Resultater
Any help would be appreciated a lot! :)
results <- html_nodes(pg, xpath=".//td[#style='width: 50%;' and
descendant::h3[contains(text(), 'Resultater')]]/table")
html_table(results[[1]])
## Karakter Antal Antal
## 1 12 11 (9,6 %)
## 2 10 48 (41,7 %)
## 3 7 41 (35,7 %)
## 4 4 4 (3,5 %)
## 5 02 1 (0,9 %)
## 6 00 1 (0,9 %)
## 7 -3 4 (3,5 %)
## 8 Ej mødt 5 (4,3 %)
html_table(results[[2]])
## Karakter Antal Antal
## 1 12 0 (0,0 %)
## 2 10 0 (0,0 %)
## 3 7 1 (9,1 %)
## 4 4 1 (9,1 %)
## 5 02 1 (9,1 %)
## 6 00 1 (9,1 %)
## 7 -3 0 (0,0 %)
## 8 Ej mødt 7 (63,6 %)

Subsetting in a function to calculate a row total

I have a data frame with results for certain instruments, and I want to create a new column which contains the totals of each row. Because I have different numbers of instruments each time I run an analysis on new data, I need a function to dynamically calculate the new column with the Row Total.
To simply my problem, here’s what my data frame looks like:
Type Value
1 A 10
2 A 15
3 A 20
4 A 25
5 B 30
6 B 40
7 B 50
8 B 60
9 B 70
10 B 80
11 B 90
My goal is to achieve the following:
A B Total
1 10 30 40
2 15 40 55
3 20 50 70
4 25 60 85
5 70 70
6 80 80
7 90 90
I’ve tried various method, but this way holds the most promise:
myList <- list(a = c(10, 15, 20, 25), b = c(30, 40, 50, 60, 70, 80, 90))
tmpDF <- data.frame(sapply(myList, '[', 1:max(sapply(myList, length))))
> tmpDF
a b
1 10 30
2 15 40
3 20 50
4 25 60
5 NA 70
6 NA 80
7 NA 90
totalSum <- rowSums(tmpDF)
totalSum <- data.frame(totalSum)
tmpDF <- cbind(tmpDF, totalSum)
> tmpDF
a b totalSum
1 10 30 40
2 15 40 55
3 20 50 70
4 25 60 85
5 NA 70 NA
6 NA 80 NA
7 NA 90 NA
Even though this way did succeeded in combining two data frames of different lengths, the ‘rowSums’ function gives the wrong values in this example. Besides that, my original data isn't in a list format, so I can't apply such a 'solution'.
I think I’m overcomplicating this problem, so I was wondering how can I …
Subset data from a data frame on the basis of ‘Type’,
Insert these individual subsets of different lengths into a new data frame,
Add an ‘Total’ column to this data frame which is the correct sum of the
individual subsets.
An added complication to this problem is that this needs to be done in an function or in an otherwise dynamic way, so that I don’t need to manually subset the dozens of ‘Types’ (A, B, C, and so on) in my data frame.
Here’s what I have so far, which doesn’t work, but illustrates the lines I’m thinking along:
TotalDf <- function(x){
tmpNumberOfTypes <- c(levels(x$Type))
for( i in tmpNumberOfTypes){
subSetofData <- subset(x, Type = i, select = Value)
if( i == 1) {
totalDf <- subSetOfData }
else{
totalDf <- cbind(totalDf, subSetofData)}
}
return(totalDf)
}
Thanks in advance for any thoughts or ideas on this,
Regards,
EDIT:
Thanks to the comment of Joris (see below) I got an end in the right direction, however, when trying to translate his solution to my data frame, I run into additional problems. His proposed answer works, and gives me the following (correct) sum of the values of A and B:
> tmp78 <- tapply(DF$value,DF$id,sum)
> tmp78
1 2 3 4 5 6
6 8 10 12 9 10
> data.frame(tmp78)
tmp78
1 6
2 8
3 10
4 12
5 9
6 10
However, when I try this solution on my data frame, it doesn’t work:
> subSetOfData <- copyOfTradesList[c(1:3,11:13),c(1,10)]
> subSetOfData
Instrument AccountValue
1 JPM 6997
2 JPM 7261
3 JPM 7545
11 KFT 6992
12 KFT 6944
13 KFT 7069
> unlist(sapply(rle(subSetOfData$Instrument)$lengths,function(x) 1:x))
Error in rle(subSetOfData$Instrument) : 'x' must be an atomic vector
> subSetOfData$InstrumentNumeric <- as.numeric(subSetOfData$Instrument)
> unlist(sapply(rle(subSetOfData$InstrumentNumeric)$lengths,function(x) 1:x))
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
> subSetOfData$id <- unlist(sapply(rle(subSetOfData$InstrumentNumeric)$lengths,function(x) 1:x))
Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 2L, 3L, 1L, 2L, :
replacement has 3 rows, data has 6
I have the disturbing idea that I’m going around in circles…
Two thoughts :
1) you could use na.rm=T in rowSums
2) How do you know which one has to go with which? You might add some indexing.
eg :
DF <- data.frame(
type=c(rep("A",4),rep("B",6)),
value = 1:10,
stringsAsFactors=F
)
DF$id <- unlist(lapply(rle(DF$type)$lengths,function(x) 1:x))
Now this allows you to easily tapply the sum on the original dataframe
tapply(DF$value,DF$id,sum)
And, more importantly, get your dataframe in the correct form :
> DF
type value id
1 A 1 1
2 A 2 2
3 A 3 3
4 A 4 4
5 B 5 1
6 B 6 2
7 B 7 3
8 B 8 4
9 B 9 5
10 B 10 6
> library(reshape)
> cast(DF,id~type)
id A B
1 1 1 5
2 2 2 6
3 3 3 7
4 4 4 8
5 5 NA 9
6 6 NA 10
TV <- data.frame(Type = c("A","A","A","A","B","B","B","B","B","B","B")
, Value = c(10,15,20,25,30,40,50,60,70,80,90)
, stringsAsFactors = FALSE)
# Added Type C for testing
# TV <- data.frame(Type = c("A","A","A","A","B","B","B","B","B","B","B", "C", "C", "C")
# , Value = c(10,15,20,25,30,40,50,60,70,80,90, 100, 150, 130)
# , stringsAsFactors = FALSE)
lnType <- with(TV, tapply(Value, Type, length))
lnType <- as.integer(lnType)
lnType
id <- unlist(mapply(FUN = rep_len, length.out = lnType, x = list(1:max(lnType))))
(TV <- cbind(id, TV))
require(reshape2)
tvWide <- dcast(TV, id ~ Type)
# Alternatively
# tvWide <- reshape(data = TV, direction = "wide", timevar = "Type", ids = c(id, Type))
tvWide <- subset(tvWide, select = -id)
# If you want something neat without the <NA>
# for(i in 1:ncol(tvWide)){
#
# if (is.na(tvWide[j,i])){
# tvWide[j,i] = 0
# }
#
# }
# }
tvWide
transform(tvWide, rowSum=rowSums(tvWide, na.rm = TRUE))