Write directly to MySQL DB from zipped file - mysql

I have a series of large zipped files that I have been unzipping to load directly into a MySQL database for querying from R.
I'll continue with this example (on x86_64 GNU/Linux):
> write.csv(iris, file = "iris.csv", row.names = FALSE, quote = FALSE)
> system("gzip iris.csv")
> list.files(pattern = "iris")
[1] "iris.csv" "iris.csv.gz"
I currently load the unzipped file in the following way:
> library(RSQLite)
> con <- dbConnect(RSQLite::SQLite(), dbname = "test_db")
> dbWriteTable(con, name = "iris", value = "iris.csv", field.types = list(Sepal.Length = "decimal(6, 2)", Sepal.Width = "decimal(6, 2)", Petal.Length = "decimal(6, 2)", Petal.Width = "decimal(6, 2)", Species = "varchar(15)"), row.names = FALSE)
[1] TRUE
I am wondering if it is possible to do a direct table write to the DB using the zipped file iris.csv.gz?
EDIT:
I am aware of gzfile but to my understanding its usage would necessitate bringing the file into memory before writing to the MySQL DB, something I am looking to avoid (correct me if I'm misunderstanding)

Related

Not saving html interactive file with R

I am trying to design a circos plot using BioCircos R package. BioCircos allows to save the plots as .html interactive files. However, when I run the package using RScript the saved .html file is empty. To save the .html file I used saveWidget option from htmlwidgets package. Is it something wrong with saveWidget option? The code I used follows:
#!/usr/bin/Rscript
######R script for BioCircos test
library(htmlwidgets)
library(BioCircos)
genomes <- list("chra1" = 217471166, "chra2" = 181034961, "chra3" = 153873357, "chra4" = 153961319, "chra5" = 164033575,
"chra6" = 154486312, "chra7" = 133565930, "chra8" = 147241510, "chra9" = 91218944, "chra10" = 52432566, "chrb1" = 843366180, "chrb2" = 842558404, "chrb3" = 707956555, "chrb4" = 635713434, "chrb5" = 567300182,
"chrb6" = 439630435, "chrb7" = 236595445, "chrb8" = 231667822, "chrb9" = 230778867, "chrb10" = 151572763, "chrb11" = 103205957) # custom genome
links_chromosomes_01 <- c("chra1", "chra2", "chra3", "chra4", "chra4", "chra5", "chra6", "chra7", "chra7", "chra8", "chra8", "chra9", "chra10") # Chromosomes on which the links should start
links_chromosomes_02 <- c("chrb2", "chrb3", "chrb1", "chrb9", "chrb10", "chrb4", "chrb5", "chrb6", "chrb1", "chrb8", "chrb3", "chrb7", "chrb6") # Chromosomes on which the links should end
links_pos_01 <- c(115060347, 102611974, 14761160, 128700431, 128681496, 42116205, 58890582, 40356090,
146935315, 136481944, 157464876, 39323393, 84752508, 136164354,
99573657, 102580613,
111139346, 120764772, 90748238, 122164776,
44933176, 18823342,
48771409, 128288229, 150613881, 18509106, 123913217, 51237349,
34237851, 53357604, 78270031,
25306417, 25320614,
94266153,
41447919, 28810876, 2802465,
45583472,
81968637, 27858237, 17263637,
30569409) ### links chra chromosomes
links_pos_02 <- c(410543481, 463189512, 825903588, 353914638, 354135472, 717707494, 643107332, 724899652,
583713545, 558756961, 642015290, 154999098, 340216235, 557731577,
643350872, 655077847,
85356666, 157889318, 226411560, 161566470,
109857786, 25338955,
473876792, 124495704, 46258030, 572314729, 141584107, 426419779,
531245660, 220131772, 353941099,
62422773, 62387030,
116923325,
76544045, 33452274, 7942164,
642047816,
215981114, 39278129, 23302654,
418922633) ### links chrb chromosomes
links_labels <- c("aldh1a3", "amh", "cyp26b1", "dmrt1", "dmrt3", "fgf20", "hhip", "srd5a3",
"amhr2", "dhh", "fgf9", "nr0b1", "rspo1", "wnt1",
"aldh1a2", "cyp19a1",
"lhx9", "pdgfb", "ptch2", "sox10",
"cbln1", "wt1",
"esr1", "foxl2", "gata4", "lrpprc", "serpine2", "srd5a2",
"asns", "ctnnb1", "srd5a1",
"cyp26a1", "cyp26c1",
"wnt4",
"ar", "nr5a1", "ptgds",
"fgf16",
"cxcr4", "pdgfa", "sox8",
"sox9")
tracklist <- BioCircosLinkTrack('myLinkTrack', links_chromosomes_01, links_pos_01,
links_pos_01, links_chromosomes_02, links_pos_02, links_pos_02,
maxRadius = 0.55, labels = links_labels)
#plotting results
plot_chra_chrb <- BioCircos(tracklist, genome = chra_chrb_genomes, genomeFillColor = "RdBu", chrPad = 0.02, displayGenomeBorder = FALSE, genomeLabelTextSize = "10pt", genomeTicksScale = 4e+3,
elementId = "chra_chrb_comp_plot_test.html")
saveWidget(plot_chra_chrb, "chra_chrb_comp_plot_test.html", selfcontained = F, libdir = "lib")
The command line to run this script:
Rscript /path_to/Circle_plot_test.r
I tried to use this script in RStudio (without saveWidget() command), however it took too long to run in my personnel computer and the results was not displayed. However, this could be due to memory usage setup because when I took off some data, the script easily generates the plot in RStudio and I am able to save it. Is there other way to save the .hmtl interactive files in R or am I doing something wrong using htmlwidgets package in my script?
Thanks all in advance for any help and comments.
When you said it took too long to run, that was a sign that something was wrong! You weren't getting anything when you used saveWidget, because there is nothing returned from BioCiros.
I found two things that are a problem. The first one will result in a blank output—you can't use a '.' in the element ID. This ID will be used in the HTML coding.
You were getting huge delays due to the scale you set for genomeTickScale. That scaling value is for a tick mark attribute. I'm not sure why you set it to .004. However, when I comment out that line, it renders immediately. I have no issues with saving the widget, either.
--One other thing, you had chra_chrb_genomes as the object name assigned to the parameter genome in the function BioCircos. I assumed it was the object genome from your question since it was the only unused object.
The only things I changed were in the BioCircos function:
(plot_chra_chrb <- BioCircos(tracklist, genome = genomes, #chra_chrb_genomes,
genomeFillColor = "RdBu",
chrPad = 0.02,
displayGenomeBorder = FALSE,
genomeLabelTextSize = "10pt",
# genomeTicksScale = 4e+3, # problematic
elementId = "chra_chrb_comp_plot_test" # no periods
))

Encoding issues when trying to write data from R to MySQL

My data contains special characters like German umlauts.
p=structure(list(ppl_code = c(992621L, 992381L, 992136L, 991989L,
991898L, 991759L, 991681L, 991593L, 991294L, 991036L, 990934L,
990751L, 990535L, 990411L, 990182L, 989507L), proj_name = c("klo",
"Dalbygda", "Oosterhorn", "Hån", "Yatir", "Montigny la Cour",
"Valle Hermoso", "Acciona Honawad - 120 MW", "Apfeltrang", "RiaBlades",
"General Acha", "Lindau-Böhlitz", "Apfeltrang", "Alcazar Round 2",
"Peckelsheim", "Linnich 3")), .Names = c("ppl_code", "proj_name"
), row.names = 15:30, class = "data.frame")
When I try to write it into MySQL database :
conn <- dbConnect(
drv = RMySQL::MySQL(),
dbname = "mydb",
host = "#####",
username = "#####",
password = "#####")
dbWriteTable(conn, value = p, name = "MyTable",row.names=FALSE)
I'm getting the Encoding error :
could not run statement: Invalid utf8 character string: 'Lindau-B'
I have checked several posts regarding this issue like here and here but they are all general explanation without a clear solution !
can anybody help me with a clear query that could solve this issue ?
You need to announce that UTF-8 is being used.
Tool -> Global Options -> Code -> Saving and put UTF-8
rs <- dbSendQuery(con, 'SET NAMES utf8')

How to use sparklyr spark_write_jdbc to connect to MySql

spark_write_jdbc(members_df,
name = "Mbrs",
options = list(
url = paste0("jdbc:mysql://",mysql_host,":",mysql_port,"/",dbname),
user = mysql_user,
password = mysql_password),
mode = "append")
Results in the following exception:
Error: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
The .jar file is in a folder on the server where RStudio is running, config details below. We're able to access MySql via the RMySql package so MySql is working and accessible.
config$`spark.sparklyr.shell.driver-class-path` <- "/dev/shm/temp/mysql-connector-java-5.1.44-bin.jar"
Even if the question is different, I think the answer still applies also to this:
How to use a predicate while reading from JDBC connection?
Code to connect to JDBC MySQL through sparklyr (I made some slight changes to simplify the code a bit, written by Jake Russ)
library(sparklyr)
library(dplyr)
config <- spark_config()
#config$`sparklyr.shell.driver-class-path` <- "E:\\spark232_hadoop27\\jars\\mysql-connector-java-5.1.47-bin.jar"
#in my case, using RStudio and Sparkly this seemed to be optional
sc <- spark_connect(master = "local")
db_tbl <- spark_read_jdbc(sc,
name = "table_name",
options = list(url = "jdbc:mysql://localhost:3306/schema_name",
user = "root",
password = "password",
dbtable = "table_name"))

How to join table using a pool connection?

I am trying to use pool to join two remote tables (City and Country) using below code:
pool <- dbPool(
drv = RMySQL::MySQL(),
dbname = "shinydemo",
host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
username = "guest",
password = "guest"
)
src_pool(pool) %>%
tbl('City') %>%
left_join('Country', by=c('CountryCode'='Code'))
But this is the error I get when run the code:
Error: x and y don't share the same src.
Set copy = TRUE to copy y into x's source (this may be time consuming).
In addition: Warning message:
In force(expr) : You have a leaked pooled object. Destroying it.
Below a working example of the same query using dplyr:
srccon <- src_mysql(
host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
dbname = "shinydemo",
user = "guest",
password = "guest"
)
tbl(srccon, 'City') %>%
left_join(tbl(srccon, 'Country'), by=c('CountryCode'='Code'))
And another example using pool::dbGetQuery
sql <- "SELECT * FROM City LEFT JOIN Country ON (CountryCode=Code)"
dbGetQuery(pool, sql)
I'm not familiar with pool, but the error message:
Error: x and y don't share the same src.
Set copy = TRUE to copy y into x's source (this may be time consuming).
is due to the fact that one of your tables is being manipulated in-database (I'd guess City) and the other is in memory/not in the database ('Country'), but it could be the other way around. In this instance, x and y are your two tables, and the error message is just telling you they're not bothin the same place. By setting copy = TRUE the tibble in y will be (temporarily) copied into wherever x is stored (e.g. either to the database or into RAM) in order to perform the computation.
You could try either bringing both objects into memory using dplyr::collect() which will mean that the join happens in memory; or ensure that both are on your database (in which case the join will occur there instead).
i.e.
# Keep both in RAM
City <- tbl('City') %>% collect()
Country <- tbl('Country') %>% collect()
joined <- City %>% left_join(Country, by = c('CountryCode' = 'Code))
I would guess that the Warning about a leaked pooled object relates to the same issue (one table is in-memory, one is on the database).

sqlSave fails when tablename is longer than 18 characters

I am currently writing a script that downloads a bunch of .csv's from a FTP server, and then puts each .csv in a MySQL database as its own table.
I download the .csv's from the FTP using RCurl and place all of the .csv's in my working directory. To create tables out of each .csv, I am using the sqlSave function from the RODBC package, where the table name is the same name as the .csv. This works fine whenever a .csv name is less than 18 characters, but when it is greater it fails. And by "fails", I mean R crashes. To track down the bug, I called debug on sqlSave.
I found that there are at least two functions that sqlSave calls that cause R to crash. The first is RODBC:::odbcTableExists, which is a non-visible function. Here is the code for the function:
RODBC:::odbcTableExists
function (channel, tablename, abort = TRUE, forQuery = TRUE,
allowDot = attr(channel, "interpretDot"))
{
if (!odbcValidChannel(channel))
stop("first argument is not an open RODBC channel")
if (length(tablename) != 1)
stop(sQuote(tablename), " should be a name")
tablename <- as.character(tablename)
switch(attr(channel, "case"), nochange = {
}, toupper = tablename <- toupper(tablename), tolower = tablename <- tolower(tablename))
isExcel <- odbcGetInfo(channel)[1L] == "EXCEL"
hasDot <- grepl(".", tablename, fixed = TRUE)
if (allowDot && hasDot) {
parts <- strsplit(tablename, ".", fixed = TRUE)[[1]]
if (length(parts) > 2)
ans <- FALSE
else {
res <- if (attr(channel, "isMySQL"))
sqlTables(channel, catalog = parts[1], tableName = parts[2])
else sqlTables(channel, schema = parts[1], tableName = parts[2])
ans <- is.data.frame(res) && nrow(res) > 0
}
}
else if (!isExcel) {
res <- sqlTables(channel, tableName = tablename)
ans <- is.data.frame(res) && nrow(res) > 0
}
else {
res <- sqlTables(channel)
tables <- stables <- if (is.data.frame(res))
res[, 3]
else ""
if (isExcel) {
tables <- sub("^'(.*)'$", "\\1", tables)
tables <- unique(c(tables, sub("\\$$", "", tables)))
}
ans <- tablename %in% tables
}
if (abort && !ans)
stop(sQuote(tablename), ": table not found on channel")
enc <- attr(channel, "encoding")
if (nchar(enc))
tablename <- iconv(tablename, to = enc)
if (ans && isExcel) {
dbname <- if (tablename %in% stables)
tablename
else paste(tablename, "$", sep = "")
if (forQuery)
paste("[", dbname, "]", sep = "")
else dbname
}
else if (ans) {
if (forQuery && !hasDot)
quoteTabNames(channel, tablename)
else tablename
}
else character(0L)
}
This fails here when the table name over 18 characters in length:
res <- sqlTables(channel, tableName = tablename)
I have fixed it by changing this to:
res <- sqlTables(channel, tablename)
I then reassign the function with the same name (odbcTableExists) in the namespace with this code change using assignInNamepace.
RODBC:::odbcTableExists no longer causes an issue. However, R still crashes when sqlwrite is called from within sqlSave(). I called debug on sqlwrite, and I found that RODBC:::odbcColumns (another non-visible function) causes that to crash when tablenames are too long. Unfortunately, I am not sure how to change RODBC:::odbcColumns to avoid the bug like I did before.
I am using R 2.15.1, and the platform is :x86_64-pc-ming32/x64 (64-bit). I should also note that I am trying to run this on a work computer, but if I run the exact same code on my personal computer, R does not crash (no bug). The work computer runs windows 7 professional, and my home computer runs windows 7 home premium with R 2.14.1.
I love this hack (I too have Windows 7 Professional at R 2.15.1 at work), and it does not crash anymore, but it causes another problem after I replaced that line and used assignInNamespace; also for some reason I had to replace odbcValidChannel with RODBC:::odbcValidChannel and quoteTabNames with RODBC:::quoteTabNames
But when I used sqlSave, I got the following error:
Error in odbcUpdate(channel, query, mydata, coldata[m, ], test = test, :
no parameters, so nothing to update
I don't even use odbcUpdate anywhere in the code, and the RODBC::: sqlSave does not have the odbcUpdate call inside.
Any thoughts?
thank you,
-Alex