I'd like to calculate the fitted values for a model fit in R using glm on data in a MySQL table and pipe the result back into that table, how could I do that?
# preparation
library("RMySQL")
con <- dbConnect(MySQL(), user="####", pass="###", host="127.0.0.1", db="###")
on.exit(dbDisconnect(con))
# fetching the data and calculating fit
tab <- dbGetQuery(con,"SELECT ID, dep, indep1, indep2 FROM table WHERE !(ISNULL(ID) OR ISNULL(dep) OR ISNULL(indep1) OR ISNULL(indep2))")
regression = glm(tab$dep ~ tab$indep1 + tab$indep2, gaussian)
# preparing data for insertion
insert <- data.frame(tab$ID, fitted.values(regression)
colnames(insert) <- c('ID', 'dep')
# table hassle (inserting data, combining with previous table, deleting old and fitresult renaming combined back to original name
if (dbExistsTable(con, '_result') {
dbRemoveTable(con, '_result');
}
dbWriteTable(con, '_result', insert)
dbSendQuery(con, 'CREATE TABLE temporary SELECT table.*, _result.dep FROM table LEFT JOIN result USING (ID)')
dbRemoveTable(con, 'table')
dbRemoveTable(con, '_result')
dbSendQuery(con, 'RENAME TABLE temporary TO table')
Related
I want to append new table to original(existing) table without overwriting original table. What is the query code for appending new table?
I have tried the following codes.
## choose one document from vector of strings
file <- x[j]
# read csv for file
doc <- read.csv(file, sep=";")
# indicate number of rows for each doc
n <- nrow(doc)
# create dataframe for doc
df <- data.frame(doc_id = numeric(n), doc_name = character(n), doc_number = character(n), stringsAsFactors = FALSE)
# loop to create df
for(k in 1:nrow(doc)){
df$doc_id[k] <- paste0(df$doc_id[k])
df$doc_name[k] <- paste0(doc$titles[k])
df$doc_number[k] <- paste0(doc$no[k])
}
# query for inserting table to mysql
query1 = sprintf('INSERT IGNORE INTO my_sql_table VALUES ("%s","%s","%s");', df[i,1],df[i,2],df[i,3])
# query for appending table to my_sql_table
query2 = sqlAppendTable(con, "doc", df)
# execute the queries
dbExecute(con, query1)
dbExecute(con, query2)
print(j)
} ## end of for loop
I also tried the following queries for appending, unfortunately, it didn't work.
INSERT IGNORE INTO my_sql_table VALUES ("%s","%s");
INSERT IGNORE INTO tableBackup(SELECT * FROM my_sql_table);
I expect to have appended new_table to original_table without deleting or overwriting the original table.
I have a sql database with one column of ID #'s and a another column that has the corresponding info of each ID. I have a vector that contains the ID #'s that I want the corresponding info to. How do I query only those specific ID's while also getting there corresponding information, and store it into a table?
I've tried for loops, I've tried filtering, and hard coding it.
con <- dbConnect(RSQLite::SQLite(), "data.db")
df<- tbl(con,"kv")
newish <- data.frame(df)
filter(person %in% IDs) %>%
collect()
After connecting call all the ID's in the vector given and extract the corresponding information and store it into a table
If I tired a for loop the table would not print all of the information, but rather only the information of the last ID in the vector, the filtering wouldn't work because it suggested that the vector only had to be of vector length one instead of 90,000. The actual results should be a table that contains only the patient ID #'s and the corresponding information of the people who I have in the vector.
This is a sql problem that you can decompose with R as follows :
# not run
# con <- dbConnect(RSQLite::SQLite(), "data.db")
ids <- paste0("id_", sample(1:100, 10))
id_var <- "id_var" # you can put more ids here with a corresponding data.frame in ids
vars <- paste0("var", 1:10)
db_name <- "mydatabase"
tab_name <- "mytable"
whereq <- paste("where", id_var, "in", paste0("('", paste0(ids, collapse = "', '"), "')") )
rqt <- paste("select", paste(vars, collapse = ", "),
"from", paste0(db_name, ".", tab_name), whereq, ";")
# check the query
rqt
#> [1] "select var1, var2, var3, var4, var5, var6, var7, var8, var9, var10 from mydatabase.mytable where id_var in ('id_99', 'id_1', 'id_72', 'id_86', 'id_65', 'id_59', 'id_67', 'id_4', 'id_82', 'id_2') ;"
# to uncomment
# res <- DBI::dbGetQuery(con, rqt)
# res <- tibble::as_tibble(res) # OR data.table::data.table(res) OR as.data.frame(res)
Created on 2019-06-11 by the reprex package (v0.2.1)
Situation
Working with Python 3.7.2
I have read previlege of a MariaDB table with 5M rows on a server.
I have a local text file with 7K integers, one per line.
The integers represent IDXs of the table.
The IDX column of the table is the primary key. (so I suppose it is automatically indexed?)
Problem
I need to select all the rows whose IDX is in the text file.
My effort
Version 1
Make 7K queries, one for each line in the text file. This makes approximately 130 queries per second, costing about 1 minute to complete.
import pymysql
connection = pymysql.connect(....)
with connection.cursor() as cursor:
query = (
"SELECT *"
" FROM TABLE1"
" WHERE IDX = %(idx)s;"
)
all_selected = {}
with open("idx_list.txt", "r") as f:
for idx in f:
idx = idx.strip()
if idx:
idx = int(idx)
parameters = {"idx": idx}
cursor.execute(query, parameters)
result = cursor.fetchall()[0]
all_selected[idx] = result
Version 2
Select the whole table, iterate over the cursor and cherry-pick rows. The for-loop over .fetchall_unbuffered() covers 30-40K rows per second, and the whole script costs about 3 minutes to complete.
import pymysql
connection = pymysql.connect(....)
with connection.cursor() as cursor:
query = "SELECT * FROM TABLE1"
set_of_idx = set()
with open("idx_list.txt", "r") as f:
for line in f:
if line.strip():
line = int(line.strip())
set_of_idx.add(line)
all_selected = {}
cursor.execute(query)
for row in cursor.fetchall_unbuffered():
if row[0] in set_of_idx:
all_selected[row[0]] = row[1:]
Expected behavior
I need to select faster, because the number of IDXs in the text file will grow as big as 10K-100K in the future.
I consulted other answers including this, but I can't make use of it since I only have read previlege, thus impossible to create another table to join with.
So how can I make the selection faster?
A temporary table implementation would look like:
connection = pymysql.connect(....,local_infile=True)
with connection.cursor() as cursor:
cursor.execute("CREATE TEMPORARY TABLE R (IDX INT PRIMARY KEY)")
cursor.execute("LOAD DATA LOCAL INFILE 'idx_list.txt' INTO R")
cursor.execute("SELECT TABLE1.* FROM TABLE1 JOIN R USING ( IDX )")
..
cursor.execute("DROP TEMPORARY TABLE R")
Thanks to the hint (or more than a hint) from #danblack, I was able to achieve the desired result with the following query.
query = (
"SELECT *"
" FROM TABLE1"
" INNER JOIN R"
" ON R.IDX = TABLE1.IDX;"
)
cursor.execute(query)
danblack's SELECT statement didn't work for me, raising an error:
pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'IDX' at line 1")
This is probably because of MariaDB's join syntax, so I consulted the MariaDB documentation on joining tables.
And now it selects 7K rows in 0.9 seconds.
Leaving here as an answer just for the sake of completeness, and for future readers.
I want to be able to add a column into an existing table with its corresponding type.
This is how I tried it:
library("RMySQL")
# Connect to DB
v_db <- dbConnect(MySQL(),
user="USERNAME", password="PASSWORD",
dbname="DBNAME", host="localhost")
on.exit(dbDisconnect(v_db))
#Read in my new data (into R)
newcolumn <- read.csv("test.csv")
newcolumn
id datafornewcolumn
1 4
2 5
3 8
dbq <- dbSendQuery(v_db, "SELECT * FROM `EXISTINGTABLE`")
dbq <- fetch(dbq, n = -1)
dbq
id existing columns
1 ...
2 ...
3 ...
dbWriteTable(v_db, "EXISTINGTABLE", merge(dbq, newcolumn, by="id", all.x=TRUE), row.name=FALSE, overwrite=T)
But with that last statement I overwrite the existing table with the new one thereby losing all the corresponding variable types.
Then I tried a workaround. Write the new data into a new table in SQL and after that merge that into the EXISTINGTABLE. However, it seems I'm not able to do that corretly:
dbSendQuery(v_db, "create table workaround (id int not null primary key,
newcolumn DECIMAL(3,1))")
#write data into that new empty table called workaround --> works fine
dbWriteTable(v_db, "workaround", neu, row.name=FALSE, append=TRUE)
#something works...
dbSendQuery(v_db, "SELECT * FROM EXISTINGTABLE
LEFT JOIN workaround ON EXISTINGTABLE.id = workaround.id
UNION
SELECT * FROM EXISTINGTABLE
RIGHT JOIN workaround ON EXISTINGTABLE.id = workaround.id")
<MySQLResult:(29344,26,2)>
The result should look like this:
EXISTINGTABLE
id existingcolumns datafornewcolumn
1 ... 4
2 ... 5
3 ... 8
Have you tried to modify your table using SQL with a ALTER TABLE statement?
rs <- dbSendStatement(v_db, "ALTER TABLE table_name ADD COLUMN [...]")
dbHasCompleted(rs)
dbGetRowsAffected(rs)
dbClearResult(rs)
Then you can simply send and UPDATE statement to add new values.
Try first with dbExecute() method... and try this on R Console:
?dbExecute or ?dbSendStatement
Check this out too: RMySQL dbWriteTable with field.types
Adding a column to an existing table can be done as :
library(dbConnect) # *loads the packages Loading required package: RMySQL
Loading required package: DBI
Loading required package: methods
Loading required package: gWidgets*
mydb =dbConnect(MySQL(),user='root',password='newpass',dbname='database_name', host='localhost')
my_table <- dbReadTable(conn=mydb,name='table_name')
my_table$column_name<- NA # *creates new column named column_name and populates its value NA*
head(my_table) # *shows the data*
I have a MySQL table I am attempting to access with R using RMySQL.
There are 1690004 rows that should be returned from
dbGetQuery(con, "SELECT * FROM tablename WHERE export_date ='2015-01-29'")
Unfortunately, I receive the following warning messages:
In is(object, Cl) : error while fetching row
In dbGetQuery(con, "SELECT * FROM tablename WHERE export_date ='2015-01-29'", : pending rows
And only receive ~400K rows.
If I break the query into several "fetches" using dbSendQuery, the warning messages start appearing after ~400K rows are received.
Any help would be appreciated.
So, it looks like it was due to a 60 second timeout imposed by my hosting provider (damn Arvixe!). I got around this by "paging/chunking" the output. Because my data has an auto-incrementing primary key, every row returned is in order, allowing me to take the next X rows after each iteration.
To get 1.6M rows I did the following:
library(RMySQL)
con <- MySQLConnect() # mysql connection function
day <- '2015-01-29' # date of interest
numofids <- 50000 # number of rows to include in each 'chunk'
count <- dbGetQuery(con, paste0("SELECT COUNT(*) as count FROM tablename WHERE export_date = '",day,"'"))$count # get the number of rows returned from the table.
dbDisconnect(con)
ns <- seq(1, count, numofids) # get sequence of rows to work over
tosave <- data.frame() # data frame to bind results to
# iterate through table to get data in 50k row chunks
for(nextseries in ns){ # for each row
print(nextseries) # print the row it's on
con <- MySQLConnect()
d1 <- dbGetQuery(con, paste0("SELECT * FROM tablename WHERE export_date = '",day,"' LIMIT ", nextseries,",",numofids)) # extract data in chunks of 50k rows
dbDisconnect(con)
# bind data to tosave dataframe. (the ifelse is avoid an error when it tries to rbind d1 to an empty dataframe on the first pass).
if(nrow(tosave)>0){
tosave <- rbind(tosave, d1)
}else{
tosave <- d1
}
}