Is there a faster way to upload data from R to MySql? - mysql

I am using the following code to upload a new table into a mysql database.
library(RMySql)
library(RODBC)
con <- dbConnect(MySQL(),
user = 'user',
password = 'pw',
host = 'amazonaws.com',
dbname = 'db_name')
dbSendQuery(con, "CREATE TABLE table_1 (
var_1 VARCHAR(50),
var_2 VARCHAR(50),
var_3 DOUBLE,
var_4 DOUBLE);
")
channel <- odbcConnect("db name")
sqlSave(channel, dat = df, tablename = "tb_name", rownames = FALSE, append =
TRUE)
The full data set is 68 variables and 5 million rows. It is taking over 90 minutes to upload 50 thousand rows to MySql. Is there a more efficient way to upload the data to MySql. I originally tried dbWriteTable() but this would result in an error message saying the connection to the database was lost.

Consider a CSV export from R for an import into MySQL with LOAD DATA INFILE:
...
write.csv(df, "/path/to/filename.csv", row.names=FALSE)
dbSendQuery(con, "LOAD DATA LOCAL INFILE '/path/to/filename.csv'
INTO TABLE mytable
FIELDS TERMINATED by ','
ENCLOSED BY '"'
LINES TERMINATED BY '\\n'")

You could try to disable the mysql query log:
dbSendQuery(con, "SET GLOBAL general_log = 'off'")
I can't tell if your mysql user account has the appropriate permissions to do that, or if it conflicts with your business needs.
Off the top of my head: Otherwise you could try to send the data in say 1000-row batches, using a for- loop in your Rscript, and maybe option verbose = true in your call to sqlSave
If you send the data in a single batch, Mysql might try to run the INSERT as a single transaction ("all-or-nothing") and if it fails it goes into recovery or just fails after inserting some random number of rows.

Related

Exporting data from R to MYSQL server

df <- data.frame(category = c("A","B","A","D","E"),
date = c("5/10/2005","6/10/2005","7/10/2005","8/10/2005","9/10/2005"),
col1 = c(1,NA,2,NA,3),
col2 = c(1,2,NA,4,5),
col3 = c(2,3,NA,NA,4))
I have to insert a data frame that is created in R to mysql server.
I have tried these methods(Efficient way to insert data frame from R to SQL). However, my data also has NA which are fails the whole process of exporting.
Is there a way around to faster upload to data.
dbWriteTable(cn,name ="table_name",value = df,overwrite=TRUE, row.names = FALSE)
The above works but is very slow to upload
The method that I have to use is this :
before = Sys.time()
chunksize = 1000000 # arbitrary chunk size
for (i in 1:ceiling(nrow(df)/chunksize)) {
query = paste0('INSERT INTO dashboard_file_new_rohan_testing (',paste0(colnames(df),collapse = ','),') VALUES ')
vals = NULL
for (j in 1:chunksize) {
k = (i-1)*chunksize+j
if (k <= nrow(df)) {
vals[j] = paste0('(', paste0(df[k,],collapse = ','), ')')
}
}
query = paste0(query, paste0(vals,collapse=','))
dbExecute(cn, query)
}
time_chunked = Sys.time() - before
Error Encountered:
Error in .local(conn, statement, ...) :
could not run statement: Unknown column 'NA' in 'field list'
One of the fastest ways to load data into MySQL is to use its LOAD DATA command line tool. You may try first writing your R data frame to a CSV file, then using MySQL's LOAD DATA to load it:
write.csv(df, "output.csv", row.names=FALSE)
Then from your command line, use:
LOAD DATA INFILE 'output.csv' INTO TABLE table_name
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES;
Note that this assumes the CSV file is already on the same machine as MySQL. If not, and you have it still locally, then use LOAD DATA LOCAL INFILE instead.
You may read MYSQL import data from csv using LOAD DATA INFILE for more help using LOAD DATA.
Edit:
To deal with the issue of NA values, which should represent NULL in MySQL, you may take the approach of first casting the entire data frame to text, and then replacing the NA values with empty string. LOAD DATA will interpret a missing value in a CSV column as being NULL. Consider this:
df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
df[is.na(df)] <- ""
Then, use write.csv along with LOAD DATA as described above.

Values are not inserted into MySQL table using pool.apply_async in python2.7

I am trying to run the following code to populate a table in parallel for a certain application. First the following function is defined which is supposed to connect to my db and execute the sql command with the values given (to insert into table).
def dbWriter(sql, rows) :
# load cnf file
MYSQL_CNF = os.path.abspath('.') + '/mysql.cnf'
conn = MySQLdb.connect(db='dedupe',
charset='utf8',
read_default_file = MYSQL_CNF)
cursor = conn.cursor()
cursor.executemany(sql, rows)
conn.commit()
cursor.close()
conn.close()
And then there is this piece:
pool = dedupe.backport.Pool(processes=2)
done = False
while not done :
chunks = (list(itertools.islice(b_data, step)) for step in
[step_size]*100)
results = []
for chunk in chunks :
print len(chunk)
results.append(pool.apply_async(dbWriter,
("INSERT INTO blocking_map VALUES (%s, %s)",
chunk)))
for r in results :
r.wait()
if len(chunk) < step_size :
done = True
pool.close()
Everything works and there are no errors. But at the end, my table is empty, meaning somehow the insertions were not successful. I have tried so many things to fix this (including adding column names for insertion) after many google searches and have not been successful. Any suggestions would be appreciated. (running code in python2.7, gcloud (ubuntu). note that indents may be a bit messed up after pasting here)
Please also note that "chunk" follows exactly the required data format.
Note. This is part of this example
Please note that the only thing I am changing in the above example (linked) is that I am separating the steps for creation of and inserting into the tables since I am running my code on gcloud platform and it enforces GTID standards.
Solution was changing dbwriter function to:
conn = MySQLdb.connect(host = # host ip,
user = # username,
passwd = # password,
db = 'dedupe')
cursor = conn.cursor()
cursor.executemany(sql, rows)
cursor.close()
conn.commit()
conn.close()

odoo 9 migrate binary field db to filestore

Odoo 9 custom module binary field attachment=True parameter added later after that new record will be stored in filesystem storage.
Binary Fields some old records attachment = True not used, so old record entry not created in ir.attachment table and filesystem not saved.
I would like to know how to migrate old records binary field value store in filesystem storage?. How to create/insert records in ir_attachment row based on old records binary field value? Is any script available?
You have to include the postgre bin path in pg_path in your configuration file. This will restore the file store that contains the binary fields
pg_path = D:\fx\upsynth_Postgres\bin
I'm sure that you no longer need a solution to this as you asked 18 months ago, but I have just had the same issue (many gigabytes of binary data in the database) and this question came up on Google so I thought I would share my solution.
When you set attachment=True the binary column will remain in the database, but the system will look in the filestore instead for the data. This left me unable to access the data from the Odoo API so I needed to retrieve the binary data from the database directly, then re-write the binary data to the record using Odoo and then finally drop the column and vacuum the table.
Here is my script, which is inspired by this solution for migrating attachments, but this solution will work for any field in any model and reads the binary data from the database rather than from the Odoo API.
import xmlrpclib
import psycopg2
username = 'your_odoo_username'
pwd = 'your_odoo_password'
url = 'http://ip-address:8069'
dbname = 'database-name'
model = 'model.name'
field = 'field_name'
dbuser = 'postgres_user'
dbpwd = 'postgres_password'
dbhost = 'postgres_host'
conn = psycopg2.connect(database=dbname, user=dbuser, password=dbpwd, host=dbhost, port='5432')
cr = conn.cursor()
# Get the uid
sock_common = xmlrpclib.ServerProxy ('%s/xmlrpc/common' % url)
uid = sock_common.login(dbname, username, pwd)
sock = xmlrpclib.ServerProxy('%s/xmlrpc/object' % url)
def migrate_attachment(res_id):
# 1. get data
cr.execute("SELECT %s from %s where id=%s" % (field, model.replace('.', '_'), res_id))
data = cr.fetchall()[0][0]
# Re-Write attachment
if data:
data = str(data)
sock.execute(dbname, uid, pwd, model, 'write', [res_id], {field: str(data)})
return True
else:
return False
# SELECT attachments:
records = sock.execute(dbname, uid, pwd, model, 'search', [])
cnt = len(records)
print cnt
i = 0
for res_id in records:
att = sock.execute(dbname, uid, pwd, model, 'read', res_id, [field])
status = migrate_attachment(res_id)
print 'Migrated ID %s (attachment %s of %s) [Contained data: %s]' % (res_id, i, cnt, status)
i += 1
cr.close()
print "done ..."
Afterwards, drop the column and vacuum the table in psql.

How to write entire dataframe into mySql table in R

I have a data frame containing columns 'Quarter' having values like "16/17 Q1", "16/17 Q2"... and 'Vendor' having values like "a", "b"... .
I am trying to write this data frame into database using
query <- paste("INSERT INTO cc_demo (Quarter,Vendor) VALUES(dd$FY_QUARTER,dd$VENDOR.x)")
but it is throwing error :
Error in .local(conn, statement, ...) :
could not run statement: Unknown column 'dd$FY_QUARTER' in 'field list'
I am new to Rmysql, Please provide me some solution to write entire dataframe?
To write a data frame to mySQL DB you need to:
Create a connection to your database, you need to specify:
MySQL connection
User
Password
Host
Database name
library("RMySQL")
connection <- dbConnect(MySQL(), user = 'root', password = 'password', host = 'localhost', dbname = 'TheDB')
Using the connection create a table and then export data to the database
dbWriteTable(connection, "testTable", testTable)
You can overwrite an existing table like this:
dbWriteTable(connection, "testTable", testTable_2, overwrite=TRUE)
I would advise against writing sql query when you can actually use very handy functions such as dbWriteTable from the RMySQL package. But for the sake of practice, below is an example of how you should go about writing the sql query that does multiple inserts for a MySQL database:
# Set up a data.frame
dd <- data.frame(Quarter = c("16/17 Q1", "16/17 Q2"), Vendors = c("a","b"))
# Begin the query
sql_qry <- "insert into cc_demo (Quarter,Vendor) VALUES"
# Finish it with
sql_qry <- paste0(sql_qry, paste(sprintf("('%s', '%s')", dd$Quarter, dd$Vendors), collapse = ","))
You should get:
"insert into cc_demo (Quarter,Vendor) VALUES('16/17 Q1', 'a'),('16/17 Q2', 'b')"
You can provide this query to your database connection in order to run it.
I hope this helps.

How to extract create statements from different tables of MySQL DBs?

I would like to extract all Create Statements in my 50 MySQL Databases via SHOW CREATE TABLE db.table or SHOW CREATE TABLE db1.mytableor SHOW CREATE TABLE db2.sometableor SHOW CREATE TABLE db3.mytable1. Thus each of the DBs has some tables inside db1(table,mytable...) db2(table1,sometable) and so on
To illustrate the DBs via a example query:
SELECT *
FROM db.table1 m
LEFT JOIN db1.sometable o ON m.id = o.id
LEFT JOIN db2.sometables t ON p.id=t.id
LEFT JOIN db3.sometable s ON s.column='john'
library(RMySQL)
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
username = "",
password = "",
host = "",
port = 3306,
dbname= mydbname)# when using dbs<-dbGetQuery(con ,"SHOW DATABASES") I have to ## dbname= mydbname## to get all DBs
Using dbs<-dbGetQuery(con ,"SHOW DATABASES")I can extract all 50 Databases in the dbConnection as character vector. I would like loop over each DB in the dbsand apply SHOW CREATE TABLE to each row/db. I suppose I have to parse the each row/db into dbname= mydbnameand dbs<-dbGetQuery(con ,"SHOW CREATE TABLE"). But I just cant figure out how to make the loops
I tried:
apply(dbs, 1, function(row) {
dbname <- row[]
for (i in 1:length(dbname)) {
create<-dbGetQuery(con,"SHOW CREATE TABLE") }
})
But that doesnt seem right. I suppose I have to include the con into the loop somehow. Otherwise I'll get:
Error in .local(drv, ...) : object 'dbname' not found
So I tried:
apply(dbs, 1, function(row) {
dbname <- row[]
for (i in 1:length(dbname)) {
con <- dbConnect(RMySQL::MySQL(),
username = "",
password = "",
host = "",
port = 3306,
dbname= [i])
create<-dbGetQuery(con,"SHOW CREATE TABLE") }})
I suppose that comes close to the solution but I miss something:
dbs<-dbGetQuery(con,"show databases")
library(foreach)
foreach(i = 1:(length(dbs))%dopar%{
query<-paste("SHOW CREATE TABLE",dbs[i])
creates<-dbGetQuery(con,query)
})
Consider this approach of importing a data frame of each database (leaving out the system ones, INFORMATION_SCHEMA and MYSQL) and their corresponding tables. Then, run SHOW CREATE TABLE statements. Finally, merge the original dataframe with binded dataframe of create statements.
Now, the one caveat is tables that repeat names across databases. To return distinct values of such combinations, the aggregate() by head function is used.
con <- dbConnect(RMySQL::MySQL(),
username = "****", password = "****",
host = "****", port = 3306,
dbname= "****")
dbtbls <- dbGetQuery(con, "SELECT `TABLE_SCHEMA` AS `Database`,
`TABLE_NAME` AS `Table`
FROM `INFORMATION_SCHEMA`.`TABLES`
WHERE `TABLE_TYPE` = 'BASE TABLE'
AND `TABLE_SCHEMA` NOT LIKE '%SCHEMA%'
AND `TABLE_SCHEMA` NOT LIKE '%MYSQL%' ")
# LIST OF SQL STATEMENTS
sql <- paste0("SHOW CREATE TABLE ", dbtbls$Database, ".", dbtbls$Table)
# LIST OF DATAFRAMES
createstmts <- lapply(sql, function(x) dbGetQuery(con, x))
dbDisconnect(con)
# ROW BIND LIST INTO ONE DATAFRAME TO MERGE WITH ORIGINAL
stmtsdf <- do.call(rbind, createstmts)
finaldf <- merge(dbtbls, stmtsdf, by='Table')
# RETURN DISTINCT RECORDS
finaldf <- aggregate(.~Database+Table, finaldf, FUN=head, 1)
mysqldump --no-data
does exactly what you are asking for. (There may be other parameters desirable to avoid/include CREATE DATABASE, etc.)
If the requirement is to subsequently pull the CREATEs into R, then I ask whether this is a one-time task, or a recurring task. For one-time, I would suggest that, overall, the mysqldump approach might be simpler.
First, you can just simply use
for (i in 1:length(dbs)) { }
Or you can look into apply functions, particularly, sapply. There you can do parsing per dbConnection string, connect and get all tables as list or vector. Then you can loop inside those to get create table statements.
So, it is basically apply inside apply.
For a good explanation of apply functions, you can look into http://www.r-bloggers.com/using-apply-sapply-lapply-in-r/