Exporting data from R to MYSQL server - mysql

df <- data.frame(category = c("A","B","A","D","E"),
date = c("5/10/2005","6/10/2005","7/10/2005","8/10/2005","9/10/2005"),
col1 = c(1,NA,2,NA,3),
col2 = c(1,2,NA,4,5),
col3 = c(2,3,NA,NA,4))
I have to insert a data frame that is created in R to mysql server.
I have tried these methods(Efficient way to insert data frame from R to SQL). However, my data also has NA which are fails the whole process of exporting.
Is there a way around to faster upload to data.
dbWriteTable(cn,name ="table_name",value = df,overwrite=TRUE, row.names = FALSE)
The above works but is very slow to upload
The method that I have to use is this :
before = Sys.time()
chunksize = 1000000 # arbitrary chunk size
for (i in 1:ceiling(nrow(df)/chunksize)) {
query = paste0('INSERT INTO dashboard_file_new_rohan_testing (',paste0(colnames(df),collapse = ','),') VALUES ')
vals = NULL
for (j in 1:chunksize) {
k = (i-1)*chunksize+j
if (k <= nrow(df)) {
vals[j] = paste0('(', paste0(df[k,],collapse = ','), ')')
}
}
query = paste0(query, paste0(vals,collapse=','))
dbExecute(cn, query)
}
time_chunked = Sys.time() - before
Error Encountered:
Error in .local(conn, statement, ...) :
could not run statement: Unknown column 'NA' in 'field list'

One of the fastest ways to load data into MySQL is to use its LOAD DATA command line tool. You may try first writing your R data frame to a CSV file, then using MySQL's LOAD DATA to load it:
write.csv(df, "output.csv", row.names=FALSE)
Then from your command line, use:
LOAD DATA INFILE 'output.csv' INTO TABLE table_name
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES;
Note that this assumes the CSV file is already on the same machine as MySQL. If not, and you have it still locally, then use LOAD DATA LOCAL INFILE instead.
You may read MYSQL import data from csv using LOAD DATA INFILE for more help using LOAD DATA.
Edit:
To deal with the issue of NA values, which should represent NULL in MySQL, you may take the approach of first casting the entire data frame to text, and then replacing the NA values with empty string. LOAD DATA will interpret a missing value in a CSV column as being NULL. Consider this:
df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
df[is.na(df)] <- ""
Then, use write.csv along with LOAD DATA as described above.

Related

Python 3 psycopg2 COPY from stdin failed: error in .read()

I am trying to apply the code found on this page, in particular part 'Copy Data from String Iterator' of the Table of Contents, but run into an issue with my code.
Since not all lines coming from the generator (here log_lines) can be imported into the PostgreSQL database, I try to filter the correct lines (here row) using itertools.filterfalse like in the codeblock below:
def copy_string_iterator(connection, log_lines) -> None:
with connection.cursor() as cursor:
create_staging_table(cursor)
log_string_iterator = StringIteratorIO((
'|'.join(map(clean_csv_value, (
row['date'],
row['time'],
row['cs_uri_query'],
row['s_contentpath'],
row['sc_status'],
row['s_computername'],
...
row['sc_substates'],
row['s_port'],
row['cs_version'],
row['c_protocol'],
row.update({'cs_cookie':'x'}),
row['timetakenms'],
row['cs_uri_stem'],
))) + '\n')
for row in filterfalse(lambda line: "#" in line.get('date'), log_lines)
)
cursor.copy_from(log_string_iterator, 'log_table', sep = '|')
When I run this, cursor.copy_from() gives me the following error:
QueryCanceled: COPY from stdin failed: error in .read() call
CONTEXT: COPY log_table, line 112910
I understand why this error happens, it is because in the test file I use there are only 112909 lines that meet the filterfalse condition. But why does it try to copy line 112910 and throw the error and not just stop?
Since Python doesn't have a coalescing operator, add something like:
(map(clean_csv_value, (
row['date'] if 'date' in row else None,
:
row['cs_uri_stem'] if 'cs_uri_stem' in row else None,
))) + '\n')
for each of your fields so you can handle any missing fields in the JSON file. Of course the fields should be nullable in the db if you use None otherwise replace with None with some default value for that field.

how to upload whole text of a file in a row in mysql through terminal

How to upload whole text of a text file in a row in database, the text gets divided and is stored in subsequent rows.
This is the code of my SQL file, database name is info containing table named info having two columns des1 and des2 having field VARCHAR(3000):
use info;
INSERT INTO info (des1) VALUES (LOAD_FILE('eng.txt'));
select * from(info);
I am getting following output:
des1 des2
NULL NULL
I also attatched an image showing output
I expect all of the text in the file to be in a single row of the database, this has to be done in terminal
There might be a problem with how you form the string that is to be inserted. You can do it easily by the help of a programming language.
I am providing a simple solution that would work in Python.
import mysql.connector
mydb = mysql.connector.connect(
host="localhost",
user="root",
passwd="password",
database="db"
)
def fileReadToString(filename):
result = ""
with open(filename, "r") as ins:
for line in ins:
result +=(line)
return result
file = fileReadToString('eng.txt')
mycursor = mydb.cursor()
sql = "INSERT INTO info VALUES (%s)"
val = (file)
mycursor.execute(sql, val)
mydb.commit()
print(mycursor.rowcount, "done")
LOAD DATA INFILE is for structured data. For unstructured data use LOAD_FILE():
INSERT INTO info (des1) VALUES (LOAD_FILE('eng.txt'))

Is there a faster way to upload data from R to MySql?

I am using the following code to upload a new table into a mysql database.
library(RMySql)
library(RODBC)
con <- dbConnect(MySQL(),
user = 'user',
password = 'pw',
host = 'amazonaws.com',
dbname = 'db_name')
dbSendQuery(con, "CREATE TABLE table_1 (
var_1 VARCHAR(50),
var_2 VARCHAR(50),
var_3 DOUBLE,
var_4 DOUBLE);
")
channel <- odbcConnect("db name")
sqlSave(channel, dat = df, tablename = "tb_name", rownames = FALSE, append =
TRUE)
The full data set is 68 variables and 5 million rows. It is taking over 90 minutes to upload 50 thousand rows to MySql. Is there a more efficient way to upload the data to MySql. I originally tried dbWriteTable() but this would result in an error message saying the connection to the database was lost.
Consider a CSV export from R for an import into MySQL with LOAD DATA INFILE:
...
write.csv(df, "/path/to/filename.csv", row.names=FALSE)
dbSendQuery(con, "LOAD DATA LOCAL INFILE '/path/to/filename.csv'
INTO TABLE mytable
FIELDS TERMINATED by ','
ENCLOSED BY '"'
LINES TERMINATED BY '\\n'")
You could try to disable the mysql query log:
dbSendQuery(con, "SET GLOBAL general_log = 'off'")
I can't tell if your mysql user account has the appropriate permissions to do that, or if it conflicts with your business needs.
Off the top of my head: Otherwise you could try to send the data in say 1000-row batches, using a for- loop in your Rscript, and maybe option verbose = true in your call to sqlSave
If you send the data in a single batch, Mysql might try to run the INSERT as a single transaction ("all-or-nothing") and if it fails it goes into recovery or just fails after inserting some random number of rows.

Julia - Rewriting a CSV

Complete Julia newbie here.
I'd like to do some processing on a CSV. Something along the lines of:
using CSV
in_file = CSV.Source('/dir/in.csv')
out_file = CSV.Sink('/dir/out.csv')
for line in CSV.eachline(in_file)
replace!(line, "None", "")
CSV.writeline(out_file, line)
end
This is in pseudocode, those aren't existing functions.
Idiomatically, should I iterate on 1:CSV.countlines(in_file)? Do a while and check something?
If all you want to do is replace a string in the line, you do not need any CSV parsing utilities. All you do is read the file line by line, replace, and write. So:
infile = "/path/to/input.csv"
outfile = "/path/to/output.csv"
out = open(outfile, "w+")
for line in readlines(infile)
newline = replace(line, "a", "b")
write(out, newline)
end
close(out)
This will replicate the pseudocode you have in your question.
If you need to parse and read the csv field by field, use the readcsv function in base.
data=readcsv(infile)
typeof(data) #Array{Any,2}
This will return the data in the file as a 2 dimensional array. You can process this data any way you want, and write it back using the writecsv function.
for i in 1:size(data,1) #iterate by rows
data[i, 1] = "This is " * data[i, 1] # Add text to first column
end
writecsv(outfile, data)
Documentation for these functions:
http://docs.julialang.org/en/release-0.5/stdlib/io-network/?highlight=readcsv#Base.readcsv
http://docs.julialang.org/en/release-0.5/stdlib/io-network/?highlight=readcsv#Base.writecsv

Filtering rows in a csv file with Python

I have a script that opens and modifies a text file. The text file which contains personnel info and a lunch account balance. My script takes the text file removes the quotes and only writes rows that contain the values D, F or R in column 8. It writes this filtered data to two files, a csv import file called lunchimport.csv for a separate program and a csv temp file called to be used for further filtering. The second stage of the script uses the csv temp file to generate two additional csv files. One file, negativebal.csv, contains only rows with a negative value in column 14. The other file, lowbal.cav, contains rows with a value between 0 and 5 in column 14. My issue is that I cant get the script to filter "between" values properly. When using the code below to just write rows with values in column14 between 0 and 5 nothing will filter out. If I use values between 0 and 1.99 it works. Anything greater than 1.99 and the code doesnt filter anything:
if row[13] > "0" and row[13] < "1.99":
lowwriter.writerow([row[0], row[13]])
I have pasted my entire code below. I do use alot of temp files to accomplish my tasks. There probably is a better way but im just interested in getting my filters to work properly.
import os
import csv
infile = open("\\\\comalexsrv\\export\\update.txt", "r")
outfile1 = open("casttemp1.csv", "w")
infile2 = open("casttemp1.csv", "r")
outfile2 = open("casttemp2.csv", "w")
infile3 = open("casttemp2.csv", "r")
outfile3 = open("casttemp3.csv", "w")
infile4 = open("casttemp3.csv", "r")
inowcsv = open("F:\zbennett\Lunch_Imports\lunchimport.csv", "w")
negcastcsv = open("\\\\tcdc\\inow_transfer$\\negativebal.csv", "w")
lowcastcsv = open("\\\\tcdc\\inow_transfer$\\lowbal.csv", "w")
# Remove quotes in update.txt, write to outfile1(casttemp1.csv)
string = infile.read()
outfile1.write(string.replace("\"", ''))
# Open infile2(casttemp1.csv), write rows with D,F,R in column 8 to outfile2(casttemp2.csv)
# Open infile2(casttemp1.csv), write rows with D,F,R in column 8 to inowcsv(F:\zbennett\Lunch_Imports\lunchimport.csv)
# Open infile2(casttemp1.csv), write rows with D,R in column 8 to outfile3(casttemp3.csv)
tempwriter = csv.writer(outfile2, delimiter=',', lineterminator= '\n')
importwriter = csv.writer(inowcsv, delimiter=',', lineterminator= '\n')
lowtemp = csv.writer(outfile3, delimiter=',', lineterminator= '\n')
for row in csv.reader(infile2, delimiter=','):
if row[7] == "D":
tempwriter.writerow(row)
importwriter.writerow(row)
lowtemp.writerow(row)
if row[7] == "F":
tempwriter.writerow(row)
importwriter.writerow(row)
if row[7] == "R":
tempwriter.writerow(row)
importwriter.writerow(row)
lowtemp.writerow(row)
# Open infile3(casttemp2.csv), write columns 1,14 for rows with less than 0 in column 14 to negcastcsv(\\tcdc\inow_transfer$\negativebal.csv)
negwriter = csv.writer(negcastcsv, delimiter=',', lineterminator= '\n')
for row in csv.reader(infile3, delimiter=','):
if row[13] < "0":
negwriter.writerow([row[0], row[13]])
# Open infile4(casttemp3.csv), write columns 1,14 for rows with column 14 greater than 0 and less than 1.75 to lowcastcsv(\\tcdc\inow_transfer$\lowbal.csv)
lowwriter = csv.writer(lowcastcsv, delimiter=',', lineterminator= '\n')
for row in csv.reader(infile4, delimiter=','):
if row[13] > "0" and row[13] < "1.99":
lowwriter.writerow([row[0], row[13]])
infile.close()
outfile1.close()
infile2.close()
outfile2.close()
inowcsv.close()
outfile3.close()
infile3.close()
infile4.close()
negcastcsv.close()
lowcastcsv.close()
# Delete casttemp1.csv file
os.remove("casttemp1.csv")
os.remove("casttemp2.csv")
os.remove("casttemp3.csv")
Comparison is happening using strings, when you probably want numeric comparison:
if 0. < float(row[13]) < 1.99:
lowwriter.writerow([row[0], row[13]])