I'm using R to insert a data.frame into a MySQL database. I have this code below that inserts 1000 rows at a time successfully. However, it's not practical if I have a data.frame with tens of thousands of rows. How would you do a bulk insert using R? is it even possible?
## R and MySQL
library(RMySQL)
### create sql connection object
mydb = dbConnect(MySQL(), dbname="db", user='xxx', password='yyy', host='localhost', unix.sock="/Applications/MAMP/mysql/mysql.sock")
# get data ready for mysql
df = data.format
# chunks
df1 <- df[1:1000,]
df2 <- df[1001:2000,]
df3 <- df[2001:nrow(df),]
## SQL insert for data.frame, limit 1000 rows
dbWriteTable(mydb, "table_name", df1, append=TRUE, row.names=FALSE)
dbWriteTable(mydb, "table_name", df2, append=TRUE, row.names=FALSE)
dbWriteTable(mydb, "table_name", df3, append=TRUE, row.names=FALSE)
For completeness, as the link suggests, write the df to a temp table and insert into the destination table as follows:
dbWriteTable(mydb, name = 'temp_table', value = df, row.names = F, append = F)
dbGetQuery(mydb, "insert into table select * from temp_table")
Fast bulk insert is now supported by the DBI-based ODBC package, see
this example posted by Jim Hester (https://github.com/r-dbi/odbc/issues/34):
library(DBI);
con <- dbConnect(odbc::odbc(), "MySQL")
dbWriteTable(con, "iris", head(iris), append = TRUE, row.names=FALSE)
dbDisconnect(con)
Since RMySQL is also DBI-based you just have to "switch" the DB connection
to use the odbc package (thanks to the standardized DBI interface of R).
Since the RMySQL package
... is being phased out in favor of the new RMariaDB package.
according to their web site (https://github.com/r-dbi/RMySQL) you could try switching the driver package to RMariaDB (perhaps they have already implemented a bulk insert feature).
For details see: https://github.com/r-dbi/RMariaDB
If all else fails, you could put it in a loop:
for(i in 0:floor(nrow(df)/1000)) {
insert_set = df[(i*1000 + 1):((i+1)*1000),]
dbWriteTable(mydb, "table_name", insert_set, append=T, row.names=F)
}
Related
I have a large table in MySQL (about 3million rows with 15 columns) and I'm trying to use some of the data from the table in an R shiny app.
I've been able to get the connection and write a query in R:
library(DBI)
library(dplyr)
cn <- dbConnect(drv = RMySQL::MySQL(),
username = "user",
password = "my_password",
host = "host",
port = 3306
)
query = "SELECT * FROM dbo.locations"
However, when I run dbGetQuery(cn, query) it takes really long (I ended up closing my RStudio program after it turned unresponsive.
I also tried
res <- DBI::dbSendQuery (cn, query)
repeat {
df <- DBI::dbFetch (res, n = 1000)
if (nrow (df) == 0) { break }
}
dbClearResult(dbListResults(cn)[[1]])
since this is similar to reading the data in by chunks, but my resulting df has 0 rows for some reason.
Any suggestions on how to get my table in R? Should I even try to read that table into R? From what I understand, R doesn't handle large data very well.
I use DBI and RMySQL package to import the whole table from the database. The code works as expected. I would like to know is there a faster way to import the same table multiple times? For example, I import the table, do some calculations, close the R session, and then import the same table again tomorrow. Is there a way to somehow cache that table and import the same table in a faster way?
The code example (working as expected):
library(RMySQL)
library(DBI)
# coonect to database
connection <- function() {
con <- DBI::dbConnect(RMySQL::MySQL(),
host = "91.234.xx.xxx",
port = 3306L,
dbname = "xxxx",
username = "xxxx",
password = "xxxx",
Trusted_Connection = "True")
}
# imoprt
db <- connection()
vix <- DBI::dbGetQuery(db, 'SELECT * FROM VIX')
invisible(dbDisconnect(db))
I am trying to insert each row from about 2000 csv files into a mysql table. With the following code, I have inserted only one row from just one file. How can I automate the code so that it inserts all rows for each file? The insertions need to be done just once.
import pymysql.cursors
connection = pymysql.connect(host='localhost',
user='s',
password='n9',
db='si',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
sql = "INSERT INTO `TrainsS` (`No.`, `Name`,`Zone`,`From`,`Delay`,`ETA`,`Location`,`To`) VALUES (%s,%s,%s,%s,%s,%s,%s, %s)"
cursor.execute(sql, ('03', 'P Exp','SF','HWH', 'none','no arr today','n/a','ND'))
connection.commit()
finally:
connection.close()
How about checking this code?
To run this you can put all your .csv files in one folder and os.walk(folder_location) that folder to get locations of all the .csv files and then I've opened them one by one and inserted into the required DB (MySQL) here.
import pandas as pd
import os
import subprocess
import warnings
warnings.simplefilter("ignore")
cwd = os.getcwd()
connection = pymysql.connect(host='localhost',
user='s',
password='n9',
db='si',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
files_csv = []
for subdir, dir, file in os.walk(cwd):
files_csv += [ fi for fi in file if fi.endswith(".csv") ]
print(files_csv)
for i in range(len(files_csv)):
with open(os.path.join(cwd, files_csv[i])) as f:
lis=[line.split() for line in f]
for i,x in enumerate(lis):
#print("line{0} = {1}".format(i,x))
#HERE x contains the row data and you can access it individualy using x[0], x[1], etc
#USE YOUR MySQL INSERTION commands here and insert the x row here.
with connection.cursor() as cursor:
sql = "INSERT INTO `TrainsS` (`No.`, `Name`,`Zone`,`From`,`Delay`,`ETA`,`Location`,`To`) VALUES (%s,%s,%s,%s,%s,%s,%s, %s)"
cursor.execute(sql, (#CONVERTED VALUES FROM x))
connection.commit()
Update -
getting values for (#CONVERTED VALUES FROM X)
values = ""
for i in range(len(columns)):
values = values + x[i] + "," # Here x[i] gives a record data in ith row. Here i'm just appending the all values to be inserted in the sql table.
values = values[:-1] # Removing the last extra comma.
command = "INSERT INTO `TrainsS` (`No.`, `Name`,`Zone`,`From`,`Delay`,`ETA`,`Location`,`To`) VALUES (" + str(values) + ")"
cursor.execute(command)
#Then commit using connection.commit()
import psycopg2
import time
import csv
conn = psycopg2.connect(
host = "localhost",
database = "postgres",
user = "postgres",
password = "postgres"
)
cur = conn.cursor()
start = time.time()
with open('combined_category_data_100 copy.csv', 'r') as file:
reader=csv.reader(file)
ncol = len(next(reader))
next(reader)
for row in reader:
cur.execute(" insert into data values (%s = (no. of columns
))", row)
conn.commit()
print("data entered successfully")
end = time.time()
print(f" time taken is {end - start}")
cur.close()
I have csv file in Amazon s3 with is 62mb in size (114 000 rows). I am converting it into spark dataset, and taking first 500 rows from it. Code is as follow;
DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);
Dataset<Row> set=df.load("s3n://"+this.accessId.replace("\"", "")+":"+this.accessToken.replace("\"", "")+"#"+this.bucketName.replace("\"", "")+"/"+this.filePath.replace("\"", "")+"");
set.take(500)
The whole operation takes 20 to 30 sec.
Now I am trying the same but rather using csv I am using mySQL table with 119 000 rows. MySQL server is in amazon ec2. Code is as follow;
String url ="jdbc:mysql://"+this.hostName+":3306/"+this.dataBaseName+"?user="+this.userName+"&password="+this.password;
SparkSession spark=StartSpark.getSparkSession();
SQLContext sc = spark.sqlContext();
DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);
Dataset<Row> set = sc
.read()
.option("url", url)
.option("dbtable", this.tableName)
.option("driver","com.mysql.jdbc.Driver")
.format("jdbc")
.load();
set.take(500);
This is taking 5 to 10 minutes.
I am running spark inside jvm. Using same configuration in both cases.
I can use partitionColumn,numParttition etc but I don't have any numeric column and one more issue is the schema of the table is unknown to me.
My issue is not how to decrease the required time as I know in ideal case spark will run in cluster but what I can not understand is why this big time difference in the above two case?
This problem has been covered multiple times on StackOverflow:
How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
spark jdbc df limit... what is it doing?
How to use JDBC source to write and read data in (Py)Spark?
and in external sources:
https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#parallelizing-reads
so just to reiterate - by default DataFrameReader.jdbc doesn't distribute data or reads. It uses single thread, single exectuor.
To distribute reads:
use ranges with lowerBound / upperBound:
Properties properties;
Lower
Dataset<Row> set = sc
.read()
.option("partitionColumn", "foo")
.option("numPartitions", "3")
.option("lowerBound", 0)
.option("upperBound", 30)
.option("url", url)
.option("dbtable", this.tableName)
.option("driver","com.mysql.jdbc.Driver")
.format("jdbc")
.load();
predicates
Properties properties;
Dataset<Row> set = sc
.read()
.jdbc(
url, this.tableName,
{"foo < 10", "foo BETWWEN 10 and 20", "foo > 20"},
properties
)
Please follow the steps below
1.download a copy of the JDBC connector for mysql. I believe you already have one.
wget http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.38/mysql-connector-java-5.1.38.jar
2.create a db-properties.flat file in the below format
jdbcUrl=jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}
user=<username>
password=<password>
3.create a empty table first where you want to load the data.
invoke spark shell with driver class
spark-shell --driver-class-path <your path to mysql jar>
then import all the required package
import java.io.{File, FileInputStream}
import java.util.Properties
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
initiate a hive context or a sql context
val sQLContext = new HiveContext(sc)
import sQLContext.implicits._
import sQLContext.sql
set some of the properties
sQLContext.setConf("hive.exec.dynamic.partition", "true")
sQLContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
Load mysql db properties from file
val dbProperties = new Properties()
dbProperties.load(new FileInputStream(new File("your_path_to/db- properties.flat")))
val jdbcurl = dbProperties.getProperty("jdbcUrl")
create a query to read the data from your table and pass it to read method of #sqlcontext. this is where you can manage your where clause
val df1 = "(SELECT * FROM your_table_name) as s1"
pass the jdbcurl, select query and db properties to read method
val df2 = sQLContext.read.jdbc(jdbcurl, df1, dbProperties)
write it to your table
df2.write.format("orc").partitionBy("your_partition_column_name").mode(SaveMode.Append).saveAsTable("your_target_table_name")
Can't find any solution how to load huge JSON. I try with well known Yelp dataset. It's 3.2 GB and I want to analyse 9 out of 10 columns. I need to skip import $text column, which will give me much lighter file to load. Probably about -70%. I don't want to manipulate the file.
I tried many libraries and stuck. I've found a solution for data.frame to apply pipe function:
df <- read.table(pipe("cut -f1,5,28 myFile.txt"))
from this thread: Ways to read only select columns from a file into R? (A happy medium between `read.table` and `scan`?)
How to do it for JSON? I'd like to do:
json <- read.table(pipe("cut -text yelp_academic_dataset_review.json"))
but this of course throws an error due to wrong format. Is there any possibilities without parsing whole file with regex?
EDIT
Structure of one row: (even can't count them all)
{"review_id":"vT2PALXWX794iUOoSnNXSA","user_id":"uKGWRd4fONB1cXXpU73urg","business_id":"D7FK-xpG4LFIxpMauvUStQ","stars":1,"date":"2016-10-31","text":"some long text here","useful":0,"funny":0,"cool":0,"type":"review"}
SECOND EDIT
Finally, I've created a loop to convert all data into csv file, omitted unwanted column. It's slow but I've got 150 mb (zipped) from 3.2 gb.
# files to process
filepath <- jfile1
fcsv <- "c:/Users/cp/Documents/R DATA/Yelp/tests/reviews.csv"
write.table(x = paste(colns, collapse=","), file = fcsv, quote = F, row.names = F, col.names = F)
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
# regex process
d <- NULL
for (i in rcols) {
pst <- paste(".*\"",colns[i],"\":*(.*?) *,\"",colns[i+1],"\":.*", sep="")
w <- sub(pst, "\\1", line)
d <- cbind(d, noquote(w))
}
# save on the fly
write.table(x = paste(d, collapse = ","), file = fcsv, append = T, quote = F, row.names = F, col.names = F)
}
close(con)
It can be save to json also. I wonder if it's the most efficient way, but other scripts I tested was slow and often had some encoding issues.
Try this:
library(jsonlite)
df <- as.data.frame(fromJSON('yelp_academic_dataset_review.json', flatten=TRUE))
Then once it is a dataframe delete the column(s) you don't need.
If you don't want to manipulate the file in advance of importing it, I'm not sure what options you have in R. Alternatively, you could make a copy of the file, then delete the text column with this script, then import the copy to R, then delete the copy.