I have a data frame with 10 million rows and 5 columns that I want to insert to an existing sql table. Note that I do not have permission to create a table, I can only insert values into an existing table. I'm currently using RODBCext
query_ch <- "insert into [blah].[dbo].[blahblah]
(col1, col2, col3, col4, col5)
values (?,?,?,?,?)"
sqlExecute(channel, query_ch, my_data)
This takes way too long (more than 10 hours). Is there a way accomplish this faster?
TL;DR: LOAD DATA INFILE is one order of magnitude faster than multiple INSERT statements, which are themselves one order of magnitude faster than single INSERT statements.
I benchmark below the three main strategies to importing data from R into Mysql:
single insert statements, as in the question:
INSERT INTO test (col1,col2,col3) VALUES (1,2,3)
multiple insert statements, formated like so:
INSERT INTO test (col1,col2,col3) VALUES (1,2,3),(4,5,6),(7,8,9)
load data infile statement, i.e. loading a previously written CSV file in mysql:
LOAD DATA INFILE 'the_dump.csv' INTO TABLE test
I use RMySQL here, but any other mysql driver should lead to similar results. The SQL table was instantiated with:
CREATE TABLE `test` (
`col1` double, `col2` double, `col3` double, `col4` double, `col5` double
) ENGINE=MyISAM;
The connection and test data were created in R with:
library(RMySQL)
con = dbConnect(MySQL(),
user = 'the_user',
password = 'the_password',
host = '127.0.0.1',
dbname='test')
n_rows = 1000000 # number of tuples
n_cols = 5 # number of fields
dump = matrix(runif(n_rows*n_cols), ncol=n_cols, nrow=n_rows)
colnames(dump) = paste0('col',1:n_cols)
Benchmarking single insert statements:
before = Sys.time()
for (i in 1:nrow(dump)) {
query = paste0('INSERT INTO test (',paste0(colnames(dump),collapse = ','),') VALUES (',paste0(dump[i,],collapse = ','),');')
dbExecute(con, query)
}
time_naive = Sys.time() - before
=> this takes about 4 minutes on my computer
Benchmarking multiple insert statements:
before = Sys.time()
chunksize = 10000 # arbitrary chunk size
for (i in 1:ceiling(nrow(dump)/chunksize)) {
query = paste0('INSERT INTO test (',paste0(colnames(dump),collapse = ','),') VALUES ')
vals = NULL
for (j in 1:chunksize) {
k = (i-1)*chunksize+j
if (k <= nrow(dump)) {
vals[j] = paste0('(', paste0(dump[k,],collapse = ','), ')')
}
}
query = paste0(query, paste0(vals,collapse=','))
dbExecute(con, query)
}
time_chunked = Sys.time() - before
=> this takes about 40 seconds on my computer
Benchmarking load data infile statement:
before = Sys.time()
write.table(dump, 'the_dump.csv',
row.names = F, col.names=F, sep='\t')
query = "LOAD DATA INFILE 'the_dump.csv' INTO TABLE test"
dbSendStatement(con, query)
time_infile = Sys.time() - before
=> this takes about 4 seconds on my computer
Crafting your SQL query to handle many insert values is the simplest way to improve the performances. Transitioning to LOAD DATA INFILE will lead to optimal results. Good performance tips can be found in this page of mysql documentation.
Related
I'm building a desktop app with PyQt5 to connect with, load data from, insert data into and update a MySQL database. What I came up with to update the database and insert data into the database works. But I feel there should be a much faster way to do it in terms of computation speed. If anyone could help that would be really helpful. What I have as of now for updating the database is this -
def log_change(self, item):
self.changed_items.append([item.row(),item.column()])
# I connect this function to the item changed signal to log any cells which have been changed
def update_db(self):
# Creating an empty list to remove the duplicated cells from the initial list
self.changed_items_load= []
[self.changed_items_load.append(x) for x in self.changed_items if x not in self.changed_items_load]
# loop through the changed_items list and remove cells with no values in them
for db_wa in self.changed_items_load:
if self.tableWidget.item(db_wa[0],db_wa[1]).text() == "":
self.changed_items_load.remove(db_wa)
try:
mycursor = mydb.cursor()
# loop through the list and update the database cell by cell
for ecr in self.changed_items_load:
command = ("update table1 set `{col_name}` = %s where id=%s;")
# table widget column name matches db table column name
data = (str(self.tableWidget.item(ecr[0],ecr[1]).text()),int(self.tableWidget.item(ecr[0],0).text()))
mycursor.execute(command.format(col_name = self.col_names[ecr[1]]),data)
# self.col_names is a list of the tableWidget columns
mydb.commit()
mycursor.close()
except OperationalError:
Msgbox = QMessageBox()
Msgbox.setText("Error! Connection to database lost!")
Msgbox.exec()
except NameError:
Msgbox = QMessageBox()
Msgbox.setText("Error! Connect to database!")
Msgbox.exec()
For inserting data and new rows into the db I was able to find some info online about that. But I have been unable to insert multiple lines at once as well as insert varying column length for each row. Like if I want to insert only 2 columns at row 1, and then 3 columns at row 2... something like that.
def insert_db(self):
# creating a list of each column
self.a = [self.tableWidget.item(row,1).text() for row in range (self.tableWidget.rowCount()) if self.tableWidget.item(row,1) != None]
self.b = [self.tableWidget.item(row,2).text() for row in range (self.tableWidget.rowCount()) if self.tableWidget.item(row,2) != None]
self.c = [self.tableWidget.item(row,3).text() for row in range (self.tableWidget.rowCount()) if self.tableWidget.item(row,3) != None]
self.d = [self.tableWidget.item(row,4).text() for row in range (self.tableWidget.rowCount()) if self.tableWidget.item(row,4) != None]
try:
mycursor = mydb.cursor()
mycursor.execute("INSERT INTO table1(Name, Date, Quantity, Comments) VALUES ('%s', '%s', '%s', '%s')" %(''.join(self.a),
''.join(self.b),
''.join(self.c),
''.join(self.d)))
mydb.commit()
mycursor.close()
except OperationalError:
Msgbox = QMessageBox()
Msgbox.setText("Error! Connection to database lost!")
Msgbox.exec()
except NameError:
Msgbox = QMessageBox()
Msgbox.setText("Error! Connect to database!")
Msgbox.exec()
Help would be appreciated. Thanks.
Like if I want to insert only 2 columns at row 1, and then 3 columns at row 2
No. A given Database table has a specific number of columns. That is an integral part of the definition of a "table".
INSERT adds new rows to a table. It is possible to construct a single SQL statement that inserts multiple rows "all at once".
UPDATE modifies one or more rows of a table. The rows are indicated by some condition specified in the Update statement.
Constructing SQL with %s is risky -- it gets in trouble if there are quotes in the string being inserted.
(I hope these comments help you get to the next stage of understanding databases.)
I have a data frame in pyspark like below
df = spark.createDataFrame(
[
('2021-10-01','A',25),
('2021-10-02','B',24),
('2021-10-03','C',20),
('2021-10-04','D',21),
('2021-10-05','E',20),
('2021-10-06','F',22),
('2021-10-07','G',23),
('2021-10-08','H',24)],("RUN_DATE", "NAME", "VALUE"))
Now using this data frame I want to update a table in MySql
# query to run should be similar to this
update_query = "UPDATE DB.TABLE SET DATE = '2021-10-01', VALUE = 25 WHERE NAME = 'A'"
# mysql_conn is a function which I use to connect to `MySql` from `pyspark` and run queries
# Invoking the function
mysql_conn(host, user_name, password, update_query)
Now when I invoke the mysql_conn function by passing parameters the query runs successfully and the record gets updated in the MySql table.
Now I want to run the update statement for all the records in the data frame.
For each NAME it has to pick the RUN_DATE and VALUE and replace in update_query and trigger the mysql_conn.
I think we need to a for loop but not sure how to proceed.
Instead of iterating through the dataframe with a for loop, it would be better to distribute the workload across each partitions using foreachPartition. Moreover, since you are writing a custom query instead of executing one query for each query, it would be more efficient to execute a batch operation to reduce the round trips, latency and concurrent connections. Eg
def update_db(rows):
temp_table_query=""
for row in rows:
if len(temp_table_query) > 0:
temp_table_query = temp_table_query + " UNION ALL "
temp_table_query = temp_table_query + " SELECT '%s' as RUNDATE, '%s' as NAME, %d as VALUE " % (row.RUN_DATE,row.NAME,row.VALUE)
update_query="""
UPDATE DBTABLE
INNER JOIN (
%s
) new_records ON DBTABLE.NAME = new_records.NAME
SET
DBTABLE.DATE = new_records.RUNDATE,
DBTABLE.VALUE = new_records.VALUE
""" % (temp_table_query)
mysql_conn(host, user_name, password, update_query)
df.foreachPartition(update_db)
View Demo on how the UPDATE query works
Let me know if this works for you.
Situation
Working with Python 3.7.2
I have read previlege of a MariaDB table with 5M rows on a server.
I have a local text file with 7K integers, one per line.
The integers represent IDXs of the table.
The IDX column of the table is the primary key. (so I suppose it is automatically indexed?)
Problem
I need to select all the rows whose IDX is in the text file.
My effort
Version 1
Make 7K queries, one for each line in the text file. This makes approximately 130 queries per second, costing about 1 minute to complete.
import pymysql
connection = pymysql.connect(....)
with connection.cursor() as cursor:
query = (
"SELECT *"
" FROM TABLE1"
" WHERE IDX = %(idx)s;"
)
all_selected = {}
with open("idx_list.txt", "r") as f:
for idx in f:
idx = idx.strip()
if idx:
idx = int(idx)
parameters = {"idx": idx}
cursor.execute(query, parameters)
result = cursor.fetchall()[0]
all_selected[idx] = result
Version 2
Select the whole table, iterate over the cursor and cherry-pick rows. The for-loop over .fetchall_unbuffered() covers 30-40K rows per second, and the whole script costs about 3 minutes to complete.
import pymysql
connection = pymysql.connect(....)
with connection.cursor() as cursor:
query = "SELECT * FROM TABLE1"
set_of_idx = set()
with open("idx_list.txt", "r") as f:
for line in f:
if line.strip():
line = int(line.strip())
set_of_idx.add(line)
all_selected = {}
cursor.execute(query)
for row in cursor.fetchall_unbuffered():
if row[0] in set_of_idx:
all_selected[row[0]] = row[1:]
Expected behavior
I need to select faster, because the number of IDXs in the text file will grow as big as 10K-100K in the future.
I consulted other answers including this, but I can't make use of it since I only have read previlege, thus impossible to create another table to join with.
So how can I make the selection faster?
A temporary table implementation would look like:
connection = pymysql.connect(....,local_infile=True)
with connection.cursor() as cursor:
cursor.execute("CREATE TEMPORARY TABLE R (IDX INT PRIMARY KEY)")
cursor.execute("LOAD DATA LOCAL INFILE 'idx_list.txt' INTO R")
cursor.execute("SELECT TABLE1.* FROM TABLE1 JOIN R USING ( IDX )")
..
cursor.execute("DROP TEMPORARY TABLE R")
Thanks to the hint (or more than a hint) from #danblack, I was able to achieve the desired result with the following query.
query = (
"SELECT *"
" FROM TABLE1"
" INNER JOIN R"
" ON R.IDX = TABLE1.IDX;"
)
cursor.execute(query)
danblack's SELECT statement didn't work for me, raising an error:
pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'IDX' at line 1")
This is probably because of MariaDB's join syntax, so I consulted the MariaDB documentation on joining tables.
And now it selects 7K rows in 0.9 seconds.
Leaving here as an answer just for the sake of completeness, and for future readers.
I am attempting to do a bulk insert into MySQL using
INSERT INTO TABLE (a, b, c) VALUES (?, ?, ?), (?, ?, ?)
I have the general log on, and see that this works splendidly for most cases. However, when the table has a BLOB column, it doesn't work as well.
I am trying to insert 20 records.
Without the BLOB, I see all 20 records in the same query in the general log, 20 records inserted in the same query.
WITH the BLOB, I see only 2 records per query in the general log, it takes 10 queries in total.
Is this a problem with MySQL, the JDBC Driver, or am I missing something else. I would prefer to use a BLOB as I have data in protobufs.
Here is an example table...
CREATE TABLE my_table (
id CHAR(36) NOT NULL,
name VARCHAR(256) NOT NULL,
data BLOB NOT NULL,
PRIMARY KEY (id)
);
Then, create your batch inserts in code...
val ps = conn.prepareStatement(
"INSERT INTO my_table(id, name, data) VALUES (?, ?, ?)")
records.grouped(1000).foreach { group =>
group.foreach { r =>
ps.setString(1, UUID.randomUUID.toString)
ps.setString(2, r.name)
ps.setBlob(3, new MariaDbBlob(r.data))
ps.addBatch()
}
ps.executeBatch()
}
If you run this and inspect the general log, you will see...
"2018-10-12T18:37:55.714825Z 4 Query INSERT INTO my_table(id, name, fqdn, data) VALUES ('b4955537-2450-48c4-9953-e27f3a0fc583', '17-apply-test', _binary '
17-apply-test\"AAAA(?2Pending8?????,J$b4955537-2450-48c4-9953-e27f3a0fc583
1:2:3:4:5:6:7:8Rsystem'), ('480e470c-6d85-4bbc-b718-21d9e80ac7f7', '18-apply-test', _binary '
18-apply-test\"AAAA(?2Pending8?????,J$480e470c-6d85-4bbc-b718-21d9e80ac7f7
1:2:3:4:5:6:7:8Rsystem')
2018-10-12T18:37:55.715489Z 4 Query INSERT INTO my_table(id, name, data) VALUES ('7571a651-0e0b-4e78-bff0-1394070735ce', '19-apply-test', _binary '
19-apply-test\"AAAA(?2Pending8?????,J$7571a651-0e0b-4e78-bff0-1394070735ce
1:2:3:4:5:6:7:8Rsystem'), ('f77ebe28-73d2-4f6b-8fd5-284f0ec2c3f0', '20-apply-test', _binary '
20-apply-test\"AAAA(?2Pending8?????,J$f77ebe28-73d2-4f6b-8fd5-284f0ec2c3f0
As you can see, each INSERT INTO only has 2 records in it.
Now, if you remove the data field from the schema and insert and re-run, you will see the following output (for 10 records)...
"2018-10-12T19:04:24.406567Z 4 Query INSERT INTO my_table(id, name) VALUES ('d323d21e-25ac-40d4-8cff-7ad12f83b8c0', '1-apply-test'), ('f20e37f2-35a4-41e9-8458-de405a44f4d9', '2-apply-test'), ('498f4e96-4bf1-4d69-a6cb-f0e61575ebb4', '3-apply-test'), ('8bf7925d-8f01-494f-8f9f-c5b8c742beae', '4-apply-test'), ('5ea663e7-d9bc-4c9f-a9a2-edbedf3e5415', '5-apply-test'), ('48f535c8-44e6-4f10-9af9-1562081538e5', '6-apply-test'), ('fbf2661f-3a23-4317-ab1f-96978b39fffe', '7-apply-test'), ('3d781e25-3f30-48fd-b22b-91f0db8ba401', '8-apply-test'), ('55ffa950-c941-44dc-a233-ebecfd4413cf', '9-apply-test'), ('6edc6e25-6e70-42b9-8473-6ab68d065d44', '10-apply-test')"
All 10 records are in the same query
I tinkered until I found the fix...
val ps = conn.prepareStatement(
"INSERT INTO my_table(id, name, data) VALUES (?, ?, ?)")
records.grouped(1000).foreach { group =>
group.foreach { r =>
ps.setString(1, UUID.randomUUID.toString)
ps.setString(2, r.name)
//ps.setBlob(3, new MariaDbBlob(r.data))
ps.setBytes(r.data)
ps.addBatch()
}
ps.executeBatch()
Using PreparedStatement.setBytes instead of using MariaDbBlob seemed to do the trick
I have to update millions of row into MySQL. I am currently using for loop to execute query. To make the update faster I want to use executemany() of Python MySQL Connector, so that I can update in batches using single query for each batch.
I don't think mysqldb has a way of handling multiple UPDATE queries at one time.
But you can use an INSERT query with ON DUPLICATE KEY UPDATE condition at the end.
I written the following example for ease of use and readability.
import MySQLdb
def update_many(data_list=None, mysql_table=None):
"""
Updates a mysql table with the data provided. If the key is not unique, the
data will be inserted into the table.
The dictionaries must have all the same keys due to how the query is built.
Param:
data_list (List):
A list of dictionaries where the keys are the mysql table
column names, and the values are the update values
mysql_table (String):
The mysql table to be updated.
"""
# Connection and Cursor
conn = MySQLdb.connect('localhost', 'jeff', 'atwood', 'stackoverflow')
cur = conn.cursor()
query = ""
values = []
for data_dict in data_list:
if not query:
columns = ', '.join('`{0}`'.format(k) for k in data_dict)
duplicates = ', '.join('{0}=VALUES({0})'.format(k) for k in data_dict)
place_holders = ', '.join('%s'.format(k) for k in data_dict)
query = "INSERT INTO {0} ({1}) VALUES ({2})".format(mysql_table, columns, place_holders)
query = "{0} ON DUPLICATE KEY UPDATE {1}".format(query, duplicates)
v = data_dict.values()
values.append(v)
try:
cur.executemany(query, values)
except MySQLdb.Error, e:
try:
print"MySQL Error [%d]: %s" % (e.args[0], e.args[1])
except IndexError:
print "MySQL Error: %s" % str(e)
conn.rollback()
return False
conn.commit()
cur.close()
conn.close()
Explanation of one liners
columns = ', '.join('`{}`'.format(k) for k in data_dict)
is the same as
column_list = []
for k in data_dict:
column_list.append(k)
columns = ", ".join(columns)
Here's an example of usage
test_data_list = []
test_data_list.append( {'id' : 1, 'name' : 'Marco', 'articles' : 1 } )
test_data_list.append( {'id' : 2, 'name' : 'Keshaw', 'articles' : 8 } )
test_data_list.append( {'id' : 3, 'name' : 'Wes', 'articles' : 0 } )
update_many(data_list=test_data_list, mysql_table='writers')
Query output
INSERT INTO writers (`articles`, `id`, `name`) VALUES (%s, %s, %s) ON DUPLICATE KEY UPDATE articles=VALUES(articles), id=VALUES(id), name=VALUES(name)
Values output
[[1, 1, 'Marco'], [8, 2, 'Keshaw'], [0, 3, 'Wes']]
Maybe this can help
How to update multiple rows with single MySQL query in python?
cur.executemany("UPDATE Writers SET Name = %s WHERE Id = %s ",
[("new_value" , "3"),("new_value" , "6")])
conn.commit()