I have a large query to execute through SQL Alchemy which has approximately 2.5 million rows. It's connecting to a MySQL database. When I do:
transactions = Transaction.query.all()
It eventually times out around ten minutes. And gets this error: sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
I've tried setting different parameters when doing create_engine like:
create_engine(connect_args={'connect_timeout': 30})
What do I need to change so the query will not timeout?
I would also be fine if there is a way to paginate the results and go through them that way.
Solved by pagination:
page_size = 10000 # get x number of items at a time
step = 0
while True:
start, stop = page_size * step, page_size * (step+1)
transactions = sql_session.query(Transaction).slice(start, stop).all()
if transactions is None:
break
for t in transactions:
f.write(str(t))
f.write('\n')
if len(transactions) < page_size:
break
step += 1
f.close()
Related
I do multiple requests to Mysqlsd and for specific users I get this error
MySQLdb._exceptions.OperationalError: (2013, 'Lost connection to MySQL server during query')
This error occurs on line
cursor.execute('Select * from process WHERE email=%s ORDER BY timestamp DESC LIMIT 20', ("tommy345a#outlook.com",))
But when I do the same query but for a different user, there is no problem.
cursor.execute('Select * from process WHERE email=%s ORDER BY timestamp DESC LIMIT 20', ("cafahxxxx#gmail.com",))
The page loads correctly.
More details on MySQL is given below
#modules used
from flask_mysqldb import MySQL
import MySQLdb.cursors
#setup
app.config['MYSQL_HOST'] = 'localhost'
app.config['MYSQL_USER'] = 'myusername'
app.config['MYSQL_PASSWORD'] = 'mypassword'
app.config['MYSQL_DB'] = 'my_db'
mysql = MySQL(app)
#extract of requests made to db
#app.route('/myhome', methods=['GET', 'POST'])
def home_page():
email = "tommy345a#outlook.com"
cursor = mysql.connection.cursor(MySQLdb.cursors.DictCursor)
cursor.execute('SELECT * FROM mail WHERE email = %s', (email,))
completed = cursor.fetchone()
cursor.execute('SELECT sum(transaction) FROM transactions WHERE email=%s', (email,))
points = cursor.fetchone()
cursor.execute('Select * from process WHERE email=%s ORDER BY timestamp DESC LIMIT 20', (email,))
transactions = cursor.fetchall()
I will post the process table from Mysql here so you can figure out why the issue is happening for red marked user and not for green marked user.
Also, this might be an issue of packet size but I haven't had any issue till yesterday (there was more than 11 entries under the user tommy. I deleted 6 rows and it is still not working). Also if you think it is due to packet size, please tell me how can I solve it without increasing packet size because I am on a shared network and the hosting provider is not letting me increase the packet size limit.
Structure
There are multiple potential reasons for "server has gone away" and only one of them is due to data being too large for your max allowed packet size.
https://dev.mysql.com/doc/refman/8.0/en/gone-away.html
If it is caused by the data being too large, keep in mind that each row of result is its own packet. It's not the whole result set that must fit in a packet. If you had 11 rows and you deleted 6 but still get the error, then the row that caused the problem still exists.
You could:
Remove the row that is too large. You might want to change the column data types of your table so that a given row cannot be too large. You showed a screenshot of some data, but I have no idea what data types you use. Hint: use SHOW CREATE TABLE process.
Change the max_allowed_packet as a session variable, since you don't have access to change the global variable. Also keep in mind the client must also change its max allowed packet to match. Read https://dev.mysql.com/doc/refman/8.0/en/packet-too-large.html
MySQL scenario:
When I execute "SELECT" queries in MySQL using multiple threads I get the following message: "Commands out of sync; you can't run this command now", I found that this is due to the limitation of having to wait "consume" the results to make another query.
C ++ example:
void DataProcAsyncWorker :: Execute ()
{
std :: thread (& DataProcAsyncWorker :: Run, this) .join ();
}
void DataProcAsyncWorker :: Run () {
sql :: PreparedStatement * prep_stmt = c-> con-> prepareStatement (query);
...
}
Important:
I can't help using multiple threads per query (SELECT, INSERT, ETC) because the module I'm building that is being integrated with NodeJS "locks" the thread until the result is already obtained, for this reason I need to run in the background (new thread) and resolve the "promise" containing the result obtained from MySQL
Important:
I am saving several "connections" [example: 10], and with each SQL call the function chooses a connection.
This is:
1. A connection pool that contains 10 established connections, Ex:
for (int i = 0; i <10; i ++) {
Com * c = new Com;
c-> id = i;
c-> con = openConnection ();
c-> con-> setSchema ("gateway");
conns.push_back (c);
}
2. The problem occurs when executing> = 100 SELECT queries per second, I believe that even with the connection balance 100 connections per second is a high number and the connection "ex: conns.at (50)" is in process and was not consumed
My question:
A. Does PostgreSQL have this limitation as well? Or in PostgreSQL there is also such a limitation?
B. Which server using SQL commands is recommended for large SQL queries per second without the need to "open new connections", that is:
In a conns.at (0) connection I can execute (through 2 simultaneous threads) SELECT commands.
Additional:
1. I can even create a larger number of connections in the pool, but when I simulate a number of queries per second greater than the number of pre-set connections I will get the error: "Commands out of sync", the the only solution I found was mutex, which is bad for performance
I found that PostgreSQL looks great with this (queue / queue), in a very efficient way, unlike MySQL where I need to call "_free_result", in PostgreSQL, I can run multiple queries on the same connection without receiving the error: "Commands out of sync ".
Note: I did the test using libpqxx (library for connection / queries to the PostgreSQL server in C) and it really worked like a wonder without giving me a headache.
Note: I don't know if it allows multi-thread execution or the execution is done synchronously on the server side for each connection, the only thing I know is that there is no such error in postgresql.
I have a reasonably large dataset of about 6,000,000 rows X 60 columns that I am trying to insert into a database. I am chunking them, and inserting them 10,000 at a time into a mysql database using a class I've written and pymysql. The problem is, I occasionally time out the server when writing, so I've modified my executemany call to re-connect on errors. This works fine for when I lose connection once, but if I lose the error a second time, I get a pymysql.InternalException stating that lock wait timeout exceeded. I was wondering how I could modify the following code to catch that and destroy the transaction completely before attempting again.
I've tried calling rollback() on the connection, but this causes another InternalException if the connection is destroyed because there is no cursor anymore.
Any help would be greatly appreciated (I also don't understand why I am getting the timeouts to begin with, but the data is relatively large.)
class Database:
def __init__(self, **creds):
self.conn = None
self.user = creds['user']
self.password = creds['password']
self.host = creds['host']
self.port = creds['port']
self.database = creds['database']
def connect(self, type=None):
self.conn = pymysql.connect(
host = self.host,
user = self.user,
password = self.password,
port = self.port,
database = self.database
)
def executemany(self, sql, data):
while True:
try:
with self.conn.cursor() as cursor:
cursor.executemany(sql, data)
self.conn.commit()
break
except pymysql.err.OperationalError:
print('Connection error. Reconnecting to database.')
time.sleep(2)
self.connect()
continue
return cursor
and I am calling it like this:
for index, chunk in enumerate(dataframe_chunker(df), start=1):
print(f"Writing chunk\t{index}\t{timer():.2f}")
db.executemany(insert_query, chunk.values.tolist())
Take a look at what MySQL is doing. The lockwait timeouts are because the inserts cannot be done until something else finishes, which could be your own code.
SELECT * FROM `information_schema`.`innodb_locks`;
Will show the current locks.
select * from information_schema.innodb_trx where trx_id = [lock_trx_id];
Will show the involved transactions
SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST where id = [trx_mysql_thread_id];
Will show the involved connection and may show the query whose lock results in the lock wait timeout. Maybe there is an uncommitted transaction.
It is likely your own code, because of the interaction with your executemany function which catches exceptions and reconnects to the database. What of the prior connection? Does the lockwait timeout kill the prior connection? That while true is going to be trouble.
For the code calling executemany on the db connection, be more defensive on the try/except with something like:
def executemany(self, sql, data):
while True:
try:
with self.conn.cursor() as cursor:
cursor.executemany(sql, data)
self.conn.commit()
break
except pymysql.err.OperationalError:
print('Connection error. Reconnecting to database.')
if self.conn.is_connected():
connection.close()
finally:
time.sleep(2)
self.connect()
But the solution here will be to not induce lockwait timeouts if there are no other database clients.
I inserted data with RabbitMQ+Celery+flask+sqlalchemy ORM.
Celery workers are total 16 count on two servers.
The number of data is about one million.
If data is queued one by one to MQ and inserted one by one from Celery,
everything is OK.
All data is inserted well.
But When I try to bulk insert using forloop in celery by some size(about 5000, 1000 and so on), some data is missed.
I write so many logs, but I can't find any error or special thing.
(If number of data is 100,000 instead 1 million, It is OK too)
The simple logic is..
tasks.py
#celery.task(bind=True, acks_late=False, ignore_result=True, max_retries=None)
def insert_data(self):
logger.info("START")
InsertData(somedata..)
logger.info("END")
#celery.task(bind=True, acks_late=False, ignore_result=True, max_retries=None)
def insert_data_bulk(self):
logger.info("START")
for i in range(5000):
InsertData(somedata..)
logger.info("END")
def InsertData(data):
logger.info("Insert START")
# my_db_engine's option : {'echo': True, 'pool_recycle': 3600, 'pool_size': 10, 'pool_reset_on_return': 'commit', 'isolation_level': 'AUTOCOMMIT'}
ss = scoped_session(sessionmaker(autocommit=False, autoflush=False,
bind=my_db_engine))
t = mymodel(**data)
ss.add(t)
ss.commit()
logger.info("Insert END")
test.py
for i in range(1000000):
insert_data_one.apply_async() # make one million messages for MQ
for i in range(200):
insert_data_bulk.apply_async() # make 200 message for MQ
insert_data_one is do well.
log is..
START
Insert START
Insert END
END
but insert_data_bulk make some missed data randomly!!
log is..
START
Insert START
END (3. Insert END <- log is missed sometimes.)
or..
START
END (2, 3 is missed. I never find them)
The total row count in database is different at any time.
I don't set timeout of Celery.
Database's timeout is matched with my sqlalchemy option.
Do you have any Ideas about it?
Please give any hints to check ;-(
I use MySQL (MyISA). Table with over 8M rows. Primary index on 'id'.
My application show:
first run: 55 req/sec,
second run: ~120 req/sec,
third run: ~1200 req/sec,
fourth run: ~4500 req/sec,
fifth run: ~9999 req/sec
After restart mysql-server again the same.
How placing ALL index at once in memory after start database server?
In my.cnf
key_buffer_size=2000M
Code sample:
now = datetime.datetime.now()
cursor = connection.cursor()
for x in xrange(1, 10000):
id = random.randint(10, 100000) # random first 10000 records for cache
cursor.execute("""SELECT num, manufacturer_id
FROM product WHERE id=%s LIMIT 1""", [id])
cursor.fetchone()
td = datetime.datetime.now() - now
sec = td.seconds + td.days * 24 * 3600
print "%.2f operation/sec" % (float(x) / float(sec))
I think two caches are at work here. One is the index cache and that can be preloaded with LOAD INDEX INTO CACHE
The other is the query cache and I think that in your case this is where the most performance is gained. AFAIK that can't be preloaded with any mysql command.
What you could do is to replace the last N queries that run before the restart. Those queries would then poplate the cache. Or keep a file of some realistic queries to run at start up.