Actually, I am trying to update one table with multiple processes via pymysql, and each process reads a CSV file split from a huge one in order to promote the speed. But I get the Lock wait timeout exceeded; try restarting transaction exception when I run the script. After searching the posts on this site, I found one post which mentioned that to set or build the built-in LOAD_DATA_INFILE, but no details on it. How can I do it with 'pymysql' to reach my aim?
---------------------------first edit----------------------------------------
Here's the job method:
`def importprogram(path, name):
begin = time.time()
print('begin to import program' + name + ' info.')
# "c:\\sometest.csv"
file = open(path, mode='rb')
csvfile = csv.reader(codecs.iterdecode(file, 'utf-8'))
connection = None
try:
connection = pymysql.connect(host='a host', user='someuser', password='somepsd', db='mydb',
cursorclass=pymysql.cursors.DictCursor)
count = 1
with connection.cursor() as cursor:
sql = '''update sometable set Acolumn='{guid}' where someid='{pid}';'''
next(csvfile, None)
for line in csvfile:
try:
count = count + 1
if ''.join(line).strip():
command = sql.format(guid=line[2], pid=line[1])
cursor.execute(command)
if count % 1000 == 0:
print('program' + name + ' cursor execute', count)
except csv.Error:
print('program csv.Error:', count)
continue
except IndexError:
print('program IndexError:', count)
continue
except StopIteration:
break
except Exception as e:
print('program' + name, str(e))
finally:
connection.commit()
connection.close()
file.close()
print('program' + name + ' info done.time cost:', time.time()-begin)`
And the multi-processing method:
import multiprocessing as mp
def multiproccess():
pool = mp.Pool(3)
results = []
paths = ['C:\\testfile01.csv', 'C:\\testfile02.csv', 'C:\\testfile03.csv']
name = 1
for path in paths:
results.append(pool.apply_async(importprogram, args=(path, str(name))))
name = name + 1
print(result.get() for result in results)
pool.close()
pool.join()
And the main method:
if __name__ == '__main__':
multiproccess()
I am new to Python. How can I make the code or the way itself goes wrong? Should I use only one single process to finish the data reading and importing?
Your issue is that you are exceeding the time allowed for a response to be fetched from the server, so the client is automatically timing out.
In my experience, adjust the wait timeout to something like 6000 seconds, combine into one CSV and just leave the data to import. Also, I would recommend running the query direct from MySQL rather than Python.
The way I usually import CSV data from Python to MySQL is through the INSERT ... VALUES ... method, and I only do so when some kind of manipulation of the data is required (i.e. inserting different rows into different tables).
I like your approach and understand your thinking but in reality there is no need. The benefit to the INSERT ... VALUES ... method is that you won't run into any timeout issue.
Related
I've tried using SqlAlchemy, as well as raw mysql.connector here, but commiting an insert into a SQL database from FastAPI takes forever.
I wanted to make sure it wasn't just my DB, so I tried it on a local script and it ran in a couple seconds.
How can I work with FastAPI to make this query possible?
Thanks!
'''
#router.post('/')
def postStockData(data:List[pydanticModels.StockPrices], raw_db = Depends(get_raw_db)):
cursor = raw_db[0]
cnxn = raw_db[1]
# i = 0
# for row in data:
# if i % 10 == 0:
# print(i)
# db.flush()
# i += 1
# db_pricing = models.StockPricing(**row.dict())
# db.add(db_pricing)
# db.commit()
SQL = "INSERT INTO " + models.StockPricing.__tablename__ + " VALUES (%s, %s, %s)"
print(SQL)
valsToInsert = []
for row in data:
rowD = row.dict()
valsToInsert.append((rowD['date'], rowD['symbol'], rowD['value']))
cursor.executemany(SQL, valsToInsert)
cnxn.commit()
return {'message':'Pricing Updated'}
'''
You are killing performances because you try a "RBAR" approach which is not suitable in RDBMS...
You use a loop and execute an SQL INSERT of only one row...
When the RDBMS is facing a query, the sequence of execution is the following :
does the user that throw the query be authenticate ?
parsing the string to verify the syntax
looking for metadata (tables, columns, datatypes...)
analyzing which operations on tables and columns this user is granted
creating an execution plan to sequences all the operations needed for the query
setting up lock for concurrency
executing the query (inserting only 1 row)
throw back an error or a OK message
Every steps consumes time... and your are all theses steps 100 000 times because of your loop.
Usually when inserting in a table many rows, there just one query to do even if the INSERT concerns 10000000000 rows from a file !
I am trying to run the following code to populate a table in parallel for a certain application. First the following function is defined which is supposed to connect to my db and execute the sql command with the values given (to insert into table).
def dbWriter(sql, rows) :
# load cnf file
MYSQL_CNF = os.path.abspath('.') + '/mysql.cnf'
conn = MySQLdb.connect(db='dedupe',
charset='utf8',
read_default_file = MYSQL_CNF)
cursor = conn.cursor()
cursor.executemany(sql, rows)
conn.commit()
cursor.close()
conn.close()
And then there is this piece:
pool = dedupe.backport.Pool(processes=2)
done = False
while not done :
chunks = (list(itertools.islice(b_data, step)) for step in
[step_size]*100)
results = []
for chunk in chunks :
print len(chunk)
results.append(pool.apply_async(dbWriter,
("INSERT INTO blocking_map VALUES (%s, %s)",
chunk)))
for r in results :
r.wait()
if len(chunk) < step_size :
done = True
pool.close()
Everything works and there are no errors. But at the end, my table is empty, meaning somehow the insertions were not successful. I have tried so many things to fix this (including adding column names for insertion) after many google searches and have not been successful. Any suggestions would be appreciated. (running code in python2.7, gcloud (ubuntu). note that indents may be a bit messed up after pasting here)
Please also note that "chunk" follows exactly the required data format.
Note. This is part of this example
Please note that the only thing I am changing in the above example (linked) is that I am separating the steps for creation of and inserting into the tables since I am running my code on gcloud platform and it enforces GTID standards.
Solution was changing dbwriter function to:
conn = MySQLdb.connect(host = # host ip,
user = # username,
passwd = # password,
db = 'dedupe')
cursor = conn.cursor()
cursor.executemany(sql, rows)
cursor.close()
conn.commit()
conn.close()
import os, sys
AWS_DIRECTORY = '/home/jenkins/.aws'
certificates_folder = 'my_folder'
SUCCESS = 'success'
class AmazonKMS(object):
def __init__(self):
# making sure boto3 has the certificates and region files
result = os.system('mkdir -p ' + AWS_DIRECTORY)
self._check_os_result(result)
result = os.system('cp ' + certificates_folder + 'kms_config ' + AWS_DIRECTORY + '/config')
self._check_os_result(result)
result = os.system('cp ' + certificates_folder + 'kms_credentials ' + AWS_DIRECTORY + '/credentials')
self._check_os_result(result)
# boto3 is the amazon client package
import boto3
self.kms_client = boto3.client('kms', region_name='us-east-1')
self.global_key_alias = 'alias/global'
self.global_key_id = None
def _check_os_result(self, result):
if result != 0 and raise_on_copy_error:
raise FAILED_COPY
def decrypt_text(self, encrypted_text):
response = self.kms_client.decrypt(
CiphertextBlob = encrypted_text
)
return response['Plaintext']
when using it
amazon_kms = AmazonKMS()
amazon_kms.decrypt_text(blob_password)
getting
E ClientError: An error occurred (AccessDeniedException) when calling the Decrypt operation: The ciphertext refers to a customer master key that does not exist, does not exist in this region, or you are not allowed to access.
stacktrace is
../keys_management/amazon_kms.py:77: in decrypt_text
CiphertextBlob = encrypted_text
/home/jenkins/.virtualenvs/global_tests/local/lib/python2.7/site-packages/botocore/client.py:253: in _api_call
return self._make_api_call(operation_name, kwargs)
/home/jenkins/.virtualenvs/global_tests/local/lib/python2.7/site-packages/botocore/client.py:557: in _make_api_call
raise error_class(parsed_response, operation_name)
This happens in a script that runs once an hour.
it's only failing 2 -3 times a day.
after a retry it succeed.
Tried to upgraded from boto3 1.2.3 to 1.4.4
what is the possible cause for this behavior ?
My guess is that the issue is not in anything you described here.
Most likely the login-tokes time out or something along those lines. To investigate this further a closer look on the way the login works here is probably helpful.
How does this code run? Is it running inside AWS like on Lambda or EC2? Do you run it from your own server (looks like it runs on jenkins)? How is the login access established? What are those kms_credentials used for and how do they look like? Do you do something like assumeing a role (which would probably work through access tokens which after some time will no longer work)?
I've test what bottleneck it is. It is from select query in middlewears.
class CheckDuplicatesFromDB(object):
def process_request(self, request, spider):
# url_list is a just python list. some urls in there.
if (request.url not in url_list):
self.crawled_urls = dict()
connection = pymysql.connect(host='123',
user='123',
password='1234',
db='123',
charset='utf8',
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
# Read a single record
sql = "SELECT `url` FROM `url` WHERE `url`=%s"
cursor.execute(sql, request.url)
self.crawled_urls = cursor.fetchone()
connection.commit()
finally:
connection.close()
if(self.crawled_urls is None):
return None
else:
if (request.url == self.crawled_urls['url']):
raise IgnoreRequest()
else:
return None
else:
return None
If I disable DOWNLOADER_MIDDLEWEARS in setting.py, scrapy crawl speed is not bad.
Before disabled:
scrapy.extensions.logstats] INFO: Crawled 4 pages (at 0 pages/min), scraped 4 items (at 2 items/min)
After disabled:
[scrapy.extensions.logstats] INFO: Crawled 55 pages (at 55 pages/min), scraped 0 items (at 0 items/min)
I guess the select query is the problem. So, I wanna select query once and getting a url data to put Request finger_prints.
I am using CrawlerProcess: the more spiders, the less crawled page/min.
Example:
1 spiders => 50 pages/min
2 spiders => total 30 pages/min
6 spiders => total 10 pages/min
What I wanna do is:
get a url data from MySQL
put a url data to Request finger_prints
How can I do this?
One major problem is that you are opening a new connection to the sql database with each response / call to process_request. Instead open the connection once and keep it open.
While this will result in a major speedup I suspect there are other bottlenecks, that come up once this is fixed.
ok I am trying to create a definition which will read a list of IDS from an external Json file, Which it is doing. Its even putting the data into the database on load of the program, my issue is this. I cant seem to match the list IDs to a comparison. Here is my current code:
def check(account):
global ID_account
import json, httplib
if not hasattr(BigWorld, 'iddata'):
UID_DB = account['databaseID']
UID = ID_account
try:
conn = httplib.HTTPConnection('URL')
conn.request('GET', '/ids.json')
conn.sock.settimeout(2)
resp = conn.getresponse()
qresp = resp.read()
BigWorld.iddata = json.loads(qresp)
LOG_NOTE('[ABRO] Request of URL data successful.')
conn.close()
except:
LOG_NOTE('[ABRO] Http request to URL problem. Loading local data.')
if UID_DB is not None:
list = BigWorld.iddata["ids"]
#print (len(list) - 1)
for n in range(0, (len(list) - 1)):
#print UID_DB
#print list[n]
if UID_DB == list[n]:
#print '[ABRO] userid located:'
#print UID_DB
UID = UID_DB
else:
LOG_NOTE('[ABRO] userid not set.')
if 'databaseID' in account and account['databaseID'] != UID:
print '[ABRO] Account not active in database, game closing...... '
BigWorld.quit()
now my json file looks like this:
{
"ids":[
"1001583757",
"500687699",
"000000000"
]
}
now when I run this with all the commented out prints it seems to execute perfectly fine up till it tries to do the match inside the for loop. Even when the print shows UID_DB and list[n] being the same values, it does not set my variable, it doesn't post any errors, its just simply acting as if there was no match. am I possibly missing a loop break? here is the python log starting with the print of the length of the table print:
INFO: 2
INFO: 1001583757
INFO: 1001583757
INFO: 1001583757
INFO: 500687699
INFO: [ABRO] Account not active, game closing......
as you can see from the log, its never printing the User located print, so it is not matching them. its just continuing with the loop and using the default ID I defined above the definition. Anyone with an idea would definitely help me out as ive been poking and prodding this thing for 3 days now.
the answer to this was found by #VikasNehaOjha it was missing simply a conversion to match types before the match comparison I did this by adding in
list[n] = int(list[n])
that resolved my issue and it finally matched comparisons.