Bottleneck in scrapy middlewears MySQL select - mysql

I've test what bottleneck it is. It is from select query in middlewears.
class CheckDuplicatesFromDB(object):
def process_request(self, request, spider):
# url_list is a just python list. some urls in there.
if (request.url not in url_list):
self.crawled_urls = dict()
connection = pymysql.connect(host='123',
user='123',
password='1234',
db='123',
charset='utf8',
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
# Read a single record
sql = "SELECT `url` FROM `url` WHERE `url`=%s"
cursor.execute(sql, request.url)
self.crawled_urls = cursor.fetchone()
connection.commit()
finally:
connection.close()
if(self.crawled_urls is None):
return None
else:
if (request.url == self.crawled_urls['url']):
raise IgnoreRequest()
else:
return None
else:
return None
If I disable DOWNLOADER_MIDDLEWEARS in setting.py, scrapy crawl speed is not bad.
Before disabled:
scrapy.extensions.logstats] INFO: Crawled 4 pages (at 0 pages/min), scraped 4 items (at 2 items/min)
After disabled:
[scrapy.extensions.logstats] INFO: Crawled 55 pages (at 55 pages/min), scraped 0 items (at 0 items/min)
I guess the select query is the problem. So, I wanna select query once and getting a url data to put Request finger_prints.
I am using CrawlerProcess: the more spiders, the less crawled page/min.
Example:
1 spiders => 50 pages/min
2 spiders => total 30 pages/min
6 spiders => total 10 pages/min
What I wanna do is:
get a url data from MySQL
put a url data to Request finger_prints
How can I do this?

One major problem is that you are opening a new connection to the sql database with each response / call to process_request. Instead open the connection once and keep it open.
While this will result in a major speedup I suspect there are other bottlenecks, that come up once this is fixed.

Related

Why does Fast API take upwards of 10 minutes to insert 100,000 rows into a SQL database

I've tried using SqlAlchemy, as well as raw mysql.connector here, but commiting an insert into a SQL database from FastAPI takes forever.
I wanted to make sure it wasn't just my DB, so I tried it on a local script and it ran in a couple seconds.
How can I work with FastAPI to make this query possible?
Thanks!
'''
#router.post('/')
def postStockData(data:List[pydanticModels.StockPrices], raw_db = Depends(get_raw_db)):
cursor = raw_db[0]
cnxn = raw_db[1]
# i = 0
# for row in data:
# if i % 10 == 0:
# print(i)
# db.flush()
# i += 1
# db_pricing = models.StockPricing(**row.dict())
# db.add(db_pricing)
# db.commit()
SQL = "INSERT INTO " + models.StockPricing.__tablename__ + " VALUES (%s, %s, %s)"
print(SQL)
valsToInsert = []
for row in data:
rowD = row.dict()
valsToInsert.append((rowD['date'], rowD['symbol'], rowD['value']))
cursor.executemany(SQL, valsToInsert)
cnxn.commit()
return {'message':'Pricing Updated'}
'''
You are killing performances because you try a "RBAR" approach which is not suitable in RDBMS...
You use a loop and execute an SQL INSERT of only one row...
When the RDBMS is facing a query, the sequence of execution is the following :
does the user that throw the query be authenticate ?
parsing the string to verify the syntax
looking for metadata (tables, columns, datatypes...)
analyzing which operations on tables and columns this user is granted
creating an execution plan to sequences all the operations needed for the query
setting up lock for concurrency
executing the query (inserting only 1 row)
throw back an error or a OK message
Every steps consumes time... and your are all theses steps 100 000 times because of your loop.
Usually when inserting in a table many rows, there just one query to do even if the INSERT concerns 10000000000 rows from a file !

Collecting Tweets using max_id is not working as expected

Iam currently doing a tweet search using Twitter Api. However, taking the tweet id is not working for me.
Here is my code:
searchQuery = '#BLM' # this is what we're searching for
searchQuery = searchQuery + "-filter:retweets"
Geocode="39.8, -95.583068847656, 2500km"
maxTweets = 1000000 # Some arbitrary large number
tweetsPerQry = 100 # this is the max the API permits
fName = 'tweetsBLM.json' # We'll store the tweets in a json file.
sinceId = None
#max_id = -1 # initial search
max_id=1278836959926980609 # the last id of previous search
tweetCount = 0
print("Downloading max {0} tweets".format(maxTweets))
with open(fName, 'w') as f:
while tweetCount < maxTweets:
try:
if (max_id <= 0):
if (not sinceId):
new_tweets = api.search(q=searchQuery,lang="en", geocode=Geocode,
count=tweetsPerQry)
else:
new_tweets = api.search(q=searchQuery,lang="en",geocode=Geocode,
count=tweetsPerQry,
since_id=sinceId )
else:
if (not sinceId):
new_tweets = api.search(q=searchQuery, lang="en", geocode=Geocode,
count=tweetsPerQry,
max_id=str(max_id - 1) )
else:
new_tweets = api.search(q=searchQuery, lang="en", geocode=Geocode,
count=tweetsPerQry,
max_id=str(max_id - 1),
since_id=sinceId)
if not new_tweets:
print("No more tweets found")
break
for tweet in new_tweets:
f.write(jsonpickle.encode(tweet._json, unpicklable=False) +
'\n')
tweetCount += len(new_tweets)
print("Downloaded {0} tweets".format(tweetCount))
max_id = new_tweets[-1].id
except tweepy.TweepError as e:
# Just exit if any error
print("some error : " + str(e))
print('exception raised, waiting 15 minutes')
print('(until:', dt.datetime.now() + dt.timedelta(minutes=15), ')')
time.sleep(15*60)
break
print ("Downloaded {0} tweets, Saved to {1}".format(tweetCount, fName))
This code works perfectly fine. I initially run it and got about 40 000 tweets. Then i took the id of the last tweet of previous/initial search to go back in time. However, i was disappointed to see that there were no tweets anymore. I can not believe that for a second. I must be going wrong somewhere because #BLM has been very active in the last 2/3 months.
Any help is very welcome. Thank you
I may have found the answer. Using Twitter API, it is not possible to get older tweets (7 days old or more). Using max_id to get around this is not possible either.
The only way is to stream and wait for more than 7 days.
Finally, there is also this link that look for older tweets
https://pypi.org/project/GetOldTweets3/ it is an extension of the original Jefferson Henrique's work

Updating one table of MYSQL with multiple processes via pymysql

Actually, I am trying to update one table with multiple processes via pymysql, and each process reads a CSV file split from a huge one in order to promote the speed. But I get the Lock wait timeout exceeded; try restarting transaction exception when I run the script. After searching the posts on this site, I found one post which mentioned that to set or build the built-in LOAD_DATA_INFILE, but no details on it. How can I do it with 'pymysql' to reach my aim?
---------------------------first edit----------------------------------------
Here's the job method:
`def importprogram(path, name):
begin = time.time()
print('begin to import program' + name + ' info.')
# "c:\\sometest.csv"
file = open(path, mode='rb')
csvfile = csv.reader(codecs.iterdecode(file, 'utf-8'))
connection = None
try:
connection = pymysql.connect(host='a host', user='someuser', password='somepsd', db='mydb',
cursorclass=pymysql.cursors.DictCursor)
count = 1
with connection.cursor() as cursor:
sql = '''update sometable set Acolumn='{guid}' where someid='{pid}';'''
next(csvfile, None)
for line in csvfile:
try:
count = count + 1
if ''.join(line).strip():
command = sql.format(guid=line[2], pid=line[1])
cursor.execute(command)
if count % 1000 == 0:
print('program' + name + ' cursor execute', count)
except csv.Error:
print('program csv.Error:', count)
continue
except IndexError:
print('program IndexError:', count)
continue
except StopIteration:
break
except Exception as e:
print('program' + name, str(e))
finally:
connection.commit()
connection.close()
file.close()
print('program' + name + ' info done.time cost:', time.time()-begin)`
And the multi-processing method:
import multiprocessing as mp
def multiproccess():
pool = mp.Pool(3)
results = []
paths = ['C:\\testfile01.csv', 'C:\\testfile02.csv', 'C:\\testfile03.csv']
name = 1
for path in paths:
results.append(pool.apply_async(importprogram, args=(path, str(name))))
name = name + 1
print(result.get() for result in results)
pool.close()
pool.join()
And the main method:
if __name__ == '__main__':
multiproccess()
I am new to Python. How can I make the code or the way itself goes wrong? Should I use only one single process to finish the data reading and importing?
Your issue is that you are exceeding the time allowed for a response to be fetched from the server, so the client is automatically timing out.
In my experience, adjust the wait timeout to something like 6000 seconds, combine into one CSV and just leave the data to import. Also, I would recommend running the query direct from MySQL rather than Python.
The way I usually import CSV data from Python to MySQL is through the INSERT ... VALUES ... method, and I only do so when some kind of manipulation of the data is required (i.e. inserting different rows into different tables).
I like your approach and understand your thinking but in reality there is no need. The benefit to the INSERT ... VALUES ... method is that you won't run into any timeout issue.

Values are not inserted into MySQL table using pool.apply_async in python2.7

I am trying to run the following code to populate a table in parallel for a certain application. First the following function is defined which is supposed to connect to my db and execute the sql command with the values given (to insert into table).
def dbWriter(sql, rows) :
# load cnf file
MYSQL_CNF = os.path.abspath('.') + '/mysql.cnf'
conn = MySQLdb.connect(db='dedupe',
charset='utf8',
read_default_file = MYSQL_CNF)
cursor = conn.cursor()
cursor.executemany(sql, rows)
conn.commit()
cursor.close()
conn.close()
And then there is this piece:
pool = dedupe.backport.Pool(processes=2)
done = False
while not done :
chunks = (list(itertools.islice(b_data, step)) for step in
[step_size]*100)
results = []
for chunk in chunks :
print len(chunk)
results.append(pool.apply_async(dbWriter,
("INSERT INTO blocking_map VALUES (%s, %s)",
chunk)))
for r in results :
r.wait()
if len(chunk) < step_size :
done = True
pool.close()
Everything works and there are no errors. But at the end, my table is empty, meaning somehow the insertions were not successful. I have tried so many things to fix this (including adding column names for insertion) after many google searches and have not been successful. Any suggestions would be appreciated. (running code in python2.7, gcloud (ubuntu). note that indents may be a bit messed up after pasting here)
Please also note that "chunk" follows exactly the required data format.
Note. This is part of this example
Please note that the only thing I am changing in the above example (linked) is that I am separating the steps for creation of and inserting into the tables since I am running my code on gcloud platform and it enforces GTID standards.
Solution was changing dbwriter function to:
conn = MySQLdb.connect(host = # host ip,
user = # username,
passwd = # password,
db = 'dedupe')
cursor = conn.cursor()
cursor.executemany(sql, rows)
cursor.close()
conn.commit()
conn.close()

World of tanks Python list comparison from json

ok I am trying to create a definition which will read a list of IDS from an external Json file, Which it is doing. Its even putting the data into the database on load of the program, my issue is this. I cant seem to match the list IDs to a comparison. Here is my current code:
def check(account):
global ID_account
import json, httplib
if not hasattr(BigWorld, 'iddata'):
UID_DB = account['databaseID']
UID = ID_account
try:
conn = httplib.HTTPConnection('URL')
conn.request('GET', '/ids.json')
conn.sock.settimeout(2)
resp = conn.getresponse()
qresp = resp.read()
BigWorld.iddata = json.loads(qresp)
LOG_NOTE('[ABRO] Request of URL data successful.')
conn.close()
except:
LOG_NOTE('[ABRO] Http request to URL problem. Loading local data.')
if UID_DB is not None:
list = BigWorld.iddata["ids"]
#print (len(list) - 1)
for n in range(0, (len(list) - 1)):
#print UID_DB
#print list[n]
if UID_DB == list[n]:
#print '[ABRO] userid located:'
#print UID_DB
UID = UID_DB
else:
LOG_NOTE('[ABRO] userid not set.')
if 'databaseID' in account and account['databaseID'] != UID:
print '[ABRO] Account not active in database, game closing...... '
BigWorld.quit()
now my json file looks like this:
{
"ids":[
"1001583757",
"500687699",
"000000000"
]
}
now when I run this with all the commented out prints it seems to execute perfectly fine up till it tries to do the match inside the for loop. Even when the print shows UID_DB and list[n] being the same values, it does not set my variable, it doesn't post any errors, its just simply acting as if there was no match. am I possibly missing a loop break? here is the python log starting with the print of the length of the table print:
INFO: 2
INFO: 1001583757
INFO: 1001583757
INFO: 1001583757
INFO: 500687699
INFO: [ABRO] Account not active, game closing......
as you can see from the log, its never printing the User located print, so it is not matching them. its just continuing with the loop and using the default ID I defined above the definition. Anyone with an idea would definitely help me out as ive been poking and prodding this thing for 3 days now.
the answer to this was found by #VikasNehaOjha it was missing simply a conversion to match types before the match comparison I did this by adding in
list[n] = int(list[n])
that resolved my issue and it finally matched comparisons.