Iam currently doing a tweet search using Twitter Api. However, taking the tweet id is not working for me.
Here is my code:
searchQuery = '#BLM' # this is what we're searching for
searchQuery = searchQuery + "-filter:retweets"
Geocode="39.8, -95.583068847656, 2500km"
maxTweets = 1000000 # Some arbitrary large number
tweetsPerQry = 100 # this is the max the API permits
fName = 'tweetsBLM.json' # We'll store the tweets in a json file.
sinceId = None
#max_id = -1 # initial search
max_id=1278836959926980609 # the last id of previous search
tweetCount = 0
print("Downloading max {0} tweets".format(maxTweets))
with open(fName, 'w') as f:
while tweetCount < maxTweets:
try:
if (max_id <= 0):
if (not sinceId):
new_tweets = api.search(q=searchQuery,lang="en", geocode=Geocode,
count=tweetsPerQry)
else:
new_tweets = api.search(q=searchQuery,lang="en",geocode=Geocode,
count=tweetsPerQry,
since_id=sinceId )
else:
if (not sinceId):
new_tweets = api.search(q=searchQuery, lang="en", geocode=Geocode,
count=tweetsPerQry,
max_id=str(max_id - 1) )
else:
new_tweets = api.search(q=searchQuery, lang="en", geocode=Geocode,
count=tweetsPerQry,
max_id=str(max_id - 1),
since_id=sinceId)
if not new_tweets:
print("No more tweets found")
break
for tweet in new_tweets:
f.write(jsonpickle.encode(tweet._json, unpicklable=False) +
'\n')
tweetCount += len(new_tweets)
print("Downloaded {0} tweets".format(tweetCount))
max_id = new_tweets[-1].id
except tweepy.TweepError as e:
# Just exit if any error
print("some error : " + str(e))
print('exception raised, waiting 15 minutes')
print('(until:', dt.datetime.now() + dt.timedelta(minutes=15), ')')
time.sleep(15*60)
break
print ("Downloaded {0} tweets, Saved to {1}".format(tweetCount, fName))
This code works perfectly fine. I initially run it and got about 40 000 tweets. Then i took the id of the last tweet of previous/initial search to go back in time. However, i was disappointed to see that there were no tweets anymore. I can not believe that for a second. I must be going wrong somewhere because #BLM has been very active in the last 2/3 months.
Any help is very welcome. Thank you
I may have found the answer. Using Twitter API, it is not possible to get older tweets (7 days old or more). Using max_id to get around this is not possible either.
The only way is to stream and wait for more than 7 days.
Finally, there is also this link that look for older tweets
https://pypi.org/project/GetOldTweets3/ it is an extension of the original Jefferson Henrique's work
Related
This is my first time using any sort of code. I have been following along with an interactive tutorial and I seem to be stuck at the very first step, trying to import a json file containing info regarding football competition data. It seems fairly straightforward but error message after error message has started to drive me insane.
I am trying to load the data into python in order to follow along with a tutorial (I will leave a link below). I believe I have saved my files and data in the same way as in the tutorial but when I change the file directory and run: import json I get a few different error messages if someone could advise on what I’m doing wrong it would be greatly appreciated. My goal is to load in the data which I have downloaded from GitHub and open the competitions JSON file.
I am also happy to provide any information required to help answer this question.
YouTube video:https://youtu.be/GTtu0t03FMO
error messages:
FileNotFoundError: [Errno 2] No such file or directory: 'Statsbomb/data/competitions.json'
JSONDecodeError:Expecting value
#Load in Statsbomb competition and match data
#This is a library for loading json files.
import json
#Load the competition file
#Got this by searching 'how do I open json in Python'
with open('Statsbomb/data/competitions.json') as f:
competitions = json.load(f)
#Womens World Cup 2019 has competition ID 72
competition_id=72
#Womens World Cup 2019 has competition ID 72
competition_id=72
#Load the list of matches for this competition
with open('Statsbomb/data/matches/'+str(competition_id)+'/30.json') as f:
matches = json.load(f)
#Look inside matches
matches[0]
matches[0]['home_team']
matches[0]['home_team']['home_team_name']
matches[0]['away_team']['away_team_name']
#Print all match results
for match in matches:
home_team_name=match['home_team']['home_team_name']
away_team_name=match['away_team']['away_team_name']
home_score=match['home_score']
away_score=match['away_score']
describe_text = 'The match between ' + home_team_name + ' and ' + away_team_name
result_text = ' finished ' + str(home_score) + ' : ' + str(away_score)
print(describe_text + result_text)
#Now lets find a match we are interested in
home_team_required ="England"
away_team_required ="Sweden"
#Find ID for the match
for match in matches:
home_team_name=match['home_team']['home_team_name']
away_team_name=match['away_team']['away_team_name']
if (home_team_name==home_team_required) and (away_team_name==away_team_required):
match_id_required = match['match_id']
print(home_team_required + ' vs ' + away_team_required + ' has id:' + str(match_id_required))
#Exercise:
#1, Edit the code above to print out the result list for the Mens World cup
#2, Edit the code above to find the ID for England vs. Sweden
#3, Write new code to write out a list of just Sweden's results in the tournament.
with open('Statsbomb/data/matches/'+str(competition_id)+'/30.json') as f:
matches = json.load(f)
try:
with open('Statsbomb/data/matches/'+str(competition_id)+'/3.json') as f:
matches = json.load(f)
Actually, I am trying to update one table with multiple processes via pymysql, and each process reads a CSV file split from a huge one in order to promote the speed. But I get the Lock wait timeout exceeded; try restarting transaction exception when I run the script. After searching the posts on this site, I found one post which mentioned that to set or build the built-in LOAD_DATA_INFILE, but no details on it. How can I do it with 'pymysql' to reach my aim?
---------------------------first edit----------------------------------------
Here's the job method:
`def importprogram(path, name):
begin = time.time()
print('begin to import program' + name + ' info.')
# "c:\\sometest.csv"
file = open(path, mode='rb')
csvfile = csv.reader(codecs.iterdecode(file, 'utf-8'))
connection = None
try:
connection = pymysql.connect(host='a host', user='someuser', password='somepsd', db='mydb',
cursorclass=pymysql.cursors.DictCursor)
count = 1
with connection.cursor() as cursor:
sql = '''update sometable set Acolumn='{guid}' where someid='{pid}';'''
next(csvfile, None)
for line in csvfile:
try:
count = count + 1
if ''.join(line).strip():
command = sql.format(guid=line[2], pid=line[1])
cursor.execute(command)
if count % 1000 == 0:
print('program' + name + ' cursor execute', count)
except csv.Error:
print('program csv.Error:', count)
continue
except IndexError:
print('program IndexError:', count)
continue
except StopIteration:
break
except Exception as e:
print('program' + name, str(e))
finally:
connection.commit()
connection.close()
file.close()
print('program' + name + ' info done.time cost:', time.time()-begin)`
And the multi-processing method:
import multiprocessing as mp
def multiproccess():
pool = mp.Pool(3)
results = []
paths = ['C:\\testfile01.csv', 'C:\\testfile02.csv', 'C:\\testfile03.csv']
name = 1
for path in paths:
results.append(pool.apply_async(importprogram, args=(path, str(name))))
name = name + 1
print(result.get() for result in results)
pool.close()
pool.join()
And the main method:
if __name__ == '__main__':
multiproccess()
I am new to Python. How can I make the code or the way itself goes wrong? Should I use only one single process to finish the data reading and importing?
Your issue is that you are exceeding the time allowed for a response to be fetched from the server, so the client is automatically timing out.
In my experience, adjust the wait timeout to something like 6000 seconds, combine into one CSV and just leave the data to import. Also, I would recommend running the query direct from MySQL rather than Python.
The way I usually import CSV data from Python to MySQL is through the INSERT ... VALUES ... method, and I only do so when some kind of manipulation of the data is required (i.e. inserting different rows into different tables).
I like your approach and understand your thinking but in reality there is no need. The benefit to the INSERT ... VALUES ... method is that you won't run into any timeout issue.
I've test what bottleneck it is. It is from select query in middlewears.
class CheckDuplicatesFromDB(object):
def process_request(self, request, spider):
# url_list is a just python list. some urls in there.
if (request.url not in url_list):
self.crawled_urls = dict()
connection = pymysql.connect(host='123',
user='123',
password='1234',
db='123',
charset='utf8',
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
# Read a single record
sql = "SELECT `url` FROM `url` WHERE `url`=%s"
cursor.execute(sql, request.url)
self.crawled_urls = cursor.fetchone()
connection.commit()
finally:
connection.close()
if(self.crawled_urls is None):
return None
else:
if (request.url == self.crawled_urls['url']):
raise IgnoreRequest()
else:
return None
else:
return None
If I disable DOWNLOADER_MIDDLEWEARS in setting.py, scrapy crawl speed is not bad.
Before disabled:
scrapy.extensions.logstats] INFO: Crawled 4 pages (at 0 pages/min), scraped 4 items (at 2 items/min)
After disabled:
[scrapy.extensions.logstats] INFO: Crawled 55 pages (at 55 pages/min), scraped 0 items (at 0 items/min)
I guess the select query is the problem. So, I wanna select query once and getting a url data to put Request finger_prints.
I am using CrawlerProcess: the more spiders, the less crawled page/min.
Example:
1 spiders => 50 pages/min
2 spiders => total 30 pages/min
6 spiders => total 10 pages/min
What I wanna do is:
get a url data from MySQL
put a url data to Request finger_prints
How can I do this?
One major problem is that you are opening a new connection to the sql database with each response / call to process_request. Instead open the connection once and keep it open.
While this will result in a major speedup I suspect there are other bottlenecks, that come up once this is fixed.
ok I am trying to create a definition which will read a list of IDS from an external Json file, Which it is doing. Its even putting the data into the database on load of the program, my issue is this. I cant seem to match the list IDs to a comparison. Here is my current code:
def check(account):
global ID_account
import json, httplib
if not hasattr(BigWorld, 'iddata'):
UID_DB = account['databaseID']
UID = ID_account
try:
conn = httplib.HTTPConnection('URL')
conn.request('GET', '/ids.json')
conn.sock.settimeout(2)
resp = conn.getresponse()
qresp = resp.read()
BigWorld.iddata = json.loads(qresp)
LOG_NOTE('[ABRO] Request of URL data successful.')
conn.close()
except:
LOG_NOTE('[ABRO] Http request to URL problem. Loading local data.')
if UID_DB is not None:
list = BigWorld.iddata["ids"]
#print (len(list) - 1)
for n in range(0, (len(list) - 1)):
#print UID_DB
#print list[n]
if UID_DB == list[n]:
#print '[ABRO] userid located:'
#print UID_DB
UID = UID_DB
else:
LOG_NOTE('[ABRO] userid not set.')
if 'databaseID' in account and account['databaseID'] != UID:
print '[ABRO] Account not active in database, game closing...... '
BigWorld.quit()
now my json file looks like this:
{
"ids":[
"1001583757",
"500687699",
"000000000"
]
}
now when I run this with all the commented out prints it seems to execute perfectly fine up till it tries to do the match inside the for loop. Even when the print shows UID_DB and list[n] being the same values, it does not set my variable, it doesn't post any errors, its just simply acting as if there was no match. am I possibly missing a loop break? here is the python log starting with the print of the length of the table print:
INFO: 2
INFO: 1001583757
INFO: 1001583757
INFO: 1001583757
INFO: 500687699
INFO: [ABRO] Account not active, game closing......
as you can see from the log, its never printing the User located print, so it is not matching them. its just continuing with the loop and using the default ID I defined above the definition. Anyone with an idea would definitely help me out as ive been poking and prodding this thing for 3 days now.
the answer to this was found by #VikasNehaOjha it was missing simply a conversion to match types before the match comparison I did this by adding in
list[n] = int(list[n])
that resolved my issue and it finally matched comparisons.
I'm trying to get all of a specific user's tweets.
I know there is a limit of retreiving 3600 tweets, so I'm wondering why I can't get more tweets from this line:
https://api.twitter.com/1/statuses/user_timeline.json?include_entities=true&include_rts=true&screen_name=mybringback&count=3600
Does anyone know how to fix this?
The API documentation specifies that the maximum number of statuses that this call will return is 200.
https://dev.twitter.com/docs/api/1/get/statuses/user_timeline
Specifies the number of tweets to try and retrieve, up to a maximum of 200. The value of count is best thought of as a limit to the number of tweets to return because suspended or deleted content is removed after the count has been applied. We include retweets in the count, even if include_rts is not supplied. It is recommended you always send include_rts=1 when using this API method.
Here's something I've used for a project that had to do just that:
import json
import commands
import time
def get_followers(screen_name):
followers_list = []
# start cursor at -1
next_cursor = -1
print("Getting list of followers for user '%s' from Twitter API..." % screen_name)
while next_cursor:
cmd = 'twurl "/1.1/followers/ids.json?cursor=' + str(next_cursor) + \
'&screen_name=' + screen_name + '"'
(status, output) = commands.getstatusoutput(cmd)
# convert json object to dictionary and ensure there are no errors
try:
data = json.loads(output)
if data.get("errors"):
# if we get an inactive account, write error message
if data.get('errors')[0]['message'] in ("Sorry, that page does not exist",
"User has been suspended"):
print("Skipping account %s. It doesn't seem to exist" % screen_name)
break
elif data.get('errors')[0]['message'] == "Rate limit exceeded":
print("\t*** Rate limit exceeded ... waiting 2 minutes ***")
time.sleep(120)
continue
# otherwise, raise an exception with the error
else:
raise Exception("The Twitter call returned errors: %s"
% data.get('errors')[0]['message'])
if data.get('ids'):
print("\t\tFound %s followers for user '%s'" % (len(data['ids']), screen_name))
followers_list += data['ids']
if data.get('next_cursor'):
next_cursor = data['next_cursor']
else:
break
except ValueError:
print("\t****No output - Retrying \t\t%s ****" % output)
return followers_list
screen_name = 'AshwinBalamohan'
followers = get_followers(screen_name)
print("\n\nThe followers for user '%s' are:\n%s" % followers)
In order to get this to work, you'll need to install the Ruby gem 'Twurl', which is available here: https://github.com/marcel/twurl
I found Twurl easier to work with than the other Python Twitter wrappers, so opted to call it from Python. Let me know if you'd like me to walk you through how to install Twurl and the Twitter API keys.