Github API search code, missing items in JSON - json

I've been trying to build up a tool that needs to fetch all files' URLs of GitHub code search's result. For example when you go the here and search for uber.com api_key. You'll see that there is 381 code results and I want to get all these 381 files' URLs.
In order to do that I learned how to use GitHub API V3 and made following function:
def fetchItems(search, GITHUB_API):
items = set()
response = {"items":[1]}
pageNumber = 1
while(response["items"]):
sleep(3) # trying to avoid rate limit, not successful though :(
url = "https://api.github.com/search/code"
params = {
"q" : search,
"per_page" : 30, # default value, it can be increased to 100
"page" : pageNumber
}
headers = {
"Accept" : "application/vnd.github+json",
"Authorization" : f"Bearer {GITHUB_API}"
}
r = requests.get(url=url, headers=headers, params=params, verify=False)
if r.status_code == 403: # if we exceed the rate limit, sleep until rate limit get reseted
epochReset = int(r.headers["X-Ratelimit-Reset"])
epochNow = time()
if epochNow < epochReset:
sleep((epochReset - epochNow) + 1)
sleep(1)
continue
response = json.loads(r.text)
for file in response["items"]:
items.add(file["html_url"])
pageNumber += 1
return items
page variable indicates the number of items that'll be returned in each page, and page is the page :). By increasing page number in every request, you should be able to get all items according to my understanding.
However when I opened my database and checked the items that have been written, I saw that there were only 377 files, so 4 of the files are missing.
Because of my repuation I can't post images, so click here.
I checked the db writer function and I'm sure that there is nothing wrong with that. Does GitHub API return missing items in JSON or am I doing something wrong ?

Related

How to ignore NoneType error in tweet ID statuses_lookup

I am trying to collect tweets with tweepy from a list of tweet ids good_tweet_ids_test, using statuses_lookup.
Since the list is a bit old, some tweets will have been deleted by now. Therefore I ignore errors in the lookup_tweets function, so it does not stop each time.
Here is my code so far:
def lookup_tweets(tweet_IDs, api):
full_tweets = []
tweet_count = len(tweet_IDs)
try:
for i in range((tweet_count // 100) + 1):
# Catch the last group if it is less than 100 tweets
end_loc = min((i + 1) * 100, tweet_count)
full_tweets.extend(
api.statuses_lookup(tweet_IDs[i * 100:end_loc], tweet_mode='extended')
)
return full_tweets
except:
pass
results = lookup_tweets(good_tweet_ids_test, api)
temp = json.dumps([status._json for status in results]) #create JSON
newdf = pd.read_json(temp, orient='records')
newdf.to_json('tweepy_tweets.json')
But when I run the temp = json.dumps([status._json for status in results]) line, it gives me the error:
TypeError: 'NoneType' object is not iterable
I do not know how to fix this. I believe the type of some of the statuses is None, because they have been deleted and can therefore not be looked up now. I simply wish for my code to move on to the next status, if the type is None.
EDIT: As have been pointed out, the issue is that results is None. So now I think I need to exclude None values from the full_tweets variable. But I cannot figure out how to. Any help?
EDIT2: With further testing I have found out that results is only None when there is a tweet ID that has now been deleted in the batch. If the batch contains only active tweets, it works. So I think I need to figure out how to have my function look up the batch of tweets, and only return those that are not None. Any help on this?
Instead of implicitly returning None when there's an error, you could explicitly return an empty list. That way, the result of lookup_tweets will always be iterable, and the calling code won't have to check its result:
def lookup_tweets(tweet_IDs, api):
full_tweets = []
tweet_count = len(tweet_IDs)
try:
for i in range((tweet_count // 100) + 1):
# Catch the last group if it is less than 100 tweets
end_loc = min((i + 1) * 100, tweet_count)
full_tweets.extend(
api.statuses_lookup(tweet_IDs[i * 100:end_loc], tweet_mode='extended')
)
return full_tweets
except:
return [] # Here!

How to dump all results of a API request when there is a page limit?

I am using an API to pull data from a url, however the API has a pagination limit. It goes like:
Page (default is 1 and it's the page number you want to retrieve)
Per_page (default is 100 and it's the maximum number of results returned in the response(max=500))
I have a script which I can get the results of a page or per page but I want to automate it. I want to be able to loop through all the pages or per_page(500) and load it in to a json file.
Here is my code that can get 500 results per_page:
import json, pprint
import requests
url = "https://my_api.com/v1/users?per_page=500"
header = {"Authorization": "Bearer <my_api_token>"}
s = requests.Session()
s.proxies = {"http": "<my_proxies>", "https": "<my_proxies>" }
resp = s.get(url, headers=header, verify=False)
raw=resp.json()
for x in raw:
print(x)
The output is 500 but is there a way to keep going and pull the results starting from where it left off? Or even go by page and get all the data per page until there's no data in a page?
It will be helpful, if you present a sample response from your API.
If the API is equipped properly, there will be a next property in a given response that leads you to the next page.
You can then keep calling the API with the link given in the next recursively. On the last page, there will be no next in the Link header.
resp.links["next"]["url"] will give you the URL to the next page.
For example, the GitHub API has next, last, first, and prev properties.
To put it into code, first, you need to turn your code into functions.
Given that there is a maximum of 500 results per page, it implies you are extracting a list of data of some sort from the API. Often, these data are returned in a list somewhere inside raw.
For now, let's assume you want to extract all elements inside a list at raw.get('data').
import requests
header = {"Authorization": "Bearer <my_api_token>"}
results_per_page = 500
def compose_url():
return (
"https://my_api.com/v1/users"
+ "?per_page="
+ str(results_per_page)
+ "&page_number="
+ "1"
)
def get_result(url=None):
if url_get is None:
url_get = compose_url()
else:
url_get = url
s = requests.Session()
s.proxies = {"http": "<my_proxies>", "https": "<my_proxies>"}
resp = s.get(url_get, headers=header, verify=False)
# You may also want to check the status code
if resp.status_code != 200:
raise Exception(resp.status_code)
raw = resp.json() # of type dict
data = raw.get("data") # of type list
if not "url" in resp.links.get("next"):
# We are at the last page, return data
return data
# Otherwise, recursively get results from the next url
return data + get_result(resp.links["next"]["url"]) # concat lists
def main():
# Driver function
data = get_result()
# Then you can print the data or save it to a file
if __name__ == "__main__":
# Now run the driver function
main()
However, if there isn't a proper Link header, I see 2 solutions:
(1) recursion and (2) loop.
I'll demonstrate recursion.
As you have mentioned, when there is pagination in API responses, i.e. when there is a limit of maximum number of results per page, there is often a query parameter called page number or start index of some sort to indicate which "page" you are querying, so we'll utilize the page_number parameter in the code.
The logic is:
Given a HTTP request response, if there is less than 500 results, it means there is no more pages. Return the results.
If there are 500 results in a given response, it means there's probably another page, so we advance page_number by 1 and do a recursion (by calling the function itself) and concatenate with previous results.
import requests
header = {"Authorization": "Bearer <my_api_token>"}
results_per_page = 500
def compose_url(results_per_page, current_page_number):
return (
"https://my_api.com/v1/users"
+ "?per_page="
+ str(results_per_page)
+ "&page_number="
+ str(current_page_number)
)
def get_result(current_page_number):
s = requests.Session()
s.proxies = {"http": "<my_proxies>", "https": "<my_proxies>"}
url = compose_url(results_per_page, current_page_number)
resp = s.get(url, headers=header, verify=False)
# You may also want to check the status code
if resp.status_code != 200:
raise Exception(resp.status_code)
raw = resp.json() # of type dict
data = raw.get("data") # of type list
# If the length of data is smaller than results_per_page (500 of them),
# that means there is no more pages
if len(data) < results_per_page:
return data
# Otherwise, advance the page number and do a recursion
return data + get_result(current_page_number + 1) # concat lists
def main():
# Driver function
data = get_result(1)
# Then you can print the data or save it to a file
if __name__ == "__main__":
# Now run the driver function
main()
If you truly want to store the raw responses, you can. However, you'll still need to check the number of results in a given response. The logic is similar. If a given raw contains 500 results, it means there is probably another page. We advance the page number by 1 and do a recursion.
Let's still assume raw.get('data') is the list whose length is the number of results.
Because JSON/dictionary files cannot be simply concatenated, you can store raw (which is a dictionary) of each page into a list of raws. You can then parse and synthesize the data in whatever way you want.
Use the following get_result function:
def get_result(current_page_number):
s = requests.Session()
s.proxies = {"http": "<my_proxies>", "https": "<my_proxies>"}
url = compose_url(results_per_page, current_page_number)
resp = s.get(url, headers=header, verify=False)
# You may also want to check the status code
if resp.status_code != 200:
raise Exception(resp.status_code)
raw = resp.json() # of type dict
data = raw.get("data") # of type list
if len(data) == results_per_page:
return [raw] + get_result(current_page_number + 1) # concat lists
return [raw] # convert raw into a list object on the fly
As for the loop method, the logic is similar to recursion. Essentially, you will call the get_result() function a number of times, collect the results, and break early when the furthest page contains less than 500 results.
If you know the total number of results in advance, you can simply the run the loop for a predetermined number of times.
Do you follow? Do you have any further questions?
(I'm a little confused by what you mean by "load it into a JSON file". Do you mean saving the final raw results into a JSON file? Or are you referring to the .json() method in resp.json()? In that case, you don't need import json to do resp.json(). The .json() method on resp is actually part of the requests module.
On a bonus point, you can make your HTTP requests asynchronous, but this is slightly beyond the scope of your original question.
P.s. I'm happy to learn what other solutions, perhaps more elegant ones, that people use.

CollectionView.reloadData() outputs cells in incorrect order

I am working on an app that requires a sync to the server after logging in to get all the activities the user has created and saved to the server. Currently, when the user logs in a getActivity() function that makes an API request and returns a response which is then handled.
Say the user has 4 activities saved on the server in this order (The order is determined by the time of the activity being created / saved) ;
Test
Bob
cvb
Testing
looking at the JSONHandler.getActivityResponse , it appears as though the the results are in the correct order. If the request was successful, on the home page where these activities are to be displayed, I currently loop through them like so;
WebAPIHandler.shared.getActivityRequest(completion:
{
success, results in DispatchQueue.main.async {
if(success)
{
for _ in (results)!
{
guard let managedObjectContext = self.managedObjectContext else { return }
let activity = Activity(context: managedObjectContext)
activity.name = results![WebAPIHandler.shared.idCount].name
print("activity name is - \(activity.name)")
WebAPIHandler.shared.idCount += 1
}
}
And the print within the for loop is also outputting in the expected order;
activity name is - Optional("Test")
activity name is - Optional("Bob")
activity name is - Optional("cvb")
activity name is - Optional("Testing")
The CollectionView does then insert new cells, but it seemingly in the wrong order. I'm using a carousel layout on the home page, and the 'cvb' object for example is appearing first in the list, and 'bob' is third in the list. I am using the following
func controller(_ controller: NSFetchedResultsController<NSFetchRequestResult>, didChange anObject: Any, at indexPath: IndexPath?, for type: NSFetchedResultsChangeType, newIndexPath: IndexPath?)
{
switch (type)
{
case .insert:
if var indexPath = newIndexPath
{
// var itemCount = 0
// var arrayWithIndexPaths: [IndexPath] = []
//
// for _ in 0..<(WebAPIHandler.shared.idCount)
// {
// itemCount += 1
//
// arrayWithIndexPaths.append(IndexPath(item: itemCount - 1, section: 0))
// print("itemCount = \(itemCount)")
// }
print("Insert object")
// walkThroughCollectionView.insertItems(at: arrayWithIndexPaths)
walkThroughCollectionView.reloadData()
}
You can see why I've tried to use collectionView.insertItems() but that would cause an error stating:
Invalid update: invalid number of items in section 0. The number of items contained in an existing section after the update (4) must be equal to the number of items contained in that section before the update (4), plus or minus the number of items inserted or deleted from that section (4 inserted, 0 deleted)
I saw a lot of other answers mentioning how reloadData() would fix the issue, but I'm real stuck at this point. I've been using swift for several months now, and this has been the first time I'm truly at a loss. What I also realised is that the order displayed in the carousel is also different to a separate viewController which is passed the same data. I just have no idea why the results return in the correct order, but are then displayed in an incorrect order. Is there a way to sort data in the collectionView after calling reloadData() or am I looking at this from the wrong angle?
Any help would be much appreciated, cheers!
The order of the collection view is specified by the sort descriptor(s) of the fetched results controller.
Usually the workflow of inserting a new NSManagedObject is
Insert the new object into the managed object context.
Save the context. This calls the delegate methods controllerWillChangeContent, controller(:didChange:at: etc.
In controller(:didChange:at: insert the cell into the collection view with insertItems(at:, nothing else. Do not call reloadData() in this method.

Pyramid Unit Test Sending Parameter

I have a Pyramid web-application that I am trying to unit-test.
In my tests file I have this snippet of code:
anyparam = {"isApple": "True"}
#parameterized.expand([
("ParamA", anyparam, 'success')])
def test_(self, name, params, expected):
request = testing.DummyRequest(params=params)
request.session['AI'] = ''
response = dothejob(request)
self.assertEqual(response['status'], expected, "expected response['status']={0} but response={1}".format(expected, response))
Whereas in my views:
#view_config(route_name='function', renderer='json')
def dothejob(request):
params = json.loads(request.body)
value = params.get('isApple') #true or false.
However, when I'm trying to unit-test it, I am getting this error:
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
However, when I make the same request via POST via web-browser it works perfectly fine.
By doing testing.DummyRequest(params=params) you are only populating request.params, not request.body.
You probably want to do something like:
request = testing.DummyRequest(json_body=params)
Also, you may want to use directly request.json_body in your code instead of json.loads(request.body).

Paging on Google Places API returns status INVALID_REQUEST

I'm using the Google Place API for place search:
https://developers.google.com/places/documentation/search
After the first query of the api, I'm getting the next page by setting the pagetoken. If I wait 2 seconds between requests, it works, but I notice that if I make the next query right after the previous one, it returns the status INVALID_REQUEST.
Is this some kind of rate limiting? I don't see this anywhere in the documentation.
https://developers.google.com/places/usage
Since each request has 20 places, getting a list of 100 results will take over 10 seconds which is a long time for someone to wait using an app.
It is documented, see the documentation
By default, each Nearby Search or Text Search returns up to 20 establishment results per
query; however, each search can return as many as 60 results, split across three pages. If
your search will return more than 20, then the search response will include an additional
value — next_page_token. Pass the value of the next_page_token to the pagetoken parameter of
a new search to see the next set of results. If the next_page_token is null, or is not
returned, then there are no further results. There is a short delay between when a
next_page_token is issued, and when it will become valid. Requesting the next page before it
is available will return an INVALID_REQUEST response. Retrying the request with the same
next_page_token will return the next page of results.
Although I'm not 100% sure this is the cause, I will leave this answer here as it took me around 6 hours to figure this out and may help someone.
As pointed out by geocodezip in his answer, there is a slight delay between the next page token being returned and that page actually being available. So I didn't find any other way to fix it other than making use of some sort of sleep.
But what I did find out is that every request after the first INVALID_REQUEST response were also giving an INVALID_REQUEST response, no matter if I waited 1, 2 or 10 seconds.
I suspect it has something to do with a cached response from Google's Part. The solution I found is to append a random incremental new parameter to the URL as to make it a "different" URL, therefore requesting a cacheless response.
The parameter I used was request_count and for every request made, I incremented it by 1.
For illustration purpose, here's the Python POC code I used (do not copy-paste as this is just a POC snippet and won't work):
# If except raises, it's most likely due to an invalid api key
# The while True is used to keep trying a new key each time
query_result_next_page = None
google_places = GooglePlaces(api_key)
invalid_requests_found = 0
request_count = 0
while True:
request_count = request_count + 1
try:
query_result = google_places.nearby_search(
lat_lng={'lat': event['latitude'], 'lng': event['longitude']},
radius=event['radius'],
pagetoken=query_result_next_page,
request_count=request_count)
# If there are additional result pages, lets get it on the next while step
if query_result.has_next_page_token:
query_result_next_page = query_result.next_page_token
else:
break
except Exception as e:
# If the key is over the query limit, try a new one
if str(e).find('OVER_QUERY_LIMIT') != -1:
logInfo("Key "+api_key+" unavailable.")
self.set_unavailable_apikey(api_key)
api_key = self.get_api_key()
# Sometimes the Places API doesn't create the next page
# despite having a next_page_key and throws an INVALID_REQUEST.
# We should just sleep for a bit and try again.
elif str(e).find('INVALID_REQUEST') != -1:
# Maximum of 4 INVALID_REQUEST responses
invalid_requests_found = invalid_requests_found + 1
if invalid_requests_found > 4:
raise e
time.sleep(1)
continue
# If it is another error, different from zero results, raises an exception
elif str(e).find('ZERO_RESULTS') == -1:
raise e
else:
break
EDIT: Forgot to mention that the GooglePlaces object is from slimkrazy's Google API lib. Unfortunatelly I had to tweak the code of the actual lib to accept this new request_count parameter.
I had to replace the nearby_search method for this:
def nearby_search(self, language=lang.ENGLISH, keyword=None, location=None,
lat_lng=None, name=None, radius=3200, rankby=ranking.PROMINENCE,
sensor=False, type=None, types=[], pagetoken=None, request_count=0):
"""Perform a nearby search using the Google Places API.
One of either location, lat_lng or pagetoken are required, the rest of
the keyword arguments are optional.
keyword arguments:
keyword -- A term to be matched against all available fields, including
but not limited to name, type, and address (default None)
location -- A human readable location, e.g 'London, England'
(default None)
language -- The language code, indicating in which language the
results should be returned, if possible. (default lang.ENGLISH)
lat_lng -- A dict containing the following keys: lat, lng
(default None)
name -- A term to be matched against the names of the Places.
Results will be restricted to those containing the passed
name value. (default None)
radius -- The radius (in meters) around the location/lat_lng to
restrict the search to. The maximum is 50000 meters.
(default 3200)
rankby -- Specifies the order in which results are listed :
ranking.PROMINENCE (default) or ranking.DISTANCE
(imply no radius argument).
sensor -- Indicates whether or not the Place request came from a
device using a location sensor (default False).
type -- Optional type param used to indicate place category.
types -- An optional list of types, restricting the results to
Places (default []). If there is only one item the request
will be send as type param.
pagetoken-- Optional parameter to force the search result to return the next
20 results from a previously run search. Setting this parameter
will execute a search with the same parameters used previously.
(default None)
"""
if location is None and lat_lng is None and pagetoken is None:
raise ValueError('One of location, lat_lng or pagetoken must be passed in.')
if rankby == 'distance':
# As per API docs rankby == distance:
# One or more of keyword, name, or types is required.
if keyword is None and types == [] and name is None:
raise ValueError('When rankby = googleplaces.ranking.DISTANCE, ' +
'name, keyword or types kwargs ' +
'must be specified.')
self._sensor = sensor
radius = (radius if radius <= GooglePlaces.MAXIMUM_SEARCH_RADIUS
else GooglePlaces.MAXIMUM_SEARCH_RADIUS)
lat_lng_str = self._generate_lat_lng_string(lat_lng, location)
self._request_params = {'location': lat_lng_str}
if rankby == 'prominence':
self._request_params['radius'] = radius
else:
self._request_params['rankby'] = rankby
if type:
self._request_params['type'] = type
elif types:
if len(types) == 1:
self._request_params['type'] = types[0]
elif len(types) > 1:
self._request_params['types'] = '|'.join(types)
if keyword is not None:
self._request_params['keyword'] = keyword
if name is not None:
self._request_params['name'] = name
if pagetoken is not None:
self._request_params['pagetoken'] = pagetoken
if language is not None:
self._request_params['language'] = language
self._request_params['request_count'] = request_count
self._add_required_param_keys()
url, places_response = _fetch_remote_json(
GooglePlaces.NEARBY_SEARCH_API_URL, self._request_params)
_validate_response(url, places_response)
return GooglePlacesSearchResult(self, places_response)