How to dump all results of a API request when there is a page limit? - json

I am using an API to pull data from a url, however the API has a pagination limit. It goes like:
Page (default is 1 and it's the page number you want to retrieve)
Per_page (default is 100 and it's the maximum number of results returned in the response(max=500))
I have a script which I can get the results of a page or per page but I want to automate it. I want to be able to loop through all the pages or per_page(500) and load it in to a json file.
Here is my code that can get 500 results per_page:
import json, pprint
import requests
url = "https://my_api.com/v1/users?per_page=500"
header = {"Authorization": "Bearer <my_api_token>"}
s = requests.Session()
s.proxies = {"http": "<my_proxies>", "https": "<my_proxies>" }
resp = s.get(url, headers=header, verify=False)
raw=resp.json()
for x in raw:
print(x)
The output is 500 but is there a way to keep going and pull the results starting from where it left off? Or even go by page and get all the data per page until there's no data in a page?

It will be helpful, if you present a sample response from your API.
If the API is equipped properly, there will be a next property in a given response that leads you to the next page.
You can then keep calling the API with the link given in the next recursively. On the last page, there will be no next in the Link header.
resp.links["next"]["url"] will give you the URL to the next page.
For example, the GitHub API has next, last, first, and prev properties.
To put it into code, first, you need to turn your code into functions.
Given that there is a maximum of 500 results per page, it implies you are extracting a list of data of some sort from the API. Often, these data are returned in a list somewhere inside raw.
For now, let's assume you want to extract all elements inside a list at raw.get('data').
import requests
header = {"Authorization": "Bearer <my_api_token>"}
results_per_page = 500
def compose_url():
return (
"https://my_api.com/v1/users"
+ "?per_page="
+ str(results_per_page)
+ "&page_number="
+ "1"
)
def get_result(url=None):
if url_get is None:
url_get = compose_url()
else:
url_get = url
s = requests.Session()
s.proxies = {"http": "<my_proxies>", "https": "<my_proxies>"}
resp = s.get(url_get, headers=header, verify=False)
# You may also want to check the status code
if resp.status_code != 200:
raise Exception(resp.status_code)
raw = resp.json() # of type dict
data = raw.get("data") # of type list
if not "url" in resp.links.get("next"):
# We are at the last page, return data
return data
# Otherwise, recursively get results from the next url
return data + get_result(resp.links["next"]["url"]) # concat lists
def main():
# Driver function
data = get_result()
# Then you can print the data or save it to a file
if __name__ == "__main__":
# Now run the driver function
main()
However, if there isn't a proper Link header, I see 2 solutions:
(1) recursion and (2) loop.
I'll demonstrate recursion.
As you have mentioned, when there is pagination in API responses, i.e. when there is a limit of maximum number of results per page, there is often a query parameter called page number or start index of some sort to indicate which "page" you are querying, so we'll utilize the page_number parameter in the code.
The logic is:
Given a HTTP request response, if there is less than 500 results, it means there is no more pages. Return the results.
If there are 500 results in a given response, it means there's probably another page, so we advance page_number by 1 and do a recursion (by calling the function itself) and concatenate with previous results.
import requests
header = {"Authorization": "Bearer <my_api_token>"}
results_per_page = 500
def compose_url(results_per_page, current_page_number):
return (
"https://my_api.com/v1/users"
+ "?per_page="
+ str(results_per_page)
+ "&page_number="
+ str(current_page_number)
)
def get_result(current_page_number):
s = requests.Session()
s.proxies = {"http": "<my_proxies>", "https": "<my_proxies>"}
url = compose_url(results_per_page, current_page_number)
resp = s.get(url, headers=header, verify=False)
# You may also want to check the status code
if resp.status_code != 200:
raise Exception(resp.status_code)
raw = resp.json() # of type dict
data = raw.get("data") # of type list
# If the length of data is smaller than results_per_page (500 of them),
# that means there is no more pages
if len(data) < results_per_page:
return data
# Otherwise, advance the page number and do a recursion
return data + get_result(current_page_number + 1) # concat lists
def main():
# Driver function
data = get_result(1)
# Then you can print the data or save it to a file
if __name__ == "__main__":
# Now run the driver function
main()
If you truly want to store the raw responses, you can. However, you'll still need to check the number of results in a given response. The logic is similar. If a given raw contains 500 results, it means there is probably another page. We advance the page number by 1 and do a recursion.
Let's still assume raw.get('data') is the list whose length is the number of results.
Because JSON/dictionary files cannot be simply concatenated, you can store raw (which is a dictionary) of each page into a list of raws. You can then parse and synthesize the data in whatever way you want.
Use the following get_result function:
def get_result(current_page_number):
s = requests.Session()
s.proxies = {"http": "<my_proxies>", "https": "<my_proxies>"}
url = compose_url(results_per_page, current_page_number)
resp = s.get(url, headers=header, verify=False)
# You may also want to check the status code
if resp.status_code != 200:
raise Exception(resp.status_code)
raw = resp.json() # of type dict
data = raw.get("data") # of type list
if len(data) == results_per_page:
return [raw] + get_result(current_page_number + 1) # concat lists
return [raw] # convert raw into a list object on the fly
As for the loop method, the logic is similar to recursion. Essentially, you will call the get_result() function a number of times, collect the results, and break early when the furthest page contains less than 500 results.
If you know the total number of results in advance, you can simply the run the loop for a predetermined number of times.
Do you follow? Do you have any further questions?
(I'm a little confused by what you mean by "load it into a JSON file". Do you mean saving the final raw results into a JSON file? Or are you referring to the .json() method in resp.json()? In that case, you don't need import json to do resp.json(). The .json() method on resp is actually part of the requests module.
On a bonus point, you can make your HTTP requests asynchronous, but this is slightly beyond the scope of your original question.
P.s. I'm happy to learn what other solutions, perhaps more elegant ones, that people use.

Related

Github API search code, missing items in JSON

I've been trying to build up a tool that needs to fetch all files' URLs of GitHub code search's result. For example when you go the here and search for uber.com api_key. You'll see that there is 381 code results and I want to get all these 381 files' URLs.
In order to do that I learned how to use GitHub API V3 and made following function:
def fetchItems(search, GITHUB_API):
items = set()
response = {"items":[1]}
pageNumber = 1
while(response["items"]):
sleep(3) # trying to avoid rate limit, not successful though :(
url = "https://api.github.com/search/code"
params = {
"q" : search,
"per_page" : 30, # default value, it can be increased to 100
"page" : pageNumber
}
headers = {
"Accept" : "application/vnd.github+json",
"Authorization" : f"Bearer {GITHUB_API}"
}
r = requests.get(url=url, headers=headers, params=params, verify=False)
if r.status_code == 403: # if we exceed the rate limit, sleep until rate limit get reseted
epochReset = int(r.headers["X-Ratelimit-Reset"])
epochNow = time()
if epochNow < epochReset:
sleep((epochReset - epochNow) + 1)
sleep(1)
continue
response = json.loads(r.text)
for file in response["items"]:
items.add(file["html_url"])
pageNumber += 1
return items
page variable indicates the number of items that'll be returned in each page, and page is the page :). By increasing page number in every request, you should be able to get all items according to my understanding.
However when I opened my database and checked the items that have been written, I saw that there were only 377 files, so 4 of the files are missing.
Because of my repuation I can't post images, so click here.
I checked the db writer function and I'm sure that there is nothing wrong with that. Does GitHub API return missing items in JSON or am I doing something wrong ?

Count the number of people having a property bounded by two numbers

The following code goes over the 10 pages of JSON returned by GET request to the URL.
and checks how many records satisfy the condition that bloodPressureDiastole is between the specified limits. It does the job, but I was wondering if there was a better or cleaner way to achieve this in python
import urllib.request
import urllib.parse
import json
baseUrl = 'https://jsonmock.hackerrank.com/api/medical_records?page='
count = 0
for i in range(1, 11):
url = baseUrl+str(i)
f = urllib.request.urlopen(url)
response = f.read().decode('utf-8')
response = json.loads(response)
lowerlimit = 110
upperlimit = 120
for elem in response['data']:
bd = elem['vitals']['bloodPressureDiastole']
if bd >= lowerlimit and bd <= upperlimit:
count = count+1
print(count)
There is no access through fields to json content because you get dict object from json.loads (see translation scheme here). It realises access via __getitem__ method (dict[key]) instead of __getattr__ (object.field) as keys may be any hashible objects not only strings. Moreover, even strings cannot serve as fields if they starts with digits or are the same with built-in dictionary methods.
Despite this, you can define your own custom class realising desired behavior with acceptable key names. json.loads has an argument object_hook wherein you may put any callable object (function or class) that take a dict as the sole argument (not only the resulted one but every one in json recursively) & return something as the result. If your jsons match some template, you can define a class with predefined fields for the json content & even with methods in order to get a robust Python-object, a part of your domain logic.
For instance, let's realise the access through fields. I get json content from response.json method from requests but it has the same arguments as in json package. Comments in code contain remarks about how to make your code more pythonic.
from collections import Counter
from requests import get
class CustomJSON(dict):
def __getattr__(self, key):
return self[key]
def __setattr__(self, key, value):
self[key] = value
LOWER_LIMIT = 110 # Constants should be in uppercase.
UPPER_LIMIT = 120
base_url = 'https://jsonmock.hackerrank.com/api/medical_records'
# It is better use special tools for handling URLs
# in order to evade possible exceptions in the future.
# By the way, your option could look clearer with f-strings
# that can put values from variables (not only) in-place:
# url = f'https://jsonmock.hackerrank.com/api/medical_records?page={i}'
counter = Counter(normal_pressure=0)
# It might be left as it was. This option is useful
# in case of additional counting any other information.
for page_number in range(1, 11):
records = get(
base_url, data={"page": page_number}
).json(object_hook=CustomJSON)
# Python has a pile of libraries for handling url requests & responses.
# urllib is a standard library rewritten from scratch for Python 3.
# However, there is a more featured (connection pooling, redirections, proxi,
# SSL verification &c.) & convenient third-party
# (this is the only disadvantage) library: urllib3.
# Based on it, requests provides an easier, more convenient & friendlier way
# to work with url-requests. So I highly recommend using it
# unless you are aiming for complex connections & url-processing.
for elem in records.data:
if LOWER_LIMIT <= elem.vitals.bloodPressureDiastole <= UPPER_LIMIT:
counter["normal_pressure"] += 1
print(counter)

How to ignore NoneType error in tweet ID statuses_lookup

I am trying to collect tweets with tweepy from a list of tweet ids good_tweet_ids_test, using statuses_lookup.
Since the list is a bit old, some tweets will have been deleted by now. Therefore I ignore errors in the lookup_tweets function, so it does not stop each time.
Here is my code so far:
def lookup_tweets(tweet_IDs, api):
full_tweets = []
tweet_count = len(tweet_IDs)
try:
for i in range((tweet_count // 100) + 1):
# Catch the last group if it is less than 100 tweets
end_loc = min((i + 1) * 100, tweet_count)
full_tweets.extend(
api.statuses_lookup(tweet_IDs[i * 100:end_loc], tweet_mode='extended')
)
return full_tweets
except:
pass
results = lookup_tweets(good_tweet_ids_test, api)
temp = json.dumps([status._json for status in results]) #create JSON
newdf = pd.read_json(temp, orient='records')
newdf.to_json('tweepy_tweets.json')
But when I run the temp = json.dumps([status._json for status in results]) line, it gives me the error:
TypeError: 'NoneType' object is not iterable
I do not know how to fix this. I believe the type of some of the statuses is None, because they have been deleted and can therefore not be looked up now. I simply wish for my code to move on to the next status, if the type is None.
EDIT: As have been pointed out, the issue is that results is None. So now I think I need to exclude None values from the full_tweets variable. But I cannot figure out how to. Any help?
EDIT2: With further testing I have found out that results is only None when there is a tweet ID that has now been deleted in the batch. If the batch contains only active tweets, it works. So I think I need to figure out how to have my function look up the batch of tweets, and only return those that are not None. Any help on this?
Instead of implicitly returning None when there's an error, you could explicitly return an empty list. That way, the result of lookup_tweets will always be iterable, and the calling code won't have to check its result:
def lookup_tweets(tweet_IDs, api):
full_tweets = []
tweet_count = len(tweet_IDs)
try:
for i in range((tweet_count // 100) + 1):
# Catch the last group if it is less than 100 tweets
end_loc = min((i + 1) * 100, tweet_count)
full_tweets.extend(
api.statuses_lookup(tweet_IDs[i * 100:end_loc], tweet_mode='extended')
)
return full_tweets
except:
return [] # Here!

Django: return a StreamingHttpResponse on an existing html page

Since it is better to have a single question for each issue, be patient if is similar to another part of another my question related to the same project.
The situation:
I have a form on html in which I can set a number and when it is submitted, it is call views.stream_response which pass the value to stream.py and it returns a StreamingHttpResponse and "virtual" blank browser page appears (/stream_response/) in which I can see a progressive number every second up to m :
1
2
3
..
m
stream.py
import time
def streamx(m):
lista = []
x=0
while len(lista) < m:
x = x + 1
time.sleep(1)
lista.append(x)
yield "<div>%s</div>\n" % x
print(lista[-1])
return (x)
---UPDATE---
views.py
def stream_response(request):
test = InputNumeroForm()
if request.method == 'POST':
test = InputNumeroForm(data=request.POST)
if test.is_valid():
m = test.cleaned_data['numero']
print (test)
print("m = ", m)
#resp = StreamingHttpResponse(stream_response_generator(m))
resp = StreamingHttpResponse(stream.streamx(m))
return resp
return render(request, 'homepage/provadata.html',{'user.username': request, 'test': test}, context_instance = RequestContext(request))
urls.py
...
url(r'^homepage/provadata/$', views.provadata),
url(r'^stream_response/$', views.stream_response, name='stream_response'),
...
homepage/provadata.html
<form id="numero" action="/stream_response_bis/" method="POST">
{% csrf_token %}
{{test}}
<input type="submit" value="to view" />
</form>
//{{ris}}
I tried to do a render_to response to stay on homepage/provadata.html and to see the progressive lists but stream.py does not starts and I can see only the input number m on the command line.
I tried with THIS suggestion in views.py
def stream_response_generator(m):
ris = stream.streamx(m)
yield loader.get_template('homepage/provadata.html').render(Context({'ris': ris}))
(adding {{ris}} to template and
resp = StreamingHttpResponse(stream_response_generator(m)) in stream_response function)
but I obtain on the template:
<generator object streamx at 0x0000000004BEB870>
And on command line it prints the input value but it not pass anymore the parameter to stream.py.
So.. How can I solve this issue?
You can use the StreamingHttpResponse to indicate that you want to stream results back and all the middleware that ships with django is aware of this and acts accordingly to not buffer your content output but send it straight down the line.
You can disable the ETAG middleware using the condition decorator. That will get your response to stream back over HTTP. You can confirm this with a command-line tool like curl. But it probably won't be enough to get your browser to show the response as it streams. To encourage the browser to show the response as it streams, you can push a bunch of whitespace down the pipe to force its buffers to fill. Example follows:
from django.views.decorators.http import condition
#condition(etag_func=None)
def stream_response(request):
resp = HttpResponse( stream_response_generator(), mimetype='text/html')
return resp
def stream_response_generator():
yield "<html><body>\n"
for x in range(1,11):
yield "<div>%s</div>\n" % x
yield " " * 1024 # Encourage browser to render incrementally
time.sleep(1)
yield "</body></html>\n"

Paging on Google Places API returns status INVALID_REQUEST

I'm using the Google Place API for place search:
https://developers.google.com/places/documentation/search
After the first query of the api, I'm getting the next page by setting the pagetoken. If I wait 2 seconds between requests, it works, but I notice that if I make the next query right after the previous one, it returns the status INVALID_REQUEST.
Is this some kind of rate limiting? I don't see this anywhere in the documentation.
https://developers.google.com/places/usage
Since each request has 20 places, getting a list of 100 results will take over 10 seconds which is a long time for someone to wait using an app.
It is documented, see the documentation
By default, each Nearby Search or Text Search returns up to 20 establishment results per
query; however, each search can return as many as 60 results, split across three pages. If
your search will return more than 20, then the search response will include an additional
value — next_page_token. Pass the value of the next_page_token to the pagetoken parameter of
a new search to see the next set of results. If the next_page_token is null, or is not
returned, then there are no further results. There is a short delay between when a
next_page_token is issued, and when it will become valid. Requesting the next page before it
is available will return an INVALID_REQUEST response. Retrying the request with the same
next_page_token will return the next page of results.
Although I'm not 100% sure this is the cause, I will leave this answer here as it took me around 6 hours to figure this out and may help someone.
As pointed out by geocodezip in his answer, there is a slight delay between the next page token being returned and that page actually being available. So I didn't find any other way to fix it other than making use of some sort of sleep.
But what I did find out is that every request after the first INVALID_REQUEST response were also giving an INVALID_REQUEST response, no matter if I waited 1, 2 or 10 seconds.
I suspect it has something to do with a cached response from Google's Part. The solution I found is to append a random incremental new parameter to the URL as to make it a "different" URL, therefore requesting a cacheless response.
The parameter I used was request_count and for every request made, I incremented it by 1.
For illustration purpose, here's the Python POC code I used (do not copy-paste as this is just a POC snippet and won't work):
# If except raises, it's most likely due to an invalid api key
# The while True is used to keep trying a new key each time
query_result_next_page = None
google_places = GooglePlaces(api_key)
invalid_requests_found = 0
request_count = 0
while True:
request_count = request_count + 1
try:
query_result = google_places.nearby_search(
lat_lng={'lat': event['latitude'], 'lng': event['longitude']},
radius=event['radius'],
pagetoken=query_result_next_page,
request_count=request_count)
# If there are additional result pages, lets get it on the next while step
if query_result.has_next_page_token:
query_result_next_page = query_result.next_page_token
else:
break
except Exception as e:
# If the key is over the query limit, try a new one
if str(e).find('OVER_QUERY_LIMIT') != -1:
logInfo("Key "+api_key+" unavailable.")
self.set_unavailable_apikey(api_key)
api_key = self.get_api_key()
# Sometimes the Places API doesn't create the next page
# despite having a next_page_key and throws an INVALID_REQUEST.
# We should just sleep for a bit and try again.
elif str(e).find('INVALID_REQUEST') != -1:
# Maximum of 4 INVALID_REQUEST responses
invalid_requests_found = invalid_requests_found + 1
if invalid_requests_found > 4:
raise e
time.sleep(1)
continue
# If it is another error, different from zero results, raises an exception
elif str(e).find('ZERO_RESULTS') == -1:
raise e
else:
break
EDIT: Forgot to mention that the GooglePlaces object is from slimkrazy's Google API lib. Unfortunatelly I had to tweak the code of the actual lib to accept this new request_count parameter.
I had to replace the nearby_search method for this:
def nearby_search(self, language=lang.ENGLISH, keyword=None, location=None,
lat_lng=None, name=None, radius=3200, rankby=ranking.PROMINENCE,
sensor=False, type=None, types=[], pagetoken=None, request_count=0):
"""Perform a nearby search using the Google Places API.
One of either location, lat_lng or pagetoken are required, the rest of
the keyword arguments are optional.
keyword arguments:
keyword -- A term to be matched against all available fields, including
but not limited to name, type, and address (default None)
location -- A human readable location, e.g 'London, England'
(default None)
language -- The language code, indicating in which language the
results should be returned, if possible. (default lang.ENGLISH)
lat_lng -- A dict containing the following keys: lat, lng
(default None)
name -- A term to be matched against the names of the Places.
Results will be restricted to those containing the passed
name value. (default None)
radius -- The radius (in meters) around the location/lat_lng to
restrict the search to. The maximum is 50000 meters.
(default 3200)
rankby -- Specifies the order in which results are listed :
ranking.PROMINENCE (default) or ranking.DISTANCE
(imply no radius argument).
sensor -- Indicates whether or not the Place request came from a
device using a location sensor (default False).
type -- Optional type param used to indicate place category.
types -- An optional list of types, restricting the results to
Places (default []). If there is only one item the request
will be send as type param.
pagetoken-- Optional parameter to force the search result to return the next
20 results from a previously run search. Setting this parameter
will execute a search with the same parameters used previously.
(default None)
"""
if location is None and lat_lng is None and pagetoken is None:
raise ValueError('One of location, lat_lng or pagetoken must be passed in.')
if rankby == 'distance':
# As per API docs rankby == distance:
# One or more of keyword, name, or types is required.
if keyword is None and types == [] and name is None:
raise ValueError('When rankby = googleplaces.ranking.DISTANCE, ' +
'name, keyword or types kwargs ' +
'must be specified.')
self._sensor = sensor
radius = (radius if radius <= GooglePlaces.MAXIMUM_SEARCH_RADIUS
else GooglePlaces.MAXIMUM_SEARCH_RADIUS)
lat_lng_str = self._generate_lat_lng_string(lat_lng, location)
self._request_params = {'location': lat_lng_str}
if rankby == 'prominence':
self._request_params['radius'] = radius
else:
self._request_params['rankby'] = rankby
if type:
self._request_params['type'] = type
elif types:
if len(types) == 1:
self._request_params['type'] = types[0]
elif len(types) > 1:
self._request_params['types'] = '|'.join(types)
if keyword is not None:
self._request_params['keyword'] = keyword
if name is not None:
self._request_params['name'] = name
if pagetoken is not None:
self._request_params['pagetoken'] = pagetoken
if language is not None:
self._request_params['language'] = language
self._request_params['request_count'] = request_count
self._add_required_param_keys()
url, places_response = _fetch_remote_json(
GooglePlaces.NEARBY_SEARCH_API_URL, self._request_params)
_validate_response(url, places_response)
return GooglePlacesSearchResult(self, places_response)