How to restrict fields returned by stackexchange api, and turn off paging? - json

I'd like to have a list of just the current titles for all questions in one of the smaller (less than 10,000 questions) stackexchange site. I tried the interactive utility here: https://api.stackexchange.com/docs/questions and it both reports the result as a json at the bottom, and produces the requesting url at the top. For example:
https://api.stackexchange.com/2.2/questions?order=desc&sort=activity&tagged=apples&site=cooking
returns this JSON in my browser:
{"items":[{"tags":["apples","crumble"],"owner":{ ...
...
...],"has_more":true,"quota_max":300,"quota_remaining":252}
What is quota? It was 10,000 on one search on one site, but suddenly it's only 300 here.
I won't be doing this very often, what I'd like is the quickest way to edit that (or similar of course) url so I can get a list of all of the titles on a small site. I don't understand how to use paging, and I don't need any of the other fields. I don't care if I get them, but I'm thinking if I exclude them I can have more at once.
If I need to script it, python (2.7) is my preferred (only) language.

quota_max is the number of requests your application is allowed per day. 300 is the default for an unregistered application. This used to be mentioned directly on the page describing throttles, but seems to have been removed. Here is historical information describing the default.
To increase this to 10,000, you need to register an application and then authenticate by passing an access token in your script.
To get all titles on a site, you can use a Python library to help:
StackAPI. The answer below will use this library. DISCLAIMER: I wrote this library
Py-StackExchange
SEAPI
StackPy
Assuming you have registered your application and authenticated we can proceed.
First, install StackAPI (documentation):
pip install stackapi
This code will then grab the 10,000 most recent questions (max_pages * page_size) for the site hardwarerecs. Each page costs you one API hit, so the more items per page, the few API calls.
from stackapi import StackAPI
SITE = StackAPI('hardwarerecs')
SITE.page_size = 100
SITE.max_pages = 100
# Filter to only get question title and link
filter = '!BHMIbze0EQ*ved8LyoO6rNjkuLgHPR'
questions = SITE.fetch('questions', filter=filter)
In the questions variable is a dictionary that looks very similar to the API output, except that the library did all the paging for you. Your data is in questions['data'] and, in this case, contains a list of dictionaries that look like this:
[
...
{u'link': u'http://hardwarerecs.stackexchange.com/questions/29/sound-board-to-replace-a-gl2200-in-a-house-of-worship-foh-setting',
u'title': u'Sound board to replace a GL2200 in a house-of-worship FOH setting?'},
{ u'link': u'http://hardwarerecs.stackexchange.com/questions/31/passive-gps-tracker-logger',
u'title': u'Passive GPS tracker/logger'}
...
]
This result set is limited to only the title and the link because of the filter we applied. You can find the appropriate filter by adjusting what fields you want in the web UI and copying the filter field.
The hardwarerecs parameter that is passed when creating the SITE parameter is the first part of the site's domain URL. Alternatively, you can find it by looking at the api_site_parameter for your site when looking at the /sites end point.

Related

Data Studio connector making multiple calls to API when it should only be making 1

I'm finalizing a Data Studio connector and noticing some odd behavior with the number of API calls.
Where I'm expecting to see a single API call, I'm seeing multiple calls.
In my apps script I'm keeping a simple tally which increments by 1 every url fetch and that is giving me the correct number I expect to see with getData().
However, in my API monitoring logs (using Runscope) I'm seeing multiple API requests for the same endpoint, and varying numbers for different endpoints in a single getData() call (they should all be the same). E.g.
I can't post the code here (client project) but it's substantially the same framework as the Data Connector code on Google's docs. I have caching and backoff implemented.
Looking for any ideas or if anyone has experienced something similar?
Thanks
Per the this reference, GDS will also perform semantic type detection if you aren't explicitly defining this property for your fields. If the query is semantic type detection, the request will feature sampleExtraction: true
When Data Studio executes the getData function of a community connector for the purpose of semantic detection, the incoming request will contain a sampleExtraction property which will be set to true.
If the GDS report includes multiple widgets with different dimensions/metrics configuration then GDS might fire multiple getData calls for each of them.
Kind of a late answer but this might help others who are facing the same problem.
The widgets / search filters attached to a graph issue getData calls of their own. If your custom adapter is built to retrieve data via API calls from third party services, data which is agnostic to the request.fields property sent forward by GDS => then these API calls are multiplied by N+1 (where N = the amout of widgets / search filters your report is implementing).
I could not find an official solution for this either, so I invented a workaround using cache.
The graph's request for getData (typically requesting more fields than the Search Filters) will be the only one allowed to query the API Endpoint. Before starting to do so it will store a key in the cache "cache_{hashOfReportParameters}_building" => true.
if (enableCache) {
cache.putString("cache_{hashOfReportParameters}_building", 'true');
Logger.log("Cache is being built...");
}
It will retrieve API responses, paginating in a look, and buffer the results.
Once it finished it will delete the cache key "cache_{hashOfReportParameters}building", and will cache the final merged results it buffered so far inside "cache{hashOfReportParameters}_final".
When it comes to filters, they also invoke: getData but typically with only up to 3 requested fields. First thing we want to do is make sure they cannot start executing prior to the primary getData call... so we add a little bit of a delay for things that might be the search filters / widgets that are after the same data set:
if (enableCache) {
var countRequestedFields = requestedFields.asArray().length;
Logger.log("Total Requested fields: " + countRequestedFields);
if (countRequestedFields <= 3) {
Logger.log('This seams to be a search filters.');
Utilities.sleep(1000);
}
}
After that we compute a hash on all of the moving parts of the report (date range, plus all of the other parameters you have set up that could influence the data retrieved form your API endpoints):
Now the best part, as long as the main graph is still building the cache, we make these getData calls wait:
while (cache.getString('cache_{hashOfReportParameters}_building') === 'true') {
Logger.log('A similar request is already executing, please wait...');
Utilities.sleep(2000);
}
After this loop we attempt to retrieve the contents of "cache_{hashOfReportParameters}_final" -- and in case we fail, its always a good idea to have a backup plan - which would be to allow it to traverse the API again. We have encountered ~ 2% error rate retrieving data we cached...
With the cached result (or buffered API responses), you just transform your response as per the schema GDS needs (which differs between graphs and filters).
As you start implementing this, you`ll notice yet another problem... Google Cache is limited to max 100KB per key. There is however no limit on the amount of keys you can cache... and fortunately others have encountered similar needs in the past and have come up with a smart solution of splitting up one big chunk you need cached into multiple cache keys, and gluing them back together into one object when retrieving is necessary.
See: https://github.com/lwbuck01/GASs/blob/b5885e34335d531e00f8d45be4205980d91d976a/EnhancedCacheService/EnhancedCache.gs
I cannot share the final solution we have implemented with you as it is too specific to a client - but I hope that this will at least give you a good idea on how to approach the problem.
Caching the full API result is a good idea in general to avoid round trips and server load for no good reason if near-realtime is good enough for your needs.

Amazon: product advertising api pagination top sellers

Is this a limitation of the amazon API?
I would like to pull data similar to this page: amazon.com/Best-Sellers-Home-Improvement-Pumps-Plumbing-Equipment/zgbs/hi/13749581/ref=zg_bs_nav_hi_1_hi
STACKOVERFLOW BREAKS THIS LINK!
am using:
operation: 'BrowseNodeLookup',
response_group: "BrowseNodeInfo,TopSellers"
The TopSeller response group only returns 10 items and does not respond to ItemPage.
Is there a way to do item lookup without a query using a browse node and sorting by popularity?
The AWS documentation on the BrowseNodeLookup API and the TopSellers response group indicates that it only includes the top 10, and there is no mention of pagination.
The TopSellers response group returns the ASINs and titles of the 10 best sellers within a specified browse node.
However, the results from TopSellers are basically equivalent to the results of an ItemSearch with Sort set to salesrank. Therefore, you can solve pagination requirements as follows:
On initial load (such as a user loading a web page or opening a particular view in a mobile application), issue BrowseNodeLookup and retrieve TopSellers. Populate some portion of the UI with information from the browse node and some other portion of the UI with the TopSellers results.
If the user never goes past the first page, then do nothing more. (There is no need to spend time on an additional service call.)
As the user navigates to subsequent pages, issue ItemSearch with Sort set to salesrank and ItemPage set to the page number. Use these results to update the portion of the web page/view in your application that was previously populated from the browse node TopSellers.
Note that you will still only be able to retrieve up to 10 pages worth of results. This is an ItemSearch API limitation.

Instagram Media Endpoint Paging

I'm currently looking at reading out posts and related json data from a given number of Instagram users using the following URL:
https://www.instagram.com//media/
This will only bring back the latest 20 posts. I have done some hunting around and I am unable to see how to form the url to bring back the next 20 results. I've seen some places that have suggested using max_timestamp, but I can't see how to make this work.
For various reasons I do not wish to use the standard Instagram API.
You should use a max_id parameter to pagination.
Example: https://www.instagram.com/[user-login]/media/?max_id=[last-min-id], where [last-min-id] is a minimal id from previous page. The id does not repeat in new page.
This endpoint 'https://www.instagram.com/[user-login]/media/' is currently turned off in the last few days, unsure exactly when.
If you are dependant on it, you might want to check it now in your apps.
e.g. https://www.instagram.com/fosterandpartners/media/

page and page_size parameters are ignored for get_groups, get_group_folders, and get_group_users

I'm working on an application that uses the Box v1 "enterprise" APIs for user and group management (the v2 API doesn't have these methods yet). Specifically, I'm enumerating groups and their associated folders and users using get_groups, get_group_folders, and get_group_users.
I have a large number of groups and folders in my organization, and I'm unable to page through the results; I only get 20 items at a time from each of these APIs. I've tried variations on the page and page_size parameters listed in the API docs, but they don't seem to do anything.
Specifically, each of these three requests gives me the same 20 groups back:
https://www.box.net/api/1.0/rest?api_key=XXX&auth_token=YYY&action=get_groups
https://www.box.net/api/1.0/rest?api_key=XXX&auth_token=YYY&action=get_groups&page=2
https://www.box.net/api/1.0/rest?api_key=XXX&auth_token=YYY&action=get_groups&page_size=50
The same goes for get_group_folders and get_group_users.
For optional parameters you do need to format them within params[]. For example when changing the page_size, your request would be:
http://box.net/api/1.0/rest?action=get_groups&api_key=API_KEY&auth_token=AUTH_TOKEN&params[page_size]=VALUE .

Tweet counter for identi.ca

Is there a way to retrieve the amount of times a certain URL was "dented" (shared on identi.ca, status.net and/or the likes?).
For twitter there are several services that give this information.
Twitter itself: http://urls.api.twitter.com/1/urls/count.json?url=http://example.com&callback=twttr.receiveCount
Tweetmeme: http://api.tweetmeme.com/url_info.jsonc?url=http://example.com
Topsy: http://otter.topsy.com/stats.js?url=http://example.com&callback=?
I don't need the fancy extra information that Tweetmeme or Topsy deliver, only the amount.
I am aware that this is problematic, seen from the "distributed" nature of status.net: it will only give a count from once single silo, e.g. identi.ca. However, for me, for now, that would be enough.
Is there such an endpoint that gives me such JSON?
I don't think so. There's a file table in StatusNet databases that holds references to dented URLs (so it wouldn't be hard to count them if you had access to database or could write a plugin -- i.e., you wouldn't have to parse all notices, just lookup the file table), but it's not exposed through the API.
The list of API possible calls for StatusNet is here: http://status.net/wiki/TwitterCompatibleAPI
In addition, there's a proposed Google Summer of Code project on this subject: Social Analytics plugin

Categories