Data Studio connector making multiple calls to API when it should only be making 1 - google-apps-script

I'm finalizing a Data Studio connector and noticing some odd behavior with the number of API calls.
Where I'm expecting to see a single API call, I'm seeing multiple calls.
In my apps script I'm keeping a simple tally which increments by 1 every url fetch and that is giving me the correct number I expect to see with getData().
However, in my API monitoring logs (using Runscope) I'm seeing multiple API requests for the same endpoint, and varying numbers for different endpoints in a single getData() call (they should all be the same). E.g.
I can't post the code here (client project) but it's substantially the same framework as the Data Connector code on Google's docs. I have caching and backoff implemented.
Looking for any ideas or if anyone has experienced something similar?
Thanks

Per the this reference, GDS will also perform semantic type detection if you aren't explicitly defining this property for your fields. If the query is semantic type detection, the request will feature sampleExtraction: true
When Data Studio executes the getData function of a community connector for the purpose of semantic detection, the incoming request will contain a sampleExtraction property which will be set to true.

If the GDS report includes multiple widgets with different dimensions/metrics configuration then GDS might fire multiple getData calls for each of them.

Kind of a late answer but this might help others who are facing the same problem.
The widgets / search filters attached to a graph issue getData calls of their own. If your custom adapter is built to retrieve data via API calls from third party services, data which is agnostic to the request.fields property sent forward by GDS => then these API calls are multiplied by N+1 (where N = the amout of widgets / search filters your report is implementing).
I could not find an official solution for this either, so I invented a workaround using cache.
The graph's request for getData (typically requesting more fields than the Search Filters) will be the only one allowed to query the API Endpoint. Before starting to do so it will store a key in the cache "cache_{hashOfReportParameters}_building" => true.
if (enableCache) {
cache.putString("cache_{hashOfReportParameters}_building", 'true');
Logger.log("Cache is being built...");
}
It will retrieve API responses, paginating in a look, and buffer the results.
Once it finished it will delete the cache key "cache_{hashOfReportParameters}building", and will cache the final merged results it buffered so far inside "cache{hashOfReportParameters}_final".
When it comes to filters, they also invoke: getData but typically with only up to 3 requested fields. First thing we want to do is make sure they cannot start executing prior to the primary getData call... so we add a little bit of a delay for things that might be the search filters / widgets that are after the same data set:
if (enableCache) {
var countRequestedFields = requestedFields.asArray().length;
Logger.log("Total Requested fields: " + countRequestedFields);
if (countRequestedFields <= 3) {
Logger.log('This seams to be a search filters.');
Utilities.sleep(1000);
}
}
After that we compute a hash on all of the moving parts of the report (date range, plus all of the other parameters you have set up that could influence the data retrieved form your API endpoints):
Now the best part, as long as the main graph is still building the cache, we make these getData calls wait:
while (cache.getString('cache_{hashOfReportParameters}_building') === 'true') {
Logger.log('A similar request is already executing, please wait...');
Utilities.sleep(2000);
}
After this loop we attempt to retrieve the contents of "cache_{hashOfReportParameters}_final" -- and in case we fail, its always a good idea to have a backup plan - which would be to allow it to traverse the API again. We have encountered ~ 2% error rate retrieving data we cached...
With the cached result (or buffered API responses), you just transform your response as per the schema GDS needs (which differs between graphs and filters).
As you start implementing this, you`ll notice yet another problem... Google Cache is limited to max 100KB per key. There is however no limit on the amount of keys you can cache... and fortunately others have encountered similar needs in the past and have come up with a smart solution of splitting up one big chunk you need cached into multiple cache keys, and gluing them back together into one object when retrieving is necessary.
See: https://github.com/lwbuck01/GASs/blob/b5885e34335d531e00f8d45be4205980d91d976a/EnhancedCacheService/EnhancedCache.gs
I cannot share the final solution we have implemented with you as it is too specific to a client - but I hope that this will at least give you a good idea on how to approach the problem.
Caching the full API result is a good idea in general to avoid round trips and server load for no good reason if near-realtime is good enough for your needs.

Related

How to restrict fields returned by stackexchange api, and turn off paging?

I'd like to have a list of just the current titles for all questions in one of the smaller (less than 10,000 questions) stackexchange site. I tried the interactive utility here: https://api.stackexchange.com/docs/questions and it both reports the result as a json at the bottom, and produces the requesting url at the top. For example:
https://api.stackexchange.com/2.2/questions?order=desc&sort=activity&tagged=apples&site=cooking
returns this JSON in my browser:
{"items":[{"tags":["apples","crumble"],"owner":{ ...
...
...],"has_more":true,"quota_max":300,"quota_remaining":252}
What is quota? It was 10,000 on one search on one site, but suddenly it's only 300 here.
I won't be doing this very often, what I'd like is the quickest way to edit that (or similar of course) url so I can get a list of all of the titles on a small site. I don't understand how to use paging, and I don't need any of the other fields. I don't care if I get them, but I'm thinking if I exclude them I can have more at once.
If I need to script it, python (2.7) is my preferred (only) language.
quota_max is the number of requests your application is allowed per day. 300 is the default for an unregistered application. This used to be mentioned directly on the page describing throttles, but seems to have been removed. Here is historical information describing the default.
To increase this to 10,000, you need to register an application and then authenticate by passing an access token in your script.
To get all titles on a site, you can use a Python library to help:
StackAPI. The answer below will use this library. DISCLAIMER: I wrote this library
Py-StackExchange
SEAPI
StackPy
Assuming you have registered your application and authenticated we can proceed.
First, install StackAPI (documentation):
pip install stackapi
This code will then grab the 10,000 most recent questions (max_pages * page_size) for the site hardwarerecs. Each page costs you one API hit, so the more items per page, the few API calls.
from stackapi import StackAPI
SITE = StackAPI('hardwarerecs')
SITE.page_size = 100
SITE.max_pages = 100
# Filter to only get question title and link
filter = '!BHMIbze0EQ*ved8LyoO6rNjkuLgHPR'
questions = SITE.fetch('questions', filter=filter)
In the questions variable is a dictionary that looks very similar to the API output, except that the library did all the paging for you. Your data is in questions['data'] and, in this case, contains a list of dictionaries that look like this:
[
...
{u'link': u'http://hardwarerecs.stackexchange.com/questions/29/sound-board-to-replace-a-gl2200-in-a-house-of-worship-foh-setting',
u'title': u'Sound board to replace a GL2200 in a house-of-worship FOH setting?'},
{ u'link': u'http://hardwarerecs.stackexchange.com/questions/31/passive-gps-tracker-logger',
u'title': u'Passive GPS tracker/logger'}
...
]
This result set is limited to only the title and the link because of the filter we applied. You can find the appropriate filter by adjusting what fields you want in the web UI and copying the filter field.
The hardwarerecs parameter that is passed when creating the SITE parameter is the first part of the site's domain URL. Alternatively, you can find it by looking at the api_site_parameter for your site when looking at the /sites end point.

Storing data in FIWARE Object Storage

I'm building an application that stores files into the FIWARE Object Storage. I don't quite understand what is the correct way of storing files into the storage.
The code python code snippet below taken from the Object Storage - User and Programmers Guide shows 2 ways of doing it:
def store_text(token, auth, container_name, object_name, object_text):
headers = {"X-Auth-Token": token}
# 1. version
#body = '{"mimetype":"text/plain", "metadata":{}, "value" : "' + object_text + '"}'
# 2. version
body = object_text
url = auth + "/" + container_name + "/" + object_name
return swift_request('PUT', url, headers, body)
The 1. version confuses me, because when I first looked at the only Node.js module (repo: fiware-object-storage) that works with Object Storage, it seemed to use 1. version. As the module was making calls to the old (v.1.1) API version instead of the presumably newest (v.2.0), referencing to the python example, not sure if that is an outdated version of doing it or not.
As I played more with the module, realised it didn't work and the code for it was a total mess. So I forked the project and quickly understood that I will need rewrite it form the ground up, taking the above mention python example from the usage guide as an reference. Link to my repo.
As of writing this the only methods that aren't implement is the object storage (PUT) and object fetching (GET).
Had some addition questions about the Object Storage which I sent to fiware-lab-help#lists.fiware.org, but haven't heard anything back so asking them here.
Haven't got much experience with writing API libraries. Should I need to worry about auth token expiring? I presume it is not needed to make a new authentication, every time we interact with storage. The authentication should happen once when server is starting-up (we create a instance) and it internally keeps it. Should I implement some kind of mechanism that refreshes the token?
Does the tenant id change? From the quote below is presume that getting a tenant I just a one time deal, then later you can use it in the config to make less authentication calls.
A valid token is required to access an object store. This section
describes how to get a valid token assuming an identity management
system compatible with OpenStack Keystone is being used. If the
username, password and tenant details are known, only step 3 is
required. source
During the authentication when fetching tenants how should I select the "right" one? For now i'm just taking the first one similar as the example code does.
Is it true that a object storage container belongs to only a single region?
Use only what you call version 2. Ignore your version 1. It is commented out in the example. It should be removed from the documentation.
(1) The token will be valid for some period of time. This could be an hour or a day, depending on the setup. This period of time should be specified in the token that is returned by the authentication service. The token needs to be periodically refreshed.
(2) The tenant id does not change.
(3) Typically only one tenant id is returned. It is possible, however, that you were assigned more than one id, in which case you have to pick which one you are currently using. Containers typically belong to a single tenant and are not shared between tenants.
(4) Containers are typically limited to a single region. This may change in the future when multi-region support for a container is added to Swift.
Solved my troubles and created the NPM module that works with the FIWARE Object Storage: https://github.com/renarsvilnis/fiware-object-storage-ge

Drive API files.list returning nextPageToken with empty item results

In the last week or so we got a report of a user missing files in the file list in our app. We we're a bit confused at first because they said they only had a couple files that matched our query string, but with a bit of work we were able to reproduce their issue by adding a large number of files to our Google Drive. Previously we had been assuming people would have less than 100 files and hadn't been doing paging to avoid multiple files.list requests.
After switching to use paging, we noticed that on one of our test accounts was sending hundreds and hundreds of files.list requests and most of the responses did not contain any files but did contain a nextPageToken. I'll update as soon as I can get a screenshot - but the client was sending enough requests to heat the computer up and drain battery fairly quickly.
We also found that based on what the query is even though it matches the same files it can have a drastic effect of the number of requests needed to retrieve our full file list. For example, switching '=' to 'contains' in the query param significantly reduces the number of requests made, but we don't see any guarantee that this is a reasonable and generalizeable solution.
Is this the intended behavior? Is there anything we can do to reduce the number of requests that we are sending?
We're using the following code to retrieve files created by our app that is causing the issue.
runLoad: function(pageToken)
{
gapi.client.drive.files.list(
{
'maxResults': 999,
'pageToken': pageToken,
'q': "trashed=false and mimeType='" + mime + "'"
}).execute(function (results)
{
this.filePageRequests++;
if (results.error || !results.nextPageToken || this.filePageRequests >= MAX_FILE_PAGE_REQUESTS)
{
this.isLoading(false);
}
else
{
this.runLoad(results.nextPageToken);
}
}.bind(this));
}
It is, but probably shouldn't be, the correct behaviour.
It generally occurs when using the drive.file scope. What (I think) is happening is that the API layer is fetching all files, and then removing those that are outside of the current scope/query, and returning the remainder to your client app. In theory, a particular page of files could have no files in-scope, and so the returned array is empty.
As you've seen, it's a horribly inefficient way of doing it, but that seems to be the way it is. You simply have to keep following the next page link until it's null.
As to "Is there anything we can do to reduce the number of requests that we are sending?"
You're already setting max results to 999 which is the obvious step. Just be aware that I have seen this value trigger internal errors (timeouts?) which manifest themselves as 500 errors. You might want to sacrifice efficiency for reliability and stick to the default of 100 which seems to be better tested.
I don't know if the code you posted is your actual code, or just a simplified illustration, but you need to make sure you are dealing with 401 errors (auth expiry) and 500 errors (sometimes recoverable with a retry)

How to extend AFNetworking 2.0 to perform request combining

I have a UI where the same image URL could be requested by several UIImageViews at varying times. Obviously if a request from one of them has finished then returning the cached version works as expected. However, especially with slower networks, I'd like to be able to piggy-back requests for an image URL onto any currently running/waiting HTTP request for the same URL.
On an HTTP server this called request combining and I'd love to do the same in the client - to combine the different requests for the same URL into a single request and then callback separately to each of the callers). The requests for that URL dont happen to start at the same time.
What's the best way to accomplish this?
I think re-writing UIImageView+AFNetworking might be the easiest way:
check the af_sharedImageRequestOperationQueue to see if it has an operation with the same request
if I do already have an operation in the queue or running then add myself to some list of callbacks/blocks to be called on success/failure
if I don't have the operation, then create it as normal
in the setCompletionBlockWithSuccess to call each of the blocks in turn.
Any simpler alternatives?
I encountered a similar problem and decided that your way was the most straightforward. One added bit of complexity is that these downloads require special credentials and so must go through their own operation queue. Here's the code from my UIImageView category to check whether a particular URL is inflight:
NSUInteger foundOperation = [[ConnectionManager sharedConnectionManager].operationQueue.operations indexOfObjectPassingTest:^BOOL(AFHTTPRequestOperation *obj, NSUInteger idx, BOOL *stop) {
BOOL URLAlreadyInFlight = [obj.request.URL.absoluteString isEqualToString:URL.absoluteString];
if (URLAlreadyInFlight) {
NSBlockOperation *updateUIOperation = [NSBlockOperation blockOperationWithBlock:^{
[[NSOperationQueue mainQueue] addOperationWithBlock:^{
self.image = [[ImageCache sharedImageCache] cachedImageForURL:URL];
}];
}];
//Makes updating the UI dependent on the completion of the matching operation.
[updateUIOperation addDependency:obj];
}
return URLAlreadyInFlight;
}];
Were you able to come up with a better solution?
EDIT: Well, it looks like my method of updating the UI just can't work, as the operation's completion blocks are run asynchronously, so the operation finishes before the blocks are run. However, I was able to modify the image cache to be able to add callbacks for when certain URLs are cached, which seems to work correctly. So this method will properly detect when certain URLs are in flight and be able to take action with that knowledge.

Asynchronous Ajax call in SCORM API

I am creating a javascript API for SCORM 2004 4th Edition. For those who don't know about SCORM, basically it is an API standard that eLearning courses can use to communicate with an LMS (Learning Management System). Now the API has to have the following method:
Initialize(args)
GetValue(key)
SetValue(key, value)
Terminate(args)
Commit(args)
GetDiagnostic(args)
GetErrorString(args)
GetLastError()
Now Initialize has to be called before anything else, and Terminate must the last. GetValue/SetValue can be called anywhere in between there. What I am doing is in the Initialize method I am getting some JSON from a web service and storing that in the API (to be used when using the GetValue/SetValue methods later). The problem I am coming across is that the AJAX call via jQuery is asynchronous, so the Initialize method call could be done before the JSON is loaded. With that being the way it is, a call to GetValue after calling Initialize could cause unexpected issues b/c the JSON that GetValue uses isn't there yet. My question is this: What can I do to ensure that the JSON is loaded before the GetValue/SetValue methods are called? I know the simple answer is to make it synchronous, but that is not advised mostly, and it doesn't seem to want to do that for me. Here is my code regarding that:
function GetJSON(){
var success = false;
$.ajaxSetup({async:false}); //should make it synchronous
$.getJSON("http://www.mydomain.com/webservices/scorm.asmx/SCORMInitialize?
learnerID=34&jsoncallback=?",
function(data){
bind(data);
success = true;
}
);
return success;
}
function bind(data){
this.cmi = eval("(" + data.d + ")");
$.ajaxSetup({async:true}); //should make it asynchronous again
}
Does anyone have any ideas? I would really appreciate it!
You've articulated the problem well. After the SCO calls Initialize, the CMI data needs to be immediately available for the SCO to make subsequent GetValue calls. However, making synchronous AJAX calls isn't advised, if there is a hangup in the request, it can lock up the entire browser until the request returns or times out. The solution is to pre-load all of the required data before the SCO is loaded. In our SCORM Engine implementation, we preload all of the data (CMI and sequencing) when the player is launched and then use a background process to periodically commit dirty data as the learner progresses through the course. It can get a bit tricky to ensure that all data is properly persisted when dealing with the combinations of possible window launching and exit scenarios, but it's certainly possible. You will want to avoid any requests to the server from within a SCORM API call as SCOs will often flood the LMS with big batches of calls. Making server requests within those calls can seriously degrade the learner's experience and place a performance burden on the server.
Mike
The way we approached this problem was to queue the CMI data in the API when the SCO is launched. We first navigate to a launch page that loads the CMI data into the API's queue, and then the laucnch page actually launches the SCO. When the SCO calls intialize, we just move the data into the CMI.