Keep getting 429 (Too Many Requests) throttling errors - onenote

I tried to engage with the API team via Twitter but I've not had a response and development is grinding to a halt here...
In short I keep getting a 429 when developing against the OneNote API, I know that this suggests I'm hitting the API too hard, but I'm not.
At worst I'm doing maybe 1 or 2 requests per minute, manually invoked by me as I develop. Sometimes I'll leave it 10-15 minutes between calls, sometimes this works, sometimes not.
I've been working on a particular problem the last few days.
In my code I make a call to get all notebooks, sections & section groups in a single query (filtered to only return data from certain notebooks)
I then make a second call to get all updated pages for those notebooks. I've been fiddling with the filter string to get this second call working (which I now think I have), but 9 times out of 10 I get the 429 on this second API call.
Is there some way of getting my user account whitelisted please?
FWIW this is my second query (the spaces normally get encoded):
/me/notes/pages?count=true&top=100&expand=parentNotebook,parentSection
&filter=(parentNotebook/id eq '{GUID}' or parentNotebook/id eq '{GUID}' or parentNotebook/id eq '{GUID}') and lastModifiedTime gt 2016-08-05T11:34:09.000Z
This does work as I'd expect, the date clause is working now, but I can only test very occasionally as I get the 429.
Incidentally if I run my second filter through the API console I get a 504 "Proxy request timeout", every time. This has been since I've added the parenthesis around the notebook predicate.
So I'm pretty much unable to continue development, how do I resolve this please?

As a short term workaround, please try the following:
Instead of one query:
/me/notes/pages?count=true&top=100&expand=parentNotebook,parentSection&filter=(parentNotebook/id eq '{GUID}' or parentNotebook/id eq '{GUID}' or parentNotebook/id eq '{GUID}') and lastModifiedTime gt 2016-08-05T11:34:09.000Z
Remove the "count=true" (are you using this?) and leave only one parentNotebookId filter. The results will be by default ordered by LastModifiedTime descending (most recent first).
Perform this query for all the notebooks you're interested in:
/me/notes/pages?top=100&expand=parentNotebook,parentSection&filter=parentNotebook/id eq '{GUID}'

Just a hunch: can you try splitting the second call (for getting updated pages) into separate http requests (one for each notebook id)?
Also if what you want are update notifications, webhooks might be the better way to go.
Lastly apologies for the silence on Twitter.

Unfortunately you're hitting a bad bug in our GET Pages API (when called with filters and additional query params). Effectively, we are doing a crappy job of applying the filters when calling our indexing service (which in turn is throttling us). We've identified this as an on-going problem which starts throttling callers especially under heavy load.
Short-term workaround: we're fiddling with our capacity and upp-ing the throttling limits set by our partner indexing service temporarily while we work on the longer term fix. Hopefully this will result in fewer 429s for you going forward.
As a future-proof resolution, I would also encourage you to look into #Jorge's suggested answer. (remove the count=true query param and filter only on 1 parentNotebookId (no lastModifiedTime filtering)

Related

Set stackdriver alerts for specific error messages

Cannot find a clean way to set Stackdriver alert notifications on errors in cloud functions
I am using a cloud function to process data to cloud data store. There are 2 types of errors that I want to be alerted on:
Technical exceptions which might cause function to 'crash'
Custom errors that we are logging from the cloud function
I have done the below,
Created a log metric searching for specific errors (although this will not work for 'crash' as the error message can be different each time)
Created an alert for this metric in Stackdriver monitoring with parameters as in below code section
This is done as per the answer to the question,
how to create alert per error in stackdriver
For the first trigger of the condition I receive an email. However, on subsequent triggers lets say on the next day, I don't. Also the incident is in 'opened' state.
Resource type: cloud function
Metric:from point 2 above
Aggregation: Aligner: count, Reducer: None, Alignment period: 1m
Configuration: Condition triggers if: Any time series violates, Condition:
is above, Threshold: 0.001, For: 1 min
So I have 3 questions,
Is this the right way to do to satisfy my requirement of creating alerts?
How can I still receive alert notifications for subsequent errors?
How to set the incident to 'resolved' either automatically/ manually?
I was having a similar problem and managed to at least get a mail every time. The "trick" seems to be to use sum instead of count in combination with for most recent value - see the screenshot below.
This causes Stackdriver to send a mail everytime a matching log entry is found and closing the issue a minute later.
Normally, alerts resolve themselves once the alerting policy stops firing. The problem you're having with your alerts not resolving is because your metric only writes non-zero points - if there are no errors, it doesn't write zero. That means that the policy never gets an unambiguous signal that everything is fine, so the alerts just sit there (they'll automatically close after 7 days, but I imagine that's not all that useful for you).
This is a common problem and it's a tricky one to solve. One possibility is to write your policy as a ratio of errors to something non-zero, like request count. As long as the request count is non-zero, the ratio will compute zero if there are no errors, and so an alert on the ratio will automatically resolve. You need to be a bit careful about rounding errors, though - if your request count is high enough, you might potentially miss a single error because the ratio could round to zero.
Aaron Sher, Stackdriver engineer
We got around this issue by having the insertId as a label of the log-based metric we created for every log record we get from the pods running our services.
In the alerting policy, this label helped in two things:
We grouped by it (named as record_id) which served in making each incident unique, so it got reported without waiting for other incidents to get resolved and at the same time it got resolved instantly.
We used it in the documentation of the notification to include a direct link to the issue (log record) itself which was a nice and essential feature to have. https://console.cloud.google.com/logs/viewer?project=MY_PROJECT&advancedFilter=insertId%3D%22${metric.label.record_id}%22
As #Aaron Sher mentioned in his answer, it is a tricky problem. We might have done something not recommended or not efficient, but it works fine and of course we are open for improvement recommendations.

How do deal with bots using the in-site search and overflowing the SQL with too many requests?

What is the best practise to not annoy users with flood limits, but yet block off bots doing automated searches?
What is going on:
I am been more aware of odd search behaviour and I finally had the time, to catch who it is. It is 157.55.39.* also known as Bing. Which is odd, because when _GET['q'] is detected, noindex is added.
Problem however is, that they are slowing down the SQL server, as there is just too many instances of requests coming in.
What I have done so far:
I have implemented searching flood limit. But since I did it with a session-cookie, checking and calculating from the last search timestamp -- bing obviously ignores cookies and continues on.
Worst case scenario is to add reCAPTHA, but I don't want the "Are you human?" tickbox everytime you search. It should appear only, when flood is detected. So basically, the real question is, how to detect too many requests from client to trigger some sort of recaptcha to stop requests..
EDIT #1:
I handled the situation currently, with:
<?
# Get end IP
define('CLIENT_IP', (filter_var(#$_SERVER['HTTP_X_FORWARDED_IP'], FILTER_VALIDATE_IP) ? #$_SERVER['HTTP_X_FORWARDED_IP'] : (filter_var(#$_SERVER['HTTP_X_FORWARDED_FOR'], FILTER_VALIDATE_IP) ? #$_SERVER['HTTP_X_FORWARDED_FOR'] : $_SERVER['REMOTE_ADDR'])));
# Detect BING:
if (substr(CLIENT_IP, 0, strrpos(CLIENT_IP, '.')) == '157.55.39') {
# Tell them not right now:
Header('HTTP/1.1 503 Service Temporarily Unavailable');
# ..and block the request
die();
}
It works. But it seems like another temp solution to a more systematic problem.
I would like to mention, that I still would like search engines, including Bing to index /search.html, just not to actually search there. There is no "latest searches" or anything like that, so its a mystery where they are getting the queries from.
EDIT #2 -- How I solved it
If someone else in the future has these problems, I hope this helps.
First of all, it turns out that Bing has the same URL parameter feature, that Google has. So I was able to tell Bing to ignore URL parameter "q".
Based on the correct answer, I added disallow rows for parameter q to robots.txt:
Disallow: /*?q=*
Disallow: /*?*q=*
I also told inside the bing webmaster console, to not bother us on peak traffic.
Overall, this right away showed positive feedback from server resource usage. I will however, implement overall flood limit for identical queries, specifically where _GET is involved. So in case Bing should ever decide to visit an AJAX call (example: ?action=upvote&postid=1).
Spam is a problem that all website owners struggle to deal with.
And there are a lot of ways to build good protection, starting from very easy ways and finishing with very hard and strong protection mechanisms.
But for you right now I see one simple solution.
Use robots.txt and disallow Bing spider to crawl your search page.
You can do this very easy.
Your robots.txt file would look like:
User-agent: bingbot
Disallow: /search.html?q=
But this will totally block search engine spider from crawling your search results.
If you want just to limit such requests, but not totally block them, try this:
User-agent: bingbot
crawl-delay: 10
This will force Bing to crawl your website pages only every 10 seconds.
But with such delay, it will crawl only 8,640 pages a day (which is very small amount of requests per/day).
If you good with this, then you ok.
But, what if you want manually control this behavior by the server itself, protecting search form not only from web crawlers, but also from hackers?
They could send to your server over 50,000 requests per/hour with the ease.
In this case, I would recommend you 2 solutions.
Firstly, connect CloudFlare to your website, and don't forget to check if your server real IP is still available via services like ViewDNS IP History, cuz many websites with CF protection lack on this (even popular once).
If your active server IP is visible in the history, then you may consider changing it (highly recommended).
Secondly, you could use MemCached to store flood data and detect if a certain IP is querying too much (i.e. 30 q/min).
And if they do, block their opportunity to use perform (via MemCached) for some time.
Of course, this is not the best solution you could use, but it will work and will cost not much for your server.

Scribd API search showing irrelevant answers

When I use the search functionality on the scribd docs API to search for a function, like
http://api.scribd.com/api?method=docs.search&api_key=API_KEY&query=hello+world
It returns irrelevant results, and ones different to the search functionality of the site. This request, for example, returns results about Guitar Hero, World of Warcraft and Virtual Worlds etc. Whereas the site search on https://www.scribd.com/search-documents?query=hello+world gives documents titled "Hello World" as you would expect. Is there a parameter that I can add to the api call that will make it return relevant results?
You may try playing with the simple parameter to see if it makes any difference to your queries. According to the API reference (half of it is inaccessible at the moment) it makes the results the same as for the website:
(optional)This option specifies whether or not to allow advanced search queries (more information). When set to false, the API search behaves the same as the search on Scribd.com. When set to true, the API search allows advanced queries that contain filters such as title:"A Tale of Two Cities". Set to "true" by default.
I tried your query myself, but it still doesn't give adequate results, even though it changes things a bit. But it is still not good enough regardless of the simple option being set to false. Even if you try to run their sample queries 1:1 they are still giving 90% irrelevant results.
Then I found a similar issue being discussed in the following google group thread back in 2011. At the end Jared Friedman (the CTO of Scribd) himself admits that API search and website Search work differently and it is not in their priorities to fix this. In 2014 another developer complained. Seems to me that four years later this is still the case.
I'd suggest contacting Scribd support directly and asking them what is the current status of the docs.search API and if there is some preliminary approval process in place (for example, they may do a background check on accounts and only then provide relevant results, otherwise they return just test results for any query) although I doubt it.

How does Chrome update URL bar completions?

I really enjoy using Chrome's URL bar because it remembers commonly-visited sites and often suggests a good completion based on what I've typed and/or visited before. So, for example, I can type t in the URL bar and Chrome will automatically fill it in with twitter.com, or I can type maps and Chrome will fill in the .google.com. This gives me the convenience of data-driven domain name shortcuts without having to maintain an explicit list.
What I'm wondering, though, is how Chrome determines that an old shortcut should be replaced with a new one. For example, if I visit twitter.com often, then that becomes the completion when I type t. But if I then start visiting twilio.com often enough, then, after some time, Chrome will start to fill that in as the default completion for t. What I can't figure out is how or when that transition takes place. It also seems that there are (at least) two cases involved : one for domain names, and another for path strings, because if I visit a certain full URL often, and then want to get to the root of the same domain, I end up having to type the entire domain name out to get Chrome to ignore the full-URL completion.
If I had to guess, I'd imagine that Chrome stores the things that I type in the URL bar in a trie whose values are the number of times that a particular string has been typed (and/or visited ?). Then I'd imagine it has some sort of exponential decay model for the "counts" in the trie. But this is just a guess. Does anyone know how this updating process happens ?
Well, I ended up finding some answers by having a look at the Chromium source code ; I'd imagine that Chrome itself uses this code without too much modification.
When you type something into the search/URL bar (which is apparently called the "Omnibox"), Chrome starts looking for suggestions and completions that match what you've typed. To do this, there are several "providers" registered with the browser, each of which knows how to make a particular type of suggestion. The URL history provider is one of these.
The querying process is pretty cool, actually. It all happens asynchronously, with particular attention paid to which activity happens in which thread (the main thread being especially important not to block). When the providers find suggestions, they call back to the omnibox, which appears to merge and sort things before updating the UI widget.
History provider
It turns out that URLs in Chrome are stored in at least one, and probably two, sqlite databases (one is on disk, and the second, which I know less about, seems to be in memory).
This comment at the top of HistoryURLProvider explains the lookup process, complete with multithreaded ASCII art !
Sqlite lookup
Basically, typing in the omnibox causes sqlite to run this SQL query for looking up URLs by prefix. The suggestions are ordered by the number of visits to the URL, as well as by the number of times that a URL has been typed.
Interestingly, this is not a trie ! The lookup is indeed based on prefix, but the scoring of those lookups does not appear to be aggregated by prefix, like I'd imagined.
I had a little less success in determining how the scores in the database are updated. This part of the code updates a URL after a visit, but I haven't yet run across where the counts are decremented (if at all ?).
Updating suggestions
What I think is happening regarding the updating of suggestions -- and this is still just a guess right now -- is that the in-memory sqlite database essentially has priority over the on-disk DB, and then whenever Chrome restarts or otherwise flushes the contents of the in-memory DB to disk, the visit and typed counts for each URL get updated at that time. Again, just a guess, but I'll keep looking as I get time.
The code is really nice to read through, actually. I definitely recommend it if you have similar questions about Chrome.

catching Exceeded maximum execution time?

Is there anyway of catching this GAS error: "Exceeded maximum execution time"
I mean catching with try ... catch(e) // so far it's not working for me.
Thanks
As written in the comments to your question, thats not possible. But, however, you can set a flag in scriptDB or properties when execution starts and clear that flag when execution comes to a normal end, so you can find out during the next run wether your script came to a regular end when it was run last time and try to take corrective actions if not.
The answer above is correct; it's not possible. An easy alternative to the workaround that pbhd mentioned would be to simply track the runtime of your script (e.g. comparing results of new Date().getTime() at regular intervals) and run whatever you'd include under your catch statement right before you hit the maximum execution time. The maximum is 6 minutes (reference).
That way, you don't have to catch the error -- you can preempt it.
During normal testing, it is possible to accidentally create an infinite (or very long running) loop that consumes the daily execution time limit 100%.
Even if you realize what you have dome wrong immediately, you cannot immediately re-try with Google scripts for another 24 hours - thus slowing down ongoing development significantly and maybe forcing the developer to do some other work , taking his focus/"attention stream" away from the current problem. This is almost always bad.
My product ("IBM OLIVER CICS test/debug" - see Wikipedia article) solved this problem - and many others - around 37 years ago - by having a time limit on any particular transaction and intercepting the resulting time out, allowing options of:-
continuation or
examine/modify variables
"manual" re-try (for the same time) or
abort.
Google could implement this approach just as easily - by "pausing" if execution time is looking too heavy. I had a similar solution to other resources in OLIVER - such as excessive API calls ("possible macro loop") and excessive memory usage.
It seems it takes an "old timer" like me to provide solutions to problems that have existed "since the beginning of time" (and certainly before PC's were thought of).
Googles current "solution" (i.e. absolute limits) only helps Google keep its own servers from being swamped. It would be easy for them to do what OLIVER did all those years ago. By the way there should be no "IBM" prefix on the Wikipedia article - it was my own product and some clown Wikipedia editor altered it to include the prefix.
(By the way, Google do not prevent other scripts on same s/s running - that maybe only use minimal amounts of extra time ( i.e. Scripts on the same spreadsheet still work) . I tried renaming the original script as an experiment but it was stopped after a very short time with "exceeded execution time" error.
GIZ-A-JOB Google - you know its worth it!