mediawiki-api - iterating through continue to get all results - mediawiki

I'm trying to create a list of all the subcategories in a category, and for all those subcategories, the basic categoryinfo for them. (Number of files, subcategories, etc.)
I'm very close - just getting hung up on handling the continue process.
This gets me the first 100 results:
http://en.wikipedia.org/w/api.php?action=query&format=xml&generator=categorymembers&gcmtitle=Category:Google%20Art%20Project%20works%20by%20artist&gcmlimit=100&gcmprop=ids|title&prop=categoryinfo&continue=
But, there are thousands of subcategories.
The result includes an xml node continue with gcmcontinue and continue attributes.
If I use that in my second request, this gives me the next 100 results:
http://en.wikipedia.org/w/api.php?action=query&format=xml&generator=categorymembers&gcmtitle=Category:Google%20Art%20Project%20works%20by%20artist&gcmlimit=100&gcmprop=ids|title&prop=categoryinfo&continue=gcmcontinue||&gcmcontinue=subcat|4c41555245c380204241525241550a474f4f474c45204152542050524f4a45435420574f524b53204259204c41555245c38020424152524155|38370707
BUT, that's where I'm having the problem. These (second) set of results no longer have a continue xml node, so I'm not sure how to access the third page and so on.
(As a side note, I'm aware that if I wanted to - that I'd have to handle sub-sub-categories - but I don't need those, just the first level is fine.)

James' own answer: So, it helps to make sure you hit "commons.wikimedia.org" instead of "en.wikipedia.org" if you want the results from commons! That was the issue.

Related

Composition in REST and consistence of the inserted data

How to properly design REST if you have a composition? I have a TestResult entity, which has TestCaseResults entities. Both support full set of REST methods. The important fact about this (which I believe differs from many examples I found on a web) is that TestResult is not consistent if it doesn't have all of TestCaseResults How do I properly design this in REST?
Let's say I create it as separate but dependent resources: api\testresults\ and api\testresults\1\testcaseresults. When the client wants to create a test result, he needs to POST to api\testresults, then retrieve URL api\testresults\1\testcaseresutls by a link from the response, and POST all of test case results to it. This means that at some point in time the test result is not consistent until the user finishes its operation. Basically, there is no concept of the transaction here.
Let's say I create only api\testresults resource, and embed an array of test case results inside, like this:
{
"Name": "Test A"
"Results": [
{
"Measured": "BB",
...
},
...
]
...
}
Then it is easier to insert, but it still hard to work with. Simple GET to api\testresults\1\ will retrieve test result with a big amount of test case results. GET to api\testresults\ will retrieve much more! The structure of this becomes complex. Furthermore, in the real word I have a few entities like TestCaseResults belong to TestResults, so there will be a few arrays, and each could have 100-200 elements.
I could try to combine the approaches. Embed the array, but also provide links to api\testresults\1\testcaseresults and support operations there as well. Maybe on GET api\testresults\1\ I could provide TestResult without it's TestCaseResults but only with a link pointing to a resource, but on POST I could accept an array of TestCaseResults embedded (not sure though it is allowed to have different return types for POST and GET in REST) But now there are two approaches for inserting information, it is confusing and I'm still not sure it solves anything.
your approach with api\testresults\1 and api\testresults\1\testcaseresults seems promising.
As JSON does not have a fixed structure, you can add query parameters to your URL to control if results are inserted or not.
api\testresults\1?with_results=true would mean that your caller want to see the test cases in addition to the test results.
api\testresults\1\testcaseresults would still return the test case results for your test 1.
If you fear that the number of test case results is too large, you can add pagination parameters, that would be reuse in the testcaseresults call.
api\testresults\1?with_results=true&per_page=10 would include the only the 10 first results. To get more, use api\testresults\1\testcaseresults?per_page=10&page=2 and so on, as it is the dedicated endpoint.
Cheers
Note: if you want a flexible API still returning JSON data, you can give a look to GraphQL, the trendy approach.

MediaWiki not returning continue parameter

I'm using this API request:
https://en.wikipedia.org/w/api.php?action=query&list=geosearch&gsradius=10000&gscoord=51.540951897949|-0.051086739997922&format=json&gslimit=50&continue=
which delivers 50 results. I want to use the 'continue' parameter to get the next page of results. According to the documentation I should get a continue field back in the results. I don't get any such result so can't get the next page.
Does anyone have any suggestions?
Dave, as #svick says, it seems list=geosearch (which is part of extension:GeoData) does not support continuation; indeed, it actually returns a "batchcomplete" element to indicate no more results (see in human-readable form).
I think you should either just get the maximum number of results (500 for users, 5000 for bots on Wikipedia), or if that's not satisfactory for your use case (which is?), pipe in at task T78703.
(Or, if you believe it to be a separate issue, report a new bug.

Apigility GET collection returns only 10 results when content negotiation is set to JSON

This issue is bugging me for some time now. To test it I just installed a fresh Apigility, set the db (PDO:mysql) and added a DB-Connected service. In the table I have 40+ records. When I make a GET collection request the response looks OK (with the default HAL content negotiation). Then I change the content negotiation to JSON. Now when I make a GET collection request my response contains only 10 elements.
So my question is: where do I set/change this limit?
You can set the page size manually, like so:
$paginator = $this->getAlbumTable()->fetchAll(true);
// set the current page to what has been passed in query string, or to 1 if none set
$paginator->setCurrentPageNumber((int) $this->params()->fromQuery('page', 1));
// set the number of items per page to 10
$paginator->setItemCountPerPage(10);
http://framework.zend.com/manual/current/en/tutorials/tutorial.pagination.html
Could you please send the page_size, total_items part at the end of the json output?
it's like:
"page_count": 140002,
"page_size": 25,
"total_items": 3500035,
"page": 1
This is not an ideal fix, because it requires you to go into the source code rather than using the page size given in the UI.
The collection class that is auto generated for you by the DB-Connected style derives off of Zend/Paginator/Paginator. This class defines the $defaultItemCountPerPage static protected member which is defaulted to 10. That's why you're only getting 10 results. If you open up the auto-generated collection class for your entity and add: protected static $defaultItemCountPerPage = 100; in the otherwise empty class, you will see that you now get up to 100 results in the response. You can look at other Paginator class variables and methods that you could replace in your derived class to get your desired behavior.
This is not an ideal solution. I'd prefer that the generated code automatically used the same configed page size that the HalJson strategy uses. Maybe I'll contribute a PR to change that. Or, maybe I'll just use the HalJson approach. It does seem like the better way to go. You should have some limit to how much data you load in from the DB at a time to not have an overly long running query or an overly large collection of data coming back you have to deal with. And, whatever limit you set, what do you do when you hit that limit? With the simple Json method, you can't ever get "page 2" of data. So, if you are going to work with some sizeable amount of data, it might be better to use HalJson on and then have some logic on the client side to grab pages of data at a time as needed. The returned JSON structure is a little more complicated, but not terribly so.
I'm probably in the same spot you are -- I'm trying to do a simple little api to play with while keeping everything simple and so I didn't want the client to have to deal with the other stuff in HalJson, but probably better to deal with that complexity and have a smooth way to page through data if you're going to use this with some real set of data. At least, that's the pep talk I'm giving myself right now. :-)

Using $skip with the SharePoint 2013 REST API

Forgive me, I'm very new to using REST.
Currently I'm using SP2013 Odata (_api/web/lists/getbytitle('<list_name>')/items?) to get the contents of a list. The list has 199 items in it so I need to call it twice and each time ask for a different set of items. I figured I could do this by calling:
_api/web/lists/getbytitle('<list_name>')/items?$skip=100&$top=100
each time changing however many I need to skip. The problem is this only ever returns the first 100 items. Is there something I'm doing wrong or is $skip broken in the OData service?
Is there a better way to iterate through REST calls, assuming this way doesn't work or isn't practical?
I'm using the JSon protocol with the Accept Header equaling application/json;odata=verbose
I suppose the $top=100 isn't really necessary
Edit: I've looked it up more and, I'm not entirely sure of the terms here, but using $skip works fine if you're using the method introduced with SharePoint 2010, i.e., _vti_bin/ListData.svc/<list_name>?$skip=100
Actually, funny enough, the old way doesn't set a 100 item limit on returns. So skip isn't even necessary. But, if you'd like to only return a certain segment of data, you'd have to do something like:
_vti_bin/ListData.svc/<list_name>?$skip=x&$top=(x+y)
where each time through the loop you would have something like x+=y
You can either use the old method which I described above, or check out my answer below for an explanation of how to do this using SP2013 OData
Alright, I've figured it out. $skip isn't a command which is meant to be used at the items? level. It works only at the lists? level. But, there's a way to do this, actually much easier than what I wanted to do.
If you just want all the data
In the returned data, assuming the list you are calling holds more than 100 items, there will be a __next at d/__next (assuming you are using json). This __next (it is a double underscorce, keep that in mind. I had a few problems at first because I was trying to get d/_next which never returned anything) is the right URL to get the next set of items. __next will only ever be a value if there is another set of items available to get.
I ended up creating a RequestURL variable which was initially set to to original request, but was changed to d/__next at the end of the loop. Then the loop went and checked if the RequestURL was not empty before going inside the loop.
Forgive my lack of code, I'm using SharePoint Designer 2013 to make this, and the syntax isn't horribly descriptive.
If you'd only like a small set of data
There's probably a few situations where you would only want x amount of rows from your list each time you go through the loop and that's real easy to do as well.
if you just add a $top=x parameter to your request, the __next URL that comes back with the response will give you the next x rows from your list. Eventually when there are no rows left to return __next won't be returned with the response.
Don't forget that in order to use __next you need to have a
$skiptoken=Paged=TRUE
in the url as well.

Drive API files.list query with 'not' parameter returns empty pages

I'm using the Drive API to list files from a collection which do not contain a certain string in their title.
My query looks something like this:
files().list(q="'xxxxx' in parents and not title contains 'toto'")
In my drive collection, I have 100 files, all contain the string "toto" in their title except for let's say 10 files.
I'm using pagination to retrieve the results 20 by 20, so I'm expecting to get only one page with the 10 files corresponding to my request. Surprisingly, the API returns 5 pages, with the first 4 having no results but with a nextToken page, and the files which are compliant with my request only come with the fifth page.
I'm still trying some use-cases here but it seems that it has something to do with the "not" operator. Like if the request was made without it, therefore returning 5 pages, but the results not corresponding to the request being removed from the response. It's very disturbing for me as I'm looking for the best performance here, and obviously having to make 5 requests to Drive instead of one single is not good for me. I'm also noticing that the results don't always come in the last page. I made the test with another collection, the results show up in the second page, but I still get 3 empty pages after that.
Am I missing something here ? Is this kind of behaviour "normal" ? I mean imagine if I had 1000 documents in my collection, having to make 50 requests to find only a few is not what I expect.
I have similar problem in files.list API. I tried to receive all three folders under root folder. I received result only on 342nd page. After several hours of researching I found some regularity in this strange behavior.
As I understood, the Drive API works in this way:
Detects something like index that best match your query
Selects first 20 records using index from step 1
Applies your filter: removes records that do not match your query
Rest is returned to you (maybe empty) with next page token.
The nextPageToken is looks like just OFFSET for the first record on next page in decided index, maybe it contains some information about query or index.
After base64 decode this token I found appropriate record number for next result in 121st position in decoded token.
Previously I built index of tokens using maxResults=1.
This is crazy, but I have no other explanation for observable behavior.
It is very useful for server because server do a very small work for search. From other side this algorithm must produce a lot of requests for pagenate whole list. But limitation for requests per second solve this problem.
Only You can do is pagenage and skip empty results. Do not forget about limitation of number of requests.
Do not try to find errors on your side. This is how Google Drive API works.
contains operator is working as a prefix matcher at the moment.title contains 'toto' will match "totolong" and "toto", but not "blahtoto".