How to get extracts of revisions of wikipedia articles

How to get extracts of revisions of wikipedia articles - mediawiki

I would like to get extracts for old versions (revisions) of wikipedia articles. This question shows how to get the content in a json format. In particular it uses prop=extracts then explaintext= to return the content. I would like to do the same but for a revision (using revid=*) of an article.

This is not possible using only TextExtracts. See RFE T66546 Support revisions, which was declined. In fact, if you try to specify revids instead of titles, then TextExtracts will return the extract(s) for the current revision(s) of the corresponding title(s). For example:
action=query&prop=extracts&exchars=100&explaintext&revids=342428310
has equivalent output to:
action=query&prop=extracts&exchars=100&explaintext&titles=Earth
You would need to figure out how TextExtracts prepares an extract and do the same for the revision of interest.

Related

Extract or generate X-Client-TraceId for header in GET-request

I would like to retrieve some historical stock prices via a REST API from the following site:
https://www.boerse-frankfurt.de/zertifikat/de0007873291-open-end-zertifikat-auf-dow-jones-industrial-average
The response is a JSON.
Basically, the query can be done as follows: An OPTIONS call is sent without parameters and then a GET request with header parameters.
Both calls are sent to the following address:
https://api.boerse-frankfurt.de/v1/data/quote_history_derivatives?isin=DE0007873291&mic=XSC&from=2021-11-12T07%3A00%3A00.000Z&to=2021-11-12T21%3A00%3A00.000Z&offset=0&limit=25
The following two parameters are included in the header:
Client-Date: 2021-11-16T23:02:29.529Z
X-Client-TraceId: d2d6911d81ebbbff7a7549555a2c26d6
And now my question: how do you get the X-Client-TraceId? It looks like a UUID, but it doesn't seem to be one. The value changes with every page view in the browser. But you can't just enter any value.
Many greetings,
Trebor

Since this question was asked, someone has written a blog post about this exact topic. The algorithm detailed there still seems to be in use (as of 2022-03-12).
An excerpt of the relevant parts:
Client-Date
This is the current time, converted to a string with Javascript’s toISOString() function.
[...]
X-Client-TraceId
[...]
salt is a fixed string, in this case w4icATTGtnjAZMbkL3kJwxMfEAKDa3MN. Apparently it appears in the source code as-is so it must be constant.
X-Client-TraceId is the md5 of time + url + salt.
Note: time is the string sent in the Client-Date header.
The blog post has some additional information around the process of reverse engineering this algorithm and the X-Security header.

Using cts query to retrieve collections associated with the given document uri- Marklogic

I need to retrieve the collections to which a given document belongs in Marklogic.
I know xdmp command does that. But I need to use it in cts query to retrieve the data and then filter records from it.
xdmp:document-get-collections("uri of document") can't be run inside cts-query to give appropriate data.
Any idea how can it be done using cts query?
Thanks

A few options come to mind:
Option One: Use cts:values()
cts:values(cts:collection-reference())
If you check out the documentation, you will see that you can also restrict this to certain fragments by passing a query as one of the parameters.
**Update: [11-10-2017]
The comment attached to this asked for a sample of restricting the results of cts:values() to a single document(for practical purposes, I will say fragment == document)
The documentation for cts:values explains this. It is the 4th parameter - a query to restrict the results. Get to know this pattern as it is part of many features of MarkLogic. It is your friend. The query I would use for this problem statement would be a cts:document-query();
An Example:
cts:values(
cts:collection-reference(),
(),
(),
cts:document-query('/path/to/my/document')
)
Full Example:
cts:search(
collection(),
cts:collection-query(
cts:values(
cts:collection-reference(),
(),
(),
cts:document-query('/path/to/my/document')
)
)
)[1 to 10]
Option two: use cts:collection-match()
Need more control over returning just some of the collections from a document, then use cts:colection-match(). Like the first option, you can restrict the results to just some fragments. However, it has the benefit of having an option for a pattern.
Attention:
They both return a sequence - perfect for feeding into other parts of your query. However, under the hood, I believe they work differently. The second option is run against a lexicon. The larger the list of unique collection names and the more complex your pattern match, the longer for resolution. I use collection-match in projects. However, I usually use it when I can limit the possible choices by restricting the results to a smaller number of documents.

You can't do this in a single step. You have to run code first to retrieve collections associated with a document. You can use something like xdmp:document-get-collections for that. You then have to feed that into a cts query that you build dynamically:
let $doc-collections := xdmp:document-get-collections($doc-uri)
return
cts:search(collection(), cts:collection-query($doc-collections))[1 to 10]
HTH!

Are you looking for cts:collection-query()?

Insert two XML files to the same collection:
xquery version "1.0-ml";
xdmp:document-insert("/a.xml", <root><sub1><a>aaa</a></sub1></root>,
map:map() => map:with("collections", ("coll1")));
xdmp:document-insert("/b.xml", <root><sub2><a>aaa</a></sub2></root>,
map:map() => map:with("collections", ("coll1")));
Search the collection:
xquery version "1.0-ml";
let $myColl:= xdmp:document-get-collections("/a.xml")
return
cts:search(/root,
cts:and-query((cts:collection-query($myColl),cts:element-query(xs:QName("a"),"aaa")
)))

Get first result of suggested results with Wikipedia's API

I am trying to use Wikipedia's API, but I can't find a way to get the first result when there are multiple possible ones.
For example if I use this request https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Cimages&titles=test&redirects=1&explaintext=1&imlimit=20 it will return an article that says
Test may refer to: ...
What I want is for it to skip this part and give me directly the results of the first result that "Test may refer to".
Do you know if this is possible or not ?
Thank you for reading :)

Drive API files.list query with 'not' parameter returns empty pages

I'm using the Drive API to list files from a collection which do not contain a certain string in their title.
My query looks something like this:
files().list(q="'xxxxx' in parents and not title contains 'toto'")
In my drive collection, I have 100 files, all contain the string "toto" in their title except for let's say 10 files.
I'm using pagination to retrieve the results 20 by 20, so I'm expecting to get only one page with the 10 files corresponding to my request. Surprisingly, the API returns 5 pages, with the first 4 having no results but with a nextToken page, and the files which are compliant with my request only come with the fifth page.
I'm still trying some use-cases here but it seems that it has something to do with the "not" operator. Like if the request was made without it, therefore returning 5 pages, but the results not corresponding to the request being removed from the response. It's very disturbing for me as I'm looking for the best performance here, and obviously having to make 5 requests to Drive instead of one single is not good for me. I'm also noticing that the results don't always come in the last page. I made the test with another collection, the results show up in the second page, but I still get 3 empty pages after that.
Am I missing something here ? Is this kind of behaviour "normal" ? I mean imagine if I had 1000 documents in my collection, having to make 50 requests to find only a few is not what I expect.

I have similar problem in files.list API. I tried to receive all three folders under root folder. I received result only on 342nd page. After several hours of researching I found some regularity in this strange behavior.
As I understood, the Drive API works in this way:
Detects something like index that best match your query
Selects first 20 records using index from step 1
Applies your filter: removes records that do not match your query
Rest is returned to you (maybe empty) with next page token.
The nextPageToken is looks like just OFFSET for the first record on next page in decided index, maybe it contains some information about query or index.
After base64 decode this token I found appropriate record number for next result in 121st position in decoded token.
Previously I built index of tokens using maxResults=1.
This is crazy, but I have no other explanation for observable behavior.
It is very useful for server because server do a very small work for search. From other side this algorithm must produce a lot of requests for pagenate whole list. But limitation for requests per second solve this problem.
Only You can do is pagenage and skip empty results. Do not forget about limitation of number of requests.
Do not try to find errors on your side. This is how Google Drive API works.

contains operator is working as a prefix matcher at the moment.title contains 'toto' will match "totolong" and "toto", but not "blahtoto".

Proper function prefix notation

Not sure if this is the right place to post this question but:
If I have a list of functions that I am listing in a spec doc, lets say
MyObj_Func1
MyObj_Func2
MyObj_Func3
and so on but I just want to list the core name of the functions (Func1, Func2, Func3) How do I note at the top to say "the functions listed below all start with "MyObj_"?
Something like: MyObj_:: and then ... before the function names in the list?

I would leave the prefixes in the documentation. Though repetitive, it's what I would expect to see (as a c developer).
For example see the GNOME docs for hash table (which I consider to be fairly well done): https://developer.gnome.org/glib/2.31/glib-Hash-Tables.html.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to get extracts of revisions of wikipedia articles - mediawiki

I would like to get extracts for old versions (revisions) of wikipedia articles. This question shows how to get the content in a json format. In particular it uses prop=extracts then explaintext= to return the content. I would like to do the same but for a revision (using revid=*) of an article.

Related

Extract or generate X-Client-TraceId for header in GET-request

Using cts query to retrieve collections associated with the given document uri- Marklogic

Get first result of suggested results with Wikipedia's API

Drive API files.list query with 'not' parameter returns empty pages

Proper function prefix notation

Categories

Resources