Getting excerpts from Wikipedia in bulk - mediawiki

Is there a way to get all title/excerpt pairs from Wikipedia? To the moment I found two ways:
Download excerpt dump, but it contains incomplete/invalid excerpts taken as a first line of an article I suppose.
Request excerpts using MediaWiki API, but it is extremelly slow because you can only get single excerpt per request (bulk query is not working for excerpts):
/w/api.php?action=query&format=json&titles=Main
page&redirects&prop=extracts&explaintext=&exintro=
I would like to get excerpts as they are generated by MediaWiki API w/o burdening Wikipedia servers. Is it possible?
P.S. I need excerpts as a plain text. No wiki text or formatting required.
Update. It's possible to get 20 excerpts maximum at once via MediaWiki API:
See https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bextracts
&exlimit=20

It is not currently possible. Cou could look at the Yahoo abstracts in the dumps which try to do something similar (not very well though). They are powered by the ActiveAbstract extension.

Related

Can I get a version of a Wikipedia page as of specified date?

I am trying to access old version of Wiki pages using data instead of "oldid". Usually to access and a version of a wiki page, I have to use the page id like this https://en.wikipedia.org/w/index.php?title=Main_Page&oldid=969106986, is there a way to access the same page using the date without knowing the ID? If i know for example that there is a version of the page published on "12:44, 23 July 2020‎ "
In addition to the "main" API (called the action API by MediaWiki developers), you can also use the REST API. It may or may not be enabled at all wikis, but if you intend to query Wikipedia content.
The revision module of the \action API (linked to in #amirouche's answer) allows you to get the wikitext format of a page. That is the source format that is used by MediaWiki, and it isn't easy to get a HTML from it, which can be easier to analyze (especially if you do ĺingquistic analytics, for instance).
If HTML would be better for your use case, you can use the REST API, see https://en.wikipedia.org/api/rest_v1/#/. For instance, if you're interested in English Wikipedia's Main Page as of July 2008, you can use https://en.wikipedia.org/api/rest_v1/page/html/Main_Page/223883415.
The number (223883415) is the revision ID, which you can get through the action API.
However, keep in mind that re-parses the revision's wikitext into HTML. That means it doesn't need to be exactly what showed as of the date the revision was saved. For instance, the wikitext can contain conditions on current date (that is used for automatically upating the mainpage). If you're intereted in seeing that, you would need to use archive.org.
You can use the MediaWiki API to get revision; refer to the documentation at: https://www.mediawiki.org/wiki/API:Revisions.
You need to map revision ids with dates. It will be straightforward :).

"Reverse" JSON Status API

I've been wondering how to fetch the PlayStation server status. They display it on this page:
https://status.playstation.com/en-us/
But PlayStation is known to use APIs instead of PHP database fetches. After looking around in the source code of the site, I found that they have a separate file called /data.json.
https://status.playstation.com/en-us/data.json
The content of this file is the same as the index file (for some reason). They use stuff like {{endDateTitle}} and {{message}}, but I can't find where it's defined, if it's pulled using a separate file or just pulled from a database using PHP.
How can I "reverse" this site and see if there's a API I can use to display the status on my site?
Maybe I did not get the question right, but it seems pretty straightforward.
If using firefox, open Developer tools, Network. Reload the page.
You can clearly see the requested URL
https://status.playstation.com/data/statuses/region/SCEA.json
It seems that an empty list as a status means "No problems" (since there are no problems I cannot verify this assumption. That's all
The parenthesis {{}} are used by various HTML templating languages, like angular, so you'd have to go through the js code to understand where they get updated.

PUT page to OneNote via live-sdk

The Beta API provides two different end points for getting the content of a OneNote page from Live, one using the ContentUrl and one the Id. Both work. The GET looks as if it should work in the same way to replace a page by using a PUT instead (it's not documented as doing this, I'm guessing)
PUT https://www.onenote.com/api/beta/pages/{pageId}/content
but I get a 404 if I try this, using he same Id I've just successfuly used for a GET. Can one replace an entire page?
Andrew
PUT operations aren't yet supported on the OneNote API. We do support PATCH operations on a page to update specific elements. More information can be found at
OneNote PATCH reference documentation and OneNote PATCH blog post
Feel free to add a request for the PUT operation at https://onenote.uservoice.com

Web-scraping only a specific domain

I am trying to make a web scrpper that, for this example, scrapes news articles from Reuters.com. I want to get the title and date. I know I will ultimately just have to pull the source code from each address and then parse the HTML using something like JSoup.
My question is: How do I ensure I do this for each news article on Reuters.com? How do I know I have hit all the reuters.com addresses? Is there any APIs that can help me with this?
What you are referring to is called web scraping plus web crawling. What you have to do is visit every link matching some criteria (crawling) and then scrape the content (scraping). I've never used them but here are two java frameworks for the job
http://wiki.apache.org/nutch/NutchTutorial
https://code.google.com/p/crawler4j/
Of course you will have to use jsoup (or simillar) for parsing the content after you've collected the urls
Update
Check this out Sending cookies in request with crawler4j? for a better list of crawlers. Nutch is pretty good, but very complicated if the only thing you want is to crawl one site. crawler4j is very simple but I don't know if it supports cookies (and if that matters to you it's a deal breaker).
Try this website http://scrape4me.com/
I was able to generate this url for headline: http://scrape4me.com/api?url=http%3A%2F%2Fwww.reuters.com%2F&head=head&elm=&item[][DIV.topStory]=0&ch=ch

How to include the result of an api request in a template?

I'm creating a wiki using Mediawiki for the first time. I would like to include automatically all backlinks of the current page in a template (like the "See also" section). I tried to play with the API, successfully, but I still haven't succeed in including the useful section of the result in my template.
I have been querying Google and Stackoverflow for days (maybe in the wrong way) but I'm still stuck.
Can somebody help me?
As far as I know, there is no reasonable way to do that. Probably the closest you could get is to write a JavaScript code that reacts on the presence of a specific HTML element in the page, makes the API request and then updates the HTML to include the result.
It’s not possible in wiki text to execute any JavaScript or use even more uncommon HTML. As such you won’t be able to use the MediaWiki API like that.
There are multiple different options you have to achieve something like this though:
You could use the API by including custom JavaScript code on MediaWiki:Common.js. The code there will be included automatically and can be used to enhance the wiki experience. This obviously requires JavaScript on the client so it might not be the best option; but at least you could use the API directly. You would have to add something to figure out where to place the results correctly though.
A better option would be to use an extension that gives you this output. You can either try to find an extension that already provides this functionality, or write your own that uses the internal MediaWiki API (not the JS one) to access that content.
One extension I could personally recommend you that does this (and many other things), is DynamicPageList (full disclosure: I’m somewhat affiliated with that project). It allows you to perform complex page selections.
For example what you are trying to do is to find all pages which link to your page. This can be easily done by DPL like this:
{{ #dpl: linksto = {{FULLPAGENAME}} }}
I wrote a blog post recently showing how to call the API to get the job queue size and display that inside of the wiki page. You can read about it at Display MediaWiki job queue size inside your wiki. This solution does require the External Data extension however. The code looks like:
{{#get_web_data: url={{SERVER}}{{SCRIPTPATH}}/api.php?action=query&meta=siteinfo&siprop=statistics&format=json
| format=JSON
| data=jobs=jobs}}
{{#external_value:jobs}}
You could easily swap in a different API call to get other data. For the specific item your looking for, #poke's answer above is probably better.