How to get Wikipedia content as text by API? - mediawiki

I want to get Wikipedia pages as text.
I looked at the Wikipedia API from here https://en.wikipedia.org/w/api.php which says that in order to get pages as text I need to append this to a page address:
api.php?action=query&meta=siteinfo&siprop=namespaces&format=txt
However, when I try appending this suffix to a normal page's address, the page is not found:
https://en.wikipedia.org/wiki/George_Washington/api.php?action=query&meta=siteinfo&siprop=namespaces&format=txt
Following the instructions from Get Text Content from mediawiki page via API, I tried adding /api.php?action=parse&page=test to the end of the query string. Therefore, I obtained this:
https://en.wikipedia.org/wiki/George_Washington/api.php?action=parse&page=test
However, this doesn't work either.

NB: All this examples are CORS enabled.
Text only
From the precise title, as seen in the wikipedia page url:
https://en.wikipedia.org/w/api.php?action=query&origin=*&prop=extracts&explaintext&titles=Sokolsky_Opening&format=json
Search relevant pages by keywords
Get IDs, get precise titles/url, get some quick text extract;
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=max&format=json&exsentences=1&origin=*&exintro=&explaintext=&generator=search&gsrlimit=23&gsrsearch=chess
Wiki page ID
Using the precise title:
https://en.wikipedia.org/w/api.php?action=query&origin=*&prop=pageprops&format=json&titles=Sokolsky_Opening
Full html
By wiki page ID, includes the Wikitext:
https://en.wikipedia.org/w/api.php?action=parse&origin=*&format=json&pageid=100017
Stripped html
Lighter html version, without the Wikitext.
https://en.wikipedia.org/w/api.php?action=query&origin=*&prop=extracts&format=json&titles=Sokolsky_Opening
Cross origin:
About using CORS requests, sometimes it may require 2 calls to the API, to jump between ID and page title.
In a ssl context, we can use fetch to embed some wiki text anywhere.
Example remote .json.
fetch("https://en.wikipedia.org/w/api.php?action=query&origin=*&prop=extracts&explaintext&format=json&titles=Sokolsky_Opening").then(v => v.json()).then((function(v){
main.innerHTML = v["query"]["pages"]["100017"]["extract"]
})
)
<pre id="main" style="white-space: pre-wrap"></pre>
⚠️ This API has some quirks, some pages with heavy contents get truncated sometimes, among other things and possible rate limiting.
🧘 Good luck. 🜀

You have to use some of these formats: json, jsonfm, none, php, phpfm, rawfm, xml or xmlfm, so txt is not valid format. Also your API link is wrong, use this:
https://en.wikipedia.org/w/api.php?action=query&titles=George_Washington&prop=revisions&rvprop=content&format=xml

Related

Embed HTML within a URL

Is it possible to embed HTML within a URL and then render that HTML in the browser itself?
In theory, what 'm thinking of works similarly to an URL like below:
http://"<h1>Hello World</h1>"
this would show a page with "Hello World" wrapped in a <h1> tag.
Of course, I understand that the above does not work in the real world for a wide range of reasons. Is there however a a way in which I can encode data within a URL and show render that data as HTML within the browser?
I understand that you could easily set up a webserver to do this, but I am interested in a solution which would work natively without any dependencies.
It's Data URL. Data URLs, URLs prefixed with the data: scheme, allow content creators to embed small files inline in documents. They were formerly known as "data URIs" until that name was retired by the WHATWG.
Data URLs are treated as unique opaque origins by modern browsers, rather than inheriting the origin of the settings object responsible for the navigation.
Syntax:
data:[<mediatype>][;base64],<data>
The HTML:
Test
This is what a data url does.
Go to welcome page

Extract content from Wikipedia to Mediawiki

Is there a way to get the intro content from wikipedia page to my mediawiki page? I was thinking of using wikipedia's api but i dont know how to parse the url on my page and also with templates. I just want a query that will display the introduction part of a wikipedia page on my page?d
I used the External_Data Extension and Wikipedia's api to achieve this.
The API
http://en.wikipedia.org/w/api.php? action=query&prop=extracts&format=json&exintro=&titles=[title of wikipedia page]
How I used it
{{#get_web_data:
url=http://en.wikipedia.org/w/api.php? action=query&prop=extracts&format=json&exintro=&titles={{PAGENAME}}
|format=JSON|data=extract=extract}}
How I displayed the extract on pages
{{#external_value:extract}}
I however need to figure out how to get only a paragraph from the return text. Will probably use a parser function.

parsing wikipedia page content

I'm looking for a library to parse html pages, specifically wikipedia articles for example: http://en.wikipedia.org/wiki/Railgun, I want to extract the article's text and images (full scale or original image not the thumb).
Is there an html parser out there ?
I would prefer not to use the wikimedia api since I can't seem to figure out how to extract an article's text and the fullsize images with them.
Thanks and sorry for my english.
EDIT: I forgot to say that the ending result should be valid html
EDIT: I got the json string with this: https://en.wikipedia.org/w/api.php?action=parse&pageid=218930&prop=text&format=json so now I need to parse the json.
I know that in javascript I can do something like this:
var pageHTML = JSON.parse("the json string").parse.text["*"];
Since I know a bit of html/javascript and python, how can I make that http request and parse the json in python 3 ?
I think you should be able to get everything with the webapi,
https://www.mediawiki.org/wiki/API:Main_page
https://www.mediawiki.org/wiki/API:Parsing_wikitext
or you could download the whole wikipedia
https://meta.wikimedia.org/wiki/Research:Data
You can get the html from the api too, check the info on https://www.mediawiki.org/wiki/Extension:TextExtracts/pt, it's like this example: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exchars=175&titles=hello%20world .
Depending on how many pages you'll need, you should consider using public dumps if the volume of pages is high.
I made a Node.js module called wikipedia-to-json (written in javascript) that parses the HTML in wikipedia articles and gives you back structed JSON objects that describe the layout of the article in-order. (titles, paragraphs, images, lists, sub-titles...)
That might be useful if you just want to do a quick extractions of text and sections and understand how things look like.

Wikipedia MediaWiki API Get Template Content via URL Request

I've been going through the doc's for past few hours and simply can't seem to figure this out though probably simple.
I have this link:
http://en.wikipedia.org/w/api.php?format=xml&action=expandtemplates&titles=Arabinose&text={{Chembox%20Elements}}&prop=wikitext
Which obviously will give me the schema of Template Chembox | Chembox Elements in this case.
All I simply want is to retrieve the Molecular forumla content/data/value for the given page/title without having to parse the entire wiki content at my end.
Understand I have prop=wikitext which will be returning wikitext in the above example, there's no option in expandtemplates for prop=text. I've been back and forth with action=query, expandedtemplates etc and no joy.
MediaWiki's API won't do the work for you. You'll have to parse your own results.

REST/Ajax deep linking compatibility - Anchor tags vs query string

So I'm working on a web app, and I want to filter search results.
A nice restful implementation might look like this:
1. mysite.com/clothes/men/hats+scarfs
But lets say we want to ajax up the filtering, like the cool kids, and we want to retain deep linking, we might use the anchor tag and parse that with Javascript to show the correct listings:
2. mysite.com/clothes#/men/hats+scarfs
However, if someone clicks the first link with JS enabled, and then changes filters, we might get:
3. mysite.com/clothes/men/hats+scarfs#/women/shoes
Urk.
Similarly, if someone does not have JS enabled, and clicks link 2 - JS will not parse the options and the correct listings will not be shown.
Are Ajax deep links and non-Ajax links incompatible? It would seem so, as servers cannot parse the # part of a url, since it is not sent to the server.
There's a monkeywrench being thrown into this issue by Google: A proposal for making Ajax crawlable. Google is including recommendations for url structure there that may give you ideas for your own application.
Here's the wrapup:
In summary, starting with a stateful
URL such as
http://example.com/dictionary.html#AJAX
, it could be available to both
crawlers and users as
http://example.com/dictionary.html#!AJAX
which could be crawled as
http://example.com/dictionary.html?_escaped_fragment_=AJAX
which in turn would be shown to users
and accessed as
http://example.com/dictionary.html#!AJAX
View Google's Presentation here (note: google docs presentation)
In general I think it's useful to simply turn off JavaScript and CSS entirely and browse your website and web application and see what ends up getting exposed. Once you get a sense of what's visible, you will understand what most search engines see and that in turn will show you what is and is not getting spidered.
If you go to mysite.com/clothes/men/hats+scarfs with JavaScript enabled then your JavaScript should automatically rewrite that to mysite.com/clothes#men/hats+scarfs - when you click on a filter, they should be controlled by JavaScript meaning you'll only change the hashtag rather than the entire URL (as you're going to have return false anyway).
The problem you have is for non-JS users going to your JS enabled deeplinks as the server can't determine that stuff. Unfortunately, the only thing you can do is take them to mysite.com/clothes and make them start their journey again (as far as I'm aware). You'll need to try and ensure that when people link to the site, they use the hardcoded deeplink rather than the hashed deeplink
I don't recommend ever using the query string as you are sending data back to the server without direct relevance to the prior specified destination. That is a corruptible security hole as malicious code can be manually added to the query string to cause a XSS or buffer overflow attack at your webserver.
I believe REST was intended to work with absolute URIs without a query string, because then your specifying only a location of a resource and it is that location that is descriptive and semantically relevant in addition to the possibility of the resource being so equally relevant. Even if there is no resource at the specified path you have still instantiated a potentially unique and descriptive location that can be processed accordingly.
Users entering the site via deep links
Nonsensical links (like /clothes/men/hats#women/shoes) can be avoided if you construct your Ajax initialisation code in such a way that users who enter the site on filtered pages (e.g. /clothes/women/shoes) are taken to the /clothes page before any Ajax filtering happens. For example, you might do something like this (using jQuery):
$("a.filter")
.each(function() {
var href = $(this).attr("href").replace("/clothes/", "/clothes#");
$(this).attr("href", href);
})
.click(function() {
update_filter($(this).attr("href").split("#")[1]);
});
Users without JavaScript
As you said in the question, there's no way for the server to know about the URL fragment so filtering would not be applied for users without JavaScript enabled if they were given a link to /clothes#filter.
However, even without filtering, these links could be made more meaningful for non-JS users by using the filter strings as IDs in your /clothes page. To prevent this messing with the Ajax experience the IDs would need to be changed (or the elements removed) with JavaScript before the Ajax links were initialised.
How practical this is depends on how many categories you have and what your /clothes page contains.