Extract content from Wikipedia to Mediawiki - mediawiki

Is there a way to get the intro content from wikipedia page to my mediawiki page? I was thinking of using wikipedia's api but i dont know how to parse the url on my page and also with templates. I just want a query that will display the introduction part of a wikipedia page on my page?d

I used the External_Data Extension and Wikipedia's api to achieve this.
The API
http://en.wikipedia.org/w/api.php? action=query&prop=extracts&format=json&exintro=&titles=[title of wikipedia page]
How I used it
{{#get_web_data:
url=http://en.wikipedia.org/w/api.php? action=query&prop=extracts&format=json&exintro=&titles={{PAGENAME}}
|format=JSON|data=extract=extract}}
How I displayed the extract on pages
{{#external_value:extract}}
I however need to figure out how to get only a paragraph from the return text. Will probably use a parser function.

Related

How to get Wikipedia content as text by API?

I want to get Wikipedia pages as text.
I looked at the Wikipedia API from here https://en.wikipedia.org/w/api.php which says that in order to get pages as text I need to append this to a page address:
api.php?action=query&meta=siteinfo&siprop=namespaces&format=txt
However, when I try appending this suffix to a normal page's address, the page is not found:
https://en.wikipedia.org/wiki/George_Washington/api.php?action=query&meta=siteinfo&siprop=namespaces&format=txt
Following the instructions from Get Text Content from mediawiki page via API, I tried adding /api.php?action=parse&page=test to the end of the query string. Therefore, I obtained this:
https://en.wikipedia.org/wiki/George_Washington/api.php?action=parse&page=test
However, this doesn't work either.
NB: All this examples are CORS enabled.
Text only
From the precise title, as seen in the wikipedia page url:
https://en.wikipedia.org/w/api.php?action=query&origin=*&prop=extracts&explaintext&titles=Sokolsky_Opening&format=json
Search relevant pages by keywords
Get IDs, get precise titles/url, get some quick text extract;
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=max&format=json&exsentences=1&origin=*&exintro=&explaintext=&generator=search&gsrlimit=23&gsrsearch=chess
Wiki page ID
Using the precise title:
https://en.wikipedia.org/w/api.php?action=query&origin=*&prop=pageprops&format=json&titles=Sokolsky_Opening
Full html
By wiki page ID, includes the Wikitext:
https://en.wikipedia.org/w/api.php?action=parse&origin=*&format=json&pageid=100017
Stripped html
Lighter html version, without the Wikitext.
https://en.wikipedia.org/w/api.php?action=query&origin=*&prop=extracts&format=json&titles=Sokolsky_Opening
Cross origin:
About using CORS requests, sometimes it may require 2 calls to the API, to jump between ID and page title.
In a ssl context, we can use fetch to embed some wiki text anywhere.
Example remote .json.
fetch("https://en.wikipedia.org/w/api.php?action=query&origin=*&prop=extracts&explaintext&format=json&titles=Sokolsky_Opening").then(v => v.json()).then((function(v){
main.innerHTML = v["query"]["pages"]["100017"]["extract"]
})
)
<pre id="main" style="white-space: pre-wrap"></pre>
⚠️ This API has some quirks, some pages with heavy contents get truncated sometimes, among other things and possible rate limiting.
🧘 Good luck. 🜀
You have to use some of these formats: json, jsonfm, none, php, phpfm, rawfm, xml or xmlfm, so txt is not valid format. Also your API link is wrong, use this:
https://en.wikipedia.org/w/api.php?action=query&titles=George_Washington&prop=revisions&rvprop=content&format=xml

parsing wikipedia page content

I'm looking for a library to parse html pages, specifically wikipedia articles for example: http://en.wikipedia.org/wiki/Railgun, I want to extract the article's text and images (full scale or original image not the thumb).
Is there an html parser out there ?
I would prefer not to use the wikimedia api since I can't seem to figure out how to extract an article's text and the fullsize images with them.
Thanks and sorry for my english.
EDIT: I forgot to say that the ending result should be valid html
EDIT: I got the json string with this: https://en.wikipedia.org/w/api.php?action=parse&pageid=218930&prop=text&format=json so now I need to parse the json.
I know that in javascript I can do something like this:
var pageHTML = JSON.parse("the json string").parse.text["*"];
Since I know a bit of html/javascript and python, how can I make that http request and parse the json in python 3 ?
I think you should be able to get everything with the webapi,
https://www.mediawiki.org/wiki/API:Main_page
https://www.mediawiki.org/wiki/API:Parsing_wikitext
or you could download the whole wikipedia
https://meta.wikimedia.org/wiki/Research:Data
You can get the html from the api too, check the info on https://www.mediawiki.org/wiki/Extension:TextExtracts/pt, it's like this example: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exchars=175&titles=hello%20world .
Depending on how many pages you'll need, you should consider using public dumps if the volume of pages is high.
I made a Node.js module called wikipedia-to-json (written in javascript) that parses the HTML in wikipedia articles and gives you back structed JSON objects that describe the layout of the article in-order. (titles, paragraphs, images, lists, sub-titles...)
That might be useful if you just want to do a quick extractions of text and sections and understand how things look like.

Wikipedia MediaWiki API Get Template Content via URL Request

I've been going through the doc's for past few hours and simply can't seem to figure this out though probably simple.
I have this link:
http://en.wikipedia.org/w/api.php?format=xml&action=expandtemplates&titles=Arabinose&text={{Chembox%20Elements}}&prop=wikitext
Which obviously will give me the schema of Template Chembox | Chembox Elements in this case.
All I simply want is to retrieve the Molecular forumla content/data/value for the given page/title without having to parse the entire wiki content at my end.
Understand I have prop=wikitext which will be returning wikitext in the above example, there's no option in expandtemplates for prop=text. I've been back and forth with action=query, expandedtemplates etc and no joy.
MediaWiki's API won't do the work for you. You'll have to parse your own results.

wikipedia template data api

I want to download the template source used in a wikipedia page (basically for generating the display text of a key). SO i am basically want this info
http://en.wikipedia.org/w/index.php?title=Template:Infobox%20cricketer&action=edit
for Template:Infobox cricketer
I have found an api for wikipedia called Template data
http://www.mediawiki.org/wiki/Extension:TemplateData
But the examples given:
http://en.wikipedia.org/w/api.php?action=templatedata&titles=Template:Stub
does not seem to work.
I think you misunderstood what Extension:TemplateData is for. It's for getting metadata about a template, which only works if that template provides those metadata.
If what you want the text of the template, you should use prop=revisions&rvprop=content, for example:
http://en.wikipedia.org/w/api.php?action=query&titles=Template:Infobox%20cricketer&prop=revisions&rvprop=content

Display html articles in a easy to read format

I have looked at the readability api which is useful to display data in a clean format on a html webpage. I am passing a Url to http://www.readability.com/read?url= to display the data. I am initially directed to a page where I can choose to view the info using readability is there any way I can directly view the content in a neat fashion without going through the actual re-direct?
take a look at Readability's API: http://www.readability.com/developers/api
Before you implement your code, you have to create an API Key on their website.