Understanding wikimedia dumps

Understanding wikimedia dumps - mediawiki

I'm trying to parse the latest wikisource dump. More specifically, I would like to get all the pages under the Category:Ballads page. For this purpose I downloaded the https://dumps.wikimedia.org/enwikisource/latest/enwikisource-latest-pages-articles.xml.bz2 dump. In this dump the relevant page contains everything except the actual links:
<page>
<title>Category:Ballads</title>
<ns>14</ns>
<id>115796</id>
<revision>
<id>4753508</id>
<parentid>4003780</parentid>
<timestamp>2014-01-25T16:21:08Z</timestamp>
<contributor>
<username>EmausBot</username>
<id>983607</id>
</contributor>
<minor />
<comment>Bot: Migrating 2 interwiki links, now provided by [[Wikipedia:Wikidata|Wikidata]] on [[d:Q8286819]]</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text bytes="51" xml:space="preserve">[[Category:Song lyrics]]
[[Category:Poems by form]]</text>
<sha1>43eusqpjj6kaqcp6nl1tcmo4ass36ia</sha1>
</revision>
</page>
<page>
My question is, how do I get the actual page content and all the links in this page?
Thank you!

You downloaded the wrong version of a dump. If you're interested in categorylinks, you need to download https://dumps.wikimedia.org/enwikisource/latest/enwikisource-latest-categorylinks.sql.gz, for instance.
If you want XML format, you would need to parse this information yourself, from raw wikitext. For that, you can use https://dumps.wikimedia.org/enwikisource/latest/enwikisource-latest-pages-meta-current.xml.bz2.
EDIT per comments:
enwikisource-latest-pages-meta-current.xml doesn't contain machine-readable information about categories, it only contains information about the current page content. You would need to look for the text XML element, which contains the raw wikitext stored in the page. Usually, at the end of the content, it has something like this:
[[Category:American Civil War]]
[[category:American speeches]]
This indicates the page is in category "American Civil War" and "American speeches".
If you want a parsed info, you would need to deal with the .sql file AFAIK.

Related

How to find for the wikipedia links in the infobox templates and other templates, using sql dumps

I want to extract the pages mentioned in the infobox and templates of pages.
E.g. From this page:
https://en.wikipedia.org/wiki/DNA
I want to extract all of the links in the infobox, like: "Genetics", "Introduction to Genetics" etc.
I want to do it, by using the sql dumps, possibly avoiding to parse the xml of whole pages, and I don't want to do it with APIs.
I could not find a way.
While Pagelinks does include also the links of infoboxes, I cannot find a way to exclude them.
I thought Templatelinks may have that info, but it is not: I could not find the pageids of the corresponding links in infoboxes.
Where is this information stored?
Or which kind of tables should I look at?
I consulted previous questions:
where can I find the infobox templates used in wiki?
and Mediawiki reference:
https://www.mediawiki.org/wiki/Manual:Templatelinks_table#Schema_summary
but could not find a solution.

That is a sidebar rather than an infobox: https://en.wikipedia.org/wiki/Template:Genetics_sidebar
I don't think there's a way of doing it other than parsing the content of the template to extract the links or using the API: e.g. https://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Template:Genetics%20sidebar&pllimit=100&plnamespace=0
Something like this should also work but it's not returning any results for me:
SELECT * from pagelinks
where pl_title = 'Genetics_sidebar'
and pl_namespace = 0
and pl_from_namespace = 10
https://quarry.wmcloud.org/query/71442

parsing wikipedia page content

I'm looking for a library to parse html pages, specifically wikipedia articles for example: http://en.wikipedia.org/wiki/Railgun, I want to extract the article's text and images (full scale or original image not the thumb).
Is there an html parser out there ?
I would prefer not to use the wikimedia api since I can't seem to figure out how to extract an article's text and the fullsize images with them.
Thanks and sorry for my english.
EDIT: I forgot to say that the ending result should be valid html
EDIT: I got the json string with this: https://en.wikipedia.org/w/api.php?action=parse&pageid=218930&prop=text&format=json so now I need to parse the json.
I know that in javascript I can do something like this:
var pageHTML = JSON.parse("the json string").parse.text["*"];
Since I know a bit of html/javascript and python, how can I make that http request and parse the json in python 3 ?

I think you should be able to get everything with the webapi,
https://www.mediawiki.org/wiki/API:Main_page
https://www.mediawiki.org/wiki/API:Parsing_wikitext
or you could download the whole wikipedia
https://meta.wikimedia.org/wiki/Research:Data

You can get the html from the api too, check the info on https://www.mediawiki.org/wiki/Extension:TextExtracts/pt, it's like this example: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exchars=175&titles=hello%20world .
Depending on how many pages you'll need, you should consider using public dumps if the volume of pages is high.

I made a Node.js module called wikipedia-to-json (written in javascript) that parses the HTML in wikipedia articles and gives you back structed JSON objects that describe the layout of the article in-order. (titles, paragraphs, images, lists, sub-titles...)
That might be useful if you just want to do a quick extractions of text and sections and understand how things look like.

Automate Web Applications -parsing HTML Data

I just want to automate a web application, where that application parses the HTML page and pulls all the HTML Tags inner text based on some condition like if we have a tag called Span Example has given whose class="spanclass_1"
This is span tag...
which has particular class id. so that app parses and pulls that span into it.
And here the main pain area is, I should not use the developer code to automate that same parsing the HTML.
I want to automate that parsing done correctly, simply by using the parsed data which is shown in UI.
Any help, would be great.
Appreciating your time reading this.
(Note span tag is not shown)
Thanks buddies.

not enough details.
is this html page just a file in local filesystem on it is internet webpage?
do u have access to pages? can u modify it ? if answer yes, that just add javascript to page which will extract data and post to server.
if answer not, than it depends on language u use to programm.
Find good framework to parse html. load page parse it and extract data. Several situation can be there.
Worse scenario - page generated on client side using js.
Best scenario - page is in xhtml mode( u are lucky. any xml parser will help to build dom and extract data)
So so - page is simple html format (try several html parser to find most suitable for u)

wikipedia template data api

I want to download the template source used in a wikipedia page (basically for generating the display text of a key). SO i am basically want this info
http://en.wikipedia.org/w/index.php?title=Template:Infobox%20cricketer&action=edit
for Template:Infobox cricketer
I have found an api for wikipedia called Template data
http://www.mediawiki.org/wiki/Extension:TemplateData
But the examples given:
http://en.wikipedia.org/w/api.php?action=templatedata&titles=Template:Stub
does not seem to work.

I think you misunderstood what Extension:TemplateData is for. It's for getting metadata about a template, which only works if that template provides those metadata.
If what you want the text of the template, you should use prop=revisions&rvprop=content, for example:
http://en.wikipedia.org/w/api.php?action=query&titles=Template:Infobox%20cricketer&prop=revisions&rvprop=content

Displaying Only Certain XML data In HTML Page

I'm completely new to website development and I'm learning things on the fly. I am a Sports Information Director at a small university and need some help spicing up our website. For most sports, we use a program called Stat Crew.
Below is an example of the xml data file.
<player name="Player 1" checkname="Player 1" uni="00" code="00" pos="G" year="SR" gs="4" gp="5" fgm="21" fga="42" fgm3="2" fga3="6" ftm="17" fta="24" tp="61" oreb="8" dreb="8" treb="16" blk="0" stl="10" ast="28" to="20" pf="10" tf="0" min="128" dq="0"/>
My html is already in a tablet format with each "player" having their own table. What I want to do is display season statistics (updated every game of course) without having to update the html manually. Can I set up my page to pull certain data from the XML (fgm, fgm3, etc) and have it automatically go into the designated html table?
I've tried to just read the questions on this site that have to do with XML but, to be honest, It's confusing.
Thanks for your help!

You should consider looking into XSL Transforms. This will allow you to utilize your XML data and generate the visualization you want with your HTML familiarity.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Understanding wikimedia dumps - mediawiki

Related

How to find for the wikipedia links in the infobox templates and other templates, using sql dumps

parsing wikipedia page content

Automate Web Applications -parsing HTML Data

wikipedia template data api

Displaying Only Certain XML data In HTML Page

Categories

Resources