How to find for the wikipedia links in the infobox templates and other templates, using sql dumps - mediawiki

I want to extract the pages mentioned in the infobox and templates of pages.
E.g. From this page:
https://en.wikipedia.org/wiki/DNA
I want to extract all of the links in the infobox, like: "Genetics", "Introduction to Genetics" etc.
I want to do it, by using the sql dumps, possibly avoiding to parse the xml of whole pages, and I don't want to do it with APIs.
I could not find a way.
While Pagelinks does include also the links of infoboxes, I cannot find a way to exclude them.
I thought Templatelinks may have that info, but it is not: I could not find the pageids of the corresponding links in infoboxes.
Where is this information stored?
Or which kind of tables should I look at?
I consulted previous questions:
where can I find the infobox templates used in wiki?
and Mediawiki reference:
https://www.mediawiki.org/wiki/Manual:Templatelinks_table#Schema_summary
but could not find a solution.

That is a sidebar rather than an infobox: https://en.wikipedia.org/wiki/Template:Genetics_sidebar
I don't think there's a way of doing it other than parsing the content of the template to extract the links or using the API: e.g. https://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Template:Genetics%20sidebar&pllimit=100&plnamespace=0
Something like this should also work but it's not returning any results for me:
SELECT * from pagelinks
where pl_title = 'Genetics_sidebar'
and pl_namespace = 0
and pl_from_namespace = 10
https://quarry.wmcloud.org/query/71442

Related

How to best transfer a document to a SAPUI5 framwork?

I'd like to achieve the following and I'm looking for ideas. I have a document and I want to represent/transform this content in/to a nice SAPUI5 framework. My idea is the following: a split app with having the paragraph titles in the master view (plus a search function on top) and the respective content in the detail view.
I'd like to know from you if
a) you might want to share your ideas and hints on alternatives.
b) this can be achieved within one single file (i.e. all the code for the split app and document content in one html) and maybe using pure html code (xml also feasible) - against the background of easily handing a large amount of text available in html.
c) if you happen to have/know a reusable template.
Thanks in advance!
An interesting question. I went through a similar exercise once, re-presenting my site with UI5.
To your questions:
(a) I would think that the approach you suggest is a good one
(b) You can indeed include all the app in a single file, I do that often by using script templates, even with XML Views. You can see some examples in my sapui5bin repository, in particular in the SinglePageExamples folder. Have a look at this html file for example: https://github.com/qmacro/sapui5bin/blob/master/SinglePageExamples/SAP-Inside-Track-Sheffield-2014/end.html
What I would suggest is, rather than intermingle the document content and the app & view definitions, maintain the content of your document separately, for example, in XML or JSON, and use a client side model to load it in and bind the parts to the right places.

Wikipedia MediaWiki API Get Template Content via URL Request

I've been going through the doc's for past few hours and simply can't seem to figure this out though probably simple.
I have this link:
http://en.wikipedia.org/w/api.php?format=xml&action=expandtemplates&titles=Arabinose&text={{Chembox%20Elements}}&prop=wikitext
Which obviously will give me the schema of Template Chembox | Chembox Elements in this case.
All I simply want is to retrieve the Molecular forumla content/data/value for the given page/title without having to parse the entire wiki content at my end.
Understand I have prop=wikitext which will be returning wikitext in the above example, there's no option in expandtemplates for prop=text. I've been back and forth with action=query, expandedtemplates etc and no joy.
MediaWiki's API won't do the work for you. You'll have to parse your own results.

How to get an RSS feed of tweets WITH images?

I'm trying to get an rss feed of a list of tweets with a given hashtag, including the images that may be attached to the tweets.
I've used several different scripts out there, but none include the media_url entity that I believe I need, according to twitter's docs on API entities. They do include other necessary things like author, tweet description, author profile pic, etc.
I've used labnol's script, no luck.
I'm currently using Twitter-RSS-Parser, which doesn't give me an image link either.
I'm not very familiar with any of the actual coding, just trying to piece together other people's findings.
Is there a way to edit either of these scripts to provide a link to the image attached to each tweet, or any other script out there that already does this?
Thanks!
Those labnol scripts will need the following parameter added to them &include_entities=true
That will ensure that Tweets which have photos will have their entity meta data returned.
I ended up using tweedledee (can't find a link anymore!) scripts, which allow for specific queries and output in JSON. From there I was able to format the JSON data as needed.

mediawiki api. how to chose page from response

When I make api query sometimes I have list with few pages. For example
http://en.wikipedia.org/wiki/Ask gives a lot of pages, I need website "Ask.com, a web search engine, formerly Ask Jeeves"
can I make query only for some category ("websites")?
How I can check category for each page in response?
Thanks
There is no trivial way to do what you're asking. You could do something like this:
Get the list of pages the disambiguation page list. You could do this by listing the links on that page (action=query&prop=links).
Get the categories of all the pages from the previous step and use that to decide which one is the one you're looking for. This is not that simple, because Ask.com is not directly in Category:Websites, it's in one of its subcategories.
I have list with few pages, for example http://en.wikipedia.org/wiki/Ask
The problem is that you're not getting a list of pages, you just are getting an ordinary page which is in the disambiguation pages category. To get the list, you need to get the links in that page.
can I make query only for some category ("websites")?
No, mediawiki does not support that.
How I can check category for each page in response?
Use the links property as a title list generator and get the categories of each page in the response. In your case, that would be http://en.wikipedia.org/w/api.php?action=query&titles=Ask&generator=links&prop=categories (don't forget to continue the query).
If you are OK with "full-text search" for "ask",
you can do that like this:
http://en.wikipedia.org/w/api.php?format=json&action=query&generator=search&gsrsearch=ask%20incategory:%22Online%20companies%22&prop=info
As you can see, "search" text is [ask incategory:"Online companies"]
The same solution also can be seen at:
Wikipedia API: how to search for a term in a specific category

Pulling out some text from a giant HTML file using Nokogiri/xpath

I am scraping a website and am trying to pull out certain elements from the HTML. In the sites I am scraping, there are script tags with a bunch of info in them however, there is one part inside these tags that I am interested in. The line basically looks like:
'image':'http://ut5.example.com/t/231/3_b_643435.jpg',
With some stuff above and below it. Now, this is different for each page source except for obviously the domain and some of the subfolders that store the images.
How would I go about looking through the source for this specific line, and cutting out just the URL? I would need to use regular expressions I feel as the URLs are dynamic.
The "gsub" method does something similar to what I want to search for, with its ability to use /regex/. But, I am not wanting to replace anything, I just want to find that URL in the source code using a /regex/ and copy it.
According to you comments, this is what you're looking for I guess
var regex = /http.+/;
Example http://jsfiddle.net/Km9ZB/