Get all Wikipedia Infobox Templates and all Pages using them - mediawiki

Given a Wikipedia page like Wikipedia: Stack Overflow there are often Infoboxes (mostly on the right hand at the top of the page). Example screenshot:
DBPedia lists all these attributes as RDF triples. You can see the example at DBPedia: Stack Overflow. There you see the property dbpprop:wikiPageUsesTemplate with the value dbpedia:Template:Infobox_website which is interesting. I want to know which Wikipedia pages use this template. How can i do that and list all pages which use the Infobox_website template? Preferably with a SPARQL query but i am open to other easy solutions.
Next thing is a list of all Infobox Templates. Wikipedia: Category Infobox Templates shows the hierarchy of the desired Wikipedia categories - that looks like what i am seeking. But i want all of these in a machine readable format, on one page. Maybe DBPedia is the right thing here too? At DBPedia: Category Infox Templates and DBPedia: INFOBOX i find very few information. But these are looking very promising. How can i use SPARQL to find all Infobox Types so that i can do step 1 repeatedly for each of them?
You can use this for testing the SPARQL queries: http://dbpedia.org/snorql/
Update 1
I seem to have solved problem number 1: SPARQL: list all pages with Infobox_website
Update 2
Also, this seems to be the query for problem number 2: SPARQL: list all Infoboxes

Ok, since i seem to have found a solution (most probably not the best) i want to share them.
1) This SPARQL query can be used to find all pages that include a specific Infobox type:
SELECT * WHERE { ?page dbpedia2:wikiPageUsesTemplate
<http://dbpedia.org/resource/Template:Infobox_website> . ?page
dbpedia2:name ?name . }
Link at SNORQL
2) This SPARQL query can be used to find all Infobox types:
SELECT DISTINCT ?template WHERE { ?page
dbpedia2:wikiPageUsesTemplate ?template . FILTER (regex(?template,
"Infobox")) . } ORDER BY ?template
Link at SNORQL

The previous answers seem to have stopped working. Only a small change is required to get them working at the new dbpedia query endpoint at http://live.dbpedia.org/sparql though.
To get a list of all of the pages and the templates that they use this query works:
SELECT * WHERE { ?page dbpprop:wikiPageUsesTemplate ?template . }
See results (limited to 100)
If you're looking for a specific template:
SELECT * WHERE {
?page
dbpprop:wikiPageUsesTemplate
<http://dbpedia.org/resource/Template:Infobox_website> .
}
See results
And for my use case I'm interested in the Wikipedia URL rather than the DBPedia page, so I'm using this query:
SELECT ?wikipedia_url WHERE {
?page
dbpprop:wikiPageUsesTemplate
<http://dbpedia.org/resource/Template:Infobox_website> .
?page foaf:isPrimaryTopicOf ?wikipedia_url .
}
See results
I'm also using curl to pull the results into a script:
$ curl -s "http://live.dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Fwikipedia_url+WHERE+%7B+%0D%0A%09+%3Fpage+%0D%0A%09+dbpprop%3AwikiPageUsesTemplate+%0D%0A%09+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FTemplate%3AInfobox_website%3E+.+%0D%0A+%3Fpage+foaf%3AisPrimaryTopicOf+%3Fwikipedia_url+.%0D%0A%0D%0A%09%7D&format=text%2Ftab-separated-values" \
| tr -d \" | grep -v "^wikipedia_url$" | head
http://en.wikipedia.org/wiki/U.S._News_&_World_Report
http://en.wikipedia.org/wiki/FriendFinder
http://en.wikipedia.org/wiki/Debkafile
http://en.wikipedia.org/wiki/GTPlanet
http://en.wikipedia.org/wiki/Lithuanian_Wikipedia
http://en.wikipedia.org/wiki/Connexions
http://en.wikipedia.org/wiki/Hypno5ive
http://en.wikipedia.org/wiki/Scoop_(website)
http://en.wikipedia.org/wiki/Bhoomi_(software)
http://en.wikipedia.org/wiki/Brainwashed_(website)
I'm not sure if this gives the full result set though, because it returns 1698 results whereas wmflabs.org seems to suggest there should be 4439.
For the second part of your question, only a small change is needed from the previous query to get a list of all templates:
SELECT DISTINCT ?template WHERE {
?page
dbpprop:wikiPageUsesTemplate
?template .
FILTER (regex(?template, "Infobox")) .
} ORDER BY ?template
See results

You can also use the MediaWiki API's embeddedin query to return a list of all pages that include a given template. You'll want to use a library for accessing the API though, which language would you prefer? For Ruby, I'd suggest MediaWiki::Gateway.

Related

Using cts query to retrieve collections associated with the given document uri- Marklogic

I need to retrieve the collections to which a given document belongs in Marklogic.
I know xdmp command does that. But I need to use it in cts query to retrieve the data and then filter records from it.
xdmp:document-get-collections("uri of document") can't be run inside cts-query to give appropriate data.
Any idea how can it be done using cts query?
Thanks
A few options come to mind:
Option One: Use cts:values()
cts:values(cts:collection-reference())
If you check out the documentation, you will see that you can also restrict this to certain fragments by passing a query as one of the parameters.
**Update: [11-10-2017]
The comment attached to this asked for a sample of restricting the results of cts:values() to a single document(for practical purposes, I will say fragment == document)
The documentation for cts:values explains this. It is the 4th parameter - a query to restrict the results. Get to know this pattern as it is part of many features of MarkLogic. It is your friend. The query I would use for this problem statement would be a cts:document-query();
An Example:
cts:values(
cts:collection-reference(),
(),
(),
cts:document-query('/path/to/my/document')
)
Full Example:
cts:search(
collection(),
cts:collection-query(
cts:values(
cts:collection-reference(),
(),
(),
cts:document-query('/path/to/my/document')
)
)
)[1 to 10]
Option two: use cts:collection-match()
Need more control over returning just some of the collections from a document, then use cts:colection-match(). Like the first option, you can restrict the results to just some fragments. However, it has the benefit of having an option for a pattern.
Attention:
They both return a sequence - perfect for feeding into other parts of your query. However, under the hood, I believe they work differently. The second option is run against a lexicon. The larger the list of unique collection names and the more complex your pattern match, the longer for resolution. I use collection-match in projects. However, I usually use it when I can limit the possible choices by restricting the results to a smaller number of documents.
You can't do this in a single step. You have to run code first to retrieve collections associated with a document. You can use something like xdmp:document-get-collections for that. You then have to feed that into a cts query that you build dynamically:
let $doc-collections := xdmp:document-get-collections($doc-uri)
return
cts:search(collection(), cts:collection-query($doc-collections))[1 to 10]
HTH!
Are you looking for cts:collection-query()?
Insert two XML files to the same collection:
xquery version "1.0-ml";
xdmp:document-insert("/a.xml", <root><sub1><a>aaa</a></sub1></root>,
map:map() => map:with("collections", ("coll1")));
xdmp:document-insert("/b.xml", <root><sub2><a>aaa</a></sub2></root>,
map:map() => map:with("collections", ("coll1")));
Search the collection:
xquery version "1.0-ml";
let $myColl:= xdmp:document-get-collections("/a.xml")
return
cts:search(/root,
cts:and-query((cts:collection-query($myColl),cts:element-query(xs:QName("a"),"aaa")
)))

Get first result of suggested results with Wikipedia's API

I am trying to use Wikipedia's API, but I can't find a way to get the first result when there are multiple possible ones.
For example if I use this request https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Cimages&titles=test&redirects=1&explaintext=1&imlimit=20 it will return an article that says
Test may refer to: ...
What I want is for it to skip this part and give me directly the results of the first result that "Test may refer to".
Do you know if this is possible or not ?
Thank you for reading :)

In CouchDB how do you take in parameters from REST call

Hi so I'm new to CouchDB looks great so far, but really struggling with what must be simple to do!
I have documents structured as:
{
"_id" : "245431e914ce42e6b2fc6e09cb00184d",
"_rev": "3-2a69f0325962b93c149204aa3b1fa683",
"type": "student",
"studentID": "12345678",
"Name": "Test",
"group: "A"
}
And would like to access them them with queries such as http://couchIP/student?group=A or something like that. Are Views what I need here? I don't understand how to take the parameter from the query in the Map functions in Views. example:
function(doc,req) {
if(req.group==='A'){
emit(doc.id, doc.name);
}
}
Is my understanding of how Couch is working wrong or what's my problem here? Thanks in advance, I'm sure this is Couch 101
Already read through http://guide.couchdb.org/ but it didn't really answer the question!
You need views to achieve the desired results.
Define the following map function inside a view of a design document. ( let's name the view "byGroup" and assume this lives in a design document named "_design/students" )
function(doc) {
if(doc.group){
emit(doc.group,null);
}
}
Results can be obtained from the following url
http://couchIP:5984/dbname/_design/students/_view/byGroup?startkey="A"&endkey="A"&include_docs=true
To have friendly url couchdb also provides url rewriting options.
You need to some further reading about views and the relevance that they return key/pair values.
It's not clear what you want to return from the view so I'll guess. If you want to return the whole document you'd create a view like:
function (doc) { emit(doc.group, doc) };
This will emit the group name as a key which you can lookup against, the whole doc will be returned as the value when you look it up.
If you want to just have access to the names of those users you want to do something like:
function (doc) { emit(doc.group, doc.name) };
Your question arises from a misconception about what a view does. Views use map/reduce to generate a representation of your data. You have no control of the output of your view in your query because the view is updated according to changes in your DB documents only.
Using a list is also not a good option. It may seem that you can use knowledge of your request in your list to generate a different output depending on the query parameters but this is wrong because couchdb uses ETags for caching and this means that most times you will get the same answer regardless of your list parameters since the underlying documents won't have changed. There is a trick though to fool couchdb in this case and this implies using two different alternating users but I wouldn't even try this way because surely there are easier ways to achieve your objectives and you can probably solve your problem using group as a key in your map function.

The search results in Drupal includes HTML Entities. How can I have a clean output?

How can I have a clean html ouput for search result pages? Each time I try to include special characters like "&" as part of the search term, I usually get results with "&" highlighted yet includes the HTML entity. Thus, the results has &, " etc...Here's a screenshot sample - http://min.us/mt3rOV5zVtOh6
Meanwhile, when I do my searches with "&" included in the search term, the result yields to having a clean output.
The piece of code in search-result.tpl.php
http://pastebin.com/zCmMJLNh
I've already tried several decoding functions but no success. Been trying to fix this for days already. The site is using Drupal 6 and the search module has been overridden.
You say "...the search module has been overridden" this could be the cause of why the search snippet remains htmlentityencoded on output ( e.g check_plain'd escaped html )
A better fix would be to find the cause in the modification, e.g a preprocess function that modifies the search snippet ( if any )
Alternatively, you could probably run the $snippet through decode_entities
i.e print decode_entities($snippet)
Assuming, the html is already escaped, as if not, can be a security risk.
See also: http://php.net/manual/en/function.html-entity-decode.php
and: http://www.php.net/manual/en/function.htmlspecialchars-decode.php
Well, you could try drupal_html_to_text to convert the snippet into plain text.
The right way is probably to figure out why those results aren't getting converted. Based on your comments it looks like the problem is only when you search specifically for "&". More specifically, it's the regex in the search.module (/modules/search/search.module - line 1188 in 6):
preg_match_all('/ ("([^"]+)"|(?!OR)([^" ]+))/', ' '. $keys, $matches);
It only matches spaces before the keyword (not after). You could modify the $keys here like:
if ($keys == '&') $keys = '&'
Or something like that (of course that means hacking core - meh).
You could also possibly add a form_alter via a module and modify the search form (see this link on how to add the form_alter). Then you could add a custom submit handler which would alter the search term in the form before it is submitted.

Perl::Mechanize: running a simple crawler with a loop [multiple queries]

currently ironing out a way to parse the data of a page: http://www.foundationfinder.ch/
i love to do it in Perl: Well - i am just musing which is the best way to do the job.
Guess that i am in front of a nice learning curve. ;) This task will give me some nice Perl lessions. At the moment it goes abit over my head...;-)
So here is a sample-page:
... and as i thought i can find all 790 resultpages within a certain range between Id= 0 and Id= 100000 i thought, that i can go the way with a loop:
http://www.foundationfinder.ch/ShowDetails.php?Id=11233&InterfaceLanguage=&Type=Html
http://www.foundationfinder.ch/ShowDetails.php?Id=927&InterfaceLanguage=1&Type=Html
http://www.foundationfinder.ch/ShowDetails.php?Id=949&InterfaceLanguage=1&Type=Html
http://www.foundationfinder.ch/ShowDetails.php?Id=20011&InterfaceLanguage=1&Type=Html
http://www.foundationfinder.ch/ShowDetails.php?Id=10579&InterfaceLanguage=1&Type=Html
i thought i can go the Perl-Way but i am not very very sure: I was trying to use LWP::UserAgent on the same URLs [see below] with different query arguments, and i am wondering if LWP::UserAgent provides a way for us to loop through the query arguments? I am not sure that LWP::UserAgent has a method for us to do that. Well - i sometimes heard that it is easier to use Mechanize. But is it really easier!?
BTW; But if i am going the PHP way i could do it with Curl - couldnt i!?
Here is my approach: I tried to figure it out. And i digged deeper in the Manpages and Howtos. We can have a loop constructing the URLs and use Curl - repeatedly
As noted above: here we have some resultpages;
http://www.foundationfinder.ch/ShowDetails.php?Id=11233&InterfaceLanguage=&Type=Html
http://www.foundationfinder.ch/ShowDetails.php?Id=927&InterfaceLanguage=1&Type=Html
Alternatively we can add a request_prepare handler that computes and add the query
arguments before we send out the request.
Again: What is aimed: i want to parse the data and afterwards i want to store it in a local MySQL-database
should i define a extern_uid !?
and go like this:
for my $i (0..10000) {
$ua->get('http://www.foundationfinder.ch/ShowDetails.php?Id=', id => 21, extern_uid => $i);
# process reply
}
Well but now i get stuck- i need help - can i do the job like this!?
regards
zero
Dont do it like this. Use HTTP live headers (Firefox Plugin) or eqv. to see what the javasript does behind the scenes while you select what you need from here to get to that page (with the table).
To get the data from the table, use HTML::TableExtract or HTML::TreeBuilder::XPath if you want to use XPath
If you do want to iterate over the queries, just create another var:
my $a = 'http://www.foundationfinder.ch/ShowDetails.php?Id=' . $q . '&InterfaceLanguage=&Type=Html';
and increment $q as you go, make sure the page is valid before trying to load it with get