MediaWiki API: search for pages in specific namespace, containing a substring (NOT a prefix) - mediawiki

I want to scrape pages from a list of Wikipedia Categories, for which there isn't a 'mother category'. In this case, dishes -- I want to get a list of all of the pages like Category:Vegetable Dishes, Category: Italian Dishes, then scrape and tag the pages in them. I know how to search for pages in a known category, but there are hundreds of categories containing the substring dishes + it feels like it should be easy to list them.
However, mediaWiki allcategories search seems to only allow search by prefix (e.g. from and to results), and while old opensearch documentation still allows search by substring, this is no longer supported. (see updated API docs + it also doesn't work if I try it)
This is very doable in the wikipedia browser, to the point where I think it might be quicker to just scrape search results, but I wonder if I'm missing something?

Thanks to #Tgr, for pointing out that I'd missed the regular search API, which allows for both a text search, specified namespace, and so on.
The correct query for my instance is:
curl "https://en.wikipedia.org/w/api.php?action=query&list=search&srnamespace=14&srsearch=Dishes&format=json"
thanks!

Related

DNN database search and replace tool

I have a DNN (9.3.x) website with CKEditor, 2sxc etc installed.
Now old URLs need to be changed into new URLs because the domain name changed. Does anyone know a tool for searching & replacing URLs in a database of DNN?
I tried "DNN Search and Replace Tool" by Evotiva, but it goes only through native DNN database-tables, leaving 2sxc and other plugin /modules tables untouched.
Besides that, there are data in JSON-format in database-tables of 2sxc, also containing old URLs.
I'm pretty sure that the Evotiva tool can be configured to search and replace in ANY table in the DNN database.
"Easy configuration of the search targets (table/column pairs. Just point and click to add/remove items. The 'Available Targets' can be sorted, filtered, and by default all 'textual' columns of 250 characters or more are included as possible targets."
It's still a text search.
As a comment, you should be trying to use relative URLs and let DNN handle the domain name part..
I believe the Engage F3 module will search Text/HTML modules for replacement strings, but it's open-source, so you could potentially extend it to inspect additional tables.

How to search a MediaWiki-based wiki for pages that use the template "underconstruction" and are tagged with "abcd"?

I want to search through a wiki based on the MediaWiki software for a list of pages which use the template "underconstruction" and are tagged with "abcd". How can this be done?
Using the MediaWiki interface
Using the AdvancedSearch extension, like on Wikipedia, allows you to specify advanced search options.
Using the API
A query similar to this might work:
/w/api.php?action=query&format=json&prop=templates&list=categorymembers&tltemplates=underconstruction&cmtitle=Category%3AAbcd&cmprop=ids%7Ctitle
First you need to install CirrusSearch extension. This will optimize search capabilities, and make the search very fast.
After that, you can use the built-in keyword to optimize the search. For example, use these keywords:
incategory:"THE-NAME-HERE"
Use it to search for all pages inside the category name.
hastemplate:"THE-NAME-HERE"
Use it to search for all pages that contain specific template.
But if you need to get all pages that have specific category, AND specific template use this:
incategory:"underconstruction" hastemplate:"abcd"
For more search optimizations see this link

In the google drive search api, how to group words into a phrase?

I'm using the Google Drive search api with the Files.list - search for files.
I have a query like : fullText contains 'battle of hastings'.
I'm getting results that seem to suggest that it searches for the individual words, rather than the phrase as a whole. I'm not completely clear though, and am relating the API's functionality to what can be done on a Google Search via the website, so please correct me there.
Anyway, I really only want results for the whole phrase - ie like surrounding a phrase in Google's Search web site with double quotes. For example, if you use Google's search web site to search for "no one will have written this before", then it says 'No results found for "no one will have written this before".', but if you don't use double quotes, then you get all sort of stuff.
To summarise:
Does the query api search for individual words and only return files with all those words in, even if they're not as a phrase, or in that order?
Is there a way to make it consider the words as a single phrase?
By using the try section of Files:List and consulting the search parameter documentation.
fullText - Full text of the file including title, description, and content.
contains - The content of one string is present in the other.
I tested using this
fullText contains 'daimto twitter'
It returned all of the files that contain that exact match.
By using the try it facility, I found that the behaviour is similar to the search ui in google drive, and you need to surround phrases that are to be considered one with double quotes. The quotes should be encoded into the URL like this :
https://www.googleapis.com/drive/v2/files?maxResults=100&
q=fullText+contains+'%22flippity+floppity%22'
I'm not sure if the spaces need to be encoded like that, but I tried to emulate it as much as possible.

Extract HTML Tables from Multiple webpages Using R

Hi I have done thorough research and have come to this extent. All I am trying to do is extract HTML table spanning many webpages.
I have to query the website sec.gov's database and the table then returns appropriate number of results (the size and number of pages vary with every query). For example:
Link: http://www.sec.gov/cgi-bin/srch-edgar
Inputs to be given:
Enter a Search string box: form-type=(8-k) AND filing-date=20140523
Start: 2014
End: 2014
How can I do this totally in R without even opening the browser?
I am sharing what all I have done
I tried many packages and the closest I came to was package RCurl. But in getURL function I opened the browser, ran the query in browser and pasted it in getURL. It returned a very long character, which has the URLs that can be looped and produce the output I want. All this information is in the "center" tag of output.
Now I do not know how to get those URLs out from the middle of the character.
Also, this is not what I wanted. I wanted to run a web query directly from R and get the varied HTML table outputs directly into R. is this possible at all?
Thanks
Meena
Yes, it is possible. You will want to use a combination of the RCurl and XML packages. You will need to programmatically generate the query parameters in the URL (based on the HTML form) and then use getURL() or getURLContent(). Sometimes, the server will expect an HTTP POST, so there is postForm().
To parse the result, look up the XPath language, which the XML package supports with getNodeSet(). I think there is also a function in the XML package for parsing an HTML table into a data.frame.
You might want to invest in this book.

Variable in MediaWiki for the current user

In MediaWiki, you can use a variable ("Magic Word") such as
{{PAGENAME}}
or
{{REVISIONDAY}}
to get specific information related to the current page being viewed. Is there a similar variable (or perhaps a different way) to get the current user who is logged in to the wiki, i.e. something like
{{USERNAME}}
context: Trying to use the #ask query in Semantic MediaWiki to narrow the list of resulting pages to show those only the user has created or edited:
{{#ask: [[Case Reflection:+]] [[Contributing User::{{USERNAME}}]]
| format=template
| template=Case Reflection Form Summary
| link=all
| sort=Last Edited
| order=DESC
| default=You have no case reflections related to this Case Study.}}
There are a bunch of extensions for that such as GetUserName, MyVariables, UserInfo. The whole concept of showing usernames is incompatible with page caching though (you need to parse the page again every time someone looks at it) so generally not a good idea.
I was just searching for the same thing, and looking to see if I could do it without extensions. It looks like there's a default feature that allows this, as long as you want it as part of writing a static version to a page, not to say "Hello, Username!" (That last case is why they have not implemented it as a standard variable, because it causes caching problems.)
Wikimedia feature request T14733 resolves with:
{{subst:REVISIONUSER}}
{{REVISIONUSER}} will dynamically show the last editor, which is usually not what you want. But if you want, for example, to make a template that includes the user's handle as part of some inserted text, this should do the job. I think in your example above,
[[Contributing User::{{subst:REVISIONUSER}}]]
(I'm not sure if Semantic Mediawiki will make you escape out the substitutions, but if it does, further instructions are at Manual:Substitution, Multilevel substitution section.)