Wikipedia API - How to get all links from multiple pages? - mediawiki

I'm new to wiki api. I have read how to get all links from a specific page, and managed to do so, but in my case I need a list of links from many pages. And sending a request for each page is inefficient. This is the kind of request I use -
http://en.wikipedia.org/w/api.php?action=query&format=jsonfm&generator=links&gpllimit=500&redirects=true&pageids=1234
I must admit that I don't fully understand what each argument means. So -
How do you chain multiple pageids to 'pageids' argument? I guess that's a silly question but I didn't find any reference :\
Can the response point out which page owns each link?
Thanks!

You can just join page IDs (or names if you use the titles parameter) with | which is in general how you make lists in the MediaWiki API. I don't think there is a way to find out which link comes from which page, though.

Related

How do I direct all traffic or searches going to duplicate (or similar URLs) to one URL on our website?

I'll try to keep this as simple as possible, as i don't quite understand how to frame the question entirely correctly myself.
We have a report back on our website that is indicating duplicate meta titles and descriptions, which look very much (almost exactly) like the following - although i have used an example domain below:
http://example.com/green
https://example.com/green
http://www.example.com/green
https://www.example.com/green
But, only one of these actually exists as an HTML file on our server, which is:
https://www.example.com/green
As i understand it, i need to somehow tell google and other search engines which of these URLs is correct, and this should be done by specifying a 'canonical' link or URL.
My problem is that the canonical reference must apparently be added to any duplicate pages that exist, and not the actual main canonical page? But we don't actually have any other pages, beyond the one mentioned just above. So there is nowhere to set these canonical rel references?
I'm sure there must be a simple explanation for this that i am completely missing?
So it turns out that these were duplicate URLs which occur as a result of the fact that our website exists as a subdomain of our domain. Any traffic that arrives at example.com (our domain) needs permanent redirect to https://www.example.com, by way of a redirect within the htaccess documentation.

Category products not found

WHOOPS, OUR BAD... The page you requested was not found, and we have
a fine guess why. If you typed the URL directly, please make sure
the spelling is correct. If you clicked on a link to get here, the
link is outdated.
What can you do?
Have no fear, help is near! There are many ways you can get back on
track with Magento Store.
Go back to the previous page.
Use the search bar at the top of the page to search for your products.
Follow these links to get you back on track!
Store Home | My Account.
I get these errors in Magento. How should I solve this?
Check in the 'url rewrite' that created urls are correct, specifically in the column 'url requested' because sometimes puts extensions like 'htm, html ...) or perhaps have some special character.
You might have faced problem due to following reasons.
Your category is not active in admin
If active, products are not assigned.
If assigned, Indexing is not done.
If indexing done, please ensure you are using default category URL.
If category url is not default, please make sure you did catalog_url_rewrite indexing properly.
These above are the cases where you can get resolve your problem.

mediawiki-api - links on page & getting fields on those pages

If I have a wikimedia category such as "Category:Google_Art_Project_works_by_Vincent_van_Gogh", is there an API to retrieve a list of the URLs linked to on that page?
I've tried this, but it doesn't return any links: https://en.wikipedia.org/w/api.php?action=query&titles=Category:Google_Art_Project_works_by_Vincent_van_Gogh&prop=links
(If not, I'll parse the html and obtain them that way.)
Once I have all the URLs linked to, is there an API to retrieve some of the information on the page? (Summary/Artist, Title, Date, Dimensions, Current location, Licensing)
I've tried this, but it doesn't seem to have a way to return that information: https://en.wikipedia.org/w/api.php?action=query&titles=File:Irises-Vincent_van_Gogh.jpg&prop=imageinfo&iiprop=url
is there an API to retrieve a list of the URLs linked to on that page?
I guess you're looking for the Categorymembers API which will list the pages in the selected category.
I've tried this, but it doesn't return any links: https://en.wikipedia.org/w/api.php?action=query&titles=Category:Google_Art_Project_works_by_Vincent_van_Gogh&prop=links
First, notice that this is a Wikimedia Commons Category, querying the en.wikipedia.org did return a you a missing page. However, even if you query the right project, you will notice that the category description does indeed not contain any links.
Once I have all the URLs linked to, is there an API to retrieve some of the information on the page?
You can use the categorymembers query as a generator, then specify the usual properties that you want from each page. However, the metadata you seem to be interested in is not available via the API, you need to parse it out of each image description text.
Try https://commons.wikimedia.org/w/api.php?action=query&generator=categorymembers&gcmtitle=Category%3aGoogle_Art_Project_works_by_Vincent_van_Gogh&prop=links|imageinfo|revisions&iiprop=timestamp|user|url|size|mime&rvprop=ids|content&rvgeneratexml

Web-scraping only a specific domain

I am trying to make a web scrpper that, for this example, scrapes news articles from Reuters.com. I want to get the title and date. I know I will ultimately just have to pull the source code from each address and then parse the HTML using something like JSoup.
My question is: How do I ensure I do this for each news article on Reuters.com? How do I know I have hit all the reuters.com addresses? Is there any APIs that can help me with this?
What you are referring to is called web scraping plus web crawling. What you have to do is visit every link matching some criteria (crawling) and then scrape the content (scraping). I've never used them but here are two java frameworks for the job
http://wiki.apache.org/nutch/NutchTutorial
https://code.google.com/p/crawler4j/
Of course you will have to use jsoup (or simillar) for parsing the content after you've collected the urls
Update
Check this out Sending cookies in request with crawler4j? for a better list of crawlers. Nutch is pretty good, but very complicated if the only thing you want is to crawl one site. crawler4j is very simple but I don't know if it supports cookies (and if that matters to you it's a deal breaker).
Try this website http://scrape4me.com/
I was able to generate this url for headline: http://scrape4me.com/api?url=http%3A%2F%2Fwww.reuters.com%2F&head=head&elm=&item[][DIV.topStory]=0&ch=ch

How to include the result of an api request in a template?

I'm creating a wiki using Mediawiki for the first time. I would like to include automatically all backlinks of the current page in a template (like the "See also" section). I tried to play with the API, successfully, but I still haven't succeed in including the useful section of the result in my template.
I have been querying Google and Stackoverflow for days (maybe in the wrong way) but I'm still stuck.
Can somebody help me?
As far as I know, there is no reasonable way to do that. Probably the closest you could get is to write a JavaScript code that reacts on the presence of a specific HTML element in the page, makes the API request and then updates the HTML to include the result.
It’s not possible in wiki text to execute any JavaScript or use even more uncommon HTML. As such you won’t be able to use the MediaWiki API like that.
There are multiple different options you have to achieve something like this though:
You could use the API by including custom JavaScript code on MediaWiki:Common.js. The code there will be included automatically and can be used to enhance the wiki experience. This obviously requires JavaScript on the client so it might not be the best option; but at least you could use the API directly. You would have to add something to figure out where to place the results correctly though.
A better option would be to use an extension that gives you this output. You can either try to find an extension that already provides this functionality, or write your own that uses the internal MediaWiki API (not the JS one) to access that content.
One extension I could personally recommend you that does this (and many other things), is DynamicPageList (full disclosure: I’m somewhat affiliated with that project). It allows you to perform complex page selections.
For example what you are trying to do is to find all pages which link to your page. This can be easily done by DPL like this:
{{ #dpl: linksto = {{FULLPAGENAME}} }}
I wrote a blog post recently showing how to call the API to get the job queue size and display that inside of the wiki page. You can read about it at Display MediaWiki job queue size inside your wiki. This solution does require the External Data extension however. The code looks like:
{{#get_web_data: url={{SERVER}}{{SCRIPTPATH}}/api.php?action=query&meta=siteinfo&siprop=statistics&format=json
| format=JSON
| data=jobs=jobs}}
{{#external_value:jobs}}
You could easily swap in a different API call to get other data. For the specific item your looking for, #poke's answer above is probably better.