WikiMedia API - How to determine which portal(s) a Page belongs to? - mediawiki

I wish to determine whether a given Wikipedia page belongs to a certain Wikipedia Portal using the MediaWiki API. So far, I have been experimenting with the page properties of the API but I cannot seem to find a way to derive what Portal a given page belongs to.
As an example, on the Wikipedia page for Cake in the very bottom of the page, I can press Show on the section Cakes, and a bunch of links to different cake pages show up. There I can also see that all of these belong to the Food portal. It is that information that I would wish to extract from a given page using the MediaWiki API.

As far as I know, there is actually no formal definition of "belongings to a portal" in Wikipedia. Opposed to categories which are part of the MediaWiki software, portals are custom pages for Wikipedia that are aimed to make it easier to explore a topic.
Instead of a formal definition though, you can use an heuristic and determine the connection between the page and some portal based on one of them linking to the other. There are API endpoints for both:
(Note: 100 is the id of the 'Portal` namespace)
Which portal pages are linked from the page "Cake" or "Pizza"
https://en.wikipedia.org/w/api.php?action=query&format=json&prop=links&titles=Cake%7CPizza&plnamespace=100
Which portal pages link to the page "Cake" or "Pizza"
https://en.wikipedia.org/w/api.php?action=query&format=json&prop=linkshere&titles=Cake%7CPizza&lhnamespace=100
(though as you can see, many unrelated portals link to "Cake" and none link to "Pizza")
A combined query for both directions
https://en.wikipedia.org/w/api.php?action=query&format=json&prop=links%7Clinkshere&titles=Cake%7CPizza&plnamespace=100&lhnamespace=100

So trough some more investigation i found the answer:
I ended up using the Revisions property in the API. This allows me to to give a series of page titles that I want to investigate, and have the HTML of each page returned to me in json format. Then I can just search for lines containing Portal and figure out what portal (if any) the page belongs to.
If anyone are in a similar situation, here is an example query to the API:
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Bread|Bubble_tea|Pizza&format=json&redirects&rvprop=content&rvslots=main

Related

Sitemap for site with few dynamic HTML files and many possible URLs

I'm nearing the end of my first web development project and I'm looking to build a sitemap for our website as part of Search Engine Optimalisation. If I understand correctly a sitemap, when done correctly, is a file that shows a content tree (similar to paths in windows explorer) to all the public pages of my website.
For the purpose of my question you're going to need some background information on the site and how it works. The site is about bird migration, a user enters the site on a homepage that holds a searchbox, he or she is able to search for a species of birds and if we have data on it the user is able to go to a seperate page with information on this bird. From there the user can access statistical data about this species. The page will look something like below, filled with content that we get from a database.
The URL will look something like http://domain.com/searchbird.html?bird=Sedge%20Warbler?lang=1 for the informational page, and http://domain.com/statistics.html?bird=Sedge%20Warbler?lang=1 for the statistical page.
Every bird species uses the same base HTML file (searchbird.html) that is filled with data based on the ?bird="" parameter. I have about four HTML files in my webroot (lets call them: index.html, searchbird.html, statistics.html, about.html).
So when I go to create a sitemap using some sort of sitemap generation tool, I get a sitemap that contains those 4 .html files, which is great! Yet I'm missing the 500 bird species that users are going to be able to find.
Is there a way for me to include every possible URL in the sitemap automatically, and how would I go about doing such a thing? I've used HTML, CSS and Javascript in the past. but I'm only a beginner. If an executable tool exists for this that'd be great, but my Google searches haven't been successful yet.
You have to generate the list of URLs for your existing pages.
So dig into your data source (database or whatever you use), find all existing bird species, and generate the two URLs per species.
Directory for users/bots
It would probably be a good idea (for visitors as well as for bots) to output these links on your website, too. Visitors would have two ways to find a species (search for it or browse the directory), and as most bots don’t use search functions, they wouldn’t be able to find the links on your site otherwise (they would have to use your sitemap, which not all bots do, or they would have to hope to find the links from some other external website).
(If you do this, you could also use a sitemap generator service; but it’s usually better do generate it yourself.)
URL design
By the way, you might want to consider changing your URL design to a more human-friendly one. Instead of
http://example.com/searchbird.html?bird=Sedge%20Warbler?lang=1
http://example.com/statistics.html?bird=Sedge%20Warbler?lang=1
you could use something like
http://example.com/en/birds/sedge-warbler
http://example.com/en/birds/sedge-warbler/statistics
where en is the language code for "English" (these are standardized, and users have a chance to understand them, contrary to lang=1), and where http://example.com/en/birds could lead to the page listing all species. For other languages, you would of course ideally translate "birds" and "statistics".
Changing the URL design is possible with URL rewriting.
U can use sitemap generator. U can use https://www.xml-sitemaps.com/. U only need put url index. That website will search all link and generate sitemap automatically.
If u use wordpress u can use plugin wordpress like https://wordpress.org/plugins/google-sitemap-generator/.
Hope that help

Mediawiki - Automatic two-way links between page sections

I want my MediaWiki install to have two classes of pages. (In the users' eyes - the wiki won't have to know the difference.)
I want some pages to be on topics, and others on sources (name of book, video, etc.)
I want to have a topic page "FAA Licenses" like:
==Medical Certificates==
===3rd Class===
Required for student license, and before student solo flights. {{{link/reference/whatever generally around here to Jeppesen Book#pg27-28}}}
And a source page "Jeppesen Book" like:
==pg27-28==
{{{link to FAA Licenses#3rd Class}}}
These source pages will track the source's (book or video) content. I imagine a source page for a book to have page numbers, and for a video to have start and stop times, or section numbers. (The book or video itself won't be on the source pages.)
So, the source pages will really serve two purposes. First, it will be fairly easy to see which parts of the sources have had notes taken and put into the topic pages. (So non-linear note-taking of sources will be easy -- skipping from source to source on topics, rather than digesting an entire source at once.) Second, it will be easy from a topic page to see where to go back to for a more in-depth review.
There's two issues I'm writing about.
(1) I want the workflow to be the user edits the topic page, putting in links to source pages and sections. I want this one user-addition to automatically make the source page link back to this spot. I want the system to handle the two-way-linking, assuming the user won't be perfect.
(2) I want the user to be able to put links in the topic page to source pages and sections that might not exist yet. I'd need those links to show up as red, to indicate they need to be created. But, still, once created, I want the system to handle the two-way-linking, even if there were multiple red links to the same area. (I could see building up quite a few red links, then having an unorganized "purge" of them by creating the missing pages and sections, and don't want to have to search for all the links to the new areas.) Ideally, I'd love for these source pages to be auto-generated -- so pages and sections were made as links were made to them, and automatically deleted (or at least the backlinks removed) as links were removed to them.
I don't think the MediaWiki what links here functionality does the job. I want this to work on a per-section rather than per-page basis. And, I don't want the user to have to add to each section a "what links here tag" -- I want it to be automatic.
The extension Semantic MediaWiki will allow you to get bidirectional linking in a semi-automatic fassion.
https://www.semantic-mediawiki.org/wiki/Help:Link_Template
shows a high level example.
If you dig deeper into SMW and SemanticForms you'll find how with e.g. SemanticForms you can get a user experience that is close to what you are asking for.
See e.g. http://smw.referata.com/wiki/Discourse_DB and http://www.discoursedb.org/wiki/Main_Page for an application of these principles.
I don't think there is an easy way to do that. You could write an extension that provides a parser function that your users can enter, save the source page + source section + target page + target section in a database at links update, then use the ParserSectionCreate hook to show links based on that. Or you can create two types of templates and write a bot that keeps them in sync.

How can I use google to find all unknown webpages point to one specific known web page?

It's easy to find all children pages of one webpage. But it is not trivial to get all parent pages of one webpage, how can I do that by using Google?
You can't and Google cannot help you as it doesn't index all of the web.
At best it follows links on other pages, or is initiated by someone wanting to have something indexed explicitly.
Create a server. Put an HTML page on it that no other page on your server has a link to. Name the page with some non-guessable UUID in the name.
Google will not find this unless they start to randomly change parts of URLs to test for existing pages (a lengthy process).
Within that page you can have links pointing to other pages. It is a parent page for those pages, is a web page, and will not be found via Google.

Facebook Page -- Dynamic HTML Based on User

I'm looking to create a Facebook Page with dynamic content based on the user visiting the page. For example, if the user has "liked" something with the consisting of "soccer" then it would display a little module specifically for soccer... or if they liked "baseball" then it would display baseball.
I guess my overall question is: "What content does FB allow developers to scrape and use in their code?" I want to utilize this on the Static FBML application.
Thanks in advance!
You may want check the open graph documentation:
http://developers.facebook.com/docs/opengraph/
Graph API: For accessing profile data
http://developers.facebook.com/docs/reference/api/
In order to request and receive extended information about a profile you need to setup a signed request from the Graph api. This can be done from a custom facebook app.

Is there anyway of making json data readable by a Google spider?

Is it possible to make JSON data readable by a Google spider?
Say for instance that I have a JSON feed that contains the data for an e-commerce site. This JSON data is used to populate a human-readable page in the users browser. (I.E. The translation from JSON data to human displayed page is done inside the users browser; not my choice, just what I've been given to work with, its an old legacy CGI application and not an actual server-side scripting language.)
My concern here is that, the google spiders will not be able to pickup/directly link to the item in question when a user clicks on it in google, being presented with an index page full of all the items, rather than being linked directly to the item they clicked on.
Is there anyway of "informing" the google spider in the JSON that what they should feed the user a different link?
While Google does crawl and index JavaScript in some circumstances, it's still best to serve "normal" (X)HTML content if at all possible. In this case, it would help to know the rest of the site's setup, in particular: is the JSON content just used to create a feed of links to the product pages (with static content) or are all product pages also generated by JSON feeds? If the feed is only used to point to the actual product pages (which are static) then one way to make the product pages discoverable could be to create a HTML sitemap page or some other alternate form of navigation. A XML Sitemap file can also help, but I would recommend not using it as the sole way of making the product pages discoverable.
If all of the content is only accessible through JSON feeds, then I think you will have to make some bigger changes if you want that content to be accessible through search results.
One way to handle it could also be to use the new JavaScript crawling/indexing proposal, which basically would result in a headless browser being set up between your site and Google: http://code.google.com/web/ajaxcrawling/ (whether setting this up or revamping the rest of the site is easier is hard to say :-))
You should make a wrapper page in server-side code around the JSON data, and respond to requests with either the wrapper or the regular version depending on the User-Agent.