How to get English Wikipedia page titles from Arabic page titles - mediawiki

I want to build a dictionary file of Arabic to English wikipedia titles
"wiki_page_title_in_arabic" : "wiki_page_title_in_english"
I now have a list of all the wiki page titles, how can I get the corresponding titles in English ? Putting in mind the huge size of the dictionary.

You could use the API and query for the property language links (prop=langlinks) and fr ease restrict results to English (lllang=en). Example: you have the article Stockholm in Arabic: ستوكهولم
Use the query: http://ar.wikipedia.org/w/api.php?action=query&prop=langlinks&format=json&lllang=en&titles=ستوكهولم
The result will give you the English title.

Related

Failing to query OpenLibrary

I want to compose a query to OpenLibrary's RESTful API that does the following:
filters the book list by the first five letters of the title
returns the book's title, author, publication date, description, and a link to the large thumbnail of the cover
So far this is all I've been able to compose with any success:
http://openlibrary.org/query.json?type=/type/edition&authors=/authors/OL1A&covers=&title=&publish_date=&description=
You can cut and paste into your browser to see the result, OpenLibrary doesn't require an API key.
My main obstacles seem to be:
I can't figure out how to filter the books by the first five letters of the title
I can't figure out how to turn the cover information into a link to the actual thumbnail
Any help?
The API does not allow you to a search the first 5 natives, but you it possible create a code to consume API and to apply the regex.
example:
first five letters searched Bhānu
the regex would look like this: [Bhānu] or Bhānu
the regex would look like this: "bhānumatīra deśa."
link to regex example: https://regex101.com/r/cTVX1Z/4

Retrieve first paragraph from Wikipedia in Chinese

I want to retrieve the first paragraph of Wikipedia in Chinese language. I found an API;
http://en.wikipedia.org/w/api.php?action=query&prop=extracts&rawcontinue=1&format=xml&exintro=&titles=samsung
but it returns data in English.
How can I get data from this API in Chinese language?
Wikipedia is not one site but multiple. The article Samsung on English Wikipedia contains no Chinese text, but you are probably looking for the corresponding page on Chinese Wikipedia. As most or all Wikipedias use the TextExtract extension that you are calling above, you can simply change the domain and page title, and use the same API call as you just did:
http://zh.wikipedia.org/w/api.php?action=query&prop=extracts&rawcontinue=1&format=xml&exintro=&titles=%E4%B8%89%E6%98%9F%E9%9B%86%E5%9B%A2
Relevant for Chinese: According to the docs, you should also be able to chose what language variant (e.g. zh-tw, Taiwanese, or zh-cn, mainland) to fetch, using the exvariant parameter.

Full urls of images of a given page on Wikipedia (only those I see on the page)

I'd want to extract all full urls of images of "Google"'s page on Wikipedia
I have tried with:
http://en.wikipedia.org/w/api.php?action=query&titles=Google&generator=images&gimlimit=10&prop=imageinfo&iiprop=url|dimensions|mime&format=json
but, in this way, I got also not google-related images, such as:
http://upload.wikimedia.org/wikipedia/en/a/a4/Flag_of_the_United_States.svg
http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
http://upload.wikimedia.org/wikipedia/commons/f/fe/Crystal_Clear_app_browser.png
How can I extract just only images that I see on Google page
Retrieve page source code, https://en.wikipedia.org/w/index.php?title=Google&action=raw
Scan it for substrings like [[File:Google web search.png|thumb|left|On February 14, 2012, Google updated its homepage with a minor twist. There are no red lines above the options in the black bar, and there is a tab space before the "+You". The sign-in button has also changed, it is no longer in the black bar, instead under it as a button.]]
Ask API for all pictures on page, http://en.wikipedia.org/w/api.php?action=query&titles=Google&generator=images&gimlimit=10&prop=imageinfo&iiprop=url|dimensions|mime&format=json
Filter out urls but those which match picture names found in step 2.
Steps 2 and 4 need more explanation.
#2. Regexp /\b(File|Image):[^]|\n\r]+/ should be enough. In Ruby's regexps, \b denotes word boundary which might be unsupported in language of your choice. Regexp I proposed will match all cases which come to my mind: [[File:something.jpg]], gallery tags: <gallery>\nFile:one.jpg\nFile:two.jpg\n</gallery>, templates: {{Infobox|pic = File:something.jpg}}. However, it won't match filenames which contain ]. I'm not sure if they're legal, but if they are, they must be very uncommon and it should not be a big deal.
If you want to match only constructs like this: [[File:something.jpg|thumb|description]], following regexp will work better: /\[\[(File|Image):[^]|]+/
#4. I'd remove all characters from names which match /[^A-Za-z0-9]/. It's easier than escaping them and, in most cases, enough.
Icons are most often attached in templates, contrary to pictures related to article subject, which are most often attached directly ([[File:…]]). There are exceptions though, for example in some articles pictures are attached with {{Gallery}} template. There is also <gallery> tag which introduces special syntax for galleries. You got to tune my solution to your needs, and even then it won't be perfect, but it should be good enough.

MediaWiki LocalSettings.php $wgSiteName issue

So I have the wiki exactly as I want it but the sitename variable always reverts to lowercase...
I assume there is a special character maybe..like in c# you have \t or \n to format text but I cannot find anything of the sort..
THis is directly from the wikipage describing sitename
"Site name
The $wgSitename variable holds the name of your wiki setup. This name gets included many times throughout the system, such as via MediaWiki:Pagetitle. For instance, the Wikipedia tagline "'From Wikipedia, the free encyclopedia."' makes use of this setting."
It states that wikipedia uses sitename to title its site as .... 'From Wikipedia, the free encyclopedia.'
They have multiple capital letters
Thanks ahead of time.
That text actually comes from the page MediaWiki:Tagline. Edit that page on your wiki and you can make that line say anything you want. You also have to edit MediaWiki:Common.css like it says in that link to make the tagline show up on pages.

Why isn't this Yahoo Pipe outputting items?

I have a Yahoo! Pipe that attempts to transform an HTML page into RSS, but the resulting feed contains no items. For each entry I've parsed these elements:
link (permalink)
title (HTML title)
description (HTML entry)
guid (segment of the permalink)
Various tutorials led me to add these:
dc:creator ("Doug")
y:id.value (permalink)
y:published (w/ date attributes generated from text like "3 days ago")
If you edit source and highlight Pipe Output, the debugger shows 5 entries with these elements/attributes intact.
What am I missing?
That is vexing! By tweaking it a bit to do "emit results" in the "Loop" operator box I managed to get a feed with 5 items, but it only contained the item.guid for some reason.
Your feed is valid though (not that hard considering there's no elements) according to http://feedvalidator.org.
I tried removing some of your components but my changes did not help.
By the way, crazy nuts that they kill Yahoo360 blogs in favor of the feedless Yahoo Profile blogs. Oh, and I like Douglas Crockford too. :-)