Full urls of images of a given page on Wikipedia (only those I see on the page) - mediawiki

I'd want to extract all full urls of images of "Google"'s page on Wikipedia
I have tried with:
http://en.wikipedia.org/w/api.php?action=query&titles=Google&generator=images&gimlimit=10&prop=imageinfo&iiprop=url|dimensions|mime&format=json
but, in this way, I got also not google-related images, such as:
http://upload.wikimedia.org/wikipedia/en/a/a4/Flag_of_the_United_States.svg
http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
http://upload.wikimedia.org/wikipedia/commons/f/fe/Crystal_Clear_app_browser.png
How can I extract just only images that I see on Google page

Retrieve page source code, https://en.wikipedia.org/w/index.php?title=Google&action=raw
Scan it for substrings like [[File:Google web search.png|thumb|left|On February 14, 2012, Google updated its homepage with a minor twist. There are no red lines above the options in the black bar, and there is a tab space before the "+You". The sign-in button has also changed, it is no longer in the black bar, instead under it as a button.]]
Ask API for all pictures on page, http://en.wikipedia.org/w/api.php?action=query&titles=Google&generator=images&gimlimit=10&prop=imageinfo&iiprop=url|dimensions|mime&format=json
Filter out urls but those which match picture names found in step 2.
Steps 2 and 4 need more explanation.
#2. Regexp /\b(File|Image):[^]|\n\r]+/ should be enough. In Ruby's regexps, \b denotes word boundary which might be unsupported in language of your choice. Regexp I proposed will match all cases which come to my mind: [[File:something.jpg]], gallery tags: <gallery>\nFile:one.jpg\nFile:two.jpg\n</gallery>, templates: {{Infobox|pic = File:something.jpg}}. However, it won't match filenames which contain ]. I'm not sure if they're legal, but if they are, they must be very uncommon and it should not be a big deal.
If you want to match only constructs like this: [[File:something.jpg|thumb|description]], following regexp will work better: /\[\[(File|Image):[^]|]+/
#4. I'd remove all characters from names which match /[^A-Za-z0-9]/. It's easier than escaping them and, in most cases, enough.
Icons are most often attached in templates, contrary to pictures related to article subject, which are most often attached directly ([[File:…]]). There are exceptions though, for example in some articles pictures are attached with {{Gallery}} template. There is also <gallery> tag which introduces special syntax for galleries. You got to tune my solution to your needs, and even then it won't be perfect, but it should be good enough.

Related

Markdown TOC with Special Characters?

I am trying to create a TOC for my Markdown blog.
The methods I am finding here... : Markdown to create pages and table of contents?
....do not work for me because I am naming all of my headers # _</>_ The Setup because I am using CSS on to style the "", giving each header a nice colored Icon next to it. If I simply use ```# The Setup ```` it works great.
This causes issues whenever I try to use [The Setup](#The-Setup).
I tried a few things like [The Setup](#_</>_-The-Setup) and other things, but I can not get it to work.
If someone can point me in the right direction I would greatly appreciate it. Also, if anyone has a better way of adding custom icons next to headers, I think that would be the better way to go about it.
As always, thanks in advance.
The general solution is to examine the rendered HTML output to see what the tool is converting the special characters to, in the HTML's element ID. Every tool could handle the conversion differently (it could convert special characters to -, _, or just remove special characters). Some examples:
<h1 id="_____the-setup">The Setup</h1>
<h1 id="-the-setup">The Setup</h1>
<h1 id="the-setup">The Setup</h1>
Once you have identified the exact id that the tool is using, then you use that value as the heading link in the markdown's table of contents. For example:
[The Setup](#_____the-setup)
Now, the tricky part is that not all Markdown tools will export the rendered HTML, including VS Code. The workaround for VS Code is:
Open the markdown preview mode (which renders to html internally).
Open the VS Code Developer Tools (Help > Toggle Developer Tools).
Use DevTools to inspect the element (in this case, the heading element for "The Setup").
I see that VS Code named the id as the-setup, so in the markdown's table of contents, I write [The Setup](#the-setup). Now the table of content hyperlink works in VS Code. Caveat: it might not work in other Markdown tools if they render a different HTML element ID!
Another shortcut now available in VS Code (1.70 July 2022), is that markdown can autocomplete the header ID. So you just type #, and it will list the valid IDs:

Wikipedia API - get random page(s)

I'm trying to get a JSON result with a set of random pages from Wikipedia, including their titles, content and images.
I've played around with their API sandbox, and so far the best I've got is this:
https://en.wikipedia.org/w/api.php?action=query&list=random&format=json&rnnamespace=0&rnlimit=10
But this only includes the namespace, id, and title of ten random pages. I would like to get the content as well as images as well.
Do anyone know how?
Alternatively I could do with the title, content and image url's of a single random page.
Best I've got here is:
https://en.wikipedia.org/w/api.php?action=query&generator=random&format=json
You're close. generator=random is the right way to go. You can then use various prop values to get the info you want:
Page title is always included.
To get the text, use prop=revisons along with rvprop=content.
To get all images used on the page, use prop=images.
Note that this will often include images you're probably not interested in, like icons and flags. To fix that, you might try instead prop=pageimages, though it doesn't seem to work always. Or you could try using both.
So, the final query could look like this:
https://en.wikipedia.org/w/api.php?format=json&action=query&generator=random&grnnamespace=0&prop=revisions|images&rvprop=content&grnlimit=10
If you'd rather use their REST api,
curl -X GET "https://en.wikipedia.org/api/rest_v1/page/random/summary"
Documentation

Direct link to MediaWiki page section

In my Wikipedia page, I have a section called subtitleA. Before arriving at this point when reading, I have one sentence that has a link that jumps to the content of that section.
To be more clear, this is a simple illustration:
To do this, you will need `this` (link to subtitleA).
To do that, you will do another thing..
== SubtitleA ==
this is how you do it....
I found the following solution:
To do this, you will need [http://wikisite.com/pageName#SubtitleA this].
This has already been proven correct; however, one of my subtitles contains spaces, brackets and directory like the following:
== SubtitleA (balabalaA\balabalaB\balabala....) ==
I can no longer use the solution I found because of those spaces... Can anyone provide me an alternative solutions? Thanks.
To do this, you will need [[pageName#SubtitleA|this]].
Use the exact same format as in the section title.
Anchor encoding is similar to percent encoding (with a . instead of a %) but not exactly the same (e.g. spaces are collapsed and encoded to _). If you really, really need to do it directly, you can use {{anchorencode|original title}}.
I found the solution:
URL encoder is the key, but not using standard %xx as the replacements for special characters. Use .xx (e.g. .5C .28) would work in the mediawiki framework.

Pulling out some text from a giant HTML file using Nokogiri/xpath

I am scraping a website and am trying to pull out certain elements from the HTML. In the sites I am scraping, there are script tags with a bunch of info in them however, there is one part inside these tags that I am interested in. The line basically looks like:
'image':'http://ut5.example.com/t/231/3_b_643435.jpg',
With some stuff above and below it. Now, this is different for each page source except for obviously the domain and some of the subfolders that store the images.
How would I go about looking through the source for this specific line, and cutting out just the URL? I would need to use regular expressions I feel as the URLs are dynamic.
The "gsub" method does something similar to what I want to search for, with its ability to use /regex/. But, I am not wanting to replace anything, I just want to find that URL in the source code using a /regex/ and copy it.
According to you comments, this is what you're looking for I guess
var regex = /http.+/;
Example http://jsfiddle.net/Km9ZB/

Why isn't this Yahoo Pipe outputting items?

I have a Yahoo! Pipe that attempts to transform an HTML page into RSS, but the resulting feed contains no items. For each entry I've parsed these elements:
link (permalink)
title (HTML title)
description (HTML entry)
guid (segment of the permalink)
Various tutorials led me to add these:
dc:creator ("Doug")
y:id.value (permalink)
y:published (w/ date attributes generated from text like "3 days ago")
If you edit source and highlight Pipe Output, the debugger shows 5 entries with these elements/attributes intact.
What am I missing?
That is vexing! By tweaking it a bit to do "emit results" in the "Loop" operator box I managed to get a feed with 5 items, but it only contained the item.guid for some reason.
Your feed is valid though (not that hard considering there's no elements) according to http://feedvalidator.org.
I tried removing some of your components but my changes did not help.
By the way, crazy nuts that they kill Yahoo360 blogs in favor of the feedless Yahoo Profile blogs. Oh, and I like Douglas Crockford too. :-)