Wikipedia API - get random page(s) - json

I'm trying to get a JSON result with a set of random pages from Wikipedia, including their titles, content and images.
I've played around with their API sandbox, and so far the best I've got is this:
https://en.wikipedia.org/w/api.php?action=query&list=random&format=json&rnnamespace=0&rnlimit=10
But this only includes the namespace, id, and title of ten random pages. I would like to get the content as well as images as well.
Do anyone know how?
Alternatively I could do with the title, content and image url's of a single random page.
Best I've got here is:
https://en.wikipedia.org/w/api.php?action=query&generator=random&format=json

You're close. generator=random is the right way to go. You can then use various prop values to get the info you want:
Page title is always included.
To get the text, use prop=revisons along with rvprop=content.
To get all images used on the page, use prop=images.
Note that this will often include images you're probably not interested in, like icons and flags. To fix that, you might try instead prop=pageimages, though it doesn't seem to work always. Or you could try using both.
So, the final query could look like this:
https://en.wikipedia.org/w/api.php?format=json&action=query&generator=random&grnnamespace=0&prop=revisions|images&rvprop=content&grnlimit=10

If you'd rather use their REST api,
curl -X GET "https://en.wikipedia.org/api/rest_v1/page/random/summary"
Documentation

Related

Onenote page hierarchy

Let's say I have a notebooks with name 'MyNotebook'. Now this notebook have a section group 'Group1' and now 'Group1' have another section group 'Group2'. Now inside 'Group2' I have section 'Section1' which has a page 'Page1'.
If we look this at like a directory structure the path to page will be -MyNotebook/Group1/Group2/Section1/Page1
When I try to get page using get page api I am able to get only immediate parent i.e Section1. So let's say I want get this complete hierarchy how I can get that ?
What API specifically are you using to get pages?
If you are using GET https://www.onenote.com/api/v1.0/me/notes/pages, this will give you all the pages, though that API has limitations (For example, it is paginated, so it will only give you the most recent 20 pages. In addition, it won't work if the user has a big number of sections).
https://blogs.msdn.microsoft.com/onenotedev/2017/07/21/a-few-performance-tips-for-using-the-onenote-api/
See the section "When getting all pages for a user, do so for each section separately"
I recommend you make a call like:
GET https://www.onenote.com/api/v1.0/me/notes/Notebooks?$expand=sections,sectionGroups($expand=sections,sectionGroups($levels=max;$expand=sections))
To obtain all the sections, and then make a call like:
GET https://www.onenote.com/api/v1.0/me/notes/sections/{id}/pages
To obtain each section's pages.
In addition to what Jorge said, if you specifically want the upwards hierarchy (and not downwards), you could do:
GET https://www.onenote.com/api/v1.0/me/notes/pages?$expand=parentSection($expand=parentSectionGroup($expand=parentSectionGroup($expand=parentNotebook)))
But as Jorge said, be careful when using the GET pages API since it has some limitations

How to expand a json for a specified website, based on the url

This might be a bit of a confounded question, but please bear with me:
If I were on a site, wanting to read comments through the json, as with this particular site, how would I expand this particular site such that I can see more than 10 comments? Currently, the ending to the url looks like /?content_id=60902841-c238-364c-92f0-68e8b4dce996&_device=full&count=10&sortBy=highestRated&isNext=true&offset=10&pageNumber=1&_media.modules.content_comments.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_comment=1.
I tried changing the pageNumber to a higher number and got the same results. I tried change the &count=10 to &count=50, which also doesn't work.
Thanks!
Each website will be a case by case. Some websites will allow you to expand it at the end of their url, as in:
https://somesite.com/search&page=1&resultcount=100
where you would change the parameter of resultcount to a higher value. Some sites cap the value at arbitrary values, and some don't have this parameter.

Pulling out some text from a giant HTML file using Nokogiri/xpath

I am scraping a website and am trying to pull out certain elements from the HTML. In the sites I am scraping, there are script tags with a bunch of info in them however, there is one part inside these tags that I am interested in. The line basically looks like:
'image':'http://ut5.example.com/t/231/3_b_643435.jpg',
With some stuff above and below it. Now, this is different for each page source except for obviously the domain and some of the subfolders that store the images.
How would I go about looking through the source for this specific line, and cutting out just the URL? I would need to use regular expressions I feel as the URLs are dynamic.
The "gsub" method does something similar to what I want to search for, with its ability to use /regex/. But, I am not wanting to replace anything, I just want to find that URL in the source code using a /regex/ and copy it.
According to you comments, this is what you're looking for I guess
var regex = /http.+/;
Example http://jsfiddle.net/Km9ZB/

URL Masking in .Net / HTML

I have a website in which I have many categories, many sub-categories within each one and many products within each of those. Since the URLs are very user-unfriendly (they contain a GUID!!!), I would like to use a method which I think is called URL Masking. For example instead of going to catalogue.aspx?ItemID=12343435323434243534, they would go to notpads.htm. This would display the same as going to catalogue.aspx?ItemID=12343435323434243534 would display, somehow.
I know I could do this by creating a file for each category / sub-category (individual products cannot be accessed individually as it is a wholesale site - customers cannot purchase directly from the site). This would be a lot of work as the server would have to update each relevant file whenever a category / sub-category / product visibility changes, or a description changes, a name changes... you get the idea...
I have tried using server-side includes but that doesn't like it when a .aspx file is specified in an html file.
I have also tried using an iframe set to 100% width / height and absolutely positioned left 0 and top 0. This works quite well, but I know there are reasons you should not use this method such as some search engines not coping with it well. I also notice that the title of the "parent" page (notepads.htm) is not the title set in the iframe (logically this is correct - but another issue I need to solve if I go ahead and use this method).
Can anyone suggest another way I could do this, or tell me whether I am going along the right lines by using iframes? Thanks.
Regards,
Richard
PS If this is the wrong name for what I am trying to do then please let me know what it actually is so I can rename / retag it.
Look into URL Rewrites. You can create a regular expression and map it to your true url. For example
http://mysite.com?product=banana
could map to
http://mysite.com?guid=lakjdsflkajkfj3lj3l4923892&asfd=9234983920894893
I believe you mean URL Rewriting.
IIS 7+ has a rewrite module built in that you can use for this kind of thing.
URL Rewriters solve the problem you are describing - When someone requests page A, display page B - in a general way.
But yours is not a general requirement. You seem to have a finite uuid-to-shortname mapping requirement. This is the kind of thing you could or should set up in your app, yourself, rather than inserting a new piece of machinery into your system.
Within a default .aspx page, You'd simply do a lookup on the shortname from the url in a persistent table stored somewhere, and then call Server.Transfer() to the uuid-named page associated to that shortname.
It should be easy to prototype this.

Extracting *relevant* image from a web-page

I have a couple of twitter-powered news aggregation website. I have been planning to add images from articles that I find on twitter.
If I download the page and extract image using <img> tag, I get a bunch of images; not all of them relevant to the article. For example, images of button, icons, ads etc are captured. How do I extract the image accompanying the article? I know there is a solution -- Facebook link sharer does this pretty well.
Mithun
Duplicate of : How to find and extract "main" image in website
Download all images from the page,
blacklist all images coming from an ad server.
then find some heuristic which will get you the correct image...
I think something like:
Biggest resolution += 5pts
Biggest filesize += 10 pts
Jpeg += 2 pts
then take the image with the most points and throw the rest away
Probably works for majority of sites.
(Would require some fiddling with the heuristics though)
It's been a long time. But this may help next time.
You can use this API https://urlmeta.org/
It's very simple to use and result is the best we need.
example for using API:
<?php
$url = "http://timesofindia.indiatimes.com/business/india-business/Raghuram-Rajan-not-fit-to-be-RBI-Governor-Subramanian-Swamy/articleshow/52236298.cms";
$result = file_get_contents('https://api.urlmeta.org/?url='.$url);
$array = json_decode($result,1);
print_r($array['meta']['image']);
?>
And that's the result you needed.
I kind of came-up with a solution that is a bit hacky but works for me. Here is what I do to get thumbnails.
Say the headline of the page I find is "this is a headline"
I use this as a query to the Google Image API and then extract the first thumbnail I find.
It actually works quite well for a majority of the cases. Check it out for yourself http://cricketfresh.in
Mithun
ps: I think this is a good answer. Will give credit to someone who comes with a more elegant answer.
I would guess that Facebook has a link extractor for the various sites it supports. Something like id="content" -> img (1st).
Guess I am wrong. Seems that Facebook uses the Open Graph Protocol to define which image (og:image) and which metadata to use.