Extracting *relevant* image from a web-page - html

I have a couple of twitter-powered news aggregation website. I have been planning to add images from articles that I find on twitter.
If I download the page and extract image using <img> tag, I get a bunch of images; not all of them relevant to the article. For example, images of button, icons, ads etc are captured. How do I extract the image accompanying the article? I know there is a solution -- Facebook link sharer does this pretty well.
Mithun
Duplicate of : How to find and extract "main" image in website

Download all images from the page,
blacklist all images coming from an ad server.
then find some heuristic which will get you the correct image...
I think something like:
Biggest resolution += 5pts
Biggest filesize += 10 pts
Jpeg += 2 pts
then take the image with the most points and throw the rest away
Probably works for majority of sites.
(Would require some fiddling with the heuristics though)

It's been a long time. But this may help next time.
You can use this API https://urlmeta.org/
It's very simple to use and result is the best we need.
example for using API:
<?php
$url = "http://timesofindia.indiatimes.com/business/india-business/Raghuram-Rajan-not-fit-to-be-RBI-Governor-Subramanian-Swamy/articleshow/52236298.cms";
$result = file_get_contents('https://api.urlmeta.org/?url='.$url);
$array = json_decode($result,1);
print_r($array['meta']['image']);
?>
And that's the result you needed.

I kind of came-up with a solution that is a bit hacky but works for me. Here is what I do to get thumbnails.
Say the headline of the page I find is "this is a headline"
I use this as a query to the Google Image API and then extract the first thumbnail I find.
It actually works quite well for a majority of the cases. Check it out for yourself http://cricketfresh.in
Mithun
ps: I think this is a good answer. Will give credit to someone who comes with a more elegant answer.

I would guess that Facebook has a link extractor for the various sites it supports. Something like id="content" -> img (1st).
Guess I am wrong. Seems that Facebook uses the Open Graph Protocol to define which image (og:image) and which metadata to use.

Related

Wikipedia API - get random page(s)

I'm trying to get a JSON result with a set of random pages from Wikipedia, including their titles, content and images.
I've played around with their API sandbox, and so far the best I've got is this:
https://en.wikipedia.org/w/api.php?action=query&list=random&format=json&rnnamespace=0&rnlimit=10
But this only includes the namespace, id, and title of ten random pages. I would like to get the content as well as images as well.
Do anyone know how?
Alternatively I could do with the title, content and image url's of a single random page.
Best I've got here is:
https://en.wikipedia.org/w/api.php?action=query&generator=random&format=json
You're close. generator=random is the right way to go. You can then use various prop values to get the info you want:
Page title is always included.
To get the text, use prop=revisons along with rvprop=content.
To get all images used on the page, use prop=images.
Note that this will often include images you're probably not interested in, like icons and flags. To fix that, you might try instead prop=pageimages, though it doesn't seem to work always. Or you could try using both.
So, the final query could look like this:
https://en.wikipedia.org/w/api.php?format=json&action=query&generator=random&grnnamespace=0&prop=revisions|images&rvprop=content&grnlimit=10
If you'd rather use their REST api,
curl -X GET "https://en.wikipedia.org/api/rest_v1/page/random/summary"
Documentation

Multiple open graph tags, which one is prioritized?

I've seen various SO threads about FB open graph image tags such as this: Facebook multiple og:image tags - Which is Default?
These threads are 2.5 years old so I'm wondering if the rules have been updated. Also, the accepted answer of the highest resolution image being the one displayed seems imperfect. What if the images aren't the same? For example, how to have one image for the homepage and then a different one on AJAX loaded pages?
As these rules are used by FB, Reddit and many high-traffic sites obviously this information is very valuable. Thanks!
Since reddit is open-source, you can look at see what its behavior is.
The place you want to look is _find_thumbnail_image(). Right now, this is the code that pertains to Open Graph:
# Allow the content author to specify the thumbnail using the Open
# Graph protocol: http://ogp.me/
og_image = (soup.find('meta', property='og:image') or
soup.find('meta', attrs={'name': 'og:image'}))
if og_image and og_image['content']:
return og_image['content'], None
og_image = (soup.find('meta', property='og:image:url') or
soup.find('meta', attrs={'name': 'og:image:url'}))
if og_image and og_image['content']:
return og_image['content'], None
So, it'll use whatever Beautiful Soup's find() method returns, which should be the first matching tag.

Showing picture from other site displays weird picture instead

I am making a Facebook-app were you can browse you and your friend's likes on a webpage that displays lots of funny pictures. The problem is that when I link to these pictures, they appear as something completly else. Like a placeholder or something. It displays correctly if it's cached (I think).
Take a look at this jsfiddle: http://jsfiddle.net/jVBSk/ . If you rightclick at the image, you get an other filename than the one in the source.
How can I avoid this, making the page display the correct images?
It seems to have some kind of hot-linking protection on it. This one's not very well made, so it's quite easy to bypass.
<?php
$file = file_get_contents($_GET['image']);
header("Content-Type: image/jpeg");
$image = imagecreatefromstring($file);
imagejpeg($image);
imagedestroy($image);
?>
Then call the script like this: script.php?image=http%3A%2F%2Fgif.artige.no%2Fstore%2F10%2F10002.jpg
The image URL has to be encoded. This can be done using urlencode() in PHP, or here's an online tool to do it: http://meyerweb.com/eric/tools/dencoder/
So in HTML that'd be something like this: <img src="script.php?image=http%3A%2F%2Fgif.artige.no%2Fstore%2F10%2F10002.jpg" alt="[Image]" />
The website is checking the referrer to see if it's their domain, or not. If it's not, it's returning this "do not steal this" image. (If anyone can translate that, I'm sure that's what it's saying).
See http://en.wikipedia.org/wiki/HTTP_referrer and http://en.wikipedia.org/wiki/Referrer_spoofing.
What you do is called hot-linking and is frowned-upon by many website owners..
The problem is that you are stealing their bandwidth, and as a solution they provide a different image instead of the requested one when the requesting page is not from their own domain..

Script to get movie posters images (from amazon ?) from a list of movies

I would like to create a poster showing the movies I've seen, in this fashion : http://i.stack.imgur.com/2z1js.jpg
I have a text file with the list of movies, I know how to create the poster from a bunch of images (imagemagick...), what I don't know is how to download the images.
How can I automate the task of finding the movie and downloading its poster ?
Searching for the movie and screen scraping http://www.impawards.com would also be a solution, yet a bit process-intensive and a bit shady (TOS usually forbid you to). You would then enter a query, parse the results, go to the poster page and get the URL of the displayed image
Other than that, check out the (unofficial) IMDB API, http://www.imdbapi.com/ and the SPARQL endpoint at http://www.linkedmdb.org/ . You could then try to parse the IMDB page as well.
Later edit : Found a web service as well, looks ok, don't know about comprehensiveness / reliability or which services are used to get the links : http://cpan.uwinnipeg.ca/htdocs/WebService-MoviePosterDB/WebService/MoviePosterDB.html
Hope this helps!
Construct URLs for each movie and then use wget to download the images.

Parsing a website and getting the info I need

hi so I need to retrieve the url for the first article on a term I search up on nytimes.com
So if I search for Apple. This link would return the result
http://query.nytimes.com/search/sitesearch?query=Apple&srchst=cse
And you just replace Apple with the term you are searching for.
If you click on that link you would see that NYtimes ask you if you mean Apple Inc.
I want to get the url for this link, and go to it.
Then you will just get a lot of information on Apple Inc.
If you scroll down you will see the articles related to Apple.
So what I ultimately want is the URL of the first article on this page.
So I really do not know how to go about this. Do I use Java, or what do I use? Any help would be greatly appreciated and I would put a bounty on this later, but I need the answer ASAP.
Thanks
EDIT: Can we do this in Java?
You can use Python with the standard urllib module to fetch the pages and the great HTML parser BeautifulSoup to obtain the information you need from the pages.
From the documentation of BeautifulSoup, here's sample code that fetches a web page and extracts some info from it:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
where, linebreak, what = incident.contents[:3]
print where.strip()
print what.strip()
print
This this is a nice and detailed article on the topic.
You certainly can do it in Java. Look at the HttpURLConnection class. Basically, you give it a URL, call the connect function, and you get back an input stream with the contents of the page, i.e. HTML text. You can then process that and parse out whatever information you want.
You're facing two challenges in the project you are describing. The first, and probably really the lesser challenge, is figuring out the mechanics of how to connect to a web page and get hold of the text within your program. The second and probably bigger challenge will be to figure out exactly how to extract the information you want from that text. I'm not clear on the details of your requirements, but you're going to have to sort through a ton of text to find what you're looking for. Without actually looking at the NY Times site at the momemnt, I'm sure it has all sorts of decorations like pretty pictures and the company logo and headlines and so on, and then there are going to be menus and advertisements and all sorts of stuff. I sincerely doubt that the NY Times or almost any other commercial web site is going to return a search page that includes nothing but a link to the article you are interested in. Somehow your program will have to figure out that the first link is to the "subscribe on line" page, the second is to an advertisement, the third is to customer service, the fourth and fifth are additional advertisements, the sixth is to the home page, etc etc until you finally get to the one you're actually interested in. How will you identify the interesting link? There are probably headings or formatting that make it recognizable to a human being, but you use a lot of intuition to screen out the clutter that can be difficult to reproduce in a program.
Good luck!
You can do this in C# using the HTML Agility Pack, or using LINQ to XML if the site is valid XHTML. EDIT: It isn't valid XHTML; I checked.
The following (tested) code will get the URL of the first search result:
var doc = new HtmlWeb().Load(#"http://query.nytimes.com/search/sitesearch?query=Apple&srchst=cse");
var url = HtmlEntity.DeEntitize(doc.DocumentNode.Descendants("ul")
.First(ul => ul.Attributes["class"] != null
&& ul.Attributes["class"].Value == "results")
.Descendants("a")
.First()
.Attributes["href"].Value);
Note that if their website changes, this code might stop working.