Easily extracting article text from an online publication

Easily extracting article text from an online publication - html

In recent versions of Safari, there is a "Reader" button that appears in the address bar on certain web pages. When you click this button, it will give you a text-only version of the article on the page without any ads or content that is not part of the article. I would like to create a web app that does something similar when the user enters the URL for an online article (a New York Times article, for instance).
I am wondering if anyone has any guesses as to whether this feature in Safari is implemented in:
A complex way, e.g. "grepping" through the article and following some algorithm to guess which tags to extract, etc.
A simple way, e.g. accessing some sort of RSS or Atom feed that provides only the article text. From what I can tell, most of these feeds seem to only provide short descriptions of articles and links to them, rather than the full text.
Any thoughts?

It's done in a complex way.
Read through this: How to enable iOS 5 Safari Reader on my website?

Related

Style google search result like stackoverflow using microformat

It's for a while I'm researching on the microformat for styling my site information different in google result page.
I found some detail about microformat in this links:
http://microformats.org/wiki/hcard-authoring#The_Importance_of_Names
http://blog.teamtreehouse.com/add-microformats-magic-to-your-site
http://microformats.org/get-started
that will have result like this :
Now, I'm trying to find out could I manipulate microformats to force google show my site information in result page, just like do it for stackoverflow or other most popular sites :
Or Is it possible to do that...?!?!?
Thanks in advance...

You can't force Google to show your website and sub pages like the Stack Overflow example you posted. Your search term was stackoverflow and so the information displayed on the results page was far and away the most relevant. Hence why it displays like that.
If someone searched for your website by name you might get a result like that. You'll need to submit an xml sitemap to Google Webmaster Tools, give it time to index and hopefully your website name will be unique enough.
I guess the main thing is that your website is first on Google's results page for a given search term and the sitemap shows Google what your other pages are.
With respect to microdata - it's really good for giving extra information to search engines. The CSS-tricks one is a perfect example. You'd need a Google+ profile and using the microdata specify that profile as the author.
Again, Webmaster Tools has some great Microdata validation tools. You can even load your pages source code up, highlight the text you want to tag and it'll show you exactly what tags to add and how so it works. Link below:
https://www.google.com/webmasters/markup-helper/

hiding login fields from google search

Background:
My website allows registered users to upload and share videos.
On the main page there is a "Username" and "Password" field so that registered users can log in if they wish.
Also, on each video page, there are text links to choose bitrate and flash player.
(See http://videoflier.com/ and http://videoflier.com/movies/1360488842878341996730 for examples of both.)
or search google for "site:videoflier.com" to see what I'm talking about.
My Problem:
When google or any search engine indexes it, they of course see the login text and the links for setting the video bitrate (which look like " 190 234 [698] 1247 kbps | osflv [jwplayer] flowplayer ")
(Search google for "site:videoflier.com" to see example.)
It looks like this:
Cardboard Airplane
videoflier.com/movies/1352509017371554759177
Cardboard Airplane By jesseg 190 234 [698] kbps | osflv jwplayer [flowplayer] This is a model airplane built from cardboard and tape. It was outfitted with remote ...
(Notice how the bitrates and player selections look ugly and waste space.)
My attempts to solve so far in a clean tidy manner
(And why I don't like any of them.)
Using pictures instead of text: I want my site to be fast and efficient, so I don't want to use pictures for text if I don't have to.
Having a separate page for settings: I want the site to be fast and simple to use.
robots.txt: If the search engines can't read the pages then it won't know how to find them!
Using CGI to hide stuff from search bots This is about the best idea I've had - but I don't really want to do a dirty hack, and it seems there's no universal way for my CGI to identify a robot. Google themselves uses several different user-agent strings, none of which actually contain the word "robot." Most contain "Googlebot" but not all. And who knows what other search engines use.
Of course I understand (and google makes this very claim) why they use agent strings that look like regular web browsers -- because dishonest folks try to send completely different content to the search engine for ad fraud.
But I don't really want to have to essentially run a continually changing blacklist to try to identify all possible search engines out there. Sounds too much like fighting email spam. And besides, I'm just trying to hide the login and bitrate lists so the search results are easier to read.
javascript: Javascript brings its own problems (Browser compatibility issues, accessibility, etc.) I use it when it is the best tool for the job, but I really love pure clean HTML when I can have it.
In an ideal world: I wish I had an HTML tag that goes like <NOBOT>username: password:</NOBOT> -- but as far as I know, nothing exists. Ideally, this fictitious tag would also keep the search engines from returning results based on the hidden items. Somebody who puts the word "password" into google most certainly is not trying to find my site -- and yet google may return it simply because it has a login field on it.
schema.org? I initially had hopes for schema.org because it allows one to specify the type of data within scopes in the HTML. Unfortunately, as far as I could tell, all of its categories and things are for things that are: It didn't seem to have a "Ignore" or "Administrative object" option.
Maybe the more round about answer is to use schema.org extensively for everything else so the search engines already know where to get their author, description, and title text from, then maybe they will skip the administrative control links.
Thank you very much,
Jesse Gordon

I would think it makes more sense to add a meta description to your pages.
For for example:
Cardboard Airplane videoflier.com/movies/1352509017371554759177 Cardboard Airplane By jesseg 190 234 [698] kbps | osflv jwplayer [flowplayer] This is a model airplane built from cardboard and tape. It was outfitted with remote ...
I would just add this to your header section.
<meta name="description" content="This is a model airplane built from cardboard and tape. It was outfitted with remote control and servos so it could be flown as an RC glider.
It did fly but had a tendency to stall. " />

Need to stack subpages on home page of Google Sites — how?

This is a rephrasing of my original question https://stackoverflow.com/questions/14516983/google-sites-trying-to-script-announcements-page-on-steroids:
I've been looking into ways to make subpages of a parent page appear in a grid like "articles" on the home page of my Google Site — like on a Joomla home page and almost like a standard "Announcements" template, except:
The articles should appear in a configurable order, not chronologically (or alphabetically).
The first two articles should be displayed full-width and the ones beneath in two columns.
All articles will contain one or more images, and at least the first one should be displayed.
The timestamp and author of each subpage/article shouldn't be displayed.
At the moment I don't care if everything except the ordering is hardcoded, but ideally there should be a place to input prefs like the number of articles displayed, image size, snippet length, css styling etc.
My progress so far:
I tried using an iframe with an outside-hosted Javascript (using google.feeds.Feed) that pulls the RSS feed from the "Announcements" template, but I can't configure the order of the articles. One possibility would be to have a number at the beginning of every subpage title and parse it, but it's going to mess up with time and the number would also be visible on the standalone article page. Or could the number be hidden with Javascript?
I tried making a spreadsheet with a row for each article with columns "OrderId", "Title", "Content", "Image" and process and format the data with a Google App Script (using createHTML and createImage), but a) there doesn't seem to be a way to get a spreadsheet image to show up inside the webapp and b) these articles are not "real" pages that can be linked to easily on the menus.
This feature would be super-useful for lots of sites, and to me it just seems odd that it isn't a standard gadget (edit: or template). Ideas, anyone?

I don't know if this is helpful, but I wanted something similar and used the RSS XML announcements feed within a Google Gadget embedded into my sites page
Example gadget / site:
http://hosting.gmodules.com/ig/gadgets/file/105840169337292240573/CBC_news_v3_1.xml
http://www.cambridgebridgeclub.org
It is badly written, messy and I'm sure someone could do better than me, but it seems to work fairly reliably. The xml seems to have all the necessary data to be able to chop up articles, and I seem to remember it has image urls as well, so can play with them (although not implemented in my gadget).
Apologies if I am missing the point. I agree with your feature request - it would be great not to have to get so low-level to implement stuff like this in sites....

Looking for an existing iGoogle/OpenSocial gadget comparable to iGoogle's default native RSS gadget

iGoogle's standard RSS feed gadget is not OpenSocial, so it can't be embedded in other web sites. So I am hoping there is an alternative solution already available somewhere.
In iGoogle's gadget list there are other RSS gadgets, but none of them seem as nice as the default one by Google which is native to iGoogle.
The main difference between the standard gadget and most others is the ability to expand a headline and see more of the article by clicking the arrow next to the story. Also a clean layout.
It must be a widget/gadget that is OpenSocial compatible.
(I am aware that iGoogle will be closing, that is not relevant to my needs.)

MyYahoo:
Appears overall best alternative to iGoogle in terms of setup speed, "just working", and widgets for average or novice users.
appears not to import opml, but can support rss feeds
by hand can setup the same rss feeds as igoogle or netvibes
supports upto 9 tabs. every tab has an advertisement
only supports Yahoo search
appears no wikipedia search widget
good support for display of stock portfolio quotes in widgets provided by Yahoo
several good yahoo widgets provided, e.g. tv guide, local movies
can have different theme for each tab, or all the same.
can use uploaded image for wallpaper.
can make backgrounds transparent.
can set background and text colors for all parts of a theme, separately.
has spectrum chooser for colors. cannot type in hex codes.
very user friendly interface for configuring themes.
supports user-created applications (widgets) for MyYahoo tabs.
This is how to get started with it

Try Skim.Me as an alternative to embedding iGoogle's RSS gadget

Crawling data or using API

How these sites gather all the data - questionhub, bigresource, thedevsea, developerbay?
Is this legal to show data in frame as bigresource do?

#amazed
EDITED : fixed some spelling issues 20110310
How these sites gather all data- questionhub, bigresource ...
Here's a very general sketch of what is probably happening in the background at website like questionhub.com
Spider program (google "spider program" to learn more)
a. configured to start reading web pages at stackoverflow.com (for example)
b. run program so it goes to home page of stackoverflow.com and starts visiting all links that it finds on those pages.
c. Returns HTML data from all of those pages
Search Index Program
Reads HTML data returned by spider and creates search index
Storing the words that it found AND what URL those words where found at
User Interface web-page
Provides feature rich user-interface so you can search the sites that have been spidered.
Is this legal to show data in frame as bigresource do?
To be technical, "it all depends" ;-)
Normally, websites want to be visible in google, so why not other search engines too.
Just as google displays part of the text that was found when a site was spidered,
questionhub.com (or others) has chosen to show more of the text found on the original page,
possibly keeping the formatting that was in the orginal HTML OR changing the formatting to
fit their standard visual styling.
A remote site can 'request' that spyders do NOT go thru some/all of their web pages
by adding a rule in a well-known file called robots.txt. Spiders do not
have to honor the robots.txt, but a vigilant website will track the IP addresses
of spyders that do not honor their robots.txt file and then block that IP address
from looking at anything on their website. You can find plenty of information about robots.txt here on stackoverflow OR by running a query on google.
There is a several industries (besides google) built about what you are asking. There are tags in stack-overflow for search-engine, search; read some of those question/answers. Lucene/Solr are open source search engine components. There is a companion open-source spider, but the name eludes me right now. Good luck.
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, or give it a + (or -) as a useful answer. This goes for your other posts here too ;-)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008