Google Cloud Vision DOCUMENT_TEXT_DETECTION - always one page? - ocr

I have been trying a variety of images (mainly via https://cloud.google.com/vision/docs/drag-and-drop, but also via the API), from pictures of several hand written pages laid out on a table, to an image of a rendered multi-page PDF, to pictures of books, and it seems like it always returns a single "page" (response format is TextAnnotation > Page > Block > Paragraph > Word > Symbol). The documentation is pretty thin on what the meaning of a "page" is, mostly just states the obvious. Could someone with experience clear this up, please?

Related

Google Website Translator completely changes site spacing and structure

I embedded Google Website Translator into the website I'm working on when Google Analytics showed me that the majority of website traffic is coming from non-English speaking locations.
The translation works, and I'm happy that the site will be available to more people. The problem is that, when translated, the structure and formatting of my pages are thrown entirely out the window. For the most part, font colors and sizes are maintained, but tables change width and most line breaks are ignored...this leaves a jumbled mess with very little structure. It can still be read, it just gets even uglier than it already was.
To see for yourself, visit the website at SVFCLV.org and translate into any language you'd like. What's the easiest way to preserve my page structure even when Google translates the page for visitors?

How to hide content from File2HD?

There is a website called file2hd.com which can download any type of content from your website including audio, movies, links, applications, objects and style sheets. Of course this doesn't work for high profile websites such as Google, but is there there a type of method I can use to cloak content on my website and prevent this?
E.g. Using a HTML Code, or using .htaccess method?
Answers are appreciated. :)
If you hide something from the software, you also hide it from regular users. Unless you have a password-protected part of your website. But even then, those users with passwords will be able to fetch all loaded content - HTML is transparent. And since you didn't provide what kind of content are you trying to hide, it's hard to give you a more accurate answer.
One thing you can do, but it works just for certain file types, is to server just small portions of a file. For example, you have a video on your page and you're fetching 5-second bits of the video from the server every 5 seconds. That way, in order for someone to download the whole thing, they'd have to get all the bits (by watching the whole thing) and then find a way to join the parts... and it's usually just not worth it. Think of Google Maps... and Google uses this/similar technique on a few other products as well.

Need to stack subpages on home page of Google Sites — how?

This is a rephrasing of my original question https://stackoverflow.com/questions/14516983/google-sites-trying-to-script-announcements-page-on-steroids:
I've been looking into ways to make subpages of a parent page appear in a grid like "articles" on the home page of my Google Site — like on a Joomla home page and almost like a standard "Announcements" template, except:
The articles should appear in a configurable order, not chronologically (or alphabetically).
The first two articles should be displayed full-width and the ones beneath in two columns.
All articles will contain one or more images, and at least the first one should be displayed.
The timestamp and author of each subpage/article shouldn't be displayed.
At the moment I don't care if everything except the ordering is hardcoded, but ideally there should be a place to input prefs like the number of articles displayed, image size, snippet length, css styling etc.
My progress so far:
I tried using an iframe with an outside-hosted Javascript (using google.feeds.Feed) that pulls the RSS feed from the "Announcements" template, but I can't configure the order of the articles. One possibility would be to have a number at the beginning of every subpage title and parse it, but it's going to mess up with time and the number would also be visible on the standalone article page. Or could the number be hidden with Javascript?
I tried making a spreadsheet with a row for each article with columns "OrderId", "Title", "Content", "Image" and process and format the data with a Google App Script (using createHTML and createImage), but a) there doesn't seem to be a way to get a spreadsheet image to show up inside the webapp and b) these articles are not "real" pages that can be linked to easily on the menus.
This feature would be super-useful for lots of sites, and to me it just seems odd that it isn't a standard gadget (edit: or template). Ideas, anyone?
I don't know if this is helpful, but I wanted something similar and used the RSS XML announcements feed within a Google Gadget embedded into my sites page
Example gadget / site:
http://hosting.gmodules.com/ig/gadgets/file/105840169337292240573/CBC_news_v3_1.xml
http://www.cambridgebridgeclub.org
It is badly written, messy and I'm sure someone could do better than me, but it seems to work fairly reliably. The xml seems to have all the necessary data to be able to chop up articles, and I seem to remember it has image urls as well, so can play with them (although not implemented in my gadget).
Apologies if I am missing the point. I agree with your feature request - it would be great not to have to get so low-level to implement stuff like this in sites....

Cleaning up HTML from textarea

I have a page with two textareas, where registered users can fill them with HTML codes. First one has TinyMCE (so HTML is cleaned up), but the other one does not, since I expect the code to be inserted as embed codes from other sites (mostly sites that provide maps, e.g. Google Maps, MapMyRace.com, etc). But problem is that those other sites may provide different tags, not just <embed> or <iframe>. So I can't strip tags because then I might strip tags that I didn't know other sites provided. I will save the HTML in these two textareas into my database, to be retrieved and displayed as parts of some other pages.
Do you have any suggestions to make this setup more secure? Or should I disallow free input of HTML in the 2nd textarea altogether? (Or.. I let the users tick a check box saying "I accept full responsibility for the behavior of the code I am inserting".. LOL)
Your opinion is highly appreciated :)
Thanks
The short answer is : free HTML is insecure and must be avoided. Nothing blocks your user from creating an iframe that redirects the user to some harmful page or put ads on your page or deface your site.
My favorite approach to this problem is to allow the user to paste a link (no the "embed on page" iframe code) in a text box. Then I use regex to identify the pasted link (is it youtube, Bing maps, ...) and I create the HTML from the pasted link, which isn't too complex for most iframe providers. It's much more work for you, and it restricts the APIs you can put on your page, but it's secure.
Letting your users use arbitrary HTML is dangerous. You may want to have a black and white lists of tags that you disallow and allow (respectively).

How do I resolve the content of a webpage?

I'm writing a special crawler-like application that needs to retrieve the main content of various pages. Just to clarify : I need the real "meat" of the page (providing there is one , naturally)
I have tried various approaches:
Many pages have rss feeds , so I can read the feed and get this page specific contnent.
Many pages use "content" meta tags
In a lot of cases , the object presented in the middle of screen is the main "content" of the page
However , these methods don't always work , and I've noticed that Facebook do a mighty fine job doing just this (when you want to attach a link , they show you the content they've found on the link page) .
So - do you have any tip for me on an approach I've over looked?
Thanks!
There really is no standard way for web pages to mark "this is the meat". Most pages don't even want this because it makes stealing their core business easier. So you really have to write a framework which can use per-page rules to locate the content you want.
Well, your question is a little bit vague still. In most cases, a "crawler" is going to just find data on the web in a text-format, and process it for storage, parsing, etc. The "Facebook Screenshot" thing is a different beast entirely.
If you're just looking for a web based crawler, there are several libraries that can be used to traverse the DOM of a web page very easily, and can grab content that you're looking for.
If you're using Python, try Beautiful Soup
If you're using Ruby, try hpricot
If you want the entire contents of a webpage for processing at a later date, simply get and store everything underneat the "html" tag.
Here's a BeautifulSoup example to get all the links off a page:
require 'hpricot'
require 'open-uri'
doc = Hpricot(open("http://www.stackoverflow.com"))
(doc/"a").each do |link|
puts link.attributes['href']
end
Edit: If you're going to primarily be grabbing content from the same sites (e.g. the comments section of Reddit, questions from StackOverflow, Digg links, etc) you can hardcode the format of them so your crawler can say, "Ok, I'm on Reddit, get everything with the class of 'thing'. You can also give it a list of default things to look for, such as divs with class/id of "main", "content", "center", etc.