Semantic Screenshots for Web Browsers - html

An awful lot of modern web traffic (particularly on social media) consists of screenshots from web browsers. These typically include some formatted text, some layout, and some bitmap/vector graphics. E.g.,
It's really easy to take and share a screenshot, but it throws away lots of useful information and doesn't transfer well between devices (not to mention being far less amenable to things like screen readers for the blind and fancy data-mining). Of course the ironic part of this is that HTML/SVG is the perfect format for representing such data, and we're not using it even though it's right there.
html2canvas comes close to doing this, but doesn't properly handle images, see some semi-related discussion here.
My question is this, how can I select a visible area in my browser and save it in a format (ideally HTML) that preserves text and images and renders to something roughly similar when rendered separately? (so that it could be included as e.g. a data iframe for sharing).
I know that this is in general impossible, and that rendering HTML is a complicated task, but I feel like it should be possible to ask the browser something like "what elements are being rendered within these pixel coordinates?".

First:
Right click on page, then click on "Save page as".
Save it with a name that ends with .html (or .webarchive in some scenarios. See which works best for you).
Edit the now saved html file to only have the part you want (you can use any text editor. Sublime Text and Atom are usually suggested).
Then:
You can open it in your browser to see what you are up to.
You might want to inspect where the CSS is from too, and get that in your html's file folder, then link the html file to it, so as to preserve the styles.
As far as I understand, you'd want to bring all the CSS to be inline, or, at least, in the <head> section of the html file, so you can upload it as a single file, and don't need to keep linking it to the CSS file.

Related

Unable to Copy Rendered HTML Signature Into Gmail

After a fair bit of looking around, the only way I've found to get a signature on gmail is to copy the rendered HTML signature. Two problems arose: 1) I couldn't actually select my entire signature, and I can't even see what I am selecting like how it works with regular text and other's tutorials for gmail signatures. 2) If I press Ctrl+A on Firefox (Chrome only copies half, even when I use Ctrl+A), I can manage to copy my signature, but if I try to paste in the signature box, it glitches out and appears static in the top left of that specific Chrome/Firefox tab, like this (edited for privacy reasons):
And if I try to just go for it and email (after saving changes), no signature will be rendered at all. Not too sure what to do at this point, so any suggestions are welcome.
Thanks.
EDIT: This is the HTML I use to render the signature. As a side note, I do replace those placeholder file names with links from an image hosting site. I also add 3 tags around a few of the ""s.
Ultimately I found the solution after playing with various HTML and image options. The problem lies in my use of the <div> tag for the layout of the signature. I should have been using <td>. Using the slice tool in Illustrator will render the HTML with <div> tags, while using ruler guides in Photoshop and saving for web (I used the legacy option) will render with <td> tags. I'm going to do a little more digging and see if using guides in Illustrator will still render with <div> tags, but I'm not sure if this site is the place to discuss this piece of the problem.
EDIT: By the way, Illustrator just really likes <div>s, so if anyone is looking to do this same thing, use Photoshop's legacy Save for Web mode. It will generate the <td> tags for you.

Tool for Viewing Formatted HTML Source Code in Browser

I'm developing a web scraping tool in Python, and I need to get intimately acquainted with the functions of various HTML tags on certain sites. Unfortunately, the "view source" that Chrome, Firefox, and Safari offer does not output very well formatted HTML source code -- it tends to place a huge number of tags on the same line. Do the browsers offer any plugins that may be able to clean things up a bit, or do I need to get/develop some kind of tool in Python that takes dirty HTML as input and outputs cleanly formatted HTML?
Since I work primarily with Chrome, the best examples I can think of are Code Formatter (Chrome)
This isn't automatic; you have to copy and paste the entire page into the app. Also the app window is small (this unalterable to my knowledge), but relatively effective.
...and JavaScript and CSS Beautifier
Much more effective and clean, but only works, as the title suggests, with .Js and CSS.
With Firefox you can select (highlight - I am writing for beginners also) the text, and once it is selected, release the left mouse button and right click within the selected area and choose "View selection source." You can then copy the highlighted text and paste it.
My composite example:
View selection source

View Source on my websites always looks nasty, why isn't the HTML lining up properly?

I do a decent job of formatting my HTML and keeping it clean, but every time I view source there are elements all over the place. I guess that's fine since it won't make the page load any faster or slower and makes it harder to copy, but it just looks ugly and I wish it didnt
Why?
View Source in a web browser will show exactly what the server sent to the client. If you're really formatting your HTML nicely and it doesn't look exactly the same on the client, then there's something else in the middle that's making it not line up the same, such as a server-side technology like PHP or ASP.NET which is being used to generate some of the markup.
Also it's possible you're seeing it different due to spaces. If in your development environment you're mixing spaces and tabs and have one tab equal to 4 spaces, for example, and then in the browser it might be one tab equal to 8 spaces, then things won't line up right. To fix this, either always use tabs or always use spaces. Most decent IDEs will swap between tabs and spaces automatically for you (like Visual Studio).
Some browser tools like Firebug and Chrome's Developer Tools will show the DOM tree as the browser understands it. This is a translation of the DOM back to HTML and is not likely to be the exact same as what the server sent the content. It is formatted perfectly though.
I'm not sure why your HTML is not lined up properly in your browser's View Source. It would be helpful to actually see your HTML.
Some of the common culprits include:
a mixture of tabs and manual spaces for indenting code (if you want things to look pretty, do one or the other).
possibly a mixture of Windows, Unix, and Macintosh line breaks (CR/LF), which can happen if code is edited from multiple computers. I've had issues with this, but I'm not sure if it would cause the issues you're describing; perhaps not. (I'm sure others can comment more knowledgeably on that possibility.)
your site is managed through a CMS that emits terrible HTML
It may be useful for you to look at HTML Tidy. I haven't used it yet, but I've always heard good things.

Do you know a webpage appearance comparator?

I need a tool to compare the design of a website, I do not want to compare the HTML code only, but the output design.
Is this even possible? also is there any opensource program of this kind?
I have searched google, but I only get one candidate so far which is an HTML Match.
In modern webpages the appearance is controlled by various 'things': html code, css styles and images at least (also javascript in some pages). Simple text-based diff programs are not enough because their output can be irrelevant to the webpage appearance (i.e. cleaning up css can show many differences but the rendered webpage remains the same).
For simpler pages HTML Match mentioned above could do the job. If I have to compare the design of two "complex" pages (including layout, space, image and text changes) I would do a two-step approach:
Run a diff tool on the html sources to highlight the textual content differences. Then I would modify one of the pages to show the same content as the other (in order to make the next step more accurate and 'focused' to show 'real' layout changes). Of course it works only with very similar html.
Load the pages in the same web browser, get some screenshots from the rendered output at fixed positions and compare the images (i.e. with ImageMagick). It should show all visual differences in the rendered output.
It is not perfect but should work.
[UPDATE] HTML Match seems dead, see this answer for an alternative solution.
Solution: “compare web pages” tool. (“We've been doing it since 1999. It's free.”)
Example output (comparing pages for TP-Link USB hub model UH700 and UH720):
Under windows:
http://www.htmlmatch.com/
If you are using KDE, you can use Kompare or KDiff3.
However, if you want to view how your web page looks in different browsers in different operating systems, BrowserShots can used.
There are these online tools - that aren't brilliant:
http://www.w3.org/2007/10/htmldiff
http://www.aaronsw.com/2002/diff/
I like the look of daisydiff but have not used it in anger: http://code.google.com/p/daisydiff/
The keyword you're looking for is "diff".
A good program that can show you the differences between two files (html markup or other) would be ExamDiff for windows.
I'm working on one and i tell you it's hard and there is nothing on the market. Maybe Google and Bing have something inhouse. You can use some image comparison tools which identify rectangle regions of changed images. This is for example a part of all modern video compression but you have to do it for different regions of the webpage (the nav bar section, the main article, the region filtered by an ad-blocker etc.) as some of them may change and it's still considered the same content on the page.
As i said very complex problem with no exact solution.
The other is going the non visual way and just compare the resulting computed computer styles of each html element. You have to hack the browser to get access to the layout tree. There is also no official API or existing library/program/hack/patch for it.
You can make a visual comparison with Araxis Merge Pro by taking screen output with systems like BrowserStack, Cross Browser, PhantomJS

Embed section of HTML from another site?

Is there a way to embed only a section of a website in another HTML page?
Example: I see an answer I want to blog about, so I grab the HTML content, and splat it in somewhere, and show only that, styled like it is on stackoverflow. Basically, I want to blockquote the section of the page with original styling, if that makes sense. Is that something the site itself has to provide, or can I use an iframe and tell it to show only a certain element or something crazy? Open to all options, but I want it to show up as HTML, not as an image (that's really a last resort).
If this is even possible, are there security concerns I need to aware of?
Don't think image should really be last resort. You have no control over the HTML/CSS of the source page, so even if you craft a solution (probably by using JavaScript to parse out the desired snippet) there is no guarantee that tomorrow the site doesn't decide to change its layout.
Even Jeff, who has control over the layout of stackoverflow.com, still prefers to screen-capture the site, rather than pull in the contents live.
Now if your goal was to have the contents auto-update, that would be a different story. But still, unless you use some agreed-upon method of sharing content, such as RSS, your solution would be very fragile.
The concept you are describing is roughly what is called a "purple include" or "transclusions". There is a library out there for it, but its not exactly actively developed. Here's a couple ajaxian articles on it.
I'd recommend using a server side solution with Python; using urllib2 to request the page, then using BeautifulSoup to parse out the bit that you need. BeautifulSoup has a very flexible selection api with which you can craft heuristics for the section you are interested in.
To illustrate:
soup = BeautifulSoup(html)
text = soup.find(text="Some text on the page that is unlikely to change")
print soup.parent.prettify()
That way if the webmaster later changes the markup on the page, your scraping script should still work.
On client side <iframe> is the only practical option. It is possible to scroll it, but it might not work in the long term, because it's technically close to clickjacking attack.
There's also cross-site XHR, but requires opt-in from destination site, and today works only in few latest browsers.
Getting HTML on server side is easy (every decent web framework has ability to download page and parse HTML and you can use XPath/XSLT or DOM to extract bit you want).
Getting styles however is going to be tricky – CSS rules may not work with HTML fragment taken out of context. You'd have to parse CSS, extract and transform rules or use browser and read currentStyle of every node.
Obviously you have to heavily filter HTML you extract to avoid XSS. It's harder than it seems.
If you don't need to automate this, a good HTML+CSS WYSIWYG editor might be able to extract content fragment with styles.
That sounds like something that IE8's Web Slices would be perfect for. However, it's only available in IE8, and the site of origin would have to implement for you to be able to take advantage of it.