Parse an in-memory DOM? - html

I have done HTML parsing. I get a URL, and using Nokogiri I can extract components from the HTML. That is fine.
Now, I am wondering if the following it is possible or it just does not make sense at all:
When we look at a browser, there is a render engine that parses the HTML/CSS/JS and creates a visual representation of it. I am wondering if it is possible to access that in-memory DOM interpretation. So, for example, when parsing an HTML I can find a that is pretty far from the root element, but when rendered it can appear on top of the page (because the CSS says it is absolutely positioned). I would like to be able to get that image as it appears on the browser.
Is there any open source API that would let me access this interpretation of an HTML file or what I am saying does not make sense at all, because what we see it is just that visual objects that can not be treated?

It sounds like you're asking for a headless browser – a rendering engine that works for your code instead of a user.
Look at PhantomJS.

Related

Alternative to dangerouslySetInnerHTML

I use React to build a web app that stores documents. Those are created in HTML and are then stored in a database. To show them in the app I load the HTML inside a div by making use of dangerouslySetInnerHTML.
<div dangerouslySetInnerHTML={{__html: this.props.page.content}} />
Even through this works perfectly fine, the name dangerouslySetInnerHTML suggests to pay more attention to this case, but I wonder what exactly can be done to remain flexible enough to load HTML and make it appear in the web app. I believe that the word dangerous addresses the danger of cross-site scripting, meaning that a script could be injected, executing harmful code.
As a countermeasure I thought of cleaning the HTML code before parsing it over to the div. One library addressing this is DOMPurify. Another way would be to convert the HTML code from the database directly into React Elements using html-react-parser.
Would that be the right approach? Or are there alternatives to dangerouslySetInnerHTML?
Perhaps these parsers can fulfill ur need
html-to-react
html-react-parser

How can I save a webpage as an image in my rails app?

In my rails app I have a need to save some webpages and display them to the user as images. For example, how would I save www.google.com as an image?
There is a command line utility called CutyCapt that is using the WebKit-Rendering engine to render HTML-Pages into various image formats. Maybe this is for you?
http://cutycapt.sourceforge.net/
Prohibitively difficult to do in pure Ruby, so you'd want to use an external service for this. Browsershots does it, for example, and it looks like they have an api, although I haven't used it myself. Maybe someone else can chime in with alternative but similar services.
You'll also want to read up on delayed_job or something similar, to make sure you're accessing those page images as a background task and that it doesn't interfere with your actual application.
You can't do it easily (probably can't do it at all).
Each page is just a text - html data. The view you want to make an image of is a rendered page. Browser renders the page using tonns of techniques like html parsing, javascript parsing, css parsing, font rendering, etc.. To make the screenshot of google page - you would need to do all the rendering somewhere in memory and then take a screenshot of rendered page.
That task is almost impossible (there is nothing fully impossible).
If you are really eager to donate tonns of time to accomplish that task - you should do this steps:
1) Find some opensource rendering engine. Firefox would do.
2) Find some way to communicate between ruby-on-rails and that engine.
3) Wire it all together and see the results.
However, I see steps 1 and 2 as nearly impossible.
Firefox addon:
https://addons.mozilla.org/en-US/firefox/addon/1146/

Clean HTML using C#

How do I repair malformed HTML using C#? A great answer would be an HTML Agility Pack sample!
I'm scraping a site (for legitimate use). The site's HTML is OK but there are some annoying problems.
One way I could go would be through regular expressions. I used Expression Web to analyse the problems and the regular expressions needed to correct them. So one way would be to use a tool such as RegexBuddy to generate C# code for these regular expressions.
However, the recommended tool for processing malformed HTML in C# is the HTML Agility Pack (HAP). Moreover, I've analysed only a handful of pages and I'm afraid that future pages will contain patterns I've not yet solved, and I would hate to enter the "find the errors in the next few pages and correct them" maintenance business. So, if HAP already has a solid, always-working solution, this would be great. The problem is that except for a few mentions here at SO I could not find any how-to-use documentation for this tool, except for the object-by-object API help file.
So - before I spend $ and learning time on RegexBuddy (no free evaluation version), or break my teeth on HAP's API documentation - is there an easy way to do this? An HAP sample would help... :-)
can you tell me what kind of annoying problems are you having?
but you dont need to use regex to clean the html, HAP will let you access the elemtents of a malformed html using Xpath Queries.
and basically you need to learn Xpath to know how to get the html elements you want.
it really depends on the kind of html you are parsing using HAP.
but there is several ways to get the elements.
like by id or class or even you can get the element that follows another element that contain a given text like "name:" for example.
you can goto W3 schools Xpath Tutorial for a nice xpath tutorial
What I took from the answers here:
1) If you're scraping a website you don't control, you'll always enter a maintenance mode where you have to fix your scraper every time the layout of the page you're scraping changes.
2) If you are limited to this known site, why not write your scraper to adjust the problems
So, if I have to go into maintenance mode, it should be as easy as possible. Therefore, my process is as follows:
I use Webius's SWExplorerAutomation to detect scenes in Web pages. The idea is that a Scene is a collection of conditions you define for IE. When a web page is loaded, IE tries to see which set of conditions is met (e.g. - page title is "Account Login", the page contains a "Login" text box a "Password" text box). If a set of conditions corresponding to a scene is detected, IE reports that the scene has been detected. This model provides an abstraction layer - Some changes in the web page can translate to changes in the scene file, saving the code from having to change. Additionally, this shields me from IE's event driven model: I call "scene. I'm evaluating this product but I'm not yet sure I'll use it, mainly because the documentation is terrible. Another alternative is Watin, and one more reason I haven't yet bought SWEA is this article accusing its author of spamming against Watin.
Once the web page has been acquired, I use Expression Web to run compatibility checks and identify errors.
I use RegexMagic to remove and correct errors. I really love this tool. Sure, sometimes it make you murderously angry because it doesn't let you do things that should be really easy, but it's a sweet, sweet tool, and the documentation is amazing.
Finally, after all the errors I know have been corrected, I use HTML Agility Pack to convert to XHTML - cross the ts and dot the is, so to speak: all lower case, quotes across attributes, and so on.
Hope this helps!
Avi
Regex can't be used for HTML Cleaning.
Does http://tidy.sourceforge.net/ helps?
If you're scraping a website you don't control, you'll always enter a maintenance mode where you have to fix your scraper every time the layout of the page you're scraping changes. It doesn't matter if you're using the regex <td color="red">\d+</td> to get the big red number from a page or if you're using a DOM parser to get the 3rd cell in the 2nd row in the table with id numbers to get the same. The regex breaks if the webmaster replaces the color attribute with a class attribute. The DOM parser breaks if the webmaster adds another row to the top of the table.
If you're scraping larger parts of a web page and want to embed them in your own web page, it may be easier to get over your desire for web standards compliance and just let the browser figure out how to display things.
Since you're using Html Agility Pack and know of the problems that occur, if you are limited to this known site, why not write your scraper to adjust the problems when you've loaded the HtmlDocument.
i.e.:
If you know the element always appears after the , insert the element into the first child position of the tag.....

Embed section of HTML from another site?

Is there a way to embed only a section of a website in another HTML page?
Example: I see an answer I want to blog about, so I grab the HTML content, and splat it in somewhere, and show only that, styled like it is on stackoverflow. Basically, I want to blockquote the section of the page with original styling, if that makes sense. Is that something the site itself has to provide, or can I use an iframe and tell it to show only a certain element or something crazy? Open to all options, but I want it to show up as HTML, not as an image (that's really a last resort).
If this is even possible, are there security concerns I need to aware of?
Don't think image should really be last resort. You have no control over the HTML/CSS of the source page, so even if you craft a solution (probably by using JavaScript to parse out the desired snippet) there is no guarantee that tomorrow the site doesn't decide to change its layout.
Even Jeff, who has control over the layout of stackoverflow.com, still prefers to screen-capture the site, rather than pull in the contents live.
Now if your goal was to have the contents auto-update, that would be a different story. But still, unless you use some agreed-upon method of sharing content, such as RSS, your solution would be very fragile.
The concept you are describing is roughly what is called a "purple include" or "transclusions". There is a library out there for it, but its not exactly actively developed. Here's a couple ajaxian articles on it.
I'd recommend using a server side solution with Python; using urllib2 to request the page, then using BeautifulSoup to parse out the bit that you need. BeautifulSoup has a very flexible selection api with which you can craft heuristics for the section you are interested in.
To illustrate:
soup = BeautifulSoup(html)
text = soup.find(text="Some text on the page that is unlikely to change")
print soup.parent.prettify()
That way if the webmaster later changes the markup on the page, your scraping script should still work.
On client side <iframe> is the only practical option. It is possible to scroll it, but it might not work in the long term, because it's technically close to clickjacking attack.
There's also cross-site XHR, but requires opt-in from destination site, and today works only in few latest browsers.
Getting HTML on server side is easy (every decent web framework has ability to download page and parse HTML and you can use XPath/XSLT or DOM to extract bit you want).
Getting styles however is going to be tricky – CSS rules may not work with HTML fragment taken out of context. You'd have to parse CSS, extract and transform rules or use browser and read currentStyle of every node.
Obviously you have to heavily filter HTML you extract to avoid XSS. It's harder than it seems.
If you don't need to automate this, a good HTML+CSS WYSIWYG editor might be able to extract content fragment with styles.
That sounds like something that IE8's Web Slices would be perfect for. However, it's only available in IE8, and the site of origin would have to implement for you to be able to take advantage of it.

Anyone know of a good algorithm for rendering an HTML table to an image?

There is a standard two-pass algorithm mentioned in RFC 1942: http://www.ietf.org/rfc/rfc1942.txt however I haven't seen any good real-world implementations. Anyone know of any? I haven't been able to find anything useful in the Mozilla or WebKit code bases, but I am not entirely sure where to look.
I guess this might actually be a deeper problem with having to actually render HTML (the contents of table cells) but just to keep it simple - plaintext HTML table as an image. Even an HTML table rendering algorithm ignoring the "as an image" part...
If a commercial tool is an option, look at:
HtmlCapture ActiveX Control V2.0 (originally named HtmlSnap)
Some features they claim:
By calling SnapHtmlString(), you can take a snapshot for a html string.
Get snapshot images rendered by either Microsoft IE or Mozilla Firefox.
Just by calling SnapUrl() and SaveImage(), you can take a snapshot of a webpage into various images, such as BMP, JPG, JPEG, GIF, PNG, TIF, TGA and PCX.
Convert html to vector image format like EMF and WMF.
Self contained ActiveX control with no third party dependencies.
Support custom gdi output of the resulting image.
Support saving resulting image both to file and in memory.
Support saving both full-size web page and thumbnail one.
Take a snapshot of a whole webpage into one image without scrollbars.
Make grayscale or B&W images with efficient algorithms to keep the quality.
Support JPEG compression level, compression method selection of TIFF and GIF.
Support setting color depth in images while keeping the quality of the image as much as possible.
Selectively save activeX, image, java applets, scripts and videos on a web page as you want.
Send custom cookies, http headers, credentials in snapshot requests.
Take snapshots of webpages via a Proxy server.
More than 30 samples written in VC, C- , Delphi, VB, C++ Builder, Java, JScript, Perl, VBScript, ASP, ASP.net and PHP are provided.
html table rendering is non-trivial due to the various ways that the sizes of the cells may be specified, tables nested within tables, etc.
if all you want is the image, a simple solution would be the .NET browser control (which is basically the COM component for IE) and a screen-capture function
if you want to get some source to manipulate, the Mozilla source should still be available
I'm not sure if this will meet your constraints or not, but you can try using IE or an IE control with MSHTML and the IHTMLElementRender interface to render the table to a device context.
If you have XHTML, not plain HTML, you should be able to retrieve the content of those cells along with information about the table's structure: colspan, rowspan, etc. Using this information, you can render the table using your own border, padding and margin values.
Things get complex when you also want to render the user defined dimensions. But for retrieving the table data and drawing it, you could use an XML parser. PHP's parser is here: http://ca3.php.net/xml
One tool that comes close is: http://www.terrainformatica.com/htmlayout/main.whtm
This library offers a way to capture rendered HTML to an image, however it is not open source (but free!). Hope it is useful to some!
Unfortunately my app is cross platform, C/C++ with no MFC or platform dependencies (nightmare!). I'm hopefully looking to find a general purpose algorithm for table rendering. I think the 2-pass option from the RFC comes pretty close so I'm probably going to just dig in and work against that. I'll be sure to blog about it and post my eventual solution here if I can!
Take a look at Prince XML - it's a commercial tool to render CSS-styled XML (including XHTML) documents to PDFs. This tool is conform with major W3C standards such as XHTML and CSS2.1. You can try the free demo version from their Homepage!
Since you want an image: It shouldn't be a big problem to convert the generated PDFs programatically to an images.