Ruby gem to quickly validate partial HTML snippets? - html

I'm making a customized quasi-CMS in Rails, and we'd like to have one field that is editable as an HTML fragment in code (the admin interface will be using CodeMirror on the frontend). When it's presented to the end user, it will just be html_safe'd and inserted into a div. We trust our content editors not to be malicious, but it would be helpful to ensure they're creating valid HTML so they don't break the page, especially since they're relatively new to coding!
As a first attempt, I'm using Hash.from_xml and rescuing exceptions as a custom validator. But is there a better and/or more-optimized way (i.e. a gem) to check that it is valid HTML?
Thanks!

You can use the Nokogiri library (and gem) to create a validator in your model. Using Nokogiri on fragments isn't perfect (so you might want to add the ability to override the validator) but it will catch many obvious errors that might break the page.
Example (assuming your model attribute/field is called content):
validate :invalid_html?
def invalid_html?
doc = Nokogiri::HTML(self.content) do |config|
config.strict
end
if doc.errors.any?
errors.add(:base, "Custom Error Message")
end
end

Instead of validation, perhaps it's worth to use Nokogiri which is capable of fixing markup:
require 'nokogiri'
html = '<div><b>Whoa</i>'
Nokogiri::HTML::DocumentFragment.parse(html).to_html
#=> "<div><b>Whoa</b></div>"

You probably want https://github.com/libc/tidy_ffi or http://apidock.com/rails/v4.0.2/HTML/WhiteListSanitizer (class method sanitize)

I think this may be what you're looking for?: be_valid_asset.

Related

Is there a way to check if text string is valid HTML in Rails?

I am writing a simple CMS in Rails 4. And I am storing my articles in database as text strings that contains HTML code (not necessary).
Anyway, I need a method to check before saving, if the text of the article is valid HTML or nor (considering that the article is not full HTML document, but the part of it, without DOCTYPE and other stuff). Something like this: https://validator.w3.org/#validate_by_input+with_options ("Validate HTML fragment"), but working inside my Rails application as validation method of the model, so if my markup is wrong, it should not save the article and show the error message instead.
Is there a gem or other method to do this?
If you are looking to check input for a field in rails you can simply utilize 3 things which are
Rails before_save callback
JS/Ajax for checking input (avoid page reloading)
Constraints in which you define to be valid HTML
In your model you could create a method in which checks to see if the text inserted into the field is valid. Before the form is saved it would check to see if the inserted html meets your definition of valid html.
Hope the above helps.
So I figured out how to achieve this using w3c_validators gem.
Add gem 'w3c_validators' to Gemfile and run bundle install.
Change model. I've added a custom validation method to validate HTML, like this:
class Article < ActiveRecord::Base
validate :valid_html
def valid_html
#validator = MarkupValidator.new
html = "<!DOCTYPE html><html><head><title>title</title></head><body>#{text}</body></html>"
results = #validator.validate_text(html)
if results.errors.length > 0
results.errors.each do |err|
errors.add(:text, err.to_s)
end
end
end
end
(I need to wrap my code into HTML and BODY tags and add some more tags, because I don't store full HTML in my DB, only partials).

Integration Testing HTML Special Characters

Kind of a strange one, but in my views I have a tick (✔) and a cross (×) used as links (in lieu of images). Is there any way of finding these elements and testing them using RSpec and Capybara-webkit, or should I try and target say the title attribute instead and ignore this route?
My test in question looks like this:
context "casting a vote", js: true do
before do
sign_in user
click_link '✔'
sleep 0.2
end
it { should have_content("Vote cast!") }
end
The failure message I get is (predictably):
Failure/Error: click_link "raw('✔')"
Capybara::ElementNotFound:
Unable to find link "raw('✔')"
Thanks in advance for your help.
Capybara doesn't see the HTML, it runs thru the DOM, which then sees the actual values those things encode. You must send the raw code as a UTF-{8,16} string containing the code point itself.
Most languages would present an HTML '✔' as "\u10004", so try that.

Rails - close html tags when you enter html in a form

Is there any way to close html tags if a user forgets to? E.g. when the user input is:
<b>small</b><i>test
Is there a way in Rails to automatically add the closing </i> tag, so that all the following html won't be italic?
I used .html_safe to interpret everything as html, but I would like to terminate <i> too.
Rails doesn't have any built in capability to do this however you have a couple of options:
Nokogiri - easy to install on pretty much all platforms (gem install nokogiri)
Tidy - The second post has the details to use it in linux and windows
Using nokogori you can simply do:
html = "<b>small</b><i>test"
clean = Nokogiri::HTML::DocumentFragment.parse(html).to_html
# clean = "<b>small</b><i>test</i>"
Rails can't do that to you.
You can use some template system like Haml or Slim to don't forget that.
You can do it by your own editor too.
A decent way to do this would be to feed the input to a dom parser and then have that parser output correct HTML.
Note that this isn't fool proof and if the user makes too many mistakes, the parser won't know what to do.
I'd suggest Nokogiri

How to sanitize user generated html code in ruby on rails

I am storing user generated html code in the database, but some of the codes are broken (without end tags), so when this code will mess up the whole render of the page.
How could I prevent this sort of behaviour with ruby on rails.
Thanks
It's not too hard to do this with a proper HTML parser like Nokogiri which can perform clean-up as part of the processing method:
bad_html = '<div><p><strong>bad</p>'
puts Nokogiri.fragment(bad_html).to_s
# <div><p><strong>bad</strong></p></div>
Once parsed properly, you should have fully balanced tags.
My google-fu reveals surprisingly few hits, but here is the top one :)
Valid Well-formed HTML
Try using the h() escape function in your erb templates to sanitize. That should do the trick
Check out Loofah, an HTML sanitization library based on Nokogiri. This will also remove potentially unsafe HTML that could inject malicious script or embed objects on the page. You should also scrub out style blocks, which might mess up the markup on the page.

What language/tool should I use for HTML parsing?

I have a couple of websites that I want to extract data from and based on previous experiences, this isn't as easy as it sound. Why? Simply because the HTML pages I have to parse aren't properly formatted (missing closing tag, etc.).
Considering that I have no constraints regarding the technology, language or tool that I can use, what are your suggestions to easily parse and extract data from HTML pages? I have tried HTML Agility Pack, BeautifulSoup, and even these tools aren't perfect (HTML Agility Pack is buggy, and BeautifulSoup parsing engine doesn't work with the pages I am passing to it).
You can use pretty much any language you like just don't try and parse HTML with regular expressions.
So let me rephrase that and say: you can use any language you like that has a HTML parser, which is pretty much everything invented in the last 15-20 years.
If you're having issues with particular pages I suggest you look into repairing them with HTML Tidy.
I think hpricot (linked by Colin Pickard) is ace. Add scrubyt to the mix and you get a great html scraping and browsing interface with the text matching power of Ruby http://scrubyt.org/
here is some example code from http://github.com/scrubber/scrubyt_examples/blob/7a219b58a67138da046aa7c1e221988a9e96c30e/twitter.rb
require 'rubygems'
require 'scrubyt'
# Simple exmaple for scraping basic
# information from a public Twitter
# account.
# Scrubyt.logger = Scrubyt::Logger.new
twitter_data = Scrubyt::Extractor.define do
fetch 'http://www.twitter.com/scobleizer'
profile_info '//ul[#class="about vcard entry-author"]' do
full_name "//li//span[#class='fn']"
location "//li//span[#class='adr']"
website "//li//a[#class='url']/#href"
bio "//li//span[#class='bio']"
end
end
puts twitter_data.to_xml
As language Java and as a open source library Jsoup will be a pretty solution for you.
hpricot may be what you are looking for.
You may try PHP's DOMDocument class. It has a couple of methods for loading HTML content. I usually make use of this class. My advises are to prepend a DOCTYPE element to the HTML in case it hasn't one and to inspect in Firebug the HTML that results after parsing. In some cases, where invalid markup is encountered, DOMDocument does a bit of rearrangement of the HTML elements. Also, if there's a meta tag specifying the charset inside the source be careful that it will be used internally by libxml when parsing the markup. Here's a little example
$html = file_get_contents('http://example.com');
$dom = new DOMDocument;
$oldValue = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($oldValue);
echo $dom->saveHTML();
Any language which works with HTML on DOM level is good.
for perl it is HTML::TreeBuilder module.