Strip text from HTML document using Ruby - html

There are lots of examples of how to strip HTML tags from a document using Ruby, Hpricot and Nokogiri have inner_text methods that remove all HTML for you easily and quickly.
What I am trying to do is the opposite, remove all the text from an HTML document, leaving just the tags and their attributes.
I considered looping through the document setting inner_html to nil but then really you'd have to do this in reverse as the first element (root) has an inner_html of the entire rest of the document, so ideally I'd have to start at the inner most element and set inner_html to nil whilst moving up through the ancestors.
Does anyone know a neat little trick for doing this efficiently? I was thinking perhaps regex's might do it but probably not as efficiently as an HTML tokenizer/parser might.

This works too:
doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").remove

You can scan the string to create an array of "tokens", and then only select those that are html tags:
>> some_html
=> "<div>foo bar</div><p>I like <em>this</em> stuff <a href='http://foo.bar'> long time</a></p>"
>> some_html.scan(/<\/?[^>]+>|[\w\|`~!##\$%^&*\(\)\-_\+=\[\]{}:;'",\.\/?]+|\s+/).select { |t| t =~ /<\/?[^>]+>/ }.join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"
==Edit==
Or even better, just scan for html tags ;)
>> some_html.scan(/<\/?[^>]+>/).join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"

To grab everything not in a tag, you can use nokogiri like this:
doc.search('//text()').text
Of course, that will grab stuff like the contents of <script> or <style> tags, so you could also remove blacklisted tags:
blacklist = ['title', 'script', 'style']
nodelist = doc.search('//text()')
blacklist.each do |tag|
nodelist -= doc.search('//' + tag + '/text()')
end
nodelist.text
You could also whitelist if you preferred, but that's probably going to be more time-intensive:
whitelist = ['p', 'span', 'strong', 'i', 'b'] #The list goes on and on...
nodelist = Nokogiri::XML::NodeSet.new(doc)
whitelist.each do |tag|
nodelist += doc.search('//' + tag + '/text()')
end
nodelist.text
You could also just build a huge XPath expression and do one search. I honestly don't know which way is faster, or if there is even an appreciable difference.

I just came up with this, but #andre-r's solution is soo much better!
#!/usr/bin/env ruby
require 'nokogiri'
def strip_text doc
Nokogiri(doc).tap { |doc|
doc.traverse do |node|
node.content = nil if node.text?
end
}.to_s
end
require 'test/unit'
require 'yaml'
class TestHTMLStripping < Test::Unit::TestCase
def test_that_all_text_gets_strippped_from_the_document
dirty, clean = YAML.load DATA
assert_equal clean, strip_text(dirty)
end
end
__END__
---
- |
<!DOCTYPE html>
<html xmlns='http://www.w3.org/1999/xhtml' xml:lang='en' lang='en'>
<head>
<meta http-equiv='Content-type' content='text/html; charset=UTF-8' />
<title>Test HTML Document</title>
<meta http-equiv='content-language' content='en' />
</head>
<body>
<h1>Test <abbr title='Hypertext Markup Language'>HTML</abbr> Document</h1>
<div class='main'>
<p>
<strong>Test</strong> <abbr title='Hypertext Markup Language'>HTML</abbr> <em>Document</em>
</p>
</div>
</body>
</html>
- |
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
<meta http-equiv="content-language" content="en">
</head>
<body><h1><abbr title="Hypertext Markup Language"></abbr></h1><div class="main"><p><strong></strong><abbr title="Hypertext Markup Language"></abbr><em></em></p></div></body>
</html>

Related

Issue parsing HTML using Nokogiri

I have some HTML and wish to get the content under the <body> element. However, with whatever I tried, after the HTML is parsed using Nokogiri, everything inside <doctype> and <head> is also becoming part of the <body> element and when I retrieve the <body> element, I see stuff inside <doctype> and the <meta> and <script> tags too.
My original HTML is:
<!DOCTYPE html \"about:legacy-compat\">
<html>
<head>
<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">
<title>Some Title</title>
<meta name='viewport' id='helloviewport' content='initial-scale=1.0,maximum-scale=2.5' />
<link rel='stylesheet' id='hello-stylesheet' type='text/css' href='some-4ac294cd125e1a062562aca1c83714ff.css'/>
<script id='hello-javascript' type='text/javascript' src='/hello/hello.js'></script>
</head>
<body marginwidth=\"6\" marginheight=\"6\" leftmargin=\"6\" topmargin=\"6\">
<div class=\"hello-status\">Hello World</div>
<div valign=\"top\"></div>
</body>
</html>
The solution I am using is:
parsed_html = Nokogiri::HTML(my_html)
body_tag_content = parsed_html.at('body')
puts body_tag_content.inner_html
What am I getting:
<p>about:legacy-compat\"></p>
\n
<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">
\n
<title>Some title</title>
\n
<meta name='viewport' id='helloviewport' content='initial-scale=1.0,maximum-scale=2.5' />
\n
<link rel='stylesheet' id='hello-stylesheet' type='text/css' href='some-4ac294cd125e1a062562aca1c83714ff.css'/>
\n<script id='hello-javascript' type='text/javascript' src='/hello/hello.js'></script>
<div class=\"hello-status\">Hello World</div>
\n
<div valign=\"top\">\n\n</div>
What am I expecting:
<div class=\"hello-status\">Hello World</div>
\n
<div valign=\"top\">\n\n</div>
Any idea what's happening in here?
I got your example to work by first cleaning up the original HTML. I removed the "about:legacy-compat" from the Doctype which seemed to be messing Nokogiri up:
# clean up the junk in the doctype
my_html.sub!("\"about:legacy-compat\"", "")
# parse and get the body
parsed_html = Nokogiri::HTML(my_html)
body_tag_content = parsed_html.at('body')
puts body_tag_content.inner_html
# => "\n <div class=\"hello-status\">Hello World</div>\n <div valign=\"top\"></div>\n "
In general, when you're parsing potentially dirty third-party data such as HTML, you should clean it up first so the parser doesn't choke and do unexpected things. You could run the HTML through a linter or "tidy" tool to try and automatically clean it up. When all else fails, you'll have to clean it by hand as above.
HTML tidy/cleaning in Ruby 1.9

CL-WHO HTML generator to file

I'm trying to generate an html file to a file. I'm using with-html-output-to-string, but I can't seem to figure out how to get the functionality to work. I'm not sure if I should use a file stream, with-open-file, and how to get the syntax to work. I've been messing with this for a day, but the code just doesnt run.
CL-USER> (who:with-html-output-to-string (out nil :prologue t :indent t)
(:html
(:head
(:title "home"))
(:body
(:p "Hello cl."))))
"<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">
<html>
<head>
<title>home
</title>
</head>
<body>
<p>Hello cl.
</p>
</body>
</html>"

Changing the content-type using Nokogiri doesn't work

I want to change the charset in the "http-equiv" content-type tag. Because I'm working with Nokogiri in other parts of my code I'd like to use it for this processing step too.
This is example code:
http_equiv = doc.at('meta[#http-equiv]')
if !http_equiv.nil? && !http_equiv["http-equiv"].nil? && http_equiv["http-equiv"].downcase.eql?("content-type")
http_equiv["content"] = "text/html; charset=utf-8"
end
content = doc.to_html.encode(Encoding::UTF_8)
The problem is that the input content is alway the same as the output content. Nokogiri didn't do anything.
Based on an answer I created a real world example which won't work in contrast to the generated example.
require 'nokogiri'
require 'open-uri'
doc = require 'open-uri'
doc = Nokogiri::HTML(open("http://www.spiegel.de/politik/deutschland/hooligans-gegen-salafisten-demo-in-koeln-eskaliert-a-999401.html"))
content_type = doc.at('meta[#http-equiv="Content-Type"]')
content_type['content'] = 'text/html; charset=UTF-8'
puts doc.to_html
I'd do something like this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<head>
<meta http-equiv="content-type" content="text/html">
</head>
<body>
foo
</body>
</html>
EOT
content_type = doc.at('meta[#http-equiv="content-type"]')
content_type['content'] = 'text/html; charset=UTF-8'
puts doc.to_html
Running that outputs:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=US-ASCII">
</head>
<body>
foo
</body>
</html>
You can also do
content_type['content'] << '; charset=UTF-8'
if you're only appending to the existing value.
It doesn't change the content-type.
It changes the content type in the tag, however there is more to it since it seems you don't want to change the content-type marker, you want to change the encoding of the document itself at output. Once you do that, Nokogiri will also change the meta tag to match:
doc.to_html(encoding: 'UTF-8')
will tell Nokogiri to output the HTML, trying to convert from ISO-8859-1 to UTF-8. There is no guarantee that will occur correctly though, because there are some incompatibilities.
Your original attempt using:
content = doc.to_html.encode(Encoding::UTF_8)
won't work correctly, because of HTML encoding that occurs on special characters. You have to change the character encoding before they are HTML-encoded, which should happen if you use to_html(encoding: 'UTF-8').

Stop Nokogiri from adding DOCTYPE and meta tags?

I'm trying to use Nokogiri to convert some template files from one format to another. But it keeps adding tags. I'm trying to prevent it from adding Doctype and meta tags, but can't figure it out. I've tried
#doc = Nokogiri::HTML.parse(r)
but that adds the tags. I've also tried
#doc = Nokogiri::HTML.fragment(r)
as suggested in "How to prevent Nokogiri from adding <DOCTYPE> tags?", but that removes any <html>, <head>, or <body> tags that are in the document.
If it matters, my code for reading the file is:
f = File.read(infile)
r = f.gsub(/<tmpl_var ([^>]*)>/, '{{{\1}}}')
#doc = Nokogiri::HTML.fragment(r)
I need to do a gsub beforehand because I need to replace <tmpl_var> tags which aren't proper HTML and cause more problems.
When using HTML.fragment(r), I do get an htmlParseStartTag: misplaced <html> tag error (as well as similar errors for <body> and <head>).
Is there a way to prevent it from making these additions?
An example conversion:
Before:
<html>
<head>
<script>
var x = "y";
</script>
</head>
<body>
<div>
Stuff
</div>
</body>
</html>
After using Parse:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<script>
var x = "y";
</script>
</head>
<body>
<div>
Stuff
</div>
</body>
</html>
After using HTML.fragment or HTML::DocumentFragment.parse:
<script>
var x = "y";
</script>
<div>
Stuff
</div>
In this case, I want it to just output the before section. (In the real script I make a bunch of changes though).
Nokogiri can be told to not add the standard HTML headers. Consider these:
require 'nokogiri'
doc = Nokogiri::HTML('<p>foo</p>')
doc.to_html # => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foo</p></body></html>\n"
doc = Nokogiri::HTML.fragment('<p>foo</p>')
doc.to_html # => "<p>foo</p>"
tmpl_var is a bad tag name in HTML, as is {{{\1}}}, so asking Nokogiri to try to parse either will result in problems:
doc = Nokogiri::HTML.fragment('<templ_var p1="baz">foo</templ_var>')
doc.errors # => [#<Nokogiri::XML::SyntaxError: Tag templ_var invalid>]
But you can still munge the DOM:
doc.to_html # => "<templ_var p1=\"baz\">foo</templ_var>"
doc.search('templ_var').each { |t| t.name = 'bar'}
doc.to_html # => "<bar p1=\"baz\">foo</bar>"
Or:
doc.to_html # => "<div><templ_var p1=\"baz\">foo</templ_var></div>"
doc.search('templ_var').each { |t| t.replace('{{{\1}}}') }
doc.to_html # => "<div>{{{\\1}}}</div>"
Putting that stuff together, plus a bit of chicanery:
doc = Nokogiri::HTML.fragment('<div><templ_var p1="baz">foo</templ_var></div>')
doc.to_html # => "<div><templ_var p1=\"baz\">foo</templ_var></div>"
doc.search('templ_var').each { |t| t.replace('{{{\1}}}') }
doc.to_html # => "<div>{{{\\1}}}</div>"
header = Nokogiri::XML.fragment('<html><body>')
header.at('body').children = doc
header.to_html # => "<html><body><div>{{{\\1}}}</div></body></html>"
So, I'd go after it something like that.
Now, why is Nokogiri stripping the <html> tag when parsing a fragment? I don't know. It leaves <body> alone if <head> or <html> is missing:
Nokogiri::HTML.fragment('<p>foo<p>').to_html
# => "<p>foo</p><p></p>"
Nokogiri::HTML.fragment('<body><p>foo<p></body>').to_html
# => "<body>\n<p>foo</p>\n<p></p>\n</body>"
But it gets funky if <head> or <html> exists:
Nokogiri::HTML.fragment('<head><style></style></head><body><p>foo<p></body>').to_html
# => "<style></style><p>foo</p><p></p>"
Nokogiri::HTML.fragment('<html><head><style></style></head><body><p>foo<p></body></html>').to_html
# => "<style></style><p>foo</p><p></p>"
That smells like a bug in Nokogiri to me as I haven't seen anything to document that behavior.
You can get around this by using Nokogiri::XML::DocumentFragment instead of Nokogiri::HTML::DocumentFragment. The XML version won't remove the html, head, or body tags.

Find a C# HTML parser find all <script> and give me the line and position info

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>title</title>
</head>
<body>
I want to get this text
<script>
var test=function()
{}
</script>
</body>
</html>
and the result is:
line:7,
position :4
content:
var test=function()
{}
Have you tried the HTML Agility Pack?
This typically works quite well and gives you a nice intuitive interface into parsing HTML content.
You should be able to use it something like this:
HtmlDocument doc = new HtmlDocument();
doc.Load("yourfile.html");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//script)
{
// do something with your script nodes
}