I have an HTML structure like this:
<div>
This is
<p> very
<script>
some code
</script>
</p>
important.
</div>
I know how to get a Nokogiri::XML::NodeSet from this:
dom.xpath("//div")
I now want to filter out any script tag:
dom.xpath("//script")
So I can get something like:
<div>
This is
<p> very</p>
important.
</div>
So that I can call div.text to get:
"This is very important."
I tried recursively/iteratively going over all children nodes and trying to match every node I want to filter out any node I don't want, but I ran into problems like too much whitespace or not enough whitespace. I'm quite sure there's a nice enough and rubyesque way.
What would be a good way to do this?
NodeSet contains the remove method which makes it easy to remove whatever matched your selector:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div><p>foo</p><p>bar</p></div>
</body>
</html>
EOT
doc.search('p').remove
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <body>
# >> <div></div>
# >> </body>
# >> </html>
Applied to your sample input:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div>
This is
<p> very
<script>
some code
</script>
</p>
important.
</div>
EOT
doc.search('script').remove
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <div>
# >> This is
# >> <p> very
# >>
# >> </p>
# >> important.
# >> </div>
# >> </body></html>
At that point the text in the <div> is:
doc.at('div').text # => "\n This is\n very\n \n \n important.\n"
Normalizing that is easy:
doc.at('div').text.gsub(/[\n ]+/,' ').strip # => "This is very important."
1st problem
To remove all the script nodes :
require 'nokogiri'
html = "<div>
This is
<p> very
<script>
some code
</script>
</p>
important.
</div>"
doc = Nokogiri::HTML(html)
doc.xpath("//script").remove
p doc.text
#=> "\n This is\n very\n \n \n important.\n"
Thanks to #theTinMan for his tip (calling remove on one NodeSet instead of each Node).
2nd problem
To remove the unneeded whitespaces, you can use :
strip to remove spaces (whitespace, tabs, newlines, ...) at beginning and end of string
gsub to replace mutiple spaces by just one whitespace
p doc.text.strip.gsub(/[[:space:]]+/,' ')
#=> "This is very important."
Related
I have successfully parsed through an HTML document and pulled the elements I want from it using this command:
#!/bin/bash
# ParseHtml.sh
grep -o '<h2 .*>.*</h2>' Path/to/html/report.html | sed 's/\(<h2 .*>\|<\/h2>\)//g' > parseResults.txt
Here is the output from the above parse command:
<h2 id="test-count"><span class="number">1704</span> pass</h2>
<h2 id="fail-count"><span class="number">163</span> failures</h2>
What I am looking to do is take the output from the parsing command and insert and replace it between two <BODY> tags:
<j:jelly xmlns:j="jelly:core" xmlns:bsh="jelly:beanshell" xmlns:st="jelly:stapler" xmlns:d="jelly:define">
<!--This is a comment. Comments are not displayed in the browser-->
<BODY>
<!--Insert and replace any pre-existing HTML --->
</BODY>
</j:jelly>
What command would I use to achieve this? I'm having trouble attempting to achieve this via sed. I would like to stick to a method that is within bash. Any help is greatly appreciative.
One of many ways (at least when I understand right to question)
If you template file (lets call it templ.html) contains:
<j:jelly xmlns:j="jelly:core" xmlns:bsh="jelly:beanshell" xmlns:st="jelly:stapler" xmlns:d="jelly:define">
<!--This is a comment. Comments are not displayed in the browser-->
<BODY>
<!--Insert and replace any pre-existing HTML --->
</BODY>
</j:jelly>
and your parseResults.txt contains
<h2 id="test-count"><span class="number">1704</span> pass</h2>
<h2 id="fail-count"><span class="number">163</span> failures</h2>
then the following bash
#!/bin/bash
template=$(<templ.html)
results=$(<parseResults.txt)
echo "${template//<!--Insert and replace any pre-existing HTML --->/$results}"
produce
<j:jelly xmlns:j="jelly:core" xmlns:bsh="jelly:beanshell" xmlns:st="jelly:stapler" xmlns:d="jelly:define">
<!--This is a comment. Comments are not displayed in the browser-->
<BODY>
<h2 id="test-count"><span class="number">1704</span> pass</h2>
<h2 id="fail-count"><span class="number">163</span> failures</h2>
</BODY>
</j:jelly>
#Begin html parse. Find all h2 tags with class info and strip them out. Find all span tags (ending in R in my case) and strip them out. Finally, write the parsed HTML into a file called "iOS_Unit_Test_Results.txt" (which is essentially just a sequence of numbers)
grep -o '<h2 .*>.*</h2>' $dirpath/something/htmltoparse.html | sed 's/\(<h2 .*>\|<\/h2>\)//g' | grep -o 'r">.*<' | grep -o 'r">.*</span>' | sed 's/r">//' | sed 's/<\/span>//' > ~/.jenkins/email-templates/iOS_Unit_Test_Results.txt
IFS=$'\n'
s=0
#Create variable to write beginning header information.
BEGINNING_HTML="<j:jelly xmlns:j=\"jelly:core\" xmlns:bsh=\"jelly:beanshell\" xmlns:st=\"jelly:stapler\" xmlns:d=\"jelly:define\">\n<!--This is a comment. Comments are not displayed in the browser-->\n<BODY>"
ENDING_HTML="</BODY>\n</j:jelly>"
MODIFY_HTML=""
#Loop through the sequenced number file from the HTML parse above
for number in $(cat ~/.jenkins/email-templates/iOS_Unit_Test_Results.txt); do
#If it's the very first element, than it's the total number of tests.
if [ $s -eq 0 ]; then
MODIFY_HTML="$MODIFY_HTML <h2>Number of Of Tests: $number </h2>\n"
echo $MODIFY_HTML
fi
#If its the second element, than it's the # of unit test failures.
if [ $s -eq 1 ]; then
MODIFY_HTML="$MODIFY_HTML <h2>Number of Failures: $number </h2>\n"
echo $MODIFY_HTML
fi
s=$((s+1))
done
#Write html file by concatenating variables.
printf "$BEGINNING_HTML\n$MODIFY_HTML\n$ENDING_HTML" > ~/.jenkins/email-templates/iphoneUnitTest.jelly
I'm trying to clean up some CMS entered HTML that has extraneous paragraph tags and br tags everywhere. The Sanitize gem has proved very useful to do this but I am stuck with a particular issue.
The problem is when there is a br tag directly after/before a paragraph tag eg
<p>
<br />
Some text here
<br />
Some more text
<br />
</p>
I would like to strip out the extraneous first and last br tags, but not the middle one.
I'm very much hoping I can use a sanitize transformer to do this but can't seem to find the right matcher to achieve this.
Any help would be much appreciated.
Here's how to locate the particular <br> nodes that are contained by <p>:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>
<br />
Some text here
<br />
Some more text
<br />
</p>
EOT
doc.search('p > br').map(&:to_html)
# => ["<br>", "<br>", "<br>"]
Once we know we can find them, it's easy to remove specific ones:
br_nodes = doc.search('p > br')
br_nodes.first.remove
br_nodes.last.remove
doc.to_html
# => "<p>\n \n Some text here\n <br>\n Some more text\n \n</p>\n"
Notice that Nokogiri removed them, but their associated Text nodes that are their immediate siblings, containing their "\n" are left behind. A browser will gobble those up and not display the line-ends, but you might be feeling OCD, so here's how to remove those also:
br_nodes = doc.search('p > br')
[br_nodes.first, br_nodes.last].each do |br|
br.next_sibling.remove
br.remove
end
doc.to_html
# => "<p>\n <br>\n Some more text\n </p>\n"
initial_linebreak_transformer = lambda {|options|
node = options[:node]
if node.present? && node.element? && node.name.downcase == 'p'
first_child = node.children.first
if first_child.name.downcase == 'br'
first_child.unlink
initial_linebreak_transformer.call options
end
end
}
Well this is probably kind of a silly question but I'm wondering if there's any way to have the generated markup in Jekyll to preserve the indentation of the Liquid-tag. World doesn't end if it isn't solvable. I'm just curious since I like my code to look tidy, even if compiled. :)
For example I have these two:
base.html:
<body>
<div id="page">
{{content}}
</div>
</body>
index.md:
---
layout: base
---
<div id="recent_articles">
{% for post in site.posts %}
<div class="article_puff">
<img src="/resources/images/fancyi.jpg" alt="" />
<h2>{{post.title}}</h2>
<p>{{post.description}}</p>
Read more
</div>
{% endfor %}
</div>
Problem is that the imported {{content}}-tag is rendered without the indendation used above.
So instead of
<body>
<div id="page">
<div id="recent_articles">
<div class="article_puff">
<img src="/resources/images/fancyimage.jpg" alt="" />
<h2>Gettin' down with responsive web design</h2>
<p>Everyone's talking about it. Your client wants it. You need to code it.</p>
Read more
</div>
</div>
</div>
</body>
I get
<body>
<div id="page">
<div id="recent_articles">
<div class="article_puff">
<img src="/resources/images/fancyimage.jpg" alt="" />
<h2>Gettin' down with responsive web design</h2>
<p>Everyone's talking about it. Your client wants it. You need to code it.</p>
Read more
</div>
</div>
</div>
</body>
Seems like only the first line is indented correctly. The rest starts at the beginning of the line... So, multiline liquid-templating import? :)
Using a Liquid Filter
I managed to make this work using a liquid filter. There are a few caveats:
Your input must be clean. I had some curly quotes and non-printable chars that looked like whitespace in a few files (copypasta from Word or some such) and was seeing "Invalid byte sequence in UTF-8" as a Jekyll error.
It could break some things. I was using <i class="icon-file"></i> icons from twitter bootstrap. It replaced the empty tag with <i class="icon-file"/> and bootstrap did not like that. Additionally, it screws up the octopress {% codeblock %}s in my content. I didn't really look into why.
While this will clean the output of a liquid variable such as {{ content }} it does not actually solve the problem in the original post, which is to indent the html in context of the surrounding html. This will provide well formatted html, but as a fragment that will not be indented relative to tags above the fragment. If you want to format everything in context, use the Rake task instead of the filter.
-
require 'rubygems'
require 'json'
require 'nokogiri'
require 'nokogiri-pretty'
module Jekyll
module PrettyPrintFilter
def pretty_print(input)
#seeing some ASCII-8 come in
input = input.encode("UTF-8")
#Parsing with nokogiri first cleans up some things the XSLT can't handle
content = Nokogiri::HTML::DocumentFragment.parse input
parsed_content = content.to_html
#Unfortunately nokogiri-pretty can't use DocumentFragments...
html = Nokogiri::HTML parsed_content
pretty = html.human
#...so now we need to remove the stuff it added to make valid HTML
output = PrettyPrintFilter.strip_extra_html(pretty)
output
end
def PrettyPrintFilter.strip_extra_html(html)
#type declaration
html = html.sub('<?xml version="1.0" encoding="ISO-8859-1"?>','')
#second <html> tag
first = true
html = html.gsub('<html>') do |match|
if first == true
first = false
next
else
''
end
end
#first </html> tag
html = html.sub('</html>','')
#second <head> tag
first = true
html = html.gsub('<head>') do |match|
if first == true
first = false
next
else
''
end
end
#first </head> tag
html = html.sub('</head>','')
#second <body> tag
first = true
html = html.gsub('<body>') do |match|
if first == true
first = false
next
else
''
end
end
#first </body> tag
html = html.sub('</body>','')
html
end
end
end
Liquid::Template.register_filter(Jekyll::PrettyPrintFilter)
Using a Rake task
I use a task in my rakefile to pretty print the output after the jekyll site has been generated.
require 'nokogiri'
require 'nokogiri-pretty'
desc "Pretty print HTML output from Jekyll"
task :pretty_print do
#change public to _site or wherever your output goes
html_files = File.join("**", "public", "**", "*.html")
Dir.glob html_files do |html_file|
puts "Cleaning #{html_file}"
file = File.open(html_file)
contents = file.read
begin
#we're gonna parse it as XML so we can apply an XSLT
html = Nokogiri::XML(contents)
#the human() method is from nokogiri-pretty. Just an XSL transform on the XML.
pretty_html = html.human
rescue Exception => msg
puts "Failed to pretty print #{html_file}: #{msg}"
end
#Yep, we're overwriting the file. Potentially destructive.
file = File.new(html_file,"w")
file.write(pretty_html)
file.close
end
end
We can accomplish this by writing a custom Liquid filter to tidy the html, and then doing {{content | tidy }} to include the html.
A quick search suggests that the ruby tidy gem may not be maintained but that nokogiri is the way to go. This will of course mean installing the nokogiri gem.
See advice on writing liquid filters, and Jekyll example filters.
An example might look something like this: in _plugins, add a script called tidy-html.rb containing:
require 'nokogiri'
module TextFilter
def tidy(input)
desired = Nokogiri::HTML::DocumentFragment.parse(input).to_html
end
end
Liquid::Template.register_filter(TextFilter)
(Untested)
I have this content:
<div class="CodeRay">
<div class="code"><pre>puts <span style="background-color:#fff0f0;color:#D20"><span style="color:#710">"</span><span style="">Hello, world!</span><span style="color:#710">"</span></span></pre></div>
</div>
and I want to add it to a HTML document using Nokogiri:
File.open("frame2.html", "r") do |file|
doc = Nokogiri::HTML.parse(file)
end
doc.at_css("body") = content # this is my content
puts doc.to_html
Then content transformed to this:
<div class="CodeRay">
<div class="code"><pre>puts <span style="background-color:#fff0f0;color:#D20"><span style="color:#710">"</span><span style="">Hello, world!</span><span style="color:#710">"</span></span></pre></div>
</div>
Another part of HTML file is OK. The question is why does Nokogiri strip the content? Why does it tranform it to HTML entities?
I reformatted your inner HTML to make it a bit more readable as a sample.
Nokogiri isn't stripping anything, it's only encoding the content being added because you're telling it to.
Unless you tell Nokogiri the new text is already HTML it will assume you are adding text, and, since the text contains characters that should be encoded, it encodes it for you.
Here's how to do what you really want:
require "nokogiri"
html = '<div class="CodeRay">
<div class="code">
<pre>puts <span style="background-color:#fff0f0;color:#D20">
<span style="color:#710">"</span>
<span style="">Hello, world!</span>
<span style="color:#710">"</span>
</span>
</pre>
</div>
</div>'
doc = Nokogiri::HTML('<html><body></body></html>')
doc.at('body').inner_html = html
puts doc.to_html
>> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
>> <html><body><div class="CodeRay">
>> <div class="code">
>> <pre>puts <span style="background-color:#fff0f0;color:#D20">
>> <span style="color:#710">"</span>
>> <span style="">Hello, world!</span>
>> <span style="color:#710">"</span>
>> </span>
>> </pre>
>> </div>
>> </div></body></html>
I Have a HTML document with links links, for exemple:
<html>
<body>
<ul>
<li>teste1</li>
<li>teste2</li>
<li>teste3</li>
<ul>
</body>
</html>
I want with Ruby on Rails, with nokogiri or some other method, to have a final doc like this:
<html>
<body>
<ul>
<li>teste1</li>
<li>teste2</li>
<li>teste3</li>
<ul>
</body>
</html>
What's the best strategy to achieve this?
If you choose to use Nokogiri, I think this should work:
require 'cgi'
require 'rubygems' rescue nil
require 'nokogiri'
file_path = "your_page.html"
doc = Nokogiri::HTML(open(file_path))
doc.css("a").each do |link|
link.attributes["href"].value = "http://myproxy.com/?url=#{CGI.escape link.attributes["href"].value}"
end
doc.write_to(open(file_path, 'w'))
If I'm not mistaken rails loads REXML up by default, depending on what you're trying to do you could use this also.
Here is what I did for replacing images src attributes:
doc = Nokogiri::HTML(html)
doc.xpath("//img").each do |img|
img.attributes["src"].value = Absolute_asset_path(img.attributes["src"].value)
end
doc.to_html // simply use .to_html to re-convert to html