Changing href attributes with nokogiri and ruby on rails - html

I Have a HTML document with links links, for exemple:
<html>
<body>
<ul>
<li>teste1</li>
<li>teste2</li>
<li>teste3</li>
<ul>
</body>
</html>
I want with Ruby on Rails, with nokogiri or some other method, to have a final doc like this:
<html>
<body>
<ul>
<li>teste1</li>
<li>teste2</li>
<li>teste3</li>
<ul>
</body>
</html>
What's the best strategy to achieve this?

If you choose to use Nokogiri, I think this should work:
require 'cgi'
require 'rubygems' rescue nil
require 'nokogiri'
file_path = "your_page.html"
doc = Nokogiri::HTML(open(file_path))
doc.css("a").each do |link|
link.attributes["href"].value = "http://myproxy.com/?url=#{CGI.escape link.attributes["href"].value}"
end
doc.write_to(open(file_path, 'w'))
If I'm not mistaken rails loads REXML up by default, depending on what you're trying to do you could use this also.

Here is what I did for replacing images src attributes:
doc = Nokogiri::HTML(html)
doc.xpath("//img").each do |img|
img.attributes["src"].value = Absolute_asset_path(img.attributes["src"].value)
end
doc.to_html // simply use .to_html to re-convert to html

Related

Finding a <div> block with an 'id' and 'class' using Nokogiri

How can I search for the following block using Nokogiri:
<div id="live_list_cat_16" class="football-block sport-block" style="display:block;">
</div>
Try this
doc.search('div#foo.bar')
How does this work?
search and at method both accept CSS queries
div#foo finds a div with id foo
div.bar finds a div with class bar
You can use #some_id as the CSS selector.
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div id="foo" class="bar">text</div>
<div id="foo2" class="bar">more_text</div>
</body>
</html>
EOT
doc.search('#foo').to_html # => "<div id=\"foo\" class=\"bar\">text</div>"
doc.search('div.bar').to_html # => "<div id=\"foo\" class=\"bar\">text</div><div id=\"foo2\" class=\"bar\">more_text</div>"
Remember, a particular ID is only allowed to exist once in the document.

How to remove a node using Nokogiri

I have an HTML structure like this:
<div>
This is
<p> very
<script>
some code
</script>
</p>
important.
</div>
I know how to get a Nokogiri::XML::NodeSet from this:
dom.xpath("//div")
I now want to filter out any script tag:
dom.xpath("//script")
So I can get something like:
<div>
This is
<p> very</p>
important.
</div>
So that I can call div.text to get:
"This is very important."
I tried recursively/iteratively going over all children nodes and trying to match every node I want to filter out any node I don't want, but I ran into problems like too much whitespace or not enough whitespace. I'm quite sure there's a nice enough and rubyesque way.
What would be a good way to do this?
NodeSet contains the remove method which makes it easy to remove whatever matched your selector:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div><p>foo</p><p>bar</p></div>
</body>
</html>
EOT
doc.search('p').remove
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <body>
# >> <div></div>
# >> </body>
# >> </html>
Applied to your sample input:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div>
This is
<p> very
<script>
some code
</script>
</p>
important.
</div>
EOT
doc.search('script').remove
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <div>
# >> This is
# >> <p> very
# >>
# >> </p>
# >> important.
# >> </div>
# >> </body></html>
At that point the text in the <div> is:
doc.at('div').text # => "\n This is\n very\n \n \n important.\n"
Normalizing that is easy:
doc.at('div').text.gsub(/[\n ]+/,' ').strip # => "This is very important."
1st problem
To remove all the script nodes :
require 'nokogiri'
html = "<div>
This is
<p> very
<script>
some code
</script>
</p>
important.
</div>"
doc = Nokogiri::HTML(html)
doc.xpath("//script").remove
p doc.text
#=> "\n This is\n very\n \n \n important.\n"
Thanks to #theTinMan for his tip (calling remove on one NodeSet instead of each Node).
2nd problem
To remove the unneeded whitespaces, you can use :
strip to remove spaces (whitespace, tabs, newlines, ...) at beginning and end of string
gsub to replace mutiple spaces by just one whitespace
p doc.text.strip.gsub(/[[:space:]]+/,' ')
#=> "This is very important."

Wrap all nodes in a DocumentFragment with a DIV

Given a simple DocumentFragment:
html = "<h1>Three's Company</h1><p>A love triangle.</p>"
doc = Nokogiri::HTML::DocumentFragment.parse html
Is there an elegant way to wrap everything the DocumentFragment holds with a DIV? Please note that I have to do this inside a method which is supposed to return a DocumentFragment instance doc which has been parsed elsewhere. I'd like doc.to_html to look something like this:
<div class="wrapper"><h1>Three's Company</h1><p>A love triangle.</p></div>
Thanks for your hints!
Here is what I found :
require 'nokogiri'
string = "<h1>Three's Company</h1><p>A love triangle.</p>"
doc = Nokogiri::HTML::DocumentFragment.parse "<div class='foo'>"
doc.at(".//div").inner_html = string
puts doc.to_html
output:
<div class="foo">
<h1>Three's Company</h1>
<p>A love triangle.</p>
</div>

Controlling the existence of an attribute

I have a problem with the Slim template engine in a Sinatra project. I have an edit form to be filled when the route is triggered. There is an issue with HTML select option. I need something like this when the edit form is loaded. Notice that Mrs. option is selected:
<select name="person[title]" id="person[title]">
<option value="Mr.">Mr.</option>
<option value="Mrs." selected>Mrs.</option>
</select>
I tried:
option[value="Mrs." "#{person.title == :mrs ? 'selected' : ''}"]
The exception was about an attribute error. Then I tried something like this:
option[value="Mrs." selected="#{person.title == :mrs ? true : false}"]
but then the output was something like this:
<option value"Mrs." selected="false">Mrs.</option>
I guess the string"false" is interpreted as true. That failed. I tried some combinations with round brackets but couldn't get it to work.
How could I set the selected attribute of an option in a select list in Slim?
For an attribute, you can write ruby code after the =, but if the ruby code has spaces in it, you have to put parentheses around the ruby code:
option[value="1" selected=("selected" if #title=="Mrs.")] "Mrs."
See "Ruby attributes" here: http://rdoc.info/gems/slim/frames.
The brackets are optional, so you can also write it like this:
option value="1" selected=("selected" if #title=="Mrs.") "Mrs."
Or, instead of brackets, you can use a different delimiter:
option {value="1" selected=("selected" if #title=="Mrs.")} "Mrs."
Here it is with some code:
slim.slim:
doctype html
html
head
title Slim Examples
meta name="keywords" content="template language"
body
h1 Markup examples
p This example shows you how a basic Slim file looks like.
select
option[value="1" selected=("selected" if #title=="Mr.")] "Mr."
option[value="2" selected=("selected" if #title=="Mrs.")] "Mrs."
Using Slim in a standalone ruby program without rails:
require 'slim'
template = Slim::Template.new(
"slim.slim",
pretty: true #pretty print the html
)
class Person
attr_accessor :title
def initialize title
#title = title
end
end
person = Person.new("Mrs.")
puts template.render(person)
--output:--
<!DOCTYPE html>
<html>
<head>
<title>
Slim Examples
</title>
<meta content="template language" name="keywords" />
</head>
<body>
<h1>
Markup examples
</h1>
<p>
This example shows you how a basic Slim file looks like.
</p>
<select><option value="1">"Mr."</option><option selected="selected" value="2">"Mrs."</option></select>
</body>
</html>
I guess the string "false" is interpreted as true.
Yes. The only things that evaluate to false are false itself and nil. Any number(including 0), any string (including ""), and any array(including []), etc. are all true.
Not pertinent to your problem, but perhaps useful to some future searcher...I guess Slim looks up instance variables in whatever object you pass as an argument to render. So if you want to provide a whole bunch of values for the template, you can write:
require 'slim'
template = Slim::Template.new(
"slim.slim",
pretty: true #pretty print the html
)
class MyVals
attr_accessor :count, :title, :animals
def initialize count, title, animals
#count = count
#title = title
#animals = animals
end
end
vals = MyVals.new(4, "Sir James III", %w[ squirrel, monkey, cobra ])
puts template.render(vals)
slim.slim:
doctype html
html
head
title Slim Examples
meta name="keywords" content="template language"
body
p =#count
p =#title
p =#animals[-1]
Neither OpenStruct nor Struct work with render() even though they seem like natural candidates.

Indenting generated markup in Jekyll/Ruby

Well this is probably kind of a silly question but I'm wondering if there's any way to have the generated markup in Jekyll to preserve the indentation of the Liquid-tag. World doesn't end if it isn't solvable. I'm just curious since I like my code to look tidy, even if compiled. :)
For example I have these two:
base.html:
<body>
<div id="page">
{{content}}
</div>
</body>
index.md:
---
layout: base
---
<div id="recent_articles">
{% for post in site.posts %}
<div class="article_puff">
<img src="/resources/images/fancyi.jpg" alt="" />
<h2>{{post.title}}</h2>
<p>{{post.description}}</p>
Read more
</div>
{% endfor %}
</div>
Problem is that the imported {{content}}-tag is rendered without the indendation used above.
So instead of
<body>
<div id="page">
<div id="recent_articles">
<div class="article_puff">
<img src="/resources/images/fancyimage.jpg" alt="" />
<h2>Gettin' down with responsive web design</h2>
<p>Everyone's talking about it. Your client wants it. You need to code it.</p>
Read more
</div>
</div>
</div>
</body>
I get
<body>
<div id="page">
<div id="recent_articles">
<div class="article_puff">
<img src="/resources/images/fancyimage.jpg" alt="" />
<h2>Gettin' down with responsive web design</h2>
<p>Everyone's talking about it. Your client wants it. You need to code it.</p>
Read more
</div>
</div>
</div>
</body>
Seems like only the first line is indented correctly. The rest starts at the beginning of the line... So, multiline liquid-templating import? :)
Using a Liquid Filter
I managed to make this work using a liquid filter. There are a few caveats:
Your input must be clean. I had some curly quotes and non-printable chars that looked like whitespace in a few files (copypasta from Word or some such) and was seeing "Invalid byte sequence in UTF-8" as a Jekyll error.
It could break some things. I was using <i class="icon-file"></i> icons from twitter bootstrap. It replaced the empty tag with <i class="icon-file"/> and bootstrap did not like that. Additionally, it screws up the octopress {% codeblock %}s in my content. I didn't really look into why.
While this will clean the output of a liquid variable such as {{ content }} it does not actually solve the problem in the original post, which is to indent the html in context of the surrounding html. This will provide well formatted html, but as a fragment that will not be indented relative to tags above the fragment. If you want to format everything in context, use the Rake task instead of the filter.
-
require 'rubygems'
require 'json'
require 'nokogiri'
require 'nokogiri-pretty'
module Jekyll
module PrettyPrintFilter
def pretty_print(input)
#seeing some ASCII-8 come in
input = input.encode("UTF-8")
#Parsing with nokogiri first cleans up some things the XSLT can't handle
content = Nokogiri::HTML::DocumentFragment.parse input
parsed_content = content.to_html
#Unfortunately nokogiri-pretty can't use DocumentFragments...
html = Nokogiri::HTML parsed_content
pretty = html.human
#...so now we need to remove the stuff it added to make valid HTML
output = PrettyPrintFilter.strip_extra_html(pretty)
output
end
def PrettyPrintFilter.strip_extra_html(html)
#type declaration
html = html.sub('<?xml version="1.0" encoding="ISO-8859-1"?>','')
#second <html> tag
first = true
html = html.gsub('<html>') do |match|
if first == true
first = false
next
else
''
end
end
#first </html> tag
html = html.sub('</html>','')
#second <head> tag
first = true
html = html.gsub('<head>') do |match|
if first == true
first = false
next
else
''
end
end
#first </head> tag
html = html.sub('</head>','')
#second <body> tag
first = true
html = html.gsub('<body>') do |match|
if first == true
first = false
next
else
''
end
end
#first </body> tag
html = html.sub('</body>','')
html
end
end
end
Liquid::Template.register_filter(Jekyll::PrettyPrintFilter)
Using a Rake task
I use a task in my rakefile to pretty print the output after the jekyll site has been generated.
require 'nokogiri'
require 'nokogiri-pretty'
desc "Pretty print HTML output from Jekyll"
task :pretty_print do
#change public to _site or wherever your output goes
html_files = File.join("**", "public", "**", "*.html")
Dir.glob html_files do |html_file|
puts "Cleaning #{html_file}"
file = File.open(html_file)
contents = file.read
begin
#we're gonna parse it as XML so we can apply an XSLT
html = Nokogiri::XML(contents)
#the human() method is from nokogiri-pretty. Just an XSL transform on the XML.
pretty_html = html.human
rescue Exception => msg
puts "Failed to pretty print #{html_file}: #{msg}"
end
#Yep, we're overwriting the file. Potentially destructive.
file = File.new(html_file,"w")
file.write(pretty_html)
file.close
end
end
We can accomplish this by writing a custom Liquid filter to tidy the html, and then doing {{content | tidy }} to include the html.
A quick search suggests that the ruby tidy gem may not be maintained but that nokogiri is the way to go. This will of course mean installing the nokogiri gem.
See advice on writing liquid filters, and Jekyll example filters.
An example might look something like this: in _plugins, add a script called tidy-html.rb containing:
require 'nokogiri'
module TextFilter
def tidy(input)
desired = Nokogiri::HTML::DocumentFragment.parse(input).to_html
end
end
Liquid::Template.register_filter(TextFilter)
(Untested)