Replacing <a> with Nokogiri - html

I am using Nokogiri to scan a document and remove specific files that are stored as attachments. I want to note however that the value was removed in-line.
Eg.
File Download
Converted to:
File Removed
Here is what I tried:
#doc = Nokogiri::HTML(html).to_html
#doc.search('a').each do |attachment|
attachment.remove
attachment.content = "REMOVED"
# ALSO TRIED:
attachment.content = "REMOVED"
end
The second one does replace the anchor text but keeps the href and the user can still download the value.
How can I replace the anchor value and change it to a < p> with a new string?

Use combination of create_element and replace to achieve that. Find the comments inline below.
html = 'File Download'
dom = Nokogiri::HTML(html) # parse with nokogiri
dom.to_s # original content
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>File Download</body></html>\n"
# scan the dom for hyperlinks
dom.css('a').each do |a|
node = dom.create_element 'p' # create paragraph element
node.inner_html = "REMOVED" # add content you want
a.replace node # replace found link with paragraph
end
dom.to_s # modified html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>REMOVED</p></body></html>\n"
Hope this helps

Related

Creating HTML links from images in :colons: with Ruby

I have a simple HTML document:
<div should-not-be-replaced=":smile:">
Hello :smile:!
</div>
How would I replace the :smile: text with <img src="smile.png">, but keeping the first :smile: unchanged, to get this:
<div should-not-be-replaced=":smile:">
Hello <img src="smile.png">!
</div>
I tried this, but Nokogiri escapes my HTML as plain text:
doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.traverse do |x|
next unless x.text?
x.content = x.text.gsub(':smile:', '<img src="smile.png">')
end
My solution is very similar to Ku's, although I've tried to handle situations where the replaced text could be in the source text multiple times by completely replacing the content text with an HTML Doc Fragment
doc = Nokogiri::HTML::DocumentFragment.parse(DATA.read)
doc.traverse do |x|
next unless x.text?
if x.text.match(%r{:(\w+):})
replace_text = x.text.gsub(%r{:(\w+):}, "<img src='#{$1}.png'>")
x.content = ""
x.add_next_sibling replace_text
end
end
I think this might be what you want, and it also deals with strings between two colons like :something: and produces "something.png" as well.
doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.traverse do |x|
if x.text? && x.content =~ /:\w+:/
x.content = x.content.sub(/:(\w+):/, '')
a = Nokogiri::HTML::DocumentFragment.parse('<a src="'+$1+'.png">')
x.add_next_sibling(a)
end
end
You are making it much too hard, and using traverse which is slow because it forces Nokogiri to walk through every node in the document; In a large page that is costly.
Instead take advantage of selectors to find the specific node(s) you want:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div parm=":smile:">
Hello :smile:!
</div>
EOT
div = doc.at('div[parm=":smile:"]')
div.inner_html = div.text.sub(/:smile:/, '<img src="smile.png">')
puts doc.to_html
Running that results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div parm=":smile:">
Hello <img src="smile.png">!
</div>
</body></html>
I'm using at, which finds the first occurrence. If you need to process more than one then use search. search returns a NodeSet, which is like an array so you'll want to iterate over it. That are innumerable examples of doing so on Stack Overflow and elsewhere.
Do you mean it returns &lt or &gt?
I recommend to wrap CGI#unescape_html method
try,
require 'cgi'
CGI::unescape_html(doc.to_s)

How extract text from a tag using Nokogiri

Example:
<p>http://localhost:3000/replies/279<br><p>
Currently using Nokogiri to grab the href from the <a>:
doc.search('a').each do |node|
href = node.attributes['href'].try(:value)
I need to make sure what's in the text part is what's in the href and I'm not sure how to extract that.
Here are the basics for checking:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<p>http://localhost:3000/replies/279<br><p>
EOT
link = doc.at('a')
link['href'] == link.text # => false
Modifying the HTML so the HREF and text match:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<p>http://localhost:3000/replies/279<br><p>
EOT
link = doc.at('a')
link['href'] == link.text # => true
at returns only the first node that matches the selector, so if you're looking to check multiple nodes you'll want to use search and iterate over the NodeSet it returns.

Download HTML Text with Ruby

I am trying to create a histogram of the letters (a,b,c,etc..) on a specified web page. I plan to make the histogram itself using a hash. However, I am having a bit of a problem actually getting the HTML.
My current code:
#!/usr/local/bin/ruby
require 'net/http'
require 'open-uri'
# This will be the hash used to store the
# histogram.
histogram = Hash.new(0)
def open(url)
Net::HTTP.get(URI.parse(url))
end
page_content = open('_insert_webpage_here')
page_content.each do |i|
puts i
end
This does a good job of getting the HTML. However, it gets it all. For www.stackoverflow.com it gives me:
<body><h1>Object Moved</h1>This document may be found here</body>
Pretending that it was the right page, I don't want the html tags. I'm just trying to get Object Moved and This document may be found here.
Is there any reasonably easy way to do this?
When you require 'open-uri', you don't need to redefine open with Net::HTTP.
require 'open-uri'
page_content = open('http://www.stackoverflow.com').read
histogram = {}
page_content.each_char do |c|
histogram[c] ||= 0
histogram[c] += 1
end
Note: this does not strip out <tags> within the HTML document, so <html><body>x!</body></html> will have { '<' => 4, 'h' => 2, 't' => 2, ... } instead of { 'x' => 1, '!' => 1 }. To remove the tags, you can use something like Nokogiri (which you said was not available), or some sort of regular expression (such as the one in Dru's answer).
See the section "Following Redirection" on the Net::HTTP Documentation here
Stripping html tags without Nokogiri
puts page_content.gsub(/<\/?[^>]*>/, "")
http://codesnippets.joyent.com/posts/show/615

Variables contains html tags display tags

I have a variable contains html tags when i print it in tpl file. It shows html tags too.
in php file
$content=" < h1>Header< /h1>< p>Content< /p>";
$this->assign("content", $content);
in tpl file
{$content}
displays: < h1>Header< /h1>< p>Content< /p>`
What can i do?
Just don't use spaces in your html tags between < and the tag item, just use this: <h1>.
This is good code for you:
$content = " <h1>Header</h1><p>Content</p>";

How to wrap words in HTML document without attributes and tag names

I have an HTML document that has long words:
<div>this is a veeeeeeeeeeeerryyyyyyyyloooongwoooord<img src="/fooooooooobaaar.jof" ></div>
I want to word-wrap it without cutting the tags or its attributes:
<div>this is a veeeeeeeeeeeerryyyyy yyyloooongwoooord<img src="/fooooooooobaaar.jof" ></div>
Also, it's possible that I will not have any HTML tag at all.
I tried Nokogiri, but it inserts a paragraph in tagless input, and wraps the whole response with an HTML document, which is not my intention.
What is the best way to accomplish this?
require "Nokogiri"
class String
def wrap()
doc = Nokogiri::HTML(self)
doc.at("body").traverse do |p|
if p.is_a?(Nokogiri::XML::Text)
input = p.content
p.content = input.scan(/.{1,25}/).join(" ")
end
end
doc.to_s # I want only the wrapped string, without the head/body stuff
end
end
I think using Nokogiri::XML(self) instead of Nokogiri::HTML(self) will help you.
This looks like a starting point for you:
require 'nokogiri'
max_word_length = 30
html = '<div>this is a veeeeeeeeeeeerryyyyyyyyloooongwoooord<img src="/fooooooooobaaar.jof" ></div>'
doc = Nokogiri::HTML.fragment(html)
doc.search('text()').each do |n|
n.content = n.content.split(' ').map { |l|
if (l.size > max_word_length)
l = l.scan(/.{1,#{ max_word_length }}/).join("\n")
end
l
}.join(' ')
end
puts doc.to_html
# >> <div>this is a veeeeeeeeeeeerryyyyyyyyloooong
# >> woooord<img src="/fooooooooobaaar.jof">
# >> </div>