How to wrap words in HTML document without attributes and tag names - html

I have an HTML document that has long words:
<div>this is a veeeeeeeeeeeerryyyyyyyyloooongwoooord<img src="/fooooooooobaaar.jof" ></div>
I want to word-wrap it without cutting the tags or its attributes:
<div>this is a veeeeeeeeeeeerryyyyy yyyloooongwoooord<img src="/fooooooooobaaar.jof" ></div>
Also, it's possible that I will not have any HTML tag at all.
I tried Nokogiri, but it inserts a paragraph in tagless input, and wraps the whole response with an HTML document, which is not my intention.
What is the best way to accomplish this?
require "Nokogiri"
class String
def wrap()
doc = Nokogiri::HTML(self)
doc.at("body").traverse do |p|
if p.is_a?(Nokogiri::XML::Text)
input = p.content
p.content = input.scan(/.{1,25}/).join(" ")
end
end
doc.to_s # I want only the wrapped string, without the head/body stuff
end
end

I think using Nokogiri::XML(self) instead of Nokogiri::HTML(self) will help you.

This looks like a starting point for you:
require 'nokogiri'
max_word_length = 30
html = '<div>this is a veeeeeeeeeeeerryyyyyyyyloooongwoooord<img src="/fooooooooobaaar.jof" ></div>'
doc = Nokogiri::HTML.fragment(html)
doc.search('text()').each do |n|
n.content = n.content.split(' ').map { |l|
if (l.size > max_word_length)
l = l.scan(/.{1,#{ max_word_length }}/).join("\n")
end
l
}.join(' ')
end
puts doc.to_html
# >> <div>this is a veeeeeeeeeeeerryyyyyyyyloooong
# >> woooord<img src="/fooooooooobaaar.jof">
# >> </div>

Related

Replacing <a> with Nokogiri

I am using Nokogiri to scan a document and remove specific files that are stored as attachments. I want to note however that the value was removed in-line.
Eg.
File Download
Converted to:
File Removed
Here is what I tried:
#doc = Nokogiri::HTML(html).to_html
#doc.search('a').each do |attachment|
attachment.remove
attachment.content = "REMOVED"
# ALSO TRIED:
attachment.content = "REMOVED"
end
The second one does replace the anchor text but keeps the href and the user can still download the value.
How can I replace the anchor value and change it to a < p> with a new string?
Use combination of create_element and replace to achieve that. Find the comments inline below.
html = 'File Download'
dom = Nokogiri::HTML(html) # parse with nokogiri
dom.to_s # original content
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>File Download</body></html>\n"
# scan the dom for hyperlinks
dom.css('a').each do |a|
node = dom.create_element 'p' # create paragraph element
node.inner_html = "REMOVED" # add content you want
a.replace node # replace found link with paragraph
end
dom.to_s # modified html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>REMOVED</p></body></html>\n"
Hope this helps

Creating HTML links from images in :colons: with Ruby

I have a simple HTML document:
<div should-not-be-replaced=":smile:">
Hello :smile:!
</div>
How would I replace the :smile: text with <img src="smile.png">, but keeping the first :smile: unchanged, to get this:
<div should-not-be-replaced=":smile:">
Hello <img src="smile.png">!
</div>
I tried this, but Nokogiri escapes my HTML as plain text:
doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.traverse do |x|
next unless x.text?
x.content = x.text.gsub(':smile:', '<img src="smile.png">')
end
My solution is very similar to Ku's, although I've tried to handle situations where the replaced text could be in the source text multiple times by completely replacing the content text with an HTML Doc Fragment
doc = Nokogiri::HTML::DocumentFragment.parse(DATA.read)
doc.traverse do |x|
next unless x.text?
if x.text.match(%r{:(\w+):})
replace_text = x.text.gsub(%r{:(\w+):}, "<img src='#{$1}.png'>")
x.content = ""
x.add_next_sibling replace_text
end
end
I think this might be what you want, and it also deals with strings between two colons like :something: and produces "something.png" as well.
doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.traverse do |x|
if x.text? && x.content =~ /:\w+:/
x.content = x.content.sub(/:(\w+):/, '')
a = Nokogiri::HTML::DocumentFragment.parse('<a src="'+$1+'.png">')
x.add_next_sibling(a)
end
end
You are making it much too hard, and using traverse which is slow because it forces Nokogiri to walk through every node in the document; In a large page that is costly.
Instead take advantage of selectors to find the specific node(s) you want:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div parm=":smile:">
Hello :smile:!
</div>
EOT
div = doc.at('div[parm=":smile:"]')
div.inner_html = div.text.sub(/:smile:/, '<img src="smile.png">')
puts doc.to_html
Running that results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div parm=":smile:">
Hello <img src="smile.png">!
</div>
</body></html>
I'm using at, which finds the first occurrence. If you need to process more than one then use search. search returns a NodeSet, which is like an array so you'll want to iterate over it. That are innumerable examples of doing so on Stack Overflow and elsewhere.
Do you mean it returns &lt or &gt?
I recommend to wrap CGI#unescape_html method
try,
require 'cgi'
CGI::unescape_html(doc.to_s)

How extract text from a tag using Nokogiri

Example:
<p>http://localhost:3000/replies/279<br><p>
Currently using Nokogiri to grab the href from the <a>:
doc.search('a').each do |node|
href = node.attributes['href'].try(:value)
I need to make sure what's in the text part is what's in the href and I'm not sure how to extract that.
Here are the basics for checking:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<p>http://localhost:3000/replies/279<br><p>
EOT
link = doc.at('a')
link['href'] == link.text # => false
Modifying the HTML so the HREF and text match:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<p>http://localhost:3000/replies/279<br><p>
EOT
link = doc.at('a')
link['href'] == link.text # => true
at returns only the first node that matches the selector, so if you're looking to check multiple nodes you'll want to use search and iterate over the NodeSet it returns.

Nokogiri help without spaces

i have the following code:
#/usr/bin/env ruby
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'cora'
require 'eat'
#require 'timeout'
doc = Nokogiri::HTML(open("http://mobile.bahn.de/bin/mobil/bhftafel.exe/dox?input=Richard-Strauss-Stra%DFe%2C+M%FCnchen%23625127&date=27.01.12&time=20%3A41&productsFilter=1111111111000000&REQTrain_name=&maxJourneys=10&start=Suchen&boardType=Abfahrt&ao=yes"))
doc = doc.xpath('//div').each do |node|
puts node.content
end
How can i remove the p-tags and spaces?
Here's a guess at what you might want:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://mobile.bahn.de/bin/mobil/bhftafel.exe/dox?input=Richard-Strauss-Stra%DFe%2C+M%FCnchen%23625127&date=27.01.12&time=20%3A41&productsFilter=1111111111000000&REQTrain_name=&maxJourneys=10&start=Suchen&boardType=Abfahrt&ao=yes"))
doc.xpath('//div//p').remove
doc = doc.xpath('//div').each do |node|
text = node.text.gsub(/\n([ \t]*\n)+/,"\n").gsub(/^\s+|\s+$/,'')
puts text unless text.empty?
end
This removes all <p> elements from the document and then removes all blank lines and leading and trailing whitespace from the text. In the end, it does not print the text if the result was an empty string.
Edit: To make a variable for the date, wrap the above in a function and use string interpolation to construct your URL. For example:
require 'nokogiri'
require 'open-uri'
def get_data( date )
date_string = date.strftime('%d-%m-%y')
url = "http://mobilde.bahn.de/…more…#{date_string}…more…"
doc = Nokogiri::HTML(open(url))
# more code from above
end

Nokogiri prevent converting entities

def wrap(content)
require "Nokogiri"
doc = Nokogiri::HTML.fragment("<div>"+content+"</div>")
chunks = doc.at("div").traverse do |p|
if p.is_a?(Nokogiri::XML::Text)
input = p.content
p.content = input.scan(/.{1,5}/).join("­")
end
end
doc.at("div").inner_html
end
wrap("aaaaaaaaaa")
gives me
"aaaaa&shy;aaaaa"
instead of
"aaaaa­aaaaa"
How get the second result ?
Return
doc.at("div").text
instead of
doc.at("div").inner_html
This, however, strips all HTML from the result. If you need to retain other markup, you can probably get away with using CGI.unescapeHTML:
CGI.unescapeHTML(doc.at("div").inner_html)