Creating HTML links from images in :colons: with Ruby - html

I have a simple HTML document:
<div should-not-be-replaced=":smile:">
Hello :smile:!
</div>
How would I replace the :smile: text with <img src="smile.png">, but keeping the first :smile: unchanged, to get this:
<div should-not-be-replaced=":smile:">
Hello <img src="smile.png">!
</div>
I tried this, but Nokogiri escapes my HTML as plain text:
doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.traverse do |x|
next unless x.text?
x.content = x.text.gsub(':smile:', '<img src="smile.png">')
end

My solution is very similar to Ku's, although I've tried to handle situations where the replaced text could be in the source text multiple times by completely replacing the content text with an HTML Doc Fragment
doc = Nokogiri::HTML::DocumentFragment.parse(DATA.read)
doc.traverse do |x|
next unless x.text?
if x.text.match(%r{:(\w+):})
replace_text = x.text.gsub(%r{:(\w+):}, "<img src='#{$1}.png'>")
x.content = ""
x.add_next_sibling replace_text
end
end

I think this might be what you want, and it also deals with strings between two colons like :something: and produces "something.png" as well.
doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.traverse do |x|
if x.text? && x.content =~ /:\w+:/
x.content = x.content.sub(/:(\w+):/, '')
a = Nokogiri::HTML::DocumentFragment.parse('<a src="'+$1+'.png">')
x.add_next_sibling(a)
end
end

You are making it much too hard, and using traverse which is slow because it forces Nokogiri to walk through every node in the document; In a large page that is costly.
Instead take advantage of selectors to find the specific node(s) you want:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div parm=":smile:">
Hello :smile:!
</div>
EOT
div = doc.at('div[parm=":smile:"]')
div.inner_html = div.text.sub(/:smile:/, '<img src="smile.png">')
puts doc.to_html
Running that results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div parm=":smile:">
Hello <img src="smile.png">!
</div>
</body></html>
I'm using at, which finds the first occurrence. If you need to process more than one then use search. search returns a NodeSet, which is like an array so you'll want to iterate over it. That are innumerable examples of doing so on Stack Overflow and elsewhere.

Do you mean it returns &lt or &gt?
I recommend to wrap CGI#unescape_html method
try,
require 'cgi'
CGI::unescape_html(doc.to_s)

Related

Converting HTML with equations pages to docx

I am trying to convert an html document to docx using pandoc.
pandoc -s Template.html --mathjax -o Test.docx
During the conversion to docx everything goes smooth less the equations.
In the html file the equation look like this:
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
\begin{equation}
\log_{10}(\mu)={-2.64}+\frac{4437.038}{T-544.391}
\end{equation}
</div>
</div>
</div>
</div>
After running the pandoc command the result in the docx document is:
\begin{equation} \log_{10}(\mu)={-2.64}+\frac{4437.038}{T-544.391} \end{equation}
Do you have idea how can I overcome this issue?
Thanks
A Lua filter can help here. The code below looks for div elements with a data-mime-type="text/markdown" attribute and, somewhat paradoxically, parses it context as LaTeX. The original div is then replaced with the parse result.
local stringify = pandoc.utils.stringify
function Div (div)
if div.attributes['mime-type'] == 'text/markdown' then
return pandoc.read(stringify(div), 'latex').blocks
end
end
Save the code to a file parse-math.lua and let pandoc use it with the --lua-filter / -L option:
pandoc --lua-filter parse-math.lua ...
As noted in a comment, this gets slightly more complicated if there are other HTML elements with the text/markdown media type. In that case we'll check if the parse result contains only math, and keep the original content otherwise.
local stringify = pandoc.utils.stringify
function Div (div)
if div.attributes['mime-type'] == 'text/markdown' then
local result = pandoc.read(stringify(div), 'latex').blocks
local first = result[1] and result[1].content or {}
return (#first == 1 and first[1].t == 'Math')
and result
or nil
end
end

Replacing <a> with Nokogiri

I am using Nokogiri to scan a document and remove specific files that are stored as attachments. I want to note however that the value was removed in-line.
Eg.
File Download
Converted to:
File Removed
Here is what I tried:
#doc = Nokogiri::HTML(html).to_html
#doc.search('a').each do |attachment|
attachment.remove
attachment.content = "REMOVED"
# ALSO TRIED:
attachment.content = "REMOVED"
end
The second one does replace the anchor text but keeps the href and the user can still download the value.
How can I replace the anchor value and change it to a < p> with a new string?
Use combination of create_element and replace to achieve that. Find the comments inline below.
html = 'File Download'
dom = Nokogiri::HTML(html) # parse with nokogiri
dom.to_s # original content
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>File Download</body></html>\n"
# scan the dom for hyperlinks
dom.css('a').each do |a|
node = dom.create_element 'p' # create paragraph element
node.inner_html = "REMOVED" # add content you want
a.replace node # replace found link with paragraph
end
dom.to_s # modified html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>REMOVED</p></body></html>\n"
Hope this helps

Issue with ruby parsing

Im just having a slight problem parising a website with nokogiri in ruby.
Here is what the site looks like
<div id="post_message_111112" class="postcontent">
Hee is text 1
here is another
</div>
<div id="post_message_111111" class="postcontent">
Here is text 2
</div>
Here is my code to parse it
doc = Nokogiri::HTML(open(myNewLink))
myPost = doc.xpath("//div[#class='postcontent']/text()").to_a()
ii=0
while ii!=myPost.length
puts "#{ii} #{myPost[ii].to_s().strip}"
ii+=1
end
My problem is when it displays it, because of the new line after Hee is text 1, the to_a puts it weird like so
myPost[0] = hee is text 1
myPost[1] = here is another
myPost[2] = here is text 2
I want each div to be its own message. like
myPost[0] = hee is text 1 here is another
myPost[1] = here is text 2
How would i solve this thanks
UPDATED
I tried
myPost = doc.xpath("//div[#class='postcontent']/text()").to_a()
myPost.each_with_index do |post, index|
puts "#{index} #{post.to_s().gsub(/\n/, ' ').strip}"
end
I put post.to_s().gsub because it was complaining about gsub not being a method for post. But i still have the same issue. I know im doing it wrong just wrecking my head
UPDATE 2
Forgot to say that the new line is <br /> and even with
doc.search('br').each do |n|
n.replace('')
end
or
doc.search('br').remove
The issue is still there
If you look at the myPost array, you will see that each div is in fact its own message. The first just happens to include a newline-character \n. To replace it with a space, use #gsub(/\n/, ' '). So your loop looks like this:
myPost.each_with_index do |post, index|
puts "#{index} #{post.to_s.gsub(/\n/, ' ').strip}"
end
Edit:
According to my limited understanding of it, xpath can only find nodes. The child nodes are <br />, so either you have multiple texts between them or you have the div tag included in your search. There sure is a way to join the texts between the <br /> nodes, but I don't know it.
Until you find it, here something that works:
replace your xpath match with "//div[#class='postcontent']"
adjust your loop to delete the div tags:
myPost.each_with_index do |post, index|
post = post.to_s
post.gsub!(/\n/, ' ')
post.gsub!(/^<div[^>]*>/, '') # delete opening div tag
post.gsub!(%r|</\s*div[^>]*>|, '') # delete closing div tag
puts "#{index} #{post.strip}"
end
Here, let me clean that up for you:
doc.search('div.postcontent').each_with_index do |div, i|
puts "#{i} #{div.text.gsub(/\s+/, ' ').strip}"
end
# 0 Hee is text 1 here is another
# 1 Here is text 2

Download HTML Text with Ruby

I am trying to create a histogram of the letters (a,b,c,etc..) on a specified web page. I plan to make the histogram itself using a hash. However, I am having a bit of a problem actually getting the HTML.
My current code:
#!/usr/local/bin/ruby
require 'net/http'
require 'open-uri'
# This will be the hash used to store the
# histogram.
histogram = Hash.new(0)
def open(url)
Net::HTTP.get(URI.parse(url))
end
page_content = open('_insert_webpage_here')
page_content.each do |i|
puts i
end
This does a good job of getting the HTML. However, it gets it all. For www.stackoverflow.com it gives me:
<body><h1>Object Moved</h1>This document may be found here</body>
Pretending that it was the right page, I don't want the html tags. I'm just trying to get Object Moved and This document may be found here.
Is there any reasonably easy way to do this?
When you require 'open-uri', you don't need to redefine open with Net::HTTP.
require 'open-uri'
page_content = open('http://www.stackoverflow.com').read
histogram = {}
page_content.each_char do |c|
histogram[c] ||= 0
histogram[c] += 1
end
Note: this does not strip out <tags> within the HTML document, so <html><body>x!</body></html> will have { '<' => 4, 'h' => 2, 't' => 2, ... } instead of { 'x' => 1, '!' => 1 }. To remove the tags, you can use something like Nokogiri (which you said was not available), or some sort of regular expression (such as the one in Dru's answer).
See the section "Following Redirection" on the Net::HTTP Documentation here
Stripping html tags without Nokogiri
puts page_content.gsub(/<\/?[^>]*>/, "")
http://codesnippets.joyent.com/posts/show/615

How to wrap words in HTML document without attributes and tag names

I have an HTML document that has long words:
<div>this is a veeeeeeeeeeeerryyyyyyyyloooongwoooord<img src="/fooooooooobaaar.jof" ></div>
I want to word-wrap it without cutting the tags or its attributes:
<div>this is a veeeeeeeeeeeerryyyyy yyyloooongwoooord<img src="/fooooooooobaaar.jof" ></div>
Also, it's possible that I will not have any HTML tag at all.
I tried Nokogiri, but it inserts a paragraph in tagless input, and wraps the whole response with an HTML document, which is not my intention.
What is the best way to accomplish this?
require "Nokogiri"
class String
def wrap()
doc = Nokogiri::HTML(self)
doc.at("body").traverse do |p|
if p.is_a?(Nokogiri::XML::Text)
input = p.content
p.content = input.scan(/.{1,25}/).join(" ")
end
end
doc.to_s # I want only the wrapped string, without the head/body stuff
end
end
I think using Nokogiri::XML(self) instead of Nokogiri::HTML(self) will help you.
This looks like a starting point for you:
require 'nokogiri'
max_word_length = 30
html = '<div>this is a veeeeeeeeeeeerryyyyyyyyloooongwoooord<img src="/fooooooooobaaar.jof" ></div>'
doc = Nokogiri::HTML.fragment(html)
doc.search('text()').each do |n|
n.content = n.content.split(' ').map { |l|
if (l.size > max_word_length)
l = l.scan(/.{1,#{ max_word_length }}/).join("\n")
end
l
}.join(' ')
end
puts doc.to_html
# >> <div>this is a veeeeeeeeeeeerryyyyyyyyloooong
# >> woooord<img src="/fooooooooobaaar.jof">
# >> </div>