Issue with ruby parsing - html

Im just having a slight problem parising a website with nokogiri in ruby.
Here is what the site looks like
<div id="post_message_111112" class="postcontent">
Hee is text 1
here is another
</div>
<div id="post_message_111111" class="postcontent">
Here is text 2
</div>
Here is my code to parse it
doc = Nokogiri::HTML(open(myNewLink))
myPost = doc.xpath("//div[#class='postcontent']/text()").to_a()
ii=0
while ii!=myPost.length
puts "#{ii} #{myPost[ii].to_s().strip}"
ii+=1
end
My problem is when it displays it, because of the new line after Hee is text 1, the to_a puts it weird like so
myPost[0] = hee is text 1
myPost[1] = here is another
myPost[2] = here is text 2
I want each div to be its own message. like
myPost[0] = hee is text 1 here is another
myPost[1] = here is text 2
How would i solve this thanks
UPDATED
I tried
myPost = doc.xpath("//div[#class='postcontent']/text()").to_a()
myPost.each_with_index do |post, index|
puts "#{index} #{post.to_s().gsub(/\n/, ' ').strip}"
end
I put post.to_s().gsub because it was complaining about gsub not being a method for post. But i still have the same issue. I know im doing it wrong just wrecking my head
UPDATE 2
Forgot to say that the new line is <br /> and even with
doc.search('br').each do |n|
n.replace('')
end
or
doc.search('br').remove
The issue is still there

If you look at the myPost array, you will see that each div is in fact its own message. The first just happens to include a newline-character \n. To replace it with a space, use #gsub(/\n/, ' '). So your loop looks like this:
myPost.each_with_index do |post, index|
puts "#{index} #{post.to_s.gsub(/\n/, ' ').strip}"
end
Edit:
According to my limited understanding of it, xpath can only find nodes. The child nodes are <br />, so either you have multiple texts between them or you have the div tag included in your search. There sure is a way to join the texts between the <br /> nodes, but I don't know it.
Until you find it, here something that works:
replace your xpath match with "//div[#class='postcontent']"
adjust your loop to delete the div tags:
myPost.each_with_index do |post, index|
post = post.to_s
post.gsub!(/\n/, ' ')
post.gsub!(/^<div[^>]*>/, '') # delete opening div tag
post.gsub!(%r|</\s*div[^>]*>|, '') # delete closing div tag
puts "#{index} #{post.strip}"
end

Here, let me clean that up for you:
doc.search('div.postcontent').each_with_index do |div, i|
puts "#{i} #{div.text.gsub(/\s+/, ' ').strip}"
end
# 0 Hee is text 1 here is another
# 1 Here is text 2

Related

VBA Excel IE automation: locate element by custom tag

I need to pick out an element by a custom html tag - ie, where the custom tag would be "somecustomtag" in the following div element
<div class="panel-one" somecustomtag="blue">
I just can't remember the sytax. I know it's something like:
Set myElements = IE.Document.getElementsbyTagName("div")
For Each ob in myElements
If ob.subTag("somecustomtag") = "blue" then ' ????????
someStringVariable = ob.innerText
exit for
End If
Next ob
I've used this a dozen times before but can't find it any where. What is the proper syntax for .subTag?
In your case somecustomtag is an attribute. You will get the value of somecustomtag with the following code snippet
ob.getAttribute("somecustomtag")

Creating HTML links from images in :colons: with Ruby

I have a simple HTML document:
<div should-not-be-replaced=":smile:">
Hello :smile:!
</div>
How would I replace the :smile: text with <img src="smile.png">, but keeping the first :smile: unchanged, to get this:
<div should-not-be-replaced=":smile:">
Hello <img src="smile.png">!
</div>
I tried this, but Nokogiri escapes my HTML as plain text:
doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.traverse do |x|
next unless x.text?
x.content = x.text.gsub(':smile:', '<img src="smile.png">')
end
My solution is very similar to Ku's, although I've tried to handle situations where the replaced text could be in the source text multiple times by completely replacing the content text with an HTML Doc Fragment
doc = Nokogiri::HTML::DocumentFragment.parse(DATA.read)
doc.traverse do |x|
next unless x.text?
if x.text.match(%r{:(\w+):})
replace_text = x.text.gsub(%r{:(\w+):}, "<img src='#{$1}.png'>")
x.content = ""
x.add_next_sibling replace_text
end
end
I think this might be what you want, and it also deals with strings between two colons like :something: and produces "something.png" as well.
doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.traverse do |x|
if x.text? && x.content =~ /:\w+:/
x.content = x.content.sub(/:(\w+):/, '')
a = Nokogiri::HTML::DocumentFragment.parse('<a src="'+$1+'.png">')
x.add_next_sibling(a)
end
end
You are making it much too hard, and using traverse which is slow because it forces Nokogiri to walk through every node in the document; In a large page that is costly.
Instead take advantage of selectors to find the specific node(s) you want:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div parm=":smile:">
Hello :smile:!
</div>
EOT
div = doc.at('div[parm=":smile:"]')
div.inner_html = div.text.sub(/:smile:/, '<img src="smile.png">')
puts doc.to_html
Running that results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div parm=":smile:">
Hello <img src="smile.png">!
</div>
</body></html>
I'm using at, which finds the first occurrence. If you need to process more than one then use search. search returns a NodeSet, which is like an array so you'll want to iterate over it. That are innumerable examples of doing so on Stack Overflow and elsewhere.
Do you mean it returns &lt or &gt?
I recommend to wrap CGI#unescape_html method
try,
require 'cgi'
CGI::unescape_html(doc.to_s)

Regex to detect all characters outside of the <img> tag

I don't have experience in regex. I am just trying to find a way to detect
and delete every character outside of the img tag. In other words I want to
strip a given html code from all text and tags and just keep everything within
the img tags. The result should show just the image tags like that:
<img src="sourcehere">
Is there a way to do this?
UPDATE:
I need specifically a regex that goes in preg_replace.
This is what I have done, but it doesn't work:
$buffer ="<html><head></head><body><img src='image.jpg'></body></html>";
$buffer = preg_replace('(?i)<(?!img|/img).*?>', '', $buffer);
echo $buffer; /* should output <img src='image.jpg'> but it doesn't */
What are your plans -- do you want to log it to a file or just display in a console, or output it in some way. This worked for me, but actually 'stringing' it out might take extra work.
this is jQuery. From my understanding you want to remove everything but the images from your document.
var arr2 = Array.prototype.slice.call( document.images );
jQuery('body').contents().remove();
for(i = 0; i < arr2.length;i++){
jQuery('body').append(arr2[i])
}
This doesn't need to be some big and fancy regex.
<img[^>]*>
This matches the text "" followed by the closer ">".
Once you have the matches you would just want to write out the matches to a string, or to the document, or however you want to represent them.
EDIT:
To complete what the OP is showing in PHP, you would want to call match instead of replace. You don't really need to replace all of the non-matching sections. You can just keep the results:
$buffer ="<html><head></head><body><img src='image.jpg'></body></html>";
preg_match("/<img[^>]*>/", $buffer, $matchArray);
foreach ($matchArray as $match){
echo $match;
}
prints out:
<img src='image.jpg'>
EDIT:
The problem I am seeing with trying to replace every other tag will be when you have contents between the tags. If you don't care about that, then here is something that works using preg_replace().
$buffer ="<html><head></head><body><img src='image.jpg'></body></html>";
$buffer = preg_replace('/(?i)<\\/*(?!img)[^>]*>/', '', $buffer);
echo $buffer; /* outputs <img src='image.jpg'> */
Use
preg_replace('/<img[^>]*>(*SKIP)(*FAIL)|./si', '', $buffer)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
<img '<img'
--------------------------------------------------------------------------------
[^>]* any character except: '>' (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
> '>'
--------------------------------------------------------------------------------
(*SKIP)(*FAIL) skips the match
--------------------------------------------------------------------------------
| or
--------------------------------------------------------------------------------
. any character

How to write a transformer for Ruby Sanitize Gem to transform <br> into newlines?

I'm using a wrapper for the Sanitize Gem's clean method to solve some our issues:
def remove_markup(html_str)
html_str.gsub /(\<\/p\>)/, "#{$1}\n"
marked_up = Sanitize.clean html_str
ESCAPE_SEQUENCES.each do |esc_seq, ascii_seq|
marked_up = marked_up.gsub('&' + esc_seq + ';', ascii_seq.chr)
end
marked_up
end
I recently add the gsub two lines as a quick way to do what I wanted:
Replace insert a newline wherever a paragraph ended.
However, I'm sure this can be accomplished more elgantly with a Sanitize transformer.
Unfortunately, I think I must be misunderstanding a few things. Here is an example of a transformer I wrote for the tag that worked.
s2 = "<p>here is para 1<br> It's a nice paragraph</p><p>Don't forget para 2</p>"
br_to_nl = lambda do |env|
node = env[:node]
node_name = env[:node_name]
return if env[:is_whitelisted] || !node.element?
return unless node_name == 'br'
node.replace "\n"
end
Sanitize.clean s2, :transformers => [br_to_nl]
=> " here is para 1\n It's a nice paragraph Don't forget para 2 "
But I couldn't come up with a solution that would work well for <p> tags.
Should I add a text element to the node as a child? How to make it show up immediately after the element?
related question (answered) How to use RubyGem Sanitize transformers to sanitize an unordered list into a comma seperated list?

Parse html using Perl works for 2 lines but not multiple

I have written the following Perl script-
use HTML::TreeBuilder;
my $html = HTML::TreeBuilder->new_from_content(<<END_HTML);
<span class=time>1 h </span>
User: There are not enough <b>big</b>
<b>fish</b> in the lake ;
END_HTML
my $source = "foo";
my #time = "10-14-2011";
my $name = $html->find('a')->as_text;
my $comment = $html->as_text;
my #keywords = map { $_->as_text } $html->find('b');
Which outputs- foo, 10-14-2011, User, 1h User: There are not enough big fish in the lake, big fish
Which is perfect and what I wanted from the test html but
this only works fine when I put in the aforementioned HTML, which I did for test purposes.
However the full HTML file has multiple references of 'a' and 'b' for instances therefore when printing out the results for these columns are blank.
How can I account for multiple values for specific searches?
Without sight of your real HTML it is hard to help, but $html->find returns a list of <a> elements, so you could write something like
foreach my $anchor ($html->find('a')) {
print $anchor->as_text, "\n";
}
But that will find all <a> elements, and it is unlikely that that is what you want. $html->look_down() is far more flexible, and provides for searching by attribute as well as by tag name.
I cannot begin to guess about your problem with comments without seeing what data you are dealing with.
If you need to process each text element independently then you probably need to call the objectify_text method. This turns every text element in the tree into a pseudo element with a ~text tag name and a text attribute, for instance <p>paragraph text</p> would be transformed into <p><~text text="paragraph text" /></p>. These elements can be discovered using $html->find('~text') as normal. Here is some code to demonstrate
use strict;
use warnings;
use HTML::TreeBuilder;
my $html = HTML::TreeBuilder->new_from_content(<<END_HTML);
<span class=time>1 h </span>
User: There are not enough <b>big</b>
<b>fish</b> in the lake ;
END_HTML
$html->objectify_text;
print $_->attr('text'), "\n" for $html->find('~text');
OUTPUT
1 h
User
: There are not enough
big
fish
in the lake ;