How extract text from a tag using Nokogiri - html

Example:
<p>http://localhost:3000/replies/279<br><p>
Currently using Nokogiri to grab the href from the <a>:
doc.search('a').each do |node|
href = node.attributes['href'].try(:value)
I need to make sure what's in the text part is what's in the href and I'm not sure how to extract that.

Here are the basics for checking:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<p>http://localhost:3000/replies/279<br><p>
EOT
link = doc.at('a')
link['href'] == link.text # => false
Modifying the HTML so the HREF and text match:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<p>http://localhost:3000/replies/279<br><p>
EOT
link = doc.at('a')
link['href'] == link.text # => true
at returns only the first node that matches the selector, so if you're looking to check multiple nodes you'll want to use search and iterate over the NodeSet it returns.

Related

Replacing <a> with Nokogiri

I am using Nokogiri to scan a document and remove specific files that are stored as attachments. I want to note however that the value was removed in-line.
Eg.
File Download
Converted to:
File Removed
Here is what I tried:
#doc = Nokogiri::HTML(html).to_html
#doc.search('a').each do |attachment|
attachment.remove
attachment.content = "REMOVED"
# ALSO TRIED:
attachment.content = "REMOVED"
end
The second one does replace the anchor text but keeps the href and the user can still download the value.
How can I replace the anchor value and change it to a < p> with a new string?
Use combination of create_element and replace to achieve that. Find the comments inline below.
html = 'File Download'
dom = Nokogiri::HTML(html) # parse with nokogiri
dom.to_s # original content
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>File Download</body></html>\n"
# scan the dom for hyperlinks
dom.css('a').each do |a|
node = dom.create_element 'p' # create paragraph element
node.inner_html = "REMOVED" # add content you want
a.replace node # replace found link with paragraph
end
dom.to_s # modified html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>REMOVED</p></body></html>\n"
Hope this helps

Download HTML Text with Ruby

I am trying to create a histogram of the letters (a,b,c,etc..) on a specified web page. I plan to make the histogram itself using a hash. However, I am having a bit of a problem actually getting the HTML.
My current code:
#!/usr/local/bin/ruby
require 'net/http'
require 'open-uri'
# This will be the hash used to store the
# histogram.
histogram = Hash.new(0)
def open(url)
Net::HTTP.get(URI.parse(url))
end
page_content = open('_insert_webpage_here')
page_content.each do |i|
puts i
end
This does a good job of getting the HTML. However, it gets it all. For www.stackoverflow.com it gives me:
<body><h1>Object Moved</h1>This document may be found here</body>
Pretending that it was the right page, I don't want the html tags. I'm just trying to get Object Moved and This document may be found here.
Is there any reasonably easy way to do this?
When you require 'open-uri', you don't need to redefine open with Net::HTTP.
require 'open-uri'
page_content = open('http://www.stackoverflow.com').read
histogram = {}
page_content.each_char do |c|
histogram[c] ||= 0
histogram[c] += 1
end
Note: this does not strip out <tags> within the HTML document, so <html><body>x!</body></html> will have { '<' => 4, 'h' => 2, 't' => 2, ... } instead of { 'x' => 1, '!' => 1 }. To remove the tags, you can use something like Nokogiri (which you said was not available), or some sort of regular expression (such as the one in Dru's answer).
See the section "Following Redirection" on the Net::HTTP Documentation here
Stripping html tags without Nokogiri
puts page_content.gsub(/<\/?[^>]*>/, "")
http://codesnippets.joyent.com/posts/show/615

Nokogiri help without spaces

i have the following code:
#/usr/bin/env ruby
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'cora'
require 'eat'
#require 'timeout'
doc = Nokogiri::HTML(open("http://mobile.bahn.de/bin/mobil/bhftafel.exe/dox?input=Richard-Strauss-Stra%DFe%2C+M%FCnchen%23625127&date=27.01.12&time=20%3A41&productsFilter=1111111111000000&REQTrain_name=&maxJourneys=10&start=Suchen&boardType=Abfahrt&ao=yes"))
doc = doc.xpath('//div').each do |node|
puts node.content
end
How can i remove the p-tags and spaces?
Here's a guess at what you might want:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://mobile.bahn.de/bin/mobil/bhftafel.exe/dox?input=Richard-Strauss-Stra%DFe%2C+M%FCnchen%23625127&date=27.01.12&time=20%3A41&productsFilter=1111111111000000&REQTrain_name=&maxJourneys=10&start=Suchen&boardType=Abfahrt&ao=yes"))
doc.xpath('//div//p').remove
doc = doc.xpath('//div').each do |node|
text = node.text.gsub(/\n([ \t]*\n)+/,"\n").gsub(/^\s+|\s+$/,'')
puts text unless text.empty?
end
This removes all <p> elements from the document and then removes all blank lines and leading and trailing whitespace from the text. In the end, it does not print the text if the result was an empty string.
Edit: To make a variable for the date, wrap the above in a function and use string interpolation to construct your URL. For example:
require 'nokogiri'
require 'open-uri'
def get_data( date )
date_string = date.strftime('%d-%m-%y')
url = "http://mobilde.bahn.de/…more…#{date_string}…more…"
doc = Nokogiri::HTML(open(url))
# more code from above
end

How to wrap words in HTML document without attributes and tag names

I have an HTML document that has long words:
<div>this is a veeeeeeeeeeeerryyyyyyyyloooongwoooord<img src="/fooooooooobaaar.jof" ></div>
I want to word-wrap it without cutting the tags or its attributes:
<div>this is a veeeeeeeeeeeerryyyyy yyyloooongwoooord<img src="/fooooooooobaaar.jof" ></div>
Also, it's possible that I will not have any HTML tag at all.
I tried Nokogiri, but it inserts a paragraph in tagless input, and wraps the whole response with an HTML document, which is not my intention.
What is the best way to accomplish this?
require "Nokogiri"
class String
def wrap()
doc = Nokogiri::HTML(self)
doc.at("body").traverse do |p|
if p.is_a?(Nokogiri::XML::Text)
input = p.content
p.content = input.scan(/.{1,25}/).join(" ")
end
end
doc.to_s # I want only the wrapped string, without the head/body stuff
end
end
I think using Nokogiri::XML(self) instead of Nokogiri::HTML(self) will help you.
This looks like a starting point for you:
require 'nokogiri'
max_word_length = 30
html = '<div>this is a veeeeeeeeeeeerryyyyyyyyloooongwoooord<img src="/fooooooooobaaar.jof" ></div>'
doc = Nokogiri::HTML.fragment(html)
doc.search('text()').each do |n|
n.content = n.content.split(' ').map { |l|
if (l.size > max_word_length)
l = l.scan(/.{1,#{ max_word_length }}/).join("\n")
end
l
}.join(' ')
end
puts doc.to_html
# >> <div>this is a veeeeeeeeeeeerryyyyyyyyloooong
# >> woooord<img src="/fooooooooobaaar.jof">
# >> </div>

How can I extract HTML img tags wrapped in anchors in Perl?

I am working on parsing HTML obtain all the hrefs that match a particular url (let's call it "target url") and then get the anchor text. I have tried LinkExtractor, TokenParser, Mechanize, TreeBuilder modules. For below HTML:
<a href="target_url">
<img src=somepath/nw.gf alt="Open this result in new window">
</a>
all of them give "Open this result in new window" as the anchor text.
Ideally I would like to see blank value or a string like "image" returned so that I know there was no anchor text but the href still matched the target url (http://www.yahoo.com in this case). Is there a way to get the desired result using other module or Perl regex?
Thanks,
You should post some examples that you tried with "LinkExtractor, TokenParser, Mechanize & TreeBuilder" so that we can help you.
Here is something which works for me in pQuery:
use pQuery;
my $data = '
<html>
Not yahoo anchor text
<img src="somepath/nw.gif" alt="Open this result in new window"></img>
just text for yahoo
anchor text only<img src="blah" alt="alt text"/>
</html>
';
pQuery( $data )->find( 'a' )->each(
sub {
say $_->innerHTML
if $_->getAttribute( 'href' ) eq 'http://www.yahoo.com';
}
);
# produces:
#
# => <img alt="Open this result in new window" src="somepath/nw.gif"></img>
# => just text for yahoo
# => anchor text only<img /="/" alt="alt text" src="blah"></img>
#
And if you just want the text:
pQuery( $data )->find( 'a' )->each(
sub {
return unless $_->getAttribute( 'href' ) eq 'http://www.yahoo.com';
if ( my $text = pQuery($_)->text ) { say $text }
}
);
# produces:
#
# => just text for yahoo
# => anchor text only
#
/I3az/
Use a proper parser (like HTML::Parser or HTML::TreeBuilder). Using regular expressions to parse SGML (HTML/XML included) isn't really all that effective because of funny multiline tags and attributes like the one you've run into.
If the HTML you are working with is fairly close to well formed you can usually load it into an XML module that supports HTML and use it to find and extract data from the parts of the document you are interested in.
My method of choice is XML::LibXML and XPath.
use XML::LibXML;
my $parser = XML::LibXML->new();
my $html = ...;
my $doc = $parser->parse_html_string($html);
my #links = $doc->findnodes('//a[#href = "http://example.com"]');
for my $node (#links) {
say $node->textContent();
}
The string passed to findnodes is an XPath expression that looks for all 'a' element descendants of $doc that have an href attribute equal to "http://example.com".