How to get a canonical link from HTML head using Nokogiri - html

I'm trying to get the defined canonical link from a webpage using Nokogiri:
<link rel="canonical" href="https://test.com/somepage">
It's the href I'm after.
Whatever I try it doesn't seem to work. This is what I have:
page = Nokogiri::HTML.parse(browser.html)
canon = page.xpath('//canonical/#href')
puts canon
This doesn't return anything, not even an error.

You are trying to get the attribute but that is not how you do it.
You can use this:
page.xpath('//link[#rel="canonical"]/#href')
What it says is: get me a link element anywhere in the document that has a rel attribute that equals "canonical" and when you find that node, get me its href attribute.
The full answers is:
page = Nokogiri::HTML.parse(browser.html)
canon = page.xpath('//link[#rel="canonical"]/#href')
puts canon
What you tried to do is get a node that is called "canonical", not the attribute.

I'm a fan of using CSS selectors over XPath as they're more readable:
require 'nokogiri'
doc = Nokogiri::HTML('<link rel="canonical" href="https://test.com/somepage">')
doc.at('link[rel="canonical"]')['href'] # => "https://test.com/somepage"
There's come confusion about what Nokogiri, and XPath, are returning when accessing a node parameter. Consider this:
require 'nokogiri'
doc = Nokogiri::HTML('<link rel="canonical" href="https://test.com/somepage">')
Here's how I'd do it using CSS:
doc.at('link[rel="canonical"]').class # => Nokogiri::XML::Element
doc.at('link[rel="canonical"]')['href'].class # => String
doc.at('link[rel="canonical"]')['href'] # => "https://test.com/somepage"
XPath, while more powerful, is also capable of making you, or Nokogiri, Ruby or the CPU, do more work.
First, xpath, which is the XPath-specific version of search returns a NodeSet, not a node or an element. A NodeSet is akin to an array of nodes, which can bite you if you're not aware of what you've got. From the NodeSet documentation:
A NodeSet contains a list of Nokogiri::XML::Node objects. Typically a NodeSet is return as a result of searching a Document via Nokogiri::XML::Searchable#css or Nokogiri::XML::Searchable#xpath
If you are looking for a specific node, or only a single instance of a particular type of node, then use at, or if you want to be picky, use at_css or at_xpath. (Nokogiri can usually figure out what you mean when using at or search but sometimes you have to use the specific method to give Nokogiri a nudge in the right direction.) Using at in the above example shows it returns the node itself, and once you've got the node it's trivial to get the value of any parameter by treating it as a hash.
xpath, search and css all return NodeSets, so, like an array, you need to point to the actual element you want then access the parameter:
doc.xpath('//link[#rel="canonical"]/#href').class # => Nokogiri::XML::NodeSet
doc.xpath('//link[#rel="canonical"]/#href').first.class # => Nokogiri::XML::Attr
doc.xpath('//link[#rel="canonical"]/#href').text # => "https://test.com/somepage"
Notice that '//link[#rel="canonical"]/#href' results in Nokogiri returning an Attr object, not text. You can print that object, and Ruby will stringify it, but it won't behave like a String resulting in errors if you try to treat it like one. For instance:
doc.xpath('//link[#rel="canonical"]/#href').first.downcase # => NoMethodError: undefined method `downcase' for #<Nokogiri::XML::Attr:0x007faace115d20>
Instead, use text or content to get the text value:
doc.at('//link[#rel="canonical"]/#href').class # => Nokogiri::XML::Attr
doc.at('//link[#rel="canonical"]/#href').text # => "https://test.com/somepage"
or get the element itself and then access the parameter like you would a hash:
doc.at('//link[#rel="canonical"]').class # => Nokogiri::XML::Element
doc.at('//link[#rel="canonical"]')['href'] # => "https://test.com/somepage"
either of which will return a String.
Also notice I'm not using #href to return the Attr in this example, I'm only getting the Node itself then using ['href'] to return the text of the parameter. It's a shorter selector and makes more sense, at least to me, since Nokogiri isn't having to return the Attr object which you then have to convert using text or possibly run into problems when you accidentally treat it as a String.

Related

How can I strip HTML tags from a string in the model before I get to the view

Trying to determine how to strip the HTML tags from a string in Ruby. I need this to be done in the model before I get to the view. So using:
ActionView::Helpers::SanitizeHelperstrip_tags()
won't work. I was looking into using Nokogiri, but can't figure out how to do it.
If I have a string:
description = google
I need it to be converted to plain text without including HTML tags so it would just come out as "google".
Right now I have the following which will take care of HTML entities:
def simple_description
simple_description = Nokogiri::HTML.parse(self.description)
simple_description.text
end
You can call the sanitizer directly like this:
Rails::Html::FullSanitizer.new.sanitize('<b>bold</b>')
# => "bold"
There are also other sanitizer classes that may be useful: FullSanitizer, LinkSanitizer, Sanitizer, WhiteListSanitizer.
Nokogiri is a great choice if you don't own the HTML generator and you want to reduce your maintenance load:
require 'nokogiri'
description = 'google'
Nokogiri::HTML::DocumentFragment.parse(description).at('a').text
# => "google"
The good thing about a parser vs. using patterns, is the parser continues work with changes to the tags or format of the document, whereas patterns get tripped up by those things.
While using a parser is a little slower, it more than makes up for that by the ease of use and reduced maintenance.
The code above breaks down to:
Nokogiri::HTML(description).to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>google</body></html>\n"
Rather than let Nokogiri add the normal HTML headers, I told it to parse only that one node into a document fragment:
Nokogiri::HTML::DocumentFragment.parse(description).to_html
# => "google"
at finds the first occurrence of that node:
Nokogiri::HTML::DocumentFragment.parse(description).at('a').to_html
# => "google"
text finds the text in the node.
Maybe you could use regular expression in ruby like following
des = 'google'
p des[/<.*>(.*)\<\/.*>/,1]
The result will be "google"
Regular expression is powerful.
You could customize to fit your needs.

Loop over all the <dd> tags and extract specefic information via Mechanize/Nokogiri

I know the basic things of accessing a website and so (I just started learning yesterday), however I want to extract now. I checked out many tutorials of Mechanize/Nokogiri but each of them had a different way of doing things which made me confused. I want a direct bold way of how to do this:
I have this website: http://openie.allenai.org/sentences/rel=contains&arg2=antioxidant&title=Green+tea
and I want to extract certain things in a structured way. If I inspect the element of this webpage and go to the body, I see so many <dd>..</dd>'s under the <dl class="dl-horizontal">. Each one of them has an <a> part which contains a href. I would like to extract this href and the bold parts of the text ex <b>green tea</b>.
I created a simple structure:
info = Struct.new(:ObjectID, :SourceID) thus from each of these <dd> will add the bold text to the object id and the href to the source id.
This is the start of the code I have, just retrieval no extraction:
agent = Mechanize.new { |agent| agent.user_agent_alias = "Windows Chrome" }
html = agent.get('http://openie.allenai.org/sentences/?rel=contains&arg2=antioxidant&title=Green+tea').body
html_doc = Nokogiri::HTML(html)
The other thing is that I am confused about whether to use Nokogiri directly or through Mechanize. The problem is that there isn't enough documentation provided by Mechanize so I was thinking of using it separately.
For now I would like to know how to loop through these and extract the info.
Here's an example of how you could parse the bold text and href attribute from the anchor elements you describe:
require 'nokogiri'
require 'open-uri'
url = 'http://openie.allenai.org/sentences/?rel=contains&arg2=antioxidant&title=Green%20tea'
doc = Nokogiri::HTML(open(url))
doc.xpath('//dd/*/a').each do |a|
text = a.xpath('.//b').map {|b| b.text.gsub(/\s+/, ' ').strip}
href = a['href']
puts "OK: text=#{text.inspect}, href=#{href.inspect}"
end
# OK: text=["Green tea", "many antioxidants"], href="http://www.talbottteas.com/category_s/55.htm"
# OK: text=["Green tea", "potent antioxidants"], href="http://www.skin-care-experts.com/tag/best-skin-care/page/4"
# OK: text=["Green tea", "potent antioxidants"], href="http://www.specialitybrand.com/news/view/207.html"
In a nutshell, this solution uses XPath in two places:
Initially to find every a element underneath each dd element.
Then to find each b element inside of the as in #1 above.
The final trick is cleaning up the text within the "b" elements into something presentable, of course, you might want it to look different somehow.

Ruby JSON Multi-Word Strings Rendered Incorrectly in HTML

My multi-word strings aren't being interpreted correctly in the DOM. How can I ensure JSON integrity from server to HTML with multi-word strings?
In the Rails controller, I store a JSON object in a variable. That object is correctly formatted when I check in the server logs. That variable is then passed to a data attribute in the view via erb. However, the generated HTML is incorrect.
# in the controller
#hash = {'subhash' => {'single' => 'word', 'double' => 'two words' } }.to_json
puts #hash.inspect
# output in the server log
=> "{\"subhash\":{\"single\":\"word\",\"double\":\"two words\"}}"
# view.html.erb
<section data-hash=<%= #hash %> ></section>
# generated html, 'double' value is incorrect
<section data-hash="{"subhash":{"single":"word","double":"two" words"}}>
# complete data value is not obtainable in the console
> $('section').data().hash
< "{"subhash":{"single":"word","double":"two"
Update
# adding html_safe does not help
{"subhash" => {"single" => "word", "double" => "two words" } }.to_json.html_safe
# results in
"{"subhash":{"single":"word","double":"two" words"}}
If you push past the browser's interpretation of your "HTML" and look directly at the source you should see something like this:
<section data-hash={"subhash":{"single":"word","double":"two words"}}>
At least that's what I get with Rails4.2. That also gives me the same results that you're seeing when I look at the HTML in a DOM inspector.
The browser is seeing that as:
<section data-hash={"subhash":{"single":"word","double":"two words"}}>
and trying to make sense of it as HTML but that doesn't make sense as HTML so the browser just gets confused.
If you add the outer quotes yourself:
<section data-hash="<%= #hash %>">
^ ^
then you'll get this in the HTML:
<section data-hash="{"subhash":{"single":"word","double":"two words"}}">
and the quotes will be interpreted correctly as the "s will be seen as HTML encoded content for the data-hash attribute rather than a strange looking attempt at an HTML attribute.

Ruby regex to match content between <ul> tags

I have a script to grab a page and edit it. The page HTML looks something like this:
<p>Title</p>...extra content...<ul><li>Item1</li><li>Item2</li></ul>
There are multiple titles and multiple unordered lists but I want to change each list with a regular expression that can find the list with a certain title and use .sub in Ruby to replace it.
The regex I currently have looks like this:
regex = /<p>Title1?.*<\/ul>/
Now if there are any items below the regex it will match to the last tag and accidentally grab all the lists below it for example if I have this content:
content = "<p>Title1</p><ul><li>Item1</li><li>Item2</li></ul><p>Title2</p><ul><li>Item1</li><li>Item2</li><li>Item3</li></ul>"
and I want to add another list item to the section for Title 1:
content.sub(regex, "<p>Title1</p><ul><li>Item1</li><li>Item2</li><li>NEW_ITEM</li></ul>)
It will delete all items below it. How do I rewrite my regex to select only the first /ul tag to substitute?
"I want to change each list with a regular expression." No you don't. You really do not want to go down this road because it's filled with misery, sorrow, and tears. One day someone will put a list item in your list item.
There are libraries like Nokogiri that make manipulating HTML very easy. There's no excuse to not use something like it:
require 'nokogiri'
html = "<p>Title</p>...extra content...<ul><li>Item1</li><li>Item2</li></ul>"
doc = Nokogiri::HTML(html)
doc.css('ul').children.first.inner_html = 'Replaced Text'
puts doc.to_s
That serves as a simple example for "replace text from first list item". It can be easily adapted to do other things, as the css method takes a simple CSS selector, not unlike jQuery.
Use a non-greedy (lazy) quantifier .*?
See this explanation of Ruby Regexp repetition.
regex = /<p>Title1?.*?<\/ul>/
...it reformats the html with newlines and changes all <br /> to <br>...
That's usually because the wrong method is used when emitting the doc as HTML or XHTML:
doc = Nokogiri::HTML::DocumentFragment.parse('<p>foo<br />bar</p>')
doc.to_xhtml # => "<p>foo<br />bar</p>"
doc.to_html # => "<p>foo<br>bar</p>"
doc = Nokogiri::HTML::DocumentFragment.parse('<p>foo<br>bar</p>')
doc.to_xhtml # => "<p>foo<br />bar</p>"
doc.to_html # => "<p>foo<br>bar</p>"
As for spuriously adding line-ends where they weren't before, I haven't seen that. It's possible to tell Nokogiri to do that if you're modifying the DOM, but from what I've seen, on its own Nokogiri is very benign.

Find and replace entire HTML nodes with Nokogiri

i have an HTML, that should be transformed, having some tags replaced with another tags.
I don't know about these tags, because they will come from db. So, set_attribute or name methods of Nokogiri are not suitable for me.
I need to do it, in a way, like in this pseudo-code:
def preprocess_content
doc = Nokogiri::HTML( self.content )
doc.css("div.to-replace").each do |div|
# "get_html_text" will obtain HTML from db. It can be anything, even another tags, tag groups etc.
div.replace self.get_html_text
end
self.content = doc.css("body").first.inner_html
end
I found Nokogiri::XML::Node::replace method. I think, it is the right direction.
This method expects some node_or_tags parameter.
Which method should i use to create a new Node from text and replace the current one with it?
Like that:
doc.css("div.to-replace").each do |div|
new_node = doc.create_element "span"
new_node.inner_html = self.get_html_text
div.replace new_node
end