Find and replace entire HTML nodes with Nokogiri - html

i have an HTML, that should be transformed, having some tags replaced with another tags.
I don't know about these tags, because they will come from db. So, set_attribute or name methods of Nokogiri are not suitable for me.
I need to do it, in a way, like in this pseudo-code:
def preprocess_content
doc = Nokogiri::HTML( self.content )
doc.css("div.to-replace").each do |div|
# "get_html_text" will obtain HTML from db. It can be anything, even another tags, tag groups etc.
div.replace self.get_html_text
end
self.content = doc.css("body").first.inner_html
end
I found Nokogiri::XML::Node::replace method. I think, it is the right direction.
This method expects some node_or_tags parameter.
Which method should i use to create a new Node from text and replace the current one with it?

Like that:
doc.css("div.to-replace").each do |div|
new_node = doc.create_element "span"
new_node.inner_html = self.get_html_text
div.replace new_node
end

Related

Createing a Sphinx code-block, with inline text parsing

I'm trying to create a directive, that will allow me to parse links inside a Sphinx CodeBlock directive. I looked at the ParsedLiteral directive from docutils, which does something like that, only it doesn't do syntax highlighting, like CodeBlock. I tried replacing the part of CodeBlock (in sphinx/directives/code.py), which generates the literal_block:
literal: Element = nodes.literal_block(code, code)
with
text_nodes, messages = self.state.inline_text(code, self.lineno)
literal: Element = nodes.literal_block(code, "", *text_nodes)
which is what docutils ParsedLiteraldirective does, but I of course kept the rest of the Sphinx CodeBlock. This parses the code correctly, but does not apply the correct syntax highlighting, so I'm wondering where the syntax highlighting is taking place, and why it's not taking place in my modified CodeBlock directive.
I'm very confused as to why this is the case and I'm looking for some input from smarter people than me.
Syntax highlights are applied at the translation phase, see sphinx.writers.html.HTMLTranslator.visit_literal_block:
def visit_literal_block(self, node: Element) -> None:
if node.rawsource != node.astext(): # <<< LOOK AT HERE
# most probably a parsed-literal block -- don't highlight
return super().visit_literal_block(node)
lang = node.get('language', 'default')
linenos = node.get('linenos', False)
# do highlight...
Once the node's rawsource is not equal to its text, the highlight will not be applied.
In your example, code is not equal to text_nodes.as_text() obviously.
Just set literal.rawsource to literal.as_text() can fix the syntax highlight.

Python - Beautifulsoup, differentiate parsed text inside of an html element by using internal tags

So, I'm working on an html parser to extract some text data from a list of and format it before giving an output. I have a title that I need to set as bold, and a description which I'll leave as it is. I've found myself stuck when I reached this situation:
<div class ="Content">
<Strong>Title:</strong>
description
</div>
As you can see the strings are actually already formatted but I can't seem to find a way to get the tags and the text out together.
What my script does kinda looks like:
article = "" #this is where I normally store all the formatted text, it's necessary that I get all the formatted text as one loooong string before I Output
temp1=""
temp2""
result = soup.findAll("div", {"class": "Content"})
if(result!=none):
x=0
for(i in result.find("strong")):
if(x==0):
temp1 = "<strong>" + i.text + "</strong>"
article += temp1
x=1
else:
temp2 = i.nextSibling #I know this is wrong
article += temp2
x = 0
print(article)
It actually throws an AttributeError but it's a wrong one since the output is "Did you call find_all() when you meant to call find()?".
I also know I can't just use .nextSibling like that and I'm litterally losing it over something that looks so simple to solve...
what I need to get is: "Title: description"
Thanks in advance for any response.
I'm sorry if I couldn't explain really well what I'm trying to accomplish but that's kind of articulated; I actually need the data to generate a POST request to a CKEditor session so that it adds the text to the html page, but I need the text to be formatted in a certain way before uploading it. In this case I would need to get the element inside the tags and format it in a certain way, then do the same with the description and print them one after the other, for example a request could look like:
http://server/upload.php?desc=<ul>%0D%0A%09<li><strong>Title%26nbsp%3B<%2strong>description<%2li><%2ul>
So that the result is:
Title1: description
So what I need to do is to differentiate between the element inside the tag and the one out of it using the tag itself as a refernce
EDIT
To select the <strong> use:
soup.select_one('div.Content strong')
and then to select its nextSibling:
strong.nextSibling
you my need to strip it to get rid of whitespaces, ....:
strong.nextSibling.strip()
Just in case
You can use ANSI escape sequences to print something bold, ... but I am not sure, why you would do that. That is something should be improved in your question.
Example
from bs4 import BeautifulSoup
html='''
<div class ="Content">
<Strong>Title:</strong>
description
</div>
'''
soup = BeautifulSoup(html,'html.parser')
text = soup.find('div', {'class': 'Content'}).get_text(strip=True).split(':')
print('\033[1m'+text[0]+': \033[0m'+ text[1])
Output
Title: description
You may want to use htql for this. Example:
text="""<div class ="Content">
<Strong>Title:</strong>
description
</div>"""
import htql
ret = htql.query(text, "<div>{ col1=<strong>:tx; col2=<strong>:xx &trim }")
# ret=[('Title:', 'description')]

How to get a canonical link from HTML head using Nokogiri

I'm trying to get the defined canonical link from a webpage using Nokogiri:
<link rel="canonical" href="https://test.com/somepage">
It's the href I'm after.
Whatever I try it doesn't seem to work. This is what I have:
page = Nokogiri::HTML.parse(browser.html)
canon = page.xpath('//canonical/#href')
puts canon
This doesn't return anything, not even an error.
You are trying to get the attribute but that is not how you do it.
You can use this:
page.xpath('//link[#rel="canonical"]/#href')
What it says is: get me a link element anywhere in the document that has a rel attribute that equals "canonical" and when you find that node, get me its href attribute.
The full answers is:
page = Nokogiri::HTML.parse(browser.html)
canon = page.xpath('//link[#rel="canonical"]/#href')
puts canon
What you tried to do is get a node that is called "canonical", not the attribute.
I'm a fan of using CSS selectors over XPath as they're more readable:
require 'nokogiri'
doc = Nokogiri::HTML('<link rel="canonical" href="https://test.com/somepage">')
doc.at('link[rel="canonical"]')['href'] # => "https://test.com/somepage"
There's come confusion about what Nokogiri, and XPath, are returning when accessing a node parameter. Consider this:
require 'nokogiri'
doc = Nokogiri::HTML('<link rel="canonical" href="https://test.com/somepage">')
Here's how I'd do it using CSS:
doc.at('link[rel="canonical"]').class # => Nokogiri::XML::Element
doc.at('link[rel="canonical"]')['href'].class # => String
doc.at('link[rel="canonical"]')['href'] # => "https://test.com/somepage"
XPath, while more powerful, is also capable of making you, or Nokogiri, Ruby or the CPU, do more work.
First, xpath, which is the XPath-specific version of search returns a NodeSet, not a node or an element. A NodeSet is akin to an array of nodes, which can bite you if you're not aware of what you've got. From the NodeSet documentation:
A NodeSet contains a list of Nokogiri::XML::Node objects. Typically a NodeSet is return as a result of searching a Document via Nokogiri::XML::Searchable#css or Nokogiri::XML::Searchable#xpath
If you are looking for a specific node, or only a single instance of a particular type of node, then use at, or if you want to be picky, use at_css or at_xpath. (Nokogiri can usually figure out what you mean when using at or search but sometimes you have to use the specific method to give Nokogiri a nudge in the right direction.) Using at in the above example shows it returns the node itself, and once you've got the node it's trivial to get the value of any parameter by treating it as a hash.
xpath, search and css all return NodeSets, so, like an array, you need to point to the actual element you want then access the parameter:
doc.xpath('//link[#rel="canonical"]/#href').class # => Nokogiri::XML::NodeSet
doc.xpath('//link[#rel="canonical"]/#href').first.class # => Nokogiri::XML::Attr
doc.xpath('//link[#rel="canonical"]/#href').text # => "https://test.com/somepage"
Notice that '//link[#rel="canonical"]/#href' results in Nokogiri returning an Attr object, not text. You can print that object, and Ruby will stringify it, but it won't behave like a String resulting in errors if you try to treat it like one. For instance:
doc.xpath('//link[#rel="canonical"]/#href').first.downcase # => NoMethodError: undefined method `downcase' for #<Nokogiri::XML::Attr:0x007faace115d20>
Instead, use text or content to get the text value:
doc.at('//link[#rel="canonical"]/#href').class # => Nokogiri::XML::Attr
doc.at('//link[#rel="canonical"]/#href').text # => "https://test.com/somepage"
or get the element itself and then access the parameter like you would a hash:
doc.at('//link[#rel="canonical"]').class # => Nokogiri::XML::Element
doc.at('//link[#rel="canonical"]')['href'] # => "https://test.com/somepage"
either of which will return a String.
Also notice I'm not using #href to return the Attr in this example, I'm only getting the Node itself then using ['href'] to return the text of the parameter. It's a shorter selector and makes more sense, at least to me, since Nokogiri isn't having to return the Attr object which you then have to convert using text or possibly run into problems when you accidentally treat it as a String.

Loop over all the <dd> tags and extract specefic information via Mechanize/Nokogiri

I know the basic things of accessing a website and so (I just started learning yesterday), however I want to extract now. I checked out many tutorials of Mechanize/Nokogiri but each of them had a different way of doing things which made me confused. I want a direct bold way of how to do this:
I have this website: http://openie.allenai.org/sentences/rel=contains&arg2=antioxidant&title=Green+tea
and I want to extract certain things in a structured way. If I inspect the element of this webpage and go to the body, I see so many <dd>..</dd>'s under the <dl class="dl-horizontal">. Each one of them has an <a> part which contains a href. I would like to extract this href and the bold parts of the text ex <b>green tea</b>.
I created a simple structure:
info = Struct.new(:ObjectID, :SourceID) thus from each of these <dd> will add the bold text to the object id and the href to the source id.
This is the start of the code I have, just retrieval no extraction:
agent = Mechanize.new { |agent| agent.user_agent_alias = "Windows Chrome" }
html = agent.get('http://openie.allenai.org/sentences/?rel=contains&arg2=antioxidant&title=Green+tea').body
html_doc = Nokogiri::HTML(html)
The other thing is that I am confused about whether to use Nokogiri directly or through Mechanize. The problem is that there isn't enough documentation provided by Mechanize so I was thinking of using it separately.
For now I would like to know how to loop through these and extract the info.
Here's an example of how you could parse the bold text and href attribute from the anchor elements you describe:
require 'nokogiri'
require 'open-uri'
url = 'http://openie.allenai.org/sentences/?rel=contains&arg2=antioxidant&title=Green%20tea'
doc = Nokogiri::HTML(open(url))
doc.xpath('//dd/*/a').each do |a|
text = a.xpath('.//b').map {|b| b.text.gsub(/\s+/, ' ').strip}
href = a['href']
puts "OK: text=#{text.inspect}, href=#{href.inspect}"
end
# OK: text=["Green tea", "many antioxidants"], href="http://www.talbottteas.com/category_s/55.htm"
# OK: text=["Green tea", "potent antioxidants"], href="http://www.skin-care-experts.com/tag/best-skin-care/page/4"
# OK: text=["Green tea", "potent antioxidants"], href="http://www.specialitybrand.com/news/view/207.html"
In a nutshell, this solution uses XPath in two places:
Initially to find every a element underneath each dd element.
Then to find each b element inside of the as in #1 above.
The final trick is cleaning up the text within the "b" elements into something presentable, of course, you might want it to look different somehow.

Ruby regex to match content between <ul> tags

I have a script to grab a page and edit it. The page HTML looks something like this:
<p>Title</p>...extra content...<ul><li>Item1</li><li>Item2</li></ul>
There are multiple titles and multiple unordered lists but I want to change each list with a regular expression that can find the list with a certain title and use .sub in Ruby to replace it.
The regex I currently have looks like this:
regex = /<p>Title1?.*<\/ul>/
Now if there are any items below the regex it will match to the last tag and accidentally grab all the lists below it for example if I have this content:
content = "<p>Title1</p><ul><li>Item1</li><li>Item2</li></ul><p>Title2</p><ul><li>Item1</li><li>Item2</li><li>Item3</li></ul>"
and I want to add another list item to the section for Title 1:
content.sub(regex, "<p>Title1</p><ul><li>Item1</li><li>Item2</li><li>NEW_ITEM</li></ul>)
It will delete all items below it. How do I rewrite my regex to select only the first /ul tag to substitute?
"I want to change each list with a regular expression." No you don't. You really do not want to go down this road because it's filled with misery, sorrow, and tears. One day someone will put a list item in your list item.
There are libraries like Nokogiri that make manipulating HTML very easy. There's no excuse to not use something like it:
require 'nokogiri'
html = "<p>Title</p>...extra content...<ul><li>Item1</li><li>Item2</li></ul>"
doc = Nokogiri::HTML(html)
doc.css('ul').children.first.inner_html = 'Replaced Text'
puts doc.to_s
That serves as a simple example for "replace text from first list item". It can be easily adapted to do other things, as the css method takes a simple CSS selector, not unlike jQuery.
Use a non-greedy (lazy) quantifier .*?
See this explanation of Ruby Regexp repetition.
regex = /<p>Title1?.*?<\/ul>/
...it reformats the html with newlines and changes all <br /> to <br>...
That's usually because the wrong method is used when emitting the doc as HTML or XHTML:
doc = Nokogiri::HTML::DocumentFragment.parse('<p>foo<br />bar</p>')
doc.to_xhtml # => "<p>foo<br />bar</p>"
doc.to_html # => "<p>foo<br>bar</p>"
doc = Nokogiri::HTML::DocumentFragment.parse('<p>foo<br>bar</p>')
doc.to_xhtml # => "<p>foo<br />bar</p>"
doc.to_html # => "<p>foo<br>bar</p>"
As for spuriously adding line-ends where they weren't before, I haven't seen that. It's possible to tell Nokogiri to do that if you're modifying the DOM, but from what I've seen, on its own Nokogiri is very benign.