Why does Nokogiri strip the content? - html

I have this content:
<div class="CodeRay">
<div class="code"><pre>puts <span style="background-color:#fff0f0;color:#D20"><span style="color:#710">"</span><span style="">Hello, world!</span><span style="color:#710">"</span></span></pre></div>
</div>
and I want to add it to a HTML document using Nokogiri:
File.open("frame2.html", "r") do |file|
doc = Nokogiri::HTML.parse(file)
end
doc.at_css("body") = content # this is my content
puts doc.to_html
Then content transformed to this:
<div class="CodeRay">
<div class="code"><pre>puts <span style="background-color:#fff0f0;color:#D20"><span style="color:#710">&quot;</span><span style="">Hello, world!</span><span style="color:#710">&quot;</span></span></pre></div>
</div>
Another part of HTML file is OK. The question is why does Nokogiri strip the content? Why does it tranform it to HTML entities?

I reformatted your inner HTML to make it a bit more readable as a sample.
Nokogiri isn't stripping anything, it's only encoding the content being added because you're telling it to.
Unless you tell Nokogiri the new text is already HTML it will assume you are adding text, and, since the text contains characters that should be encoded, it encodes it for you.
Here's how to do what you really want:
require "nokogiri"
html = '<div class="CodeRay">
<div class="code">
<pre>puts <span style="background-color:#fff0f0;color:#D20">
<span style="color:#710">"</span>
<span style="">Hello, world!</span>
<span style="color:#710">"</span>
</span>
</pre>
</div>
</div>'
doc = Nokogiri::HTML('<html><body></body></html>')
doc.at('body').inner_html = html
puts doc.to_html
>> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
>> <html><body><div class="CodeRay">
>> <div class="code">
>> <pre>puts <span style="background-color:#fff0f0;color:#D20">
>> <span style="color:#710">"</span>
>> <span style="">Hello, world!</span>
>> <span style="color:#710">"</span>
>> </span>
>> </pre>
>> </div>
>> </div></body></html>

Related

How can I remove html tag with Python?

My code is working. But the only thing it's also returning the HTML tags. Is there anything I can add to my FOR loop to strip the HTML code?
Here's my code below.
addressNeeded = soup.find("h1", {"style": "font-size: inherit; font-weight: inherit;"})
for x in addressNeeded:
addressList.append(x)
the outcome is:
['\n', <label class="summary-list__label">
<span itemprop="streetAddress">95 Cooks Drive</span>
</label>, '\n', <span class="summary-list__label summary-list__label--small">
<span itemprop="addressLocality">Westside</span>,
<span itemprop="addressRegion">NY</span>
<span itemprop="postalCode">07663</span>
I thank you in advance!
I believe you should change your print(x) to print(x.string) as suggested in this answer

How to remove a node using Nokogiri

I have an HTML structure like this:
<div>
This is
<p> very
<script>
some code
</script>
</p>
important.
</div>
I know how to get a Nokogiri::XML::NodeSet from this:
dom.xpath("//div")
I now want to filter out any script tag:
dom.xpath("//script")
So I can get something like:
<div>
This is
<p> very</p>
important.
</div>
So that I can call div.text to get:
"This is very important."
I tried recursively/iteratively going over all children nodes and trying to match every node I want to filter out any node I don't want, but I ran into problems like too much whitespace or not enough whitespace. I'm quite sure there's a nice enough and rubyesque way.
What would be a good way to do this?
NodeSet contains the remove method which makes it easy to remove whatever matched your selector:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div><p>foo</p><p>bar</p></div>
</body>
</html>
EOT
doc.search('p').remove
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <body>
# >> <div></div>
# >> </body>
# >> </html>
Applied to your sample input:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div>
This is
<p> very
<script>
some code
</script>
</p>
important.
</div>
EOT
doc.search('script').remove
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <div>
# >> This is
# >> <p> very
# >>
# >> </p>
# >> important.
# >> </div>
# >> </body></html>
At that point the text in the <div> is:
doc.at('div').text # => "\n This is\n very\n \n \n important.\n"
Normalizing that is easy:
doc.at('div').text.gsub(/[\n ]+/,' ').strip # => "This is very important."
1st problem
To remove all the script nodes :
require 'nokogiri'
html = "<div>
This is
<p> very
<script>
some code
</script>
</p>
important.
</div>"
doc = Nokogiri::HTML(html)
doc.xpath("//script").remove
p doc.text
#=> "\n This is\n very\n \n \n important.\n"
Thanks to #theTinMan for his tip (calling remove on one NodeSet instead of each Node).
2nd problem
To remove the unneeded whitespaces, you can use :
strip to remove spaces (whitespace, tabs, newlines, ...) at beginning and end of string
gsub to replace mutiple spaces by just one whitespace
p doc.text.strip.gsub(/[[:space:]]+/,' ')
#=> "This is very important."

Use regular expressions to add new class to element using search/replace

I want to add a NewClass value to the class attribute and modify the text of the span using find/replace functionality with a pair of regular expressions.
<div>
<span class='customer' id='phone$0'>Home</span>
<br/>
<span class='customer' id='phone$1'>Business</span>
<br/>
<span class='customer' id='phone$2'>Mobile</span>
</div>
I am trying to get the following result using after search/replace:
<span class='customer NewClass' id='phone$1'>Organization</span>
Also curious to know if a single find/replace operation can been used for both tasks?
Regex can do this, but be aware the using regex to change HTML can have a lot of edge cases that you may not have accounted for.
This regex101 example shows those three <span> elements changed to add NewClass and the contents to be changed to Organization.
Other technologies, however, would be safer. jQuery, for example, could replace them regardless of the order of the attributes:
$("span#phone$1").addClass("NewClass");
$("span#phone$1").text("Organization");
So just be careful with it, and you should be fine.
EDIT
According to comments on the OP, you want to only change the span containing ID phone$1, so the regex101 link has been updated to reflect this.
EDIT 2
Permalink was too long to fit into a comment, so adding the permalink here. Click on the "Content" tab at the bottom to see the replacement.
You can use a regex like this:
'.*?' id='phone\$1'>.*?<
With substitution string:
'customer' id='phone\$1'>Organization<
Working demo
Php code
$re = "/'.*?' id='phone\\$1'>.*?</";
$str = "<div>\n <span class='customer' id='phone\$0'>Home</span>\n<br/>\n <span class='customer' id='phone\$1'>Business</span>\n<br/>\n <span class='customer' id='phone\$2'>Mobile</span>\n</div>";
$subst = "'customerNewClass' id='phone\$1'>Organization<";
$result = preg_replace($re, $subst, $str);
Result
<div>
<span class='customer' id='phone$0'>Home</span>
<br/>
<span class='customerNewClass' id='phone$1'>Organization</span>
<br/>
<span class='customer' id='phone$2'>Mobile</span>
</div>
Since your tags include preg_match and preg_replace, I think you are using PHP.
Regex is generally not a good idea to manipulate HTML or XML. See RegEx match open tags except XHTML self-contained tags SO post.
In PHP, you can use DOMDocument and DOMXPath with //span[#id="phone$1"] xpath (get all span tags with id attribute vlaue equal to phone$1):
$html =<<<DATA
<div>
<span class='customer' id='phone$0'>Home</span>
<br/>
<span class='customer' id='phone$1'>Business</span>
<br/>
<span class='customer' id='phone$2'>Mobile</span>
</div>
DATA;
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$sps = $xp->query('//span[#id="phone$1"]');
foreach ($sps as $sp) {
$sp->setAttribute('class', $sp->getAttribute('class') . ' NewClass');
$sp->nodeValue = 'Organization';
}
echo $dom->saveHTML();
See IDEONE demo
Result:
<div>
<span class="customer" id="phone$0">Home</span>
<br>
<span class="customer NewClass" id="phone$1">Organization</span>
<br>
<span class="customer" id="phone$2">Mobile</span>
</div>

Scraping with Nokogiri::HTML - Can't get text from XPATH

I'm trying to scrape html with Nokogiri.
This is the html source:
<span id="J_WlAreaInfo" class="wl-areacon">
<span id="J-From">山东济南</span>
至
<span id="J-To">
<span id="J_WlAddressInfo" class="wl-addressinfo" title="全国">
全国
<s></s>
</span>
</span>
</span>
I need to get the following text: 山东济南
Checked shortest XPATH with firebug:
//*[#id="J-From"]
Here is my ruby code:
doc = Nokogiri::HTML(open("http://foo.html"), "UTF-8")
area = doc.xpath('//*[#id="J-From"]')
puts area.text
However, it returns nothing.
What am I doing wrong?
However, it returns nothing. What am I doing wrong?
xpath() returns an array containing the matches (it's actually called a NodeSet):
require 'nokogiri'
html = %q{
<span id="J_WlAreaInfo" class="wl-areacon">
<span id="J-From">山东济南</span>
至
<span id="J-To">
<span id="J_WlAddressInfo" class="wl-addressinfo" title="全国">
全国
<s></s>
</span>
</span>
</span>
}
doc = Nokogiri::HTML(html)
target_tags = doc.xpath('//*[#id="J-From"]')
target_tags.each do |target_tag|
puts target_tag.text
end
--output:--
山东济南
Edit: You can actually call text() on the Array, but it will return the concatenated results of the text for each match in the array--which is not something I've ever found useful--but because there is only one match you should have gotten the result 山东济南. There is nothing in your post that indicates why you didn't get that result.
If you only want a single result from your xpath, i.e. the first match, then you can use at_xpath():
target_tag = doc.at_xpath('//*[#id="J-From"]')
puts target_tag.text
--output:--
山东济南

Replace only raw text in HTML string

I have a string:
html_string =
'<span><span class=\"ip\"></span> Do not stare <span class=\"img\"></span> at the monitor continuously </span>\r\n'
I want to replace the character s in the raw text (not in the html tags) of html_string with <span class="highlighted">s</span>.
The result should be:
'<span><span class=\"ip\"></span> Do not <span class="highlighted">s</span>tare <span class=\"img\"></span> at the monitor continuou<span class="highlighted">s</span>ly </span>\r\n'
What I did is:
html_string.gsub(/s/, '<span class="highlighted">s</span>')
but this replaces all occurrences of the s character regardless of raw text or a tag. I want to replace it skipping html tags and its attributes. How it can be done?
Do not pretend to be ideal answer, just to give you a way where to go:
require 'nokogiri'
html_string = '<span><span class="ip"></span> Do not stare <span class="img"></span> at the monitor continuously </span>'
doc = Nokogiri::HTML.fragment(html_string)
spans = doc.css('span')
spans.each do |span|
span.xpath('text()').each do |text|
if text.content =~ /stare/
text.content = text.content.sub(/stare/, '<span class="highlighted">s</span>tare')
end
end
end
p doc.to_html.gsub(/\</, '<').gsub(/\>/, '>')
Which output is:
#=> "<span><span class=\"ip\"></span> Do not <span class=\"highlighted\">s</span>tare <span class=\"img\"></span> at the monitor continuously </span>"
So, here we are looking for all spans and checking them for content that has stare word. Then we change content. That's all, and learn nokogiri.
That's really simple: parse the html, replace in the text nodes, print to html.
Nokogiri seems to be popular for that in Ruby.