Sanitizing HTML using Nokogiri

Sanitizing HTML using Nokogiri - html

I'm trying to clean up some CMS entered HTML that has extraneous paragraph tags and br tags everywhere. The Sanitize gem has proved very useful to do this but I am stuck with a particular issue.
The problem is when there is a br tag directly after/before a paragraph tag eg
<p>
<br />
Some text here
<br />
Some more text
<br />
</p>
I would like to strip out the extraneous first and last br tags, but not the middle one.
I'm very much hoping I can use a sanitize transformer to do this but can't seem to find the right matcher to achieve this.
Any help would be much appreciated.

Here's how to locate the particular <br> nodes that are contained by <p>:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>
<br />
Some text here
<br />
Some more text
<br />
</p>
EOT
doc.search('p > br').map(&:to_html)
# => ["<br>", "<br>", "<br>"]
Once we know we can find them, it's easy to remove specific ones:
br_nodes = doc.search('p > br')
br_nodes.first.remove
br_nodes.last.remove
doc.to_html
# => "<p>\n \n Some text here\n <br>\n Some more text\n \n</p>\n"
Notice that Nokogiri removed them, but their associated Text nodes that are their immediate siblings, containing their "\n" are left behind. A browser will gobble those up and not display the line-ends, but you might be feeling OCD, so here's how to remove those also:
br_nodes = doc.search('p > br')
[br_nodes.first, br_nodes.last].each do |br|
br.next_sibling.remove
br.remove
end
doc.to_html
# => "<p>\n <br>\n Some more text\n </p>\n"

initial_linebreak_transformer = lambda {|options|
node = options[:node]
if node.present? && node.element? && node.name.downcase == 'p'
first_child = node.children.first
if first_child.name.downcase == 'br'
first_child.unlink
initial_linebreak_transformer.call options
end
end
}

Related

Dynamically Inject new HTML into a web page and be able to access any new DOM elements that are in the "new" injected HTML

I found this link that suggested injecting a table into a div.
enter link description here
Here is an example of new HTML that I want to inject:
<br />
<br />
<LSz class='LineSpaceDouble'>
Hi, <p class='FIRST_NAME'> </p> <br><br>
Hi <p class='FIRST_NAME'> </p>, my name is <p class='MYNAME'> </p> .
More Text.<br>
</LSz>
<br />
<label for='PBirthDate'>Primary Birthdate:</label>
<input id='PBirthDate' class='input100' type='text' name='PBirthDate' placeholder='mm/dd/yr' size='10'>
<span class='focus-input100'></span>
<br />
Here is my current jq code that does the injection:
var S = J;
$(S).appendTo('#Script_Displayed');
J holds HTML text to be injected.
and Script_Displayed is the id of the div
THAT works -- in that the "text" is indeed injected into the web page where the div is located.
My problem is when I attempt to change a value:
var Z = document.getElementsByClassName('FIRST_NAME');
Z.innerHTML = "Anthony";
The new innerHTML value does not appear on the web page.
What Can I do to make these changes visible?

The function getElementsByClassName returns a collection of elements, not a single element. So this won't work by default:
Z.innerHTML = "Anthony";
Instead, loop over the collection to assign the innerHTML value to each element in the collection:
var Z = document.getElementsByClassName('FIRST_NAME');
for (let el of Z) {
el.innerHTML = "Anthony";
}
Hi, <p class='FIRST_NAME'> </p> <br><br>
Hi <p class='FIRST_NAME'> </p>, my name is <p class='MYNAME'> </p> .
More Text.<br>

return html tags in the html text using re

I have html text and i just want to determine what are the html tags available in the text.
html_text = '<p class="gmail-m3464245979397595798gmail-m6143070745855285966gmail-m-3072962113628903492gmail-m-7999079541169053160wordsection1" style="margin:0in;margin-bottom:.0001pt">Position Title: Onsite Client Services Associate<br /> Duration: 7 months<br /> Location: Tempe, AZ 85282<br /> <br /> <b><u>Roles and responsibilities</u></b><o:p></o:p></p> <p class="gmail-m3464245979397595798gmail-m6143070745855285966gmail-m-3072962113628903492gmail-m-7999079541169053160wordsection1" style="margin-top:5.0pt;margin-right:0in;margin-bottom:0in;margin-left:.25in; margin-bottom:.0001pt"><span style="font-family:Symbol">·</span><span style="font-size:7.0pt"> </span>Primary function during peak season (July-December) will be an onsite presence at our large client in the Phoenix area. <o:p></o:p></p>'
As a first step I was parsing every tag from the text for every html tag
like html_text.find('</p>'). As it is very long to parse by checking with every tag, I was trying to use of regex
re.findall(r'\<\/.>', html_text)
The output of the above is ['</p>', '</b>', '</u>']. But I want the output to be ['</p>','</span>', '<br />', '</b>', '</u>']. So If I modify
re.findall(r'\<\/.*>', html_text)
presuming i can get </span>, I am getting the whole text.
['</u></b><o:p></o:p></p> <p class="gmail-m3464245979397595798gmail-m6143070745855285966gmail-m-3072962113628903492gmail-m-7999079541169053160wordsection1" style="margin-top:5.0pt;margin-right:0in;margin-bottom:0in;margin-left:.25in; margin-bottom:.0001pt"><span style="font-family:Symbol">·</span><span style="font-size:7.0pt"> </span>Primary function during peak season (July-December) will be an onsite presence at our large client in the Phoenix area. <o:p></o:p></p>']
Is there a way I can write the expression for all tags as one expression or else should I write condition check for every tag ? In the above I couldn't determine <br />.

Finally after some little trails, I have found answer for my self, just posting it if it would help some one. It will determine all the tags, do some cleaning will determine the tags.
re.findall(re.compile("<.*?>"), html_text)
output is
['<p class="gmail-m3464245979397595798gmail-m6143070745855285966gmail-m-3072962113628903492gmail-m-7999079541169053160wordsection1" style="margin:0in;margin-bottom:.0001pt">', '<br />', '<br />', '<br />', '<br />', '<b>', '<u>', '</u>', '</b>', '<o:p>', '</o:p>', '</p>', '<p class="gmail-m3464245979397595798gmail-m6143070745855285966gmail-m-3072962113628903492gmail-m-7999079541169053160wordsection1" style="margin-top:5.0pt;margin-right:0in;margin-bottom:0in;margin-left:.25in; margin-bottom:.0001pt">', '<span style="font-family:Symbol">', '</span>', '<span style="font-size:7.0pt">', '</span>', '<o:p>', '</o:p>', '</p>']

As far as I know, what you are trying to do won't be fully achievable with just regex.
Usually, in an HTML tag there are attributes inside the opening tag. For example-
<span class="text">Some Text </span> has class="text" between the opening <span and the closing >.
So, if you want to just match <span> from <span class="text">Some Text </span>, you'll have to match <span first and then somehow skip class="text" and match > again. Which is not possible with regex as regex can only match characters one after another.
One solution that comes to my mind is, you can use this regex (<[^\/\s]+)([^>]+)>. Which will match <span class="text">Some Text </span> and return <span. You can then just add > after that using string concatenation.
Regex Explanation-
Thanks.

Delete an HTML element containing a pattern

How can I delete elements (from <span> to </span>) whose text contain PATTERN in it? The contents of the element should be deleted along with the element.
For example, I want to delete the first <span>...</span> element in the following:
<span><SPAN>some text with
with </SPAN> a PATTERNin it etc</span><span><SPAN>some text
without </SPAN> a thingIn it etc</span>
to produce, using SED only :
<span><SPAN>some text
without </SPAN> a thingIn it etc</span>
PS: No help with end of lines or solo words, it must just detect any <span>...</span> and PATTERN.
Production server only allow basic commands such as SED.
I'm currently using the following but it's ugly and doesn't seem to work.
sed '/<span.*\n.*PATTERN.*<\/span>/d'

If HTML:
perl -MXML::LibXML -e'
my $parser = XML::LibXML->new();
my $doc = $parser->parse_html_file($ARGV[0]);
$_->unbindNode()
for $doc->findnodes(q{//span[contains(text(), "PATTERN")]});
binmode(STDOUT);
print($doc->toString());
' in.html >out.html
If XHTML:
perl -MXML::LibXML -e'
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($ARGV[0]);
my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs( h => "http://www.w3.org/1999/xhtml" );
$_->unbindNode()
for $xpc->findnodes(q{//h:span[contains(text(), "PATTERN")]}, $doc);
binmode(STDOUT);
print($doc->toString());
' in.xhtml >out.xhtml
The above both produce the following (with some implied elements vivified):
<span><SPAN>some text
without </SPAN> a thingIn it etc</span>

Replace only raw text in HTML string

I have a string:
html_string =
'<span><span class=\"ip\"></span> Do not stare <span class=\"img\"></span> at the monitor continuously </span>\r\n'
I want to replace the character s in the raw text (not in the html tags) of html_string with <span class="highlighted">s</span>.
The result should be:
'<span><span class=\"ip\"></span> Do not <span class="highlighted">s</span>tare <span class=\"img\"></span> at the monitor continuou<span class="highlighted">s</span>ly </span>\r\n'
What I did is:
html_string.gsub(/s/, '<span class="highlighted">s</span>')
but this replaces all occurrences of the s character regardless of raw text or a tag. I want to replace it skipping html tags and its attributes. How it can be done?

Do not pretend to be ideal answer, just to give you a way where to go:
require 'nokogiri'
html_string = '<span><span class="ip"></span> Do not stare <span class="img"></span> at the monitor continuously </span>'
doc = Nokogiri::HTML.fragment(html_string)
spans = doc.css('span')
spans.each do |span|
span.xpath('text()').each do |text|
if text.content =~ /stare/
text.content = text.content.sub(/stare/, '<span class="highlighted">s</span>tare')
end
end
end
p doc.to_html.gsub(/\</, '<').gsub(/\>/, '>')
Which output is:
#=> "<span><span class=\"ip\"></span> Do not <span class=\"highlighted\">s</span>tare <span class=\"img\"></span> at the monitor continuously </span>"
So, here we are looking for all spans and checking them for content that has stare word. Then we change content. That's all, and learn nokogiri.

That's really simple: parse the html, replace in the text nodes, print to html.
Nokogiri seems to be popular for that in Ruby.

Remove <p> tags - Regular Expression (Regex)

I have some HTML and the requirement is to remove only starting <p> tags from the string.
Example:
input: <p style="display:inline; margin: 40pt;"><span style="font:XXXX;"> Text1 Here</span></p><p style="margin: 50pt"><span style="font:XXXX">Text2 Here</span></p> <p style="display:inline; margin: 40pt;"><span style="font:XXXX;"> Text3 Here</span></p>the string goes on like that
desired output: <span style="font:XXXX;"> Text1 Here</span></p><span style="font:XXXX">Text2 Here</span></p><span style="font:XXXX;"> Text3 Here</span></p>
Is it possible using Regex? I have tried some combinations but not working. This is all a single string. Any advice appreciated.

I'm sure you know the warnings about using regex to match html. With these disclaimers, you can do this:
Option 1: Leaving the closing </p> tags
This first option leaves the closing </p> tags, but that's what your desired output shows. :) Option 2 will remove them as well.
PHP
$replaced = preg_replace('~<p[^>]*>~', '', $yourstring);
JavaScript
replaced = yourstring.replace(/<p[^>]*>/g, "");
Python
replaced = re.sub("<p[^>]*>", "", yourstring)
<p matches the beginning of the tag
The negative character class [^>]* matches any character that is not a closing >
> closes the match
we replace all this with an empty string
Option 2: Also removing the closing </p> tags
PHP
$replaced = preg_replace('~</?p[^>]*>~', '', $yourstring);
JavaScript
replaced = yourstring.replace(/<\/?p[^>]*>/g, "");
Python
replaced = re.sub("</?p[^>]*>", "", yourstring)

This is a PCRE expression:
/<p( *\w+=("[^"]*"|'[^']'|[^ >]))*>(.*<\/p>)/Ug
Replace each occurrence with $3 or just remove all occurrences of:
/<p( *\w+=("[^"]*"|'[^']'|[^ >]))*>/g
If you want to remove the closing tag as well:
/<p( *\w+=("[^"]*"|'[^']'|[^ >]))*>(.*)<\/p>/Ug

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Sanitizing HTML using Nokogiri - html

initial_linebreak_transformer = lambda {|options| node = options[:node] if node.present? && node.element? && node.name.downcase == 'p' first_child = node.children.first if first_child.name.downcase == 'br' first_child.unlink initial_linebreak_transformer.call options end end }

Related

Dynamically Inject new HTML into a web page and be able to access any new DOM elements that are in the "new" injected HTML

return html tags in the html text using re

Delete an HTML element containing a pattern

Replace only raw text in HTML string

Remove <p> tags - Regular Expression (Regex)

Categories

Resources