Sanitizing HTML using Nokogiri - html

I'm trying to clean up some CMS entered HTML that has extraneous paragraph tags and br tags everywhere. The Sanitize gem has proved very useful to do this but I am stuck with a particular issue.
The problem is when there is a br tag directly after/before a paragraph tag eg
<p>
<br />
Some text here
<br />
Some more text
<br />
</p>
I would like to strip out the extraneous first and last br tags, but not the middle one.
I'm very much hoping I can use a sanitize transformer to do this but can't seem to find the right matcher to achieve this.
Any help would be much appreciated.

Here's how to locate the particular <br> nodes that are contained by <p>:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>
<br />
Some text here
<br />
Some more text
<br />
</p>
EOT
doc.search('p > br').map(&:to_html)
# => ["<br>", "<br>", "<br>"]
Once we know we can find them, it's easy to remove specific ones:
br_nodes = doc.search('p > br')
br_nodes.first.remove
br_nodes.last.remove
doc.to_html
# => "<p>\n \n Some text here\n <br>\n Some more text\n \n</p>\n"
Notice that Nokogiri removed them, but their associated Text nodes that are their immediate siblings, containing their "\n" are left behind. A browser will gobble those up and not display the line-ends, but you might be feeling OCD, so here's how to remove those also:
br_nodes = doc.search('p > br')
[br_nodes.first, br_nodes.last].each do |br|
br.next_sibling.remove
br.remove
end
doc.to_html
# => "<p>\n <br>\n Some more text\n </p>\n"

initial_linebreak_transformer = lambda {|options|
node = options[:node]
if node.present? && node.element? && node.name.downcase == 'p'
first_child = node.children.first
if first_child.name.downcase == 'br'
first_child.unlink
initial_linebreak_transformer.call options
end
end
}

Related

Dynamically Inject new HTML into a web page and be able to access any new DOM elements that are in the "new" injected HTML

I found this link that suggested injecting a table into a div.
enter link description here
Here is an example of new HTML that I want to inject:
<br />
<br />
<LSz class='LineSpaceDouble'>
Hi, <p class='FIRST_NAME'> </p> <br><br>
Hi <p class='FIRST_NAME'> </p>, my name is <p class='MYNAME'> </p> .
More Text.<br>
</LSz>
<br />
<label for='PBirthDate'>Primary Birthdate:</label>
<input id='PBirthDate' class='input100' type='text' name='PBirthDate' placeholder='mm/dd/yr' size='10'>
<span class='focus-input100'></span>
<br />
Here is my current jq code that does the injection:
var S = J;
$(S).appendTo('#Script_Displayed');
J holds HTML text to be injected.
and Script_Displayed is the id of the div
THAT works -- in that the "text" is indeed injected into the web page where the div is located.
My problem is when I attempt to change a value:
var Z = document.getElementsByClassName('FIRST_NAME');
Z.innerHTML = "Anthony";
The new innerHTML value does not appear on the web page.
What Can I do to make these changes visible?
The function getElementsByClassName returns a collection of elements, not a single element. So this won't work by default:
Z.innerHTML = "Anthony";
Instead, loop over the collection to assign the innerHTML value to each element in the collection:
var Z = document.getElementsByClassName('FIRST_NAME');
for (let el of Z) {
el.innerHTML = "Anthony";
}
Hi, <p class='FIRST_NAME'> </p> <br><br>
Hi <p class='FIRST_NAME'> </p>, my name is <p class='MYNAME'> </p> .
More Text.<br>

return html tags in the html text using re

I have html text and i just want to determine what are the html tags available in the text.
html_text = '<p class="gmail-m3464245979397595798gmail-m6143070745855285966gmail-m-3072962113628903492gmail-m-7999079541169053160wordsection1" style="margin:0in;margin-bottom:.0001pt">Position Title: Onsite Client Services Associate<br /> Duration: 7 months<br /> Location: Tempe, AZ 85282<br /> <br /> <b><u>Roles and responsibilities</u></b><o:p></o:p></p> <p class="gmail-m3464245979397595798gmail-m6143070745855285966gmail-m-3072962113628903492gmail-m-7999079541169053160wordsection1" style="margin-top:5.0pt;margin-right:0in;margin-bottom:0in;margin-left:.25in; margin-bottom:.0001pt"><span style="font-family:Symbol">·</span><span style="font-size:7.0pt"> </span>Primary function during peak season (July-December) will be an onsite presence at our large client in the Phoenix area. <o:p></o:p></p>'
As a first step I was parsing every tag from the text for every html tag
like html_text.find('</p>'). As it is very long to parse by checking with every tag, I was trying to use of regex
re.findall(r'\<\/.>', html_text)
The output of the above is ['</p>', '</b>', '</u>']. But I want the output to be ['</p>','</span>', '<br />', '</b>', '</u>']. So If I modify
re.findall(r'\<\/.*>', html_text)
presuming i can get </span>, I am getting the whole text.
['</u></b><o:p></o:p></p> <p class="gmail-m3464245979397595798gmail-m6143070745855285966gmail-m-3072962113628903492gmail-m-7999079541169053160wordsection1" style="margin-top:5.0pt;margin-right:0in;margin-bottom:0in;margin-left:.25in; margin-bottom:.0001pt"><span style="font-family:Symbol">·</span><span style="font-size:7.0pt"> </span>Primary function during peak season (July-December) will be an onsite presence at our large client in the Phoenix area. <o:p></o:p></p>']
Is there a way I can write the expression for all tags as one expression or else should I write condition check for every tag ? In the above I couldn't determine <br />.
Finally after some little trails, I have found answer for my self, just posting it if it would help some one. It will determine all the tags, do some cleaning will determine the tags.
re.findall(re.compile("<.*?>"), html_text)
output is
['<p class="gmail-m3464245979397595798gmail-m6143070745855285966gmail-m-3072962113628903492gmail-m-7999079541169053160wordsection1" style="margin:0in;margin-bottom:.0001pt">', '<br />', '<br />', '<br />', '<br />', '<b>', '<u>', '</u>', '</b>', '<o:p>', '</o:p>', '</p>', '<p class="gmail-m3464245979397595798gmail-m6143070745855285966gmail-m-3072962113628903492gmail-m-7999079541169053160wordsection1" style="margin-top:5.0pt;margin-right:0in;margin-bottom:0in;margin-left:.25in; margin-bottom:.0001pt">', '<span style="font-family:Symbol">', '</span>', '<span style="font-size:7.0pt">', '</span>', '<o:p>', '</o:p>', '</p>']
As far as I know, what you are trying to do won't be fully achievable with just regex.
Usually, in an HTML tag there are attributes inside the opening tag. For example-
<span class="text">Some Text </span> has class="text" between the opening <span and the closing >.
So, if you want to just match <span> from <span class="text">Some Text </span>, you'll have to match <span first and then somehow skip class="text" and match > again. Which is not possible with regex as regex can only match characters one after another.
One solution that comes to my mind is, you can use this regex (<[^\/\s]+)([^>]+)>. Which will match <span class="text">Some Text </span> and return <span. You can then just add > after that using string concatenation.
Regex Explanation-
Thanks.

Delete an HTML element containing a pattern

How can I delete elements (from <span> to </span>) whose text contain PATTERN in it? The contents of the element should be deleted along with the element.
For example, I want to delete the first <span>...</span> element in the following:
<span><SPAN>some text with
with </SPAN> a PATTERNin it etc</span><span><SPAN>some text
without </SPAN> a thingIn it etc</span>
to produce, using SED only :
<span><SPAN>some text
without </SPAN> a thingIn it etc</span>
PS: No help with end of lines or solo words, it must just detect any <span>...</span> and PATTERN.
Production server only allow basic commands such as SED.
I'm currently using the following but it's ugly and doesn't seem to work.
sed '/<span.*\n.*PATTERN.*<\/span>/d'
If HTML:
perl -MXML::LibXML -e'
my $parser = XML::LibXML->new();
my $doc = $parser->parse_html_file($ARGV[0]);
$_->unbindNode()
for $doc->findnodes(q{//span[contains(text(), "PATTERN")]});
binmode(STDOUT);
print($doc->toString());
' in.html >out.html
If XHTML:
perl -MXML::LibXML -e'
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($ARGV[0]);
my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs( h => "http://www.w3.org/1999/xhtml" );
$_->unbindNode()
for $xpc->findnodes(q{//h:span[contains(text(), "PATTERN")]}, $doc);
binmode(STDOUT);
print($doc->toString());
' in.xhtml >out.xhtml
The above both produce the following (with some implied elements vivified):
<span><SPAN>some text
without </SPAN> a thingIn it etc</span>

Replace only raw text in HTML string

I have a string:
html_string =
'<span><span class=\"ip\"></span> Do not stare <span class=\"img\"></span> at the monitor continuously </span>\r\n'
I want to replace the character s in the raw text (not in the html tags) of html_string with <span class="highlighted">s</span>.
The result should be:
'<span><span class=\"ip\"></span> Do not <span class="highlighted">s</span>tare <span class=\"img\"></span> at the monitor continuou<span class="highlighted">s</span>ly </span>\r\n'
What I did is:
html_string.gsub(/s/, '<span class="highlighted">s</span>')
but this replaces all occurrences of the s character regardless of raw text or a tag. I want to replace it skipping html tags and its attributes. How it can be done?
Do not pretend to be ideal answer, just to give you a way where to go:
require 'nokogiri'
html_string = '<span><span class="ip"></span> Do not stare <span class="img"></span> at the monitor continuously </span>'
doc = Nokogiri::HTML.fragment(html_string)
spans = doc.css('span')
spans.each do |span|
span.xpath('text()').each do |text|
if text.content =~ /stare/
text.content = text.content.sub(/stare/, '<span class="highlighted">s</span>tare')
end
end
end
p doc.to_html.gsub(/\</, '<').gsub(/\>/, '>')
Which output is:
#=> "<span><span class=\"ip\"></span> Do not <span class=\"highlighted\">s</span>tare <span class=\"img\"></span> at the monitor continuously </span>"
So, here we are looking for all spans and checking them for content that has stare word. Then we change content. That's all, and learn nokogiri.
That's really simple: parse the html, replace in the text nodes, print to html.
Nokogiri seems to be popular for that in Ruby.

Remove <p> tags - Regular Expression (Regex)

I have some HTML and the requirement is to remove only starting <p> tags from the string.
Example:
input: <p style="display:inline; margin: 40pt;"><span style="font:XXXX;"> Text1 Here</span></p><p style="margin: 50pt"><span style="font:XXXX">Text2 Here</span></p> <p style="display:inline; margin: 40pt;"><span style="font:XXXX;"> Text3 Here</span></p>the string goes on like that
desired output: <span style="font:XXXX;"> Text1 Here</span></p><span style="font:XXXX">Text2 Here</span></p><span style="font:XXXX;"> Text3 Here</span></p>
Is it possible using Regex? I have tried some combinations but not working. This is all a single string. Any advice appreciated.
I'm sure you know the warnings about using regex to match html. With these disclaimers, you can do this:
Option 1: Leaving the closing </p> tags
This first option leaves the closing </p> tags, but that's what your desired output shows. :) Option 2 will remove them as well.
PHP
$replaced = preg_replace('~<p[^>]*>~', '', $yourstring);
JavaScript
replaced = yourstring.replace(/<p[^>]*>/g, "");
Python
replaced = re.sub("<p[^>]*>", "", yourstring)
<p matches the beginning of the tag
The negative character class [^>]* matches any character that is not a closing >
> closes the match
we replace all this with an empty string
Option 2: Also removing the closing </p> tags
PHP
$replaced = preg_replace('~</?p[^>]*>~', '', $yourstring);
JavaScript
replaced = yourstring.replace(/<\/?p[^>]*>/g, "");
Python
replaced = re.sub("</?p[^>]*>", "", yourstring)
This is a PCRE expression:
/<p( *\w+=("[^"]*"|'[^']'|[^ >]))*>(.*<\/p>)/Ug
Replace each occurrence with $3 or just remove all occurrences of:
/<p( *\w+=("[^"]*"|'[^']'|[^ >]))*>/g
If you want to remove the closing tag as well:
/<p( *\w+=("[^"]*"|'[^']'|[^ >]))*>(.*)<\/p>/Ug