Use regular expressions to add new class to element using search/replace - html

I want to add a NewClass value to the class attribute and modify the text of the span using find/replace functionality with a pair of regular expressions.
<div>
<span class='customer' id='phone$0'>Home</span>
<br/>
<span class='customer' id='phone$1'>Business</span>
<br/>
<span class='customer' id='phone$2'>Mobile</span>
</div>
I am trying to get the following result using after search/replace:
<span class='customer NewClass' id='phone$1'>Organization</span>
Also curious to know if a single find/replace operation can been used for both tasks?

Regex can do this, but be aware the using regex to change HTML can have a lot of edge cases that you may not have accounted for.
This regex101 example shows those three <span> elements changed to add NewClass and the contents to be changed to Organization.
Other technologies, however, would be safer. jQuery, for example, could replace them regardless of the order of the attributes:
$("span#phone$1").addClass("NewClass");
$("span#phone$1").text("Organization");
So just be careful with it, and you should be fine.
EDIT
According to comments on the OP, you want to only change the span containing ID phone$1, so the regex101 link has been updated to reflect this.
EDIT 2
Permalink was too long to fit into a comment, so adding the permalink here. Click on the "Content" tab at the bottom to see the replacement.

You can use a regex like this:
'.*?' id='phone\$1'>.*?<
With substitution string:
'customer' id='phone\$1'>Organization<
Working demo
Php code
$re = "/'.*?' id='phone\\$1'>.*?</";
$str = "<div>\n <span class='customer' id='phone\$0'>Home</span>\n<br/>\n <span class='customer' id='phone\$1'>Business</span>\n<br/>\n <span class='customer' id='phone\$2'>Mobile</span>\n</div>";
$subst = "'customerNewClass' id='phone\$1'>Organization<";
$result = preg_replace($re, $subst, $str);
Result
<div>
<span class='customer' id='phone$0'>Home</span>
<br/>
<span class='customerNewClass' id='phone$1'>Organization</span>
<br/>
<span class='customer' id='phone$2'>Mobile</span>
</div>

Since your tags include preg_match and preg_replace, I think you are using PHP.
Regex is generally not a good idea to manipulate HTML or XML. See RegEx match open tags except XHTML self-contained tags SO post.
In PHP, you can use DOMDocument and DOMXPath with //span[#id="phone$1"] xpath (get all span tags with id attribute vlaue equal to phone$1):
$html =<<<DATA
<div>
<span class='customer' id='phone$0'>Home</span>
<br/>
<span class='customer' id='phone$1'>Business</span>
<br/>
<span class='customer' id='phone$2'>Mobile</span>
</div>
DATA;
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$sps = $xp->query('//span[#id="phone$1"]');
foreach ($sps as $sp) {
$sp->setAttribute('class', $sp->getAttribute('class') . ' NewClass');
$sp->nodeValue = 'Organization';
}
echo $dom->saveHTML();
See IDEONE demo
Result:
<div>
<span class="customer" id="phone$0">Home</span>
<br>
<span class="customer NewClass" id="phone$1">Organization</span>
<br>
<span class="customer" id="phone$2">Mobile</span>
</div>

Related

Xpath issues selecting <spans> nested in <td>

I'm trying to extract text from a lot of XHTML documents with a program that uses Xpath queries to map the text into a structured table. the XHTML document looks like this
<td class="td-3 c12" valign="top">
<p class="pa-4">
<span class="ca-5">text I would like to select </span>
</p>
</td>
<td class="td-3 c13" valign="top">
<p class="pa-2">
<span class="ca-0">some more text I want to select </span>
</p>
<p class="pa-2">
<span class="ca-0">
<br>
</br>
</span>
</p>
<p class="pa-2">
<span class="ca-5">text and values I don't want to select.</span>
</p>
<p class="pa-2">
<span class="ca-5"> also text and values I don't want to </span>
</p>
</td>
I'm able to select the the spans by their class and retrieve the text/values, however they're not unique enough and I need to filter by table classes. for example only the text from span class ca-0 that is a child of td class td-3 c13
which would be <span class="ca-0">some more text I want to select </span>
I've tried all these combinations
//xhtml:td[#class="td-3 c13"]/xhtml:span[#class = "ca-0"]
//xhtml:span[#class = "ca-0"] //ancestor::xhtml:td[#class= "td-3 c13"]
//xhtml:td[#class="td-3 c6"]//xhtml:span[#class = "ca-0"]
I'm not sure how much your sample xml reflects your actual xml, but strictly based on your sample xml (AND disregarding possible namespaces issues you will probably face), the following xpath expression:
//td[contains(#class,"td-3")]/p[1]/span/text()
selects
text I would like to select
some more text I want to select
According to the doc, and to support namespaces, you should write something like this (fn:...) :
//*:td[fn:contains(#class,"td-3")]/*:p[1]/*:span
Or with a binding namespace :
node.xpath("//xhtml:td[fn:contains(#class,'td-3')]/xhtml:p[1]/xhtml:span", {"xhtml":"http://example.com/ns"})
This expression should work too (select the first span of the first p of each td element) :
//*:td/*:p[1]/*:span[1]
Side notes :
Your XPath expressions could be fixed. Span is not a child but a descendant, so we use //. We use () to keep the first result only.
(//xhtml:td[#class="td-3 c13"]//xhtml:span[#class = "ca-0"])[1]
(//xhtml:td[#class="td-3 c6"]//xhtml:span[#class = "ca-0"])[1]
Replace // with a predicate [] :
(//xhtml:span[#class = "ca-0"][ancestor::xhtml:td[#class= "td-3 c13"]])[1]
Test your XPath with : https://docs.marklogic.com/cts.validIndexPath
The solution is
//td[(#class ="td-3") and (#class = "c13)]/p/span
for some reason it sees the
<td class="td-3 c13">
as separate classes e.g.
<td class = "td-3" and class = "c13"
so you need to treat them as such
Thanks to #E.Wiest and #JackFleeting for validating and pointing me in the right direction.

HTML::ELEMENT not finding all elements

I have this snippet of html:
<li class="result-row" data="2">
<p class="result-info">
<span class="icon icon-star" role="button">
<span class="screen-reader-text">favorite this post</span>
</span>
<time class="result-date" datetime="2018-12-04 09:21" title="Tue 04 Dec 09:21:50 AM">Dec 4</time>
Link Text
and this perl code (not production, so no quality comments are necessary)
my $root = $tree->elementify();
my #rows = $root->look_down('class', 'result-row');
my $item = $rows[0];
say $item->dump;
my $date = $item->look_down('class', 'result-date');
say $date;
my $title = $item->look_down('class', 'result-title hdrlnk');
All outputs are as I expected except $date isn't defined.
When I look at the $item->dump, it looks like the time element doesn't show up in the output. Here's a snippet of the output from $item->dump where I would expect to see a <time...> element. All it shows is the text from the time element.
<li class="result-row" data="2"> #0.1.9.3.2.0
<a class="result-image gallery empty" href="https://localhost/1.html"> #0.1.9.3.2.0.0
<p class="result-info"> #0.1.9.3.2.0.1
<span class="icon icon-star" role="button"> #0.1.9.3.2.0.1.0
" "
<span class="screen-reader-text"> #0.1.9.3.2.0.1.0.1
"favorite this post"
" "
" Dec 4 "
<a class="result-title hdrlnk" data="2" href="https://localhost/1.html"> #0.1.9.3.2.0.1
.2
"Link Text..."
" "
...
I've not used HTML::Element before. I rtfmed and didn't see any tag exclusions and I did a search of the package code for tags white/black lists (which wouldn't make sense, but neither does leaving out the time tag).
Does anyone know why the time element is not showing up in the dump and any search for it turns up nothing?
As an fyi, the rest of the code searches and finds elements without issue, it just appears to be the time tag that's missing.
HTML::TreeBuilder does not support HTML5 tags. Consider Mojo::DOM as an alternative that keeps up with the living HTML standard. I can't show how your whole code would look with Mojo::DOM since you've only shown a piece, but the Mojo::DOM equivalent of look_down is find (returns a Mojo::Collection arrayref) or at (returns the first element found or undef), both taking a CSS selector.

RegEx Removing Span tags from HTML

I need some RegEx for removing span tags with a specific class including the end tag but don't want to remove what's in between ...
I do not want to remove any other span tags
I cannot come up with it since I tend to forget the RegEx Tricks :(
I have this
<span class="SpellE">system_user.user_name</span>
<span>This is some text</span>
<Span class="OtherCLass">Some other text</span>
<span class="SpellE">system_user.userid</span>
And I want this result
system_user.user_name
<span>This is some text</span>
<Span class="OtherCLass">Some other text</span>
system_user.userid
Yes I need to tidy up some messy MS Html :)
Thanks in advance
The following regex should match what you want:
<span class=\"SpellE\">(.*)</span>
It matches the span with class='SpellE', creating a Group of the span text.
Then you should replace the match with Group 1.
In JavaScript, you can use it like this:
var testStr = '<span class="SpellE">system_user.user_name</span>\n'
+ '<span>This is some text</span>\n'
+ '<Span class="OtherCLass">Some other text</span>\n'
+ '<span class="SpellE">system_user.userid</span>\n';
var regex = /<span class=\"SpellE\">(.*)</span>/gi;
var result = testStr.replace(regex, '\1');
Now the result should be your wanted output.

Delete an HTML element containing a pattern

How can I delete elements (from <span> to </span>) whose text contain PATTERN in it? The contents of the element should be deleted along with the element.
For example, I want to delete the first <span>...</span> element in the following:
<span><SPAN>some text with
with </SPAN> a PATTERNin it etc</span><span><SPAN>some text
without </SPAN> a thingIn it etc</span>
to produce, using SED only :
<span><SPAN>some text
without </SPAN> a thingIn it etc</span>
PS: No help with end of lines or solo words, it must just detect any <span>...</span> and PATTERN.
Production server only allow basic commands such as SED.
I'm currently using the following but it's ugly and doesn't seem to work.
sed '/<span.*\n.*PATTERN.*<\/span>/d'
If HTML:
perl -MXML::LibXML -e'
my $parser = XML::LibXML->new();
my $doc = $parser->parse_html_file($ARGV[0]);
$_->unbindNode()
for $doc->findnodes(q{//span[contains(text(), "PATTERN")]});
binmode(STDOUT);
print($doc->toString());
' in.html >out.html
If XHTML:
perl -MXML::LibXML -e'
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($ARGV[0]);
my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs( h => "http://www.w3.org/1999/xhtml" );
$_->unbindNode()
for $xpc->findnodes(q{//h:span[contains(text(), "PATTERN")]}, $doc);
binmode(STDOUT);
print($doc->toString());
' in.xhtml >out.xhtml
The above both produce the following (with some implied elements vivified):
<span><SPAN>some text
without </SPAN> a thingIn it etc</span>

Replace only raw text in HTML string

I have a string:
html_string =
'<span><span class=\"ip\"></span> Do not stare <span class=\"img\"></span> at the monitor continuously </span>\r\n'
I want to replace the character s in the raw text (not in the html tags) of html_string with <span class="highlighted">s</span>.
The result should be:
'<span><span class=\"ip\"></span> Do not <span class="highlighted">s</span>tare <span class=\"img\"></span> at the monitor continuou<span class="highlighted">s</span>ly </span>\r\n'
What I did is:
html_string.gsub(/s/, '<span class="highlighted">s</span>')
but this replaces all occurrences of the s character regardless of raw text or a tag. I want to replace it skipping html tags and its attributes. How it can be done?
Do not pretend to be ideal answer, just to give you a way where to go:
require 'nokogiri'
html_string = '<span><span class="ip"></span> Do not stare <span class="img"></span> at the monitor continuously </span>'
doc = Nokogiri::HTML.fragment(html_string)
spans = doc.css('span')
spans.each do |span|
span.xpath('text()').each do |text|
if text.content =~ /stare/
text.content = text.content.sub(/stare/, '<span class="highlighted">s</span>tare')
end
end
end
p doc.to_html.gsub(/\</, '<').gsub(/\>/, '>')
Which output is:
#=> "<span><span class=\"ip\"></span> Do not <span class=\"highlighted\">s</span>tare <span class=\"img\"></span> at the monitor continuously </span>"
So, here we are looking for all spans and checking them for content that has stare word. Then we change content. That's all, and learn nokogiri.
That's really simple: parse the html, replace in the text nodes, print to html.
Nokogiri seems to be popular for that in Ruby.