Find and replace all occurrences of a string inside an HTML tag in one pass - html

I need a regular expression to search for and replace multiple occurrences of a text string within a delimited section of text.
Let's say there is HTML code with one or more spans that have a certain class. Each span may have none, one or multiple occurrences of the string {abc} inside, e.g.
<p>lorem ipsum dolor <span class="xyz">sid amet{abc}et pluribus {abc} unum{abc} diex
et mon droit</span> you'll never walk alone</p>
Thus I need a regex pair to replace all occurrences of {abc} within <span id="xyz"> with {def} in a single pass.
This is for use in a text editor such as Notepad++ and the like and needs to be be a PCRE/UNIX-style regular expression.
What I have is,
find: (<span class="xyz">)([^<]*)\{abc\}([^<]*<)
replace: \1\2{def}\3
This does work for one occurrence within a span, but in case of more occurrences, I have to run replacement multiple times, in cycle, while I need that to be one-pass.
I wonder how can I achieve that. I suppose this is a pretty common case, somehow I could not find similar things concerning the need to be one-pass, no cycles, no code, and I'd like to get an idea how this could be done in principle.

This seems to work in Notepad++
Find what : (?:<span class="xyz">|\G)[^<]*?\K\{abc\}(?=[^<]*<\/span>)
Replace with : {def}
Search mode : Regular expression
Note that because of the [^<]* there is an assumption that there are no other tags within the span tag.

Related

How to match text and skip HTML tags using a regular expression?

I have a bunch of records in a QuickBase table that contain a rich text field. In other words, they each contain some paragraphs of text intermingled with HTML tags like <p>, <strong>, etc.
I need to migrate the records to a new table where the corresponding field is a plain text field. For this, I would like to strip out all HTML tags and leave only the text in the field values.
For example, from the below input, I would expect to extract just a small example link to a webpage:
<p>just a small <a href="#">
example</a> link</p><p>to a webpage</p>
As I am trying to get this done quickly and without coding or using an external tool, I am constrained to using Quickbase Pipelines' Text channel tool. The way it works is that I define a regex pattern and it outputs only the bits that match the pattern.
So far I've been able to come up with this regular expression (Python-flavored as QB's backend is written in Python) that correctly does the exact opposite of what I need. I.e. it matches only the HTML tags:
/(<[^>]*>)/
In a sense, I need the negative image of this expression but have not be able to build it myself.
Your help in "negating" the above expression is most appreciated.
Assuming there are no < or > elsewhere or entity-encoded, an idea using a lookbehind.
(?:(?<=>)|^)[^<]+
See this demo at regex101
(?:(?<=>)|^) is an alternation between either ^ start of the string or looking behind for any >. From there [^<]+ matches one or more characters that are not < (negated character class).

Search and replace outer tag in Atom using REGEX

Using Atom, I'm trying to replace the outer tag structure for multiple different texts within a document. Also using REGEX, which I'm not versed enough to come up with my own solution
HTML to be searched <span class="klass">Any text string</span>
Replace it with <code>Any text string</code>
My REGEX (<?span class="klass">)+[\w]+(<?/span>)
Is there a wildcard to "keep" the [\w] part into the replaced result?
You can use a capture group to capture the text in between the <span> tags during the match, and then use it to build the <code> output you want. Try the following find and replace:
Find:
<span class="klass">(.*?)</span>
Replace:
<code>$1</code>
Here $1 represents the quantity (.*?) which we captured in the search. One other point, we use .*? when capturing between tags as opposed to just .*. The former .*? is a "lazy" or tempered dot. This tells the engine to stop matching upon hitting the first closing </span> tag. Without this, the match would be greedy and would consume as much as possible, ending only with the final </span> tag in your text.

Regex: Extracting readable (non-code) text and URLs from HTML documents

I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE

Finding a string that is split by multiple html tags

I am using Xpath to find a list of strings in an HTML document. The strings appear when you type into a text box, to suggest possible results - in other words, it's auto-complete. The problem is, I'm trying to retrieve the whole list of auto-complete suggestions, the results are all split up by <strong> tags.
To give a couple examples: I start typing "str" and the HTML will look like this:
<strong>str</strong>ing
But it gets better! If I don't type anything at all, every single character in the auto-complete results will be interrupted with opening and closing strong tags. Like so:
s
<strong></strong>
t
<strong></strong>
r
<strong></strong>
i
<strong></strong>
n
<strong></strong>
g
So, my question is, how do I construct an xpath that retrieves this string, but omits the strong tags?
For reference, the hierarchy of the HTML looks like this:
-div
--ul
---li
----(string I'm looking for)
---li
----(another string I'm looking for)
So my xpath at this point is: //div[#class='class']/ul/li/text(), which will get me the individual parts of the strings.
This XPath expression:
string(PathToYourDiv/ul/li[$n])
evaluates to the string value of $n-th li child of the ul that is a child of YourDiv. And this is the concatenation of all the text-node descendents od this li element -- effectively giving you the complete string you want.
You have just to substitute YourDiv and $n with specific expressions.
Do not use the // abbreviation, because:
Its evaluation can be very slow.
Indexing such an expression with [] in not intuitive and produces surprizing results that result in a FAQ.
That is much less code on the question than people would like to see around here.
But why don't you try a variant like this:
//div[#class='class']/ul/li/strong/text()

Regular expression to delete HTML strings

I am trying to delete part of a string that does not match my pattern. For example, in
<SYNC Start=364><P Class=KRCC>
<Font Color=lightpink>abcd
I would like to delete
<P Class=KRCC><Font Color=lightpink>
How do I do that?
Your question does not indicate that you need (or should use) regular expressions. If you want to remove a fixed string, do traditional search and replace.
Just match `your pattern' and write that to a file or update the table of a database. That way, you are deleting the rest.
If the HTML you are parsing is valid and always follows a known standard format, you can use non-greedy patterns to remove most of what you don't want.
These samples will have to be modified based on the tool/framework you're using to handle regular expressions. I am not escaping special characters for brevity.
To match any paragraph tags:
<p.*?>(.*?)</p>
You would replace these matches with $1 (or whatever your syntax requires to access groups).
It's important to use non-greedy (?) patterns to avoid accidentally matching two unrelated start/end tags. For example:
<p.*>(.*)</p>
Would behave very differently. In the case of the following example HTML, it would not correctly match two paragraphs:
<p>Lorem ipsum.</p><p>Lorem ipsum.</p>
Instead, it would match "<p>Lorem ipsum.</p><p>" as the first portion, which would result in losing content.
If you need to match paragraphs with specific classes, you could use something like this:
<p.*?class="delete".*?>(.*?)</p>
Where things get sticky is when you start working with non-standardized HTML. For example, this is all valid HTML, but the pattern to clean it up would be ugly:
<p>no class</p>
<p class=delete>no quotes</p>
<p class="delete">double quotes</p>
<p class='delete'>single quotes</p>
<p>space in closing tag</p >
<p>no closing tag