Regex: Negate capture group with logical or - html

I'm trying to use regex to filter forbidden HTML tags out of a given string. Yes I know, I'm supposed to use a parser instead but for this specific problem it's faster this way.
The idea is to whitelist every tag which is okay (e.g. <span>, <b>, </br>) and match forbidden ones. So far I came up with the following expression: <\/?(?!(span|b|br)).\>
It works well for single char tags like <a> but stuff like <label> does not work. I'd really appreciate some help, thanks in advance.

This regex will get tags while ignoring the span, br, b opening and closing tags.
It should even ignore those from the white list if they contain attributes.
<\/?(?!(?:span|br|b)(?: [^>]*)?>)[^>\/]*>

/<(?!(\/?span|\/?b|\/?br)).*?>/g

Related

Find and exclude html-tags as whole words in negative lookbehind with regex

I basically try to find all paragraphs (in javascript/jquery) in a text, that are not yet wrapped in a set of defined html-tags:
p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe
My current regex (https://regex101.com/r/O4i2hP/1) already matches paragraphs and excludes the defined tags
(.+?(?<![</(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)>]$))(\n|$)+/gm
but I just don't get, how to just match whole tags only.
The problem is:
(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)> matches a single
character in the list (p|h123456blockquteimgafr)> (case sensitive)
Thus, as you can see from the example, code that is wrapped in tags such as <strong>TEXT</strong> is also excluded.
I tried different things such as word boundaries \bword\b, but didn't get it working. I hope you can help. Thx
This will do it.
^(?!<(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)+?>.</\1>).$
I now found a working approach. The tags should be wrapped in groups rather than in character classes. The following works for me:
(.+?(?<!(<\/)(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)(>)$))(\n|$)+/gm
see also: https://regex101.com/r/DC5msM/1

How to regex match individual html tag with inner tag when multiple tags exists in text

let's say my text looks like this:
<button class="b1"
(click)="b1()">
<mat-icon>icon</mat-icon>
</button>
<button class="b1"
(click)="b1()">
<mat-icon>othericon</mat-icon>
<span>Some Text</span>
</button>
I'm trying to use regex (Rust based without lookaround... because that's what VSCode uses) to select only buttons that include the span inside them. I've tried this:
<button[\n\s\S]*?>[\n\s\S]*?span[\n\s\S]*?</button>
... but the problem is that it matches from the start of the first button in the file even if it doesn't include a span. I thought the Lazy quantifier would find the shortest match, but it doesn't seem to work that way. See my RegExr http://regexr.com/4cdra for an example. I want it to match over multiple lines which is the reason for the [\n\s\S].
<button[\n\s\S]*?>[\n\s\S]*?</button> ... doing this works well to match just the single tags... however getting it to work with inner tags is where I'm getting stuck.
Thanks!
In general, you should avoid trying to parse HTML using regex. Given that you are doing this from an IDE, you may not have any choice. One trick which can work here is to use a tempered dot to avoid parsing a closing </button> tag:
<button[^>]*>((?!</button>)[\s\S])*<span>[\s\S]*?</button>
Demo
Most of the pattern is probably familiar to you. Of note, I use [\s\S] to match across newlines. Also, consider the tempered dot trick:
((?!</button>)[\s\S])*
This uses a negative lookahead to match any character, one at a time, so long as the closing </button> tag is not encountered. This prevents the pattern from crossing tags while trying to find a <span>.

<in a nutshell> as text not html tag

I have a text: Our process<in a nutshell>
that has an output as:
Our process<in nutshell="" a=""></in>
I didn't even know in is a tag and cannot find on google what it does.
How do I post it as text? And what is <in>?
Thanks!
In HTML:
Our process <in a nutshell>
There is no <in> tag defined in HTML, but browsers and other parsers still treat <in a nutshell> as tag. It creates an element node in the document tree, representing an unknown element, so it has only a set of general properties. It has no special rendering, and no functionality is associated with it. But you could style it and/or use client-side JavaScript to add functionality to it.
In this case, you didn’t mean to do anything like that, but the tag is still parsed, and in is treated as the element name (tag name) and nutshell and a as attribute names, with attribute values defaulted to the empty string. Since tags are treated as code for starting an element, the tag itself is not rendered. Browsers may imply a closing tag </in> under certain conditions. This explains the “output” presented in the question; it’s really just the fragment of code viewed in a browser’s Developer Tools. The actual rendering in the example case is just the string “Our process”.
To prevent this processing, the “<” character needs to be escaped somehow; < is the best and most common method, so you would write
Our process<in a nutshell>
There is no need to escape the “>”, but you may do so, for symmetry, using >.
Try to replace
< with <
and replace
> with >
Does this give you the expected results?
The browser is interpreting anything in '<>' as a tag.
You need to use the character code to display those symbols as text:
Our process <in a nutshell>

Regex Operator in Validating HTML Tags

I am following Regular Expression.info and see on their samples page an expression to match agains HTML tags, as follows:
([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
What is the semantic effect of the part \b[^]? I get its a word boundary but given what follows it what is the purpose?
It matches anything extra (if it exists) up until the next occurrence of a ">" (closing HTML tag). This would capture stuff like class="classname" id="idname". However, it would also capture any character you could think of, such as •·°ÁÓ, which may or may not be what you want. As always, a proper HTML parser is the way to go for parsing HTML.

Search to exclude html tags in MySQL

I need a search query which will exclude text within HTML tags. For example, I need to search for a word called "spa" in my database. There are HTML tags in the database, so the result will contain <span> tags.
I need the search query to check only the words starting with the word "spa" but not within any HTML tag.
Please help.
Regex are always really hard to use for HTML, cause there are many rules that apply to it.
You should consider using a HTML-Parser instead.