HTML Regex Selector - html

I am trying to make a regex for HTML, I am coming up with a few minor issues regarding header html blocks to be selected and title in head for some reason,
To explain it better:
<h5>Thing</h5> will all be selected but I only want <h5> and </h5> selected and it's the same with <title>Test</title> I only want the html tags selected but it selects the whole thing,
here is my regex so far:
/(<\/(\w+)>)|(<(\w+)).+?(?=>)>|(<(\w+))>/ig

Your problem is here: <(\w+).+?(?=>)>
This says:
open an angle bracket
consume as many word characters as possible (min 1)
consume as few characters as possible (min 1)
make sure a closing angle bracket follows
consume the closing angle bracket
First of all, step 4 is superfluous; you know you will have a closing bracket next, otherwise step 5 will fail to match.
But the bigger problem is step 3. Let's see what happens on <h5>Thing</h5>:
<
h5 (because > is not a word character any more)
>Thing</h5, because this is the least amount matched before a closing angle bracket (remember, matching 0 characters here is not an option)
Make sure next is >
>
Anyway, in the simple case, what you want can be done by /<\/?.+?>/. This will break if attributes have values that include a greater than symbol: <div title="a>b">. Avoiding this is possible, but it makes the regexp a bit more complex, kind of like this (but I may have forgotten something):
<\w+(?:\s+\w+(?:=(?:"[^"]*"|'[^']*'|[^'"][^\s>]*)?)?)*\s*>|<\/\w+>

Related

RegEx replace only occurrences outside of <h> html tags

I would like to regex replace Plus in the below text, but only when it's not wrapped in a header tag:
<h4 class="Somethingsomething" id="something">Plus plan</h4>The <b>Plus</b> plan starts at $14 per person per month and comes with everything from Basic.
In the above I would like to replace the second "Plus" but not the first.
My regex attempt so far is:
(?!<h\d*>)\bPlus\b(?!<\\h>)
Meaning:
Do not capture the following if in a <h + 1 digit and 0 or more characters and end an closing <\h>
Capture only if the group "Plus" is surrounded by spaces or white space
However - this captures both occurrences. Can someone point out my mistake and correct this?
I want to use this in VBA but should be a general regex question, as far as I understand.
Somewhat related but not addressing my problem in regex
Not relevant, as not RegEx
You can use
\bPlus\b(?![^>]*<\/h\d+>)
See the regex demo. To use the match inside the replacement pattern, use the $& backreference in your VBA code.
Details:
\bPlus\b - a whole word Plus
(?![^>]*<\/h\d+>) - a negative lookahead that fails the match if, immediately to the right of the current location, there are
[^>]* - zero or more chars other than >
<\/h - </h string
\d+ - one or more digits
> - a > char.

RegExp to search text inside HTML tags

I'm having some difficulty using a RegExp to search for text between HTML tags. This is for a search function to search text on a HTML page without find the characters as a match in the tags or attributes of the HTML. When a match has been found I surround it with a div and assign it a highlight class to highlight the search words in the HTML page. If the RegExp also matches on tags or attributes the HTML code is becoming corrupt.
Here is the HTML code:
<html>
<span>assigned</span>
<span>Assigned > to</span>
<span>assigned > to</span>
<div>ticket assigned to</div>
<div id="assigned" class="assignedClass">Ticket being assigned to</div>
</html>
and the current RegExp I've come up with is:
(?<=(>))assigned(?!\<)(?!>)/gi
which matches if assigned or Assigned is the start of text in a tag, but not on the others. It does a good job of ignoring the attributes and tags but it is not working well if the text does not start with the search string.
Can anyone help me out here? I've been working on this for a an hour now but can' find a solution (RegExp noob here..)
UPDATE 2
https://regex101.com/r/ZwXr4Y/1 show the remaining problem regarding HTML entities and HTML comments.
When searching the problem left is that is not ignored, all text inside HTML entities and comments should be ignored. So when searching for "b" it should not match even if the HTML entity is correctly between HTML tags.
Update #2
Regex:
(<)(script[^>]*>[^<]*(?:<(?!\/script>)[^<]*)*<\/script>|\/?\b[^<>]+>|!(?:--\s*(?:(?:\[if\s*!IE]>\s*-->)?[^-]*(?:-(?!->)-*[^-]*)*)--|\[CDATA[^\]]*(?:](?!]>)[^\]]*)*]])>)|(e)
Usage:
html.replace(/.../g, function(match, p1, p2, p3) {
return p3 ? "<div class=\"highlight\">" + p3 + "</div>" : match;
})
Live demo
Explanation:
As you went through more different situations I had to modify RegEx to cover more possible cases. But now I came with this one that covers almost all cases. How it works:
Captures all <script> tags and their contents
Captures all CDATAblocks
Captures all HTML tags (opening / closing)
Captures all HTML comments (as well as IE if conditional statements)
Captures all targeted strings defined in last group inside remaining text (here it is
(e))
Doing so lets us quickly manipulate our target. E.g. Wrap it in tags as represented in usage section. Talking performance-wise, I tried to write it in a way to perform well.
This RegEx doesn't provide a 100% guarantee to match correct positions (99% does) but it should give expected results most of the time and can get modified later easily.
try this
Live Demo
string.match(/<.{1,15}>(.*?)<\/.{1,15}>/g)
this means <.{1,15}>(.*?)</.{1,15}> that anything that between html tag
<any> Content </any>
will be the target or the result for example
<div> this is the content </content>
"this is the content" this is the result

Regular Expression for HTML attributes

I need to write a regular expression to catch the following things in bold
class="something_A211"
style="width:380px;margin-top: 20px;"
I have no idea how to write it, can someone help me?
I need this because, in html file i have to replace (whit notepad++) with empty, so i want to have a clear < tr > or < td > or anything else.
Thank you
You can use a regex like this to capture the content:
((?:class|style)=".*?")
Working demo
However, if you just want to match and delete that you can get rid of capturing groups:
(?:class|style)=".*?"
For all constructions like something="data", you can use this.
[^\s]*?\=\".*?\"
https://regex101.com/r/oQ5dR0/1
The link shows you what everything does.
To explain it briefly, a non space character can come before the "=" any mumber of times, then comes the quotes and info inside of them.
The question mark in .*? (and character any number of times) is needed so only the minimum amount of characters will be used (instead of looking for the next possible quotes somewhere further along)

Regex Operator in Validating HTML Tags

I am following Regular Expression.info and see on their samples page an expression to match agains HTML tags, as follows:
([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
What is the semantic effect of the part \b[^]? I get its a word boundary but given what follows it what is the purpose?
It matches anything extra (if it exists) up until the next occurrence of a ">" (closing HTML tag). This would capture stuff like class="classname" id="idname". However, it would also capture any character you could think of, such as •·°ÁÓ, which may or may not be what you want. As always, a proper HTML parser is the way to go for parsing HTML.

Regex to match text longer than x characters between html tags?

I have the task of migrating THE worst HTML product descriptions you will ever encounter. It consists of a mixture of tables and paragraphs. The majority are not even 100% valid HTML and there are plenty of Microsoft tags courtesy of MS Word. It is littered with in line style tags and the most of it relies on the most bonky set of css rules you will ever see.
Essentially I have come the the realisation that the only thing of use is the paragraphs of text. I can not just grab the <p> tags as sometimes the paragraphs do not use them and sometimes titles or single words have their own <p> tag.
So my question is can I match text that is longer then x characters between html tags?
Ideally it would also ignore <br/> and <br>
Here is a link to an example of the html I am dealing with
Note it is just the description I am processing, not the whole page.
Group 1 of this regex will match n+ chars between tags (n = 100 in this example):
<[^>]+>([^<]{100,})<[^>]+>
Notes:
I have deliberately not matched for a matching closing tag (<([^>]+)>([^<]{100,})<\1>) because of OP's sloppy HTML - a tag is a tag
I have avoided using a lookbehind ((?<=<[^>]+>)) because the match is of arbitrary length, which can cause backtracking problems (some languages, like java, do not even support it).
Scanning through the site a little, it looks like many of the descriptions fall short of 100 characters. You might try a multi-pass approach, where in the first iteration, you capture all content from the first table following 'div id="tab1"'. From that starting point, it may be easier to identify and eliminate the parts you don't want, rather than extracting the parts you do want.