RegEx to grab an entire div with a specific ID? - html

I have some markup which contains the firebug hidden div.
(long story short, the YUI RTE posts content back that includes the hidden firebug div is that is active)
So, in my posted content I have the extra div which I will remove server side in PHP:
<div firebugversion="1.5.4" style="display: none;" id="_firebugConsole"></div>
I cant seem to get a handle on the regex I would need to write to match this string, bearing in mind that it won't always be that exact string (version may change).
All help welcome!

Regex is not the best tool for the job, but you can try:
<div firebugversion=[^>]*></div>
The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.
The * is the zero-or-more repetition. Thus, [^>]* matches a sequence of anything but >.
If you want to target the id specifically, you can try:
<div [^>]*\bid="_firebugConsole"[^>]*></div>
The \b is the word boundary anchor.

Match this regex -
<div.*id="_firebugConsole".*?/div>

I would suggest this:
\<div firebugversion="(.+)" style="(.+)" id="(.+)"\>
Then you have three groups:
firebugversion
style
id

This one is a little more complicated, and probably not perfect, but it will:
Match any div containing the attribute firebugversion
Match the firebugversion attribute no matter which order attributes appear in the tag
Match the div, even if it contains content or spacing between it and its closing tag (i've seen the firebug tag with a &nbsp ; tag inside it before) Note: it does lazy matching so it will only match the next tag, rather than the last it finds in the document
<(div)\b([^>]*?)(firebugversion)([^>]*?)>(.*?)</div>

Related

Find and exclude html-tags as whole words in negative lookbehind with regex

I basically try to find all paragraphs (in javascript/jquery) in a text, that are not yet wrapped in a set of defined html-tags:
p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe
My current regex (https://regex101.com/r/O4i2hP/1) already matches paragraphs and excludes the defined tags
(.+?(?<![</(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)>]$))(\n|$)+/gm
but I just don't get, how to just match whole tags only.
The problem is:
(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)> matches a single
character in the list (p|h123456blockquteimgafr)> (case sensitive)
Thus, as you can see from the example, code that is wrapped in tags such as <strong>TEXT</strong> is also excluded.
I tried different things such as word boundaries \bword\b, but didn't get it working. I hope you can help. Thx
This will do it.
^(?!<(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)+?>.</\1>).$
I now found a working approach. The tags should be wrapped in groups rather than in character classes. The following works for me:
(.+?(?<!(<\/)(p|h1|h2|h3|h4|h5|h6|blockquote|img|table|iframe)(>)$))(\n|$)+/gm
see also: https://regex101.com/r/DC5msM/1

How to regex match individual html tag with inner tag when multiple tags exists in text

let's say my text looks like this:
<button class="b1"
(click)="b1()">
<mat-icon>icon</mat-icon>
</button>
<button class="b1"
(click)="b1()">
<mat-icon>othericon</mat-icon>
<span>Some Text</span>
</button>
I'm trying to use regex (Rust based without lookaround... because that's what VSCode uses) to select only buttons that include the span inside them. I've tried this:
<button[\n\s\S]*?>[\n\s\S]*?span[\n\s\S]*?</button>
... but the problem is that it matches from the start of the first button in the file even if it doesn't include a span. I thought the Lazy quantifier would find the shortest match, but it doesn't seem to work that way. See my RegExr http://regexr.com/4cdra for an example. I want it to match over multiple lines which is the reason for the [\n\s\S].
<button[\n\s\S]*?>[\n\s\S]*?</button> ... doing this works well to match just the single tags... however getting it to work with inner tags is where I'm getting stuck.
Thanks!
In general, you should avoid trying to parse HTML using regex. Given that you are doing this from an IDE, you may not have any choice. One trick which can work here is to use a tempered dot to avoid parsing a closing </button> tag:
<button[^>]*>((?!</button>)[\s\S])*<span>[\s\S]*?</button>
Demo
Most of the pattern is probably familiar to you. Of note, I use [\s\S] to match across newlines. Also, consider the tempered dot trick:
((?!</button>)[\s\S])*
This uses a negative lookahead to match any character, one at a time, so long as the closing </button> tag is not encountered. This prevents the pattern from crossing tags while trying to find a <span>.

RegExp to search text inside HTML tags

I'm having some difficulty using a RegExp to search for text between HTML tags. This is for a search function to search text on a HTML page without find the characters as a match in the tags or attributes of the HTML. When a match has been found I surround it with a div and assign it a highlight class to highlight the search words in the HTML page. If the RegExp also matches on tags or attributes the HTML code is becoming corrupt.
Here is the HTML code:
<html>
<span>assigned</span>
<span>Assigned > to</span>
<span>assigned > to</span>
<div>ticket assigned to</div>
<div id="assigned" class="assignedClass">Ticket being assigned to</div>
</html>
and the current RegExp I've come up with is:
(?<=(>))assigned(?!\<)(?!>)/gi
which matches if assigned or Assigned is the start of text in a tag, but not on the others. It does a good job of ignoring the attributes and tags but it is not working well if the text does not start with the search string.
Can anyone help me out here? I've been working on this for a an hour now but can' find a solution (RegExp noob here..)
UPDATE 2
https://regex101.com/r/ZwXr4Y/1 show the remaining problem regarding HTML entities and HTML comments.
When searching the problem left is that is not ignored, all text inside HTML entities and comments should be ignored. So when searching for "b" it should not match even if the HTML entity is correctly between HTML tags.
Update #2
Regex:
(<)(script[^>]*>[^<]*(?:<(?!\/script>)[^<]*)*<\/script>|\/?\b[^<>]+>|!(?:--\s*(?:(?:\[if\s*!IE]>\s*-->)?[^-]*(?:-(?!->)-*[^-]*)*)--|\[CDATA[^\]]*(?:](?!]>)[^\]]*)*]])>)|(e)
Usage:
html.replace(/.../g, function(match, p1, p2, p3) {
return p3 ? "<div class=\"highlight\">" + p3 + "</div>" : match;
})
Live demo
Explanation:
As you went through more different situations I had to modify RegEx to cover more possible cases. But now I came with this one that covers almost all cases. How it works:
Captures all <script> tags and their contents
Captures all CDATAblocks
Captures all HTML tags (opening / closing)
Captures all HTML comments (as well as IE if conditional statements)
Captures all targeted strings defined in last group inside remaining text (here it is
(e))
Doing so lets us quickly manipulate our target. E.g. Wrap it in tags as represented in usage section. Talking performance-wise, I tried to write it in a way to perform well.
This RegEx doesn't provide a 100% guarantee to match correct positions (99% does) but it should give expected results most of the time and can get modified later easily.
try this
Live Demo
string.match(/<.{1,15}>(.*?)<\/.{1,15}>/g)
this means <.{1,15}>(.*?)</.{1,15}> that anything that between html tag
<any> Content </any>
will be the target or the result for example
<div> this is the content </content>
"this is the content" this is the result

RegEx match content inside div with specific class

How can I match the contents of all divs that have a specific class. For example:
<div class="column-box-description paddingT05">content</div>
Generally, you shouldn't do this with regex unless you can make strong assumptions about the text you're matching; you'll do better with something that actually parses HTML.
But, if you can make these stronger assumptions, you can use:
<div class="[^"]*?paddingT05[^"]*?">(.*?)<\/div>
The key part is the reluctant quantifier *? which matches the minimal text possible (i.e. it doesn't greedily eat up the </div>.
You can do something like this:
<div.*class\s*=\s*["'].*the_class_you_require_here.*["']\s*>(.*)<\/div>
Replace "the_class_you_require_here" with a class name of your choosing. The div content is in the first group resulted from this expesion. You can read on groups here: http://www.regular-expressions.info/brackets.html

In page anchors with spaces in the title

I have a page with some <h1 name="Header Name"> tags. Notice that the name attribute has a space in it. I want to use an anchor so that I can jump to the heading. I thought that if I added %20 to replace the spaces it would work, but no.
I am currently swamped, so would prefer not to edit the source and redeploy with each header having an ID or title with hyphens instead of spaces.
<!-- non-working example of what I want -->
<a href="mypage.html#Header%20Name" />
<h1 title="Header Name">Foo</h1>
I read through the spec here, but couldn't find an answer. I also could not find an answer via Google or SO.
http://www.w3.org/TR/html4/struct/links.html
Is there a way to do this?
That's not how anchor tags work. You can jump to a name of an anchor tag, or an id of any element. You can never jump to a title attribute, unfortunately. You would need to modify the html, which you don't want to do, or include javascript to accomplish what you are asking.
Edit: Standards have always said id attributes cannot include spaces, including HTML5, which says this:
HTML5 gets rid of the additional restrictions on the id attribute. The
only requirements left — apart from being unique in the document — are
that the value must contain at least one character (can’t be empty),
and that it can’t contain any space characters.
Here's the W3C spec https://www.w3.org/TR/REC-html40-971218/types.html#h-6.2
ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").
No spaces.
Anchors point to name or id attributes. It won't find your title attribute.
Use <h1 id="Header Name"> and it should work, but ideally you shouldn't have spaces there.
The destination anchor should be defined by an id attribute, and it must not contain a space. (HTML5 removes all other constraints but disallows a space and an empty value. Older HTML versions are much more restrictive.) So:
<h1 id="HeaderName">Foo</h1>
Alternatively, you could use an a element with name attribute, but this is clumsy and outdated (and forbidden in HTML5, though it still works, properly used):
<h1><a name="HeaderName">Foo</a></h1>
You can use the name attribute on a few elements only, not e.g. in h1, and only in a elements does it have the meaning of acting as a destination anchor.
The title attribute can be used on any element (at least as per HTML5), but it is merely an advisory title and has nothing to do with linking.
The construct <a href="mypage.html#Header%20Name" /> denotes an empty a element, a usability nightmare, and does not actually work (it is taken just as a start tag of an element), except in XHTML when served with an XML document type, which you are most probably not doing. So just don’t use the “self-closing” syntax except for elements with empty declared content, if at all.
There /is/ a way to have an anchor with a space in its name without violating the HTML specification. Use <a name="my%20name%20with%20spaces"></a>. The specification makes it clear that you can use percent-encoding when using <a> with the name attribute. However, when using id, you should not percent-encode the attribute, and since spaces are disallowed in the id attribute, the only option is to use <a name=...> instead.
More information: https://blog.markasoftware.com/post/spaces-in-anchor-names/