nested html regex crashes vscode - html

I am using the following regex in vscode for a search and replace. It is to match an outer tag with plus 3 nested tags.
<tag>(((.|\n)*?)(</tag>)){4}
If i add any character to the end of this regex, vscode crashes. In my case i was going to specify a tag after that match.
Im pretty new to regex so trying to keep it simple.
I know this is a common problem and something to do with backtracking and i want to know how to simply this.

NEVER use (.|\n)*?. It is a very misfortunate, widely known pattern, that causes so much backtracking that it often leads to situations like this, when the text is long and specific enough to lead to catastrophic backtracking.
Note that even [\w\W]*? (or [\s\S\r]*?, see Multi-line regular expressions in Visual Studio Code) here might already suffice. Although it also involves quite a lot of backtracking, it will be much more efficient.
What can usually be used is an unrolled pattern, like
<tag>(?:[^<\r]*(?:<(?!/tag>)[^<]*)*</tag>){4}
Instead of (.|\n)*?, a series of patterns are used so that each could only match distinct positions in a string.
Details
<tag> - a literal string
(?:[^<\r]*(?:<(?!/tag>)[^<]*)*</tag>){4} - four repetitions of
[^<\r]* - 0 or more chars other than < (even line break chars, \r ensures this in VS Code regex, it enables all character classes in the pattern that can match newlines to match newlines (thus, \r is not necessary to use in the next character class))
(?:<(?!/tag>)[^<]*)* - 0 or more repetitions of a < not followed with /tag> and then 0 or more chars other than <.
</tag> - a literal </tag> string.
Having said that, you might also be interested in the Emmet:Balace outward:
A well-known tag balancing: searches for tag or tag's content bounds
from current caret position and selects it. It will expand (outward
balancing) or shrink (inward balancing) selection when called multiple
times. Not every editor supports both inward and outward balancing due
of some implementation issues, most editors have outward balancing
only.
Emmet’s tag balancing is quite unique. Unlike other implementation,
this one will search tag bounds from caret’s position, not the start
of the document. It means you can use tag balancer even in non-HTML
documents.

Related

HTML Minification: Whitespace between element attributes

I'd like to remove more unnecessary bytes from my output, and it seems it's acceptable (in practice) to strip what can add up to quite a lot of whitespace from HTML markup by omitting/collapsing the gaps between DOM element attributes.
Although I've tested and researched (a little in both cases), I'm wondering how safe it would be?
I tested in Chrome (43.0.2357.65 m), IE (11.0.9600.17801), FF (38.0.1) and Safari (5.1.7 (blah-di-blah)) and they didn't seem to mind, and couldn't find anything specific in The Specs about whitespace between attributes.
w3.org's Validator complains, which is a strong indication that this is not safe and shouldn't be expected to work, but (there's always a "but") it's possible the requirement for a space is only strict when no quotes are present (for obvious reasons).
Also (snippy but poignant): their SSL is "out of date" which doesn't inspire confidence in their opinion.
I noted also that someone's HTML compressor could (when enabled) strip quotes around attribute values where those values had no whitespace within them (e.g. id), which implies that at least most if not all HTML parsing is focussed on the text either side of the equals signs (except with booleans of course), and where quotes are in use, they'd be considered the prioritized delimiter.
So, would:
<!DOCTYPE html><html><body>
Yabba Dabba Doo!
</body></html>
▲ that ever go wrong, and if so, under which conditions?
What other reasons could there be to maintain this whitespace in production output (code "readability" is a non issue in this case)?
Update (since finding an answer):
Although I basically answered my own question insofar that there is a specification governing whether there should be a space between attributes, I still wonder if omitting them when using quoted values can be considered practically safe, and would appreciate feedback on this point.
Considering how often spaces may be omitted by accident in production HTML, and that the browsers I tested don't seem to mind when they are, I assume it would be very rare if ever that a browser failed to handle documents with these spaces omitted.
Although it's sensible to follow the specs in pretty much all situations, might this be one time cheating a bit could be acceptable?
After all - if we can magically save several hundred bytes without affecting the quality of the output, why not?
There is a specification (after all)
It turns out I should have looked harder. My bad.
According to these specs:
If an attribute using the empty attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
and
If an attribute using the unquoted attribute syntax is to be followed by another attribute or by the optional U+002F SOLIDUS character (/) allowed in step 6 of the start tag syntax above, then there must be a space character separating the two.
and
If an attribute using the single-quoted attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
and
If an attribute using the double-quoted attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
Which unless I am mistaken (again), means there must always be spaces between attributes.
You could try online HTML minifiers like http://www.whak.ca/minify/HTML.htm or http://www.scriptcompress.com/minify-HTML.htm (search google for more) and find little things they change for hints to what can be taken out yet still render the HTML code.
On the first link your code:
<!DOCTYPE html><html><body>
Yabba Dabba Doo!
</body></html>
Turns into:
<!DOCTYPE html><html><body>Yabba Dabba Doo!
saving you 18 bytes already...

How do you replace the content of html tags in vim?

For instance, if I want to replace <person>Nancy</person> with <person>Henry</person> for all occurrences of <person>*</person> in vim?
Currently, I have:
%s:/'<person>*<\/person>/<person>Henry<\/person>
But obviously, this is wrong.
For a single substitution, Vim offers the handy cit (change inner tag) command.
For a global substitution, the answer depends on how well-structured your tag (soup) is. HTML / XML have a quite flexible syntax, so you can express the same content in various ways, and it becomes increasingly harder to construct a regular expression that matches them all. (Attempting to catch all cases is futile; see this famous answer.)
:%s/\v(<person>).\{-}(<\/person>)/\1Henry\2/g
does what you want but yeah, what Ingo said.
\v means "very magic": it's a convenient way to avoid backslashitis.
(something) (or \(something\) without the \v modifier) is a sub-expression, you can have up to nine of them in your search pattern and reuse those capture groups with \1...\9 in your replacement pattern or even later in your search pattern. \0 represents the whole match. Here, the opening tag is referenced as \1 and the closing tag as \2.

Looking for good bracket characters for a template engines code blocks

I am looking for a good character pair to use for enclosing template code within a template for the next version of our inhouse template engine.
The current one uses plain {} but this makes the parser very complex to be able to distinguish between real code blocks and random {} chars in the literal text in the template.
I think a dual char combination like the one used in asp.net or php is a better aproach but the question is char character pair should I use or is there some perfect single char that is never used and thats easy to write.
Some criteria that needs to be fullfilled:
Cannot be changed by HTMLEncode, the sources will be editable through webbased HTML editors and plain textareas and need to stay the same no matter what editor is used.
Regex will be used to clean code parts after editing in an HTML editor that might have encoded the internal part of the code block like & chars.
Should be resonably easy to write on both english and swedish keyboard layout.
Should be a very rare combination, the template will generate HTML and Text and could include CSS and Javascript literal text with JSON, so any combination that might collide with those is bad unless very rare. That means that {{}} is out as it can occur in JSON.
The code within the code block will contain spaces, underscores, dollar and many more combinations, not only fieldnames but if/while constructs as well.
The parser is generated with Antlr
I am looking for suggestions and objections to find one or more combinations that would work i as many situations as possible, possibly multiple alternative pairs for different situations.
Template-Toolkit defaults to [% template directives %], which works reasonably well.

Is it actually possible to parse freeform HTML with a regular expression?

now before you prepare to right a speech about the perils of HTML parsing with regex, I already know it. This is more just a curiosity question, than actually wanting to know the question for practical usage.
Basically, given a file of HTML in some random, but perfectly valid format, can you parse out the content of <p> tags using a half-sane number of regular expressions? (and also pretending that <p> tags can not be nested or some other minor limitation)
It's certainly possible to extract all the text between {insert character sequence 1 here} and {insert character sequence 2 here} with regular expressions, so long as those sequences aren't overlapping. For example:
/(?<{insert character sequence 1 here}).*?(?={insert character sequence 2 here})/
Of course, it's terribly brittle and will break horribly if what you're running it on is even slightly malformed, or contains either character sequence outside the context where it's meaningful, or any number of other ways. If you oversimplify the problem, then yes you can get away with an oversimplified solution.
Yes, under restrictions like valid HTML and non-nesting, you can use regular expressions for certain uses.
It depends on what you limitations you'd consider minor. XHTML, for one obvious example, is somewhat more amenable to simple parsing. A great deal depends on whether you're thinking in terms of parsing existing HTML, or generating new HTML that could be parsed relatively easily. For the former case, I'd say the restrictions were major -- i.e., you'd need to know a great deal about the specific HTML in question to parse it. For the latter case, I'd say the restrictions were fairly trivial -- i.e., would only involve how you write the HTML, but would not affect what you could express in HTML.

Regex to match attributes in HTML?

I have a txt file which actually is a html source of some webpage.
Inside that txt file there are various strings preceded by a "title=" tag.
e.g.
<div id='UWTDivDomains_5_6_2_2' title='Connectivity Framework'>
I am interested in getting the text Connectivity Framework to be extraced and written to a separate file.
Like this, there are many such tags each having a different text after the title='some text here which i need to extract '
I want to extract all such instances of the text from the html source/txt file and write to a separate txt file. The text can contain lower case, upper case letters and number only. The length of each text string(in characters) will vary.
I am using PowerGrep for windows. Powergrep allows me to search a text file with regular expression inout.
I tried using the search as
title='[a-zA-Z0-9]
It shows the correct matches, but it matches only first character of the string and writes only the first character of the text string matched to the second txt file, not all string.
I want all string to be matched and written to the second file.
What is the correct regular expression or way to do what i want to do, using powergrep?
-AD.
I'm just not sure how many times the question of regular expression parsing of HTML files has to be asked (and answered with the correct solution of "use a DOM parser"). It comes up every day.
The difficulties are:
In HTML attributes can have single-quotes, double-quotes or even no quotes;
Similar strings can appear in the HTML document itself;
You have to handle correct escaping; and
Malformed HTML (decent parsers are extremely robust to common errors).
So if you cater for all this (and it gets to be a pretty complicated yet still imperfect regex), it's still not 100%.
HTML parsers exist for a reason. Use them.
I'm not familiar with PowerGrep, however, your regex is incomplete. Try this:
title='[a-zA-Z0-9 ]*'
or better yet:
title='([^']*)'
The other answers all give correct changes to the regex, so I'll explain what the issue was with your original.
The square brackets indicate a character class - meaning that the regex will match any character within those brackets. However, like everything else, it will only match it once by default. Just as the regex "s" would match only the first character in "ssss", the regex "[a-zA-Z0-9]" will match only the first character in "Connectivity Framework".
By adding repetition, one can get that character class to match repeatedly. The easiest way to do this is by adding an asterisk after it (which will match 0 or more occurences). Thus the regex "[a-zA-Z0-9]*" will match as many characters in a row until it hits a character that is not in that character class (in your case, the space character since you didn't include that in your brackets).
Regexes though can be pretty complex to describe the syntax accurately - what if someone put a non-alphanumeric character such as an ampersand within the attribute? You could try to capture all input between the quotes by making the character set "anything except a quote character", so "'[^']*'" would usually do the right thing. Often you need to bear in mind escaping as well (e.g. with a string 'Mary\'s lamb' you do actually want to capture the apostrophe in the middle so a simple "everything but apostrophes" character set won't cut it) though thankfully this is not an issue with XML/HTML according to the specs.
Still, if there is an existing library available that will do the extraction for you, this is likely to be faster and more correct than rolling your own, so I would lean towards that if possible.
I would use this regular expression to get the title attribute values
<[a-z]+[^>]*\s+title\s*=\s*("[^"]*"|'[^']*'|[^\s >]*)
Note that this regex matches the attribute value expression with quotes. So you have to remove them if needed.
Here's the regex you need
title='([a-zA-Z0-9]+)'
but if you're going to be doing a lot more stuff like this, using a parser might make it much more robust and useful.
Try this instead:
title=\'[a-zA-Z0-9]*\'