PowerShell Regex - match a string that does not include a different string

PowerShell Regex - match a string that does not include a different string - html

The goal is to prepare an HTML file to be transformed to Markdown using PowerShell.
The PowerShell script includes these lines:
-replace '<pre.*?>(.*?)</pre>', '`$1`'`
-replace '<code.*?>(.*?)</code>', '`<b>$1</b>`'`
Sometimes the HTML includes text <pre><code>text</code></pre> text. Sometimes it only includes text <code>/text</code> text.
Because Markdown interprets text surrounded by single backticks (`) to be "code" for stylistic purposes, I want the PowerShell search/replace to:
If <pre>...</pre> is present, replace<pre>...</pre> with backticks, not <code>...</code>.
If <pre>...</pre> is absent, replace<code>...</code> with backticks.
(If I am going about it all wrong, I would be grateful to know.)
I am going in the wrong direction, because no Regex I have tried is working.
^(?!.*?[</pre>]).*$<code.*?>(.*?)</code> (no matches)
^((?!</pre>$).)*<code.*?>(.*?)</code> (matches even when </pre> is present)
^(?!</pre>$).*<code.*?>(.*?)</code> (matches even when </pre> is present)
Etc.
Can anyone point me in the right direction? Thank you for any help.
(I know there are tools that transform HTML to Markdown automatically and am using one - this is just a unique preparatory step based on irregularities in our specific output.)

#'
...
... <pre><code>bingo</code></pre> ...
... <code>bongo</code> ...
...
'# -replace '(?s)(?:(?:<pre>\s*)?<code>)(.*?)(?:</code>(?:\s*</pre>)?)', '`$1`'
Note: For brevity and simplicity, I'm assuming that the opening <pre> and <code> tags contain neither attributes nor whitespace before their closing >, and, similarly, that the closing tags contain no whitespace before their closing >. It is variability like this that makes it generally preferable to use a dedicated HTML parser rather than regular expressions.
The above yields:
...
... `bingo` ...
... `bongo` ...
...
(?s) is the SingleLine inline regex option that makes . match newlines too (in case the value to enclose in `...` spans multiple lines - though note that in later Markdown rendering those newlines may be lost).
(?:...) constructs are non-capturing subexpressions, useful for subexpressions that are needed for logical reasons, without needing what they match to be referenced later.

Related

How do I type html in a markdown file without it rendering?

I want to type the following sentence in a markdown file: she says <h1> is large. I can do it in StackOverflow with three backticks around h1, but this doesn't work for a .md file. I've also tried a single backtick, single quote, double quote, hashtags, spacing, <code>h1</code> and everything else I could think of. Is there a way to do this?

You can escape the < characters by replacing them with <, which is the HTML escape sequence for <. You're sentence would then be:
she says <h1> is large
As a side note, the original Markdown "spec" has the following to say:
However, inside Markdown code spans and blocks, angle brackets and ampersands are always encoded automatically. This makes it easy to use Markdown to write about HTML code. (As opposed to raw HTML, which is a terrible format for writing about HTML syntax, because every single < and & in your example code needs to be escaped.)
...which means that, if you're still getting tags when putting them in backticks, whatever renderer you're using isn't "compliant" (to the extent that one can be compliant with that document), and you might want to file a bug.

Generally, you can surround the code in single backticks to automatically escape the characters. Otherwise just use the HTML escapes for < <and > >.
i.e.
she says <h1> is large or she says `<h1>` is large

A backslash (\) can be used to escape < and >.
Ex: she says <h1> is large
P.S. See this answer's source by clicking Edit.

Regex that matches any HTML tag with the content inside

I'd like to use Regex to match HTML tag "head" and text inside them so I can delete them easily. I'm using a find and replace tool that is utilizing regex syntax and it really works great in replacing multiple files at once.
I tried doing a lot of syntax but I always fail.
http://regex101.com/r/aZ6pN5/2
Anyone can help please?

Replace .* in your regex with [\S\s]*?, so that it would match line breaks also. You can't use s DOTALL modifier in JavaScript.
<head.*?>([\s\S]*?)<\/head>
[\s\S]*? This would do an non-greedy match of zero or more space or non-space characters.
DEMO
OR
To replace the contents of head tag.
(<head\b[^<>]*>)[\s\S]*?(<\/head>)
Replacement string:
$1stringyouwant$2
DEMO

Sublime Text regex to find and replace whitespace between two xml or html tags?

I'm using Sublime Text and I need to come up with a regex that will find the whitespaces between a certain opening and closing tag and replace them with commas.
Example: Replace white space in
<tags>This is an example</tags>
so it becomes
<tags>This,is,an,example</tags>
Thanks!

You have just to use a simple regex like:
\s+
And replace it with with a comma.
Working demo

This will find instances of
<tags>...</tags>
with whitespace between the tags
(<tags>\S+)\W(.+</tags>)
This will replace the first whitespace with a comma
\1,\2
Open Find and Replace [OS X Cmd+Opt+F :: Windows Ctrl+H]
Use the two values above to find and replace and use the 'Replace All' option. Repeat until all the whitespaces are converted to commas.
The best answer is probably a quick script but this will get you there fairly fast without needing to do any coding.

You can replace any one or more whitespace chunks in between two tags using a single regular expression:
(?s)(?:\G(?!\A)|<tags>(?=.*?</tags>))(?:(?!</?tags>).)*?\K\s+
See the regex demo. Details
(?s) - a DOTALL inline modifier, makes . match line breaks
(?:\G(?!\A)|<tags>(?=.*?</tags>)) - either the end of the previous successful match (\G(?!\A)) or (|) <tags> substring that is immediately followed with any zero or more chars, as few as possible and then </tags> (see (?=.*?</tags>))
(?:(?!</?tags>).)*? - any char that does not start a <tags> or </tags> substrings, zero or more occurrences but as few as possible
\K - match reset operator
\s+ - one or more whitespaces (NOTE: use \s if each whitespace must be replaced).
SublimeText settings:

Regular expression to match empty HTML tags that may contain embedded JSTL?

I'm trying to construct a regular expression to look for empty html tags that may have embedded JSTL. I'm using Perl for my matching.
So far I can match any empty html tag that does not contain JSTL with the following?
/<\w+\b(?!:)[^<]*?>\s*<\/\w+/si
The \b(?!:) will avoid matching an opening JTSL tag but that doesn't address the whether JSTL may be within the HTML tag itself (which is allowable). I only want to know if this HTML tag has no children (only whitespace or empty). So I'm looking for a pattern that would match both the following:
<div id="my-id">
</div>
<div class="<c:out var="${my.property}" />"></div>
Currently the first div matches. The second does not. Is it doable? I tried several variations using lookahead assertions, and I'm starting to think it's not. However, I can't say for certain or articulate why it's not.
Edit: I'm not writing something to interpret the code, and I'm not interested in using a parser. I'm writing a script to point out potential issues/oversights. And at this point, I'm curious, too, to see if there is something clever with lookaheads or lookbehinds that I may be missing. If it bothers you that I'm trying to "solve" a problem this way, don't think of it as looking for a solution. To me it's more of a challenge now, and an opportunity to learn more about regular expressions.
Also, if it helps, you can assume that the html is xhtml strict.

Try
<(\w+)(?:\s+\w+="[^"]+(?:"\$[^"]+"[^"]+)?")*>\s*</\1>
A short explanation:
< # match a '<'
(\w+) # match one or more a-z, A-Z, 0-9 or '_' and store it in group 1
(?: # open non-matching-group 1
\s+ # match one or more white space characters
\w+ # match one or more a-z, A-Z, 0-9 or '_'
=" # match '="'
[^"]+ # match one or more characters other than '"'
(?: # open non-matching-group 2
"\$ # match '"$'
[^"]+ # match one or more characters other than '"'
" # match '"'
[^"]+ # match one or more characters other than '"'
)? # close non-matching-group 2, and make it optional
" # match '"'
)* # close non-matching-group 1, and make repeat itself zero or more times
> # match '>'
\s* # match zero or more white space characters
</\1> # match '</X>' where `X` is what is captured in group 1
This works for both you examples but I am sure someone can construct html that you want to match but will not be matched by the regex. But after reading your 'edit', it seems you are aware of that.

It's not a good idea to use regexes for HTML as there are many constructs that cannot be matched by most regex systems. Also much HTML (as opposed to XHTML) has many difficult constructs. Suggest you use an HTML parser. [This has been frequently addressed on SO and the universal answer is don't use regex).

Using an HTML parser doesn't mean you're interpreting or running the content: it means you are transforming it from a string of characters into a nested object. HTML is not regular, so regular expressions aren't the best solution to this problem.
See the docs for HTML::TreeBuilder as a good place to start. Other good resources include HTML::Parser and of course this site. :)
Edit: I'll pretend that your question has nothing to do with HTML and is just an interesting regex puzzle, and as such will ponder it... ...[still thinking.. edit coming] (puzzle abandoned in the face of a really awesome solution presented above)

If you assume that your input is valid XML, as you say, my tool of choice would be XML::Twig.

based on what i have read i believe the (?: is a non-capturing group not a non-matching group, thus the comment on the regex should be changed.
A non-matching group would be (?!

Regex to match all HTML tags except <p> and </p>

I need to match and remove all tags using a regular expression in Perl. I have the following:
<\\??(?!p).+?>
But this still matches with the closing </p> tag. Any hint on how to match with the closing tag as well?
Note, this is being performed on xhtml.

If you insist on using a regex, something like this will work in most cases:
# Remove all HTML except "p" tags
$html =~ s{<(?>/?)(?:[^pP]|[pP][^\s>/])[^>]*>}{}g;
Explanation:
s{
< # opening angled bracket
(?>/?) # ratchet past optional /
(?:
[^pP] # non-p tag
| # ...or...
[pP][^\s>/] # longer tag that begins with p (e.g., <pre>)
)
[^>]* # everything until closing angled bracket
> # closing angled bracket
}{}gx; # replace with nothing, globally
But really, save yourself some headaches and use a parser instead. CPAN has several modules that are suitable. Here's an example using the HTML::TokeParser module that comes with the extremely capable HTML::Parser CPAN distribution:
use strict;
use HTML::TokeParser;
my $parser = HTML::TokeParser->new('/some/file.html')
or die "Could not open /some/file.html - $!";
while(my $t = $parser->get_token)
{
# Skip start or end tags that are not "p" tags
next if(($t->[0] eq 'S' || $t->[0] eq 'E') && lc $t->[1] ne 'p');
# Print everything else normally (see HTML::TokeParser docs for explanation)
if($t->[0] eq 'T')
{
print $t->[1];
}
else
{
print $t->[-1];
}
}
HTML::Parser accepts input in the form of a file name, an open file handle, or a string. Wrapping the above code in a library and making the destination configurable (i.e., not just printing as in the above) is not hard. The result will be much more reliable, maintainable, and possibly also faster (HTML::Parser uses a C-based backend) than trying to use regular expressions.

In my opinion, trying to parse HTML with anything other than an HTML parser is just asking for a world of pain. HTML is a really complex language (which is one of the major reasons that XHTML was created, which is much simpler than HTML).
For example, this:
<HTML /
<HEAD /
<TITLE / > /
<P / >
is a complete, 100% well-formed, 100% valid HTML document. (Well, it's missing the DOCTYPE declaration, but other than that ...)
It is semantically equivalent to
<html>
<head>
<title>
>
</title>
</head>
<body>
<p>
>
</p>
</body>
</html>
But it's nevertheless valid HTML that you're going to have to deal with. You could, of course, devise a regex to parse it, but, as others already suggested, using an actual HTML parser is just sooo much easier.

I came up with this:
<(?!\/?p(?=>|\s.*>))\/?.*?>
x/
< # Match open angle bracket
(?! # Negative lookahead (Not matching and not consuming)
\/? # 0 or 1 /
p # p
(?= # Positive lookahead (Matching and not consuming)
> # > - No attributes
| # or
\s # whitespace
.* # anything up to
> # close angle brackets - with attributes
) # close positive lookahead
) # close negative lookahead
# if we have got this far then we don't match
# a p tag or closing p tag
# with or without attributes
\/? # optional close tag symbol (/)
.*? # and anything up to
> # first closing tag
/
This will now deal with p tags with or without attributes and the closing p tags, but will match pre and similar tags, with or without attributes.
It doesn't strip out attributes, but my source data does not put them in. I may change this later to do this, but this will suffice for now.

Not sure why you are wanting to do this - regex for HTML sanitisation isn't always the best method (you need to remember to sanitise attributes and such, remove javascript: hrefs and the likes)... but, a regex to match HTML tags that aren't <p></p>:
(<[^pP].*?>|</[^pP]>)
Verbose:
(
< # < opening tag
[^pP].*? # p non-p character, then non-greedy anything
> # > closing tag
| # ....or....
</ # </
[^pP] # a non-p tag
> # >
)

I used Xetius regex and it works fine. Except for some flex generated tags which can be : with no spaces inside. I tried ti fix it with a simple ? after \s and it looks like it's working :
<(?!\/?p(?=>|\s?.*>))\/?.*?>
I'm using it to clear tags from flex generated html text so i also added more excepted tags :
<(?!\/?(p|a|b|i|u|br)(?=>|\s?.*>))\/?.*?>

Xetius, resurrecting this ancient question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
With all the disclaimers about using regex to parse html, here is a simple way to do it.
#!/usr/bin/perl
$regex = '(<\/?p[^>]*>)|<[^>]*>';
$subject = 'Bad html <a> </I> <p>My paragraph</p> <i>Italics</i> <p class="blue">second</p>';
($replaced = $subject) =~ s/$regex/$1/eg;
print $replaced . "\n";
See this live demo
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...

Since HTML is not a regular language I would not expect a regular expression to do a very good job at matching it. They might be up to this task (though I'm not convinced), but I would consider looking elsewhere; I'm sure perl must have some off-the-shelf libraries for manipulating HTML.
Anyway, I would think that what you want to match is </?(p.+|.*)(\s*.*)> non-greedily (I don't know the vagaries of perl's regexp syntax so I cannot help further). I am assuming that \s means whitespace. Perhaps it doesn't. Either way, you want something that'll match attributes offset from the tag name by whitespace. But it's more difficult than that as people often put unescaped angle brackets inside scripts and comments and perhaps even quoted attribute values, which you don't want to match against.
So as I say, I don't really think regexps are the right tool for the job.

Since HTML is not a regular language
HTML isn't but HTML tags are and they can be adequatly described by regular expressions.

Assuming that this will work in PERL as it does in languages that claim to use PERL-compatible syntax:
/<\/?[^p][^>]*>/
EDIT:
But that won't match a <pre> or <param> tag, unfortunately.
This, perhaps?
/<\/?(?!p>|p )[^>]+>/
That should cover <p> tags that have attributes, too.

You also might want to allow for whitespace before the "p" in the p tag. Not sure how often you'll run into this, but < p> is perfectly valid HTML.

The original regex can be made to work with very little effort:
<(?>/?)(?!p).+?>
The problem was that the /? (or \?) gave up what it matched when the assertion after it failed. Using a non-backtracking group (?>...) around it takes care that it never releases the matched slash, so the (?!p) assertion is always anchored to the start of the tag text.
(That said I agree that generally parsing HTML with regexes is not the way to go).

Try this, it should work:
/<\/?([^p](\s.+?)?|..+?)>/
Explanation: it matches either a single letter except “p”, followed by an optional whitespace and more characters, or multiple letters (at least two).
/EDIT: I've added the ability to handle attributes in p tags.

This works for me because all the solutions above failed for other html tags starting with p such as param pre progress, etc. It also takes care of the html attributes too.
~(<\/?[^>]*(?<!<\/p|p)>)~ig

You should probably also remove any attributes on the <p> tag, since someone bad could do something like:
<p onclick="document.location.href='http://www.evil.com'">Clickable text</p>
The easiest way to do this, is to use the regex people suggest here to search for &ltp> tags with attributes, and replace them with <p> tags without attributes. Just to be on the safe side.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

PowerShell Regex - match a string that does not include a different string - html

Related

How do I type html in a markdown file without it rendering?

Regex that matches any HTML tag with the content inside

Sublime Text regex to find and replace whitespace between two xml or html tags?

Regular expression to match empty HTML tags that may contain embedded JSTL?

Regex to match all HTML tags except <p> and </p>

Categories

Resources