Regex conditional - html

How would I write a RegEx to:
Find a match where the first instance of a > character is before the first instance of a < character.
(I am looking for bad HTML where the closing > initially in a line has no opening <.)

It's a pretty bad idea to try to parse html with regex, or even try to detect broken html with a regex.
What happens when there is a linebreak so that the > character is the first character on the line for example (valid html).
You might get some mileage from reading the answers to this question also: RegEx match open tags except XHTML self-contained tags

Would this work?
string =~ /^[^<]*>/
This should start at the beginning of the line, look for all characters that aren't an open '<' and then match if it finds a close '>' tag.

^[^<>]*>
if you need the corresponding < as well,
^[^<>]*>[^<]*<
If there is a possibility of tags before the first >,
^[^<>]*(?:<[^<>]+>[^<>]*)*>
Note that it can give false positives, e.g.
<!-- > -->
is a valid HTML, but the RegEx will complain.

Related

Using grep in BBEdit

I'd like BBEdit to search some HTML and match every paragraph tag that contains a text string like "myText".
This sort of works but often matches beyond the closing ">" of the tag.
<p.*myText[^>]*>
As I understand it, this should match the opening angle bracket-"p", then any number of characters until it finds "myText", then any number of characters that are NOT ">" until it finds the closing ">". What's wrong?
Use <p\s[^>]*myText[^>]*> – from comment by Wiktor Stribiżew.

How to avoid <> in HTML?

I would like to paste into my HTML code a phrase
"<car>"
and I would like that this word "car" will be between <>. In some text will be
"<car>"
and this is not a HTML expression. The problem is that when I put it the parser think that this is the HTML syntax how to avoid it. Is there any expression which need to be between this?
replace < by < and > by >
Live on JSFiddle.
< and > are special characters, more special characters in HTML you can find here.
More about HTML entities you can find here.
use > for > and < for <
$gt;car<
you need to use special character .. To know more about Special Character link here
CODE:
<p>"<car >"</p>
OUTPUT:
"<car>"
< = < less than
> = > greater than
The same applies for XML too. Take a look here, special characters for HTML.
If you really want LESS THAN SIGN “<” to appear visibly in page content, write it as &, so that it will not be treated as starting a tag. Ref.: 5.3.2 Character entity references in HTML 4.01.
So you would write
<car>
If you like, you can write “>” as > for symmetry, but there is no need to.
But if you really want to put something in angle brackets, e.g. using a mathematical notation, rather than a markup notation (as in HTML and XML), consider using U+27E8 MATHEMATICAL LEFT ANGLE BRACKET “⟨” and U+27E9 MATHEMATICAL RIGHT ANGLE BRACKET “⟩”. They cause no problems to HTML markup, as they are not markup-significant. If you don’t know how to type them in your authoring environment, you can use character references for them:
⟨car⟩
This would result in ⟨car⟩, though as always with less common special characters, you would need to consider character (font) problems.
You can use the "greater than" and "less than" entities:
<car>
The W3C, the organization responsible for setting web standards, has some pretty good documentation on HTML entities. They consist of an ampersand followed by an entity name followed by a semicolon (&name;) or an ampersand followed by a pound sign followed by an entity number followed by a semicolon (&#number;). The link I provided has a table of common HTML entities.

Regular expression to match html tags

Just wanted to know if this the right way to write a regular-expression for an opening Html-tag <strong> : /<strong[^>]*/i?
What I am trying to do is have a pattern in place for html tags and then use is to match any html document.
Thanks in advance!
Close.
It would be like this for the opening tag:
/<strong[^>]*?>/i
Keep in mind that using Regex on HTML which involves tags nested within themselves can get very messy.
Ok. What I understood is that You want to match any string between "<" and ">" symbols. for an example <codekaro>
To do so you can use :
^[\<][A-Za-z]*[\>]$
Here, ^ indicates start of an expression,
[\<] will check for one occurrence of < symbol, \ is used as escape character for < symbol
[A-Za-z]* will check for any string,
[>] will check for one occurrence of > symbol, \ is used as escape character for > symbol
$ indicates end of an expression.
I encourage you to use this link for regex tutorial and this link to check results of regular expression.
Hope this will help you..!!
Happy learning..!!

Regex Operator in Validating HTML Tags

I am following Regular Expression.info and see on their samples page an expression to match agains HTML tags, as follows:
([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
What is the semantic effect of the part \b[^]? I get its a word boundary but given what follows it what is the purpose?
It matches anything extra (if it exists) up until the next occurrence of a ">" (closing HTML tag). This would capture stuff like class="classname" id="idname". However, it would also capture any character you could think of, such as •·°ÁÓ, which may or may not be what you want. As always, a proper HTML parser is the way to go for parsing HTML.

Regex to match all HTML tags except <p> and </p>

I need to match and remove all tags using a regular expression in Perl. I have the following:
<\\??(?!p).+?>
But this still matches with the closing </p> tag. Any hint on how to match with the closing tag as well?
Note, this is being performed on xhtml.
If you insist on using a regex, something like this will work in most cases:
# Remove all HTML except "p" tags
$html =~ s{<(?>/?)(?:[^pP]|[pP][^\s>/])[^>]*>}{}g;
Explanation:
s{
< # opening angled bracket
(?>/?) # ratchet past optional /
(?:
[^pP] # non-p tag
| # ...or...
[pP][^\s>/] # longer tag that begins with p (e.g., <pre>)
)
[^>]* # everything until closing angled bracket
> # closing angled bracket
}{}gx; # replace with nothing, globally
But really, save yourself some headaches and use a parser instead. CPAN has several modules that are suitable. Here's an example using the HTML::TokeParser module that comes with the extremely capable HTML::Parser CPAN distribution:
use strict;
use HTML::TokeParser;
my $parser = HTML::TokeParser->new('/some/file.html')
or die "Could not open /some/file.html - $!";
while(my $t = $parser->get_token)
{
# Skip start or end tags that are not "p" tags
next if(($t->[0] eq 'S' || $t->[0] eq 'E') && lc $t->[1] ne 'p');
# Print everything else normally (see HTML::TokeParser docs for explanation)
if($t->[0] eq 'T')
{
print $t->[1];
}
else
{
print $t->[-1];
}
}
HTML::Parser accepts input in the form of a file name, an open file handle, or a string. Wrapping the above code in a library and making the destination configurable (i.e., not just printing as in the above) is not hard. The result will be much more reliable, maintainable, and possibly also faster (HTML::Parser uses a C-based backend) than trying to use regular expressions.
In my opinion, trying to parse HTML with anything other than an HTML parser is just asking for a world of pain. HTML is a really complex language (which is one of the major reasons that XHTML was created, which is much simpler than HTML).
For example, this:
<HTML /
<HEAD /
<TITLE / > /
<P / >
is a complete, 100% well-formed, 100% valid HTML document. (Well, it's missing the DOCTYPE declaration, but other than that ...)
It is semantically equivalent to
<html>
<head>
<title>
>
</title>
</head>
<body>
<p>
>
</p>
</body>
</html>
But it's nevertheless valid HTML that you're going to have to deal with. You could, of course, devise a regex to parse it, but, as others already suggested, using an actual HTML parser is just sooo much easier.
I came up with this:
<(?!\/?p(?=>|\s.*>))\/?.*?>
x/
< # Match open angle bracket
(?! # Negative lookahead (Not matching and not consuming)
\/? # 0 or 1 /
p # p
(?= # Positive lookahead (Matching and not consuming)
> # > - No attributes
| # or
\s # whitespace
.* # anything up to
> # close angle brackets - with attributes
) # close positive lookahead
) # close negative lookahead
# if we have got this far then we don't match
# a p tag or closing p tag
# with or without attributes
\/? # optional close tag symbol (/)
.*? # and anything up to
> # first closing tag
/
This will now deal with p tags with or without attributes and the closing p tags, but will match pre and similar tags, with or without attributes.
It doesn't strip out attributes, but my source data does not put them in. I may change this later to do this, but this will suffice for now.
Not sure why you are wanting to do this - regex for HTML sanitisation isn't always the best method (you need to remember to sanitise attributes and such, remove javascript: hrefs and the likes)... but, a regex to match HTML tags that aren't <p></p>:
(<[^pP].*?>|</[^pP]>)
Verbose:
(
< # < opening tag
[^pP].*? # p non-p character, then non-greedy anything
> # > closing tag
| # ....or....
</ # </
[^pP] # a non-p tag
> # >
)
I used Xetius regex and it works fine. Except for some flex generated tags which can be : with no spaces inside. I tried ti fix it with a simple ? after \s and it looks like it's working :
<(?!\/?p(?=>|\s?.*>))\/?.*?>
I'm using it to clear tags from flex generated html text so i also added more excepted tags :
<(?!\/?(p|a|b|i|u|br)(?=>|\s?.*>))\/?.*?>
Xetius, resurrecting this ancient question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
With all the disclaimers about using regex to parse html, here is a simple way to do it.
#!/usr/bin/perl
$regex = '(<\/?p[^>]*>)|<[^>]*>';
$subject = 'Bad html <a> </I> <p>My paragraph</p> <i>Italics</i> <p class="blue">second</p>';
($replaced = $subject) =~ s/$regex/$1/eg;
print $replaced . "\n";
See this live demo
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
Since HTML is not a regular language I would not expect a regular expression to do a very good job at matching it. They might be up to this task (though I'm not convinced), but I would consider looking elsewhere; I'm sure perl must have some off-the-shelf libraries for manipulating HTML.
Anyway, I would think that what you want to match is </?(p.+|.*)(\s*.*)> non-greedily (I don't know the vagaries of perl's regexp syntax so I cannot help further). I am assuming that \s means whitespace. Perhaps it doesn't. Either way, you want something that'll match attributes offset from the tag name by whitespace. But it's more difficult than that as people often put unescaped angle brackets inside scripts and comments and perhaps even quoted attribute values, which you don't want to match against.
So as I say, I don't really think regexps are the right tool for the job.
Since HTML is not a regular language
HTML isn't but HTML tags are and they can be adequatly described by regular expressions.
Assuming that this will work in PERL as it does in languages that claim to use PERL-compatible syntax:
/<\/?[^p][^>]*>/
EDIT:
But that won't match a <pre> or <param> tag, unfortunately.
This, perhaps?
/<\/?(?!p>|p )[^>]+>/
That should cover <p> tags that have attributes, too.
You also might want to allow for whitespace before the "p" in the p tag. Not sure how often you'll run into this, but < p> is perfectly valid HTML.
The original regex can be made to work with very little effort:
<(?>/?)(?!p).+?>
The problem was that the /? (or \?) gave up what it matched when the assertion after it failed. Using a non-backtracking group (?>...) around it takes care that it never releases the matched slash, so the (?!p) assertion is always anchored to the start of the tag text.
(That said I agree that generally parsing HTML with regexes is not the way to go).
Try this, it should work:
/<\/?([^p](\s.+?)?|..+?)>/
Explanation: it matches either a single letter except “p”, followed by an optional whitespace and more characters, or multiple letters (at least two).
/EDIT: I've added the ability to handle attributes in p tags.
This works for me because all the solutions above failed for other html tags starting with p such as param pre progress, etc. It also takes care of the html attributes too.
~(<\/?[^>]*(?<!<\/p|p)>)~ig
You should probably also remove any attributes on the <p> tag, since someone bad could do something like:
<p onclick="document.location.href='http://www.evil.com'">Clickable text</p>
The easiest way to do this, is to use the regex people suggest here to search for &ltp> tags with attributes, and replace them with <p> tags without attributes. Just to be on the safe side.