Regex selects first to last instead of just first - html

I'm trying to use String.sub! in ruby and it substitutes way too much.
The regex i'm using. You can see it's matching too much: http://rubular.com/r/IUav4KEFWH
<rb>.+<\/rb>
it selects from the first to the last and I want it just to select the first pair.
is there another version of sub I'm not aware of, or a better way to sub
it would be easy to turn of multi-line and put them on separate lines but I don't want to sacrifice multi-lining

Your regex is too greedy:
<rb>.+<\/rb>
Make it non-greedy using:
<rb>.+?<\/rb>
Rubular Demo

It matches from the first <rb> tag up until the very last </rb> tag because + is a greedy operator meaning it will match as much as it can and still allow the remainder of the regular expression to match.
You want to use +? for a non-greedy match meaning "one or more — preferably as few as possible".
<rb>.+?</rb>
Note: A parser to extract from HTML is recommended rather than using regular expression.

You can try this variant:
<rb>(?>(?!<\/rb>).)*+<\/rb>
Demo
Or if you want:
<rb>[^<]+<\/rb>
Demo
See the difference between .*? And [^<]+ in this DEMO

Related

How to edit this html lexer rule?

I want to edit this HTML lexer rule and I need help with the Regular Expression
the TAG_NAME refers to any HTML attribute for ex: (required, class, id, etc...).
I want to edit it to make it does not accept this exact syntax: 'az-'.
I think this needs regular expression modification, I looked it up but I couldn't integrate what I found online with the way these rules are written.
I tried to remove the '-' in the Tag_NameChar as a first try but that made the HTML doesnt recognize attributes like 'data-target'.
This snippet is for the rule:
and this one shows how the attributes are recognized.
ANTLR does not support lookahead syntax like some regex engines do, so there's no easy way to exclude certain matches from within the regex. It's always possible to rewrite a regular expression to exclude a given string (regular expressions are closed under negation and intersection), but it usually ends up quite painful. In your case, you'd end up with something following the logic of "a tag name can either have less than 3 characters, more than 3 characters, or it could have three characters where the first isn't an 'a', the second isn't a 'z' or the last isn't a '-'".
The less painful, but also less cross-language solution is to use a predicate that returns false if the text of the tag name equals az-. So something like {getText().equals("az-")}? depending on the language.
If you're okay with introducing an additional lexer rule, you may also introduce a rule INVALID_TAG_NAME (or whatever you want to call it) that matches exactly az- and that's defined before TAG_NAME. That way any tag that's named exactly az- will produce an INVALID_TAG_NAME token instead of a TAG_NAME token.
Depending on your requirements, you could also leave the grammar unchanged altogether and simply produce an error when you see a tag named az- when you traverse the tree in a listener or visitor.

Regex matching Google Cache url (matching entire href parameter when it contains a word)

Disclaimer: I know that html and regex should not stand together, but this is an exceptional case.
I need to parse Google Search results and extract cache urls. I have this in the page:
<a href="/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:
gsNKb7ku3ewJ:somedata&ei=MyIIUtrZAcPX7AaVzIHwDg&ved=0CB8QIDAC&usg
=AFQjCNGcnWfdzQiTKwyAMmI-M-xzxII5Ag">Cached</a>
I tried simple stuff like: href=[\'"]?([^\'" >]+) but it is not what I need. I want to extract a single parameter (q) from the href. I need to get:
http://webcache.googleusercontent.com/search%3Fq%3Dcache:gsNKb7ku3ewJ:somedata
So everything between "url?q=" and first "&", when the contents contain word "webcache" in it.
If your language supports positive look-behinds:
(?<=q=).*?(?=[&"])
Otherwise match group \1 with this expression:
(?:q=)(.*?)(?=[&"])
Explanation:
.*? is the body of our expression. Just match everything, but don't be greedy!
(?<=q=) is a positive look-behind, which says "q=" should come before the match
(?=[&"]) is a positive look ahead, which says "either & or a quote should come after the match"
Because we make it not greedy with the ?, it'll stop at the first quote or ampersand. Otherwise it'd match all of the way to the closing quote.
Use a look behind before, and a look ahead at the end to assert the surrounding text, and include the keyword in the regex:
(?<=url\?q=)[^&]*webcache[^&]*(?=&)
Using [^&]* ensures that the keyword occurs before an & - within the target string.

Regex all uppercase with special characters

I have a regex '^[A0-Z9]+$' that works until it reaches strings with 'special' characters like a period or dash.
List:
UPPER
lower
UPPER lower
lower UPPER
TEST
test
UPPER2.2-1
UPPER2
Gives:
UPPER
TEST
UPPER2
How do I get the regex to ignore non-alphanumeric characters also so it includes UPPER2.2-1 also?
I have a link here to show it 'real-time': http://www.rubular.com/r/ev23M7G1O3
This is for MySQL REGEX
EDIT: I didn't specify I wanted all non-alphanumeric characters (including spaces), but with the help of others here it led me to this: '^[A-Z-0-9[:punct:][:space:]]+$' is there anything wrong with this?
Try
'^[A-Z0-9.-]+$'
You just need to add the special characters to the group, optionally escaping them.
Additionally if you choose not to escape the -, be aware that it should be placed at the start or the end of the grouping expression to avoid the chance that it may be interpreted as delimiting a range.
To your updated question, if you want all non-whitespace, try using a group such as:
^[^ ]+$
which will match everything except for a space.
If instead what you wanted is all non-whitespace and non-lowercase, you likely will want to use:
^[^ a-z]+$
The 'trick' used here is adding a caret symbol after the opening [ in the group expression. This indicates that we want the negation of the match.
Following the pattern, we can also apply this 'trick' to get everything but lowercase letters like this:
^[^a-z]+$
I'm not really sure which of the 3 above you want, but if nothing else, this ought to serve as a good example of what you can do with character classes.
I believe you are looking for (one?) uppercase-word match, where word is pretty much anything.
^[^a-z\s]+$
...or if you want to allow more words with spaces, then probably just
^[^a-z]+$
You just need to put in the . and -. In theory, you don't need to escape because they are inside the brackets, but I like to to remind myself to escape when I have to.
'^[A-Z0-9\.\-]+$'
Try regular expression as below:
'^[A0-Z0\\.\\-]+$'

RegEx to return 'href' attribute of 'link' tags only?

Im trying to craft a regex that only returns <link> tag hrefs
Why does this regex return all hrefs including <a hrefs?
(?<=<link\s+.*?)href\s*=\s*[\'\"][^\'\"]+
<link rel="stylesheet" rev="stylesheet" href="idlecore-tidied.css?T_2_5_0_228" media="screen">
<a href="anotherurl">Slash Boxes</a>
Either
/(?<=<link\b[^<>]*?)\bhref=\s*=\s*(?:"[^"]*"|'[^']'|\S+)/
or
/<link\b[^<>]*?\b(href=\s*=\s*(?:"[^"]*"|'[^']'|\S+))/
The main difference is [^<>]*? instead of .*?. This is because you don't want it to continue the search into other tags.
Avoid lookbehind for such simple case, just match what you need, and capture what you want to get.
I got good results with <link\s+[^>]*(href\s*=\s*(['"]).*?\2) in The Regex Coach with s and g options.
/(?<=<link\s+.*?)href\s*=\s*[\'\"][^\'\"]+[^>]*>/
i'm a little shaky on the back-references myself, so I left that in there. This regex though:
/(<link\s+.*?)href\s*=\s*[\'\"][^\'\"]+[^>]*>/
...works in my Javascript test.
(?<=<link\s+.*?)href\s*=\s*[\'\"][^\'\"]+
works with Expresso (I think Expresso runs on the .NET regex-engine). You could even refine this a bit more to match the closing ' or
":
(?<=<link\s+.*?)href\s*=\s*([\'\"])[^\'\"]+(\1)
Perhaps your regex-engine doesn't work with lookbehind assertions. A workaround would be
(?:<link\s+.*?)(href\s*=\s*([\'\"])[^\'\"]+(\2))
Your match will then be in the captured group 1.
What regex flavor are you using? Perl, for one, doesn't support variable-length lookbehind. Where that's an option, I'd choose (edited to implement the very good idea from MizardX):
(?<=<link\b[^<>]*?)href\s*=\s*(['"])(?:(?!\1).)+\1
as a first approximation. That way the choice of quote character (' or ") will be matched.
The same for a language without support for (variable-length) lookbehind:
(?:<link\b[^<>]*?)(href\s*=\s*(['"])(?:(?!\2).)+\2)
\1 will contain your match.

Variable order regex syntax

Is there a way to indicate that two or more regex phrases can occur in any order? For instance, XML attributes can be written in any order. Say that I have the following XML:
Home
Home
How would I write a match that checks the class and title and works for both cases? I'm mainly looking for the syntax that allows me to check in any order, not just matching the class and title as I can do that. Is there any way besides just including both combinations and connecting them with a '|'?
Edit: My preference would be to do it in a single regex as I'm building it programatically and also unit testing it.
No, I believe the best way to do it with a single RE is exactly as you describe. Unfortunately, it'll get very messy when your XML can have 5 different attributes, giving you a large number of different REs to check.
On the other hand, I wouldn't be doing this with an RE at all since they're not meant to be programming languages. What's wrong with the old fashioned approach of using an XML processing library?
If you're required to use an RE, this answer probably won't help much, but I believe in using the right tools for the job.
Have you considered xpath? (where attribute order doesn't matter)
//a[#class and #title]
Will select both <a> nodes as valid matches. The only caveat being that the input must be xhtml (well formed xml).
You can create a lookahead for each of the attributes and plug them into a regex for the whole tag. For example, the regex for the tag could be
<a\b[^<>]*>
If you're using this on XML you'll probably need something more elaborate. By itself, this base regex will match a tag with zero or more attributes. Then you add a lookhead for each of the attributes you want to match:
(?=[^<>]*\s+class="link")
(?=[^<>]*\s+title="Home")
The [^<>]* lets it scan ahead for the attribute, but won't let it look beyond the closing angle bracket. Matching the leading whitespace here in the lookahead serves two purposes: it's more flexible than matching it in the base regex, and it ensure that we're matching a whole attribute name. Combining them we get:
<a\b(?=[^<>]*\s+class="link")(?=[^<>]*\s+title="Home")[^<>]+>[^<>]+</a>
Of course, I've made some simplifying assumptions for the sake of clarity. I didn't allow for whitespace around the equals signs, for single-quotes or no quotes around the attribute values, or for angle brackets in the attribute values (which I hear is legal, but I've never seen it done). Plugging those leaks (if you need to) will make the regex uglier, but won't require changes to the basic structure.
You could use named groups to pull the attributes out of the tag. Run the regex and then loop over the groups doing whatever tests that you need.
Something like this (untested, using .net regex syntax with the \w for word characters and \s for whitespace):
<a ((?<key>\w+)\s?=\s?['"](?<value>\w+)['"])+ />
The easiest way would be to write a regex that picks up the <a .... > part, and then write two more regexes to pull out the class and the title. Although you could probably do it with a single regex, it would be very complicated, and probably a lot more error prone.
With a single regex you would need something like
<a[^>]*((class="([^"]*)")|(title="([^"]*)"))?((title="([^"]*)")|(class="([^"]*)"))?[^>]*>
Which is just a first hand guess without checking to see if it's even valid. Much easier to just divide and conquer the problem.
An first ad hoc solution might be to do the following.
((class|title)="[^"]*?" *)+
This is far from perfect because it allows every attribute to occur more than once. I could imagine that this might be solveable with assertions. But if you just want to extract the attributes this might already be sufficent.
If you want to match a permutation of a set of elements, you could use a combination of back references and zero-width
negative forward matching.
Say you want to match any one of these six lines:
123-abc-456-def-789-ghi-0AB
123-abc-456-ghi-789-def-0AB
123-def-456-abc-789-ghi-0AB
123-def-456-ghi-789-abc-0AB
123-ghi-456-abc-789-def-0AB
123-ghi-456-def-789-abc-0AB
You can do this with the following regex:
/123-(abc|def|ghi)-456-(?!\1)(abc|def|ghi)-789-(?!\1|\2)(abc|def|ghi)-0AB/
The back references (\1, \2), let you refer to your previous matches, and the zero
width forward matching ((?!...) ) lets you negate a positional match, saying don't match if the
contained matches at this position. Combining the two makes sure that your match is a legit permutation
of the given elements, with each possibility only occuring once.
So, for example, in ruby:
input = <<LINES
123-abc-456-abc-789-abc-0AB
123-abc-456-abc-789-def-0AB
123-abc-456-abc-789-ghi-0AB
123-abc-456-def-789-abc-0AB
123-abc-456-def-789-def-0AB
123-abc-456-def-789-ghi-0AB
123-abc-456-ghi-789-abc-0AB
123-abc-456-ghi-789-def-0AB
123-abc-456-ghi-789-ghi-0AB
123-def-456-abc-789-abc-0AB
123-def-456-abc-789-def-0AB
123-def-456-abc-789-ghi-0AB
123-def-456-def-789-abc-0AB
123-def-456-def-789-def-0AB
123-def-456-def-789-ghi-0AB
123-def-456-ghi-789-abc-0AB
123-def-456-ghi-789-def-0AB
123-def-456-ghi-789-ghi-0AB
123-ghi-456-abc-789-abc-0AB
123-ghi-456-abc-789-def-0AB
123-ghi-456-abc-789-ghi-0AB
123-ghi-456-def-789-abc-0AB
123-ghi-456-def-789-def-0AB
123-ghi-456-def-789-ghi-0AB
123-ghi-456-ghi-789-abc-0AB
123-ghi-456-ghi-789-def-0AB
123-ghi-456-ghi-789-ghi-0AB
LINES
# outputs only the permutations
puts input.grep(/123-(abc|def|ghi)-456-(?!\1)(abc|def|ghi)-789-(?!\1|\2)(abc|def|ghi)-0AB/)
For a permutation of five elements, it would be:
/1-(abc|def|ghi|jkl|mno)-
2-(?!\1)(abc|def|ghi|jkl|mno)-
3-(?!\1|\2)(abc|def|ghi|jkl|mno)-
4-(?!\1|\2|\3)(abc|def|ghi|jkl|mno)-
5-(?!\1|\2|\3|\4)(abc|def|ghi|jkl|mno)-6/x
For your example, the regex would be
/<a href="home.php" (class="link"|title="Home") (?!\1)(class="link"|title="Home")>Home<\/a>/