Regex to extract text from inside an HTML tag - html

I know this has been asked at least a thousand times but I can't find a proper regex that will match a name in this string here:
<td><div id="topbarUserName">Donald</div></td>
I want to get the name 'Donald' and the regex that's the closest is >[a-zA-Z0-9]+ but the result is >Donald.
I'm coding in PureBasic (It's syntax is similar to that of Basic) and it uses the PCRE library for regular expressions.
Can anyone help?

Josh's pattern will work if you only make use of the numbered group, not the whole match. If you have to use the whole match, use something like (?<=>)(\w+?)(?=<)
Either way, regex is widely known to not be good for parsing HTML.
Explanation:
(?<=) is used to check if something appears before the current item.
\w+? will match any "word"-character, one or more times, but stop whenever the rest of the pattern matches something, for this situation the ? could have been left out.
(?=) is used to check if something appears after the current item.

Try this
It should capture anything that is a letter / number
>([\w]+)<
Also I'm not exactly sure what your project limitations are, but it would be much easier to do something like this
$('#topbarUserName').text();
in jQuery instead of using a regex.

>([a-zA-Z]+) should do the Trick. Remember to get the grouping right.

Why not doing it with plain old basic string-functions?
a.w = FindString(HTMLstring.s, "topbarUserName") + 16 ; 2 for "> and topbar...
If a > 0
b.w = FindString(HTMLstring, "<", a)
If b > 0
c.w = b - a
Donald.s = Mid(HTMLstring,a, c)
EndIf
EndIf
Debug Donald

Related

Regex find two characters in order, between others, ignoring punctuation

I'm trying to filter using regex in mySQL.
The field is a text field and I want to find all that match 'MD' or similar ('M.D.', 'M. D.', 'DDS, M.D.' etc.).
I do not want to accept those that contain M and D as a part of another acronym (e.g., 'DMD'). However 'DMD, M.D.' I would want to find.
Apologies if this is a simple task - I read through some regex tutorials and couldn't figure this out! Thanks.
Update:
With help from the suggestions I arrived at the following solution:
(\s|^)M\.?\s*D\.?
which works for all of my cases. The quotes in my questions were to indicate it was a string, they are not a part of the string.
You can use a regex like this:
\b(M\.?\s*D\.?|D\.?\s*D\.?\s*S\.?)
Working demo
If I have understood your requirement:
'([^'.]*[ ,]*M[. ]*D[. ]*)'
this looks for MD preceded by space comma or ' separated by 0 or more dots & spaces, followed by '
it matches all the contents between the '' marks
test: https://regex101.com/r/oV2kV8/2
In the end I found this solution works:
(\s|^)M\.?\s*D\.?(\s|$)
This allows for the 'MD' to be at the start or after another credential and to have spaces or periods or nothing between the letters.

How do I check if values between html tags are blank or empty using regular expressions in notepad plus plus

I'm conducting a mass search of files in notepad++ and I need to determine if there are no values between a set of tags (i.e. ).
".*?" will search for 0 or more characters (well, most), which is fine. But I'm looking for a set of tags with at least one character between them.
".+?" is similar to the above and does work in notepad++.
I tried the following, which was unsuccessful:
<author>.{0}?</author>
Thank you for any help.
Since you look for something that doesn't exist you don't have to make it that complicated. Simply searching for <author></author> would do the trick, wouldn't it? If you want to include space-characters as "nothing" you could modify it to the following:
<author>\s*?</author>
Output:
<author></author> Match
<author> </author> Match
<author>something</author> No match
I don't understand why you are using the "?" operator; ".+" should yield the result you need.

Regex for Encoded HTML

I'd like to create a regex that will match an opening <a> tag containing an href attribute only:
<a href="doesntmatter.com">
It should match the above, but not match when other attributes are added:
<a href="doesntmatter.com" onmouseover="alert('Do something evil with Javascript')">
Normally that would be pretty easy, but the HTML is encoded. So encoding both of the above, I need the regex to match this:
<a href="doesntmatter.com" >
But not match this:
<a href="doesntmatter.com" onmouseover="alert('do something evil with javascript.')" >
Assume all encoded HTML is "valid" (no weird malformed XSS trickery) and assume that we don't need to follow any HTML sanitization best practices. I just need the simplest regex that will match A) above but not B).
Thanks!
The initial regular expression that comes to mind is /<a href=".*?">/; a lazy expression (.*?) can be used to match the string between the quotes. However, as pointed out in the comments, because the regular expression is anchored by a >, it'll match the invalid tag as well, because a match is still made.
In order to get around this problem, you can use atomic grouping. Atomic grouping tells the regular expression engine, "once you have found a match for this group, accept it" -- this will solve the problem of the regex going back and matching the second string after not finding a > a the end of the href. The regular expression with an atomic group would look like:
/<a (?>href=".*?")>/
Which would look like the following when replacing the characters with their HTML entities:
/<a (?>href=".*?")>/
Hey! I had to do a similar thing recently. I recommend decoding the html first then attempt to grab the info you want. Here's my solution in C#:
private string getAnchor(string data)
{
MatchCollection matches;
string pattern = #"<a.*?href=[""'](?<href>.*?)[""'].*?>(?<text>.*?)</a>";
Regex myRegex = new Regex(pattern, RegexOptions.Multiline);
string anchor = "";
matches = myRegex.Matches(data);
foreach (Match match in matches)
{
anchor += match.Groups["href"].Value.Trim() + "," + match.Groups["text"].Value.Trim();
}
return anchor;
}
I hope that helps!
I don't see how matching one is different from the other? You're just looking for exactly what you just wrote, making the portion that is doesntmatter.com the part you capture. I guess matching for anything until " (not "?) can present a problem, but you do it like this in regex:
(?:(?!").)*
It essentially means:
Match the following group 0 or more times
Fail match if the following string is """
Match any character (except new line unless DOTALL is specified)
The complete regular expression would be:
/<a href="(?>(?:[^&]+|(?!").)*)">/s
This is more efficient than using a non-greedy expression.
Credit to Daniel Vandersluis for reminding me of the atomic group! It fits nicely here for the sake of optimization (this pattern can never match if it has to backtrack.)
I also threw in an additional [^&]+ group to avoid repeating the negative look-ahead so many times.
Alternatively, one could use a possessive quantifier, which essentially does the same thing (your regex engine might not support it):
/<a href="(?:[^&]+|(?!").)*+">/s
As you can see it's slightly shorter.

How can I retrieve a collection of values from nested HTML-like elements using RegExp?

I have a problem creating a regular expression for the following task:
Suppose we have HTML-like text of the kind:
<x>...<y>a</y>...<y>b</y>...</x>
I want to get a collection of values inside <y></y> tags located inside a given <x> tag, so the result of the above example would be a collection of two elements ["a","b"].
Additionally, we know that:
<y> tags cannot be enclosed in other <y> tags
... can include any text or other tags.
How can I achieve this with RegExp?
This is a job for an HTML/XML parser. You could do it with regular expressions, but it would be very messy. There are examples in the page I linked to.
I'm taking your word on this:
"y" tags cannot be enclosed in other "y" tags
input looks like: <x>...<y>a</y>...<y>b</y>...</x>
and the fact that everything else is also not nested and correctly formatted. (Disclaimer: If it is not, it's not my fault.)
First, find the contents of any X tags with a loop over the matches of this:
<x[^>]*>(.*?)</x>
Then (in the loop body) find any Y tags within match group 1 of the "outer" match from above:
<y[^>]*>(.*?)</y>
Pseudo-code:
input = "<x>...<y>a</y>...<y>b</y>...</x>"
x_re = "<x[^>]*>(.*?)</x>"
y_re = "<y[^>]*>(.*?)</y>"
for each x_match in input.match_all(x_re)
for each y_match in x_match.group(1).value.match_all(y_re)
print y_match.group(1).value
next y_match
next x_match
Pseudo-output:
a
b
Further clarification in the comments revealed that there is an arbitrary amount of Y elements within any X element. This means there can be no single regex that matches them and extracts their contents.
Short and simple: Use XPath :)
It would help if we knew what language or tool you're using; there's a great deal of variation in syntax, semantics, and capabilities. Here's one way to do it in Java:
String str = "<y>c</y>...<x>...<y>a</y>...<y>b</y>...</x>...<y>d</y>";
String regex = "<y[^>]*+>(?=(?:[^<]++|<(?!/?+x\\b))*+</x>)(.*?)</y>";
Matcher m = Pattern.compile(regex).matcher(str);
while (m.find())
{
System.out.println(m.group(1));
}
Once I've matched a <y>, I use a lookahead to affirm that there's a </x> somewhere up ahead, but there's no <x> between the current position and it. Assuming the pseudo-HTML is reasonably well-formed, that means the current match position is inside an "x" element.
I used possessive quantifiers heavily because they make things like this so much easier, but as you can see, the regex is still a bit of a monster. Aside from Java, the only regex flavors I know of that support possessive quantifiers are PHP and the JGS tools (RegexBuddy/PowerGrep/EditPad Pro). On the other hand, many languages provide a way to get all of the matches at once, but in Java I had to code my own loop for that.
So it is possible to do this job with one regex, but a very complicated one, and both the regex and the enclosing code have to be tailored to the language you're working in.

Variable order regex syntax

Is there a way to indicate that two or more regex phrases can occur in any order? For instance, XML attributes can be written in any order. Say that I have the following XML:
Home
Home
How would I write a match that checks the class and title and works for both cases? I'm mainly looking for the syntax that allows me to check in any order, not just matching the class and title as I can do that. Is there any way besides just including both combinations and connecting them with a '|'?
Edit: My preference would be to do it in a single regex as I'm building it programatically and also unit testing it.
No, I believe the best way to do it with a single RE is exactly as you describe. Unfortunately, it'll get very messy when your XML can have 5 different attributes, giving you a large number of different REs to check.
On the other hand, I wouldn't be doing this with an RE at all since they're not meant to be programming languages. What's wrong with the old fashioned approach of using an XML processing library?
If you're required to use an RE, this answer probably won't help much, but I believe in using the right tools for the job.
Have you considered xpath? (where attribute order doesn't matter)
//a[#class and #title]
Will select both <a> nodes as valid matches. The only caveat being that the input must be xhtml (well formed xml).
You can create a lookahead for each of the attributes and plug them into a regex for the whole tag. For example, the regex for the tag could be
<a\b[^<>]*>
If you're using this on XML you'll probably need something more elaborate. By itself, this base regex will match a tag with zero or more attributes. Then you add a lookhead for each of the attributes you want to match:
(?=[^<>]*\s+class="link")
(?=[^<>]*\s+title="Home")
The [^<>]* lets it scan ahead for the attribute, but won't let it look beyond the closing angle bracket. Matching the leading whitespace here in the lookahead serves two purposes: it's more flexible than matching it in the base regex, and it ensure that we're matching a whole attribute name. Combining them we get:
<a\b(?=[^<>]*\s+class="link")(?=[^<>]*\s+title="Home")[^<>]+>[^<>]+</a>
Of course, I've made some simplifying assumptions for the sake of clarity. I didn't allow for whitespace around the equals signs, for single-quotes or no quotes around the attribute values, or for angle brackets in the attribute values (which I hear is legal, but I've never seen it done). Plugging those leaks (if you need to) will make the regex uglier, but won't require changes to the basic structure.
You could use named groups to pull the attributes out of the tag. Run the regex and then loop over the groups doing whatever tests that you need.
Something like this (untested, using .net regex syntax with the \w for word characters and \s for whitespace):
<a ((?<key>\w+)\s?=\s?['"](?<value>\w+)['"])+ />
The easiest way would be to write a regex that picks up the <a .... > part, and then write two more regexes to pull out the class and the title. Although you could probably do it with a single regex, it would be very complicated, and probably a lot more error prone.
With a single regex you would need something like
<a[^>]*((class="([^"]*)")|(title="([^"]*)"))?((title="([^"]*)")|(class="([^"]*)"))?[^>]*>
Which is just a first hand guess without checking to see if it's even valid. Much easier to just divide and conquer the problem.
An first ad hoc solution might be to do the following.
((class|title)="[^"]*?" *)+
This is far from perfect because it allows every attribute to occur more than once. I could imagine that this might be solveable with assertions. But if you just want to extract the attributes this might already be sufficent.
If you want to match a permutation of a set of elements, you could use a combination of back references and zero-width
negative forward matching.
Say you want to match any one of these six lines:
123-abc-456-def-789-ghi-0AB
123-abc-456-ghi-789-def-0AB
123-def-456-abc-789-ghi-0AB
123-def-456-ghi-789-abc-0AB
123-ghi-456-abc-789-def-0AB
123-ghi-456-def-789-abc-0AB
You can do this with the following regex:
/123-(abc|def|ghi)-456-(?!\1)(abc|def|ghi)-789-(?!\1|\2)(abc|def|ghi)-0AB/
The back references (\1, \2), let you refer to your previous matches, and the zero
width forward matching ((?!...) ) lets you negate a positional match, saying don't match if the
contained matches at this position. Combining the two makes sure that your match is a legit permutation
of the given elements, with each possibility only occuring once.
So, for example, in ruby:
input = <<LINES
123-abc-456-abc-789-abc-0AB
123-abc-456-abc-789-def-0AB
123-abc-456-abc-789-ghi-0AB
123-abc-456-def-789-abc-0AB
123-abc-456-def-789-def-0AB
123-abc-456-def-789-ghi-0AB
123-abc-456-ghi-789-abc-0AB
123-abc-456-ghi-789-def-0AB
123-abc-456-ghi-789-ghi-0AB
123-def-456-abc-789-abc-0AB
123-def-456-abc-789-def-0AB
123-def-456-abc-789-ghi-0AB
123-def-456-def-789-abc-0AB
123-def-456-def-789-def-0AB
123-def-456-def-789-ghi-0AB
123-def-456-ghi-789-abc-0AB
123-def-456-ghi-789-def-0AB
123-def-456-ghi-789-ghi-0AB
123-ghi-456-abc-789-abc-0AB
123-ghi-456-abc-789-def-0AB
123-ghi-456-abc-789-ghi-0AB
123-ghi-456-def-789-abc-0AB
123-ghi-456-def-789-def-0AB
123-ghi-456-def-789-ghi-0AB
123-ghi-456-ghi-789-abc-0AB
123-ghi-456-ghi-789-def-0AB
123-ghi-456-ghi-789-ghi-0AB
LINES
# outputs only the permutations
puts input.grep(/123-(abc|def|ghi)-456-(?!\1)(abc|def|ghi)-789-(?!\1|\2)(abc|def|ghi)-0AB/)
For a permutation of five elements, it would be:
/1-(abc|def|ghi|jkl|mno)-
2-(?!\1)(abc|def|ghi|jkl|mno)-
3-(?!\1|\2)(abc|def|ghi|jkl|mno)-
4-(?!\1|\2|\3)(abc|def|ghi|jkl|mno)-
5-(?!\1|\2|\3|\4)(abc|def|ghi|jkl|mno)-6/x
For your example, the regex would be
/<a href="home.php" (class="link"|title="Home") (?!\1)(class="link"|title="Home")>Home<\/a>/