regex to ignore duplicate matches - html

I'm using an application to search this website that I don't have control of right this moment and was wondering if there is a way to ignore duplicate matches using only regex.
Right now I wrote this to get matches for the image source in the pages source code
uses this to retrieve srcs
<span> <img id="imgProduct.*? src="/(.*?)" alt="
from this
<span> <img id="imgProduct_1" class="SmPrdImg selected"
onclick="(some javascript);" src="the_src_I_want1.jpg" alt="woohee"> </span>
<span> <img id="imgProduct_2" class="SmPrdImg selected"
onclick="(some javascript);" src="the_src_I_want2.jpg" alt="woohee"> </span>
<span> <img id="imgProduct_3" class="SmPrdImg selected"
onclick="(some javascript);" src="the_src_I_want3.jpg" alt="woohee"> </span>
the only problem is that the exact same code listed above is duplicated way lower in the source. Is there a way to ignore or delete the duplicates using only regex?

Your pattern's not very good; it's way too specific to your exact source code as it currently exists. As #Truth commented, if that changes, you'll break your pattern. I'd recommend something more like this:
<img[^>]*src=['"]([^'"]*)['"]
That will match the contents of any src attribute inside any <img> tag, no matter how much your source code changes.
To prevent duplicates with regex, you'll need lookahead, and this is likely to be very slow. I do not recommend using regex for this. This is just to show that you could, if you had to. The pattern you would need is something like this (I tested this using Notepad++'s regex search, which is based on PCRE and more robust than JavaScript's, but I'm reasonably sure that JavaScript's regex parser can handle this).
<img[^>]*src=['"]([^'"]*)['"](?!(?:.|\s)*<img[^>]*src=['"]\1['"])
You'll then get a match for the last instance of every src.
The Breakdown
For illustration, here's how the pattern works:
<img[^>]*src=['"]([^'"]*)['"]
This makes sure that we are inside a <img> tag when src comes up, and then makes sure we match only what is inside the quotes (which can be either single or double quotes; since neither is a legal character in a filename anyway we don't have to worry about mixing quote types or escaped quotes).
(?!
(?:
.
|
\s
)*
<img[^>]*src=['"]\1['"]
)
The (?! starts a negative lookahead: we are requiring that the following pattern cannot be matched after this point.
Then (?:.|\s)* matches any character or any whitespace. This is because JavaScript's . will not match a newline, while \s will. Mostly, I was lazy and didn't want to write out a pattern for any possible line ending, so I just used \s. The *, of course, means we can have any number of these. That means that the following (still part of the negative lookahead) cannot be found anywhere in the rest of the file. The (?: instead of ( means that this parenthetical isn't going to be remembered for backreferences.
That bit is <img[^>]*src=['"]\1['"]. This is very similar to the initial pattern, but instead of capturing the src with ([^'"]*), we're referencing the previously-captured src with \1.
Thus the pattern is saying "match any src in an img that does not have any img with the same src anywhere in the rest of the file," which means you only get the last instance of each src and no duplicates.
If you want to remove all instances of any img whose src appears more than once, I think you're out of luck, by the way. JavaScript does not support lookbehind, and the overwhelming majority of regex engines that do wouldn't allow such a complicated lookbehind anyway.

I wouldn't work too hard to make them unique, just do that in the PHP following the preg match with array_unique:
$pattern = '~<span> <img id="imgProduct.*? src="/(.*?)" alt="~is';
$match = preg_match_all($pattern, $html, $matches);
if ($match)
{
$matches = array_unique($matches[1]);
}
If you are using JavaScript, then you'd need to use another function instead of array_unique, check PHPJS:
http://phpjs.org/functions/array_unique:346

Related

How to set RegEx global flag in HTML pattern attribute of the input element?

I want to client validate a form input (username + password) before sending it to the server (php).
Therefore I applied the pattern attribute in the input tag.
I came up with a RegEx expression that does the job on the server side:
(preg_match_all('/^[a-zA-Z0-9. _äöüßÄÖÜ#-]{1,50}$/', $_POST['username']) == 0)
thereby the global flag is set using preg_match_all (instead of preg_match).
Now I wanted to implement the same RegEx in my pattern attribute in the HTML form.
HTML standard defines that RegEx in the pattern attribute follows RegEx in JavaScript, which devides the expression into "pattern, flags" divided by a comma. I would translate that into HTML like this:
pattern="^[a-zA-Z0-9. _äöüßÄÖÜ#-]{1,50}$,g"
That doesn't work.
All JavaScript RegEx validators I have found enclose the pattern into slashes:
/^[a-zA-Z0-9. _äöüßÄÖÜ#-]{1,50}$/
and say, that the global flag would be behind the last slash:
pattern="/^[a-zA-Z0-9. _äöüßÄÖÜ#-]{1,50}$/g"
That doesn't work either.
Mozilla also states in their developer guide (I also read it elsewhere):
No forward slashes should be specified around the pattern text.
So, how can I get the global flag into the pattern attribute of the input element?
There are a couple of facts you should be aware when using pattern attribute regex:
There is no need to use g flag, the whole string must match the regex, and the regex check will only be performed once, a single match is enough
There is no need wrapping the pattern with regex delimiters, and if you add slashes at the start and end, they will be treated as literal slashes making part of the regex pattern, and in 99.9% of cases that would ruin the regex
You do not even need ^ and $ anchors as the pattern regex must match the entire string input. In fact, the pattern is automatically enclosed with ^(?: and )$, so if you use pattern="^\d+$" (just a quick example), the final regex (in Chrome, e.g.) will look like /^(?:^\d+$)$/u, which looks rather redundant.
So, all you need is
pattern="^[a-zA-Z0-9. _äöüßÄÖÜ#-]{1,50}"
// Or even
pattern="^[\w. äöüßÄÖÜ#-]{1,50}"
Note that [A-Za-z0-9_] = \w in JavaScript regex.

Search and replace outer tag in Atom using REGEX

Using Atom, I'm trying to replace the outer tag structure for multiple different texts within a document. Also using REGEX, which I'm not versed enough to come up with my own solution
HTML to be searched <span class="klass">Any text string</span>
Replace it with <code>Any text string</code>
My REGEX (<?span class="klass">)+[\w]+(<?/span>)
Is there a wildcard to "keep" the [\w] part into the replaced result?
You can use a capture group to capture the text in between the <span> tags during the match, and then use it to build the <code> output you want. Try the following find and replace:
Find:
<span class="klass">(.*?)</span>
Replace:
<code>$1</code>
Here $1 represents the quantity (.*?) which we captured in the search. One other point, we use .*? when capturing between tags as opposed to just .*. The former .*? is a "lazy" or tempered dot. This tells the engine to stop matching upon hitting the first closing </span> tag. Without this, the match would be greedy and would consume as much as possible, ending only with the final </span> tag in your text.

Using Regex to find "<img .../>" and "<script ...> </script>" in HTML string

I am trying to use Regular Expressions for the first time to search for images and scripts in webpages in Scala. The expressions I've come up with are
Images:
/(<img\S+\s+\/>)+/
Scripts:
/(<script\s+\S+><\/script>)+/
I don't really know anything about HTML code or using Regex so I'm not sure what I need in order to specify that it should match <img .../> where the ... could be any amount of characters or whitespace. This is just a small part of a programming assignment I'm writing in Scala and we have to use Regex.
A regex like <img[^>]*> would match <img..........>.
A regex like <script.*?</script> would match a single <script...>...</script> instance. The ? is necessary to prevent it from matching everything from the first <script...> tag to the last </script> tag.
(Feel free to add back in the capturing ( )'s, the \ escapes, and surround with the regex delimiting / / tokens. I removed them to focus on the regular expressions themselves, without the leaning toothpick syndrome and other noise.)
While these are better than the ones you proposed, they will still break in many circumstances. RegEx is not designed to parse HTML.
<script>
<!-- This "</script>" doesn't end the script, but fools the RegEx -->
</script>

Regex matching Google Cache url (matching entire href parameter when it contains a word)

Disclaimer: I know that html and regex should not stand together, but this is an exceptional case.
I need to parse Google Search results and extract cache urls. I have this in the page:
<a href="/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:
gsNKb7ku3ewJ:somedata&ei=MyIIUtrZAcPX7AaVzIHwDg&ved=0CB8QIDAC&usg
=AFQjCNGcnWfdzQiTKwyAMmI-M-xzxII5Ag">Cached</a>
I tried simple stuff like: href=[\'"]?([^\'" >]+) but it is not what I need. I want to extract a single parameter (q) from the href. I need to get:
http://webcache.googleusercontent.com/search%3Fq%3Dcache:gsNKb7ku3ewJ:somedata
So everything between "url?q=" and first "&", when the contents contain word "webcache" in it.
If your language supports positive look-behinds:
(?<=q=).*?(?=[&"])
Otherwise match group \1 with this expression:
(?:q=)(.*?)(?=[&"])
Explanation:
.*? is the body of our expression. Just match everything, but don't be greedy!
(?<=q=) is a positive look-behind, which says "q=" should come before the match
(?=[&"]) is a positive look ahead, which says "either & or a quote should come after the match"
Because we make it not greedy with the ?, it'll stop at the first quote or ampersand. Otherwise it'd match all of the way to the closing quote.
Use a look behind before, and a look ahead at the end to assert the surrounding text, and include the keyword in the regex:
(?<=url\?q=)[^&]*webcache[^&]*(?=&)
Using [^&]* ensures that the keyword occurs before an & - within the target string.

Variable order regex syntax

Is there a way to indicate that two or more regex phrases can occur in any order? For instance, XML attributes can be written in any order. Say that I have the following XML:
Home
Home
How would I write a match that checks the class and title and works for both cases? I'm mainly looking for the syntax that allows me to check in any order, not just matching the class and title as I can do that. Is there any way besides just including both combinations and connecting them with a '|'?
Edit: My preference would be to do it in a single regex as I'm building it programatically and also unit testing it.
No, I believe the best way to do it with a single RE is exactly as you describe. Unfortunately, it'll get very messy when your XML can have 5 different attributes, giving you a large number of different REs to check.
On the other hand, I wouldn't be doing this with an RE at all since they're not meant to be programming languages. What's wrong with the old fashioned approach of using an XML processing library?
If you're required to use an RE, this answer probably won't help much, but I believe in using the right tools for the job.
Have you considered xpath? (where attribute order doesn't matter)
//a[#class and #title]
Will select both <a> nodes as valid matches. The only caveat being that the input must be xhtml (well formed xml).
You can create a lookahead for each of the attributes and plug them into a regex for the whole tag. For example, the regex for the tag could be
<a\b[^<>]*>
If you're using this on XML you'll probably need something more elaborate. By itself, this base regex will match a tag with zero or more attributes. Then you add a lookhead for each of the attributes you want to match:
(?=[^<>]*\s+class="link")
(?=[^<>]*\s+title="Home")
The [^<>]* lets it scan ahead for the attribute, but won't let it look beyond the closing angle bracket. Matching the leading whitespace here in the lookahead serves two purposes: it's more flexible than matching it in the base regex, and it ensure that we're matching a whole attribute name. Combining them we get:
<a\b(?=[^<>]*\s+class="link")(?=[^<>]*\s+title="Home")[^<>]+>[^<>]+</a>
Of course, I've made some simplifying assumptions for the sake of clarity. I didn't allow for whitespace around the equals signs, for single-quotes or no quotes around the attribute values, or for angle brackets in the attribute values (which I hear is legal, but I've never seen it done). Plugging those leaks (if you need to) will make the regex uglier, but won't require changes to the basic structure.
You could use named groups to pull the attributes out of the tag. Run the regex and then loop over the groups doing whatever tests that you need.
Something like this (untested, using .net regex syntax with the \w for word characters and \s for whitespace):
<a ((?<key>\w+)\s?=\s?['"](?<value>\w+)['"])+ />
The easiest way would be to write a regex that picks up the <a .... > part, and then write two more regexes to pull out the class and the title. Although you could probably do it with a single regex, it would be very complicated, and probably a lot more error prone.
With a single regex you would need something like
<a[^>]*((class="([^"]*)")|(title="([^"]*)"))?((title="([^"]*)")|(class="([^"]*)"))?[^>]*>
Which is just a first hand guess without checking to see if it's even valid. Much easier to just divide and conquer the problem.
An first ad hoc solution might be to do the following.
((class|title)="[^"]*?" *)+
This is far from perfect because it allows every attribute to occur more than once. I could imagine that this might be solveable with assertions. But if you just want to extract the attributes this might already be sufficent.
If you want to match a permutation of a set of elements, you could use a combination of back references and zero-width
negative forward matching.
Say you want to match any one of these six lines:
123-abc-456-def-789-ghi-0AB
123-abc-456-ghi-789-def-0AB
123-def-456-abc-789-ghi-0AB
123-def-456-ghi-789-abc-0AB
123-ghi-456-abc-789-def-0AB
123-ghi-456-def-789-abc-0AB
You can do this with the following regex:
/123-(abc|def|ghi)-456-(?!\1)(abc|def|ghi)-789-(?!\1|\2)(abc|def|ghi)-0AB/
The back references (\1, \2), let you refer to your previous matches, and the zero
width forward matching ((?!...) ) lets you negate a positional match, saying don't match if the
contained matches at this position. Combining the two makes sure that your match is a legit permutation
of the given elements, with each possibility only occuring once.
So, for example, in ruby:
input = <<LINES
123-abc-456-abc-789-abc-0AB
123-abc-456-abc-789-def-0AB
123-abc-456-abc-789-ghi-0AB
123-abc-456-def-789-abc-0AB
123-abc-456-def-789-def-0AB
123-abc-456-def-789-ghi-0AB
123-abc-456-ghi-789-abc-0AB
123-abc-456-ghi-789-def-0AB
123-abc-456-ghi-789-ghi-0AB
123-def-456-abc-789-abc-0AB
123-def-456-abc-789-def-0AB
123-def-456-abc-789-ghi-0AB
123-def-456-def-789-abc-0AB
123-def-456-def-789-def-0AB
123-def-456-def-789-ghi-0AB
123-def-456-ghi-789-abc-0AB
123-def-456-ghi-789-def-0AB
123-def-456-ghi-789-ghi-0AB
123-ghi-456-abc-789-abc-0AB
123-ghi-456-abc-789-def-0AB
123-ghi-456-abc-789-ghi-0AB
123-ghi-456-def-789-abc-0AB
123-ghi-456-def-789-def-0AB
123-ghi-456-def-789-ghi-0AB
123-ghi-456-ghi-789-abc-0AB
123-ghi-456-ghi-789-def-0AB
123-ghi-456-ghi-789-ghi-0AB
LINES
# outputs only the permutations
puts input.grep(/123-(abc|def|ghi)-456-(?!\1)(abc|def|ghi)-789-(?!\1|\2)(abc|def|ghi)-0AB/)
For a permutation of five elements, it would be:
/1-(abc|def|ghi|jkl|mno)-
2-(?!\1)(abc|def|ghi|jkl|mno)-
3-(?!\1|\2)(abc|def|ghi|jkl|mno)-
4-(?!\1|\2|\3)(abc|def|ghi|jkl|mno)-
5-(?!\1|\2|\3|\4)(abc|def|ghi|jkl|mno)-6/x
For your example, the regex would be
/<a href="home.php" (class="link"|title="Home") (?!\1)(class="link"|title="Home")>Home<\/a>/