Regex to match style=' ' - html

I'm using a series of regex patterns to remove HTML elements from my code. I need to also remove the style="{stuff}" attributes that are also present in the file.
At the moment I have style.*?, which matches only the word style, however I thought that by adding .*? to the regex it would also match with zero to unlimited characters after the style declaration?
I also have style={0,1}"{0,1}.*?"{0,1} which matches:
style=""
style="
style
But does not match style="something", again in this regex I would expect the .*? to match everything between the first " and the second ", but this is not the case. What do I need to do to change this regex so that it will match with all of the following:
style="font-family:"Open Sans", Arial, sans-serif;background-color:rgb(255, 255, 255);display:inline !important;"
style=""
style="something"
style

The pattern style.*? does not match the following parts as there is nothing following the non greedy part so it is matching as least as possible.
You could use an optional group and a negated character class:
\bstyle(?:="[^"]*")?
In parts
\bstyle Word bounary, match style
(?: Non capturing group
=" Match = and opening "
[^"]* Match any char 0+ times except "using a negated character class
" Match closing "
)? Close group and make it optional
Regex demo
If you want to match single or double quotes with the accompanying closing single or double quote to not match for example style="', you could use a capturing group (["']) with a backreference \1 to what was captured in group 1:
\bstyle(?:=(["'])[^"]*\1)?
Regex demo

Here's what I cooked up. It uses positive lookbehind (?<=...) and lookahead (?=...) to ensure that the found match is inside an HTML tag:
(?<=<[a-zA-Z][^<>]*?)\sstyle(?:="[^"]*")?(?=[\s>])(?=[^<>]*>)
Test it out.
It will match any whitespace before the "style", so that a removal of all matches goes from <a stuff="..." style="width:18px;" href="someurl"> to <a stuff="..." href="someurl"> without leaving a double space behind where it was removed.
Note that some regex parsers (like the Python one) don't like lookbehind with non-fixed size. This can be solved simply by changing the first and last parts, the lookbehind and lookahead groups, into capture groups instead, thereby capturing the whole html tag. Then you simply need to replace the match by $1$2 instead of an empty string, replacing the found match by the same thing but without the style="..." part inside it.
The resulting regex for that would be:
(<[a-zA-Z][^<>]*?)\sstyle(?:="[^"]*")?(?=[\s>])([^<>]*>)
Test it out.

Related

RegEx replace only occurrences outside of <h> html tags

I would like to regex replace Plus in the below text, but only when it's not wrapped in a header tag:
<h4 class="Somethingsomething" id="something">Plus plan</h4>The <b>Plus</b> plan starts at $14 per person per month and comes with everything from Basic.
In the above I would like to replace the second "Plus" but not the first.
My regex attempt so far is:
(?!<h\d*>)\bPlus\b(?!<\\h>)
Meaning:
Do not capture the following if in a <h + 1 digit and 0 or more characters and end an closing <\h>
Capture only if the group "Plus" is surrounded by spaces or white space
However - this captures both occurrences. Can someone point out my mistake and correct this?
I want to use this in VBA but should be a general regex question, as far as I understand.
Somewhat related but not addressing my problem in regex
Not relevant, as not RegEx
You can use
\bPlus\b(?![^>]*<\/h\d+>)
See the regex demo. To use the match inside the replacement pattern, use the $& backreference in your VBA code.
Details:
\bPlus\b - a whole word Plus
(?![^>]*<\/h\d+>) - a negative lookahead that fails the match if, immediately to the right of the current location, there are
[^>]* - zero or more chars other than >
<\/h - </h string
\d+ - one or more digits
> - a > char.

Sublime Text regex to find and replace whitespace between two xml or html tags?

I'm using Sublime Text and I need to come up with a regex that will find the whitespaces between a certain opening and closing tag and replace them with commas.
Example: Replace white space in
<tags>This is an example</tags>
so it becomes
<tags>This,is,an,example</tags>
Thanks!
You have just to use a simple regex like:
\s+
And replace it with with a comma.
Working demo
This will find instances of
<tags>...</tags>
with whitespace between the tags
(<tags>\S+)\W(.+</tags>)
This will replace the first whitespace with a comma
\1,\2
Open Find and Replace [OS X Cmd+Opt+F :: Windows Ctrl+H]
Use the two values above to find and replace and use the 'Replace All' option. Repeat until all the whitespaces are converted to commas.
The best answer is probably a quick script but this will get you there fairly fast without needing to do any coding.
You can replace any one or more whitespace chunks in between two tags using a single regular expression:
(?s)(?:\G(?!\A)|<tags>(?=.*?</tags>))(?:(?!</?tags>).)*?\K\s+
See the regex demo. Details
(?s) - a DOTALL inline modifier, makes . match line breaks
(?:\G(?!\A)|<tags>(?=.*?</tags>)) - either the end of the previous successful match (\G(?!\A)) or (|) <tags> substring that is immediately followed with any zero or more chars, as few as possible and then </tags> (see (?=.*?</tags>))
(?:(?!</?tags>).)*? - any char that does not start a <tags> or </tags> substrings, zero or more occurrences but as few as possible
\K - match reset operator
\s+ - one or more whitespaces (NOTE: use \s if each whitespace must be replaced).
SublimeText settings:

Regular Expression to match certain src and href in HTML file

I am building a script to scan through HTML files and replace all 'src' and 'href' attributes under certain conditions. Here is the regex I have right now - (href|src)=["|'](.*?)["|'].
What I am not sure on is expanding the (.*?) to say unless it contains mailto:, https:// or if it does not http://www.google.co.uk for example.
The basic idea of this script is to replace all assets not covered by SSL and put them under an SSL secured URL.
Does anyone know how this can be achieved?
Many thanks.
Here is your expression with a number of tweaks for improved syntax:
(?:href|src)=(["'])(?!mailto|https).*?\1
I assume you don't need to capture the href or src into their own capture group, so a non-capturing group will do: (?:
We remove the | from the character class for the opening quotes as it does not mean OR
We capture the opening quote into Group 1 with (["']), which enables us to ensure that the closing quote is the same type by using the back-reference \1. Otherwise your expression would match src="http://google.com' (double quote and single quote = unbalanced)
Note the change in parentheses in what follows. The negative lookahead does not need to be part of a capture Group.
The lazy dot star .*? presumably does not need to be in a capture group
The \1 refers to capture Group 1, that is to say the content of the first capturing parentheses, i.e., either a single or a double quote, ensuring that we match the same kind of quote at the beginning and at the end.
OK after a little bit more research I found the answer to this. My regex has been expanded to the below.
(href|src)=["|']((?!mailto|https).*?)["|']. Examples below -
src="http://google.co.uk" > match
src='http://google.co.uk' > match
src="/css/test.css" > match
src='/css/test.css' > match
src="css/test.css" > match
src='css/test.css' > match
src="https://google.co.uk" > no match
src='https://google.co.uk' > no match
src="mailto:test#google.co.uk" > no match
src='mailto:test#google.co.uk' > no match

Regex matching Google Cache url (matching entire href parameter when it contains a word)

Disclaimer: I know that html and regex should not stand together, but this is an exceptional case.
I need to parse Google Search results and extract cache urls. I have this in the page:
<a href="/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:
gsNKb7ku3ewJ:somedata&ei=MyIIUtrZAcPX7AaVzIHwDg&ved=0CB8QIDAC&usg
=AFQjCNGcnWfdzQiTKwyAMmI-M-xzxII5Ag">Cached</a>
I tried simple stuff like: href=[\'"]?([^\'" >]+) but it is not what I need. I want to extract a single parameter (q) from the href. I need to get:
http://webcache.googleusercontent.com/search%3Fq%3Dcache:gsNKb7ku3ewJ:somedata
So everything between "url?q=" and first "&", when the contents contain word "webcache" in it.
If your language supports positive look-behinds:
(?<=q=).*?(?=[&"])
Otherwise match group \1 with this expression:
(?:q=)(.*?)(?=[&"])
Explanation:
.*? is the body of our expression. Just match everything, but don't be greedy!
(?<=q=) is a positive look-behind, which says "q=" should come before the match
(?=[&"]) is a positive look ahead, which says "either & or a quote should come after the match"
Because we make it not greedy with the ?, it'll stop at the first quote or ampersand. Otherwise it'd match all of the way to the closing quote.
Use a look behind before, and a look ahead at the end to assert the surrounding text, and include the keyword in the regex:
(?<=url\?q=)[^&]*webcache[^&]*(?=&)
Using [^&]* ensures that the keyword occurs before an & - within the target string.

regex to ignore duplicate matches

I'm using an application to search this website that I don't have control of right this moment and was wondering if there is a way to ignore duplicate matches using only regex.
Right now I wrote this to get matches for the image source in the pages source code
uses this to retrieve srcs
<span> <img id="imgProduct.*? src="/(.*?)" alt="
from this
<span> <img id="imgProduct_1" class="SmPrdImg selected"
onclick="(some javascript);" src="the_src_I_want1.jpg" alt="woohee"> </span>
<span> <img id="imgProduct_2" class="SmPrdImg selected"
onclick="(some javascript);" src="the_src_I_want2.jpg" alt="woohee"> </span>
<span> <img id="imgProduct_3" class="SmPrdImg selected"
onclick="(some javascript);" src="the_src_I_want3.jpg" alt="woohee"> </span>
the only problem is that the exact same code listed above is duplicated way lower in the source. Is there a way to ignore or delete the duplicates using only regex?
Your pattern's not very good; it's way too specific to your exact source code as it currently exists. As #Truth commented, if that changes, you'll break your pattern. I'd recommend something more like this:
<img[^>]*src=['"]([^'"]*)['"]
That will match the contents of any src attribute inside any <img> tag, no matter how much your source code changes.
To prevent duplicates with regex, you'll need lookahead, and this is likely to be very slow. I do not recommend using regex for this. This is just to show that you could, if you had to. The pattern you would need is something like this (I tested this using Notepad++'s regex search, which is based on PCRE and more robust than JavaScript's, but I'm reasonably sure that JavaScript's regex parser can handle this).
<img[^>]*src=['"]([^'"]*)['"](?!(?:.|\s)*<img[^>]*src=['"]\1['"])
You'll then get a match for the last instance of every src.
The Breakdown
For illustration, here's how the pattern works:
<img[^>]*src=['"]([^'"]*)['"]
This makes sure that we are inside a <img> tag when src comes up, and then makes sure we match only what is inside the quotes (which can be either single or double quotes; since neither is a legal character in a filename anyway we don't have to worry about mixing quote types or escaped quotes).
(?!
(?:
.
|
\s
)*
<img[^>]*src=['"]\1['"]
)
The (?! starts a negative lookahead: we are requiring that the following pattern cannot be matched after this point.
Then (?:.|\s)* matches any character or any whitespace. This is because JavaScript's . will not match a newline, while \s will. Mostly, I was lazy and didn't want to write out a pattern for any possible line ending, so I just used \s. The *, of course, means we can have any number of these. That means that the following (still part of the negative lookahead) cannot be found anywhere in the rest of the file. The (?: instead of ( means that this parenthetical isn't going to be remembered for backreferences.
That bit is <img[^>]*src=['"]\1['"]. This is very similar to the initial pattern, but instead of capturing the src with ([^'"]*), we're referencing the previously-captured src with \1.
Thus the pattern is saying "match any src in an img that does not have any img with the same src anywhere in the rest of the file," which means you only get the last instance of each src and no duplicates.
If you want to remove all instances of any img whose src appears more than once, I think you're out of luck, by the way. JavaScript does not support lookbehind, and the overwhelming majority of regex engines that do wouldn't allow such a complicated lookbehind anyway.
I wouldn't work too hard to make them unique, just do that in the PHP following the preg match with array_unique:
$pattern = '~<span> <img id="imgProduct.*? src="/(.*?)" alt="~is';
$match = preg_match_all($pattern, $html, $matches);
if ($match)
{
$matches = array_unique($matches[1]);
}
If you are using JavaScript, then you'd need to use another function instead of array_unique, check PHPJS:
http://phpjs.org/functions/array_unique:346