How to negate this regex - html

The following regex will do validation for the P. O. Box is entered in the text box,
\b[P|p](OST|ost)?[.\s-]+[O|o](FFICE|ffice)?[.\s-]+[B|b](OX|ox)\b
i want to negate this so as to detect whether user is not entering P. O. Box it in the text box,
I know that we can do it by using javascript also, but my platform has different form structure, its demandware form, where i have regex as the field attribute.
We can submit a regex in this field and it will validate it automatically.Any idea?

I recommend not attempting to do this with a regular expression. It is way too easy to defeat1. Instead, you need to out source the problem to someone who knows what they are doing. In this case, since you are dealing with US addresses only, the USPS.
So, you should use the USPS address standarization/verification API. You can submit an address, and it will return to you a "cleaned" version of that address. It will tell you whether or not the address is valid. And if it is a post office box, it will return it to you in a standardized format and now you don't need a regular expression that can be defeated, now you only need a simple string match. And, as an added bonanza, you'll get a standardized and validated representation of the delivery address reducing2 the possibility of error.
I recognize I am sidestepping your actual engineering question. But part of engineering is abandoning solutions that are the wrong path. You need to validate addresses. So validate addresses rather than trying to build a state machine that can detect some inputs that represent post office boxes but will fail on others. And the USPS provides a validation service and they are the authoritative experts here.
1: I am not saying that you'll face adversaries, just you'll face all the creative, sloppy, lazy ways that people have for entering in their addresses.
2: But not eliminating.

You need to use a negative lookaround: (?!pattern).
In this case
(?! \b[P|p](OST|ost)?[.\s-]+[O|o](FFICE|ffice)?[.\s-]+[B|b](OX|ox)\b )
For reference:
Regex lookarounds

You can use this pattern:
^(?:[^p]+|\Bp+|p(?!(?:ost)?[.\s-]+o(?:ffice)?[.\s-]+box\b))+$
The idea is to test only substrings that begin with "p" (for more performances). To make this check case insensitive, you can add (?i) at the begining of the pattern:
^(?i)(?:[^p]+|\Bp+|p(?!(?:ost)?[.\s-]+o(?:ffice)?[.\s-]+box\b))+$

If the regex flavor is JavaScript's, then you can use negative look-ahead:
^(?!.*?\b[Pp](OST|ost)?[.\s-]+[Oo](FFICE|ffice)?[.\s-]+[Bb](OX|ox)\b)

Related

XSS validation: Is it safe to check only if value contains <> and %3C %3E

I've been doing server side XSS validation. Here is what I found to use:
List of forbidden attributes: javascript:,mocha:,eval(,alert(,vbscript:,livescript:,expression(,url(,&{,&#,/*,*/,onclick,oncontextmenu,ondblclick,onmousedown,onmouseenter,onmouseleave,onmousemove,onmouseover,onmouseout,onmouseup,onkeydown,onkeypress,onkeyup,onblur,onchange,onfocus,onfocusin,onfocusout,oninput,oninvalid,onreset,onsearch,onselect,onsubmit,ondrag,ondragend,ondragenter,ondragleave,ondragover,ondragstart,ondrop,oncopy,oncut,onpaste,ontouchcancel,ontouchend,ontouchmove,ontouchstart,onwheel
However they seem to be too strict since some correct values such as "™" are considered as illegal if I use this list.
I'm thinking if to check only value doesn't contain any of those characters like '<', '>', "%3E", "%3C" would be safe to prevent XSS attack?
You do not need to block any of those characters, long lesson-learned you're just cutting out other language support potentially and you can work with these characters safely.
Imagine we have a chat-box, as this is where these may get used most often in a textarea. The user can pass the following <html></html> and if it wasn't handled properly every user receiving the message will open up a new HTML document inside this chat (scary).
This gets handled on the client side fortunately, some hacks to make it "text-only".
When dealing with login you could regex check username, if that passes you can then SQL search the username if match is made check if password matches too by encrypting the password just given and matching it, no match get out.
I've never needed to block characters or treat anything special knowing proper encoding/decoding (escape practices and whatnot). Maybe this will help your search.

MySQL regex to validate email not working - curly brace quantifier ignored

I'm trying to use the following regular expression to validate emails in a MySQL database:
^[^#]+#[^#]+\.[^#]{2,}$
in a condition like this:
...and email REGEXP '^[^#]+#[^#]+\.[^#]{2,}$'
For the most part, the expression is working. But it's allowing single-character top level domains. For example, both the following emails pass validation:
something#hotmail.com and something#hotmail.c
The second case is clearly a typo. The {2,} of the regex should allow any string of characters other than the # symbol, of length 2 or more, after the dot.
I've run the regex itself through multiple testers running different protocols, (Perl, TCL, etc.) and it works as expected every time, rejecting the single-character TLD version of the email address. It's only when I use this regex in a MySQL context that it fails.
I've checked, and there are no additional characters after the ".c" in the erroneous email address. Is there something inherent to MySQL or this version that could be preventing this from working?
Running MySQL version 5.5.61-cll
You may try using the following regex pattern:
^[^#]+#[^.]+[.][^.]{2,}([.][^.]{2,})*$
The rightmost portion of the pattern means:
[.] match a literal dot
[^.]{2,} followed by a domain component (any 2 or more characters but dot)
([.][^.]{2,})* followed by dot and another component, zero or more times
Demo
So this would match:
jon.skeet#google.com
jon.skeet#google.co.uk
But would not match:
gordonlinoff#blah
By golly we are hackers and we validate stuff! What's more, we understand concepts like ACID compliance, data integrity, and single points of authority. So obviously, we should ensure that no invalid emails enter the DB. What's more? We have such a wonderful tool with which to do it: a proper schema with check constraints!
Unfortunately, emails are one of those details that are notoriously difficult to validate with simple regular expressions. It is possible, sure. No, actually, it's not. None of those links offer 100% compliance.
A much better approach is simply to test the address by sending it an activation email. David Gilbertson explains it far better than I'm going to in a concise SO answer, but the highlights:
Don't even try validate.
Just test the address with an actual email.
For my projects, both personal and professional, this is the regex I use for email address sanity checking prior to sending an activation/confirmation email:
\S+#\S+
This is extremely simple (and yep, still excludes some technically valid email addresses), extremely simple to debug, and works for any legitimate traffic to our sites. (I have yet to see an email address even close to something like #!$%&’*+-/=?^_{}|~#example.com in our logs.)

Regex getting the tags from an <a href= ...> </a> and the likes

I've tried the answers I've found in SOF, but none supported here : https://regexr.com
I essentially have an .OPML file with a large number of podcasts and descriptions.
in the following format:
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
What regex I can use to so I can just get the title and the link:
Software Engineering Daily
http://softwareengineeringdaily.com/feed/podcast/
Brief
There are many ways to go about this. The best way is likely using an XML parser. I would definitely read this post that discusses use of regex, especially with XML.
As you can see there are many answers to your question. It also depends on which language you are using since regex engines differ. Some accept backreferences, whilst others do not. I'll post multiple methods below that work in different circumstances/for different regex flavours. You can probably piece together from the multiple regex methods below which parts work best for you.
Code
Method 1
This method works in almost any regex flavour (at least the normal ones).
This method only checks against the attribute value opening and closing marks of " and doesn't include the possibility for whitespace before or after the = symbol. This is the simplest solution to get the values you want.
See regex in use here
\b(text|xmlUrl)="[^"]*"
Similarly, the following methods add more value to the above expression
\b(text|xmlUrl)\s*=\s*"[^"]*" Allows whitespace around =
\b(text|xmlUrl)=(?:"[^"]*"|'[^']*') Allows for ' to be used as attribute value delimiter
As another alternative (following the comments below my answer), if you wanted to grab every attribute except specific ones, you can use the following. Note that I use \w, which should cover most attributes, but you can just replace this with whatever valid characters you want. \S can be used to specify any non-whitespace characters or a set such as [\w-] may be used to specify any word or hyphen character. The negation of the specific attributes occurs with (?!text|xmlUrl), which says don't match those characters. Also, note that the word boundary \b at the beginning ensures that we're matching the full attribute name of text and not the possibility of other attributes with the same termination such as subtext.
\b((?!text|xmlUrl)\w+)="[^"]*"
Method 2
This method only works with regex flavours that allow backreferences. Apparently JGsoft applications, Delphi, Perl, Python, Ruby, PHP, R, Boost, and Tcl support single-digit backreferences. Double-digit backreferences are supported by JGsoft applications, Delphi, Python, and Boost. Information according this article about numbered backreferences from Regular-Expressions.info
See regex in use here
This method uses a backreference to ensure the same closing mark is used at the start and end of the attribute's value and also includes the possibility of whitespace surrounding the = symbol. This doesn't allow the possibility for attributes with no delimiter specified (using xmlUrl=http://softwareengineeringdaily.com/feed/podcast/ may also be valid).
See regex in use here
\b(text|xmlUrl)\s*=\s*(["'])(.*?)\2
Method 3
This method is the same as Method 2 but also allows attributes with no delimiters (note that delimiters are now considered to be space characters, thus, it will only match until the next space).
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(.*?)\2|(\S*))
Method 4
While Method 3 works, some people might complain that the attribute values might either of 2 groups. This can be fixed by either of the following methods.
Method 4.A
Branch reset groups are only possible in a few languages, notably JGsoft V2, PCRE 7.2+, PHP, Delphi, R (with PCRE enabled), Boost 1.42+ according to Regular-Expressions.info
This also shows the method you would use if backreferences aren't possible and you wanted to match multiple delimiters ("([^"])"|'([^']*))
See regex in use here
\b(text|xmlUrl)\s*=\s*(?|"([^"]*)"|'([^']*)'|(\S*))
Method 4.B
Duplicate subpatterns are not often supported. See this Regular-Expresions.info article for more information
This method uses the J regex flag, which allows duplicate subpattern names ((?<v>) is in there twice)
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(?<v>.*?)\2|(?<v>\S*))
Results
Input
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
Output
Each line below represents a different group. New matches are separated by two lines.
text
Software Engineering Daily
xmlUrl
http://softwareengineeringdaily.com/feed/podcast/
Explanation
I'll explain different parts of the regexes used in the Code section that way you understand the usage of each of these parts. This is more of a reference to the methods above.
"[^"]*" This is the fastest method possible (to the best of my knowledge) to grabbing anything between two " symbols. Note that it does not check for escaped backslashes, it will match any non-" character between two ". Whilst "(.*?)" can also be used, it's slightly slower
(["'])(.*?)\2 is basically shorthand for "(.*?)"|'(.*?)'. You can use any of the following methods to get the same result:
(?:"(.*?)"|'(.*?)')
(?:"([^"])"|'([^']*)') <-- slightly faster than line above
(?|) This is a branch reset group. When you place groups inside it like (?|(x)|(y)) it returns the same group index for both matches. This means that if x is captured, it'll get group index of 1, and if y is captured, it'll also get a group index of 1.
For simple HTML strings you might get along with
Url=(['"])(.+?)\1
Here, take group $2, see a demo on regex101.com.
Obligatory: consider using a parser instead (see here).

Matlab Regular expression query

Very new to regex and haven't found a descriptive explaination to narrow down my understanding of regex to get me to a solution.
I use a script that scrapes html script from Yahoo finance to get financial options table data. Yahoo recently changed their HTML code and the old algorithm no longer works. The old expression was the following:
Main_Pattern = '.*?</table><table[^>]*>(.*?)</table';
Tables = regexp(urlText, Main_Pattern, 'tokens');
Where Tables used to return data, it no longer does. An HTML inspection of the HTML suggests to me that the data is no longer in <table>, but rather in <tbody>...
My question is "what does the Main_Pattern regex mean in layman's terms?" I'm trying to figure how to modify that expression such that is is applicable to the current HTML.
While I agree with #Marcin and Regular Expressions are best learned by doing and leveraging the reference of your chosen tool, I'll try and break down in what it is doing.
.*?</table>: Match anything up to the first </table> literal (This is a Lazy expression due to the ?).
<table: Match this literal.
[^>]*>: Match as much as possible that isn't > from after <table literal to the last occurrence of a > that satisfies the rest of the expression (this is a Greedy expression since there is no ? after the *).
(.*?)</table: Match and capture anything between the > from the previous part up to the </table literal; what was captured can be retrieved using the 'tokens' options of regexp (you can also get the entire string that was matched using the 'match' option).
While I broke it into pieces, I'd like to emphasize that the entire expression itself works as a whole, which is why some parts refer to the previous parts.
Refer to the Operators and Characters section of the MATLAB documentation for more in-depth explanations of the above.
For the future, a more robust option might be to use MATLAB's xmlread and DOM object to traverse the table nodes.
I do understand that that is another API to learn, but it may be more maintainable for the future.

Regex not matching URL Params

I am currently working on a stub server I can plug into a webpage so I do not need to hit sagepay every time I test my payment screen. I need the server to receive a request from the web page and use the dynamic parameters contained in the URL to build the server response. The stub uses regex targets to pick out the parameters I need and add them to the response.
I am using this stub server
I built the accepted URL piece by piece, using the regex tester contained here to test each bit of logic. The expressions work separately, but when I try to join two or more of them together they refuse to work. Each parameter is separated by an ampersand (&) and the name of the parameter.
Here is a sample of the parameters:
paymentType=A&amount=147.06&policyUid=07ef493b-0000-0000-6a05-9fa4d6a5b5ad&paymentMethod=A&script=Retail/accept.py&scriptParams=uid=07ef461a-0000-0000-6a059fa44a8870bf&invokePCL=true&paymentType=A&description=New Business Payment&firstName=Adam&surname=Har&addressLine1=20 Potters Road&city=London&postalCode=EC1 4JS&payerUid=07ef3ff7-0000-0000-6a05-9fa42e92d56b&cardType=valid&continuousAuthority=true&makeCurrent=true
and in a list for ease of reading (without &'s)
paymentType=A
amount=147.06
policyUid=07ef493b-0000-0000-6a05-9fa4d6a5b5ad
paymentMethod=A
script=Retail/accept.py
scriptParams=uid=07ef461a-0000-0000-6a059fa44a8870bf&invokePCL=true&paymentType=A
description=New Business Payment
firstName=Adam
surname=Har
addressLine1=20 Chase road
city=London
postalCode=EC1 3PF
payerUid=07ef3ff7-0000-0000-6a05-9fa42e92d56b
cardType=valid
continuousAuthority=true
makeCurrent=true
And here is my accepted URL parameters with the regex logic:
paymentType=A&amount=([0-9]+.[0-9]{2})&policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)$)&paymentMethod=([a-zA-Z]+)&script=([a-zA-Z]+/[a-zA-Z]+.py)&scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)))&description=([a-zA-Z0-9 ]+s)&firstName=[A-Za-z]&surname=[A-Za-z]&addressLine1=[a-zA-Z0-9 ]+&city=([a-zA-Z ]+)&postalCode=[a-zA-Z0-9 ]+&payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)$)&cardType=[a-zA-Z]+&continuousAuthority=[a-zA-Z]+&makeCurrent=[a-zA-Z]+
again in a list:
registerPayment?outputType=xml
country=GB
paymentType=A
amount=([0-9]+.[0-9]{2})
policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*$)
paymentMethod=([a-zA-Z]+)
script=([a-zA-Z]+/[a-zA-Z]+.py)
scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)))
description=([a-zA-Z0-9 ]+s)
firstName=[A-Za-z]
surname=[A-Za-z]
addressLine1=[a-zA-Z0-9 ]+
city=([a-zA-Z ]+)
postalCode=[a-zA-Z0-9 ]+
payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*$)
cardType=[a-zA-Z]+
continuousAuthority=[a-zA-Z]+
makeCurrent=[a-zA-Z]+
My question is; why does my regex and sample match ok seperately, but dont when I put them all together ?
Additional question:
I am using the logic (([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+))) for the whole ScriptParams parameter (the &'s here are part of the parameter.) If I just want to get the 'uid' part and leave the rest, what expression would I need to target this (it is made up of A-z a-z 0-9 and dashes)?
thank you
UPDATE
I have tweaked your answer slightly, because the stub server I am using will not accept the (?:[\s-]) when it loads the file containing the URL templates. I have also incorporated a lot of % and 0-9 because the request is UTF encoded before it is matched (which I had not anticipated), and a few of the params have rogue spaces beyond my control. Other than that, your solution worked great :)
Here is my new version of the scriptParams regex:
&scriptParams=[a-zA-Z]{3}%3d[-A-Za-z0-9]+
This accepts the whole parameter, and works fine in the regex tester. Now when I link anything after this part, there is an unsuccessful match.
I do not understand why this is a problem as the regex seem to string together nicely otherwise. Any ideas are appreciated.
Here is the full regex:
paymentType=[-%a-zA-Z0-9 ]+&amount=[0-9]+.[0-9]{2}&policyUid=([-A-Za-z0-9]+)&paymentMethod=([%a-zA-Z0-9]+)&script=[%/.a-zA-Z0-9]+&scriptParams=[a-zA-Z]{3}%3d[-A-Za-z0-9]+&description=[%a-zA-Z0-9 ]+&firstName=[-%A-Za-z0-9]+&surname=[-%A-Za-z0-9]+&addressLine1=[-%a-zA-Z0-9 ]+&city=[-%a-zA-Z 0-9]+&postalCode=[-%a-zA-Z 0-9]+&payerUid=([-A-Za-z0-9]+)&cardType=[%A-Za-z0-9]+&continuousAuthority=[A-Za-z]+&makeCurrent=[A-Za-z]+
And here is the full set of URL params (with UTF encoding present):
paymentType=A&amount=104.85&policyUid=16a9cc22-0000-0000-5a96-5654d9a31f92&paymentMethod=A%20&script=RetailQuotes%2FacceptQuote.py%20&scriptParams=uid%3d16a9c958-0000-0000-5a96-565435311d07%26invokePCL%3dtrue%26paymentType%3dA%20&description=New%2520Business%2520Payment&firstName=Adam&surname=Har%20&addressLine1=26%2520Close&city=Potters%2520Town&postalCode=EC1%25206LR%20&payerUid=16a9c24e-0000-0000-5a96-5654b3f956e0&cardType=valid%20&continuousAuthority=true&makeCurrent=true
Thank you
PS
(Solved the server problem. Was a slight mistake I was making in the usage of URL params.)
First, your regex not all work, some are missing quantifiers, others have a $ for some reason and some parameters are even missing! Here's what they should have been:
paymentType=A
amount=([0-9]+.[0-9]{2})
policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)
paymentMethod=([a-zA-Z]+)
script=([a-zA-Z]+/[a-zA-Z]+.py)
scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)+))
invokePCL=([a-z]+)
paymentType=A
description=([a-zA-Z0-9 ]+)
firstName=[A-Za-z]+
surname=[A-Za-z]+
addressLine1=[a-zA-Z0-9 ]+
city=([a-zA-Z ]+)
postalCode=[a-zA-Z0-9 ]+
payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)
cardType=[a-zA-Z]+
continuousAuthority=[a-zA-Z]+
makeCurrent=[a-zA-Z]+
And combined, you get:
paymentType=A&amount=([0-9]+.[0-9]{2})&policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)&paymentMethod=([a-zA-Z]+)&script=([a-zA-Z]+/[a-zA-Z]+.py)&scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)+))&invokePCL=([a-z]+)&paymentType=A&description=([a-zA-Z0-9 ]+)&firstName=[A-Za-z]+&surname=[A-Za-z]+&addressLine1=[a-zA-Z0-9 ]+&city=([a-zA-Z ]+)&postalCode=[a-zA-Z0-9 ]+&payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)&cardType=[a-zA-Z]+&continuousAuthority=[a-zA-Z]+&makeCurrent=[a-zA-Z]+
regex101 demo
[Note, I took your regexes where they matched and ran minimal edits to them].
For your second question, I'm not sure what you mean by the Uid part and that & are part of the parameter. Given that there are 3 Uids in the url with similar format (policy, scriptparams, user), you will have to put them in the expression, unless you know a specific pattern to the scriptparams' Uid.
In the expression below, I made use of the fact that only scriptparams' uid was in lowercase:
uid=[0-9a-f]+(?:-[0-9a-f]+)+
regex101 demo