I've got a bit of code where I have a loop with a string.search() to parse a string of HTML. The purpose is to seek out any valid URLs and surround each one with HREF tags appropriately while ignoring anything else, like email addresses. The problem is no matter how I modify the regular expression, it either kicks back the part of the email address after the # sign and highlights it or it highlights or ignores everything.
An example string would be:
"</span><span class='blue'>Weaselgrease:</span><span class='magenta'>weaselgrease#weasel.grs vs weaselgrease.weasel.grs</span><span class='blue'> [12:41:33 AM]</span>"
Where 'weaselgrease.weasel.grs' would be identified as a proper URL and 'weaselgrease#weasel.grs' would be ignored.
The code I have currently is /([fh]t{1,2}ps?:\/\/)?[\w-]+\.\w{2,4}/
I know it's rather simple, but it doesn't need to be complex yet.
I've tried a conditional and gotten nowhere. I may just be missing something, but my searching and even playing legos with http://regex101.com/ has gotten me no closer.
Ultimately I'm going to have it do the following:
Identify a valid URL's index in the string
Ignore if it's an email
Ignore if it's just an IP address (no prepending http:// and no trailing slash)
But I'd be happy with just an inkling of help on what I need to do to get it to ignore email addresses.
URL without proper protocol (i.e. http, https, ftp) cannot be validated as such, because this means that almost everything that has . (dot) in it is a valid url.
So there is not a way to properly check if it's url or e-mail if you don't use the protocol. Example:
end of sentence.New sentence -> sentence.New is valid url in your case
weaselgrease#weasel.grs -> everything before # is ignored and weasel.grs is valid url
Related
I have url with parameters in my static web page
https://www.nejlepsi-skolky.cz/list-map-of-kindergartens-seznam-mapa-skolek-jesli?locationFrom=Velk%C3%A9%20Opatovice,Dlouh%C3%A1%20429;latFrom=49.6120414733887;lonFrom=16.6786785125732
the problem is that the second and the third equals signs are changed into %3D, when I click on the URL. Is there a possibility to say to the browser not to change equals? And why the first equal works fine?
I have found some answers about nginx, but this is just static site not connected to nginx settings.
Thank you
EDIT: Problem was caused with saving the page to the local storage. During the save the & signs are being changed by firefox into ; signs
the problem is that the second and the third equals signs are changed into %3D, when I click on the URL.
This is normal. It should not be a problem if you are reading the URL with something that supports the standards.
And why the first equal works fine?
The standard format of a query string is a series of key=value pairs which are separated by & characters.
Older versions of the standard supported ; as well as &, but they are obsolete.
= signs, being the symbol that indicates the end of a key and start of a value, should be encoded as %3D when they occur inside a key or a value.
The first equal in your example is the symbol that separates the key from the value. The second is data inside the value.
Is there a possibility to say to the browser not to change equals?
Change your URL so you use & as the separator instead of ;.
I've been doing server side XSS validation. Here is what I found to use:
List of forbidden attributes: javascript:,mocha:,eval(,alert(,vbscript:,livescript:,expression(,url(,&{,&#,/*,*/,onclick,oncontextmenu,ondblclick,onmousedown,onmouseenter,onmouseleave,onmousemove,onmouseover,onmouseout,onmouseup,onkeydown,onkeypress,onkeyup,onblur,onchange,onfocus,onfocusin,onfocusout,oninput,oninvalid,onreset,onsearch,onselect,onsubmit,ondrag,ondragend,ondragenter,ondragleave,ondragover,ondragstart,ondrop,oncopy,oncut,onpaste,ontouchcancel,ontouchend,ontouchmove,ontouchstart,onwheel
However they seem to be too strict since some correct values such as "™" are considered as illegal if I use this list.
I'm thinking if to check only value doesn't contain any of those characters like '<', '>', "%3E", "%3C" would be safe to prevent XSS attack?
You do not need to block any of those characters, long lesson-learned you're just cutting out other language support potentially and you can work with these characters safely.
Imagine we have a chat-box, as this is where these may get used most often in a textarea. The user can pass the following <html></html> and if it wasn't handled properly every user receiving the message will open up a new HTML document inside this chat (scary).
This gets handled on the client side fortunately, some hacks to make it "text-only".
When dealing with login you could regex check username, if that passes you can then SQL search the username if match is made check if password matches too by encrypting the password just given and matching it, no match get out.
I've never needed to block characters or treat anything special knowing proper encoding/decoding (escape practices and whatnot). Maybe this will help your search.
I'm trying to use the following regular expression to validate emails in a MySQL database:
^[^#]+#[^#]+\.[^#]{2,}$
in a condition like this:
...and email REGEXP '^[^#]+#[^#]+\.[^#]{2,}$'
For the most part, the expression is working. But it's allowing single-character top level domains. For example, both the following emails pass validation:
something#hotmail.com and something#hotmail.c
The second case is clearly a typo. The {2,} of the regex should allow any string of characters other than the # symbol, of length 2 or more, after the dot.
I've run the regex itself through multiple testers running different protocols, (Perl, TCL, etc.) and it works as expected every time, rejecting the single-character TLD version of the email address. It's only when I use this regex in a MySQL context that it fails.
I've checked, and there are no additional characters after the ".c" in the erroneous email address. Is there something inherent to MySQL or this version that could be preventing this from working?
Running MySQL version 5.5.61-cll
You may try using the following regex pattern:
^[^#]+#[^.]+[.][^.]{2,}([.][^.]{2,})*$
The rightmost portion of the pattern means:
[.] match a literal dot
[^.]{2,} followed by a domain component (any 2 or more characters but dot)
([.][^.]{2,})* followed by dot and another component, zero or more times
Demo
So this would match:
jon.skeet#google.com
jon.skeet#google.co.uk
But would not match:
gordonlinoff#blah
By golly we are hackers and we validate stuff! What's more, we understand concepts like ACID compliance, data integrity, and single points of authority. So obviously, we should ensure that no invalid emails enter the DB. What's more? We have such a wonderful tool with which to do it: a proper schema with check constraints!
Unfortunately, emails are one of those details that are notoriously difficult to validate with simple regular expressions. It is possible, sure. No, actually, it's not. None of those links offer 100% compliance.
A much better approach is simply to test the address by sending it an activation email. David Gilbertson explains it far better than I'm going to in a concise SO answer, but the highlights:
Don't even try validate.
Just test the address with an actual email.
For my projects, both personal and professional, this is the regex I use for email address sanity checking prior to sending an activation/confirmation email:
\S+#\S+
This is extremely simple (and yep, still excludes some technically valid email addresses), extremely simple to debug, and works for any legitimate traffic to our sites. (I have yet to see an email address even close to something like #!$%&’*+-/=?^_{}|~#example.com in our logs.)
I am currently working on a stub server I can plug into a webpage so I do not need to hit sagepay every time I test my payment screen. I need the server to receive a request from the web page and use the dynamic parameters contained in the URL to build the server response. The stub uses regex targets to pick out the parameters I need and add them to the response.
I am using this stub server
I built the accepted URL piece by piece, using the regex tester contained here to test each bit of logic. The expressions work separately, but when I try to join two or more of them together they refuse to work. Each parameter is separated by an ampersand (&) and the name of the parameter.
Here is a sample of the parameters:
paymentType=A&amount=147.06&policyUid=07ef493b-0000-0000-6a05-9fa4d6a5b5ad&paymentMethod=A&script=Retail/accept.py&scriptParams=uid=07ef461a-0000-0000-6a059fa44a8870bf&invokePCL=true&paymentType=A&description=New Business Payment&firstName=Adam&surname=Har&addressLine1=20 Potters Road&city=London&postalCode=EC1 4JS&payerUid=07ef3ff7-0000-0000-6a05-9fa42e92d56b&cardType=valid&continuousAuthority=true&makeCurrent=true
and in a list for ease of reading (without &'s)
paymentType=A
amount=147.06
policyUid=07ef493b-0000-0000-6a05-9fa4d6a5b5ad
paymentMethod=A
script=Retail/accept.py
scriptParams=uid=07ef461a-0000-0000-6a059fa44a8870bf&invokePCL=true&paymentType=A
description=New Business Payment
firstName=Adam
surname=Har
addressLine1=20 Chase road
city=London
postalCode=EC1 3PF
payerUid=07ef3ff7-0000-0000-6a05-9fa42e92d56b
cardType=valid
continuousAuthority=true
makeCurrent=true
And here is my accepted URL parameters with the regex logic:
paymentType=A&amount=([0-9]+.[0-9]{2})&policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)$)&paymentMethod=([a-zA-Z]+)&script=([a-zA-Z]+/[a-zA-Z]+.py)&scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)))&description=([a-zA-Z0-9 ]+s)&firstName=[A-Za-z]&surname=[A-Za-z]&addressLine1=[a-zA-Z0-9 ]+&city=([a-zA-Z ]+)&postalCode=[a-zA-Z0-9 ]+&payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)$)&cardType=[a-zA-Z]+&continuousAuthority=[a-zA-Z]+&makeCurrent=[a-zA-Z]+
again in a list:
registerPayment?outputType=xml
country=GB
paymentType=A
amount=([0-9]+.[0-9]{2})
policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*$)
paymentMethod=([a-zA-Z]+)
script=([a-zA-Z]+/[a-zA-Z]+.py)
scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)))
description=([a-zA-Z0-9 ]+s)
firstName=[A-Za-z]
surname=[A-Za-z]
addressLine1=[a-zA-Z0-9 ]+
city=([a-zA-Z ]+)
postalCode=[a-zA-Z0-9 ]+
payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*$)
cardType=[a-zA-Z]+
continuousAuthority=[a-zA-Z]+
makeCurrent=[a-zA-Z]+
My question is; why does my regex and sample match ok seperately, but dont when I put them all together ?
Additional question:
I am using the logic (([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+))) for the whole ScriptParams parameter (the &'s here are part of the parameter.) If I just want to get the 'uid' part and leave the rest, what expression would I need to target this (it is made up of A-z a-z 0-9 and dashes)?
thank you
UPDATE
I have tweaked your answer slightly, because the stub server I am using will not accept the (?:[\s-]) when it loads the file containing the URL templates. I have also incorporated a lot of % and 0-9 because the request is UTF encoded before it is matched (which I had not anticipated), and a few of the params have rogue spaces beyond my control. Other than that, your solution worked great :)
Here is my new version of the scriptParams regex:
&scriptParams=[a-zA-Z]{3}%3d[-A-Za-z0-9]+
This accepts the whole parameter, and works fine in the regex tester. Now when I link anything after this part, there is an unsuccessful match.
I do not understand why this is a problem as the regex seem to string together nicely otherwise. Any ideas are appreciated.
Here is the full regex:
paymentType=[-%a-zA-Z0-9 ]+&amount=[0-9]+.[0-9]{2}&policyUid=([-A-Za-z0-9]+)&paymentMethod=([%a-zA-Z0-9]+)&script=[%/.a-zA-Z0-9]+&scriptParams=[a-zA-Z]{3}%3d[-A-Za-z0-9]+&description=[%a-zA-Z0-9 ]+&firstName=[-%A-Za-z0-9]+&surname=[-%A-Za-z0-9]+&addressLine1=[-%a-zA-Z0-9 ]+&city=[-%a-zA-Z 0-9]+&postalCode=[-%a-zA-Z 0-9]+&payerUid=([-A-Za-z0-9]+)&cardType=[%A-Za-z0-9]+&continuousAuthority=[A-Za-z]+&makeCurrent=[A-Za-z]+
And here is the full set of URL params (with UTF encoding present):
paymentType=A&amount=104.85&policyUid=16a9cc22-0000-0000-5a96-5654d9a31f92&paymentMethod=A%20&script=RetailQuotes%2FacceptQuote.py%20&scriptParams=uid%3d16a9c958-0000-0000-5a96-565435311d07%26invokePCL%3dtrue%26paymentType%3dA%20&description=New%2520Business%2520Payment&firstName=Adam&surname=Har%20&addressLine1=26%2520Close&city=Potters%2520Town&postalCode=EC1%25206LR%20&payerUid=16a9c24e-0000-0000-5a96-5654b3f956e0&cardType=valid%20&continuousAuthority=true&makeCurrent=true
Thank you
PS
(Solved the server problem. Was a slight mistake I was making in the usage of URL params.)
First, your regex not all work, some are missing quantifiers, others have a $ for some reason and some parameters are even missing! Here's what they should have been:
paymentType=A
amount=([0-9]+.[0-9]{2})
policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)
paymentMethod=([a-zA-Z]+)
script=([a-zA-Z]+/[a-zA-Z]+.py)
scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)+))
invokePCL=([a-z]+)
paymentType=A
description=([a-zA-Z0-9 ]+)
firstName=[A-Za-z]+
surname=[A-Za-z]+
addressLine1=[a-zA-Z0-9 ]+
city=([a-zA-Z ]+)
postalCode=[a-zA-Z0-9 ]+
payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)
cardType=[a-zA-Z]+
continuousAuthority=[a-zA-Z]+
makeCurrent=[a-zA-Z]+
And combined, you get:
paymentType=A&amount=([0-9]+.[0-9]{2})&policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)&paymentMethod=([a-zA-Z]+)&script=([a-zA-Z]+/[a-zA-Z]+.py)&scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)+))&invokePCL=([a-z]+)&paymentType=A&description=([a-zA-Z0-9 ]+)&firstName=[A-Za-z]+&surname=[A-Za-z]+&addressLine1=[a-zA-Z0-9 ]+&city=([a-zA-Z ]+)&postalCode=[a-zA-Z0-9 ]+&payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)&cardType=[a-zA-Z]+&continuousAuthority=[a-zA-Z]+&makeCurrent=[a-zA-Z]+
regex101 demo
[Note, I took your regexes where they matched and ran minimal edits to them].
For your second question, I'm not sure what you mean by the Uid part and that & are part of the parameter. Given that there are 3 Uids in the url with similar format (policy, scriptparams, user), you will have to put them in the expression, unless you know a specific pattern to the scriptparams' Uid.
In the expression below, I made use of the fact that only scriptparams' uid was in lowercase:
uid=[0-9a-f]+(?:-[0-9a-f]+)+
regex101 demo
[This may not be precisely a programming question, but it's a puzzle that may best be answered by programmers. I tried it first on the Pro Webmasters site, to overwhelming silence]
We have an email address verification process on our website. The site first generates an appropriate key as a string
mykey
It then encodes that key as a bunch of bytes
&$dac~ʌ����!
It then base64 encodes that bunch of bytes
JiRkYWN+yoyIhIQ==
Since this key is going to be given as a querystring value of a URL that is to be placed in an HTML email, we need to first URLEncode it then HTMLEncode the result, giving us (there's no effect of HTMLEncoding in the example case, but I can't be bothered to rework the example)
JiRkYWN%2ByoyIhIQ%3D%3D
This is then embedded in HTML that is sent as part of an email, something like:
click here.
Or paste <b>http://myapp/verify?key=JiRkYWN%2ByoyIhIQ%3D%3D</b> into your browser.
When the receiving user clicks on the link, the site receives the request, extracts the value of the querystring 'key' parameter, base64 decodes it, decrypts it, and does the appropriate thing in terms of the site logic.
However on occasion we have users who report that their clicking is ineffective. One such user forwarded us the email he had been sent, and on inspection the HTML had been transformed into (to put it in terms of the example above)
click here
Or paste <b>http://myapp/verify?key=JiRkYWN+yoyIhIQ%3D%3D</b> into your browser.
That is, the %2B string - but none of the other percentage encoded strings - had been converted into a plus. (It's definitely leaving us with the right values - I've looked at the appropriate SMTP logs).
key=JiRkYWN%2ByoyIhIQ%3D%3D
key=JiRkYWN+yoyIhIQ%3D%3D
So I think that there are a couple of possibilities:
There's something I'm doing that's stupid, that I can't see, or
Some mail clients convert %2b strings to plus signs, perhaps to try to cope with the problem of people mistakenly URLEncoding plus signs
In case of 1 - what is it? In case of 2 - is there a standard, known way of dealing with this kind of scenario?
Many thanks for any help
The problem lies at this step
on inspection the HTML had been transformed into (to put it in terms of the example above)
click here
Or paste <b>http://myapp/verify?key=JiRkYWN+yoyIhIQ%3D%3D</b> into
your browser.
That is, the %2B string - but none of the other percentage encoded
strings - had been converted into a plus
Your application at "the other end" must be missing a step of unescaping. Regardless of if there is a %2B or a + a function like perls uri_unescape returns consistent answers
DB<9> use URI::Escape;
DB<10> x uri_unescape("JiRkYWN+yoyIhIQ%3D%3D")
0 'JiRkYWN+yoyIhIQ=='
DB<11> x uri_unescape("JiRkYWN%2ByoyIhIQ%3D%3D")
0 'JiRkYWN+yoyIhIQ=='
Here is what should be happening. All I'm showing are the steps. I'm using perl in a debugger. Step 54 encodes the string to base64. Step 55 shows how the base64 encoded string could be made into a uri escaped parameter. Steps 56 and 57 are what the client end should be doing to decode.
One possible work around is to ensure that your base64 "key" does not contain any plus signs!
DB<53> $key="AB~"
DB<54> x encode_base64($key)
0 'QUJ+
'
DB<55> x uri_escape('QUJ+')
0 'QUJ%2B'
DB<56> x uri_unescape('QUJ%2B')
0 'QUJ+'
DB<57> $result=decode_base64('QUJ+')
DB<58> x $result
0 'AB~'
What may be happening here is that the URLDecode is turning the %2b into a +, which is being interpreted as a space character in the URL. I was able to overcome a similar problem by first urldecoding the string, then using a replace function to replace spaces in the decoded string with + characters, and then decrypting the "fixed" string.