Regex not matching URL Params - html

I am currently working on a stub server I can plug into a webpage so I do not need to hit sagepay every time I test my payment screen. I need the server to receive a request from the web page and use the dynamic parameters contained in the URL to build the server response. The stub uses regex targets to pick out the parameters I need and add them to the response.
I am using this stub server
I built the accepted URL piece by piece, using the regex tester contained here to test each bit of logic. The expressions work separately, but when I try to join two or more of them together they refuse to work. Each parameter is separated by an ampersand (&) and the name of the parameter.
Here is a sample of the parameters:
paymentType=A&amount=147.06&policyUid=07ef493b-0000-0000-6a05-9fa4d6a5b5ad&paymentMethod=A&script=Retail/accept.py&scriptParams=uid=07ef461a-0000-0000-6a059fa44a8870bf&invokePCL=true&paymentType=A&description=New Business Payment&firstName=Adam&surname=Har&addressLine1=20 Potters Road&city=London&postalCode=EC1 4JS&payerUid=07ef3ff7-0000-0000-6a05-9fa42e92d56b&cardType=valid&continuousAuthority=true&makeCurrent=true
and in a list for ease of reading (without &'s)
paymentType=A
amount=147.06
policyUid=07ef493b-0000-0000-6a05-9fa4d6a5b5ad
paymentMethod=A
script=Retail/accept.py
scriptParams=uid=07ef461a-0000-0000-6a059fa44a8870bf&invokePCL=true&paymentType=A
description=New Business Payment
firstName=Adam
surname=Har
addressLine1=20 Chase road
city=London
postalCode=EC1 3PF
payerUid=07ef3ff7-0000-0000-6a05-9fa42e92d56b
cardType=valid
continuousAuthority=true
makeCurrent=true
And here is my accepted URL parameters with the regex logic:
paymentType=A&amount=([0-9]+.[0-9]{2})&policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)$)&paymentMethod=([a-zA-Z]+)&script=([a-zA-Z]+/[a-zA-Z]+.py)&scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)))&description=([a-zA-Z0-9 ]+s)&firstName=[A-Za-z]&surname=[A-Za-z]&addressLine1=[a-zA-Z0-9 ]+&city=([a-zA-Z ]+)&postalCode=[a-zA-Z0-9 ]+&payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)$)&cardType=[a-zA-Z]+&continuousAuthority=[a-zA-Z]+&makeCurrent=[a-zA-Z]+
again in a list:
registerPayment?outputType=xml
country=GB
paymentType=A
amount=([0-9]+.[0-9]{2})
policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*$)
paymentMethod=([a-zA-Z]+)
script=([a-zA-Z]+/[a-zA-Z]+.py)
scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)))
description=([a-zA-Z0-9 ]+s)
firstName=[A-Za-z]
surname=[A-Za-z]
addressLine1=[a-zA-Z0-9 ]+
city=([a-zA-Z ]+)
postalCode=[a-zA-Z0-9 ]+
payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*$)
cardType=[a-zA-Z]+
continuousAuthority=[a-zA-Z]+
makeCurrent=[a-zA-Z]+
My question is; why does my regex and sample match ok seperately, but dont when I put them all together ?
Additional question:
I am using the logic (([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+))) for the whole ScriptParams parameter (the &'s here are part of the parameter.) If I just want to get the 'uid' part and leave the rest, what expression would I need to target this (it is made up of A-z a-z 0-9 and dashes)?
thank you
UPDATE
I have tweaked your answer slightly, because the stub server I am using will not accept the (?:[\s-]) when it loads the file containing the URL templates. I have also incorporated a lot of % and 0-9 because the request is UTF encoded before it is matched (which I had not anticipated), and a few of the params have rogue spaces beyond my control. Other than that, your solution worked great :)
Here is my new version of the scriptParams regex:
&scriptParams=[a-zA-Z]{3}%3d[-A-Za-z0-9]+
This accepts the whole parameter, and works fine in the regex tester. Now when I link anything after this part, there is an unsuccessful match.
I do not understand why this is a problem as the regex seem to string together nicely otherwise. Any ideas are appreciated.
Here is the full regex:
paymentType=[-%a-zA-Z0-9 ]+&amount=[0-9]+.[0-9]{2}&policyUid=([-A-Za-z0-9]+)&paymentMethod=([%a-zA-Z0-9]+)&script=[%/.a-zA-Z0-9]+&scriptParams=[a-zA-Z]{3}%3d[-A-Za-z0-9]+&description=[%a-zA-Z0-9 ]+&firstName=[-%A-Za-z0-9]+&surname=[-%A-Za-z0-9]+&addressLine1=[-%a-zA-Z0-9 ]+&city=[-%a-zA-Z 0-9]+&postalCode=[-%a-zA-Z 0-9]+&payerUid=([-A-Za-z0-9]+)&cardType=[%A-Za-z0-9]+&continuousAuthority=[A-Za-z]+&makeCurrent=[A-Za-z]+
And here is the full set of URL params (with UTF encoding present):
paymentType=A&amount=104.85&policyUid=16a9cc22-0000-0000-5a96-5654d9a31f92&paymentMethod=A%20&script=RetailQuotes%2FacceptQuote.py%20&scriptParams=uid%3d16a9c958-0000-0000-5a96-565435311d07%26invokePCL%3dtrue%26paymentType%3dA%20&description=New%2520Business%2520Payment&firstName=Adam&surname=Har%20&addressLine1=26%2520Close&city=Potters%2520Town&postalCode=EC1%25206LR%20&payerUid=16a9c24e-0000-0000-5a96-5654b3f956e0&cardType=valid%20&continuousAuthority=true&makeCurrent=true
Thank you
PS
(Solved the server problem. Was a slight mistake I was making in the usage of URL params.)

First, your regex not all work, some are missing quantifiers, others have a $ for some reason and some parameters are even missing! Here's what they should have been:
paymentType=A
amount=([0-9]+.[0-9]{2})
policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)
paymentMethod=([a-zA-Z]+)
script=([a-zA-Z]+/[a-zA-Z]+.py)
scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)+))
invokePCL=([a-z]+)
paymentType=A
description=([a-zA-Z0-9 ]+)
firstName=[A-Za-z]+
surname=[A-Za-z]+
addressLine1=[a-zA-Z0-9 ]+
city=([a-zA-Z ]+)
postalCode=[a-zA-Z0-9 ]+
payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)
cardType=[a-zA-Z]+
continuousAuthority=[a-zA-Z]+
makeCurrent=[a-zA-Z]+
And combined, you get:
paymentType=A&amount=([0-9]+.[0-9]{2})&policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)&paymentMethod=([a-zA-Z]+)&script=([a-zA-Z]+/[a-zA-Z]+.py)&scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)+))&invokePCL=([a-z]+)&paymentType=A&description=([a-zA-Z0-9 ]+)&firstName=[A-Za-z]+&surname=[A-Za-z]+&addressLine1=[a-zA-Z0-9 ]+&city=([a-zA-Z ]+)&postalCode=[a-zA-Z0-9 ]+&payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)&cardType=[a-zA-Z]+&continuousAuthority=[a-zA-Z]+&makeCurrent=[a-zA-Z]+
regex101 demo
[Note, I took your regexes where they matched and ran minimal edits to them].
For your second question, I'm not sure what you mean by the Uid part and that & are part of the parameter. Given that there are 3 Uids in the url with similar format (policy, scriptparams, user), you will have to put them in the expression, unless you know a specific pattern to the scriptparams' Uid.
In the expression below, I made use of the fact that only scriptparams' uid was in lowercase:
uid=[0-9a-f]+(?:-[0-9a-f]+)+
regex101 demo

Related

grep : Count the number of elements in json response [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

How to extract in Splunk at indexed time json field with same child-key from different father-key using regex? [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

MySQL regex to validate email not working - curly brace quantifier ignored

I'm trying to use the following regular expression to validate emails in a MySQL database:
^[^#]+#[^#]+\.[^#]{2,}$
in a condition like this:
...and email REGEXP '^[^#]+#[^#]+\.[^#]{2,}$'
For the most part, the expression is working. But it's allowing single-character top level domains. For example, both the following emails pass validation:
something#hotmail.com and something#hotmail.c
The second case is clearly a typo. The {2,} of the regex should allow any string of characters other than the # symbol, of length 2 or more, after the dot.
I've run the regex itself through multiple testers running different protocols, (Perl, TCL, etc.) and it works as expected every time, rejecting the single-character TLD version of the email address. It's only when I use this regex in a MySQL context that it fails.
I've checked, and there are no additional characters after the ".c" in the erroneous email address. Is there something inherent to MySQL or this version that could be preventing this from working?
Running MySQL version 5.5.61-cll
You may try using the following regex pattern:
^[^#]+#[^.]+[.][^.]{2,}([.][^.]{2,})*$
The rightmost portion of the pattern means:
[.] match a literal dot
[^.]{2,} followed by a domain component (any 2 or more characters but dot)
([.][^.]{2,})* followed by dot and another component, zero or more times
Demo
So this would match:
jon.skeet#google.com
jon.skeet#google.co.uk
But would not match:
gordonlinoff#blah
By golly we are hackers and we validate stuff! What's more, we understand concepts like ACID compliance, data integrity, and single points of authority. So obviously, we should ensure that no invalid emails enter the DB. What's more? We have such a wonderful tool with which to do it: a proper schema with check constraints!
Unfortunately, emails are one of those details that are notoriously difficult to validate with simple regular expressions. It is possible, sure. No, actually, it's not. None of those links offer 100% compliance.
A much better approach is simply to test the address by sending it an activation email. David Gilbertson explains it far better than I'm going to in a concise SO answer, but the highlights:
Don't even try validate.
Just test the address with an actual email.
For my projects, both personal and professional, this is the regex I use for email address sanity checking prior to sending an activation/confirmation email:
\S+#\S+
This is extremely simple (and yep, still excludes some technically valid email addresses), extremely simple to debug, and works for any legitimate traffic to our sites. (I have yet to see an email address even close to something like #!$%&’*+-/=?^_{}|~#example.com in our logs.)

Why is this %2B string being urldecoded?

[This may not be precisely a programming question, but it's a puzzle that may best be answered by programmers. I tried it first on the Pro Webmasters site, to overwhelming silence]
We have an email address verification process on our website. The site first generates an appropriate key as a string
mykey
It then encodes that key as a bunch of bytes
&$dac~ʌ����!
It then base64 encodes that bunch of bytes
JiRkYWN+yoyIhIQ==
Since this key is going to be given as a querystring value of a URL that is to be placed in an HTML email, we need to first URLEncode it then HTMLEncode the result, giving us (there's no effect of HTMLEncoding in the example case, but I can't be bothered to rework the example)
JiRkYWN%2ByoyIhIQ%3D%3D
This is then embedded in HTML that is sent as part of an email, something like:
click here.
Or paste <b>http://myapp/verify?key=JiRkYWN%2ByoyIhIQ%3D%3D</b> into your browser.
When the receiving user clicks on the link, the site receives the request, extracts the value of the querystring 'key' parameter, base64 decodes it, decrypts it, and does the appropriate thing in terms of the site logic.
However on occasion we have users who report that their clicking is ineffective. One such user forwarded us the email he had been sent, and on inspection the HTML had been transformed into (to put it in terms of the example above)
click here
Or paste <b>http://myapp/verify?key=JiRkYWN+yoyIhIQ%3D%3D</b> into your browser.
That is, the %2B string - but none of the other percentage encoded strings - had been converted into a plus. (It's definitely leaving us with the right values - I've looked at the appropriate SMTP logs).
key=JiRkYWN%2ByoyIhIQ%3D%3D
key=JiRkYWN+yoyIhIQ%3D%3D
So I think that there are a couple of possibilities:
There's something I'm doing that's stupid, that I can't see, or
Some mail clients convert %2b strings to plus signs, perhaps to try to cope with the problem of people mistakenly URLEncoding plus signs
In case of 1 - what is it? In case of 2 - is there a standard, known way of dealing with this kind of scenario?
Many thanks for any help
The problem lies at this step
on inspection the HTML had been transformed into (to put it in terms of the example above)
click here
Or paste <b>http://myapp/verify?key=JiRkYWN+yoyIhIQ%3D%3D</b> into
your browser.
That is, the %2B string - but none of the other percentage encoded
strings - had been converted into a plus
Your application at "the other end" must be missing a step of unescaping. Regardless of if there is a %2B or a + a function like perls uri_unescape returns consistent answers
DB<9> use URI::Escape;
DB<10> x uri_unescape("JiRkYWN+yoyIhIQ%3D%3D")
0 'JiRkYWN+yoyIhIQ=='
DB<11> x uri_unescape("JiRkYWN%2ByoyIhIQ%3D%3D")
0 'JiRkYWN+yoyIhIQ=='
Here is what should be happening. All I'm showing are the steps. I'm using perl in a debugger. Step 54 encodes the string to base64. Step 55 shows how the base64 encoded string could be made into a uri escaped parameter. Steps 56 and 57 are what the client end should be doing to decode.
One possible work around is to ensure that your base64 "key" does not contain any plus signs!
DB<53> $key="AB~"
DB<54> x encode_base64($key)
0 'QUJ+
'
DB<55> x uri_escape('QUJ+')
0 'QUJ%2B'
DB<56> x uri_unescape('QUJ%2B')
0 'QUJ+'
DB<57> $result=decode_base64('QUJ+')
DB<58> x $result
0 'AB~'
What may be happening here is that the URLDecode is turning the %2b into a +, which is being interpreted as a space character in the URL. I was able to overcome a similar problem by first urldecoding the string, then using a replace function to replace spaces in the decoded string with + characters, and then decrypting the "fixed" string.

The search results in Drupal includes HTML Entities. How can I have a clean output?

How can I have a clean html ouput for search result pages? Each time I try to include special characters like "&" as part of the search term, I usually get results with "&" highlighted yet includes the HTML entity. Thus, the results has &, " etc...Here's a screenshot sample - http://min.us/mt3rOV5zVtOh6
Meanwhile, when I do my searches with "&" included in the search term, the result yields to having a clean output.
The piece of code in search-result.tpl.php
http://pastebin.com/zCmMJLNh
I've already tried several decoding functions but no success. Been trying to fix this for days already. The site is using Drupal 6 and the search module has been overridden.
You say "...the search module has been overridden" this could be the cause of why the search snippet remains htmlentityencoded on output ( e.g check_plain'd escaped html )
A better fix would be to find the cause in the modification, e.g a preprocess function that modifies the search snippet ( if any )
Alternatively, you could probably run the $snippet through decode_entities
i.e print decode_entities($snippet)
Assuming, the html is already escaped, as if not, can be a security risk.
See also: http://php.net/manual/en/function.html-entity-decode.php
and: http://www.php.net/manual/en/function.htmlspecialchars-decode.php
Well, you could try drupal_html_to_text to convert the snippet into plain text.
The right way is probably to figure out why those results aren't getting converted. Based on your comments it looks like the problem is only when you search specifically for "&". More specifically, it's the regex in the search.module (/modules/search/search.module - line 1188 in 6):
preg_match_all('/ ("([^"]+)"|(?!OR)([^" ]+))/', ' '. $keys, $matches);
It only matches spaces before the keyword (not after). You could modify the $keys here like:
if ($keys == '&') $keys = '&'
Or something like that (of course that means hacking core - meh).
You could also possibly add a form_alter via a module and modify the search form (see this link on how to add the form_alter). Then you could add a custom submit handler which would alter the search term in the form before it is submitted.