This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*
Related
This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*
I've tried the answers I've found in SOF, but none supported here : https://regexr.com
I essentially have an .OPML file with a large number of podcasts and descriptions.
in the following format:
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
What regex I can use to so I can just get the title and the link:
Software Engineering Daily
http://softwareengineeringdaily.com/feed/podcast/
Brief
There are many ways to go about this. The best way is likely using an XML parser. I would definitely read this post that discusses use of regex, especially with XML.
As you can see there are many answers to your question. It also depends on which language you are using since regex engines differ. Some accept backreferences, whilst others do not. I'll post multiple methods below that work in different circumstances/for different regex flavours. You can probably piece together from the multiple regex methods below which parts work best for you.
Code
Method 1
This method works in almost any regex flavour (at least the normal ones).
This method only checks against the attribute value opening and closing marks of " and doesn't include the possibility for whitespace before or after the = symbol. This is the simplest solution to get the values you want.
See regex in use here
\b(text|xmlUrl)="[^"]*"
Similarly, the following methods add more value to the above expression
\b(text|xmlUrl)\s*=\s*"[^"]*" Allows whitespace around =
\b(text|xmlUrl)=(?:"[^"]*"|'[^']*') Allows for ' to be used as attribute value delimiter
As another alternative (following the comments below my answer), if you wanted to grab every attribute except specific ones, you can use the following. Note that I use \w, which should cover most attributes, but you can just replace this with whatever valid characters you want. \S can be used to specify any non-whitespace characters or a set such as [\w-] may be used to specify any word or hyphen character. The negation of the specific attributes occurs with (?!text|xmlUrl), which says don't match those characters. Also, note that the word boundary \b at the beginning ensures that we're matching the full attribute name of text and not the possibility of other attributes with the same termination such as subtext.
\b((?!text|xmlUrl)\w+)="[^"]*"
Method 2
This method only works with regex flavours that allow backreferences. Apparently JGsoft applications, Delphi, Perl, Python, Ruby, PHP, R, Boost, and Tcl support single-digit backreferences. Double-digit backreferences are supported by JGsoft applications, Delphi, Python, and Boost. Information according this article about numbered backreferences from Regular-Expressions.info
See regex in use here
This method uses a backreference to ensure the same closing mark is used at the start and end of the attribute's value and also includes the possibility of whitespace surrounding the = symbol. This doesn't allow the possibility for attributes with no delimiter specified (using xmlUrl=http://softwareengineeringdaily.com/feed/podcast/ may also be valid).
See regex in use here
\b(text|xmlUrl)\s*=\s*(["'])(.*?)\2
Method 3
This method is the same as Method 2 but also allows attributes with no delimiters (note that delimiters are now considered to be space characters, thus, it will only match until the next space).
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(.*?)\2|(\S*))
Method 4
While Method 3 works, some people might complain that the attribute values might either of 2 groups. This can be fixed by either of the following methods.
Method 4.A
Branch reset groups are only possible in a few languages, notably JGsoft V2, PCRE 7.2+, PHP, Delphi, R (with PCRE enabled), Boost 1.42+ according to Regular-Expressions.info
This also shows the method you would use if backreferences aren't possible and you wanted to match multiple delimiters ("([^"])"|'([^']*))
See regex in use here
\b(text|xmlUrl)\s*=\s*(?|"([^"]*)"|'([^']*)'|(\S*))
Method 4.B
Duplicate subpatterns are not often supported. See this Regular-Expresions.info article for more information
This method uses the J regex flag, which allows duplicate subpattern names ((?<v>) is in there twice)
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(?<v>.*?)\2|(?<v>\S*))
Results
Input
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
Output
Each line below represents a different group. New matches are separated by two lines.
text
Software Engineering Daily
xmlUrl
http://softwareengineeringdaily.com/feed/podcast/
Explanation
I'll explain different parts of the regexes used in the Code section that way you understand the usage of each of these parts. This is more of a reference to the methods above.
"[^"]*" This is the fastest method possible (to the best of my knowledge) to grabbing anything between two " symbols. Note that it does not check for escaped backslashes, it will match any non-" character between two ". Whilst "(.*?)" can also be used, it's slightly slower
(["'])(.*?)\2 is basically shorthand for "(.*?)"|'(.*?)'. You can use any of the following methods to get the same result:
(?:"(.*?)"|'(.*?)')
(?:"([^"])"|'([^']*)') <-- slightly faster than line above
(?|) This is a branch reset group. When you place groups inside it like (?|(x)|(y)) it returns the same group index for both matches. This means that if x is captured, it'll get group index of 1, and if y is captured, it'll also get a group index of 1.
For simple HTML strings you might get along with
Url=(['"])(.+?)\1
Here, take group $2, see a demo on regex101.com.
Obligatory: consider using a parser instead (see here).
While writing a grammar for Github for syntax highlighting programs written in the Racket language, I have stumbled upon a problem.
In Racket #| starts a multiline comment and |# ends it.
The problem is that multiline comments can be nested:
#| a comment #| still a comment |# even
more comment |#
Here is my non-working attempt:
repository:
multilinecomment:
begin: \#\|
end: \|\#
name: comment
contentName: comment
patterns:
- include: "#multilinecomment"
name: comment
- match: ([^\|]|\|(?=[^#]))*
name: comment
The intent of the match patterns are:
"#multilinecomment"
A multiline comment can contain another multiline comment.
([^\|]|\|(?=[^#]))*
The meaning of the subexpressions:
[^\|] any characters not an `|`
\|(?=[^#]) an `|` followed by a non-`#`
The entire expression thus matches a string not containg |#
Update:
Got an answer from Allan Odgaard on the TextMate mailing list:
http://textmate.1073791.n5.nabble.com/TextMate-grammars-and-nested-multiline-comments-td28743.html
So I've tested a bunch of languages in Sublime that have multiline comments (C/C++, Java, HTML, PHP, JavaScript), and none of the language syntaxes support multiline comments embedded in multiline comments - the syntax highlighting for the comment scope ends with the first "comment close" marker, not with symmetric markers. Now, this isn't to say that it's impossible, because the BracketHighlighter plugin works great for matching symmetric tags, brackets, and other markers. However, it's written in Python, and uses custom logic for its matching algorithms, something that may not be available in the Oniguruma engine that powers Sublime's syntax highlighter, and apparently Github's as well.
Basically, from your description of the problem, you need a code parser to ensure that nested comments are legal, something you can't do with just a syntax highlighting definition. If you're writing this just for Sublime, a custom plugin could take care of that, but I don't know enough about Github's Linguist syntax highlighting system to say if you're allowed to do that. I'm not a regex master yet, but it seems to me that it would be rather difficult to achieve this purely by regex, as you'd need to somehow keep track of an arbitrary number of internal symmetric "open" and "close" markers before finding (and identifying!) the final one.
Sorry I couldn't provide a definitive answer other than I'm not sure this is possible, but that's the best I can come up with without knowing more about Sublime's and Github's internals, something that (at least in Sublime's case) won't happen unless it's open-sourced. Good luck!
Old post, and I don't have the reputation for a comment, but it is emphatically NOT possible to detect arbitrarily nested comments using purely regular expressions. Intuitively, this is because all regular expressions can be transformed into a finite state machine, and keeping track of nesting depth requires a (theoretically) infinite amount of state (the number of states needs to be equal to at least the different possible nesting depths, which here is infinite).
In practice this number grows very slowly, so if you don't want to go to too much trouble you could probably write something that allows nesting up to a reasonable depth. Otherwise you'll probably need a separate phase that parses through and finds the comments to tell the syntax highlighter to ignore them.
You had the correct idea but it looks like your second pattern also matches for the "begin nested comment" sequence #| which will never give a chance for your recursive #multilinecomment pattern to kick in.
All you have to do is replace your second pattern with something similar to
(#(?=[^|])|\|(?=[^#])|[^|#])+
Take the last match out. You do not need it. Its redundant to what textmate will do naturally, which is to match all additional text in to the comment scope until the end marker comes along, or the entire pattern recurses upon itself.
Very new to regex and haven't found a descriptive explaination to narrow down my understanding of regex to get me to a solution.
I use a script that scrapes html script from Yahoo finance to get financial options table data. Yahoo recently changed their HTML code and the old algorithm no longer works. The old expression was the following:
Main_Pattern = '.*?</table><table[^>]*>(.*?)</table';
Tables = regexp(urlText, Main_Pattern, 'tokens');
Where Tables used to return data, it no longer does. An HTML inspection of the HTML suggests to me that the data is no longer in <table>, but rather in <tbody>...
My question is "what does the Main_Pattern regex mean in layman's terms?" I'm trying to figure how to modify that expression such that is is applicable to the current HTML.
While I agree with #Marcin and Regular Expressions are best learned by doing and leveraging the reference of your chosen tool, I'll try and break down in what it is doing.
.*?</table>: Match anything up to the first </table> literal (This is a Lazy expression due to the ?).
<table: Match this literal.
[^>]*>: Match as much as possible that isn't > from after <table literal to the last occurrence of a > that satisfies the rest of the expression (this is a Greedy expression since there is no ? after the *).
(.*?)</table: Match and capture anything between the > from the previous part up to the </table literal; what was captured can be retrieved using the 'tokens' options of regexp (you can also get the entire string that was matched using the 'match' option).
While I broke it into pieces, I'd like to emphasize that the entire expression itself works as a whole, which is why some parts refer to the previous parts.
Refer to the Operators and Characters section of the MATLAB documentation for more in-depth explanations of the above.
For the future, a more robust option might be to use MATLAB's xmlread and DOM object to traverse the table nodes.
I do understand that that is another API to learn, but it may be more maintainable for the future.
I am currently working on a stub server I can plug into a webpage so I do not need to hit sagepay every time I test my payment screen. I need the server to receive a request from the web page and use the dynamic parameters contained in the URL to build the server response. The stub uses regex targets to pick out the parameters I need and add them to the response.
I am using this stub server
I built the accepted URL piece by piece, using the regex tester contained here to test each bit of logic. The expressions work separately, but when I try to join two or more of them together they refuse to work. Each parameter is separated by an ampersand (&) and the name of the parameter.
Here is a sample of the parameters:
paymentType=A&amount=147.06&policyUid=07ef493b-0000-0000-6a05-9fa4d6a5b5ad&paymentMethod=A&script=Retail/accept.py&scriptParams=uid=07ef461a-0000-0000-6a059fa44a8870bf&invokePCL=true&paymentType=A&description=New Business Payment&firstName=Adam&surname=Har&addressLine1=20 Potters Road&city=London&postalCode=EC1 4JS&payerUid=07ef3ff7-0000-0000-6a05-9fa42e92d56b&cardType=valid&continuousAuthority=true&makeCurrent=true
and in a list for ease of reading (without &'s)
paymentType=A
amount=147.06
policyUid=07ef493b-0000-0000-6a05-9fa4d6a5b5ad
paymentMethod=A
script=Retail/accept.py
scriptParams=uid=07ef461a-0000-0000-6a059fa44a8870bf&invokePCL=true&paymentType=A
description=New Business Payment
firstName=Adam
surname=Har
addressLine1=20 Chase road
city=London
postalCode=EC1 3PF
payerUid=07ef3ff7-0000-0000-6a05-9fa42e92d56b
cardType=valid
continuousAuthority=true
makeCurrent=true
And here is my accepted URL parameters with the regex logic:
paymentType=A&amount=([0-9]+.[0-9]{2})&policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)$)&paymentMethod=([a-zA-Z]+)&script=([a-zA-Z]+/[a-zA-Z]+.py)&scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)))&description=([a-zA-Z0-9 ]+s)&firstName=[A-Za-z]&surname=[A-Za-z]&addressLine1=[a-zA-Z0-9 ]+&city=([a-zA-Z ]+)&postalCode=[a-zA-Z0-9 ]+&payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)$)&cardType=[a-zA-Z]+&continuousAuthority=[a-zA-Z]+&makeCurrent=[a-zA-Z]+
again in a list:
registerPayment?outputType=xml
country=GB
paymentType=A
amount=([0-9]+.[0-9]{2})
policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*$)
paymentMethod=([a-zA-Z]+)
script=([a-zA-Z]+/[a-zA-Z]+.py)
scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)))
description=([a-zA-Z0-9 ]+s)
firstName=[A-Za-z]
surname=[A-Za-z]
addressLine1=[a-zA-Z0-9 ]+
city=([a-zA-Z ]+)
postalCode=[a-zA-Z0-9 ]+
payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*$)
cardType=[a-zA-Z]+
continuousAuthority=[a-zA-Z]+
makeCurrent=[a-zA-Z]+
My question is; why does my regex and sample match ok seperately, but dont when I put them all together ?
Additional question:
I am using the logic (([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+))) for the whole ScriptParams parameter (the &'s here are part of the parameter.) If I just want to get the 'uid' part and leave the rest, what expression would I need to target this (it is made up of A-z a-z 0-9 and dashes)?
thank you
UPDATE
I have tweaked your answer slightly, because the stub server I am using will not accept the (?:[\s-]) when it loads the file containing the URL templates. I have also incorporated a lot of % and 0-9 because the request is UTF encoded before it is matched (which I had not anticipated), and a few of the params have rogue spaces beyond my control. Other than that, your solution worked great :)
Here is my new version of the scriptParams regex:
&scriptParams=[a-zA-Z]{3}%3d[-A-Za-z0-9]+
This accepts the whole parameter, and works fine in the regex tester. Now when I link anything after this part, there is an unsuccessful match.
I do not understand why this is a problem as the regex seem to string together nicely otherwise. Any ideas are appreciated.
Here is the full regex:
paymentType=[-%a-zA-Z0-9 ]+&amount=[0-9]+.[0-9]{2}&policyUid=([-A-Za-z0-9]+)&paymentMethod=([%a-zA-Z0-9]+)&script=[%/.a-zA-Z0-9]+&scriptParams=[a-zA-Z]{3}%3d[-A-Za-z0-9]+&description=[%a-zA-Z0-9 ]+&firstName=[-%A-Za-z0-9]+&surname=[-%A-Za-z0-9]+&addressLine1=[-%a-zA-Z0-9 ]+&city=[-%a-zA-Z 0-9]+&postalCode=[-%a-zA-Z 0-9]+&payerUid=([-A-Za-z0-9]+)&cardType=[%A-Za-z0-9]+&continuousAuthority=[A-Za-z]+&makeCurrent=[A-Za-z]+
And here is the full set of URL params (with UTF encoding present):
paymentType=A&amount=104.85&policyUid=16a9cc22-0000-0000-5a96-5654d9a31f92&paymentMethod=A%20&script=RetailQuotes%2FacceptQuote.py%20&scriptParams=uid%3d16a9c958-0000-0000-5a96-565435311d07%26invokePCL%3dtrue%26paymentType%3dA%20&description=New%2520Business%2520Payment&firstName=Adam&surname=Har%20&addressLine1=26%2520Close&city=Potters%2520Town&postalCode=EC1%25206LR%20&payerUid=16a9c24e-0000-0000-5a96-5654b3f956e0&cardType=valid%20&continuousAuthority=true&makeCurrent=true
Thank you
PS
(Solved the server problem. Was a slight mistake I was making in the usage of URL params.)
First, your regex not all work, some are missing quantifiers, others have a $ for some reason and some parameters are even missing! Here's what they should have been:
paymentType=A
amount=([0-9]+.[0-9]{2})
policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)
paymentMethod=([a-zA-Z]+)
script=([a-zA-Z]+/[a-zA-Z]+.py)
scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)+))
invokePCL=([a-z]+)
paymentType=A
description=([a-zA-Z0-9 ]+)
firstName=[A-Za-z]+
surname=[A-Za-z]+
addressLine1=[a-zA-Z0-9 ]+
city=([a-zA-Z ]+)
postalCode=[a-zA-Z0-9 ]+
payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)
cardType=[a-zA-Z]+
continuousAuthority=[a-zA-Z]+
makeCurrent=[a-zA-Z]+
And combined, you get:
paymentType=A&amount=([0-9]+.[0-9]{2})&policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)&paymentMethod=([a-zA-Z]+)&script=([a-zA-Z]+/[a-zA-Z]+.py)&scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)+))&invokePCL=([a-z]+)&paymentType=A&description=([a-zA-Z0-9 ]+)&firstName=[A-Za-z]+&surname=[A-Za-z]+&addressLine1=[a-zA-Z0-9 ]+&city=([a-zA-Z ]+)&postalCode=[a-zA-Z0-9 ]+&payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)&cardType=[a-zA-Z]+&continuousAuthority=[a-zA-Z]+&makeCurrent=[a-zA-Z]+
regex101 demo
[Note, I took your regexes where they matched and ran minimal edits to them].
For your second question, I'm not sure what you mean by the Uid part and that & are part of the parameter. Given that there are 3 Uids in the url with similar format (policy, scriptparams, user), you will have to put them in the expression, unless you know a specific pattern to the scriptparams' Uid.
In the expression below, I made use of the fact that only scriptparams' uid was in lowercase:
uid=[0-9a-f]+(?:-[0-9a-f]+)+
regex101 demo