Related
Can anyone please direct me to an example of where one can send a user input variable to a checking function or module & return the validated input assigning / updating the initialised variable?. I am trying to re-create something I did in C++ many years ago where I am trying to validate an integer! In this particular case that the number of bolts input in a building frame connection is such. Any direction would be greatly appreciated as my internet searches and trawls through my copy of Python A Crash Course have yet to shed any light! Many thanks in anticipation that someone will feel benevolent towards a Python newbie!
Regards Steve
Below is one on my numerous attempts at this, really I would just like to abandon and use While and a function call. In this one apparently I am not allowed to put > (line 4) between str and int, this desite my attempt to force N to be int - penultimate line!
def int_val(N):
#checks
# check 1. n > 0 for real entries
N > 0
isinstance(N, int)
N=N
return N
print("N not ok enter again")
#N = input("Input N the Number of bolts ")
# Initialiase N=0
#N = 0
# Enter the number of bolts
N = input("Input N the Number of bolts ")
int_val(N)
print("no of bolts is", N)
Is something like this what you have in mind? It takes advantage of the fact that using the built-in int function will convert a string to an integer if possible, but otherwise throw a ValueError.
def str_to_posint(s):
"""Return value if converted and greater than zero else None."""
try:
num = int(s)
return num if num > 0 else None
except ValueError:
return None
while True:
s = input("Enter number of bolts: ")
if num_bolts := str_to_posint(s):
break
print(f"Sorry, \"{s}\" is not a valid number of bolts.")
print(f"{num_bolts = }")
Output:
Enter number of bolts: twenty
Sorry, "twenty" is not a valid number of bolts.
Enter number of bolts: 20
num_bolts = 20
def str_to_posint(s):
"""Return value if converted and greater than zero else None."""
try:
num = int(s)
return num if num > 0 else None
except ValueError:
return None
while True:
s = input("Enter number of bolts: ")
if num_bolts := str_to_posint(s):
break
print(f"Sorry, "{s}" is not a valid number of bolts.")
print(f"{num_bolts = }")
I'm working on an aggregated config file parsing tool, hoping it can support .json, .yaml and .toml files. So, I have done the next tests:
The example.json config file is as:
{
"DEFAULT":
{
"ServerAliveInterval": 45,
"Compression": true,
"CompressionLevel": 9,
"ForwardX11": true
},
"bitbucket.org":
{
"User": "hg"
},
"topsecret.server.com":
{
"Port": 50022,
"ForwardX11": false
},
"special":
{
"path":"C:\\Users",
"escaped1":"\n\t",
"escaped2":"\\n\\t"
}
}
The example.yaml config file is as:
DEFAULT:
ServerAliveInterval: 45
Compression: yes
CompressionLevel: 9
ForwardX11: yes
bitbucket.org:
User: hg
topsecret.server.com:
Port: 50022
ForwardX11: no
special:
path: C:\Users
escaped1: "\n\t"
escaped2: \n\t
and the example.toml config file is as:
[DEFAULT]
ServerAliveInterval = 45
Compression = true
CompressionLevel = 9
ForwardX11 = true
['bitbucket.org']
User = 'hg'
['topsecret.server.com']
Port = 50022
ForwardX11 = false
[special]
path = 'C:\Users'
escaped1 = "\n\t"
escaped2 = '\n\t'
Then, the test code with output is as:
import pickle,json,yaml
# TOML, see https://github.com/hukkin/tomli
try:
import tomllib
except ModuleNotFoundError:
import tomli as tomllib
path = "example.json"
with open(path) as file:
config1 = json.load(file)
assert isinstance(config1,dict)
pickled1 = pickle.dumps(config1)
path = "example.yaml"
with open(path, 'r', encoding='utf-8') as file:
config2 = yaml.safe_load(file)
assert isinstance(config2,dict)
pickled2 = pickle.dumps(config2)
path = "example.toml"
with open(path, 'rb') as file:
config3 = tomllib.load(file)
assert isinstance(config3,dict)
pickled3 = pickle.dumps(config3)
print(config1==config2) # True
print(config2==config3) # True
print(pickled1==pickled2) # False
print(pickled2==pickled3) # True
So, my question is, since the parsed obj are all dicts, and these dicts are equal to each other, why their pickled codes are not the same, i.e., why is the pickled code of the dict parsed from json different to other two?
Thanks in advance.
The difference is due to:
The json module performing memoizing for object attributes with the same value (it's not interning them, but the scanner object contains a memo dict that it uses to dedupe identical attribute strings within a single parsing run), while yaml does not (it just makes a new str each time it sees the same data), and
pickle faithfully reproducing the exact structure of the data it's told to dump, replacing subsequent references to the same object with a back-reference to the first time it was seen (among other reasons, this makes it possible to dump recursive data structures, e.g. lst = [], lst.append(lst), without infinite recursion, and reproduce them faithfully when unpickled)
Issue #1 isn't visible in equality testing (strs compare equal with the same data, not just the same exact object in memory). But when pickle sees "ForwardX11" the first time, it inserts the pickled form of the object and emits a pickle opcode that assigns a number to that object. If that exact object is seen again (same memory address, not merely same value), instead of reserializing it, it just emits a simpler opcode that just says "Go find the object associated with the number from last time and put it here as well". If it's a different object though, even one with the same value, it's new, and gets serialized separately (and assigned another number in case the new object is seen again).
Simplifying your code to demonstrate the issue, you can inspect the generated pickle output to see how this is happening:
s = r'''{
"DEFAULT":
{
"ForwardX11": true
},
"FOO":
{
"ForwardX11": false
}
}'''
s2 = r'''DEFAULT:
ForwardX11: yes
FOO:
ForwardX11: no
'''
import io, json, yaml, pickle, pickletools
d1 = json.load(io.StringIO(s))
d2 = yaml.safe_load(io.StringIO(s2))
pickletools.dis(pickle.dumps(d1))
pickletools.dis(pickle.dumps(d2))
Try it online!
The output from that code for the json parsed input is (with # comments inline to point out important things), at least on Python 3.7 (the default pickle protocol and exact pickling format can change from release to release), is:
0: \x80 PROTO 3
2: } EMPTY_DICT
3: q BINPUT 0
5: ( MARK
6: X BINUNICODE 'DEFAULT'
18: q BINPUT 1
20: } EMPTY_DICT
21: q BINPUT 2
23: X BINUNICODE 'ForwardX11' # Serializes 'ForwardX11'
38: q BINPUT 3 # Assigns the serialized form the ID of 3
40: \x88 NEWTRUE
41: s SETITEM
42: X BINUNICODE 'FOO'
50: q BINPUT 4
52: } EMPTY_DICT
53: q BINPUT 5
55: h BINGET 3 # Looks up whatever object was assigned the ID of 3
57: \x89 NEWFALSE
58: s SETITEM
59: u SETITEMS (MARK at 5)
60: . STOP
highest protocol among opcodes = 2
while the output from the yaml loaded data is:
0: \x80 PROTO 3
2: } EMPTY_DICT
3: q BINPUT 0
5: ( MARK
6: X BINUNICODE 'DEFAULT'
18: q BINPUT 1
20: } EMPTY_DICT
21: q BINPUT 2
23: X BINUNICODE 'ForwardX11' # Serializes as before
38: q BINPUT 3 # and assigns code 3 as before
40: \x88 NEWTRUE
41: s SETITEM
42: X BINUNICODE 'FOO'
50: q BINPUT 4
52: } EMPTY_DICT
53: q BINPUT 5
55: X BINUNICODE 'ForwardX11' # Doesn't see this 'ForwardX11' as being the exact same object, so reserializes
70: q BINPUT 6 # and marks again, in case this copy is seen again
72: \x89 NEWFALSE
73: s SETITEM
74: u SETITEMS (MARK at 5)
75: . STOP
highest protocol among opcodes = 2
printing the id of each such string would get you similar information, e.g., replacing the pickletools lines with:
for k in d1['DEFAULT']:
print(id(k))
for k in d1['FOO']:
print(id(k))
for k in d2['DEFAULT']:
print(id(k))
for k in d2['FOO']:
print(id(k))
will show a consistent id for both 'ForwardX11's in d1, but differing ones for d2; a sample run produced (with inline comments added):
140067902240944 # First from d1
140067902240944 # Second from d1 is *same* object
140067900619760 # First from d2
140067900617712 # Second from d2 is unrelated object (same value, but stored separately)
While I didn't bother checking if toml behaved the same way, given that it pickles the same as the yaml, it's clearly not attempting to dedupe strings; json is uniquely weird there. It's not a terrible idea that it does so mind you; the keys of a JSON dict are logically equivalent to attributes on an object, and for huge inputs (say, 10M objects in an array with the same handful of keys), it might save a meaningful amount of memory on the final parsed output by deduping (e.g. on CPython 3.11 x86-64 builds, replacing 10M copies of "ForwardX11" with a single copy would reduce 590 MB for string data to just 59 bytes).
As a side-note: This "dicts are equal, pickles are not" issue could also occur:
When the two dicts were constructed with the same keys and values, but the order in which the keys were inserted differed (modern Python uses insertion-ordered dicts; comparisons between them ignore ordering, but pickle would be serializing them in whatever order they iterate in naturally).
When there are objects which compare equal but have different types (e.g. set vs. frozenset, int vs. float); pickle would treat them separately, but equality tests would not see a difference.
Neither of these is the issue here (both json and yaml appear to be constructing in the same order seen in the input, and they're parsing the ints as ints), but it's entirely possible for your test of equality to return True, while the pickled forms are unequal, even when all the objects involved are unique.
I have the following closure:
def get!(Item, id) do
Enum.find(
#items,
fn(item) -> item.id == id end
)
end
As I believe this looks ugly and difficult to read, I'd like to give this a name, like:
def get!(Item, id) do
defp has_target_id?(item), do: item.id = id
Enum.find(#items, has_target_id?/1)
end
Unfortunately, this results in:
== Compilation error in file lib/auction/fake_repo.ex ==
** (ArgumentError) cannot invoke defp/2 inside function/macro
(elixir) lib/kernel.ex:5238: Kernel.assert_no_function_scope/3
(elixir) lib/kernel.ex:4155: Kernel.define/4
(elixir) expanding macro: Kernel.defp/2
lib/auction/fake_repo.ex:28: Auction.FakeRepo.get!/2
Assuming it is possible, what is the correct way to do this?
The code you posted has an enormous amount of syntax errors/glitches. I would suggest you start with getting accustomed to the syntax, rather than trying to make Elixir better by inventing the things that nobody uses.
Here is the correct version that does what you wanted. The task might be accomplished with an anonymous function, although I hardly see a reason to make a perfectly looking idiomatic Elixir look ugly.
defmodule Foo do
#items [%{id: 1}, %{id: 2}, %{id: 3}]
def get!(id) do
has_target_id? = fn item -> item.id == id end
Enum.find(#items, has_target_id?)
end
end
Foo.get! 1
#⇒ %{id: 1}
Foo.get! 4
#⇒ nil
You can do this:
def get!(Item, id) do
Enum.find(
#items,
&compare_ids(&1, id)
)
end
defp compare_ids(%Item{}=item, id) do
item.id == id
end
But, that's equivalent to:
Enum.find(
#items,
fn item -> compare_ids(item, id) end
)
and may not pass your looks ugly and difficult to read test.
I was somehow under the impression Elixir supports nested functions?
Easy enough to test:
defmodule A do
def go do
def greet do
IO.puts "hello"
end
greet()
end
end
Same error:
$ iex a.ex
Erlang/OTP 20 [erts-9.2] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [hipe] [kernel-poll:false]
** (ArgumentError) cannot invoke def/2 inside function/macro
(elixir) lib/kernel.ex:5150: Kernel.assert_no_function_scope/3
(elixir) lib/kernel.ex:3906: Kernel.define/4
(elixir) expanding macro: Kernel.def/2
a.ex:3: A.go/0
wouldn't:
defp compare_ids(item, id), do: item.id == id
be enough? Is there any advantage to including %Item{} or making
separate functions for returning both true and false conditions?
What you gain by specifying the first parameter as:
func(%Item{} = item, target_id)
is that only an Item struct will match the first parameter. Here is an example:
defmodule Item do
defstruct [:id, :name, :description]
end
defmodule Dog do
defstruct [:id, :name, :owner]
end
defmodule A do
def go(%Item{} = item), do: IO.inspect(item.id, label: "id: ")
end
In iex:
iex(1)> item = %Item{id: 1, name: "book", description: "old"}
%Item{description: "old", id: 1, name: "book"}
iex(2)> dog = %Dog{id: 1, name: "fido", owner: "joe"}
%Dog{id: 1, name: "fido", owner: "joe"}
iex(3)> A.go item
id: : 1
1
iex(4)> A.go dog
** (FunctionClauseError) no function clause matching in A.go/1
The following arguments were given to A.go/1:
# 1
%Dog{id: 1, name: "fido", owner: "joe"}
a.ex:10: A.go/1
iex(4)>
You get a function clause error if you call the function with a non-Item, and the earlier an error occurs, the better, because it makes debugging easier.
Of course, by preventing the function from accepting other structs, you make the function less general--but because it's a private function, you can't call it from outside the module anyway. On the other hand, if you wanted to call the function on both Dog and Item structs, then you could simply specify the first parameter as:
|
V
func(%{}=thing, target_id)
then both an Item and a Dog would match--but not non-maps.
What you gain by specifying the first parameter as:
|
V
func(%Item{id: id}, target_id)
is that you let erlang's pattern matching engine extract the data you need, rather than calling item.id as you would need to do with this definition:
func(%Item{}=item, target_id)
In erlang, pattern matching in a parameter list is the most efficient/convenient/stylish way to write functions. You use pattern matching to extract the data that you want to use in the function body.
Going even further, if you write the function definition like this:
same variable name
| |
V V
func(%Item{id: target_id}, target_id)
then erlang's pattern matching engine not only extracts the value for the id field from the Item struct, but also checks that the value is equal to the value of the target_id variable in the 2nd argument.
Defining multiple function clauses is a common idiom in erlang, and it is considered good style because it takes advantage of pattern matching rather than logic inside the function body. Here's an erlang example:
get_evens(List) ->
get_evens(List, []).
get_evens([Head|Tail], Results) when Head rem 2 == 0 ->
get_evens(Tail, [Head|Results]);
get_evens([Head|Tail], Results) when Head rem 2 =/= 0 ->
get_evens(Tail, Results);
get_evens([], Results) ->
lists:reverse(Results).
I need a regular expression to select all the text between two outer brackets.
Example:
START_TEXT(text here(possible text)text(possible text(more text)))END_TXT
^ ^
Result:
(text here(possible text)text(possible text(more text)))
I want to add this answer for quickreference. Feel free to update.
.NET Regex using balancing groups:
\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)
Where c is used as the depth counter.
Demo at Regexstorm.com
Stack Overflow: Using RegEx to balance match parenthesis
Wes' Puzzling Blog: Matching Balanced Constructs with .NET Regular Expressions
Greg Reinacker's Weblog: Nested Constructs in Regular Expressions
PCRE using a recursive pattern:
\((?:[^)(]+|(?R))*+\)
Demo at regex101; Or without alternation:
\((?:[^)(]*(?R)?)*+\)
Demo at regex101; Or unrolled for performance:
\([^)(]*+(?:(?R)[^)(]*)*+\)
Demo at regex101; The pattern is pasted at (?R) which represents (?0).
Perl, PHP, Notepad++, R: perl=TRUE, Python: PyPI regex module with (?V1) for Perl behaviour.
(the new version of PyPI regex package already defaults to this → DEFAULT_VERSION = VERSION1)
Ruby using subexpression calls:
With Ruby 2.0 \g<0> can be used to call full pattern.
\((?>[^)(]+|\g<0>)*\)
Demo at Rubular; Ruby 1.9 only supports capturing group recursion:
(\((?>[^)(]+|\g<1>)*\))
Demo at Rubular (atomic grouping since Ruby 1.9.3)
JavaScript API :: XRegExp.matchRecursive
XRegExp.matchRecursive(str, '\\(', '\\)', 'g');
Java: An interesting idea using forward references by #jaytea.
Without recursion up to 3 levels of nesting:
(JS, Java and other regex flavors)
To prevent runaway if unbalanced, with * on innermost [)(] only.
\((?:[^)(]|\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\))*\)
Demo at regex101; Or unrolled for better performance (preferred).
\([^)(]*(?:\([^)(]*(?:\([^)(]*(?:\([^)(]*\)[^)(]*)*\)[^)(]*)*\)[^)(]*)*\)
Demo at regex101; Deeper nesting needs to be added as required.
Reference - What does this regex mean?
RexEgg.com - Recursive Regular Expressions
Regular-Expressions.info - Regular Expression Recursion
Mastering Regular Expressions - Jeffrey E.F. Friedl 1 2 3 4
Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.
But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.
You can use regex recursion:
\(([^()]|(?R))*\)
[^\(]*(\(.*\))[^\)]*
[^\(]* matches everything that isn't an opening bracket at the beginning of the string, (\(.*\)) captures the required substring enclosed in brackets, and [^\)]* matches everything that isn't a closing bracket at the end of the string. Note that this expression does not attempt to match brackets; a simple parser (see dehmann's answer) would be more suitable for that.
This answer explains the theoretical limitation of why regular expressions are not the right tool for this task.
Regular expressions can not do this.
Regular expressions are based on a computing model known as Finite State Automata (FSA). As the name indicates, a FSA can remember only the current state, it has no information about the previous states.
In the above diagram, S1 and S2 are two states where S1 is the starting and final step. So if we try with the string 0110 , the transition goes as follows:
0 1 1 0
-> S1 -> S2 -> S2 -> S2 ->S1
In the above steps, when we are at second S2 i.e. after parsing 01 of 0110, the FSA has no information about the previous 0 in 01 as it can only remember the current state and the next input symbol.
In the above problem, we need to know the no of opening parenthesis; this means it has to be stored at some place. But since FSAs can not do that, a regular expression can not be written.
However, an algorithm can be written to do this task. Algorithms are generally falls under Pushdown Automata (PDA). PDA is one level above of FSA. PDA has an additional stack to store some additional information. PDAs can be used to solve the above problem, because we can 'push' the opening parenthesis in the stack and 'pop' them once we encounter a closing parenthesis. If at the end, stack is empty, then opening parenthesis and closing parenthesis matches. Otherwise not.
(?<=\().*(?=\))
If you want to select text between two matching parentheses, you are out of luck with regular expressions. This is impossible(*).
This regex just returns the text between the first opening and the last closing parentheses in your string.
(*) Unless your regex engine has features like balancing groups or recursion. The number of engines that support such features is slowly growing, but they are still not a commonly available.
It is actually possible to do it using .NET regular expressions, but it is not trivial, so read carefully.
You can read a nice article here. You also may need to read up on .NET regular expressions. You can start reading here.
Angle brackets <> were used because they do not require escaping.
The regular expression looks like this:
<
[^<>]*
(
(
(?<Open><)
[^<>]*
)+
(
(?<Close-Open>>)
[^<>]*
)+
)*
(?(Open)(?!))
>
I was also stuck in this situation when dealing with nested patterns and regular-expressions is the right tool to solve such problems.
/(\((?>[^()]+|(?1))*\))/
This is the definitive regex:
\(
(?<arguments>
(
([^\(\)']*) |
(\([^\(\)']*\)) |
'(.*?)'
)*
)
\)
Example:
input: ( arg1, arg2, arg3, (arg4), '(pip' )
output: arg1, arg2, arg3, (arg4), '(pip'
note that the '(pip' is correctly managed as string.
(tried in regulator: http://sourceforge.net/projects/regulator/)
I have written a little JavaScript library called balanced to help with this task. You can accomplish this by doing
balanced.matches({
source: source,
open: '(',
close: ')'
});
You can even do replacements:
balanced.replacements({
source: source,
open: '(',
close: ')',
replace: function (source, head, tail) {
return head + source + tail;
}
});
Here's a more complex and interactive example JSFiddle.
Adding to bobble bubble's answer, there are other regex flavors where recursive constructs are supported.
Lua
Use %b() (%b{} / %b[] for curly braces / square brackets):
for s in string.gmatch("Extract (a(b)c) and ((d)f(g))", "%b()") do print(s) end (see demo)
Raku (former Perl6):
Non-overlapping multiple balanced parentheses matches:
my regex paren_any { '(' ~ ')' [ <-[()]>+ || <&paren_any> ]* }
say "Extract (a(b)c) and ((d)f(g))" ~~ m:g/<&paren_any>/;
# => (「(a(b)c)」 「((d)f(g))」)
Overlapping multiple balanced parentheses matches:
say "Extract (a(b)c) and ((d)f(g))" ~~ m:ov:g/<&paren_any>/;
# => (「(a(b)c)」 「(b)」 「((d)f(g))」 「(d)」 「(g)」)
See demo.
Python re non-regex solution
See poke's answer for How to get an expression between balanced parentheses.
Java customizable non-regex solution
Here is a customizable solution allowing single character literal delimiters in Java:
public static List<String> getBalancedSubstrings(String s, Character markStart,
Character markEnd, Boolean includeMarkers)
{
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastOpenDelimiter = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1) {
lastOpenDelimiter = (includeMarkers ? i : i + 1);
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));
}
if (level > 0) level--;
}
}
return subTreeList;
}
}
Sample usage:
String s = "some text(text here(possible text)text(possible text(more text)))end text";
List<String> balanced = getBalancedSubstrings(s, '(', ')', true);
System.out.println("Balanced substrings:\n" + balanced);
// => [(text here(possible text)text(possible text(more text)))]
The regular expression using Ruby (version 1.9.3 or above):
/(?<match>\((?:\g<match>|[^()]++)*\))/
Demo on rubular
The answer depends on whether you need to match matching sets of brackets, or merely the first open to the last close in the input text.
If you need to match matching nested brackets, then you need something more than regular expressions. - see #dehmann
If it's just first open to last close see #Zach
Decide what you want to happen with:
abc ( 123 ( foobar ) def ) xyz ) ghij
You need to decide what your code needs to match in this case.
"""
Here is a simple python program showing how to use regular
expressions to write a paren-matching recursive parser.
This parser recognises items enclosed by parens, brackets,
braces and <> symbols, but is adaptable to any set of
open/close patterns. This is where the re package greatly
assists in parsing.
"""
import re
# The pattern below recognises a sequence consisting of:
# 1. Any characters not in the set of open/close strings.
# 2. One of the open/close strings.
# 3. The remainder of the string.
#
# There is no reason the opening pattern can't be the
# same as the closing pattern, so quoted strings can
# be included. However quotes are not ignored inside
# quotes. More logic is needed for that....
pat = re.compile("""
( .*? )
( \( | \) | \[ | \] | \{ | \} | \< | \> |
\' | \" | BEGIN | END | $ )
( .* )
""", re.X)
# The keys to the dictionary below are the opening strings,
# and the values are the corresponding closing strings.
# For example "(" is an opening string and ")" is its
# closing string.
matching = { "(" : ")",
"[" : "]",
"{" : "}",
"<" : ">",
'"' : '"',
"'" : "'",
"BEGIN" : "END" }
# The procedure below matches string s and returns a
# recursive list matching the nesting of the open/close
# patterns in s.
def matchnested(s, term=""):
lst = []
while True:
m = pat.match(s)
if m.group(1) != "":
lst.append(m.group(1))
if m.group(2) == term:
return lst, m.group(3)
if m.group(2) in matching:
item, s = matchnested(m.group(3), matching[m.group(2)])
lst.append(m.group(2))
lst.append(item)
lst.append(matching[m.group(2)])
else:
raise ValueError("After <<%s %s>> expected %s not %s" %
(lst, s, term, m.group(2)))
# Unit test.
if __name__ == "__main__":
for s in ("simple string",
""" "double quote" """,
""" 'single quote' """,
"one'two'three'four'five'six'seven",
"one(two(three(four)five)six)seven",
"one(two(three)four)five(six(seven)eight)nine",
"one(two)three[four]five{six}seven<eight>nine",
"one(two[three{four<five>six}seven]eight)nine",
"oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",
"ERROR testing ((( mismatched ))] parens"):
print "\ninput", s
try:
lst, s = matchnested(s)
print "output", lst
except ValueError as e:
print str(e)
print "done"
You need the first and last parentheses. Use something like this:
str.indexOf('('); - it will give you first occurrence
str.lastIndexOf(')'); - last one
So you need a string between,
String searchedString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');
because js regex doesn't support recursive match, i can't make balanced parentheses matching work.
so this is a simple javascript for loop version that make "method(arg)" string into array
push(number) map(test(a(a()))) bass(wow, abc)
$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)
const parser = str => {
let ops = []
let method, arg
let isMethod = true
let open = []
for (const char of str) {
// skip whitespace
if (char === ' ') continue
// append method or arg string
if (char !== '(' && char !== ')') {
if (isMethod) {
(method ? (method += char) : (method = char))
} else {
(arg ? (arg += char) : (arg = char))
}
}
if (char === '(') {
// nested parenthesis should be a part of arg
if (!isMethod) arg += char
isMethod = false
open.push(char)
} else if (char === ')') {
open.pop()
// check end of arg
if (open.length < 1) {
isMethod = true
ops.push({ method, arg })
method = arg = undefined
} else {
arg += char
}
}
}
return ops
}
// const test = parser(`$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)`)
const test = parser(`push(number) map(test(a(a()))) bass(wow, abc)`)
console.log(test)
the result is like
[ { method: 'push', arg: 'number' },
{ method: 'map', arg: 'test(a(a()))' },
{ method: 'bass', arg: 'wow,abc' } ]
[ { method: '$$', arg: 'groups' },
{ method: 'filter',
arg: '{type:\'ORGANIZATION\',isDisabled:{$ne:true}}' },
{ method: 'pickBy', arg: '_id,type' },
{ method: 'map', arg: 'test()' },
{ method: 'as', arg: 'groups' } ]
While so many answers mention this in some form by saying that regex does not support recursive matching and so on, the primary reason for this lies in the roots of the Theory of Computation.
Language of the form {a^nb^n | n>=0} is not regular. Regex can only match things that form part of the regular set of languages.
Read more # here
I didn't use regex since it is difficult to deal with nested code. So this snippet should be able to allow you to grab sections of code with balanced brackets:
def extract_code(data):
""" returns an array of code snippets from a string (data)"""
start_pos = None
end_pos = None
count_open = 0
count_close = 0
code_snippets = []
for i,v in enumerate(data):
if v =='{':
count_open+=1
if not start_pos:
start_pos= i
if v=='}':
count_close +=1
if count_open == count_close and not end_pos:
end_pos = i+1
if start_pos and end_pos:
code_snippets.append((start_pos,end_pos))
start_pos = None
end_pos = None
return code_snippets
I used this to extract code snippets from a text file.
This do not fully address the OP question but I though it may be useful to some coming here to search for nested structure regexp:
Parse parmeters from function string (with nested structures) in javascript
Match structures like:
matches brackets, square brackets, parentheses, single and double quotes
Here you can see generated regexp in action
/**
* get param content of function string.
* only params string should be provided without parentheses
* WORK even if some/all params are not set
* #return [param1, param2, param3]
*/
exports.getParamsSAFE = (str, nbParams = 3) => {
const nextParamReg = /^\s*((?:(?:['"([{](?:[^'"()[\]{}]*?|['"([{](?:[^'"()[\]{}]*?|['"([{][^'"()[\]{}]*?['")}\]])*?['")}\]])*?['")}\]])|[^,])*?)\s*(?:,|$)/;
const params = [];
while (str.length) { // this is to avoid a BIG performance issue in javascript regexp engine
str = str.replace(nextParamReg, (full, p1) => {
params.push(p1);
return '';
});
}
return params;
};
This might help to match balanced parenthesis.
\s*\w+[(][^+]*[)]\s*
This one also worked
re.findall(r'\(.+\)', s)
Given a string like this:
a,"string, with",various,"values, and some",quoted
What is a good algorithm to split this based on commas while ignoring the commas inside the quoted sections?
The output should be an array:
[ "a", "string, with", "various", "values, and some", "quoted" ]
Looks like you've got some good answers here.
For those of you looking to handle your own CSV file parsing, heed the advice from the experts and Don't roll your own CSV parser.
Your first thought is, "I need to handle commas inside of quotes."
Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."
It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free FileHelpers library.
Python:
import csv
reader = csv.reader(open("some.csv"))
for row in reader:
print row
If my language of choice didn't offer a way to do this without thinking then I would initially consider two options as the easy way out:
Pre-parse and replace the commas within the string with another control character then split them, followed by a post-parse on the array to replace the control character used previously with the commas.
Alternatively split them on the commas then post-parse the resulting array into another array checking for leading quotes on each array entry and concatenating the entries until I reached a terminating quote.
These are hacks however, and if this is a pure 'mental' exercise then I suspect they will prove unhelpful. If this is a real world problem then it would help to know the language so that we could offer some specific advice.
Of course using a CSV parser is better but just for the fun of it you could:
Loop on the string letter by letter.
If current_letter == quote :
toggle inside_quote variable.
Else if (current_letter ==comma and not inside_quote) :
push current_word into array and clear current_word.
Else
append the current_letter to current_word
When the loop is done push the current_word into array
The author here dropped in a blob of C# code that handles the scenario you're having a problem with:
CSV File Imports in .Net
Shouldn't be too difficult to translate.
What if an odd number of quotes appear
in the original string?
This looks uncannily like CSV parsing, which has some peculiarities to handling quoted fields. The field is only escaped if the field is delimited with double quotations, so:
field1, "field2, field3", field4, "field5, field6" field7
becomes
field1
field2, field3
field4
"field5
field6" field7
Notice if it doesn't both start and end with a quotation, then it's not a quoted field and the double quotes are simply treated as double quotes.
Insedently my code that someone linked to doesn't actually handle this correctly, if I recall correctly.
Here's a simple python implementation based on Pat's pseudocode:
def splitIgnoringSingleQuote(string, split_char, remove_quotes=False):
string_split = []
current_word = ""
inside_quote = False
for letter in string:
if letter == "'":
if not remove_quotes:
current_word += letter
if inside_quote:
inside_quote = False
else:
inside_quote = True
elif letter == split_char and not inside_quote:
string_split.append(current_word)
current_word = ""
else:
current_word += letter
string_split.append(current_word)
return string_split
I use this to parse strings, not sure if it helps here; but with some minor modifications perhaps?
function getstringbetween($string, $start, $end){
$string = " ".$string;
$ini = strpos($string,$start);
if ($ini == 0) return "";
$ini += strlen($start);
$len = strpos($string,$end,$ini) - $ini;
return substr($string,$ini,$len);
}
$fullstring = "this is my [tag]dog[/tag]";
$parsed = getstringbetween($fullstring, "[tag]", "[/tag]");
echo $parsed; // (result = dog)
/mp
This is a standard CSV-style parse. A lot of people try to do this with regular expressions. You can get to about 90% with regexes, but you really need a real CSV parser to do it properly. I found a fast, excellent C# CSV parser on CodeProject a few months ago that I highly recommend!
Here's one in pseudocode (a.k.a. Python) in one pass :-P
def parsecsv(instr):
i = 0
j = 0
outstrs = []
# i is fixed until a match occurs, then it advances
# up to j. j inches forward each time through:
while i < len(instr):
if j < len(instr) and instr[j] == '"':
# skip the opening quote...
j += 1
# then iterate until we find a closing quote.
while instr[j] != '"':
j += 1
if j == len(instr):
raise Exception("Unmatched double quote at end of input.")
if j == len(instr) or instr[j] == ',':
s = instr[i:j] # get the substring we've found
s = s.strip() # remove extra whitespace
# remove surrounding quotes if they're there
if len(s) > 2 and s[0] == '"' and s[-1] == '"':
s = s[1:-1]
# add it to the result
outstrs.append(s)
# skip over the comma, move i up (to where
# j will be at the end of the iteration)
i = j+1
j = j+1
return outstrs
def testcase(instr, expected):
outstr = parsecsv(instr)
print outstr
assert expected == outstr
# Doesn't handle things like '1, 2, "a, b, c" d, 2' or
# escaped quotes, but those can be added pretty easily.
testcase('a, b, "1, 2, 3", c', ['a', 'b', '1, 2, 3', 'c'])
testcase('a,b,"1, 2, 3" , c', ['a', 'b', '1, 2, 3', 'c'])
# odd number of quotes gives a "unmatched quote" exception
#testcase('a,b,"1, 2, 3" , "c', ['a', 'b', '1, 2, 3', 'c'])
Here's a simple algorithm:
Determine if the string begins with a '"' character
Split the string into an array delimited by the '"' character.
Mark the quoted commas with a placeholder #COMMA#
If the input starts with a '"', mark those items in the array where the index % 2 == 0
Otherwise mark those items in the array where the index % 2 == 1
Concatenate the items in the array to form a modified input string.
Split the string into an array delimited by the ',' character.
Replace all instances in the array of #COMMA# placeholders with the ',' character.
The array is your output.
Heres the python implementation:
(fixed to handle '"a,b",c,"d,e,f,h","i,j,k"')
def parse_input(input):
quote_mod = int(not input.startswith('"'))
input = input.split('"')
for item in input:
if item == '':
input.remove(item)
for i in range(len(input)):
if i % 2 == quoted_mod:
input[i] = input[i].replace(",", "#COMMA#")
input = "".join(input).split(",")
for item in input:
if item == '':
input.remove(item)
for i in range(len(input)):
input[i] = input[i].replace("#COMMA#", ",")
return input
# parse_input('a,"string, with",various,"values, and some",quoted')
# -> ['a,string', ' with,various,values', ' and some,quoted']
# parse_input('"a,b",c,"d,e,f,h","i,j,k"')
# -> ['a,b', 'c', 'd,e,f,h', 'i,j,k']
I just couldn't resist to see if I could make it work in a Python one-liner:
arr = [i.replace("|", ",") for i in re.sub('"([^"]*)\,([^"]*)"',"\g<1>|\g<2>", str_to_test).split(",")]
Returns ['a', 'string, with', 'various', 'values, and some', 'quoted']
It works by first replacing the ',' inside quotes to another separator (|),
splitting the string on ',' and replacing the | separator again.
Since you said language agnostic, I wrote my algorithm in the language that's closest to pseudocode as posible:
def find_character_indices(s, ch):
return [i for i, ltr in enumerate(s) if ltr == ch]
def split_text_preserving_quotes(content, include_quotes=False):
quote_indices = find_character_indices(content, '"')
output = content[:quote_indices[0]].split()
for i in range(1, len(quote_indices)):
if i % 2 == 1: # end of quoted sequence
start = quote_indices[i - 1]
end = quote_indices[i] + 1
output.extend([content[start:end]])
else:
start = quote_indices[i - 1] + 1
end = quote_indices[i]
split_section = content[start:end].split()
output.extend(split_section)
output += content[quote_indices[-1] + 1:].split()
return output