parsing nested structures in R - json

I have a json-like string that represents a nested structure. it is not a real json in that the names and values are not quoted. I want to parse it to a nested structure, e.g. list of lists.
#example:
x_string = "{a=1, b=2, c=[1,2,3], d={e=something}}"
and the result should be like this:
x_list = list(a=1,b=2,c=c(1,2,3),d=list(e="something"))
is there any convenient function that I don't know that does this kind of parsing?
Thanks.

If all of your data is consistent, there is a simple solution involving regex and jsonlite package. The code is:
if(!require(jsonlite, quiet=TRUE)){
#if library is not installed: installs it and loads it into the R session for use.
install.packages("jsonlite",repos="https://ftp.heanet.ie/mirrors/cran.r-project.org")
library(jsonlite)
}
x_string = "{a=1, b=2, c=[1,2,3], d={e=something}}"
json_x_string = "{\"a\":1, \"b\":2, \"c\":[1,2,3], \"d\":{\"e\":\"something\"}}"
fromJSON(json_x_string)
s <- gsub( "([A-Za-z]+)", "\"\\1\"", gsub( "([A-Za-z]*)=", "\\1:", x_string ) )
fromJSON( s )
The first section checks if the package is installed. If it is it loads it, otherwise it installs it and then loads it. I usually include this in any R code I'm writing to make it simpler to transfer between pcs/people.
Your string is x_string, we want it to look like json_x_string which gives the desired output when we call fromJSON().
The regex is split into two parts because it's been a while - I'm pretty sure this could be made more elegant. Then again, this depends on if your data is consistent so I'll leave it like this for now. First it changes "=" to ":", then it adds quotation marks around all groups of letters. Calling fromJSON(s) gives the output:
fromJSON(s)
$a
[1] 1
$b
[1] 2
$c
[1] 1 2 3
$d
$d$e
[1] "something"

I would rather avoid using JSON's parsing for the lack of extendibility and flexibility, and stick to a solution of regex + recursion.
And here is an extendable base code that parses your input string as desired
The main recursion function:
# Parse string
parse.string = function(.string){
regex = "^((.*)=)??\\{(.*)\\}"
# Recursion termination: element parsing
if(iselement(.string)){
return(parse.element(.string))
}
# Extract components
elements.str = gsub(regex, "\\3", .string)
elements.vector = get.subelements(elements.str)
# Recursively parse each element
parsed.elements = list(sapply(elements.vector, parse.string, USE.NAMES = F))
# Extract list's name and return
name = gsub(regex, "\\2", .string)
names(parsed.elements) = name
return(parsed.elements)
}
.
Helping functions:
library(stringr)
# Test if the string is a base element
iselement = function(.string){
grepl("^[^[:punct:]]+=[^\\{\\}]+$", .string)
}
# Parse element
parse.element = function(element.string){
splits = strsplit(element.string, "=")[[1]]
element = splits[2]
# Parse numeric elements
if(!is.na(as.numeric(element))){
element = as.numeric(element)
}
# TODO: Extend here to include vectors
# Reformat and return
element = list(element)
names(element) = splits[1]
return(element)
}
# Get subelements from a string
get.subelements = function(.string){
# Regex of allowed elements - Extend here to include more types
elements.regex = c("[^, ]+?=\\{.+?\\}", #Sublist
"[^, ]+?=\\[.+?\\]", #Vector
"[^, ]+?=[^=,]+") #Base element
str_extract_all(.string, pattern = paste(elements.regex, collapse = "|"))[[1]]
}
.
Parsing results:
string = "{a=1, b=2, c=[1,2,3], d={e=something}}"
string_2 = "{a=1, b=2, c=[1,2,3], d=somthing}"
named_string = "xyz={a=1, b=2, c=[1,2,3], d={e=something, f=22}}"
named_string_2 = "xyz={d={e=something, f=22}}"
parse.string(string)
# [[1]]
# [[1]]$a
# [1] 1
#
# [[1]]$b
# [1] 2
#
# [[1]]$c
# [1] "[1,2,3]"
#
# [[1]]$d
# [[1]]$d$e
# [1] "something"

Related

Cannot index into null array from function returned variable, or issues accessing regex data returned

I'm not sure if I'm returning the value from the function incorrectly, but when I try to access it's info, it has the above error,
Cannot index into a null array
I've tried a couple different ways, and I'm not sure if I'm not returning this correctly from the function, or if I'm just accessing the info returned incorrectly. Looking at Cannot index into null array, it looks like for him, some of his array had null values. But when I print my info to screen before I exit the function, it has info. How do I return the value found in the function such that I can loop through the contents in my main code and use one of the strings in the object? This is a continuation of parsing repeated pattern.
#parse data out of cpp code and loop through to further process
#function
Function Get-CaseContents{
[cmdletbinding()]
Param ( [string]$parsedCaseMethod, [string]$parseLinesGroupIndicator)
Process
{
# construct regex
$fullregex = [regex]"_stprintf[\s\S]*?_T\D*", # Start of error message, capture until digits
"(?<sdkErr>\d+)", # Error number, digits only
"\D[\s\S]*?", # match anything, non-greedy
"(?<sdkDesc>\((.+?)\))", # Error description, anything within parentheses, non-greedy
"([\s\S]*?outError\s*=(?<sdkOutErr>\s[a-zA-Z_]*))", # Capture OutErr string and parse out part after underscore later
"[\s\S]*?", # match anything, non-greedy
"(?<sdkSeverity>outSeverity\s*=\s[a-zA-Z_]*)", # Capture severity string and parse out part after underscore later
'' -join ''
# run the regex
$Values = $parsedCaseMethod | Select-String -Pattern $fullregex -AllMatches
# Convert Name-Value pairs to object properties
$result = foreach ($match in $Values.Matches){
[PSCustomObject][ordered]#{
sdkErr = $match.Groups['sdkErr']
sdkDesc = $match.Groups['sdkDesc']
sdkOutErr = $match.Groups['sdkOutErr']
sdkSeverity = ($match.Groups['sdkSeverity'] -split '_')[-1] #take part after _
}
}
#Write-Host "result:" $result -ForegroundColor Green
$result
return $Values
...
#main code
...
#call method to get case info (sdkErr, sdkDesc, sdkOutErr, sdkSeverity)
$ValuesCase = Get-CaseContents -parsedCaseMethod $matchFound -parseLinesGroupIndicator "_stprintf" #need to get returned info back
$result = foreach ($match in $ValuesCase.Matches){
[PSCustomObject][ordered]#{
sdkErr = $match.Groups['sdkErr']
sdkDesc = $match.Groups['sdkDesc']
sdkOutErr = $match.Groups['sdkOutErr']
sdkSeverity = ($match.Groups['sdkSeverity'] -split '_')[-1] #take part after _
} #result
} #foreach ValuesCase
The example of string sent to the function to parse is:
...
case kRESULT_STATUS_Undefined_Opcode:
_stprintf( outDevStr, _T("8004 - (Comm. Err 04) - %s(Undefined Opcode)"), errorStr);
outError = INVALID_PARAM;
outSeverity = CCA_WARNING;
break;
case kRESULT_STATUS_Comm_Timeout:
_stprintf( outDevStr, _T("8005 - (Comm. Err 05) - %s(Timeout sending command)"), errorStr);
outError = INVALID_PARAM;
outSeverity = CCA_WARNING;
break;
case kRESULT_STATUS_TXD_Failed:
_stprintf( outDevStr, _T("8006 - (Comm. Err 06) - %s(TXD Failed--Send buffer overflow.)"), errorStr);
outError = INVALID_PARAM;
outSeverity = CCA_WARNING;
break;
...
Another thing I tried is (but it also had the index into null array issue):
foreach($matchRegex in $ValuesCase.Matches)
{
$sdkOutErr = $matchRegex.Groups['sdkOutErr']
Write-Host sdkOutErr -ForegroundColor DarkMagenta
}
Ultimately, I need to grab $sdkOutErr to further process. I'll need to use the other variables too in the returned object, but this is the first one I need. I love the way the output is formatted in the function, but probably don't know how to return the info and use what is returned. I'm not sure what to search for to figure out the issue other than the error message, which leads me to believe I'm returning the info wrong. I don't think I need to return $result, because I think that's just a string with the values in the $values.Matches in the function. I need to access the values returned as I mentioned.
I checked, and the contents sent to the function is not blank.
I tried returning $results, and it looks like this when I write-Host, which would be difficult to access each sdkOutErr:
#{sdkErr=1000; sdkDesc=(Out of Memory); sdkOutErr= NO_MEMORY; sdkSeverity=FATAL} #{sdkErr=1002; sdkDesc=(Failed to load DLL); sdkOutErr= OTHER_ERROR; sdkSeverity=FATAL} #{sdkErr=1003; sdkDesc=(Failed to load DLL); sdk
OutErr= OTHER_ERROR; sdkSeverity=FATAL} #{sdkErr=1004; sdkDesc=(Failed to open); sdkOutErr= OTHER_ERROR; sdkSeverity=FATAL} #{sdkErr=1005; sdkDesc=(Unable to access the specified profile); sdkOutErr= OTHER_ERROR; sdkSeverity=
FATAL} #{sdkErr=100 ...
How can I return this from the function so that it's not a null array/index, and the data is accessible if I use a foreach loop (or two) in the main code to get the sdkOutErr (to start).
I'm fairly new to (complicated)powershell and I have a feeling I need a map inside the array in my function, but I'm not sure.
Before I returned the function Values or results, it was printing something like this out. Once I added in main $ValuesCase=Get-CaseContents... (returning $values from function), or $parsedCase = Get-CaseContents... (returning $results from function), it stopped showing this on the screen:
sdkErr sdkDesc sdkOutErr sdkSeverity
------ ------- --------- -----------
1000 (Out of Memory) NO_MEMORY FATAL
1002 (Failed to load DLL) OTHER_ERROR FATAL
1003 (Failed to load DLL) OTHER_ERROR FATAL
1004 (Failed to open) OTHER_ERROR FATAL
I tried returning $results, and it looks like this when I write-Host, which would be difficult to access each sdkOutErr:
Getting all the sdkOutErr values is not as difficult as you might imagine:
$results.sdkOutErr # this will output the `sdkOutErr` value from each object in the array
Or, outside the function:
(Get-CaseContents -parsedCaseMethod $matchFound -parseLinesGroupIndicator "_stprintf").sdkOutErr
Another option, which might perform better if the result set is large, is to use ForEach-Object to grab just the sdkOutErr values:
$fullResults = Get-CaseContents -parsedCaseMethod $matchFound -parseLinesGroupIndicator "_stprintf"
$sdkOutErrValuesOnly = $fullResults |ForEach-Object -MemberName sdkOutErr

How to parse a string to key value pair using regex?

What is the best way to parse the string into key value pair using regex?
Sample input:
application="fre" category="MessagingEvent" messagingEventType="MessageReceived"
Expected output:
application "fre"
Category "MessagingEvent"
messagingEventType "MessageReceived"
We already tried the following regex and its working.
application=(?<application>(...)*) *category=(?<Category>\S*) *messagingEventType=(?<messagingEventType>\S*)
But we want a generic regex which will parse the sample input to the expected output as key value pair?
Any idea or solution will be helpful.
input = 'application="fre" category="MessagingEvent" messagingEventType="MessageReceived"'
puts input.
scan(/(\w+)="([^"]+)"/). # scan for KV-pairs
map{ |k, v| %Q|#{k.ljust(30,' ')}"#{v}"| }. # adjust as you requested
join($/) # join with platform-dependent line delimiters
#⇒ application "fre"
# category "MessagingEvent"
# messagingEventType "MessageReceived"
Instead of using regex, it can be done by spliting and storing the string in hash like below:
input = 'application="fre" category="MessagingEvent" messagingEventType="MessageReceived"'
res = {}
input.split.each { |str| a,b = str.split('='); res[a] = b}
puts res
==> {"application"=>"\"fre\"", "category"=>"\"MessagingEvent\"", "messagingEventType"=>"\"MessageReceived\""}

How Do I Consume an Array of JSON Objects using Plumber in R

I have been experimenting with Plumber in R recently, and am having success when I pass the following data using a POST request;
{"Gender": "F", "State": "AZ"}
This allows me to write a function like the following to return the data.
#* #post /score
score <- function(Gender, State){
data <- list(
Gender = as.factor(Gender)
, State = as.factor(State))
return(data)
}
However, when I try to POST an array of JSON objects, I can't seem to access the data through the function
[{"Gender":"F","State":"AZ"},{"Gender":"F","State":"NY"},{"Gender":"M","State":"DC"}]
I get the following error
{
"error": [
"500 - Internal server error"
],
"message": [
"Error in is.factor(x): argument \"Gender\" is missing, with no default\n"
]
}
Does anyone have an idea of how Plumber parses JSON? I'm not sure how to access and assign the fields to vectors to score the data.
Thanks in advance
I see two possible solutions here. The first would be a command line based approach which I assume you were attempting. I tested this on a Windows OS and used column based data.frame encoding which I prefer due to shorter JSON string lengths. Make sure to escape quotation marks correctly to avoid 'argument "..." is missing, with no default' errors:
curl -H "Content-Type: application/json" --data "{\"Gender\":[\"F\",\"F\",\"M\"],\"State\":[\"AZ\",\"NY\",\"DC\"]}" http://localhost:8000/score
# [["F","F","M"],["AZ","NY","DC"]]
The second approach is R native and has the advantage of having everything in one place:
library(jsonlite)
library(httr)
## sample data
lst = list(
Gender = c("F", "F", "M")
, State = c("AZ", "NY", "DC")
)
## jsonify
jsn = lapply(
lst
, toJSON
)
## query
request = POST(
url = "http://localhost:8000/score?"
, query = jsn # values must be length 1
)
response = content(
request
, as = "text"
, encoding = "UTF-8"
)
fromJSON(
response
)
# [,1]
# [1,] "[\"F\",\"F\",\"M\"]"
# [2,] "[\"AZ\",\"NY\",\"DC\"]"
Be aware that httr::POST() expects a list of length-1 values as query input, so the array data should be jsonified beforehand. If you want to avoid the additional package imports altogether, some system(), sprintf(), etc. magic should do the trick.
Finally, here is my plumber endpoint (living in R/plumber.R and condensed a little bit):
#* #post /score
score = function(Gender, State){
lapply(
list(Gender, State)
, as.factor
)
}
and code to fire up the API:
pr = plumber::plumb("R/plumber.R")
pr$run(port = 8000)

Parse a MySQL insert statement with multiple rows [duplicate]

I need a regular expression to select all the text between two outer brackets.
Example:
START_TEXT(text here(possible text)text(possible text(more text)))END_TXT
^ ^
Result:
(text here(possible text)text(possible text(more text)))
I want to add this answer for quickreference. Feel free to update.
.NET Regex using balancing groups:
\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)
Where c is used as the depth counter.
Demo at Regexstorm.com
Stack Overflow: Using RegEx to balance match parenthesis
Wes' Puzzling Blog: Matching Balanced Constructs with .NET Regular Expressions
Greg Reinacker's Weblog: Nested Constructs in Regular Expressions
PCRE using a recursive pattern:
\((?:[^)(]+|(?R))*+\)
Demo at regex101; Or without alternation:
\((?:[^)(]*(?R)?)*+\)
Demo at regex101; Or unrolled for performance:
\([^)(]*+(?:(?R)[^)(]*)*+\)
Demo at regex101; The pattern is pasted at (?R) which represents (?0).
Perl, PHP, Notepad++, R: perl=TRUE, Python: PyPI regex module with (?V1) for Perl behaviour.
(the new version of PyPI regex package already defaults to this → DEFAULT_VERSION = VERSION1)
Ruby using subexpression calls:
With Ruby 2.0 \g<0> can be used to call full pattern.
\((?>[^)(]+|\g<0>)*\)
Demo at Rubular; Ruby 1.9 only supports capturing group recursion:
(\((?>[^)(]+|\g<1>)*\))
Demo at Rubular  (atomic grouping since Ruby 1.9.3)
JavaScript  API :: XRegExp.matchRecursive
XRegExp.matchRecursive(str, '\\(', '\\)', 'g');
Java: An interesting idea using forward references by #jaytea.
Without recursion up to 3 levels of nesting:
(JS, Java and other regex flavors)
To prevent runaway if unbalanced, with * on innermost [)(] only.
\((?:[^)(]|\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\))*\)
Demo at regex101; Or unrolled for better performance (preferred).
\([^)(]*(?:\([^)(]*(?:\([^)(]*(?:\([^)(]*\)[^)(]*)*\)[^)(]*)*\)[^)(]*)*\)
Demo at regex101; Deeper nesting needs to be added as required.
Reference - What does this regex mean?
RexEgg.com - Recursive Regular Expressions
Regular-Expressions.info - Regular Expression Recursion
Mastering Regular Expressions - Jeffrey E.F. Friedl 1 2 3 4
Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.
But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.
You can use regex recursion:
\(([^()]|(?R))*\)
[^\(]*(\(.*\))[^\)]*
[^\(]* matches everything that isn't an opening bracket at the beginning of the string, (\(.*\)) captures the required substring enclosed in brackets, and [^\)]* matches everything that isn't a closing bracket at the end of the string. Note that this expression does not attempt to match brackets; a simple parser (see dehmann's answer) would be more suitable for that.
This answer explains the theoretical limitation of why regular expressions are not the right tool for this task.
Regular expressions can not do this.
Regular expressions are based on a computing model known as Finite State Automata (FSA). As the name indicates, a FSA can remember only the current state, it has no information about the previous states.
In the above diagram, S1 and S2 are two states where S1 is the starting and final step. So if we try with the string 0110 , the transition goes as follows:
0 1 1 0
-> S1 -> S2 -> S2 -> S2 ->S1
In the above steps, when we are at second S2 i.e. after parsing 01 of 0110, the FSA has no information about the previous 0 in 01 as it can only remember the current state and the next input symbol.
In the above problem, we need to know the no of opening parenthesis; this means it has to be stored at some place. But since FSAs can not do that, a regular expression can not be written.
However, an algorithm can be written to do this task. Algorithms are generally falls under Pushdown Automata (PDA). PDA is one level above of FSA. PDA has an additional stack to store some additional information. PDAs can be used to solve the above problem, because we can 'push' the opening parenthesis in the stack and 'pop' them once we encounter a closing parenthesis. If at the end, stack is empty, then opening parenthesis and closing parenthesis matches. Otherwise not.
(?<=\().*(?=\))
If you want to select text between two matching parentheses, you are out of luck with regular expressions. This is impossible(*).
This regex just returns the text between the first opening and the last closing parentheses in your string.
(*) Unless your regex engine has features like balancing groups or recursion. The number of engines that support such features is slowly growing, but they are still not a commonly available.
It is actually possible to do it using .NET regular expressions, but it is not trivial, so read carefully.
You can read a nice article here. You also may need to read up on .NET regular expressions. You can start reading here.
Angle brackets <> were used because they do not require escaping.
The regular expression looks like this:
<
[^<>]*
(
(
(?<Open><)
[^<>]*
)+
(
(?<Close-Open>>)
[^<>]*
)+
)*
(?(Open)(?!))
>
I was also stuck in this situation when dealing with nested patterns and regular-expressions is the right tool to solve such problems.
/(\((?>[^()]+|(?1))*\))/
This is the definitive regex:
\(
(?<arguments>
(
([^\(\)']*) |
(\([^\(\)']*\)) |
'(.*?)'
)*
)
\)
Example:
input: ( arg1, arg2, arg3, (arg4), '(pip' )
output: arg1, arg2, arg3, (arg4), '(pip'
note that the '(pip' is correctly managed as string.
(tried in regulator: http://sourceforge.net/projects/regulator/)
I have written a little JavaScript library called balanced to help with this task. You can accomplish this by doing
balanced.matches({
source: source,
open: '(',
close: ')'
});
You can even do replacements:
balanced.replacements({
source: source,
open: '(',
close: ')',
replace: function (source, head, tail) {
return head + source + tail;
}
});
Here's a more complex and interactive example JSFiddle.
Adding to bobble bubble's answer, there are other regex flavors where recursive constructs are supported.
Lua
Use %b() (%b{} / %b[] for curly braces / square brackets):
for s in string.gmatch("Extract (a(b)c) and ((d)f(g))", "%b()") do print(s) end (see demo)
Raku (former Perl6):
Non-overlapping multiple balanced parentheses matches:
my regex paren_any { '(' ~ ')' [ <-[()]>+ || <&paren_any> ]* }
say "Extract (a(b)c) and ((d)f(g))" ~~ m:g/<&paren_any>/;
# => (「(a(b)c)」 「((d)f(g))」)
Overlapping multiple balanced parentheses matches:
say "Extract (a(b)c) and ((d)f(g))" ~~ m:ov:g/<&paren_any>/;
# => (「(a(b)c)」 「(b)」 「((d)f(g))」 「(d)」 「(g)」)
See demo.
Python re non-regex solution
See poke's answer for How to get an expression between balanced parentheses.
Java customizable non-regex solution
Here is a customizable solution allowing single character literal delimiters in Java:
public static List<String> getBalancedSubstrings(String s, Character markStart,
Character markEnd, Boolean includeMarkers)
{
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastOpenDelimiter = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1) {
lastOpenDelimiter = (includeMarkers ? i : i + 1);
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));
}
if (level > 0) level--;
}
}
return subTreeList;
}
}
Sample usage:
String s = "some text(text here(possible text)text(possible text(more text)))end text";
List<String> balanced = getBalancedSubstrings(s, '(', ')', true);
System.out.println("Balanced substrings:\n" + balanced);
// => [(text here(possible text)text(possible text(more text)))]
The regular expression using Ruby (version 1.9.3 or above):
/(?<match>\((?:\g<match>|[^()]++)*\))/
Demo on rubular
The answer depends on whether you need to match matching sets of brackets, or merely the first open to the last close in the input text.
If you need to match matching nested brackets, then you need something more than regular expressions. - see #dehmann
If it's just first open to last close see #Zach
Decide what you want to happen with:
abc ( 123 ( foobar ) def ) xyz ) ghij
You need to decide what your code needs to match in this case.
"""
Here is a simple python program showing how to use regular
expressions to write a paren-matching recursive parser.
This parser recognises items enclosed by parens, brackets,
braces and <> symbols, but is adaptable to any set of
open/close patterns. This is where the re package greatly
assists in parsing.
"""
import re
# The pattern below recognises a sequence consisting of:
# 1. Any characters not in the set of open/close strings.
# 2. One of the open/close strings.
# 3. The remainder of the string.
#
# There is no reason the opening pattern can't be the
# same as the closing pattern, so quoted strings can
# be included. However quotes are not ignored inside
# quotes. More logic is needed for that....
pat = re.compile("""
( .*? )
( \( | \) | \[ | \] | \{ | \} | \< | \> |
\' | \" | BEGIN | END | $ )
( .* )
""", re.X)
# The keys to the dictionary below are the opening strings,
# and the values are the corresponding closing strings.
# For example "(" is an opening string and ")" is its
# closing string.
matching = { "(" : ")",
"[" : "]",
"{" : "}",
"<" : ">",
'"' : '"',
"'" : "'",
"BEGIN" : "END" }
# The procedure below matches string s and returns a
# recursive list matching the nesting of the open/close
# patterns in s.
def matchnested(s, term=""):
lst = []
while True:
m = pat.match(s)
if m.group(1) != "":
lst.append(m.group(1))
if m.group(2) == term:
return lst, m.group(3)
if m.group(2) in matching:
item, s = matchnested(m.group(3), matching[m.group(2)])
lst.append(m.group(2))
lst.append(item)
lst.append(matching[m.group(2)])
else:
raise ValueError("After <<%s %s>> expected %s not %s" %
(lst, s, term, m.group(2)))
# Unit test.
if __name__ == "__main__":
for s in ("simple string",
""" "double quote" """,
""" 'single quote' """,
"one'two'three'four'five'six'seven",
"one(two(three(four)five)six)seven",
"one(two(three)four)five(six(seven)eight)nine",
"one(two)three[four]five{six}seven<eight>nine",
"one(two[three{four<five>six}seven]eight)nine",
"oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",
"ERROR testing ((( mismatched ))] parens"):
print "\ninput", s
try:
lst, s = matchnested(s)
print "output", lst
except ValueError as e:
print str(e)
print "done"
You need the first and last parentheses. Use something like this:
str.indexOf('('); - it will give you first occurrence
str.lastIndexOf(')'); - last one
So you need a string between,
String searchedString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');
because js regex doesn't support recursive match, i can't make balanced parentheses matching work.
so this is a simple javascript for loop version that make "method(arg)" string into array
push(number) map(test(a(a()))) bass(wow, abc)
$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)
const parser = str => {
let ops = []
let method, arg
let isMethod = true
let open = []
for (const char of str) {
// skip whitespace
if (char === ' ') continue
// append method or arg string
if (char !== '(' && char !== ')') {
if (isMethod) {
(method ? (method += char) : (method = char))
} else {
(arg ? (arg += char) : (arg = char))
}
}
if (char === '(') {
// nested parenthesis should be a part of arg
if (!isMethod) arg += char
isMethod = false
open.push(char)
} else if (char === ')') {
open.pop()
// check end of arg
if (open.length < 1) {
isMethod = true
ops.push({ method, arg })
method = arg = undefined
} else {
arg += char
}
}
}
return ops
}
// const test = parser(`$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)`)
const test = parser(`push(number) map(test(a(a()))) bass(wow, abc)`)
console.log(test)
the result is like
[ { method: 'push', arg: 'number' },
{ method: 'map', arg: 'test(a(a()))' },
{ method: 'bass', arg: 'wow,abc' } ]
[ { method: '$$', arg: 'groups' },
{ method: 'filter',
arg: '{type:\'ORGANIZATION\',isDisabled:{$ne:true}}' },
{ method: 'pickBy', arg: '_id,type' },
{ method: 'map', arg: 'test()' },
{ method: 'as', arg: 'groups' } ]
While so many answers mention this in some form by saying that regex does not support recursive matching and so on, the primary reason for this lies in the roots of the Theory of Computation.
Language of the form {a^nb^n | n>=0} is not regular. Regex can only match things that form part of the regular set of languages.
Read more # here
I didn't use regex since it is difficult to deal with nested code. So this snippet should be able to allow you to grab sections of code with balanced brackets:
def extract_code(data):
""" returns an array of code snippets from a string (data)"""
start_pos = None
end_pos = None
count_open = 0
count_close = 0
code_snippets = []
for i,v in enumerate(data):
if v =='{':
count_open+=1
if not start_pos:
start_pos= i
if v=='}':
count_close +=1
if count_open == count_close and not end_pos:
end_pos = i+1
if start_pos and end_pos:
code_snippets.append((start_pos,end_pos))
start_pos = None
end_pos = None
return code_snippets
I used this to extract code snippets from a text file.
This do not fully address the OP question but I though it may be useful to some coming here to search for nested structure regexp:
Parse parmeters from function string (with nested structures) in javascript
Match structures like:
matches brackets, square brackets, parentheses, single and double quotes
Here you can see generated regexp in action
/**
* get param content of function string.
* only params string should be provided without parentheses
* WORK even if some/all params are not set
* #return [param1, param2, param3]
*/
exports.getParamsSAFE = (str, nbParams = 3) => {
const nextParamReg = /^\s*((?:(?:['"([{](?:[^'"()[\]{}]*?|['"([{](?:[^'"()[\]{}]*?|['"([{][^'"()[\]{}]*?['")}\]])*?['")}\]])*?['")}\]])|[^,])*?)\s*(?:,|$)/;
const params = [];
while (str.length) { // this is to avoid a BIG performance issue in javascript regexp engine
str = str.replace(nextParamReg, (full, p1) => {
params.push(p1);
return '';
});
}
return params;
};
This might help to match balanced parenthesis.
\s*\w+[(][^+]*[)]\s*
This one also worked
re.findall(r'\(.+\)', s)

R fromJSON incorrectly reads Unicode from file

I am trying to read json object in R from file, which contains names and surnames in unicode. Here is the content of the file "x1.json":
{"general": {"last_name":
"\u041f\u0430\u0449\u0435\u043d\u043a\u043e", "name":
"\u0412\u0456\u0442\u0430\u043b\u0456\u0439"}}
I use RJSONIO package and when I declare the JSON object directly, everything goes well:
x<-fromJSON('{"general": {"last_name": "\u041f\u0430\u0449\u0435\u043d\u043a\u043e", "name": "\u0412\u0456\u0442\u0430\u043b\u0456\u0439"}}')
x
# $general
# last_name name
# "Пащенко" "Віталій"
But when I read the same from file, strings are converted to some unknown for me encoding:
x1<-fromJSON("x1.json")
x1
# $general
# last_name name
# "\0370I5=:>" "\022VB0;V9"
Note that these are not escaped "\u" (which was discussed here)
I have tried to specify "encoding" argument, but this did not help:
> x1<-fromJSON("x1.json", encoding = "UTF-8")
> x1
$general
last_name name
"\0370I5=:>" "\022VB0;V9"
System information:
> Sys.getlocale()
[1] "LC_COLLATE=Ukrainian_Ukraine.1251;LC_CTYPE=Ukrainian_Ukraine.1251;LC_MONETARY=Ukrainian_Ukraine.1251;LC_NUMERIC=C;LC_TIME=Ukrainian_Ukraine.1251"
Switching to English (Sys.setlocale("LC_ALL","English")) has not changed the situation.
If your file had unicode data like this (instead of its representation)
{"general": {"last_name":"Пащенко", "name":"Віталій"}}
then,
> fromJSON("x1.json", encoding = "UTF-8")
will work
If you really want your code to work with current file, try like this
JSONstring=""
con <- file("x1.json",open = "r")
while (length(oneLine <- readLines(con, n = 1, warn = FALSE)) > 0) {
JSONstring <- paste(JSONstring,parse(text = paste0("'",oneLine, "'"))[[1]],sep='')
}
fromJSON(JSONstring)
use library("jsonlite") not rjson
library("jsonlite")
mydf <- toJSON( mydf, encoding = "UTF-8")
will be fine