JSON data deserialize using Regular Expressions - json

I'm facing issue while fetching keys and values from the data using regular expression if the JSON contains \ & ".
{
"KeyOne":"Value One",
"KeyTwo": "Value \\ two",
"KeyThree": "Value \" Three",
"KeyFour": "ValueFour\\"
}
It is sample data, from this I want to read the values are keys. How can I achieve with regular expressions.
Note: I'm deserializing this JSON data in the server side(SAP ABAP).

On earlier releases less than 7.2 (from memory) you can use class /ui2/cl_json
if on 7.3 or later use kernel IXML writer which support JSON.
It is orders of magnitude faster than /ui2/cl_json
you can use identity transformation approach where the source structure is known
and you can create that structure in abap or already has an abap equivalent defined. Otherwise just traverse the JSON document.
The example string was easily parsed
EDIT: Add sample code
REPORT zjsondemo.
CLASS lcl DEFINITION CREATE PUBLIC.
PUBLIC SECTION.
METHODS json_stru_known.
METHODS json_stru_traverse.
ENDCLASS.
CLASS lcl IMPLEMENTATION.
METHOD json_stru_known.
DATA l_src_json TYPE string.
DATA l_mara TYPE mara.
WRITE: / 'DEMO 1 Known structure Identity transformation '.
l_src_json = `{"MARA":{"MATNR":"012345678", "MATKL": "DUMMY" }}`.
WRITE: / 'Conver to MARA -> ', l_src_json.
CALL TRANSFORMATION id SOURCE XML l_src_json
RESULT mara = l_mara. "
WRITE: / 'MARA - MATNR', l_mara-matnr,
/ ' MATKL', l_mara-matkl.
TYPES:
BEGIN OF lty_foo_bar,
KeyOne TYPE string,
KeyTwo Type string,
KeyThree TYPE string,
KeyFour Type string,
END OF lty_foo_bar.
DATA:
lv_json_string TYPE string,
ls_data TYPE lty_foo_bar.
" in this example we use upper case attribute names
"because we map to SAP target
" structure which has upper case names.
" if you need lowercase variables then you can not map straight to an
" SAP type. Then you need to use the traverse technique. See example 2
lv_json_string = |\{| &&
|"KEYONE":"Value One",| &&
|"KEYTWO": "Value \\\\ two", | &&
|"KEYTHREE": "Value \\" Three", | &&
|"KEYFOUR": "ValueFour\\\\" | &&
|\}|.
lv_json_string = `{"JUNK":` && lv_json_string && `}`.
CALL TRANSFORMATION id SOURCE XML lv_json_string
RESULT junk = ls_data. "
Write: / ls_data-keyone,ls_data-keytwo, ls_data-keythree , ls_data-keyfour.
ENDMETHOD.
METHOD json_stru_traverse.
DATA l_src_json TYPE string.
DATA: lo_node TYPE REF TO if_sxml_node.
DATA: lif_element TYPE REF TO if_sxml_open_element,
lif_element_close TYPE REF TO if_sxml_close_element,
lif_value_node TYPE REF TO if_sxml_value,
l_val TYPE string,
l_attr TYPE if_sxml_attribute=>attributes,
l_att_val TYPE string.
FIELD-SYMBOLS: <attr> LIKE LINE OF l_attr.
WRITE: / 'DEMO 2 Traverse any json document'.
l_src_json = `{"MATNR":"012345678", "MATKL": "DUMMY", "SOMENODE": "With this value" }`.
WRITE: / 'Parse as JSON with 3 nodes -> ', l_src_json.
DATA(reader) = cl_sxml_string_reader=>create( cl_abap_codepage=>convert_to( l_src_json ) ).
lo_node = reader->read_next_node( ). " {
IF lo_node IS INITIAL.
EXIT.
ENDIF.
DO 3 TIMES.
lif_element ?= reader->read_next_node( ).
l_attr = lif_element->get_attributes( ).
LOOP AT l_attr ASSIGNING <attr>.
l_att_val = <attr>->get_value( ).
WRITE: / 'Attribute:', l_att_val.
ENDLOOP.
lif_value_node ?= reader->read_next_node( ).
l_val = lif_value_node->get_value( ).
WRITE: '=>', l_val.
lif_element_close ?= reader->read_next_node( ).
ENDDO.
ENDMETHOD.
ENDCLASS.
START-OF-SELECTION.
DATA lo_lcl TYPE REF TO lcl.
CREATE OBJECT lo_lcl.
lo_lcl->json_stru_known( ).
lo_lcl->json_stru_traverse( ).
The SAP system is supplied with many example programs.
Search for demo*json
SAP docu on json parsing

As #mrzasa and #joanis said in their comments: Do not use RegEx to parse JSON!
For small objects or when performance is not a concern, you can use /ui2/cl_json:
TYPES:
BEGIN OF lty_foo_bar,
KeyOne TYPE string,
KeyTwo Type string,
KeyThree TYPE string,
KeyFour Type string,
END OF lty_foo_bar.
DATA:
lv_json_string TYPE string,
ls_data TYPE lty_foo_bar.
lv_json_string = |\{| &&
|"KeyOne":"Value One",| &&
|"KeyTwo": "Value \\\\ two", | &&
|"KeyThree": "Value \\" Three", | &&
|"KeyFour": "ValueFour\\\\" | &&
|\}|.
/ui2/cl_json=>deserialize(
EXPORTING
json = lv_json_string
CHANGING
data = ls_data ).
ls_data-KeyOne contains 'Value One' and so on.
For larger objects and/or better performance check lxml from #phil soadys answer below. The correct handling of upper and lower case letters still causes headache in ABAP anyways.

Related

How to use a link rule reference in an ordered choice expression in a TextX grammar?

I am new to TextX. I am trying to create a grammar for defining data types that have fields that could be of a simple type or of the type of another data type. The grammar description is:
Library:
data_types *= DataType
;
DataType: name=ID "{"
fields*=Field
"}" ;
Field: type=([DataType] | ID) name=ID;
//Type: [DataType] | ID;
An example of a model following this grammar would be
vec {
int64 a
int64 b
int64 c
}
matrix {
vec a
vec b
}
I want to link the type of the field to either a data type that is already declared, or to some simple string. However, when compiling the above grammar with textx generate dummy.tx --target dot, I get the error Error: None:9:13: error: Expected ''((\\')|[^'])*'' or '"((\\")|[^"])*"' or re_match or rule_ref or '[' at position dummy.tx:(9, 13) => 'eld: type=*([DataType'..
Is there any way to accomplish what I want? I have tried putting the type declaration in a separate block, as seen in the comment, but that did not help. Any suggestion or hint would be highly appreciated.
A standard approach is to use custom classes and create builtins for all types that are not created by users themselves. It is best to show how it is done in code using your example. Note the use of registration so that the language with registered builtins can be available to textx CLI command. Also, see the entity example as the same techniques is used there.
from textx import metamodel_from_str
from textx.registration import (language, register_language,
metamodel_for_language)
# We use registration support to register language
# This way it will be available to textx CLI command
#language('library', '.lib')
def library_lang():
"Library language."
grammar = r'''
Library:
data_types *= DataType
;
DataType: name=ID "{"
fields*=Field
"}" ;
Field: type=[Type] name=ID;
Type: DataType | BuiltInType;
BuiltInType: name=ID;
'''
# Here we use our class for builtin types so we can
# make our own instances for builtin types.
# See textX Entity example for more.
class BuiltInType:
def __init__(self, parent, name):
self.parent = parent
self.name = name
# Create all builtin instances.
builtins = {
'int64': BuiltInType(None, 'int64'),
}
return metamodel_from_str(grammar,
# Register custom classes and builtins.
classes=[BuiltInType],
builtins=builtins)
# This should really be done through setup.{cfg,py}
# Here it is done through registration API for an example to
# be self-contained.
register_language(library_lang)
model_str = r'''
vec {
int64 a
int64 b
int64 c
}
matrix {
vec a
vec b
}
'''
# Now we can get registered language metamodel by name and
# parse our model.
model = metamodel_for_language('library').model_from_str(model_str)
# ... do something with the model
assert len(model.data_types) == 2
assert model.data_types[0].name == 'vec'
assert model.data_types[0].fields[0].name == 'a'
assert model.data_types[0].fields[0].type.name == 'int64'

How to parse CSV file in the most performant way?

I would like to parse big CSV files in ABAP in the most performant way under the following conditions:
We do not know the structure of the CSV->the parse result should be table of string_table or something simular
The parsing should happen in accordance to https://www.rfc-editor.org/rfc/rfc4180
No solution specific calls
I found a very nice blog https://blogs.sap.com/2014/09/09/understanding-csv-files-and-their-handling-in-abap/ but it has its shortcoming:
Write your own code - The code example is not sufficient
Read the file using KCD_CSV_FILE_TO_INTERN_CONVERT - solution specific (not available everywhere) and will dump on fields that are big enough
Use RTTI and dynamic programming along with FM RSDS_CONVERT_CSV - we do not know the structure in advance
Use class CL_RSDA_CSV_CONVERTER - we do not know the structure in advance
I also checked the first available solution on github - https://github.com/thedoginthewok/ZwdCSV . Unfortunately, it has macros in the code (absolutely unacceptable) and also requires you to know the structure in advance.
I also tried to use the regex to do the job, but on big files this is too slow.
Even though I am extremely annoyed by this fact, I had to create a solution myself (I cannot believe that I actually did it - it should be in the standard...)
My first solution was a direct copy paste of Java code into ABAP (https://mkyong.com/java/how-to-read-and-parse-csv-file-in-java/). Unfortunately, as my other question How to iterate over string characters in ABAP in performant way? shown, it is not that easy to iterate over string in abap as it is in Java.
I then tried a split/count approach and so far it has the best performance. Does anyone knows the better way achieve this?
REPORT z_csv_test.
CLASS lcl_csv_parser DEFINITION CREATE PRIVATE.
PUBLIC SECTION.
TYPES:
tt_string_matrix TYPE STANDARD TABLE OF string_table WITH EMPTY KEY.
CLASS-METHODS:
create
IMPORTING
!iv_delimiter TYPE string DEFAULT '"'
!iv_separator TYPE string DEFAULT ','
!iv_line_separator TYPE abap_cr_lf DEFAULT cl_abap_char_utilities=>cr_lf
RETURNING
VALUE(r_result) TYPE REF TO lcl_csv_parser.
METHODS:
parse
IMPORTING
iv_string TYPE string
RETURNING
VALUE(r_result) TYPE tt_string_matrix,
constructor
IMPORTING
!iv_delimiter TYPE string
!iv_separator TYPE string
!iv_line_separator TYPE string.
PROTECTED SECTION.
PRIVATE SECTION.
DATA:
gv_delimiter TYPE string,
gv_separator TYPE string,
gv_line_separator TYPE string,
gv_escaped_delimiter TYPE string.
METHODS parse_line_to_string_table
IMPORTING
iv_line TYPE string
RETURNING
VALUE(r_result) TYPE string_table.
ENDCLASS.
CLASS lcl_csv_parser IMPLEMENTATION.
METHOD create.
r_result = NEW #(
iv_delimiter = iv_delimiter
iv_line_separator = CONV #( iv_line_separator )
iv_separator = iv_separator ).
ENDMETHOD.
METHOD constructor.
me->gv_delimiter = iv_delimiter.
me->gv_separator = iv_separator.
me->gv_line_separator = iv_line_separator.
me->gv_escaped_delimiter = |{ iv_delimiter }{ iv_delimiter }|.
ENDMETHOD.
METHOD parse.
"get the lines
SPLIT iv_string AT me->gv_line_separator INTO TABLE DATA(lt_lines).
DATA lx_open_line TYPE abap_bool VALUE abap_false.
DATA lv_current_line TYPE string.
LOOP AT lt_lines ASSIGNING FIELD-SYMBOL(<ls_line>).
FIND ALL OCCURRENCES OF me->gv_delimiter IN <ls_line> IN CHARACTER MODE MATCH COUNT DATA(lv_count).
IF ( lv_count MOD 2 ) = 1.
IF lx_open_line = abap_true.
lv_current_line = |{ lv_current_line }{ me->gv_line_separator }{ <ls_line> }|.
lx_open_line = abap_false.
APPEND parse_line_to_string_table( lv_current_line ) TO r_result.
ELSE.
lv_current_line = <ls_line>.
lx_open_line = abap_true.
ENDIF.
ELSE.
IF lx_open_line = abap_true.
lv_current_line = |{ lv_current_line }{ me->gv_line_separator }{ <ls_line> }|.
ELSE.
APPEND parse_line_to_string_table( <ls_line> ) TO r_result.
ENDIF.
ENDIF.
ENDLOOP.
ENDMETHOD.
METHOD parse_line_to_string_table.
SPLIT iv_line AT me->gv_separator INTO TABLE DATA(lt_line).
DATA lx_open_field TYPE abap_bool VALUE abap_false.
DATA lv_current_field TYPE string.
LOOP AT lt_line ASSIGNING FIELD-SYMBOL(<ls_field>).
FIND ALL OCCURRENCES OF me->gv_delimiter IN <ls_field> IN CHARACTER MODE MATCH COUNT DATA(lv_count).
IF ( lv_count MOD 2 ) = 1.
IF lx_open_field = abap_true.
lv_current_field = |{ lv_current_field }{ me->gv_separator }{ <ls_field> }|.
lx_open_field = abap_false.
APPEND lv_current_field TO r_result.
ELSE.
lv_current_field = <ls_field>.
lx_open_field = abap_true.
ENDIF.
ELSE.
IF lx_open_field = abap_true.
lv_current_field = |{ lv_current_field }{ me->gv_separator }{ <ls_field> }|.
ELSE.
APPEND <ls_field> TO r_result.
ENDIF.
ENDIF.
ENDLOOP.
REPLACE ALL OCCURRENCES OF me->gv_escaped_delimiter IN TABLE r_result WITH me->gv_delimiter.
ENDMETHOD.
ENDCLASS.
CLASS lcl_test_csv_parser DEFINITION
FINAL
CREATE PUBLIC .
PUBLIC SECTION.
CLASS-METHODS run.
CLASS-METHODS get_file
RETURNING VALUE(r_result) TYPE string.
PROTECTED SECTION.
PRIVATE SECTION.
ENDCLASS.
CLASS lcl_test_csv_parser IMPLEMENTATION.
METHOD get_file.
DATA lv_file_line TYPE string.
DO 10 TIMES.
lv_file_line = |"1234,{ cl_abap_char_utilities=>cr_lf }567890",{ lv_file_line }|.
ENDDO.
lv_file_line = lv_file_line && cl_abap_char_utilities=>cr_lf.
DATA(lt_file_as_table) = VALUE string_table(
FOR i = 1 THEN i + 1 UNTIL i = 1000000
( lv_file_line ) ).
CONCATENATE LINES OF lt_file_as_table INTO r_result.
ENDMETHOD.
METHOD run.
DATA lv_prepare_start TYPE timestampl.
GET TIME STAMP FIELD lv_prepare_start.
DATA(lv_file) = get_file( ).
DATA lv_prepare_end TYPE timestampl.
GET TIME STAMP FIELD lv_prepare_end.
WRITE |Preparation took { cl_abap_tstmp=>subtract( tstmp1 = lv_prepare_end tstmp2 = lv_prepare_start ) }|.
DATA lv_parse_start TYPE timestampl.
GET TIME STAMP FIELD lv_parse_start.
DATA(lo_parser) = lcl_csv_parser=>create( ).
DATA(lt_file) = lo_parser->parse( lv_file ).
DATA lv_parse_end TYPE timestampl.
GET TIME STAMP FIELD lv_parse_end.
WRITE |Parse took { cl_abap_tstmp=>subtract( tstmp1 = lv_parse_end tstmp2 = lv_parse_start ) }|.
ENDMETHOD.
ENDCLASS.
START-OF-SELECTION.
lcl_test_csv_parser=>run( ).
I'd like to present a different approach using find heavily, compared to your line based approach this seems to have equivalent performance for unquoted fields but performs slightly better if quoted fields are present:
In general, this uses the pattern position = find( off = position + 1 ) to iterate over the string in chunks, and then uses substring to copy ranges into strings. What can be observed here is that in a loop that iterates a million times, every nanosecond saved has an impact on the performance, and by moving as much of it out of the inner loop one can increase performance significantly. For the "simple" case of 10 digit fields one can see that both algorithms perform equally well, however for "longer" 30 digit fields your algorithm is getting faster in comparison. For fields with quotes the scan & concat approach I've used seems to be faster than the "reconstruct" approach.
I guess although one can achieve small gains through more clever ABAP, further significant optimizations are only possible by utilizing the engine even more.
Anyways, Here's the algorithm:
CLASS lcl_csv_parser_find IMPLEMENTATION.
METHOD parse.
DATA line TYPE string_table.
DATA position TYPE i.
DATA(string_length) = strlen( i_string ).
" Dereferencing member fields is slightly slower than variable access, in a close loop this matters
DATA(separators) = me->separators.
DATA(delimiter) = me->delimiter.
CHECK string_length <> 0.
" Checking for delimiters in the DO loop is quite slow.
" By scanning the whole file once and skipping that check if no delimiter is present
" This lead to a slight performance increase of 1s for 1 million rows
DATA(next_delimiter) = find( val = i_string sub = delimiter ).
DO.
DATA(start_position) = position.
DATA(field) = ``.
" Check if field is enclosed in double quotes, as we need to unescape then
IF next_delimiter <> -1 AND i_string+position(1) = delimiter.
start_position = start_position + 1. " literal starts after opening quote
DO.
position = find( val = i_string off = position + 1 sub = delimiter ).
" literal must be closed
" ASSERT position <> -1.
DATA(subliteral_length) = position - start_position.
field = field && substring( val = i_string off = start_position len = subliteral_length ).
DATA(following_position) = position + 1.
IF position = string_length OR i_string+following_position(1) <> delimiter.
" End of literal is reached
position = position + 1. " skip closing quote
EXIT. " DO
ELSE.
" Found escape quote instead
position = following_position + 1.
field = field && me->delimiter.
" continue searching
ENDIF.
" ASSERT sy-index < 1000.
ENDDO.
ELSE.
" Unescaped field, simply find the ending comma or newline
position = find_any_of( val = i_string off = position + 1 sub = separators ).
IF position = -1.
position = string_length.
ENDIF.
field = substring( val = i_string off = start_position len = position - start_position ).
ENDIF.
APPEND field TO line.
" Check if line ended and new line is started
DATA(current) = substring( val = i_string off = position len = 2 ).
IF current = me->line_separator.
APPEND line TO r_result.
CLEAR line.
position = position + 2. " skip newline
ELSE.
" ASSERT i_string+position(1) = me->separator.
position = position + 1.
ENDIF.
" Check if file ended
IF position >= string_length.
RETURN.
ENDIF.
" ASSERT sy-index < 100000001.
ENDDO.
ENDMETHOD.
ENDCLASS.
As a sidenote, instead of creating a huge table of string fields as stated in #1, I would experiment with some kind of "visitor pattern", e.g. pass an instance of such an interface to the parser:
INTERFACE if_csv_visitor.
METHODS begin_line.
METHODS end_line.
METHODS visit_field
IMPORTING
i_field TYPE string.
ENDINTERFACE.
As in a lot of cases you'll write the CSV fields into a structure anyways,
and thus one can save allocating this quite large table.
And for further reference, here's the whole report:
*&---------------------------------------------------------------------*
*& Report Z_CSV
*&---------------------------------------------------------------------*
*&
*&---------------------------------------------------------------------*
REPORT Z_CSV.
* --------------------- Generic CSV Parser ----------------------------*
CLASS lcl_csv_parser DEFINITION ABSTRACT.
PUBLIC SECTION.
TYPES:
t_string_matrix TYPE STANDARD TABLE OF string_table WITH EMPTY KEY.
METHODS:
parse ABSTRACT
IMPORTING
i_string TYPE string
RETURNING
VALUE(r_result) TYPE t_string_matrix,
constructor
IMPORTING
i_delimiter TYPE string DEFAULT '"'
i_separator TYPE string DEFAULT ','
i_line_separator TYPE abap_cr_lf DEFAULT cl_abap_char_utilities=>cr_lf.
PROTECTED SECTION.
DATA:
delimiter TYPE string,
separator TYPE string,
line_separator TYPE string,
escaped_delimiter TYPE string,
separators TYPE string.
ENDCLASS.
CLASS lcl_csv_parser IMPLEMENTATION.
METHOD constructor.
me->delimiter = i_delimiter.
me->separator = i_separator.
me->line_separator = i_line_separator.
me->escaped_delimiter = |{ i_delimiter }{ i_delimiter }|.
me->separators = i_separator && i_line_separator.
ENDMETHOD.
ENDCLASS.
* --------------------------- Line based CSV Parser ------------------------ *
CLASS lcl_csv_parser_line DEFINITION INHERITING FROM lcl_csv_parser.
PUBLIC SECTION.
METHODS parse REDEFINITION.
PRIVATE SECTION.
METHODS parse_line_to_string_table
IMPORTING
i_line TYPE string
RETURNING
VALUE(r_result) TYPE string_table.
ENDCLASS.
CLASS lcl_csv_parser_line IMPLEMENTATION.
METHOD parse.
"get the lines
SPLIT i_string AT me->line_separator INTO TABLE DATA(lines).
DATA open_line TYPE abap_bool VALUE abap_false.
DATA current_line TYPE string.
LOOP AT lines ASSIGNING FIELD-SYMBOL(<line>).
FIND ALL OCCURRENCES OF me->delimiter IN <line> IN CHARACTER MODE MATCH COUNT DATA(count).
IF ( count MOD 2 ) = 1.
IF open_line = abap_true.
current_line = |{ current_line }{ me->line_separator }{ <line> }|.
open_line = abap_false.
APPEND parse_line_to_string_table( current_line ) TO r_result.
ELSE.
current_line = <line>.
open_line = abap_true.
ENDIF.
ELSE.
IF open_line = abap_true.
current_line = |{ current_line }{ me->line_separator }{ <line> }|.
ELSE.
APPEND parse_line_to_string_table( <line> ) TO r_result.
ENDIF.
ENDIF.
ENDLOOP.
ENDMETHOD.
METHOD parse_line_to_string_table.
SPLIT i_line AT me->separator INTO TABLE DATA(fields).
DATA open_field TYPE abap_bool VALUE abap_false.
DATA current_field TYPE string.
LOOP AT fields ASSIGNING FIELD-SYMBOL(<field>).
FIND ALL OCCURRENCES OF me->delimiter IN <field> IN CHARACTER MODE MATCH COUNT DATA(count).
IF ( count MOD 2 ) = 1.
IF open_field = abap_true.
current_field = |{ current_field }{ me->separator }{ <field> }|.
open_field = abap_false.
APPEND current_field TO r_result.
ELSE.
current_field = <field>.
open_field = abap_true.
ENDIF.
ELSE.
IF open_field = abap_true.
current_field = |{ current_field }{ me->separator }{ <field> }|.
ELSE.
APPEND <field> TO r_result.
ENDIF.
ENDIF.
ENDLOOP.
REPLACE ALL OCCURRENCES OF me->escaped_delimiter IN TABLE r_result WITH me->delimiter.
ENDMETHOD.
ENDCLASS.
*--------------- Find based CSV Parser ------------------------------------*
CLASS lcl_csv_parser_find DEFINITION INHERITING FROM lcl_csv_parser.
PUBLIC SECTION.
METHODS parse REDEFINITION.
ENDCLASS.
CLASS lcl_csv_parser_find IMPLEMENTATION.
METHOD parse.
DATA line TYPE string_table.
DATA position TYPE i.
DATA(string_length) = strlen( i_string ).
" Dereferencing member fields is slightly slower than variable access, in a close loop this matters
DATA(separators) = me->separators.
DATA(delimiter) = me->delimiter.
CHECK string_length <> 0.
" Checking for delimiters in the DO loop is quite slow.
" By scanning the whole file once and skipping that check if no delimiter is present
" This lead to a slight performance increase of 1s for 1 million rows
DATA(next_delimiter) = find( val = i_string sub = delimiter ).
DO.
DATA(start_position) = position.
DATA(field) = ``.
" Check if field is enclosed in double quotes, as we need to unescape then
IF next_delimiter <> -1 AND i_string+position(1) = delimiter.
start_position = start_position + 1. " literal starts after opening quote
DO.
position = find( val = i_string off = position + 1 sub = delimiter ).
" literal must be closed
" ASSERT position <> -1.
DATA(subliteral_length) = position - start_position.
field = field && substring( val = i_string off = start_position len = subliteral_length ).
DATA(following_position) = position + 1.
IF position = string_length OR i_string+following_position(1) <> delimiter.
" End of literal is reached
position = position + 1. " skip closing quote
EXIT. " DO
ELSE.
" Found escape quote instead
position = following_position + 1.
field = field && me->delimiter.
" continue searching
ENDIF.
" ASSERT sy-index < 1000.
ENDDO.
ELSE.
" Unescaped field, simply find the ending comma or newline
position = find_any_of( val = i_string off = position + 1 sub = separators ).
IF position = -1.
position = string_length.
ENDIF.
field = substring( val = i_string off = start_position len = position - start_position ).
ENDIF.
APPEND field TO line.
" Check if line ended and new line is started
DATA(current) = substring( val = i_string off = position len = 2 ).
IF current = me->line_separator.
APPEND line TO r_result.
CLEAR line.
position = position + 2. " skip newline
ELSE.
" ASSERT i_string+position(1) = me->separator.
position = position + 1.
ENDIF.
" Check if file ended
IF position >= string_length.
RETURN.
ENDIF.
" ASSERT sy-index < 100000001.
ENDDO.
ENDMETHOD.
ENDCLASS.
* -------------------- Tests -------------------------------------------------------- *
CLASS lcl_test_csv_parser DEFINITION
FINAL
CREATE PUBLIC .
PUBLIC SECTION.
CLASS-METHODS run.
CLASS-METHODS get_file_complex
RETURNING VALUE(r_result) TYPE string.
CLASS-METHODS get_file_simple
RETURNING VALUE(r_result) TYPE string.
CLASS-METHODS get_file_long
RETURNING VALUE(r_result) TYPE string.
CLASS-METHODS get_file_longer
RETURNING VALUE(r_result) TYPE string.
CLASS-METHODS get_file_mixed
RETURNING VALUE(r_result) TYPE string.
PROTECTED SECTION.
PRIVATE SECTION.
ENDCLASS.
CLASS lcl_test_csv_parser IMPLEMENTATION.
METHOD get_file_complex.
DATA(file_line) =
repeat( val = |"1234,{ cl_abap_char_utilities=>cr_lf }7890",| occ = 9 ) &&
|"1234,{ cl_abap_char_utilities=>cr_lf }7890"| &&
cl_abap_char_utilities=>cr_lf.
r_result = repeat( val = file_line occ = 1000000 ).
ENDMETHOD.
METHOD get_file_simple.
DATA(file_line) =
repeat( val = |1234567890,| occ = 9 ) &&
|1234567890| &&
cl_abap_char_utilities=>cr_lf.
r_result = repeat( val = file_line occ = 1000000 ).
ENDMETHOD.
METHOD get_file_long.
DATA(file_line) =
repeat( val = |12345678901234567890,| occ = 4 ) &&
|12345678901234567890| &&
cl_abap_char_utilities=>cr_lf.
r_result = repeat( val = file_line occ = 1000000 ).
ENDMETHOD.
METHOD get_file_longer.
DATA(file_line) =
repeat( val = |1234567890123456789012345678901234567890,| occ = 2 ) &&
|1234567890123456789012345678901234567890| &&
cl_abap_char_utilities=>cr_lf.
r_result = repeat( val = file_line occ = 1000000 ).
ENDMETHOD.
METHOD get_file_mixed.
DATA(file_line) =
|1234567890,1234567890,"1234,{ cl_abap_char_utilities=>cr_lf }7890",1234567890,1234567890,1234567890,"1234,{ cl_abap_char_utilities=>cr_lf }7890",1234567890,1234567890,1234567890| &&
cl_abap_char_utilities=>cr_lf.
r_result = repeat( val = file_line occ = 1000000 ).
ENDMETHOD.
METHOD run.
DATA prepare_start TYPE timestampl.
GET TIME STAMP FIELD prepare_start.
TYPES:
BEGIN OF t_file,
name TYPE string,
content TYPE string,
END OF t_file,
t_files TYPE STANDARD TABLE OF t_file WITH EMPTY KEY.
DATA(files) = VALUE t_files(
( name = `simple` content = get_file_simple( ) )
( name = `long` content = get_file_long( ) )
( name = `longer` content = get_file_long( ) )
( name = `complex` content = get_file_complex( ) )
( name = `mixed` content = get_file_mixed( ) )
).
DATA prepare_end TYPE timestampl.
GET TIME STAMP FIELD prepare_end.
WRITE |Preparation took { cl_abap_tstmp=>subtract( tstmp1 = prepare_end tstmp2 = prepare_start ) }|. SKIP 2.
WRITE: 'File', 15 'Line Parse', 30 'Find Parse', 45 'Match'. NEW-LINE.
ULINE.
LOOP AT files INTO DATA(file).
WRITE file-name UNDER 'File'.
DATA line_start TYPE timestampl.
GET TIME STAMP FIELD line_start.
DATA(line_parser) = NEW lcl_csv_parser_line( ).
DATA(line_result) = line_parser->parse( file-content ).
DATA line_end TYPE timestampl.
GET TIME STAMP FIELD line_end.
WRITE |{ cl_abap_tstmp=>subtract( tstmp1 = line_end tstmp2 = line_start ) }s| UNDER 'Line Parse'.
DATA find_start TYPE timestampl.
GET TIME STAMP FIELD find_start.
DATA(find_parser) = NEW lcl_csv_parser_find( ).
DATA(find_result) = find_parser->parse( file-content ).
DATA find_end TYPE timestampl.
GET TIME STAMP FIELD find_end.
WRITE |{ cl_abap_tstmp=>subtract( tstmp1 = find_end tstmp2 = find_start ) }s| UNDER 'Find Parse'.
" WRITE COND #( WHEN line_result = find_result THEN 'yes' ELSE 'no') UNDER 'Match'.
NEW-LINE.
ENDLOOP.
ENDMETHOD.
ENDCLASS.
START-OF-SELECTION.
lcl_test_csv_parser=>run( ).

Saving json file by dumping dictionary in a for loop, leading to malformed json

So I have the following dictionaries that I get by parsing a text file
keys = ["scientific name", "common names", "colors]
values = ["somename1", ["name11", "name12"], ["color11", "color12"]]
keys = ["scientific name", "common names", "colors]
values = ["somename2", ["name21", "name22"], ["color21", "color22"]]
and so on. I am dumping the key value pairs using a dictionary to a json file using a for loop where I go through each key value pair one by one
for loop starts
d = dict(zip(keys, values))
with open("file.json", 'a') as j:
json.dump(d, j)
If I open the saved json file I see the contents as
{"scientific name": "somename1", "common names": ["name11", "name12"], "colors": ["color11", "color12"]}{"scientific name": "somename2", "common names": ["name21", "name22"], "colors": ["color21", "color22"]}
Is this the right way to do it?
The purpose is to query the common name or colors for a given scientific name. So then I do
with open("file.json", "r") as j:
data = json.load(j)
I get the error, json.decoder.JSONDecodeError: Extra data:
I think this is because I am not dumping the dictionaries in json in the for loop correctly. I have to insert some square brackets programatically. Just doing json.dump(d, j) won't suffice.
JSON may only have one root element. This root element can be [], {} or most other datatypes.
In your file, however, you get multiple root elements:
{...}{...}
This isn't valid JSON, and the error Extra data refers to the second {}, where valid JSON would end instead.
You can write multiple dicts to a JSON string, but you need to wrap them in an array:
[{...},{...}]
But now off to how I would fix your code. First, I rewrote what you posted, because your code was rather pseudo-code and didn't run directly.
import json
inputs = [(["scientific name", "common names", "colors"],
["somename1", ["name11", "name12"], ["color11", "color12"]]),
(["scientific name", "common names", "colors"],
["somename2", ["name21", "name22"], ["color21", "color22"]])]
for keys, values in inputs:
d = dict(zip(keys, values))
with open("file.json", 'a') as j:
json.dump(d, j)
with open("file.json", 'r') as j:
print(json.load(j))
As you correctly realized, this code failes with
json.decoder.JSONDecodeError: Extra data: line 1 column 105 (char 104)
The way I would write it, is:
import json
inputs = [(["scientific name", "common names", "colors"],
["somename1", ["name11", "name12"], ["color11", "color12"]]),
(["scientific name", "common names", "colors"],
["somename2", ["name21", "name22"], ["color21", "color22"]])]
jsonData = list()
for keys, values in inputs:
d = dict(zip(keys, values))
jsonData.append(d)
with open("file.json", 'w') as j:
json.dump(jsonData, j)
with open("file.json", 'r') as j:
print(json.load(j))
Also, for python's json library, it is important that you write the entire json file in one go, meaning with 'w' instead of 'a'.

Parse a MySQL insert statement with multiple rows [duplicate]

I need a regular expression to select all the text between two outer brackets.
Example:
START_TEXT(text here(possible text)text(possible text(more text)))END_TXT
^ ^
Result:
(text here(possible text)text(possible text(more text)))
I want to add this answer for quickreference. Feel free to update.
.NET Regex using balancing groups:
\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)
Where c is used as the depth counter.
Demo at Regexstorm.com
Stack Overflow: Using RegEx to balance match parenthesis
Wes' Puzzling Blog: Matching Balanced Constructs with .NET Regular Expressions
Greg Reinacker's Weblog: Nested Constructs in Regular Expressions
PCRE using a recursive pattern:
\((?:[^)(]+|(?R))*+\)
Demo at regex101; Or without alternation:
\((?:[^)(]*(?R)?)*+\)
Demo at regex101; Or unrolled for performance:
\([^)(]*+(?:(?R)[^)(]*)*+\)
Demo at regex101; The pattern is pasted at (?R) which represents (?0).
Perl, PHP, Notepad++, R: perl=TRUE, Python: PyPI regex module with (?V1) for Perl behaviour.
(the new version of PyPI regex package already defaults to this → DEFAULT_VERSION = VERSION1)
Ruby using subexpression calls:
With Ruby 2.0 \g<0> can be used to call full pattern.
\((?>[^)(]+|\g<0>)*\)
Demo at Rubular; Ruby 1.9 only supports capturing group recursion:
(\((?>[^)(]+|\g<1>)*\))
Demo at Rubular  (atomic grouping since Ruby 1.9.3)
JavaScript  API :: XRegExp.matchRecursive
XRegExp.matchRecursive(str, '\\(', '\\)', 'g');
Java: An interesting idea using forward references by #jaytea.
Without recursion up to 3 levels of nesting:
(JS, Java and other regex flavors)
To prevent runaway if unbalanced, with * on innermost [)(] only.
\((?:[^)(]|\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\))*\)
Demo at regex101; Or unrolled for better performance (preferred).
\([^)(]*(?:\([^)(]*(?:\([^)(]*(?:\([^)(]*\)[^)(]*)*\)[^)(]*)*\)[^)(]*)*\)
Demo at regex101; Deeper nesting needs to be added as required.
Reference - What does this regex mean?
RexEgg.com - Recursive Regular Expressions
Regular-Expressions.info - Regular Expression Recursion
Mastering Regular Expressions - Jeffrey E.F. Friedl 1 2 3 4
Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.
But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.
You can use regex recursion:
\(([^()]|(?R))*\)
[^\(]*(\(.*\))[^\)]*
[^\(]* matches everything that isn't an opening bracket at the beginning of the string, (\(.*\)) captures the required substring enclosed in brackets, and [^\)]* matches everything that isn't a closing bracket at the end of the string. Note that this expression does not attempt to match brackets; a simple parser (see dehmann's answer) would be more suitable for that.
This answer explains the theoretical limitation of why regular expressions are not the right tool for this task.
Regular expressions can not do this.
Regular expressions are based on a computing model known as Finite State Automata (FSA). As the name indicates, a FSA can remember only the current state, it has no information about the previous states.
In the above diagram, S1 and S2 are two states where S1 is the starting and final step. So if we try with the string 0110 , the transition goes as follows:
0 1 1 0
-> S1 -> S2 -> S2 -> S2 ->S1
In the above steps, when we are at second S2 i.e. after parsing 01 of 0110, the FSA has no information about the previous 0 in 01 as it can only remember the current state and the next input symbol.
In the above problem, we need to know the no of opening parenthesis; this means it has to be stored at some place. But since FSAs can not do that, a regular expression can not be written.
However, an algorithm can be written to do this task. Algorithms are generally falls under Pushdown Automata (PDA). PDA is one level above of FSA. PDA has an additional stack to store some additional information. PDAs can be used to solve the above problem, because we can 'push' the opening parenthesis in the stack and 'pop' them once we encounter a closing parenthesis. If at the end, stack is empty, then opening parenthesis and closing parenthesis matches. Otherwise not.
(?<=\().*(?=\))
If you want to select text between two matching parentheses, you are out of luck with regular expressions. This is impossible(*).
This regex just returns the text between the first opening and the last closing parentheses in your string.
(*) Unless your regex engine has features like balancing groups or recursion. The number of engines that support such features is slowly growing, but they are still not a commonly available.
It is actually possible to do it using .NET regular expressions, but it is not trivial, so read carefully.
You can read a nice article here. You also may need to read up on .NET regular expressions. You can start reading here.
Angle brackets <> were used because they do not require escaping.
The regular expression looks like this:
<
[^<>]*
(
(
(?<Open><)
[^<>]*
)+
(
(?<Close-Open>>)
[^<>]*
)+
)*
(?(Open)(?!))
>
I was also stuck in this situation when dealing with nested patterns and regular-expressions is the right tool to solve such problems.
/(\((?>[^()]+|(?1))*\))/
This is the definitive regex:
\(
(?<arguments>
(
([^\(\)']*) |
(\([^\(\)']*\)) |
'(.*?)'
)*
)
\)
Example:
input: ( arg1, arg2, arg3, (arg4), '(pip' )
output: arg1, arg2, arg3, (arg4), '(pip'
note that the '(pip' is correctly managed as string.
(tried in regulator: http://sourceforge.net/projects/regulator/)
I have written a little JavaScript library called balanced to help with this task. You can accomplish this by doing
balanced.matches({
source: source,
open: '(',
close: ')'
});
You can even do replacements:
balanced.replacements({
source: source,
open: '(',
close: ')',
replace: function (source, head, tail) {
return head + source + tail;
}
});
Here's a more complex and interactive example JSFiddle.
Adding to bobble bubble's answer, there are other regex flavors where recursive constructs are supported.
Lua
Use %b() (%b{} / %b[] for curly braces / square brackets):
for s in string.gmatch("Extract (a(b)c) and ((d)f(g))", "%b()") do print(s) end (see demo)
Raku (former Perl6):
Non-overlapping multiple balanced parentheses matches:
my regex paren_any { '(' ~ ')' [ <-[()]>+ || <&paren_any> ]* }
say "Extract (a(b)c) and ((d)f(g))" ~~ m:g/<&paren_any>/;
# => (「(a(b)c)」 「((d)f(g))」)
Overlapping multiple balanced parentheses matches:
say "Extract (a(b)c) and ((d)f(g))" ~~ m:ov:g/<&paren_any>/;
# => (「(a(b)c)」 「(b)」 「((d)f(g))」 「(d)」 「(g)」)
See demo.
Python re non-regex solution
See poke's answer for How to get an expression between balanced parentheses.
Java customizable non-regex solution
Here is a customizable solution allowing single character literal delimiters in Java:
public static List<String> getBalancedSubstrings(String s, Character markStart,
Character markEnd, Boolean includeMarkers)
{
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastOpenDelimiter = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1) {
lastOpenDelimiter = (includeMarkers ? i : i + 1);
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));
}
if (level > 0) level--;
}
}
return subTreeList;
}
}
Sample usage:
String s = "some text(text here(possible text)text(possible text(more text)))end text";
List<String> balanced = getBalancedSubstrings(s, '(', ')', true);
System.out.println("Balanced substrings:\n" + balanced);
// => [(text here(possible text)text(possible text(more text)))]
The regular expression using Ruby (version 1.9.3 or above):
/(?<match>\((?:\g<match>|[^()]++)*\))/
Demo on rubular
The answer depends on whether you need to match matching sets of brackets, or merely the first open to the last close in the input text.
If you need to match matching nested brackets, then you need something more than regular expressions. - see #dehmann
If it's just first open to last close see #Zach
Decide what you want to happen with:
abc ( 123 ( foobar ) def ) xyz ) ghij
You need to decide what your code needs to match in this case.
"""
Here is a simple python program showing how to use regular
expressions to write a paren-matching recursive parser.
This parser recognises items enclosed by parens, brackets,
braces and <> symbols, but is adaptable to any set of
open/close patterns. This is where the re package greatly
assists in parsing.
"""
import re
# The pattern below recognises a sequence consisting of:
# 1. Any characters not in the set of open/close strings.
# 2. One of the open/close strings.
# 3. The remainder of the string.
#
# There is no reason the opening pattern can't be the
# same as the closing pattern, so quoted strings can
# be included. However quotes are not ignored inside
# quotes. More logic is needed for that....
pat = re.compile("""
( .*? )
( \( | \) | \[ | \] | \{ | \} | \< | \> |
\' | \" | BEGIN | END | $ )
( .* )
""", re.X)
# The keys to the dictionary below are the opening strings,
# and the values are the corresponding closing strings.
# For example "(" is an opening string and ")" is its
# closing string.
matching = { "(" : ")",
"[" : "]",
"{" : "}",
"<" : ">",
'"' : '"',
"'" : "'",
"BEGIN" : "END" }
# The procedure below matches string s and returns a
# recursive list matching the nesting of the open/close
# patterns in s.
def matchnested(s, term=""):
lst = []
while True:
m = pat.match(s)
if m.group(1) != "":
lst.append(m.group(1))
if m.group(2) == term:
return lst, m.group(3)
if m.group(2) in matching:
item, s = matchnested(m.group(3), matching[m.group(2)])
lst.append(m.group(2))
lst.append(item)
lst.append(matching[m.group(2)])
else:
raise ValueError("After <<%s %s>> expected %s not %s" %
(lst, s, term, m.group(2)))
# Unit test.
if __name__ == "__main__":
for s in ("simple string",
""" "double quote" """,
""" 'single quote' """,
"one'two'three'four'five'six'seven",
"one(two(three(four)five)six)seven",
"one(two(three)four)five(six(seven)eight)nine",
"one(two)three[four]five{six}seven<eight>nine",
"one(two[three{four<five>six}seven]eight)nine",
"oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",
"ERROR testing ((( mismatched ))] parens"):
print "\ninput", s
try:
lst, s = matchnested(s)
print "output", lst
except ValueError as e:
print str(e)
print "done"
You need the first and last parentheses. Use something like this:
str.indexOf('('); - it will give you first occurrence
str.lastIndexOf(')'); - last one
So you need a string between,
String searchedString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');
because js regex doesn't support recursive match, i can't make balanced parentheses matching work.
so this is a simple javascript for loop version that make "method(arg)" string into array
push(number) map(test(a(a()))) bass(wow, abc)
$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)
const parser = str => {
let ops = []
let method, arg
let isMethod = true
let open = []
for (const char of str) {
// skip whitespace
if (char === ' ') continue
// append method or arg string
if (char !== '(' && char !== ')') {
if (isMethod) {
(method ? (method += char) : (method = char))
} else {
(arg ? (arg += char) : (arg = char))
}
}
if (char === '(') {
// nested parenthesis should be a part of arg
if (!isMethod) arg += char
isMethod = false
open.push(char)
} else if (char === ')') {
open.pop()
// check end of arg
if (open.length < 1) {
isMethod = true
ops.push({ method, arg })
method = arg = undefined
} else {
arg += char
}
}
}
return ops
}
// const test = parser(`$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)`)
const test = parser(`push(number) map(test(a(a()))) bass(wow, abc)`)
console.log(test)
the result is like
[ { method: 'push', arg: 'number' },
{ method: 'map', arg: 'test(a(a()))' },
{ method: 'bass', arg: 'wow,abc' } ]
[ { method: '$$', arg: 'groups' },
{ method: 'filter',
arg: '{type:\'ORGANIZATION\',isDisabled:{$ne:true}}' },
{ method: 'pickBy', arg: '_id,type' },
{ method: 'map', arg: 'test()' },
{ method: 'as', arg: 'groups' } ]
While so many answers mention this in some form by saying that regex does not support recursive matching and so on, the primary reason for this lies in the roots of the Theory of Computation.
Language of the form {a^nb^n | n>=0} is not regular. Regex can only match things that form part of the regular set of languages.
Read more # here
I didn't use regex since it is difficult to deal with nested code. So this snippet should be able to allow you to grab sections of code with balanced brackets:
def extract_code(data):
""" returns an array of code snippets from a string (data)"""
start_pos = None
end_pos = None
count_open = 0
count_close = 0
code_snippets = []
for i,v in enumerate(data):
if v =='{':
count_open+=1
if not start_pos:
start_pos= i
if v=='}':
count_close +=1
if count_open == count_close and not end_pos:
end_pos = i+1
if start_pos and end_pos:
code_snippets.append((start_pos,end_pos))
start_pos = None
end_pos = None
return code_snippets
I used this to extract code snippets from a text file.
This do not fully address the OP question but I though it may be useful to some coming here to search for nested structure regexp:
Parse parmeters from function string (with nested structures) in javascript
Match structures like:
matches brackets, square brackets, parentheses, single and double quotes
Here you can see generated regexp in action
/**
* get param content of function string.
* only params string should be provided without parentheses
* WORK even if some/all params are not set
* #return [param1, param2, param3]
*/
exports.getParamsSAFE = (str, nbParams = 3) => {
const nextParamReg = /^\s*((?:(?:['"([{](?:[^'"()[\]{}]*?|['"([{](?:[^'"()[\]{}]*?|['"([{][^'"()[\]{}]*?['")}\]])*?['")}\]])*?['")}\]])|[^,])*?)\s*(?:,|$)/;
const params = [];
while (str.length) { // this is to avoid a BIG performance issue in javascript regexp engine
str = str.replace(nextParamReg, (full, p1) => {
params.push(p1);
return '';
});
}
return params;
};
This might help to match balanced parenthesis.
\s*\w+[(][^+]*[)]\s*
This one also worked
re.findall(r'\(.+\)', s)

Converting a MongoDB Query String with Datetime field into Dict in Python

I have a string as follows,
s= "query : {'$and': [{'$or': [{'Component': 'pfr'}, {'Component': 'ng-pfr'}, {'Component': 'common-flow-table'}, {'Component': 'media-mon'}]}, {'Submitted-on': {'$gte': datetime.datetime(2016, 2, 21, 0, 0)}}, {'Submitted-on': {'$lte': datetime.datetime(2016, 2, 28, 0, 0)}}]}
" which is a MongoDB query stored in a string.How to convert it into a Dict or JSON format in Python
Your format is not standard, so you need a hack to get it.
import json
s = " query : {'names' :['abc','xyz'],'location':'India'}"
key, value = s.strip().split(':', 1)
r = value.replace("'", '"')
data = {
key: json.loads(r)
}
From your comment: the datetime gives problems. Then I present to you the hack of hacks: the eval function.
import datetime
import json
s = " query : {'names' :['abc','xyz'],'location':'India'}"
key, value = s.strip().split(':', 1)
# we can leave out the replacing, single quotes is fine for eval
data = {
key: eval(value)
}
NB eval -especially on unsanitized input- is very unsafe.
NB: hacks will be broken, in the first case for example because a value or key contains a quote character.