The data has HTML values inside text:
col1
-------------------------------------------------------------
Drell-Yan Process Background Estimation Using eμ Method
Expressions of constant π
Computational Analysis of Protein β-Structure
δ13C and 14C Measurements in Aerosol Particles
I need to get actual symbols instead of all the HTML decimal values.
Html.Table in powerquery-m can decode HTML decimal values into visible symbols:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("PY7BCsIwDIZfJcyrg03nUHaa06MwEA8ydyg1arFrR5KCvr1l4C7JId/3/+m65EBobXpVDlryGplhr/T7ST64OxxZzKDEeAcXNu4JeAtZtioXu01ZwQnl5e9Jv+yS42ek6EaQwT9Axy3KCfzxMqsmrvHDGGRKVBbqOL5sJiW2Cxo3G0VZpWehoCUQTu582Vb5ugFDkBdNfEJxJAZ0whD9Gsmzt9AqEqMtctL3Pw==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [col1 = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"col1", type text}}),
#"Added Custom" = Table.AddColumn(#"Changed Type", "HtmlTable", each Html.Table([col1],{{"HtmlDecoded",":root"}})),
#"Expanded HtmlTable" = Table.ExpandTableColumn(#"Added Custom", "HtmlTable", {"HtmlDecoded"}, {"HtmlDecoded"})
in
#"Expanded HtmlTable"
Or in place:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("PY7BCsIwDIZfJcyrg03nUHaa06MwEA8ydyg1arFrR5KCvr1l4C7JId/3/+m65EBobXpVDlryGplhr/T7ST64OxxZzKDEeAcXNu4JeAtZtioXu01ZwQnl5e9Jv+yS42ek6EaQwT9Axy3KCfzxMqsmrvHDGGRKVBbqOL5sJiW2Cxo3G0VZpWehoCUQTu582Vb5ugFDkBdNfEJxJAZ0whD9Gsmzt9AqEqMtctL3Pw==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [col1 = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"col1", type text}}),
Decoded = Table.TransformColumns(#"Changed Type", {{"col1", each Table.FirstValue(Html.Table(_,{{"HtmlDecoded",":root"}})) }} )
in
Decoded
Related
I have some JSON fields stored in a database which contain a String -> Double mapping, e.g.:
{
"some type of thing": 0.45,
"other type of thing": 0.35,
"something else": 0.2
}
I want to represent this as a ThingComposition:
data ThingType = ThingTypeSome
| ThingTypeOther
| ThingTypeUnknown Text
-- | Create a ThingType from a text representation
txtToThing :: Text -> ThingType
txtToThing "some type of thing" = ThingTypeSome
txtToThing "other type of thing" = ThingTypeOther
txtToThing s = ThingTypeUnknown s
-- Deserialise ThingType from JSON text
instance FromJSON ThingType where
parseJSON val = withText "ThingType" (return . txtToThing) val
data ThingComposition = ThingComposition [(ThingType, Double)]
| InvalidThingComposition
instance FromJSON ThingComposition where
parseJSON val = withObject "ThingComposition"
_
val
The _ is what I have no idea how to fill out. I've tried something like the following but I can't get the types to align and I can't work out the best way to do this, given that it's possible that the JSON representation won't match the types, but I don't want to create a list of [(Either String ThingType, Either String Double)]. How can I parse that the object at the top into the ThingComposition type?
_ = (return . ThingComposition) . map (bimap parseJSON parseJSON) . toList
I would make some supporting instances for your ThingType, then reuse the FromJSON (HashMap k v) instance.
-- added "deriving Eq" to your declaration; otherwise unchanged
data ThingType = ThingTypeSome
| ThingTypeOther
| ThingTypeUnknown Text
deriving Eq
thingToTxt :: ThingType -> Text
thingToTxt ThingTypeSome = "some type of thing"
thingToTxt ThingTypeOther = "other type of thing"
thingToTxt (ThingTypeUnknown s) = s
instance FromJSONKey ThingType where
fromJSONKey = FromJSONKeyText txtToThing
instance Hashable ThingType where
hashWithSalt n = hashWithSalt n . thingToTxt
With that supporting code, you now have a FromJSON instance for HashMap ThingType Double, which is superior in many ways to a [(ThingType, Double)].
I would like to parse big CSV files in ABAP in the most performant way under the following conditions:
We do not know the structure of the CSV->the parse result should be table of string_table or something simular
The parsing should happen in accordance to https://www.rfc-editor.org/rfc/rfc4180
No solution specific calls
I found a very nice blog https://blogs.sap.com/2014/09/09/understanding-csv-files-and-their-handling-in-abap/ but it has its shortcoming:
Write your own code - The code example is not sufficient
Read the file using KCD_CSV_FILE_TO_INTERN_CONVERT - solution specific (not available everywhere) and will dump on fields that are big enough
Use RTTI and dynamic programming along with FM RSDS_CONVERT_CSV - we do not know the structure in advance
Use class CL_RSDA_CSV_CONVERTER - we do not know the structure in advance
I also checked the first available solution on github - https://github.com/thedoginthewok/ZwdCSV . Unfortunately, it has macros in the code (absolutely unacceptable) and also requires you to know the structure in advance.
I also tried to use the regex to do the job, but on big files this is too slow.
Even though I am extremely annoyed by this fact, I had to create a solution myself (I cannot believe that I actually did it - it should be in the standard...)
My first solution was a direct copy paste of Java code into ABAP (https://mkyong.com/java/how-to-read-and-parse-csv-file-in-java/). Unfortunately, as my other question How to iterate over string characters in ABAP in performant way? shown, it is not that easy to iterate over string in abap as it is in Java.
I then tried a split/count approach and so far it has the best performance. Does anyone knows the better way achieve this?
REPORT z_csv_test.
CLASS lcl_csv_parser DEFINITION CREATE PRIVATE.
PUBLIC SECTION.
TYPES:
tt_string_matrix TYPE STANDARD TABLE OF string_table WITH EMPTY KEY.
CLASS-METHODS:
create
IMPORTING
!iv_delimiter TYPE string DEFAULT '"'
!iv_separator TYPE string DEFAULT ','
!iv_line_separator TYPE abap_cr_lf DEFAULT cl_abap_char_utilities=>cr_lf
RETURNING
VALUE(r_result) TYPE REF TO lcl_csv_parser.
METHODS:
parse
IMPORTING
iv_string TYPE string
RETURNING
VALUE(r_result) TYPE tt_string_matrix,
constructor
IMPORTING
!iv_delimiter TYPE string
!iv_separator TYPE string
!iv_line_separator TYPE string.
PROTECTED SECTION.
PRIVATE SECTION.
DATA:
gv_delimiter TYPE string,
gv_separator TYPE string,
gv_line_separator TYPE string,
gv_escaped_delimiter TYPE string.
METHODS parse_line_to_string_table
IMPORTING
iv_line TYPE string
RETURNING
VALUE(r_result) TYPE string_table.
ENDCLASS.
CLASS lcl_csv_parser IMPLEMENTATION.
METHOD create.
r_result = NEW #(
iv_delimiter = iv_delimiter
iv_line_separator = CONV #( iv_line_separator )
iv_separator = iv_separator ).
ENDMETHOD.
METHOD constructor.
me->gv_delimiter = iv_delimiter.
me->gv_separator = iv_separator.
me->gv_line_separator = iv_line_separator.
me->gv_escaped_delimiter = |{ iv_delimiter }{ iv_delimiter }|.
ENDMETHOD.
METHOD parse.
"get the lines
SPLIT iv_string AT me->gv_line_separator INTO TABLE DATA(lt_lines).
DATA lx_open_line TYPE abap_bool VALUE abap_false.
DATA lv_current_line TYPE string.
LOOP AT lt_lines ASSIGNING FIELD-SYMBOL(<ls_line>).
FIND ALL OCCURRENCES OF me->gv_delimiter IN <ls_line> IN CHARACTER MODE MATCH COUNT DATA(lv_count).
IF ( lv_count MOD 2 ) = 1.
IF lx_open_line = abap_true.
lv_current_line = |{ lv_current_line }{ me->gv_line_separator }{ <ls_line> }|.
lx_open_line = abap_false.
APPEND parse_line_to_string_table( lv_current_line ) TO r_result.
ELSE.
lv_current_line = <ls_line>.
lx_open_line = abap_true.
ENDIF.
ELSE.
IF lx_open_line = abap_true.
lv_current_line = |{ lv_current_line }{ me->gv_line_separator }{ <ls_line> }|.
ELSE.
APPEND parse_line_to_string_table( <ls_line> ) TO r_result.
ENDIF.
ENDIF.
ENDLOOP.
ENDMETHOD.
METHOD parse_line_to_string_table.
SPLIT iv_line AT me->gv_separator INTO TABLE DATA(lt_line).
DATA lx_open_field TYPE abap_bool VALUE abap_false.
DATA lv_current_field TYPE string.
LOOP AT lt_line ASSIGNING FIELD-SYMBOL(<ls_field>).
FIND ALL OCCURRENCES OF me->gv_delimiter IN <ls_field> IN CHARACTER MODE MATCH COUNT DATA(lv_count).
IF ( lv_count MOD 2 ) = 1.
IF lx_open_field = abap_true.
lv_current_field = |{ lv_current_field }{ me->gv_separator }{ <ls_field> }|.
lx_open_field = abap_false.
APPEND lv_current_field TO r_result.
ELSE.
lv_current_field = <ls_field>.
lx_open_field = abap_true.
ENDIF.
ELSE.
IF lx_open_field = abap_true.
lv_current_field = |{ lv_current_field }{ me->gv_separator }{ <ls_field> }|.
ELSE.
APPEND <ls_field> TO r_result.
ENDIF.
ENDIF.
ENDLOOP.
REPLACE ALL OCCURRENCES OF me->gv_escaped_delimiter IN TABLE r_result WITH me->gv_delimiter.
ENDMETHOD.
ENDCLASS.
CLASS lcl_test_csv_parser DEFINITION
FINAL
CREATE PUBLIC .
PUBLIC SECTION.
CLASS-METHODS run.
CLASS-METHODS get_file
RETURNING VALUE(r_result) TYPE string.
PROTECTED SECTION.
PRIVATE SECTION.
ENDCLASS.
CLASS lcl_test_csv_parser IMPLEMENTATION.
METHOD get_file.
DATA lv_file_line TYPE string.
DO 10 TIMES.
lv_file_line = |"1234,{ cl_abap_char_utilities=>cr_lf }567890",{ lv_file_line }|.
ENDDO.
lv_file_line = lv_file_line && cl_abap_char_utilities=>cr_lf.
DATA(lt_file_as_table) = VALUE string_table(
FOR i = 1 THEN i + 1 UNTIL i = 1000000
( lv_file_line ) ).
CONCATENATE LINES OF lt_file_as_table INTO r_result.
ENDMETHOD.
METHOD run.
DATA lv_prepare_start TYPE timestampl.
GET TIME STAMP FIELD lv_prepare_start.
DATA(lv_file) = get_file( ).
DATA lv_prepare_end TYPE timestampl.
GET TIME STAMP FIELD lv_prepare_end.
WRITE |Preparation took { cl_abap_tstmp=>subtract( tstmp1 = lv_prepare_end tstmp2 = lv_prepare_start ) }|.
DATA lv_parse_start TYPE timestampl.
GET TIME STAMP FIELD lv_parse_start.
DATA(lo_parser) = lcl_csv_parser=>create( ).
DATA(lt_file) = lo_parser->parse( lv_file ).
DATA lv_parse_end TYPE timestampl.
GET TIME STAMP FIELD lv_parse_end.
WRITE |Parse took { cl_abap_tstmp=>subtract( tstmp1 = lv_parse_end tstmp2 = lv_parse_start ) }|.
ENDMETHOD.
ENDCLASS.
START-OF-SELECTION.
lcl_test_csv_parser=>run( ).
I'd like to present a different approach using find heavily, compared to your line based approach this seems to have equivalent performance for unquoted fields but performs slightly better if quoted fields are present:
In general, this uses the pattern position = find( off = position + 1 ) to iterate over the string in chunks, and then uses substring to copy ranges into strings. What can be observed here is that in a loop that iterates a million times, every nanosecond saved has an impact on the performance, and by moving as much of it out of the inner loop one can increase performance significantly. For the "simple" case of 10 digit fields one can see that both algorithms perform equally well, however for "longer" 30 digit fields your algorithm is getting faster in comparison. For fields with quotes the scan & concat approach I've used seems to be faster than the "reconstruct" approach.
I guess although one can achieve small gains through more clever ABAP, further significant optimizations are only possible by utilizing the engine even more.
Anyways, Here's the algorithm:
CLASS lcl_csv_parser_find IMPLEMENTATION.
METHOD parse.
DATA line TYPE string_table.
DATA position TYPE i.
DATA(string_length) = strlen( i_string ).
" Dereferencing member fields is slightly slower than variable access, in a close loop this matters
DATA(separators) = me->separators.
DATA(delimiter) = me->delimiter.
CHECK string_length <> 0.
" Checking for delimiters in the DO loop is quite slow.
" By scanning the whole file once and skipping that check if no delimiter is present
" This lead to a slight performance increase of 1s for 1 million rows
DATA(next_delimiter) = find( val = i_string sub = delimiter ).
DO.
DATA(start_position) = position.
DATA(field) = ``.
" Check if field is enclosed in double quotes, as we need to unescape then
IF next_delimiter <> -1 AND i_string+position(1) = delimiter.
start_position = start_position + 1. " literal starts after opening quote
DO.
position = find( val = i_string off = position + 1 sub = delimiter ).
" literal must be closed
" ASSERT position <> -1.
DATA(subliteral_length) = position - start_position.
field = field && substring( val = i_string off = start_position len = subliteral_length ).
DATA(following_position) = position + 1.
IF position = string_length OR i_string+following_position(1) <> delimiter.
" End of literal is reached
position = position + 1. " skip closing quote
EXIT. " DO
ELSE.
" Found escape quote instead
position = following_position + 1.
field = field && me->delimiter.
" continue searching
ENDIF.
" ASSERT sy-index < 1000.
ENDDO.
ELSE.
" Unescaped field, simply find the ending comma or newline
position = find_any_of( val = i_string off = position + 1 sub = separators ).
IF position = -1.
position = string_length.
ENDIF.
field = substring( val = i_string off = start_position len = position - start_position ).
ENDIF.
APPEND field TO line.
" Check if line ended and new line is started
DATA(current) = substring( val = i_string off = position len = 2 ).
IF current = me->line_separator.
APPEND line TO r_result.
CLEAR line.
position = position + 2. " skip newline
ELSE.
" ASSERT i_string+position(1) = me->separator.
position = position + 1.
ENDIF.
" Check if file ended
IF position >= string_length.
RETURN.
ENDIF.
" ASSERT sy-index < 100000001.
ENDDO.
ENDMETHOD.
ENDCLASS.
As a sidenote, instead of creating a huge table of string fields as stated in #1, I would experiment with some kind of "visitor pattern", e.g. pass an instance of such an interface to the parser:
INTERFACE if_csv_visitor.
METHODS begin_line.
METHODS end_line.
METHODS visit_field
IMPORTING
i_field TYPE string.
ENDINTERFACE.
As in a lot of cases you'll write the CSV fields into a structure anyways,
and thus one can save allocating this quite large table.
And for further reference, here's the whole report:
*&---------------------------------------------------------------------*
*& Report Z_CSV
*&---------------------------------------------------------------------*
*&
*&---------------------------------------------------------------------*
REPORT Z_CSV.
* --------------------- Generic CSV Parser ----------------------------*
CLASS lcl_csv_parser DEFINITION ABSTRACT.
PUBLIC SECTION.
TYPES:
t_string_matrix TYPE STANDARD TABLE OF string_table WITH EMPTY KEY.
METHODS:
parse ABSTRACT
IMPORTING
i_string TYPE string
RETURNING
VALUE(r_result) TYPE t_string_matrix,
constructor
IMPORTING
i_delimiter TYPE string DEFAULT '"'
i_separator TYPE string DEFAULT ','
i_line_separator TYPE abap_cr_lf DEFAULT cl_abap_char_utilities=>cr_lf.
PROTECTED SECTION.
DATA:
delimiter TYPE string,
separator TYPE string,
line_separator TYPE string,
escaped_delimiter TYPE string,
separators TYPE string.
ENDCLASS.
CLASS lcl_csv_parser IMPLEMENTATION.
METHOD constructor.
me->delimiter = i_delimiter.
me->separator = i_separator.
me->line_separator = i_line_separator.
me->escaped_delimiter = |{ i_delimiter }{ i_delimiter }|.
me->separators = i_separator && i_line_separator.
ENDMETHOD.
ENDCLASS.
* --------------------------- Line based CSV Parser ------------------------ *
CLASS lcl_csv_parser_line DEFINITION INHERITING FROM lcl_csv_parser.
PUBLIC SECTION.
METHODS parse REDEFINITION.
PRIVATE SECTION.
METHODS parse_line_to_string_table
IMPORTING
i_line TYPE string
RETURNING
VALUE(r_result) TYPE string_table.
ENDCLASS.
CLASS lcl_csv_parser_line IMPLEMENTATION.
METHOD parse.
"get the lines
SPLIT i_string AT me->line_separator INTO TABLE DATA(lines).
DATA open_line TYPE abap_bool VALUE abap_false.
DATA current_line TYPE string.
LOOP AT lines ASSIGNING FIELD-SYMBOL(<line>).
FIND ALL OCCURRENCES OF me->delimiter IN <line> IN CHARACTER MODE MATCH COUNT DATA(count).
IF ( count MOD 2 ) = 1.
IF open_line = abap_true.
current_line = |{ current_line }{ me->line_separator }{ <line> }|.
open_line = abap_false.
APPEND parse_line_to_string_table( current_line ) TO r_result.
ELSE.
current_line = <line>.
open_line = abap_true.
ENDIF.
ELSE.
IF open_line = abap_true.
current_line = |{ current_line }{ me->line_separator }{ <line> }|.
ELSE.
APPEND parse_line_to_string_table( <line> ) TO r_result.
ENDIF.
ENDIF.
ENDLOOP.
ENDMETHOD.
METHOD parse_line_to_string_table.
SPLIT i_line AT me->separator INTO TABLE DATA(fields).
DATA open_field TYPE abap_bool VALUE abap_false.
DATA current_field TYPE string.
LOOP AT fields ASSIGNING FIELD-SYMBOL(<field>).
FIND ALL OCCURRENCES OF me->delimiter IN <field> IN CHARACTER MODE MATCH COUNT DATA(count).
IF ( count MOD 2 ) = 1.
IF open_field = abap_true.
current_field = |{ current_field }{ me->separator }{ <field> }|.
open_field = abap_false.
APPEND current_field TO r_result.
ELSE.
current_field = <field>.
open_field = abap_true.
ENDIF.
ELSE.
IF open_field = abap_true.
current_field = |{ current_field }{ me->separator }{ <field> }|.
ELSE.
APPEND <field> TO r_result.
ENDIF.
ENDIF.
ENDLOOP.
REPLACE ALL OCCURRENCES OF me->escaped_delimiter IN TABLE r_result WITH me->delimiter.
ENDMETHOD.
ENDCLASS.
*--------------- Find based CSV Parser ------------------------------------*
CLASS lcl_csv_parser_find DEFINITION INHERITING FROM lcl_csv_parser.
PUBLIC SECTION.
METHODS parse REDEFINITION.
ENDCLASS.
CLASS lcl_csv_parser_find IMPLEMENTATION.
METHOD parse.
DATA line TYPE string_table.
DATA position TYPE i.
DATA(string_length) = strlen( i_string ).
" Dereferencing member fields is slightly slower than variable access, in a close loop this matters
DATA(separators) = me->separators.
DATA(delimiter) = me->delimiter.
CHECK string_length <> 0.
" Checking for delimiters in the DO loop is quite slow.
" By scanning the whole file once and skipping that check if no delimiter is present
" This lead to a slight performance increase of 1s for 1 million rows
DATA(next_delimiter) = find( val = i_string sub = delimiter ).
DO.
DATA(start_position) = position.
DATA(field) = ``.
" Check if field is enclosed in double quotes, as we need to unescape then
IF next_delimiter <> -1 AND i_string+position(1) = delimiter.
start_position = start_position + 1. " literal starts after opening quote
DO.
position = find( val = i_string off = position + 1 sub = delimiter ).
" literal must be closed
" ASSERT position <> -1.
DATA(subliteral_length) = position - start_position.
field = field && substring( val = i_string off = start_position len = subliteral_length ).
DATA(following_position) = position + 1.
IF position = string_length OR i_string+following_position(1) <> delimiter.
" End of literal is reached
position = position + 1. " skip closing quote
EXIT. " DO
ELSE.
" Found escape quote instead
position = following_position + 1.
field = field && me->delimiter.
" continue searching
ENDIF.
" ASSERT sy-index < 1000.
ENDDO.
ELSE.
" Unescaped field, simply find the ending comma or newline
position = find_any_of( val = i_string off = position + 1 sub = separators ).
IF position = -1.
position = string_length.
ENDIF.
field = substring( val = i_string off = start_position len = position - start_position ).
ENDIF.
APPEND field TO line.
" Check if line ended and new line is started
DATA(current) = substring( val = i_string off = position len = 2 ).
IF current = me->line_separator.
APPEND line TO r_result.
CLEAR line.
position = position + 2. " skip newline
ELSE.
" ASSERT i_string+position(1) = me->separator.
position = position + 1.
ENDIF.
" Check if file ended
IF position >= string_length.
RETURN.
ENDIF.
" ASSERT sy-index < 100000001.
ENDDO.
ENDMETHOD.
ENDCLASS.
* -------------------- Tests -------------------------------------------------------- *
CLASS lcl_test_csv_parser DEFINITION
FINAL
CREATE PUBLIC .
PUBLIC SECTION.
CLASS-METHODS run.
CLASS-METHODS get_file_complex
RETURNING VALUE(r_result) TYPE string.
CLASS-METHODS get_file_simple
RETURNING VALUE(r_result) TYPE string.
CLASS-METHODS get_file_long
RETURNING VALUE(r_result) TYPE string.
CLASS-METHODS get_file_longer
RETURNING VALUE(r_result) TYPE string.
CLASS-METHODS get_file_mixed
RETURNING VALUE(r_result) TYPE string.
PROTECTED SECTION.
PRIVATE SECTION.
ENDCLASS.
CLASS lcl_test_csv_parser IMPLEMENTATION.
METHOD get_file_complex.
DATA(file_line) =
repeat( val = |"1234,{ cl_abap_char_utilities=>cr_lf }7890",| occ = 9 ) &&
|"1234,{ cl_abap_char_utilities=>cr_lf }7890"| &&
cl_abap_char_utilities=>cr_lf.
r_result = repeat( val = file_line occ = 1000000 ).
ENDMETHOD.
METHOD get_file_simple.
DATA(file_line) =
repeat( val = |1234567890,| occ = 9 ) &&
|1234567890| &&
cl_abap_char_utilities=>cr_lf.
r_result = repeat( val = file_line occ = 1000000 ).
ENDMETHOD.
METHOD get_file_long.
DATA(file_line) =
repeat( val = |12345678901234567890,| occ = 4 ) &&
|12345678901234567890| &&
cl_abap_char_utilities=>cr_lf.
r_result = repeat( val = file_line occ = 1000000 ).
ENDMETHOD.
METHOD get_file_longer.
DATA(file_line) =
repeat( val = |1234567890123456789012345678901234567890,| occ = 2 ) &&
|1234567890123456789012345678901234567890| &&
cl_abap_char_utilities=>cr_lf.
r_result = repeat( val = file_line occ = 1000000 ).
ENDMETHOD.
METHOD get_file_mixed.
DATA(file_line) =
|1234567890,1234567890,"1234,{ cl_abap_char_utilities=>cr_lf }7890",1234567890,1234567890,1234567890,"1234,{ cl_abap_char_utilities=>cr_lf }7890",1234567890,1234567890,1234567890| &&
cl_abap_char_utilities=>cr_lf.
r_result = repeat( val = file_line occ = 1000000 ).
ENDMETHOD.
METHOD run.
DATA prepare_start TYPE timestampl.
GET TIME STAMP FIELD prepare_start.
TYPES:
BEGIN OF t_file,
name TYPE string,
content TYPE string,
END OF t_file,
t_files TYPE STANDARD TABLE OF t_file WITH EMPTY KEY.
DATA(files) = VALUE t_files(
( name = `simple` content = get_file_simple( ) )
( name = `long` content = get_file_long( ) )
( name = `longer` content = get_file_long( ) )
( name = `complex` content = get_file_complex( ) )
( name = `mixed` content = get_file_mixed( ) )
).
DATA prepare_end TYPE timestampl.
GET TIME STAMP FIELD prepare_end.
WRITE |Preparation took { cl_abap_tstmp=>subtract( tstmp1 = prepare_end tstmp2 = prepare_start ) }|. SKIP 2.
WRITE: 'File', 15 'Line Parse', 30 'Find Parse', 45 'Match'. NEW-LINE.
ULINE.
LOOP AT files INTO DATA(file).
WRITE file-name UNDER 'File'.
DATA line_start TYPE timestampl.
GET TIME STAMP FIELD line_start.
DATA(line_parser) = NEW lcl_csv_parser_line( ).
DATA(line_result) = line_parser->parse( file-content ).
DATA line_end TYPE timestampl.
GET TIME STAMP FIELD line_end.
WRITE |{ cl_abap_tstmp=>subtract( tstmp1 = line_end tstmp2 = line_start ) }s| UNDER 'Line Parse'.
DATA find_start TYPE timestampl.
GET TIME STAMP FIELD find_start.
DATA(find_parser) = NEW lcl_csv_parser_find( ).
DATA(find_result) = find_parser->parse( file-content ).
DATA find_end TYPE timestampl.
GET TIME STAMP FIELD find_end.
WRITE |{ cl_abap_tstmp=>subtract( tstmp1 = find_end tstmp2 = find_start ) }s| UNDER 'Find Parse'.
" WRITE COND #( WHEN line_result = find_result THEN 'yes' ELSE 'no') UNDER 'Match'.
NEW-LINE.
ENDLOOP.
ENDMETHOD.
ENDCLASS.
START-OF-SELECTION.
lcl_test_csv_parser=>run( ).
SSIS supports Power Query as a Source, but do not seem to support M code using Html.Table. I'm using only SQL Server Data Tools (SSDT)
M code below=
let Scrapper =
(Page as number) as table =>
let
Source = Web.BrowserContents("https://www.zerohedge.com/?page=" & Number.ToText(Page)),
#"Extracted Table From Html" = Html.Table(Source, {{"Title", ".teaser-title:nth-last-child(4)"}, {"Date", ".extras__created:nth-last-child(1)"}, {"Views", ".extras__views:nth-last-child(2)"}}, [RowSelector=".view-content:nth-last-child(2) > DIV.views-row"]),
#"Changed Type" = Table.TransformColumnTypes(#"Extracted Table From Html",{{"Title", type text}, {"Date", type datetime}, {"Views", Int64.Type}})
in
#"Changed Type",
Source = {0..1},
#"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Renamed Columns" = Table.RenameColumns(#"Converted to Table",{{"Column1", "Pages"}}),
#"Invoked Custom Function" = Table.AddColumn(#"Renamed Columns", "Scrapper", each Scrapper([Pages])),
#"Expanded Scrapper" = Table.ExpandTableColumn(#"Invoked Custom Function", "Scrapper", {"Title", "Date", "Views"}, {"Title", "Date", "Views"})
in
#"Expanded Scrapper"
Error: 0x0 at Data Flow Task, Power Query Source: The import
Html.Table matches no exports. Did you miss a module reference?
I don't see any other solution that wait for Microsoft to support Html.table
error image in SSDT
I'm trying to merge several columns in a table to columns of another table. Each column in the primary table contains texts while the PrimaryAnalysis table contains indexes for the texts. I'd like to create columns of indexes for the primary table but I'm having to do that one at a time for each table thus:
#"Merged Queries" = Table.NestedJoin(#"Changed Type2",{"Text.1"},PrimaryAnalysis,{"Letter"},"NewColumn"),
#"Expanded NewColumn" = Table.ExpandTableColumn(#"Merged Queries", "NewColumn", {"Index"}, {"Index"}),
#"Renamed Columns2" = Table.RenameColumns(#"Expanded NewColumn",{{"Index", "First"}}),
#"Merged Queries1" = Table.NestedJoin(#"Renamed Columns2",{"Text.2"},PrimaryAnalysis,{"Letter"},"NewColumn"),
#"Expanded NewColumn1" = Table.ExpandTableColumn(#"Merged Queries1", "NewColumn", {"Index"}, {"Index"}),
#"Renamed Columns3" = Table.RenameColumns(#"Expanded NewColumn1",{{"Index", "2nd"}}),
#"Merged Queries2" = Table.NestedJoin(#"Renamed Columns3",{"Text.3"},PrimaryAnalysis,{"Letter"},"NewColumn"),
#"Expanded NewColumn2" = Table.ExpandTableColumn(#"Merged Queries2", "NewColumn", {"Index"}, {"Index"}),
#"Renamed Columns4" = Table.RenameColumns(#"Expanded NewColumn2",{{"Index", "3rd"}}),
Now I have to do that for 23 columns. Is there a way to implement DO...Repeat or any other loop in Power Query to perform this task?
Thanks in advance.
One way to loop in Power Query is to use a recursive function.
In the code below I read an Excel file with a table that should be similar to your primary table (so the step #"Changed Type2" in the code below should be similar to your step #"Changed Type2").
Next a function AddIndices is defined in which 1 column with an Index is added in each iteration. After 23 iterations the function stops, otherwise it calls itself.
An important point of attention with such recursive functions is that it MUST include a Table.Buffer (see step "Expanded"), otherwise in each iteration the code tries to evaluate all former iterations again and gets stuck. Table.Buffer prevents this.
In the last step of the query, the function is invoked.
let
Source = Excel.Workbook(File.Contents("C:\Users\Marcel\Documents\Forum bijdragen\StackOverflow Power Query\Loop Computation in Power Query.xlsx"), null, true),
Tabel1_Table = Source{[Item="Tabel1",Kind="Table"]}[Data],
#"Changed Type2" = Table.TransformColumnTypes(Tabel1_Table,{{"Text.1", type text}, {"Text.2", type text}, {"Text.3", type text}, {"Text.4", type text}, {"Text.5", type text}, {"Text.6", type text}, {"Text.7", type text}, {"Text.8", type text}, {"Text.9", type text}, {"Text.10", type text}, {"Text.11", type text}, {"Text.12", type text}, {"Text.13", type text}, {"Text.14", type text}, {"Text.15", type text}, {"Text.16", type text}, {"Text.17", type text}, {"Text.18", type text}, {"Text.19", type text}, {"Text.20", type text}, {"Text.21", type text}, {"Text.22", type text}, {"Text.23", type text}}),
// Recursive function:
AddIndices = (TableSoFar as table, optional Iteration as number) as table =>
let
CurrentIteration = if Iteration = null then 1 else Iteration,
CurrentColumn = "Text."&Text.From(CurrentIteration),
NewIndexColumn = "Index."&Text.From(CurrentIteration),
MergedTable = Table.NestedJoin(TableSoFar,{CurrentColumn},PrimaryAnalysis,{"Letter"},"NewColumn"),
Expanded = Table.Buffer(Table.ExpandTableColumn(MergedTable, "NewColumn", {"Index"}, {NewIndexColumn})),
Result = if CurrentIteration = 23 then Expanded else #AddIndices(Expanded, CurrentIteration + 1)
in
Result,
// Call recursive function:
AddedIndices = AddIndices(#"Changed Type2")
in
AddedIndices
I'm trying to use csv.reader to create a list of items from a string, but I'm having trouble. For instance, I have the following string:
bibinfo = "wooldridge1999asymptotic, author = \"Wooldridge, Jeffrey M.\", title = \"Asymptotic Properties of Weighted M-Estimators for Variable Probability Samples\", journal = \"Econometrica\", volume = \"\", year = 1999"
And I run the following code:
import csv
from io import StringIO
bibitems = [bibitem for bibitem in csv.reader(StringIO(bibinfo), skipinitialspace = True)][0]
But instead of having a list in which commas within a pair of double-quotes are not considered as separators, I obtain the following (unwanted) result:
['wooldridge1999asymptotic', 'author = "Wooldridge', 'Jeffrey M."', 'title = "Asymptotic Properties of Weighted M-Estimators for Variable Probability Samples"', 'journal = "Econometrica"', 'volume = ""', 'year = 1999']
In other words, it separates some items (like author's surname from first name) when it should not. I followed the tips in this other link, but it seems that I'm missing something else too.
It works if the " is at beginning of the item:
"author = Wooldridge, Jeffrey M."
With the changed text:
>>> s = """wooldridge1999asymptotic, "author = Wooldridge, Jeffrey M.", title = "Asymptotic Properties of Weighted M-Estimators for Variable Probability Samples", journal = "Econometrica", volume = "", year = 1999"""
>>> list(csv.reader(s.splitlines(), skipinitialspace=True))
[['wooldridge1999asymptotic',
'author = Wooldridge, Jeffrey M.',
'title = "Asymptotic Properties of Weighted M-Estimators for Variable Probability Samples"',
'journal = "Econometrica"',
'volume = ""',
'year = 1999']]