Conversion of a list to a dictionary python - json

{
"event_type": "ITEM_PREVIEW",
"event_id": "67521d60cbb5f4dedef901d5e82f394ed122662d",
"created_at": "2015-10-21T14:12:46-07:00"
}
I have this json which is being read as a list; how do I
convert it to a json
or
convert it to a dict such as event_type = key and ITEM_PREVIEW = value
I tried converting in to a string and use json.Encoder
I also tried this the first function gets events and saves it to a file the I want the second one to be able to parse the information
def events():
for event in client.events().generate_events_with_long_polling():
print(event)
ev = open('events.txt', 'a')
json.dump(event, ev)
ev.write('\n')
ev.close()
return ev
#events()
def trigger():
entries = open('events.txt', 'rU')
print('\n', type(entries))
dictss = entries.readlines()
print('\n', type(dictss), '\n', len(dictss))
for q in dictss:
print(q)
w = dict([x.strip().split(":") for x in dictss if " " in x])
print(w)
trigger()

Related

How to parse CSV file in the most performant way?

I would like to parse big CSV files in ABAP in the most performant way under the following conditions:
We do not know the structure of the CSV->the parse result should be table of string_table or something simular
The parsing should happen in accordance to https://www.rfc-editor.org/rfc/rfc4180
No solution specific calls
I found a very nice blog https://blogs.sap.com/2014/09/09/understanding-csv-files-and-their-handling-in-abap/ but it has its shortcoming:
Write your own code - The code example is not sufficient
Read the file using KCD_CSV_FILE_TO_INTERN_CONVERT - solution specific (not available everywhere) and will dump on fields that are big enough
Use RTTI and dynamic programming along with FM RSDS_CONVERT_CSV - we do not know the structure in advance
Use class CL_RSDA_CSV_CONVERTER - we do not know the structure in advance
I also checked the first available solution on github - https://github.com/thedoginthewok/ZwdCSV . Unfortunately, it has macros in the code (absolutely unacceptable) and also requires you to know the structure in advance.
I also tried to use the regex to do the job, but on big files this is too slow.
Even though I am extremely annoyed by this fact, I had to create a solution myself (I cannot believe that I actually did it - it should be in the standard...)
My first solution was a direct copy paste of Java code into ABAP (https://mkyong.com/java/how-to-read-and-parse-csv-file-in-java/). Unfortunately, as my other question How to iterate over string characters in ABAP in performant way? shown, it is not that easy to iterate over string in abap as it is in Java.
I then tried a split/count approach and so far it has the best performance. Does anyone knows the better way achieve this?
REPORT z_csv_test.
CLASS lcl_csv_parser DEFINITION CREATE PRIVATE.
PUBLIC SECTION.
TYPES:
tt_string_matrix TYPE STANDARD TABLE OF string_table WITH EMPTY KEY.
CLASS-METHODS:
create
IMPORTING
!iv_delimiter TYPE string DEFAULT '"'
!iv_separator TYPE string DEFAULT ','
!iv_line_separator TYPE abap_cr_lf DEFAULT cl_abap_char_utilities=>cr_lf
RETURNING
VALUE(r_result) TYPE REF TO lcl_csv_parser.
METHODS:
parse
IMPORTING
iv_string TYPE string
RETURNING
VALUE(r_result) TYPE tt_string_matrix,
constructor
IMPORTING
!iv_delimiter TYPE string
!iv_separator TYPE string
!iv_line_separator TYPE string.
PROTECTED SECTION.
PRIVATE SECTION.
DATA:
gv_delimiter TYPE string,
gv_separator TYPE string,
gv_line_separator TYPE string,
gv_escaped_delimiter TYPE string.
METHODS parse_line_to_string_table
IMPORTING
iv_line TYPE string
RETURNING
VALUE(r_result) TYPE string_table.
ENDCLASS.
CLASS lcl_csv_parser IMPLEMENTATION.
METHOD create.
r_result = NEW #(
iv_delimiter = iv_delimiter
iv_line_separator = CONV #( iv_line_separator )
iv_separator = iv_separator ).
ENDMETHOD.
METHOD constructor.
me->gv_delimiter = iv_delimiter.
me->gv_separator = iv_separator.
me->gv_line_separator = iv_line_separator.
me->gv_escaped_delimiter = |{ iv_delimiter }{ iv_delimiter }|.
ENDMETHOD.
METHOD parse.
"get the lines
SPLIT iv_string AT me->gv_line_separator INTO TABLE DATA(lt_lines).
DATA lx_open_line TYPE abap_bool VALUE abap_false.
DATA lv_current_line TYPE string.
LOOP AT lt_lines ASSIGNING FIELD-SYMBOL(<ls_line>).
FIND ALL OCCURRENCES OF me->gv_delimiter IN <ls_line> IN CHARACTER MODE MATCH COUNT DATA(lv_count).
IF ( lv_count MOD 2 ) = 1.
IF lx_open_line = abap_true.
lv_current_line = |{ lv_current_line }{ me->gv_line_separator }{ <ls_line> }|.
lx_open_line = abap_false.
APPEND parse_line_to_string_table( lv_current_line ) TO r_result.
ELSE.
lv_current_line = <ls_line>.
lx_open_line = abap_true.
ENDIF.
ELSE.
IF lx_open_line = abap_true.
lv_current_line = |{ lv_current_line }{ me->gv_line_separator }{ <ls_line> }|.
ELSE.
APPEND parse_line_to_string_table( <ls_line> ) TO r_result.
ENDIF.
ENDIF.
ENDLOOP.
ENDMETHOD.
METHOD parse_line_to_string_table.
SPLIT iv_line AT me->gv_separator INTO TABLE DATA(lt_line).
DATA lx_open_field TYPE abap_bool VALUE abap_false.
DATA lv_current_field TYPE string.
LOOP AT lt_line ASSIGNING FIELD-SYMBOL(<ls_field>).
FIND ALL OCCURRENCES OF me->gv_delimiter IN <ls_field> IN CHARACTER MODE MATCH COUNT DATA(lv_count).
IF ( lv_count MOD 2 ) = 1.
IF lx_open_field = abap_true.
lv_current_field = |{ lv_current_field }{ me->gv_separator }{ <ls_field> }|.
lx_open_field = abap_false.
APPEND lv_current_field TO r_result.
ELSE.
lv_current_field = <ls_field>.
lx_open_field = abap_true.
ENDIF.
ELSE.
IF lx_open_field = abap_true.
lv_current_field = |{ lv_current_field }{ me->gv_separator }{ <ls_field> }|.
ELSE.
APPEND <ls_field> TO r_result.
ENDIF.
ENDIF.
ENDLOOP.
REPLACE ALL OCCURRENCES OF me->gv_escaped_delimiter IN TABLE r_result WITH me->gv_delimiter.
ENDMETHOD.
ENDCLASS.
CLASS lcl_test_csv_parser DEFINITION
FINAL
CREATE PUBLIC .
PUBLIC SECTION.
CLASS-METHODS run.
CLASS-METHODS get_file
RETURNING VALUE(r_result) TYPE string.
PROTECTED SECTION.
PRIVATE SECTION.
ENDCLASS.
CLASS lcl_test_csv_parser IMPLEMENTATION.
METHOD get_file.
DATA lv_file_line TYPE string.
DO 10 TIMES.
lv_file_line = |"1234,{ cl_abap_char_utilities=>cr_lf }567890",{ lv_file_line }|.
ENDDO.
lv_file_line = lv_file_line && cl_abap_char_utilities=>cr_lf.
DATA(lt_file_as_table) = VALUE string_table(
FOR i = 1 THEN i + 1 UNTIL i = 1000000
( lv_file_line ) ).
CONCATENATE LINES OF lt_file_as_table INTO r_result.
ENDMETHOD.
METHOD run.
DATA lv_prepare_start TYPE timestampl.
GET TIME STAMP FIELD lv_prepare_start.
DATA(lv_file) = get_file( ).
DATA lv_prepare_end TYPE timestampl.
GET TIME STAMP FIELD lv_prepare_end.
WRITE |Preparation took { cl_abap_tstmp=>subtract( tstmp1 = lv_prepare_end tstmp2 = lv_prepare_start ) }|.
DATA lv_parse_start TYPE timestampl.
GET TIME STAMP FIELD lv_parse_start.
DATA(lo_parser) = lcl_csv_parser=>create( ).
DATA(lt_file) = lo_parser->parse( lv_file ).
DATA lv_parse_end TYPE timestampl.
GET TIME STAMP FIELD lv_parse_end.
WRITE |Parse took { cl_abap_tstmp=>subtract( tstmp1 = lv_parse_end tstmp2 = lv_parse_start ) }|.
ENDMETHOD.
ENDCLASS.
START-OF-SELECTION.
lcl_test_csv_parser=>run( ).
I'd like to present a different approach using find heavily, compared to your line based approach this seems to have equivalent performance for unquoted fields but performs slightly better if quoted fields are present:
In general, this uses the pattern position = find( off = position + 1 ) to iterate over the string in chunks, and then uses substring to copy ranges into strings. What can be observed here is that in a loop that iterates a million times, every nanosecond saved has an impact on the performance, and by moving as much of it out of the inner loop one can increase performance significantly. For the "simple" case of 10 digit fields one can see that both algorithms perform equally well, however for "longer" 30 digit fields your algorithm is getting faster in comparison. For fields with quotes the scan & concat approach I've used seems to be faster than the "reconstruct" approach.
I guess although one can achieve small gains through more clever ABAP, further significant optimizations are only possible by utilizing the engine even more.
Anyways, Here's the algorithm:
CLASS lcl_csv_parser_find IMPLEMENTATION.
METHOD parse.
DATA line TYPE string_table.
DATA position TYPE i.
DATA(string_length) = strlen( i_string ).
" Dereferencing member fields is slightly slower than variable access, in a close loop this matters
DATA(separators) = me->separators.
DATA(delimiter) = me->delimiter.
CHECK string_length <> 0.
" Checking for delimiters in the DO loop is quite slow.
" By scanning the whole file once and skipping that check if no delimiter is present
" This lead to a slight performance increase of 1s for 1 million rows
DATA(next_delimiter) = find( val = i_string sub = delimiter ).
DO.
DATA(start_position) = position.
DATA(field) = ``.
" Check if field is enclosed in double quotes, as we need to unescape then
IF next_delimiter <> -1 AND i_string+position(1) = delimiter.
start_position = start_position + 1. " literal starts after opening quote
DO.
position = find( val = i_string off = position + 1 sub = delimiter ).
" literal must be closed
" ASSERT position <> -1.
DATA(subliteral_length) = position - start_position.
field = field && substring( val = i_string off = start_position len = subliteral_length ).
DATA(following_position) = position + 1.
IF position = string_length OR i_string+following_position(1) <> delimiter.
" End of literal is reached
position = position + 1. " skip closing quote
EXIT. " DO
ELSE.
" Found escape quote instead
position = following_position + 1.
field = field && me->delimiter.
" continue searching
ENDIF.
" ASSERT sy-index < 1000.
ENDDO.
ELSE.
" Unescaped field, simply find the ending comma or newline
position = find_any_of( val = i_string off = position + 1 sub = separators ).
IF position = -1.
position = string_length.
ENDIF.
field = substring( val = i_string off = start_position len = position - start_position ).
ENDIF.
APPEND field TO line.
" Check if line ended and new line is started
DATA(current) = substring( val = i_string off = position len = 2 ).
IF current = me->line_separator.
APPEND line TO r_result.
CLEAR line.
position = position + 2. " skip newline
ELSE.
" ASSERT i_string+position(1) = me->separator.
position = position + 1.
ENDIF.
" Check if file ended
IF position >= string_length.
RETURN.
ENDIF.
" ASSERT sy-index < 100000001.
ENDDO.
ENDMETHOD.
ENDCLASS.
As a sidenote, instead of creating a huge table of string fields as stated in #1, I would experiment with some kind of "visitor pattern", e.g. pass an instance of such an interface to the parser:
INTERFACE if_csv_visitor.
METHODS begin_line.
METHODS end_line.
METHODS visit_field
IMPORTING
i_field TYPE string.
ENDINTERFACE.
As in a lot of cases you'll write the CSV fields into a structure anyways,
and thus one can save allocating this quite large table.
And for further reference, here's the whole report:
*&---------------------------------------------------------------------*
*& Report Z_CSV
*&---------------------------------------------------------------------*
*&
*&---------------------------------------------------------------------*
REPORT Z_CSV.
* --------------------- Generic CSV Parser ----------------------------*
CLASS lcl_csv_parser DEFINITION ABSTRACT.
PUBLIC SECTION.
TYPES:
t_string_matrix TYPE STANDARD TABLE OF string_table WITH EMPTY KEY.
METHODS:
parse ABSTRACT
IMPORTING
i_string TYPE string
RETURNING
VALUE(r_result) TYPE t_string_matrix,
constructor
IMPORTING
i_delimiter TYPE string DEFAULT '"'
i_separator TYPE string DEFAULT ','
i_line_separator TYPE abap_cr_lf DEFAULT cl_abap_char_utilities=>cr_lf.
PROTECTED SECTION.
DATA:
delimiter TYPE string,
separator TYPE string,
line_separator TYPE string,
escaped_delimiter TYPE string,
separators TYPE string.
ENDCLASS.
CLASS lcl_csv_parser IMPLEMENTATION.
METHOD constructor.
me->delimiter = i_delimiter.
me->separator = i_separator.
me->line_separator = i_line_separator.
me->escaped_delimiter = |{ i_delimiter }{ i_delimiter }|.
me->separators = i_separator && i_line_separator.
ENDMETHOD.
ENDCLASS.
* --------------------------- Line based CSV Parser ------------------------ *
CLASS lcl_csv_parser_line DEFINITION INHERITING FROM lcl_csv_parser.
PUBLIC SECTION.
METHODS parse REDEFINITION.
PRIVATE SECTION.
METHODS parse_line_to_string_table
IMPORTING
i_line TYPE string
RETURNING
VALUE(r_result) TYPE string_table.
ENDCLASS.
CLASS lcl_csv_parser_line IMPLEMENTATION.
METHOD parse.
"get the lines
SPLIT i_string AT me->line_separator INTO TABLE DATA(lines).
DATA open_line TYPE abap_bool VALUE abap_false.
DATA current_line TYPE string.
LOOP AT lines ASSIGNING FIELD-SYMBOL(<line>).
FIND ALL OCCURRENCES OF me->delimiter IN <line> IN CHARACTER MODE MATCH COUNT DATA(count).
IF ( count MOD 2 ) = 1.
IF open_line = abap_true.
current_line = |{ current_line }{ me->line_separator }{ <line> }|.
open_line = abap_false.
APPEND parse_line_to_string_table( current_line ) TO r_result.
ELSE.
current_line = <line>.
open_line = abap_true.
ENDIF.
ELSE.
IF open_line = abap_true.
current_line = |{ current_line }{ me->line_separator }{ <line> }|.
ELSE.
APPEND parse_line_to_string_table( <line> ) TO r_result.
ENDIF.
ENDIF.
ENDLOOP.
ENDMETHOD.
METHOD parse_line_to_string_table.
SPLIT i_line AT me->separator INTO TABLE DATA(fields).
DATA open_field TYPE abap_bool VALUE abap_false.
DATA current_field TYPE string.
LOOP AT fields ASSIGNING FIELD-SYMBOL(<field>).
FIND ALL OCCURRENCES OF me->delimiter IN <field> IN CHARACTER MODE MATCH COUNT DATA(count).
IF ( count MOD 2 ) = 1.
IF open_field = abap_true.
current_field = |{ current_field }{ me->separator }{ <field> }|.
open_field = abap_false.
APPEND current_field TO r_result.
ELSE.
current_field = <field>.
open_field = abap_true.
ENDIF.
ELSE.
IF open_field = abap_true.
current_field = |{ current_field }{ me->separator }{ <field> }|.
ELSE.
APPEND <field> TO r_result.
ENDIF.
ENDIF.
ENDLOOP.
REPLACE ALL OCCURRENCES OF me->escaped_delimiter IN TABLE r_result WITH me->delimiter.
ENDMETHOD.
ENDCLASS.
*--------------- Find based CSV Parser ------------------------------------*
CLASS lcl_csv_parser_find DEFINITION INHERITING FROM lcl_csv_parser.
PUBLIC SECTION.
METHODS parse REDEFINITION.
ENDCLASS.
CLASS lcl_csv_parser_find IMPLEMENTATION.
METHOD parse.
DATA line TYPE string_table.
DATA position TYPE i.
DATA(string_length) = strlen( i_string ).
" Dereferencing member fields is slightly slower than variable access, in a close loop this matters
DATA(separators) = me->separators.
DATA(delimiter) = me->delimiter.
CHECK string_length <> 0.
" Checking for delimiters in the DO loop is quite slow.
" By scanning the whole file once and skipping that check if no delimiter is present
" This lead to a slight performance increase of 1s for 1 million rows
DATA(next_delimiter) = find( val = i_string sub = delimiter ).
DO.
DATA(start_position) = position.
DATA(field) = ``.
" Check if field is enclosed in double quotes, as we need to unescape then
IF next_delimiter <> -1 AND i_string+position(1) = delimiter.
start_position = start_position + 1. " literal starts after opening quote
DO.
position = find( val = i_string off = position + 1 sub = delimiter ).
" literal must be closed
" ASSERT position <> -1.
DATA(subliteral_length) = position - start_position.
field = field && substring( val = i_string off = start_position len = subliteral_length ).
DATA(following_position) = position + 1.
IF position = string_length OR i_string+following_position(1) <> delimiter.
" End of literal is reached
position = position + 1. " skip closing quote
EXIT. " DO
ELSE.
" Found escape quote instead
position = following_position + 1.
field = field && me->delimiter.
" continue searching
ENDIF.
" ASSERT sy-index < 1000.
ENDDO.
ELSE.
" Unescaped field, simply find the ending comma or newline
position = find_any_of( val = i_string off = position + 1 sub = separators ).
IF position = -1.
position = string_length.
ENDIF.
field = substring( val = i_string off = start_position len = position - start_position ).
ENDIF.
APPEND field TO line.
" Check if line ended and new line is started
DATA(current) = substring( val = i_string off = position len = 2 ).
IF current = me->line_separator.
APPEND line TO r_result.
CLEAR line.
position = position + 2. " skip newline
ELSE.
" ASSERT i_string+position(1) = me->separator.
position = position + 1.
ENDIF.
" Check if file ended
IF position >= string_length.
RETURN.
ENDIF.
" ASSERT sy-index < 100000001.
ENDDO.
ENDMETHOD.
ENDCLASS.
* -------------------- Tests -------------------------------------------------------- *
CLASS lcl_test_csv_parser DEFINITION
FINAL
CREATE PUBLIC .
PUBLIC SECTION.
CLASS-METHODS run.
CLASS-METHODS get_file_complex
RETURNING VALUE(r_result) TYPE string.
CLASS-METHODS get_file_simple
RETURNING VALUE(r_result) TYPE string.
CLASS-METHODS get_file_long
RETURNING VALUE(r_result) TYPE string.
CLASS-METHODS get_file_longer
RETURNING VALUE(r_result) TYPE string.
CLASS-METHODS get_file_mixed
RETURNING VALUE(r_result) TYPE string.
PROTECTED SECTION.
PRIVATE SECTION.
ENDCLASS.
CLASS lcl_test_csv_parser IMPLEMENTATION.
METHOD get_file_complex.
DATA(file_line) =
repeat( val = |"1234,{ cl_abap_char_utilities=>cr_lf }7890",| occ = 9 ) &&
|"1234,{ cl_abap_char_utilities=>cr_lf }7890"| &&
cl_abap_char_utilities=>cr_lf.
r_result = repeat( val = file_line occ = 1000000 ).
ENDMETHOD.
METHOD get_file_simple.
DATA(file_line) =
repeat( val = |1234567890,| occ = 9 ) &&
|1234567890| &&
cl_abap_char_utilities=>cr_lf.
r_result = repeat( val = file_line occ = 1000000 ).
ENDMETHOD.
METHOD get_file_long.
DATA(file_line) =
repeat( val = |12345678901234567890,| occ = 4 ) &&
|12345678901234567890| &&
cl_abap_char_utilities=>cr_lf.
r_result = repeat( val = file_line occ = 1000000 ).
ENDMETHOD.
METHOD get_file_longer.
DATA(file_line) =
repeat( val = |1234567890123456789012345678901234567890,| occ = 2 ) &&
|1234567890123456789012345678901234567890| &&
cl_abap_char_utilities=>cr_lf.
r_result = repeat( val = file_line occ = 1000000 ).
ENDMETHOD.
METHOD get_file_mixed.
DATA(file_line) =
|1234567890,1234567890,"1234,{ cl_abap_char_utilities=>cr_lf }7890",1234567890,1234567890,1234567890,"1234,{ cl_abap_char_utilities=>cr_lf }7890",1234567890,1234567890,1234567890| &&
cl_abap_char_utilities=>cr_lf.
r_result = repeat( val = file_line occ = 1000000 ).
ENDMETHOD.
METHOD run.
DATA prepare_start TYPE timestampl.
GET TIME STAMP FIELD prepare_start.
TYPES:
BEGIN OF t_file,
name TYPE string,
content TYPE string,
END OF t_file,
t_files TYPE STANDARD TABLE OF t_file WITH EMPTY KEY.
DATA(files) = VALUE t_files(
( name = `simple` content = get_file_simple( ) )
( name = `long` content = get_file_long( ) )
( name = `longer` content = get_file_long( ) )
( name = `complex` content = get_file_complex( ) )
( name = `mixed` content = get_file_mixed( ) )
).
DATA prepare_end TYPE timestampl.
GET TIME STAMP FIELD prepare_end.
WRITE |Preparation took { cl_abap_tstmp=>subtract( tstmp1 = prepare_end tstmp2 = prepare_start ) }|. SKIP 2.
WRITE: 'File', 15 'Line Parse', 30 'Find Parse', 45 'Match'. NEW-LINE.
ULINE.
LOOP AT files INTO DATA(file).
WRITE file-name UNDER 'File'.
DATA line_start TYPE timestampl.
GET TIME STAMP FIELD line_start.
DATA(line_parser) = NEW lcl_csv_parser_line( ).
DATA(line_result) = line_parser->parse( file-content ).
DATA line_end TYPE timestampl.
GET TIME STAMP FIELD line_end.
WRITE |{ cl_abap_tstmp=>subtract( tstmp1 = line_end tstmp2 = line_start ) }s| UNDER 'Line Parse'.
DATA find_start TYPE timestampl.
GET TIME STAMP FIELD find_start.
DATA(find_parser) = NEW lcl_csv_parser_find( ).
DATA(find_result) = find_parser->parse( file-content ).
DATA find_end TYPE timestampl.
GET TIME STAMP FIELD find_end.
WRITE |{ cl_abap_tstmp=>subtract( tstmp1 = find_end tstmp2 = find_start ) }s| UNDER 'Find Parse'.
" WRITE COND #( WHEN line_result = find_result THEN 'yes' ELSE 'no') UNDER 'Match'.
NEW-LINE.
ENDLOOP.
ENDMETHOD.
ENDCLASS.
START-OF-SELECTION.
lcl_test_csv_parser=>run( ).

How use wikidata api to access to the statements

I'm trying to get information from Wikidata. For example, to access to "cobalt-70" I use the API.
API_ENDPOINT = "https://www.wikidata.org/w/api.php"
query = "cobalt-70"
params = {
'action': 'wbsearchentities',
'format': 'json',
'language': 'en',
'search': query
}
r = requests.get(API_ENDPOINT, params = params)
print(r.json())
So there is a "claims" which gives access to the statements. Is there a best way to check if a value exists in the statement? For example, "cobalt-70" have the value 0.5 inside the property P2114. So how can I check if a value exists in the statement of the entity? As this example.
Is there an approach to access it. Thank you!
I'm not sure this is exactly what you are looking for, but if it's close enough, you can probably modify it as necessary:
import requests
import json
url = 'https://www.wikidata.org/wiki/Special:EntityData/Q18844865.json'
req = requests.get(url)
targets = j_dat['entities']['Q18844865']['claims']['P2114']
for target in targets:
values = target['mainsnak']['datavalue']['value'].items()
for value in values:
print(value[0],value[1])
Output:
amount +0.5
unit http://www.wikidata.org/entity/Q11574
upperBound +0.6799999999999999
lowerBound +0.32
amount +108.0
unit http://www.wikidata.org/entity/Q723733
upperBound +115.0
lowerBound +101.0
EDIT:
To find property id by value, try:
targets = j_dat['entities']['Q18844865']['claims'].items()
for target in targets:
line = target[1][0]['mainsnak']['datavalue']['value']
if isinstance(line,dict):
for v in line.values():
if v == "+0.5":
print('property: ',target[0])
Output:
property: P2114
I try a solution which consists to search inside the json object as the solution proposed here : https://stackoverflow.com/a/55549654/8374738. I hope it can help. Let's give you the idea.
import pprint
def search(d, search_pattern, prev_datapoint_path=''):
output = []
current_datapoint = d
current_datapoint_path = prev_datapoint_path
if type(current_datapoint) is dict:
for dkey in current_datapoint:
if search_pattern in str(dkey):
c = current_datapoint_path
c+="['"+dkey+"']"
output.append(c)
c = current_datapoint_path
c+="['"+dkey+"']"
for i in search(current_datapoint[dkey], search_pattern, c):
output.append(i)
elif type(current_datapoint) is list:
for i in range(0, len(current_datapoint)):
if search_pattern in str(i):
c = current_datapoint_path
c += "[" + str(i) + "]"
output.append(i)
c = current_datapoint_path
c+="["+ str(i) +"]"
for i in search(current_datapoint[i], search_pattern, c):
output.append(i)
elif search_pattern in str(current_datapoint):
c = current_datapoint_path
output.append(c)
output = filter(None, output)
return list(output)
And you just need to use:
pprint.pprint(search(res.json(),'0.5','res.json()'))
Output:
["res.json()['claims']['P2114'][0]['mainsnak']['datavalue']['value']['amount']"]

Is there a way to take a list of strings and create a JSON file, where both the key and value are list items?

I am creating a python script that can read scanned, and tabular .pdfs and extract some important data and insert it into a JSON to later be implemented into a SQL database (I will also be developing the DB as a project for learning MongoDB).
Basically, my issue is I have never worked with any JSON files before but that was the format I was recommended to output to. The scraping script works, the pre-processing could be a lot cleaner, but for now it works. The issue I run into is the keys, and values are in the same list, and some of the values because they had a decimal point are two different list items. Not really sure where to even start.
I don't really know where to start, I suppose since I know what the indexes of the list are I can easily assign keys and values, but then it may not be applicable to any .pdf, that is the script cannot be coded explicitly.
import PyPDF2 as pdf2
import textract
with "TestSpec.pdf" as filename:
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.pdfFileReader(pdfFileObj)
num_pages = pdfReader.numpages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(0)
count += 1
text += pageObj.extractText()
if text != "":
text = text
else:
text = textract.process(filename, method="tesseract", language="eng")
def cleanText(x):
'''
This function takes the byte data extracted from scanned PDFs, and cleans it of all
unnessary data.
Requires re
'''
stringedText = str(x)
cleanText = stringedText.replace('\n','')
splitText = re.split(r'\W+', cleanText)
caseingText = [word.lower() for word in splitText]
cleanOne = [word for word in caseingText if word != 'n']
dexStop = cleanOne.index("od260")
dexStart = cleanOne.index("sheet")
clean = cleanOne[dexStart + 1:dexStop]
return clean
cleanText = cleanText(text)
This is the current output
['n21', 'feb', '2019', 'nsequence', 'lacz', 'rp', 'n5', 'gat', 'ctc', 'tac', 'cat', 'ggc', 'gca', 'cat', 'ttc', 'ccc', 'gaa', 'aag', 'tgc', '3', 'norder', 'no', '15775199', 'nref', 'no', '207335463', 'n25', 'nmole', 'dna', 'oligo', '36', 'bases', 'nproperties', 'amount', 'of', 'oligo', 'shipped', 'to', 'ntm', '50mm', 'nacl', '66', '8', 'xc2', 'xb0c', '11', '0', '32', '6', 'david', 'cook', 'ngc', 'content', '52', '8', 'd260', 'mmoles', 'kansas', 'state', 'university', 'biotechno', 'nmolecular', 'weight', '10', '965', '1', 'nnmoles']
and we want the output as a JSON setup like
{"Date | 21feb2019", "Sequence ID: | lacz-rp", "Sequence 5'-3' | gat..."}
and so on. Just not sure how to do that.
here is a screenshot of the data from my sample pdf
So, i have figured out some of this. I am still having issues with grabbing the last 3rd of the data i need without explicitly programming it in. but here is what i have so far. Once i have everything working then i will worry about optimizing it and condensing.
# for PDF reading
import PyPDF2 as pdf2
import textract
# for data preprocessing
import re
from dateutil.parser import parse
# For generating the JSON file array
import json
# This finds and opens the pdf file, reads the data, and extracts the data.
filename = "*.pdf"
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.PdfFileReader(pdfFileObj)
text = ""
pageObj = pdfReader.getPage(0)
text += pageObj.extractText()
# checks if extracted data is in string form or picture, if picture textract reads data.
# it then closes the pdf file
if text != "":
text = text
else:
text = textract.process(filename, method="tesseract", language="eng")
pdfFileObj.close()
# Converts text to string from byte data for preprocessing
stringedText = str(text)
# Removed escaped lines and replaced them with actual new lines.
formattedText = stringedText.replace('\\n', '\n').lower()
# Slices the long string into a workable piece (only contains useful data)
slice1 = formattedText[(formattedText.index("sheet") + 10): (formattedText.index("secondary") - 2)]
clean = re.sub('\n', " ", slice1)
clean2 = re.sub(' +', ' ', clean)
# Creating the PrimerData dictionary
with open("PrimerData.json",'w') as file:
primerDataSlice = clean[clean.index("molecular"): -1]
primerData = re.split(": |\n", primerDataSlice)
primerKeys = primerData[0::2]
primerValues = primerData[1::2]
primerDict = {"Primer Data": dict(zip(primerKeys,primerValues))}
# Generatring the JSON array "Primer Data"
primerJSON = json.dumps(primerDict, ensure_ascii=False)
file.write(primerJSON)
# Grabbing the date (this has just the date, so json will have to add date.)
date = re.findall('(\d{2}[\/\- ](\d{2}|january|jan|february|feb|march|mar|april|apr|may|may|june|jun|july|jul|august|aug|september|sep|october|oct|november|nov|december|dec)[\/\- ]\d{2,4})', clean2)
Without input data it is difficult to give you working code. A minimal working example with input would help. As for JSON handling, python dictionaries can dump to json easily. See examples here.
https://docs.python-guide.org/scenarios/json/
Get a json string from a dictionary and write to a file. Figure out how to parse the text into a dictionary.
import json
d = {"Date" : "21feb2019", "Sequence ID" : "lacz-rp", "Sequence 5'-3'" : "gat"}
json_data = json.dumps(d)
print(json_data)
# Write that data to a file
So, I did figure this out, the problem was really just that because of the way my pre-processing was pulling all the data into a single list wasn't really that great of an idea considering that the keys for the dictionary never changed.
Here is the semi-finished result for making the Dictionary and JSON file.
# Collect the sequence name
name = clean2[clean2.index("Sequence") + 11: clean2.index("Sequence") + 19]
# Collecting Shipment info
ordered = input("Who placed this order? ")
received = input("Who is receiving this order? ")
dateOrder = re.findall(
r"(\d{2}[/\- ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[/\- ]\d{2,4})",
clean2)
dateReceived = date.today()
refNo = clean2[clean2.index("ref.No. ") + 8: clean2.index("ref.No.") + 17]
orderNo = clean2[clean2.index("Order No.") +
10: clean2.index("Order No.") + 18]
# Finding and grabbing the sequence data. Storing it and then finding the
# GC content and melting temp or TM
bases = int(clean2[clean2.index("bases") - 3:clean2.index("bases") - 1])
seqList = [line for line in clean2 if re.match(r'^[AGCT]+$', line)]
sequence = "".join(i for i in seqList[:bases])
def gc_content(x):
count = 0
for i in x:
if i == 'G' or i == 'C':
count += 1
else:
count = count
return round((count / bases) * 100, 1)
gc = gc_content(sequence)
tm = mt.Tm_GC(sequence, Na=50)
moleWeight = round(mw(Seq(sequence, generic_dna)), 2)
dilWeight = float(clean2[clean2.index("ug/OD260:") +
10: clean2.index("ug/OD260:") + 14])
dilution = dilWeight * 10
primerDict = {"Primer Data": {
"Sequence": sequence,
"Bases": bases,
"TM (50mM NaCl)": tm,
"% GC content": gc,
"Molecular weight": moleWeight,
"ug/0D260": dilWeight,
"Dilution volume (uL)": dilution
},
"Shipment Info": {
"Ref. No.": refNo,
"Order No.": orderNo,
"Ordered by": ordered,
"Date of Order": dateOrder,
"Received By": received,
"Date Received": str(dateReceived.strftime("%d-%b-%Y"))
}}
# Generating the JSON array "Primer Data"
with open("".join(name) + ".json", 'w') as file:
primerJSON = json.dumps(primerDict, ensure_ascii=False)
file.write(primerJSON)

Auto-generate json paths from given json structure using Python

I am trying to generate auto json paths from given json structure but stuck in the programatic part. Can someone please help out with the idea to take it further?
Below is the code so far i have achieved.
def iterate_dict(dict_data, key, tmp_key):
for k, v in dict_data.items():
key = key + tmp_key + '.' + k
key = key.replace('$$', '$')
if type(v) is dict:
tmp_key = key
key = '$'
iterate_dict(v, key, tmp_key)
elif type(v) is list:
str_encountered = False
for i in v:
if type(i) is str:
str_encountered = True
tmp_key = key
break
tmp_key = key
key = '$'
iterate_dict(i, key, tmp_key)
if str_encountered:
print(key, v)
if tmp_key is not None:
tmp_key = str(tmp_key)[:-str(k).__len__() - 1]
key = '$'
else:
print(key, v)
key = '$'
import json
iterate_dict_new(dict(json.loads(d_data)), '$', '')
consider the below json structure
{
"id": "1",
"categories": [
{
"name": "author",
"book": "fiction",
"leaders": [
{
"ref": ["wiki", "google"],
"athlete": {
"$ref": "some data"
},
"data": {
"$data": "some other data"
}
}
]
},
{
"name": "dummy name"
}
]
}
Expected output out of python script:
$id = 1
$categories[0].name = author
$categories[0].book = fiction
$categories[0].leaders[0].ref[0] = wiki
$categories[0].leaders[0].ref[1] = google
$categories[0].leaders[0].athlete.$ref = some data
$categories[0].leaders[0].data.$data = some other data
$categories[1].name = dummy name
Current output with above python script:
$.id 1
$$.categories.name author
$$.categories.book fiction
$$$.categories.leaders.ref ["wiki", "google"]
$$$$$.categories.leaders.athlete.$ref some data
$$$$$$.categories.leaders.athlete.data.$data some other data
$$.name dummy name
The following recursive function is similar to yours, but instead of just displaying a dictionary, it can also take a list. This means that if you passed in a dictionary where one of the values was a nested list, then the output would still be correct (printing things like dict.key[3][4] = element).
def disp_paths(it, p='$'):
for k, v in (it.items() if type(it) is dict else enumerate(it)):
if type(v) is dict:
disp_paths(v, '{}.{}'.format(p, k))
elif type(v) is list:
for i, e in enumerate(v):
if type(e) is dict or type(e) is list:
disp_paths(e, '{}.{}[{}]'.format(p, k, i))
else:
print('{}.{}[{}] = {}'.format(p, k, i, e))
else:
f = '{}.{} = {}' if type(it) is dict else '{}[{}] = {}'
print(f.format(p, k, v))
which, when ran with your dictionary (disp_paths(d)), gives the expected output of:
$.categories[0].leaders[0].athlete.$ref = some data
$.categories[0].leaders[0].data.$data = some other data
$.categories[0].leaders[0].ref[0] = wiki
$.categories[0].leaders[0].ref[1] = google
$.categories[0].book = fiction
$.categories[0].name = author
$.categories[1].name = dummy name
$.id = 1
Note that this is unfortunately not ordered, but that is unavoidable as dictionaries have no inherent order (they are just sets of key:value pairs)
If you need help understanding my modifications, just drop a comment!

Converting a MongoDB Query String with Datetime field into Dict in Python

I have a string as follows,
s= "query : {'$and': [{'$or': [{'Component': 'pfr'}, {'Component': 'ng-pfr'}, {'Component': 'common-flow-table'}, {'Component': 'media-mon'}]}, {'Submitted-on': {'$gte': datetime.datetime(2016, 2, 21, 0, 0)}}, {'Submitted-on': {'$lte': datetime.datetime(2016, 2, 28, 0, 0)}}]}
" which is a MongoDB query stored in a string.How to convert it into a Dict or JSON format in Python
Your format is not standard, so you need a hack to get it.
import json
s = " query : {'names' :['abc','xyz'],'location':'India'}"
key, value = s.strip().split(':', 1)
r = value.replace("'", '"')
data = {
key: json.loads(r)
}
From your comment: the datetime gives problems. Then I present to you the hack of hacks: the eval function.
import datetime
import json
s = " query : {'names' :['abc','xyz'],'location':'India'}"
key, value = s.strip().split(':', 1)
# we can leave out the replacing, single quotes is fine for eval
data = {
key: eval(value)
}
NB eval -especially on unsanitized input- is very unsafe.
NB: hacks will be broken, in the first case for example because a value or key contains a quote character.