BigQuery json function - cannot extract all values if json string not well formatted - json

I have a json string stored in a field in BigQuery which has this structure::
{'language': 'Eng', 'date_started': '2021-02-08 16: 56: 55 GMT', 'link_id': '111', 'url_variables': {'touchpoint': {'key': 'touchpoint', 'value': 'phone', 'type': 'url'
}, 'interaction_id': {'key': 'interaction_id', 'value': '111', 'type': 'url'
}
}, 'ip_address': None, 'referer': '', 'user_agent': None, 'response_time': 111, 'data_quality': [], 'longitude': '', 'latitude': '', 'country': '', 'city': '', 'region': '', 'postal': '', 'dma': '', 'survey_data': {'25': {'id': 25, 'type': 'TEXTBOX', 'question': 'feedback_source', 'section_id': 1, 'shown': False
}, '229': {'id': 229, 'type': 'TEXTBOX', 'question': 'recruitment_method', 'section_id': 1, 'shown': False
}, '227': {'id': 227, 'type': 'TEXTBOX', 'question': 'meeting_point', 'section_id': 1, 'answer': 'phone', 'shown': True
}, '221': {'id': 221, 'type': 'TEXTBOX', 'question': 'interaction_id', 'section_id': 1, 'answer': '222', 'shown': True
}, '217': {'id': 217, 'type': 'TEXTBOX', 'question': 'session_id', 'section_id': 1, 'answer': '333', 'shown': True
}, '231': {'id': 231, 'type': 'ESSAY', 'question': 'BlaBla question 4', 'section_id': 3, 'answer': 'Bla Bla answer', 'shown': True
}, '255': {'id': 255, 'type': 'TEXTBOX', 'question': 'tz_offset', 'section_id': 3, 'answer': '-120', 'shown': True
}, '77': {'id': 77, 'type': 'parent', 'question': 'Bla Bla 1', 'section_id': 35, 'options': {'10395': {'id': 10395, 'option': 'Neutraal', 'answer': '3'
}
}, 'shown': True
}, '250': {'id': 250, 'type': 'RADIO', 'question': 'Bla Bla?', 'section_id': 66, 'original_answer': '1', 'answer': '1', 'answer_id': 10860, 'shown': True
}, '251': {'id': 251, 'type': 'RADIO', 'question': 'Bla Bla', 'section_id': 66, 'original_answer': '0', 'answer': '0', 'answer_id': 10863, 'shown': True
}
}
}
I'm able to extract some of the values with the query below, but I cannot extract response_time or any of the values inside the survey_data structure.
They always come out as null.
DECLARE resp STRING
DEFAULT "{'id': '111', 'contact_id': '', 'status': 'Complete', 'is_test_data': '0', 'date_submitted': '2021-07-08 17: 02: 16 GMT', 'session_id': '111', 'language': 'Eng', 'date_started': '2021-02-08 16: 56: 55 GMT', 'link_id': '111', 'url_variables': {'touchpoint': {'key': 'touchpoint', 'value': 'phone', 'type': 'url' }, 'interaction_id': {'key': 'interaction_id', 'value': '111', 'type': 'url' } }, 'ip_address': None, 'referer': '', 'user_agent': None, 'response_time': 111, 'data_quality': [], 'longitude': '', 'latitude': '', 'country': '', 'city': '', 'region': '', 'postal': '', 'dma': '', 'survey_data': {'25': {'id': 25, 'type': 'TEXTBOX', 'question': 'feedback_source', 'section_id': 1, 'shown': False }, '229': {'id': 229, 'type': 'TEXTBOX', 'question': 'recruitment_method', 'section_id': 1, 'shown': False }, '227': {'id': 227, 'type': 'TEXTBOX', 'question': 'meeting_point', 'section_id': 1, 'answer': 'phone', 'shown': True }, '221': {'id': 221, 'type': 'TEXTBOX', 'question': 'interaction_id', 'section_id': 1, 'answer': '222', 'shown': True }, '217': {'id': 217, 'type': 'TEXTBOX', 'question': 'session_id', 'section_id': 1, 'answer': '333', 'shown': True }, '231': {'id': 231, 'type': 'ESSAY', 'question': 'BlaBla question 4', 'section_id': 3, 'answer': 'Bla Bla answer', 'shown': True }, '255': {'id': 255, 'type': 'TEXTBOX', 'question': 'tz_offset', 'section_id': 3, 'answer': '-120', 'shown': True }, '77': {'id': 77, 'type': 'parent', 'question': 'Bla Bla 1', 'section_id': 35, 'options': {'10395': {'id': 10395, 'option': 'Neutraal', 'answer': '3' } }, 'shown': True }, '250': {'id': 250, 'type': 'RADIO', 'question': 'Bla Bla?', 'section_id': 66, 'original_answer': '1', 'answer': '1', 'answer_id': 10860, 'shown': True }, '251': {'id': 251, 'type': 'RADIO', 'question': 'Bla Bla', 'section_id': 66, 'original_answer': '0', 'answer': '0', 'answer_id': 10863, 'shown': True } } }";
SELECT
JSON_VALUE( resp, '$.url_variables.interaction_id.value') as url_interaction_id_value ,
  JSON_VALUE( resp, '$.url_variables.interaction_id.type') as url_interaction_id_type,
  JSON_VALUE( resp, '$.language') as language,
JSON_QUERY( resp, '$.response_time') as response_time, -- NOT WORKING
JSON_QUERY( resp, '$.survey_data') as survey_data -- NOT WORKING
I tried with jq in bash from the CLI and it seems to complain about the fact that some of the None values are not quoted.
Question:
Does it mean that BigQuery attempts to extract values from the JSON string as far as it can, "until" it encounters something that it is not well formatted (e.g. the unquoted None values) and then it just cannot parse further and returns nulls ?
NB: In another app, I have been able to parse the json file in Python and extract values from inside the json string.

Looks like you have few formatting issues with your resp field which you can fix with few REPLACEs as in below example
SELECT
JSON_VALUE( resp, '$.url_variables.interaction_id.value') as url_interaction_id_value ,
JSON_VALUE( resp, '$.url_variables.interaction_id.type') as url_interaction_id_type,
JSON_VALUE( resp, '$.language') as language,
JSON_QUERY( resp, '$.response_time') as response_time, -- WORKING NOW
JSON_QUERY( resp, '$.survey_data') as survey_data -- WORKING NOW,
FROM (
SELECT REPLACE(REPLACE(REPLACE(resp, "None,", "'None',"), "True", "true"), "False", "false") as resp
FROM `project.dataset.table`
)
if applied to sample data in your question - now it gets you all you need

Related

Normalize a json column that has has column names passed as value in python

Edit: sample json of details column:
{6591: '[]',
8112: "[{'name': 'start', 'time': 1659453223851}, {'name': 'arrival', 'time': 1659454209024, 'location': [-73.7895605, 40.6869539]}, {'name': 'departure', 'time': 1659453289013, 'location': [-73.8124575, 40.7091602]}]",
5674: '[]',
4236: '[]',
3148: "[{'name': 'start', 'time': 1659121571280}, {'name': 'arrival', 'time': 1659122768105, 'location': [-74.220351348, 40.748419051]}, {'name': 'departure', 'time': 1659121605076, 'location': [-74.189452444, 40.715865856]}]",
3408: "[{'name': 'start', 'time': 1659113772531}, {'name': 'arrival', 'time': 1659114170204, 'location': [-73.9469142, 40.671488]}, {'name': 'departure', 'time': 1659113832693, 'location': [-73.956379, 40.6669802]}]",
1438: '[]',
3634: '[]',
5060: "[{'name': 'start', 'time': 1659190337964}, {'name': 'arrival', 'time': 1659190367182, 'location': [-76.614058283, 39.292697049]}, {'name': 'departure', 'time': 1659190345722, 'location': [-76.614058283, 39.292697049]}]",
6614: '[]',
7313: '[]',
7653: '[]',
9446: '[]',
1237: '[]',
6974: "[{'name': 'start', 'time': 1659383554887}, {'name': 'adminCompletion', 'time': 1659386192031, 'data': {'adminId': 'ZFQCAL6aeS', 'sendNotificationFromAdminComplete': False}}, {'name': 'arrival', 'time': 1659385764198, 'location': [-73.943001009, 40.705886527]}, {'name': 'departure', 'time': 1659383653199, 'location': [-73.94038015, 40.814893186]}]",
762: '[]',
4843: '[]',
8682: '[]',
7271: '[]',
4672: "[{'name': 'start', 'time': 1659131562088}, {'name': 'arrival', 'time': 1659131937387, 'location': [-87.62621, 41.9015626]}, {'name': 'departure', 'time': 1659131637316, 'location': [-87.6263294, 41.9094856]}]"}
I have a dataframe columns like 'details' and 'id'. It looks like this. I want to completely flatten details column.
details id
[{'name': 'start', 'time': 1659479418}, {'name': 'arrival', 'time': 1659452651073, 'location': [-75.040536278, 40.034055]}, {'name': 'departure', 'time': 1659451650, 'location': [-75.1609003, 39.947729034]}] 1
[] 2
[] 3
[{'name': 'start', 'time': 1659126581459}, {'name': 'arrival', 'time': 1659128206850, 'location': [-80.3165751, 25.8625698]}, {'name': 'departure', 'time': 1659126641679, 'location': [-80.2511886, 25.921769]}] 4
[{'name': 'start', 'time': 1659120813100}, {'name': 'arrival', 'time': 1659121980125, 'location': [-76.642292, 39.307895253]}, {'name': 'departure', 'time': 1659120903093, 'location': [-76.741190426, 39.34240617]}] 5
[] 6
[] 7
[{'name': 'start', 'time': 1659217203753}, {'name': 'adminCompletion', 'time': 1659217336224, 'data': {'adminId': '~R~WZt7bKO979BRTqHyarS2p', 'sendNotification': False}}, {'name': 'arrival', 'time': 1659217308939, 'location': [-73.941830752, 40.702405857]}, {'name': 'departure', 'time': 1659217288936, 'location': [-73.941830752, 40.702405857]}] 8
[{'name': 'start', 'time': 1659189824814}, {'name': 'arrival', 'time': 1659191937100, 'location': [-76.406627, 39.984]}, {'name': 'departure', 'time': 1659189915191, 'location': [-76.614515552, 39.292407218]}] 9
[] 10
what is expected from this is:
start_time admincompletiontime adminId sendnotification arrival_time arrival_location departure_time departure_location id
1659479418 1.65945E+12 [-75.040536278, 40.034055] 1659451650 [-75.1609003, 39.947729034] 1
2
3
1.65913E+12 1.65913E+12 [-80.3165751, 25.8625698] 1.65913E+12 [-80.2511886, 25.921769] 4
1.65922E+12 1.65922E+12 ~R~WZt7bKO979BRTqHyarS2p FALSE 1.65922E+12 [-73.941830752, 40.702405857] 1.65922E+12 [-73.941830752, 40.702405857] 8
I want to extract all the columns that are passed as values. pd.json_normalize() did not work for me in this case. please suggest.
Your data is pretty scuffed, you need to clean it up, but following a pattern like this should start you in the right direction:
from ast import literal_eval
data = {key:literal_eval(value) for key, value in data.items()}
data = [[{y['name']:{'time':y['time'],'location':y.get('location')}} for y in x] for x in data.values() if x]
df = pd.concat([pd.json_normalize(x) for x in data])
df = (df.dropna(how='all', axis=1)
.bfill()
.dropna()
.drop_duplicates('start.time')
.reset_index(drop=True))
print(df)
Output:
start.time arrival.time arrival.location departure.time departure.location adminCompletion.time
0 1.659453e+12 1.659454e+12 [-73.7895605, 40.6869539] 1.659453e+12 [-73.8124575, 40.7091602] 1.659386e+12
1 1.659122e+12 1.659454e+12 [-73.7895605, 40.6869539] 1.659453e+12 [-73.8124575, 40.7091602] 1.659386e+12
2 1.659114e+12 1.659123e+12 [-74.220351348, 40.748419051] 1.659122e+12 [-74.189452444, 40.715865856] 1.659386e+12
3 1.659190e+12 1.659114e+12 [-73.9469142, 40.671488] 1.659114e+12 [-73.956379, 40.6669802] 1.659386e+12
4 1.659384e+12 1.659190e+12 [-76.614058283, 39.292697049] 1.659190e+12 [-76.614058283, 39.292697049] 1.659386e+12
5 1.659132e+12 1.659386e+12 [-73.943001009, 40.705886527] 1.659384e+12 [-73.94038015, 40.814893186] 1.659386e+12

in Python 3, how can I slice JSON data where objects all start with same name?

I have a JSON string that returns device info and if devices are found, the devices will be listed as device0, device1, device2, etc. In this simple code below, how can I discover all devices found in the JSON and then print the the info below for each device? I currently lookup each device statically and I want this discovery to be dynamic and print the results for each one found.
r1 = requests.get(url = url_api, params = PARAMS)
devicedata = r1.json()
if 'device0' in devicedata:
print('')
device0Name = (devicedata['device0']['device_name'])
print(device0Name)
print('Temp: {}'.format (devicedata['device0']['obs'][0]['ambient_temp']))
print('Probe Temp: {}'.format (devicedata['device0']['obs'][0]['probe_temp']))
print('Humidity: {}%'.format (devicedata['device0']['obs'][0]['humidity']))
print('')
# JSON info looks like this...
{'device0': {'success': True, 'device_type': 'TX60', 'obs': [{'device_id': '1111', 'device_type': 'TX60', 'u_timestamp': '1580361017', 'ambient_temp': '45.7', 'probe_temp': '45.5', 'humidity': '82', 'linkquality': '100', 'lowbattery': '0', 'success': '9', 's_interval': '99', 'timestamp': '1/29/2020 11:10 PM', 'utctime': 1580361017}], 'alerts': {'miss': {'id': '520831', 'alert_type': 'miss', 's_id': '1111', 'max': '-100', 'min': '30', 'wet': '0', 'alert_id': '1', 'phone': 'yes', 'email': '', 'state': None}, 'batt': {'id': '520832', 'alert_type': 'batt', 's_id': '1111', 'max': '-100', 'min': '-100', 'wet': '0', 'alert_id': '1', 'phone': 'yes', 'email': '', 'state': None}}, 'ispws': 0, 'unit': {'temp': '°F', 'temp2': '°F', 'rh': '%'}, 'device_id': '1111', 'expired': '0', 'interval': '30', 'reg_date': '2020-01-17 22:06:48', 'create_date': 1579298808, 'device_name': 'Back Yard', 'assocGateway': '1', 'problem': False}, 'device1': {'success': True, 'device_type': 'TX60', 'obs': [{'device_id': '2222', 'device_type': 'TX60', 'u_timestamp': '1580360303', 'ambient_temp': '63.6', 'probe_temp': 'N/C', 'humidity': '64', 'linkquality': '100', 'lowbattery': '0', 'success': '9', 's_interval': '99', 'timestamp': '1/29/2020 10:58 PM', 'utctime': 1580360303}], 'alerts': {'miss': {'id': '520220', 'alert_type': 'miss', 's_id': '2222', 'max': '-100', 'min': '30', 'wet': '0', 'alert_id': '1', 'phone': 'yes', 'email': '', 'state': None}, 'batt': {'id': '520221', 'alert_type': 'batt', 's_id': '2222', 'max': '-100', 'min': '-100', 'wet': '0', 'alert_id': '1', 'phone': 'yes', 'email': '', 'state': None}}, 'ispws': 0, 'unit': {'temp': '°F', 'temp2': '°F', 'rh': '%'}, 'device_id': '3333', 'expired': '1', 'interval': '30', 'reg_date': '2016-03-19 01:45:04', 'create_date': 1500868369, 'device_name': 'Crawl Space', 'assocGateway': '1', 'problem': False}, 'device2': {'success': True, 'device_type': 'TX60', 'obs': [{'device_id': '3333', 'device_type': 'TX60', 'u_timestamp': '1580360195', 'ambient_temp': '70.2', 'probe_temp': 'N/C', 'humidity': '48', 'linkquality': '100', 'lowbattery': '0', 'success': '9', 's_interval': '99', 'timestamp': '1/29/2020 10:56 PM', 'utctime': 1580360195}], 'alerts': None, 'ispws': 0, 'unit': {'temp': '°F', 'temp2': '°F', 'rh': '%'}, 'device_id': '3333', 'expired': '0', 'interval': '15', 'reg_date': '2020-01-30 04:34:00', 'create_date': 1580358840, 'device_name': 'Basement', 'assocGateway': '2', 'problem': False}, 'tz': 'America/Chicago'}
The output for a single device looks like this..
Back Yard
Temp: 50.9
Probe Temp: 51.2
Humidity: 92%
Crawl Space
Temp: 65.4
Probe Temp: N/C
Humidity: 55%
Basement
Temp: 70
Probe Temp: N/C
Humidity: 48%
Found it.
for devKey in devicedata.keys():
if "device" in devKey:
dev = devicedata[devKey]
name = dev["device_name"]
obs = dev["obs"][0]
temp = obs["ambient_temp"]
probeTemp = obs["probe_temp"]
humidity = obs["humidity"]
print(name)
print('Temp: {}'.format(temp))
print('Probe Temp: {}'.format(probeTemp))
print('Humidity: {}%'.format(humidity))
print('')

Python list to multi-level json

I 'm a Python beginner, I have a list that needs to be converted to json format.
I hope to get some help.
raw data:
result = [('A', 'a1', '1'),
('A', 'a2', '2'),
('B', 'b1', '1'),
('B', 'b2', '2')]
The result I want:
[{'type':'A',
'data':[{'name':'a1','url':'1'},
{'name':'a2','url':'2'}]
},
{'type': 'B',
'data': [{'name':'b1', 'url': '1'},
{'name':'b2','url':'2'}]
}]
('A', 'a1', 1) is an example of a tuple, which are iterable.
result = [('A', 'a1', '1'),
('A', 'a2', '2'),
('B', 'b1', '1'),
('B', 'b2', '2')]
type_list = []
for tup in result:
if len(type_list) == 0:
type_list.append({'type': tup[0], 'data': [{ 'name': tup[1], 'url': tup[2] }]})
else:
for type_obj in type_list:
found = False
if type_obj['type'] == tup[0]:
type_obj['data'].append({ 'name': tup[1], 'url': tup[2] })
found = True
break
if not found:
type_list.append({'type': tup[0], 'data': [{ 'name': tup[1], 'url': tup[2] }]})
print(type_list)
which prints:
[{'type': 'A', 'data': [{'name': 'a1', 'url': '1'}, {'name': 'a2', 'url': '2'}]}, {'type': 'B', 'data': [{'name': 'b1', 'url': '1'}, {'name': 'b2', 'url': '2'}]}]

datetime keyerror in json API data

Using Python 3.5, I'm trying return data from the Todoist REST api, which is in JSON format.
[{'id': 2577166691, 'project_id': 2181643136, 'url': 'https://todoist.com/showTask?id=2577166691', 'completed': False, 'order': 2, 'content': 'soon', 'priority': 1, 'comment_count': 0, 'due': {'recurring': False, 'date': '2018-04-01', 'timezone': 'UTC+10:00', 'datetime': '2018-04-01T10:00:00Z', 'string': 'Mar 31 2019'}, 'indent': 1}, {'id': 2577166849, 'project_id': 2181643136, 'url': 'https://todoist.com/showTask?id=2577166849', 'completed': False, 'order': 3, 'content': 'To City +1', 'priority': 1, 'comment_count': 0, 'due': {'recurring': False, 'date': '2018-03-31', 'string': 'Mar 31'}, 'indent': 1}, {'id': 2577225965, 'project_id': 2181643136, 'url': 'https://todoist.com/showTask?id=2577225965', 'completed': False, 'order': 4, 'content': 'To City +2', 'priority': 1, 'comment_count': 0, 'indent': 1}, {'id': 2577974095, 'project_id': 2181643136, 'url': 'https://todoist.com/showTask?id=2577974095', 'completed': False, 'order': 5, 'content': 'To City +3', 'priority': 1, 'comment_count': 0, 'indent': 1}, {'id': 2577974970, 'project_id': 2181643136, 'url': 'https://todoist.com/showTask?id=2577974970', 'completed': False, 'order': 6, 'content': 'Next train from City', 'priority': 1, 'comment_count': 0, 'indent': 1}, {'id': 2577975012, 'project_id': 2181643136, 'url': 'https://todoist.com/showTask?id=2577975012', 'completed': False, 'order': 7, 'content': 'From City +1', 'priority': 1, 'comment_count': 0, 'indent': 1}, {'id': 2577975101, 'project_id': 2181643136, 'url': 'https://todoist.com/showTask?id=2577975101', 'completed': False, 'order': 8, 'content': 'From City +2', 'priority': 1, 'comment_count': 0, 'indent': 1}, {'id': 2577975145, 'project_id': 2181643136, 'url': 'https://todoist.com/showTask?id=2577975145', 'completed': False, 'order': 9, 'content': 'From City +3', 'priority': 1, 'comment_count': 0, 'indent': 1}]
I can correctly obtain data for all items, eg
print(json_tasks[0]['id']
2577166691
And it also works for
print(json_tasks[0]['due']['recurring'])
False
print(json_tasks[0]['due']['date'])
2018-04-01
But:
print(json_tasks[0]['due']['datetime'])
'KeyError: 'datetime'
I have tried a number of things but I'm stumped. What am I doing wrong? How can I get it to recognise 'datetime' as a key?
The code below, when I ran it, printed out 2018-04-01T10:00:00Z.
json_tasks = [{'id': 2577166691, 'project_id': 2181643136, 'url': 'https://todoist.com/showTask?id=2577166691', 'completed': False, 'order': 2, 'content': 'soon', 'priority': 1, 'comment_count': 0, 'due': {'recurring': False, 'date': '2018-04-01', 'timezone': 'UTC+10:00', 'datetime': '2018-04-01T10:00:00Z', 'string': 'Mar 31 2019'}, 'indent': 1}, {'id': 2577166849, 'project_id': 2181643136, 'url': 'https://todoist.com/showTask?id=2577166849', 'completed': False, 'order': 3, 'content': 'To City +1', 'priority': 1, 'comment_count': 0, 'due': {'recurring': False, 'date': '2018-03-31', 'string': 'Mar 31'}, 'indent': 1}, {'id': 2577225965, 'project_id': 2181643136, 'url': 'https://todoist.com/showTask?id=2577225965', 'completed': False, 'order': 4, 'content': 'To City +2', 'priority': 1, 'comment_count': 0, 'indent': 1}, {'id': 2577974095, 'project_id': 2181643136, 'url': 'https://todoist.com/showTask?id=2577974095', 'completed': False, 'order': 5, 'content': 'To City +3', 'priority': 1, 'comment_count': 0, 'indent': 1}, {'id': 2577974970, 'project_id': 2181643136, 'url': 'https://todoist.com/showTask?id=2577974970', 'completed': False, 'order': 6, 'content': 'Next train from City', 'priority': 1, 'comment_count': 0, 'indent': 1}, {'id': 2577975012, 'project_id': 2181643136, 'url': 'https://todoist.com/showTask?id=2577975012', 'completed': False, 'order': 7, 'content': 'From City +1', 'priority': 1, 'comment_count': 0, 'indent': 1}, {'id': 2577975101, 'project_id': 2181643136, 'url': 'https://todoist.com/showTask?id=2577975101', 'completed': False, 'order': 8, 'content': 'From City +2', 'priority': 1, 'comment_count': 0, 'indent': 1}, {'id': 2577975145, 'project_id': 2181643136, 'url': 'https://todoist.com/showTask?id=2577975145', 'completed': False, 'order': 9, 'content': 'From City +3', 'priority': 1, 'comment_count': 0, 'indent': 1}]
print(json_tasks[0]['due']['datetime'])

import csv to json/dict with additional keywords/organization

I'm trying to convert a .csv to json/dict such that the data in its current form:
cat1,cat2,cat3,name
1,2,3,a
4,5,6,b
7,8,9,c
I'm currently using something like this(as well as importing using pandas.df bc it will be used for graphing from json file):
with open('Data.csv') as f:
reader = csv.DictReader(f)
rows = list(reader)
print (rows)
[{'cat1': '1', 'name': 'a', 'cat3': '3', 'cat2': '2'},
{'cat1': '4', 'name': 'b', 'cat3': '6', 'cat2': '5'},
{'cat1': '7', 'name': 'c', 'cat3': '9', 'cat2': '8'}]
and I want it to look like this in json/dict format:
{"data: [{"all_cats": {"cat1": 1}, {"cat2": 2}, {"cat3": 3}}, "name": a},
{"all_cats": {"cat1": 4}, {"cat2": 5}, {"cat3": 6}}, "name": b},
{"all_cats": {"cat1": 7}, {"cat2": 8}, {"cat3": 8}}, "name": c}]}
Importing directly doesn't allow me to include: 'cat1', 'cat2', 'cat3' under 'all_cats' and keep 'name' separate.
Any help would be appreciated.
Since it's space separated and not comma separated you have to add delimiter=" ". Additionally since some of your rows have whitespace beforehand, that means you also have to add skipinitialspace=True.
reader = csv.DictReader(f, delimiter=" ", skipinitialspace=True)
rows = list(dict(row) for row in reader)
Thus if you now do:
for row in rows:
print(row)
The output will be:
{'cat1': '1', 'cat2': '2', 'cat3': '3', 'name': 'a'}
{'cat1': '4', 'cat2': '5', 'cat3': '6', 'name': 'b'}
{'cat1': '7', 'cat2': '8', 'cat3': '9', 'name': 'c'}
As already mentioned in the other answer you don't specify valid JSON format for what you want to achieve. You can check if a string contains valid JSON format using json.loads(jsonDATAstring) function:
import json
jsonDATAstring_1 = """
{"data: [{"all_cats": {"cat1": 1}, {"cat2": 2}, {"cat3": 3}}, "name": a},
{"all_cats": {"cat1": 4}, {"cat2": 5}, {"cat3": 6}}, "name": b},
{"all_cats": {"cat1": 7}, {"cat2": 8}, {"cat3": 8}}, "name": c}]}
"""
json.loads(jsonDATAstring_1)
what in case of the by you specified expected JSON format results in:
json.decoder.JSONDecodeError: Expecting ':' delimiter: line 2 column 12 (char 12)
From what is known to me from your question I assume, that the JSON string you want to get is a following one:
jsonDATAstring_2 = """
{"data": [{"all_cats": {"cat1": 1, "cat2": 2, "cat3": 3}, "name": "a"},
{"all_cats": {"cat1": 4, "cat2": 5, "cat3": 6}, "name": "b"},
{"all_cats": {"cat1": 7, "cat2": 8, "cat3": 8}, "name": "c"}]}
"""
json.loads(jsonDATAstring_2)
This second string loads OK, so assuming:
rows = [{'cat1': '1', 'name': 'a', 'cat3': '3', 'cat2': '2'},
{'cat1': '4', 'name': 'b', 'cat3': '6', 'cat2': '5'},
{'cat1': '7', 'name': 'c', 'cat3': '9', 'cat2': '8'}]
you can get what you want as follows:
dctData = {"data": []}
lstCats = ['cat1', 'cat2', 'cat3']
for row in rows:
dctAllCats = {"all_cats":{}, "name":"?"}
for cat in lstCats:
dctAllCats["all_cats"][cat] = row[cat]
dctAllCats["name"] = row["name"]
dctData["data"].append(dctAllCats)
import pprint
pp = pprint.PrettyPrinter()
pp.pprint(dctData)
what gives:
{'data': [{'all_cats': {'cat1': '1', 'cat2': '2', 'cat3': '3'}, 'name': 'a'},
{'all_cats': {'cat1': '4', 'cat2': '5', 'cat3': '6'}, 'name': 'b'},
{'all_cats': {'cat1': '7', 'cat2': '8', 'cat3': '9'}, 'name': 'c'}]}
Now it is possible to serialize the Python dictionary object to JSON string (or file):
jsonString = json.dumps(dctData)
print(jsonString)
what gives:
{"data": [{"all_cats": {"cat1": "1", "cat2": "2", "cat3": "3"}, "name": "a"}, {"all_cats": {"cat1": "4", "cat2": "5", "cat3": "6"}, "name": "b"}, {"all_cats": {"cat1": "7", "cat2": "8", "cat3": "9"}, "name": "c"}]}