I'm trying to convert a .csv to json/dict such that the data in its current form:
cat1,cat2,cat3,name
1,2,3,a
4,5,6,b
7,8,9,c
I'm currently using something like this(as well as importing using pandas.df bc it will be used for graphing from json file):
with open('Data.csv') as f:
reader = csv.DictReader(f)
rows = list(reader)
print (rows)
[{'cat1': '1', 'name': 'a', 'cat3': '3', 'cat2': '2'},
{'cat1': '4', 'name': 'b', 'cat3': '6', 'cat2': '5'},
{'cat1': '7', 'name': 'c', 'cat3': '9', 'cat2': '8'}]
and I want it to look like this in json/dict format:
{"data: [{"all_cats": {"cat1": 1}, {"cat2": 2}, {"cat3": 3}}, "name": a},
{"all_cats": {"cat1": 4}, {"cat2": 5}, {"cat3": 6}}, "name": b},
{"all_cats": {"cat1": 7}, {"cat2": 8}, {"cat3": 8}}, "name": c}]}
Importing directly doesn't allow me to include: 'cat1', 'cat2', 'cat3' under 'all_cats' and keep 'name' separate.
Any help would be appreciated.
Since it's space separated and not comma separated you have to add delimiter=" ". Additionally since some of your rows have whitespace beforehand, that means you also have to add skipinitialspace=True.
reader = csv.DictReader(f, delimiter=" ", skipinitialspace=True)
rows = list(dict(row) for row in reader)
Thus if you now do:
for row in rows:
print(row)
The output will be:
{'cat1': '1', 'cat2': '2', 'cat3': '3', 'name': 'a'}
{'cat1': '4', 'cat2': '5', 'cat3': '6', 'name': 'b'}
{'cat1': '7', 'cat2': '8', 'cat3': '9', 'name': 'c'}
As already mentioned in the other answer you don't specify valid JSON format for what you want to achieve. You can check if a string contains valid JSON format using json.loads(jsonDATAstring) function:
import json
jsonDATAstring_1 = """
{"data: [{"all_cats": {"cat1": 1}, {"cat2": 2}, {"cat3": 3}}, "name": a},
{"all_cats": {"cat1": 4}, {"cat2": 5}, {"cat3": 6}}, "name": b},
{"all_cats": {"cat1": 7}, {"cat2": 8}, {"cat3": 8}}, "name": c}]}
"""
json.loads(jsonDATAstring_1)
what in case of the by you specified expected JSON format results in:
json.decoder.JSONDecodeError: Expecting ':' delimiter: line 2 column 12 (char 12)
From what is known to me from your question I assume, that the JSON string you want to get is a following one:
jsonDATAstring_2 = """
{"data": [{"all_cats": {"cat1": 1, "cat2": 2, "cat3": 3}, "name": "a"},
{"all_cats": {"cat1": 4, "cat2": 5, "cat3": 6}, "name": "b"},
{"all_cats": {"cat1": 7, "cat2": 8, "cat3": 8}, "name": "c"}]}
"""
json.loads(jsonDATAstring_2)
This second string loads OK, so assuming:
rows = [{'cat1': '1', 'name': 'a', 'cat3': '3', 'cat2': '2'},
{'cat1': '4', 'name': 'b', 'cat3': '6', 'cat2': '5'},
{'cat1': '7', 'name': 'c', 'cat3': '9', 'cat2': '8'}]
you can get what you want as follows:
dctData = {"data": []}
lstCats = ['cat1', 'cat2', 'cat3']
for row in rows:
dctAllCats = {"all_cats":{}, "name":"?"}
for cat in lstCats:
dctAllCats["all_cats"][cat] = row[cat]
dctAllCats["name"] = row["name"]
dctData["data"].append(dctAllCats)
import pprint
pp = pprint.PrettyPrinter()
pp.pprint(dctData)
what gives:
{'data': [{'all_cats': {'cat1': '1', 'cat2': '2', 'cat3': '3'}, 'name': 'a'},
{'all_cats': {'cat1': '4', 'cat2': '5', 'cat3': '6'}, 'name': 'b'},
{'all_cats': {'cat1': '7', 'cat2': '8', 'cat3': '9'}, 'name': 'c'}]}
Now it is possible to serialize the Python dictionary object to JSON string (or file):
jsonString = json.dumps(dctData)
print(jsonString)
what gives:
{"data": [{"all_cats": {"cat1": "1", "cat2": "2", "cat3": "3"}, "name": "a"}, {"all_cats": {"cat1": "4", "cat2": "5", "cat3": "6"}, "name": "b"}, {"all_cats": {"cat1": "7", "cat2": "8", "cat3": "9"}, "name": "c"}]}
Related
I have a json string stored in a field in BigQuery which has this structure::
{'language': 'Eng', 'date_started': '2021-02-08 16: 56: 55 GMT', 'link_id': '111', 'url_variables': {'touchpoint': {'key': 'touchpoint', 'value': 'phone', 'type': 'url'
}, 'interaction_id': {'key': 'interaction_id', 'value': '111', 'type': 'url'
}
}, 'ip_address': None, 'referer': '', 'user_agent': None, 'response_time': 111, 'data_quality': [], 'longitude': '', 'latitude': '', 'country': '', 'city': '', 'region': '', 'postal': '', 'dma': '', 'survey_data': {'25': {'id': 25, 'type': 'TEXTBOX', 'question': 'feedback_source', 'section_id': 1, 'shown': False
}, '229': {'id': 229, 'type': 'TEXTBOX', 'question': 'recruitment_method', 'section_id': 1, 'shown': False
}, '227': {'id': 227, 'type': 'TEXTBOX', 'question': 'meeting_point', 'section_id': 1, 'answer': 'phone', 'shown': True
}, '221': {'id': 221, 'type': 'TEXTBOX', 'question': 'interaction_id', 'section_id': 1, 'answer': '222', 'shown': True
}, '217': {'id': 217, 'type': 'TEXTBOX', 'question': 'session_id', 'section_id': 1, 'answer': '333', 'shown': True
}, '231': {'id': 231, 'type': 'ESSAY', 'question': 'BlaBla question 4', 'section_id': 3, 'answer': 'Bla Bla answer', 'shown': True
}, '255': {'id': 255, 'type': 'TEXTBOX', 'question': 'tz_offset', 'section_id': 3, 'answer': '-120', 'shown': True
}, '77': {'id': 77, 'type': 'parent', 'question': 'Bla Bla 1', 'section_id': 35, 'options': {'10395': {'id': 10395, 'option': 'Neutraal', 'answer': '3'
}
}, 'shown': True
}, '250': {'id': 250, 'type': 'RADIO', 'question': 'Bla Bla?', 'section_id': 66, 'original_answer': '1', 'answer': '1', 'answer_id': 10860, 'shown': True
}, '251': {'id': 251, 'type': 'RADIO', 'question': 'Bla Bla', 'section_id': 66, 'original_answer': '0', 'answer': '0', 'answer_id': 10863, 'shown': True
}
}
}
I'm able to extract some of the values with the query below, but I cannot extract response_time or any of the values inside the survey_data structure.
They always come out as null.
DECLARE resp STRING
DEFAULT "{'id': '111', 'contact_id': '', 'status': 'Complete', 'is_test_data': '0', 'date_submitted': '2021-07-08 17: 02: 16 GMT', 'session_id': '111', 'language': 'Eng', 'date_started': '2021-02-08 16: 56: 55 GMT', 'link_id': '111', 'url_variables': {'touchpoint': {'key': 'touchpoint', 'value': 'phone', 'type': 'url' }, 'interaction_id': {'key': 'interaction_id', 'value': '111', 'type': 'url' } }, 'ip_address': None, 'referer': '', 'user_agent': None, 'response_time': 111, 'data_quality': [], 'longitude': '', 'latitude': '', 'country': '', 'city': '', 'region': '', 'postal': '', 'dma': '', 'survey_data': {'25': {'id': 25, 'type': 'TEXTBOX', 'question': 'feedback_source', 'section_id': 1, 'shown': False }, '229': {'id': 229, 'type': 'TEXTBOX', 'question': 'recruitment_method', 'section_id': 1, 'shown': False }, '227': {'id': 227, 'type': 'TEXTBOX', 'question': 'meeting_point', 'section_id': 1, 'answer': 'phone', 'shown': True }, '221': {'id': 221, 'type': 'TEXTBOX', 'question': 'interaction_id', 'section_id': 1, 'answer': '222', 'shown': True }, '217': {'id': 217, 'type': 'TEXTBOX', 'question': 'session_id', 'section_id': 1, 'answer': '333', 'shown': True }, '231': {'id': 231, 'type': 'ESSAY', 'question': 'BlaBla question 4', 'section_id': 3, 'answer': 'Bla Bla answer', 'shown': True }, '255': {'id': 255, 'type': 'TEXTBOX', 'question': 'tz_offset', 'section_id': 3, 'answer': '-120', 'shown': True }, '77': {'id': 77, 'type': 'parent', 'question': 'Bla Bla 1', 'section_id': 35, 'options': {'10395': {'id': 10395, 'option': 'Neutraal', 'answer': '3' } }, 'shown': True }, '250': {'id': 250, 'type': 'RADIO', 'question': 'Bla Bla?', 'section_id': 66, 'original_answer': '1', 'answer': '1', 'answer_id': 10860, 'shown': True }, '251': {'id': 251, 'type': 'RADIO', 'question': 'Bla Bla', 'section_id': 66, 'original_answer': '0', 'answer': '0', 'answer_id': 10863, 'shown': True } } }";
SELECT
JSON_VALUE( resp, '$.url_variables.interaction_id.value') as url_interaction_id_value ,
JSON_VALUE( resp, '$.url_variables.interaction_id.type') as url_interaction_id_type,
JSON_VALUE( resp, '$.language') as language,
JSON_QUERY( resp, '$.response_time') as response_time, -- NOT WORKING
JSON_QUERY( resp, '$.survey_data') as survey_data -- NOT WORKING
I tried with jq in bash from the CLI and it seems to complain about the fact that some of the None values are not quoted.
Question:
Does it mean that BigQuery attempts to extract values from the JSON string as far as it can, "until" it encounters something that it is not well formatted (e.g. the unquoted None values) and then it just cannot parse further and returns nulls ?
NB: In another app, I have been able to parse the json file in Python and extract values from inside the json string.
Looks like you have few formatting issues with your resp field which you can fix with few REPLACEs as in below example
SELECT
JSON_VALUE( resp, '$.url_variables.interaction_id.value') as url_interaction_id_value ,
JSON_VALUE( resp, '$.url_variables.interaction_id.type') as url_interaction_id_type,
JSON_VALUE( resp, '$.language') as language,
JSON_QUERY( resp, '$.response_time') as response_time, -- WORKING NOW
JSON_QUERY( resp, '$.survey_data') as survey_data -- WORKING NOW,
FROM (
SELECT REPLACE(REPLACE(REPLACE(resp, "None,", "'None',"), "True", "true"), "False", "false") as resp
FROM `project.dataset.table`
)
if applied to sample data in your question - now it gets you all you need
I have a JSON file that I'm trying to parse. My script has a few print statements to check if I'm parsing the various fields correctly. It appears to be that way for ids and ip_metadata[ip] within the for loop. However, when I print out the entire dictionary at the end, the ids list is always the last ones. Only ids has this problem, the other fields were stored correctly. I can't see why ids isn't stored correctly.
JSON file:
{
"count": 4,
"data": [
{
"ip": "1.2.3.4",
"cty": {
"country": "US",
"organization": "ABC"
},
"info": {
"p": [
{
"id": 123,
"grp": "A"
},
{
"id": 234,
"grp": "B"
},
{
"id": 345,
"grp": "C"
},
{
"id": 456,
"grp": "D"
}
]
}
},
{
"ip": "2.3.4.5",
"cty": {
"country": "US",
"organization": "ABC"
},
"info": {
"p": [
{
"id": 111,
"grp": "A"
},
{
"id": 222,
"grp": "B"
},
{
"id": 333,
"grp": "C"
},
{
"id": 444,
"grp": "D"
}
]
}
},
{
"ip": "1.2.3.1",
"cty": {
"country": "AU",
"organization": "ABC"
},
"info": {
"p": [
{
"id": 222,
"grp": "A"
},
{
"id": 333,
"grp": "B"
},
{
"id": 444,
"grp": "C"
}
]
}
},
{
"ip": "10.2.3.4",
"cty": {
"country": "US",
"organization": "DDD"
},
"info": {
"p": [
{
"id": 555,
"grp": "A"
},
{
"id": 666,
"grp": "B"
},
{
"id": 777,
"grp": "C"
},
{
"id": 888,
"grp": "D"
}
]
}
}
],
"status": "ok"
}
My python script is
import json
import glob
from collections import defaultdict
ip_metadata = defaultdict(list)
def main():
for json_file in glob.glob("test_input/test_json.json"):
with open(json_file, "r") as fin:
ids = []
json_data = json.load(fin)
if json_data["count"] > 0:
for data in json_data['data']:
ip = data['ip']
country = data['cty']['country']
organization = data['cty']['organization']
ids[:] = []
ids_2 = data['info']['p']
for idss in ids_2:
id = idss['id']
grp = idss['grp']
ids.append((id,grp))
print(ids)
ip_metadata[ip].append((country,organization,ids))
print(ip_metadata[ip])
print("=============================")
for k, v in ip_metadata.items():
print(k,v)
if __name__ == '__main__':
main()
The output is
[(123, 'A'), (234, 'B'), (345, 'C'), (456, 'D')]
[('US', 'ABC', [(123, 'A'), (234, 'B'), (345, 'C'), (456, 'D')])]
[(111, 'A'), (222, 'B'), (333, 'C'), (444, 'D')]
[('US', 'ABC', [(111, 'A'), (222, 'B'), (333, 'C'), (444, 'D')])]
[(222, 'A'), (333, 'B'), (444, 'C')]
[('AU', 'ABC', [(222, 'A'), (333, 'B'), (444, 'C')])]
[(555, 'A'), (666, 'B'), (777, 'C'), (888, 'D')]
[('US', 'DDD', [(555, 'A'), (666, 'B'), (777, 'C'), (888, 'D')])]
=============================
1.2.3.4 [('US', 'ABC', [(555, 'A'), (666, 'B'), (777, 'C'), (888, 'D')])]
2.3.4.5 [('US', 'ABC', [(555, 'A'), (666, 'B'), (777, 'C'), (888, 'D')])]
1.2.3.1 [('AU', 'ABC', [(555, 'A'), (666, 'B'), (777, 'C'), (888, 'D')])]
10.2.3.4 [('US', 'DDD', [(555, 'A'), (666, 'B'), (777, 'C'), (888, 'D')])]
I managed to resolve this by removing the ids = [] line under the open() line, and replace ids[:] = [] with ids = [] instead. I don't quite understand why this fixes things, because I don't know why the value of ids would replace previously stored values of ids of different keys. But it works now.
I 'm a Python beginner, I have a list that needs to be converted to json format.
I hope to get some help.
raw data:
result = [('A', 'a1', '1'),
('A', 'a2', '2'),
('B', 'b1', '1'),
('B', 'b2', '2')]
The result I want:
[{'type':'A',
'data':[{'name':'a1','url':'1'},
{'name':'a2','url':'2'}]
},
{'type': 'B',
'data': [{'name':'b1', 'url': '1'},
{'name':'b2','url':'2'}]
}]
('A', 'a1', 1) is an example of a tuple, which are iterable.
result = [('A', 'a1', '1'),
('A', 'a2', '2'),
('B', 'b1', '1'),
('B', 'b2', '2')]
type_list = []
for tup in result:
if len(type_list) == 0:
type_list.append({'type': tup[0], 'data': [{ 'name': tup[1], 'url': tup[2] }]})
else:
for type_obj in type_list:
found = False
if type_obj['type'] == tup[0]:
type_obj['data'].append({ 'name': tup[1], 'url': tup[2] })
found = True
break
if not found:
type_list.append({'type': tup[0], 'data': [{ 'name': tup[1], 'url': tup[2] }]})
print(type_list)
which prints:
[{'type': 'A', 'data': [{'name': 'a1', 'url': '1'}, {'name': 'a2', 'url': '2'}]}, {'type': 'B', 'data': [{'name': 'b1', 'url': '1'}, {'name': 'b2', 'url': '2'}]}]
Suppose I have a table like this:
and an array of objects which contains [
{ID: '0', Name: 'Leo', Age: 21, CurrentState: 5},
{ID: '1', Name: 'George', Age: 26, , CurrentState: 6},
{ID: '2', Name: 'Diana', Age: 27, , CurrentState: 4}
].
How can I update the columns ID, Name, Age, and CurrentState without affecting the values of IsAdmin column?
If I run the following code
Table.bulkCreate([
{ID: '0', Name: 'Leo', Age: 21, CurrentState: 5},
{ID: '1', Name: 'George', Age: 26, , CurrentState: 6},
{ID: '2', Name: 'Diana', Age: 27, , CurrentState: 4}
], {
updateOnDuplicate: true
})
All the values of IsAdmin of the updated rows will be set to FALSE, which is its default value.
How can I avoid this problem?
Just passe an array of only the attributes that need to be updated
Table.bulkCreate([
{ID: '0', Name: 'Leo', Age: 21, CurrentState: 5},
{ID: '1', Name: 'George', Age: 26, , CurrentState: 6},
{ID: '2', Name: 'Diana', Age: 27, , CurrentState: 4}
], {
updateOnDuplicate: ['Name','CurrentState']
})
options.updateOnDuplicate
Array
optional
Fields to update if row key already exists (on duplicate key update)? (only supported by mysql). By default, all fields are updated.
Given the following two arrays of dictionaries, how can I merge them such that the resulting array of dictionaries contains only those dictionaries whose version is greatest?
data1 = [{'id': 1, 'name': 'Oneeee', 'version': 2},
{'id': 2, 'name': 'Two', 'version': 1},
{'id': 3, 'name': 'Three', 'version': 2},
{'id': 4, 'name': 'Four', 'version': 1},
{'id': 5, 'name': 'Five', 'version': 1}]
data2 = [{'id': 1, 'name': 'One', 'version': 1},
{'id': 2, 'name': 'Two', 'version': 1},
{'id': 3, 'name': 'Threeee', 'version': 3},
{'id': 6, 'name': 'Six', 'version': 2}]
The merged result should look like this:
data3 = [{'id': 1, 'name': 'Oneeee', 'version': 2},
{'id': 2, 'name': 'Two', 'version': 1},
{'id': 3, 'name': 'Threeee', 'version': 3},
{'id': 4, 'name': 'Four', 'version': 1},
{'id': 5, 'name': 'Five', 'version': 1},
{'id': 6, 'name': 'Six', 'version': 2}]
How about the following:
//define data1 and Data2
var data1 = new[]{
new {id = 1, name = "Oneeee", version = 2},
new {id = 2, name = "Two", version = 1},
new {id = 3, name = "Three", version = 2},
new {id = 4, name = "Four", version = 1},
new {id = 5, name ="Five", version = 1}
};
var data2 = new[] {
new {id = 1, name = "One", version = 1},
new {id = 2, name = "Two", version = 1},
new {id = 3, name = "Threeee", version = 3},
new {id = 6, name = "Six", version = 2}
};
//create a dictionary to handle lookups
var dict1 = data1.ToDictionary (k => k.id);
var dict2 = data2.ToDictionary (k => k.id);
// now query the data
var q = from k in dict1.Keys.Union(dict2.Keys)
select
dict1.ContainsKey(k) ?
(
dict2.ContainsKey(k) ?
(
dict1[k].version > dict2[k].version ? dict1[k] : dict2[k]
) :
dict1[k]
) :
dict2[k];
// convert enumerable back to array
var result = q.ToArray();
Alternative solution that is database friendly if data1 and data2 are tables....
var q = (
from d1 in data1
join d2 in data2 on d1.id equals d2.id into data2j
from d2j in data2j.DefaultIfEmpty()
where d2j == null || d1.version >= d2j.version
select d1
).Union(
from d2 in data2
join d1 in data1 on d2.id equals d1.id into data1j
from d1j in data1j.DefaultIfEmpty()
where d1j == null || d2.version > d1j.version
select d2
);
var result = q.ToArray();
Assume you have some class like (JSON.NET attributes used here):
public class Data
{
[JsonProperty("id")]
public int Id { get; set; }
[JsonProperty("name")]
public string Name { get; set; }
[JsonProperty("version")]
public int Version { get; set; }
}
You can parse you JSON to arrays of this Data objects:
var str1 = #"[{'id': 1, 'name': 'Oneeee', 'version': 2},
{'id': 2, 'name': 'Two', 'version': 1},
{'id': 3, 'name': 'Three', 'version': 2},
{'id': 4, 'name': 'Four', 'version': 1},
{'id': 5, 'name': 'Five', 'version': 1}]";
var str2 = #"[{'id': 1, 'name': 'One', 'version': 1},
{'id': 2, 'name': 'Two', 'version': 1},
{'id': 3, 'name': 'Threeee', 'version': 3},
{'id': 6, 'name': 'Six', 'version': 2}]";
var data1 = JsonConvert.DeserializeObject<Data[]>(str1);
var data2 = JsonConvert.DeserializeObject<Data[]>(str2);
So, you can concat these two arrays, group data items by id and select from each group item with highest version:
var data3 = data1.Concat(data2)
.GroupBy(d => d.Id)
.Select(g => g.OrderByDescending(d => d.Version).First())
.ToArray(); // or ToDictionary(d => d.Id)
Result (serialized back to JSON):
[
{ "id": 1, "name": "Oneeee", "version": 2 },
{ "id": 2, "name": "Two", "version": 1 },
{ "id": 3, "name": "Threeee", "version": 3 },
{ "id": 4, "name": "Four", "version": 1 },
{ "id": 5, "name": "Five", "version": 1 },
{ "id": 6, "name": "Six", "version": 2 }
]