I am trying to merge two json stored in python dictionaries. Here, the first dictionary is the parent into which the second dictionary gets merged into. In reality, the second dictionary represents a second line in a TSV file that holds the next record of a json array.
My program is trying to read the TSV file line by line and merging them into one single nested json.
Let us consider the two dictionaries:
Parent dictionary dict1: {"CA": [{"Marin": [{"zip":1}], "population":10000}]}
and,
dict2: {"CA": {"Marin": {"zip":2}}}
Note: dict1 is the source-of-truth with regards to the correct json structure.
Here, as you can see, I would like to append the zip: 2 into the Marin county of California state.
Here is my merge code:
class MyClass:
# merges two lines containing json arrays inside nested json
def merge_lines(dict1: dict, dict2: dict) -> dict:
for key in dict1:
if key in dict2:
if isinstance(dict1[key], dict):
MyClass.merge_lines(dict1[key], dict2[key])
elif isinstance(dict1[key], list):
to_be_merged_list = [dict2[key]]
dict1[key].extend(to_be_merged_list)
return dict1
Below is how I am trying to test:
def test_nested_json_arrays(self):
d1 = {"CA": [{"Marin": [{"zip":1}], "population":10000}]}
d2 = {"CA": {"Marin": {"zip":2}}}
expected_result = {"CA": [{"Marin": [{"zip":1}, {"zip":2}], "population":1000}]}
actual = MyClass.merge_lines(d1, d2)
assert expected_result == actual
However, I am getting the below result:
E AssertionError: assert {'CA': [{'Mar...tion': 1000}]} == {'CA': [{'Mari... {'zip': 2}}]}
E Differing items:
E {'CA': [{'Marin': [{'zip': 1}, {'zip': 2}], 'population': 1000}]} != {'CA': [{'Marin': [{'zip': 1}], 'population': 10000}, {'Marin': {'zip': 2}}]}
E Full diff:
E - {'CA': [{'Marin': [{'zip': 1}, {'zip': 2}], 'population': 1000}]}
E + {'CA': [{'Marin': [{'zip': 1}], 'population': 10000}, {'Marin': {'zip': 2}}]}
Can someone help me figure out the changes required in the code to fix this?
Note: the field names are not constant and this can apply to any combination of country, state, county, zip and other nested attributes.
dict1 = {"CA": [{"Marin": [{"zip":1}], "population":10000}]}
dict2 = {"CA": {"Marin": {"zip":2}}}
Looking at dict2:
You are given dict2, containing key k (in this case k = "CA")
dict2[k] itself is a dictionary, that contains one (or more) key (c = "Marin") - value (z) pair(s)
z now is the dictionary that you care about.
Looking at dict1:
For each element county_info in dict1[k], you care about the one that has a key c.
The value at this key (county_info[c]) is a list, to which you want to append z
So let's do that:
def merge_lines(dict1, dict2):
for k, v in dict2.items():
for c, z in v.items():
# Find the element of dict1[k] that has the key c:
for county_info in dict1[k]:
if c in county_info:
county_info[c].append(z)
break
Since the function modifies dict1 in-place, running merge_lines(dict1, dict2) gives us a modified dict1 that looks like what you expect:
{'CA': [{'Marin': [{'zip': 1}, {'zip': 2}], 'population': 10000}]}
This is a more generic approach, where d1 is updated with changes contained in d2.
Of course, you should provide more examples of updates from d2 to see if my solution works in all cases.
#!/usr/bin/env python3
d1 = {"CA": [{"Marin": [{"zip": 1}], "population": 10000}]}
d2 = {"CA": {"Marin": {"zip": 2}}}
# read update data from d2 and append new values to d1
def update(d1, d2):
for key, value in d2.items():
if key in d1:
for key1, value1 in value.items():
if key1 in d1[key][0]:
d1[key][0][key1].append(value1)
else:
d1[key][0][key1] = [value1]
else:
d1[key] = [value]
return d1
print(update(d1, d2))
Related
So I have the following dictionaries that I get by parsing a text file
keys = ["scientific name", "common names", "colors]
values = ["somename1", ["name11", "name12"], ["color11", "color12"]]
keys = ["scientific name", "common names", "colors]
values = ["somename2", ["name21", "name22"], ["color21", "color22"]]
and so on. I am dumping the key value pairs using a dictionary to a json file using a for loop where I go through each key value pair one by one
for loop starts
d = dict(zip(keys, values))
with open("file.json", 'a') as j:
json.dump(d, j)
If I open the saved json file I see the contents as
{"scientific name": "somename1", "common names": ["name11", "name12"], "colors": ["color11", "color12"]}{"scientific name": "somename2", "common names": ["name21", "name22"], "colors": ["color21", "color22"]}
Is this the right way to do it?
The purpose is to query the common name or colors for a given scientific name. So then I do
with open("file.json", "r") as j:
data = json.load(j)
I get the error, json.decoder.JSONDecodeError: Extra data:
I think this is because I am not dumping the dictionaries in json in the for loop correctly. I have to insert some square brackets programatically. Just doing json.dump(d, j) won't suffice.
JSON may only have one root element. This root element can be [], {} or most other datatypes.
In your file, however, you get multiple root elements:
{...}{...}
This isn't valid JSON, and the error Extra data refers to the second {}, where valid JSON would end instead.
You can write multiple dicts to a JSON string, but you need to wrap them in an array:
[{...},{...}]
But now off to how I would fix your code. First, I rewrote what you posted, because your code was rather pseudo-code and didn't run directly.
import json
inputs = [(["scientific name", "common names", "colors"],
["somename1", ["name11", "name12"], ["color11", "color12"]]),
(["scientific name", "common names", "colors"],
["somename2", ["name21", "name22"], ["color21", "color22"]])]
for keys, values in inputs:
d = dict(zip(keys, values))
with open("file.json", 'a') as j:
json.dump(d, j)
with open("file.json", 'r') as j:
print(json.load(j))
As you correctly realized, this code failes with
json.decoder.JSONDecodeError: Extra data: line 1 column 105 (char 104)
The way I would write it, is:
import json
inputs = [(["scientific name", "common names", "colors"],
["somename1", ["name11", "name12"], ["color11", "color12"]]),
(["scientific name", "common names", "colors"],
["somename2", ["name21", "name22"], ["color21", "color22"]])]
jsonData = list()
for keys, values in inputs:
d = dict(zip(keys, values))
jsonData.append(d)
with open("file.json", 'w') as j:
json.dump(jsonData, j)
with open("file.json", 'r') as j:
print(json.load(j))
Also, for python's json library, it is important that you write the entire json file in one go, meaning with 'w' instead of 'a'.
I have a json file that I am trying to pull specific attribute data from. The Json data is essentially a dictionary. Before the data is turned into a file, it is first held in a variable like this:
params = {'f': 'json', 'where': '1=1', 'geometryType': 'esriGeometryPolygon', 'spatialRel': 'esriSpatialRelIntersects','outFields': '*', 'returnGeometry': 'true'}
r = requests.get('https://hazards.fema.gov/gis/nfhl/rest/services/CSLF/Prelim_CSLF/MapServer/3/query', params)
cslfJson = r.json()
and then written into a file like this:
path = r"C:/Workspace/Sandbox/ScratchTests/cslf.json"
with open(path, 'w') as f:
json.dump(cslfJson, f, indent=2)
within this json data is an attribute called DFIRM_ID. I want to create an empty list called dfirm_id = [], get all of the values for DFIRM_ID and for that value, append it to the list like this dfirm_id.append(value). I am thinking I need to somehow read through the json variable data or the actual file, but I am not sure how to do it. Any suggestions on an easy method to accomplish this?
dfirm_id = []
for k, v in cslf:
if cslf[k] == 'DFIRM_ID':
dfirm.append(cslf[v])
As requested, here is what print(cslfJson) looks like:
It actually prints a huge dictionary that looks like this:
{'displayFieldName': 'CSLF_ID', 'fieldAliases': {'OBJECTID':
'OBJECTID', 'CSLF_ID': 'CSLF_ID', 'Area_SF': 'Area_SF', 'Pre_Zone':
'Pre_Zone', 'Pre_ZoneST': 'Pre_ZoneST', 'PRE_SRCCIT': 'PRE_SRCCIT',
'NEW_ZONE': 'NEW_ZONE', 'NEW_ZONEST': .... {'attributes': {'OBJECTID':
26, 'CSLF_ID': '13245C_26', 'Area_SF': 5.855231804165408e-05,
'Pre_Zone': 'X', 'Pre_ZoneST': '0.2 PCT ANNUAL CHANCE FLOOD HAZARD',
'PRE_SRCCIT': '13245C_STUDY1', 'NEW_ZONE': 'A', 'NEW_ZONEST': None,
'NEW_SRCCIT': '13245C_STUDY2', 'CHHACHG': 'None (Zero)', 'SFHACHG':
'Increase', 'FLDWYCHG': 'None (Zero)', 'NONSFHACHG': 'Decrease',
'STRUCTURES': None, 'POPULATION': None, 'HUC8_CODE': None, 'CASE_NO':
None, 'VERSION_ID': '2.3.3.3', 'SOURCE_CIT': '13245C_STUDY2', 'CID':
'13245C', 'Pre_BFE': -9999, 'Pre_BFE_LEN_UNIT': None, 'New_BFE':
-9999, 'New_BFE_LEN_UNIT': None, 'BFECHG': 'False', 'ZONECHG': 'True', 'ZONESTCHG': 'True', 'DFIRM_ID': '13245C', 'SHAPE_Length':
0.009178426056888393, 'SHAPE_Area': 4.711699932249018e-07, 'UID': 'f0125a91-2331-4318-9a50-d77d042a48c3'}}, {'attributes': .....}
If your json data is already a dictionary, then take advantage of that. The beauty of a dictionary / hashmap is that it provides an average time complexity of O(1).
Based on your comment, I believe this will solve your problem:
dfirm_id = []
for feature in cslf['features']:
dfirm_id.append(feature['attributes']['DFIRM_ID'])
I am trying to generate auto json paths from given json structure but stuck in the programatic part. Can someone please help out with the idea to take it further?
Below is the code so far i have achieved.
def iterate_dict(dict_data, key, tmp_key):
for k, v in dict_data.items():
key = key + tmp_key + '.' + k
key = key.replace('$$', '$')
if type(v) is dict:
tmp_key = key
key = '$'
iterate_dict(v, key, tmp_key)
elif type(v) is list:
str_encountered = False
for i in v:
if type(i) is str:
str_encountered = True
tmp_key = key
break
tmp_key = key
key = '$'
iterate_dict(i, key, tmp_key)
if str_encountered:
print(key, v)
if tmp_key is not None:
tmp_key = str(tmp_key)[:-str(k).__len__() - 1]
key = '$'
else:
print(key, v)
key = '$'
import json
iterate_dict_new(dict(json.loads(d_data)), '$', '')
consider the below json structure
{
"id": "1",
"categories": [
{
"name": "author",
"book": "fiction",
"leaders": [
{
"ref": ["wiki", "google"],
"athlete": {
"$ref": "some data"
},
"data": {
"$data": "some other data"
}
}
]
},
{
"name": "dummy name"
}
]
}
Expected output out of python script:
$id = 1
$categories[0].name = author
$categories[0].book = fiction
$categories[0].leaders[0].ref[0] = wiki
$categories[0].leaders[0].ref[1] = google
$categories[0].leaders[0].athlete.$ref = some data
$categories[0].leaders[0].data.$data = some other data
$categories[1].name = dummy name
Current output with above python script:
$.id 1
$$.categories.name author
$$.categories.book fiction
$$$.categories.leaders.ref ["wiki", "google"]
$$$$$.categories.leaders.athlete.$ref some data
$$$$$$.categories.leaders.athlete.data.$data some other data
$$.name dummy name
The following recursive function is similar to yours, but instead of just displaying a dictionary, it can also take a list. This means that if you passed in a dictionary where one of the values was a nested list, then the output would still be correct (printing things like dict.key[3][4] = element).
def disp_paths(it, p='$'):
for k, v in (it.items() if type(it) is dict else enumerate(it)):
if type(v) is dict:
disp_paths(v, '{}.{}'.format(p, k))
elif type(v) is list:
for i, e in enumerate(v):
if type(e) is dict or type(e) is list:
disp_paths(e, '{}.{}[{}]'.format(p, k, i))
else:
print('{}.{}[{}] = {}'.format(p, k, i, e))
else:
f = '{}.{} = {}' if type(it) is dict else '{}[{}] = {}'
print(f.format(p, k, v))
which, when ran with your dictionary (disp_paths(d)), gives the expected output of:
$.categories[0].leaders[0].athlete.$ref = some data
$.categories[0].leaders[0].data.$data = some other data
$.categories[0].leaders[0].ref[0] = wiki
$.categories[0].leaders[0].ref[1] = google
$.categories[0].book = fiction
$.categories[0].name = author
$.categories[1].name = dummy name
$.id = 1
Note that this is unfortunately not ordered, but that is unavoidable as dictionaries have no inherent order (they are just sets of key:value pairs)
If you need help understanding my modifications, just drop a comment!
With a csv of ~50 rows (stars) and ~30 columns (name, magnitudes and distance), that has some empty string values (''), I am trying to do two things in which all the help so far hasn't been useful. (1) I need to parse empty strings as 0.0, so I can (2) append each row in a list of lists (what I called s).
In other words:
- s is a list of stars (each one has all its parameters)
- d is a particular parameter for all the stars (distance), which I obtain correctly.
Big issue is with s. My try:
with open('stars.csv', 'r') as mycsv:
csv_stars = csv.reader(mycsv)
next(csv_stars) #skip header
stars = list(csv_stars)
s = [] # star
d = [] # distances
for row in stars:
row[row==''] = '0'
s.append(float(row)) #stars
d.append(arcsec*AU*float(row[30]))
I can't think of a better syntax, and so I get the error
s.append(float(row)) # stars
TypeError: float() argument must be a string or a number
From s I would obtain later the magnitudes for all the stars, separately. But first things first...
#cwasdwa Please look at below code. it will give you an idea. I am sure there might be better way. This solution is based on what I have understood from your code.
with open('stars.csv', 'r') as mycsv:
csv_stars = csv.reader(mycsv)
next(csv_stars) #skip header
stars = list(csv_stars)
s = [] # star
d = [] # distances
for row in stars:
newRow = [] #create new row array to convert all '' to 0.0
for x in row:
if x =='':
newRow.append(0.0)
else:
newRow.append(x)
s.append(newRow) #stars
if row[30] == '':
value = 0.0
else:
value = row[30]
d.append(arcsec*AU*float(value))
I have a number of csv files. Two example files are shown below.
input1.csv
Actinocyclus actinochilus,7
Asterionella formosa,4
Aulacodiscus orientalis,1
Aulacoseira granulata,3
Chaetoceros radicans,1
Corethron hystrix,6
Coscinodiscaceae 1
Dactyliosolen fragilissimus,32
Diadesmis gallica,1
Diatoma hyemalis 1
Synedropsis hyperboreoides,4
Trigonium formosum,4
Urosolenia eriensis,2
input2.csv
Actinocyclus actinochilus,55
Asterionella formosa,3
Aulacoseira granulata,5
Chaetoceros radicans,7
Dactyliosolen fragilissimus,5
Diatoma hyemalis,1
Stephanopyxis turris,1
Striatella unipunctata,1
Synedropsis hyperboreoides,28
Trigonium formosum,3
Urosolenia eriensis,2
I want to merge these csv files by adding column two based on the same name in column one as in the example output below.
output.csv
Actinocyclus actinochilus,62
Asterionella formosa,7
Aulacodiscus orientalis,1
Aulacoseira granulata,8
Chaetoceros radicans,8
Corethron hystrix,6
Coscinodiscaceae, 1
Dactyliosolen fragilissimus,37
Diadesmis gallica,1
Diatoma hyemalis,2
Stephanopyxis turris,1
Striatella unipunctata,1
Synedropsis hyperboreoides,32
Trigonium formosum,7
Urosolenia eriensis,4
I tried join and cat but these stacked them together. Any idea how could add them together?
Solution for multiple files
This is a Python 3 solution. If you need it to work with Python 2, change this line names = inp.keys() | data.keys() into names = inp.viewkeys() | data.viewkeys().
# get this list of file names form somewhere like `glob`
file_names = ['input1.csv', 'input2.csv', 'input3.csv', 'input4.csv']
def file_to_dict(file_name):
"""Read a two-column csv file into a dict with first column as key
and an integer value from the second column.
"""
with open(file_name) as fobj:
pairs = (line.split(',') for line in fobj if line.strip())
return {k.strip(): int(v) for k, v in pairs}
def merge(data, file_name):
"""Merge input file with dict `data` adding the numerical values.
"""
inp = file_to_dict(file_name)
names = inp.keys() | data.keys()
for name in names:
data[name] = data.get(name, 0) + inp.get(name, 0)
return data
data = {}
for file_name in file_names:
merge(data, file_name)
with open('output.csv', 'w') as fobj:
for name, val in sorted(data.items()):
fobj.write('{},{}\n'.format(name, val))
Solution for two files
This produces the desired output:
def file_to_dict(file_name):
"""Read a two-column csv file into a dict with first column as key
and an integer value from the second column.
"""
with open(file_name) as fobj:
pairs = (line.split(',') for line in fobj if line.strip())
return {k.strip(): int(v) for k, v in pairs}
inp1 = file_to_dict('input1.csv')
inp2 = file_to_dict('input2.csv')
names = sorted(inp1.keys() | inp2.keys())
with open('output.csv', 'w') as fobj:
for name in names:
val = inp1.get(name, 0) + inp2.get(name, 0)
fobj.write('{},{}\n'.format(name, val))
Explanation
The function file_to_dict reads one input file and returns a dictionary like this:
{'Actinocyclus actinochilus': 7,
'Asterionella formosa': 4,
...
Next:
pairs = (line.split(',') for line in fobj if line.strip())
pairs holds a generator expression that represents all name-value pairs as strings. Then:
{k.strip(): int(v) for k, v in pairs}
creates a dictionary from this pairs, stripping of extra whits space from the name and converting the string in the second column into an integer.
After reading both input files with this function:
names = sorted(inp1.keys() | inp2.keys())
uses the union of the names from both inputs, i.e. all names that appear in input1 and input2 and sorts them alphabetically.
The output file needs to be open in write mode:
with open('output.csv', 'w') as fobj:
for each name:
for name in names:
we retrieve the value from the input dictionaries:
val = inp1.get(name, 0) + inp2.get(name, 0)
The method get returns the value if the name is in the dictionary. Otherwise, it returns the 0 given as second argument.
Finally, we write this result line by line:
fobj.write('{},{}\n'.format(name, val))