python : Parsing json file into list of dictionaries - json

I have the following json file annotations
and here is a screenshot form it.tree structure of the json file
I want to parse it and extract the following info
here is a link which I take this screenshot form it Standard Dataset Dicts
I tried to use this code which is not working as expected.
def get_buildings_dicts(img_dir):
json_file = os.path.join(img_dir, "annotations.json")
with open(json_file) as f:
imgs_anns = json.load(f)
dataset_dicts = []
for idx, v in enumerate(imgs_anns):
record = {}
filename = os.path.join(img_dir, v["imagePath"])
height, width = cv2.imread(filename).shape[:2]
record["file_name"] = filename
record["image_id"] = idx
record["height"] = height
record["width"] = width
annos = v["shapes"][idx]
objs = []
for anno in annos:
# assert not anno["region_attributes"]
anno = anno["shape_type"]
px = anno["points"][0]
py = anno["points"][1]
poly = [(x + 0.5, y + 0.5) for x, y in zip(px, py)]
poly = [p for x in poly for p in x]
obj = {
"bbox": [np.min(px), np.min(py), np.max(px), np.max(py)],
"bbox_mode": BoxMode.XYXY_ABS,
"segmentation": [poly],
"category_id": 0,
}
objs.append(obj)
record["annotations"] = objs
dataset_dicts.append(record)
return dataset_dicts
here is an expected output of the final dict items:
{
"file_name": "balloon/train/34020010494_e5cb88e1c4_k.jpg",
"image_id": 0,
"height": 1536,
"width": 2048,
"annotations": [
{
"bbox": [994, 619, 1445, 1166],
"bbox_mode": <BoxMode.XYXY_ABS: 0>,
"segmentation": [[1020.5, 963.5, 1000.5, 899.5, 994.5, 841.5, 1003.5, 787.5, 1023.5, 738.5, 1050.5, 700.5, 1089.5, 663.5, 1134.5, 638.5, 1190.5, 621.5, 1265.5, 619.5, 1321.5, 643.5, 1361.5, 672.5, 1403.5, 720.5, 1428.5, 765.5, 1442.5, 800.5, 1445.5, 860.5, 1441.5, 896.5, 1427.5, 942.5, 1400.5, 990.5, 1361.5, 1035.5, 1316.5, 1079.5, 1269.5, 1112.5, 1228.5, 1129.5, 1198.5, 1134.5, 1207.5, 1144.5, 1210.5, 1153.5, 1190.5, 1166.5, 1177.5, 1166.5, 1172.5, 1150.5, 1174.5, 1136.5, 1170.5, 1129.5, 1153.5, 1122.5, 1127.5, 1112.5, 1104.5, 1084.5, 1061.5, 1037.5, 1032.5, 989.5, 1020.5, 963.5]],
"category_id": 0
}
]
}

I think the only tricky part is dealing with the nested lists but a handful of coprehensions can probably make life easier for us.
Try:
import json
new_images = []
with open("merged_file.json", "r") as file_in:
for index, image in enumerate( json.load(file_in)):
#height, width = cv2.imread(filename).shape[:2]
height, width = 100, 100
new_images.append({
"image_id": index,
"filename": image["imagePath"],
"height": height,
"width": width,
"annotations": [
{
"category_id": 0,
#"bbox_mode": BoxMode.XYXY_ABS,
"bbox_mode": 0,
"bbox": [
min(x for x,y in shape["points"]),
min(y for x,y in shape["points"]),
max(x for x,y in shape["points"]),
max(y for x,y in shape["points"])
],
"segmentation": [coord for point in shape["points"] for coord in point]
}
for shape in image["shapes"]
],
})
print(json.dumps(new_images, indent=2))

Related

Python- Issue parsing multi-layered API JSON into CSV

I'm trying to parse the NIH grant API and am running into a complex layering issue. In the JSON output below, I've been able to navigate into the "results" section which contains all the fields I want, except some are layered within another dictionary. What I'm trying to do is get the JSON data within "full_study_section", "organization", and "project_num_split" to be in the same layer as "appl_id", "contact_pi_name", "fiscal_year", and so forth. This post was helpful but I'm not quite sure how to level the layers through iteration.
{
"meta":{
"limit":25,
"offset":0,
"properties":{},
"search_id":null,
"sort_field":"project_start_date",
"sort_order":"desc",
"sorted_by_relevance":false,
"total":78665
},
"results":[
{
"appl_id":10314644,
"contact_pi_name":"BROCATO, EMILY ROSE",
"fiscal_year":2021,
"full_study_section":{
"group_code":"32",
"name":"Special Emphasis Panel[ZAA1 GG (32)]",
"sra_designator_code":"GG",
"sra_flex_code":"",
"srg_code":"ZAA1",
"srg_flex":""
},
"organization":{
"city":null,
"country":null,
"dept_type":"PHARMACOLOGY",
"external_org_id":353201,
"fips_country_code":null,
"org_city":"RICHMOND",
"org_country":"UNITED STATES",
"org_duns":[
"105300446"
],
"org_fips":"US",
"org_ipf_code":"353201",
"org_name":"VIRGINIA COMMONWEALTH UNIVERSITY",
"org_state":"VA",
"org_state_name":null,
"org_zipcode":"232980568"
},
"project_end_date":null,
"project_num":"1F31AA029259-01A1",
"project_num_split":{
"activity_code":"F31",
"appl_type_code":"1",
"full_support_year":"01A1",
"ic_code":"AA",
"serial_num":"029259",
"suffix_code":"A1",
"support_year":"01"
},
"project_start_date":"2022-03-07T05:00:00Z",
"subproject_id":null
},
Code:
import requests
import json
import csv
params = {
"criteria":
{
"fiscal_years":[2021]
},
"include_fields": [
"ApplId","ContactPiName","FiscalYear",
"OrgCountry","AllText",
"FullStudySection","Organization","ProjectEndDate",
"ProjectNum","ProjectNumSplit","ProjectStartDate","SubprojectId"
],
"offset":0,
"limit":25,
"sort_field":"project_start_date",
"sort_order":"desc"
}
response = requests.post("https://api.reporter.nih.gov/v2/projects/search", json = params)
#print(response.status_code)
#print(response.text)
resdecode = json.loads(response.text)
#print(json.dumps(resdecode, sort_keys=True, indent=4, separators=(',', ':')))
data = resdecode["results"]
#print(json.dumps(data, sort_keys=True, indent=4, separators=(',', ':')))
pns = resdecode["results"][0]["project_num_split"]
#print(json.dumps(pns, sort_keys=True, indent=4, separators=(',', ':')))
# for item in data:
# appl_id = item.get("appl_id")
# print(appl_id)
writerr = csv.writer(open('C:/Users/nkmou/Desktop/Venture/Tech Opportunities/NIH.csv', 'w', newline = ''))
count = 0
for row in resdecode:
if count == 0:
header = resdecode.keys()
writerr.writerow(header)
count += 1
writerr.writerow(row)
writerr.close()
In order to move the items under full_study_section, organization and project_num_split to same level as appl_id, contact_pi_name and fiscal_year you will have to loop through each of the results and recreate those key value pairs for those three dicts and then remove the full_study_section, organization and project_num_split keys once done. Below code should work as you expected.
import requests
import json
import csv
params = {
"criteria":
{
"fiscal_years":[2021]
},
"include_fields": [
"ApplId","ContactPiName","FiscalYear",
"OrgCountry","AllText",
"FullStudySection","Organization","ProjectEndDate",
"ProjectNum","ProjectNumSplit","ProjectStartDate","SubprojectId"
],
"offset":0,
"limit":25,
"sort_field":"project_start_date",
"sort_order":"desc"
}
response = requests.post("https://api.reporter.nih.gov/v2/projects/search", json = params)
resdecode = json.loads(response.text)
data = resdecode["results"]
for item in data:
x = ["full_study_section","organization","project_num_split"]
for i in x:
for key, value in item[i].items():
item[key] = value
del item[i]
with open('C:/Users/nkmou/Desktop/Venture/Tech Opportunities/NIH.csv', 'w', newline = '') as f:
writer = csv.writer(f)
count = 0
for row in data:
if count == 0:
header = row.keys()
writer.writerow(header)
count =+ 1
writer.writerow(row.values())
You can move the items to the required level and remove the dict.
import json
import pprint
pp = pprint
file = open("test.json")
jsonData = json.load(file)
full_study_section = jsonData['results'][0]['full_study_section']
organization = jsonData['results'][0]['organization']
project_num_split = jsonData['results'][0]['project_num_split']
jsonData['results'][0].update(full_study_section)
jsonData['results'][0].update(project_num_split)
jsonData['results'][0].update(organization)
jsonData['results'][0].pop('full_study_section')
jsonData['results'][0].pop('project_num_split')
jsonData['results'][0].pop('organization')
pp.pprint(jsonData)
Output:
{u'meta': {u'limit': 25,
u'offset': 0,
u'properties': {},
u'search_id': None,
u'sort_field': u'project_start_date',
u'sort_order': u'desc',
u'sorted_by_relevance': False,
u'total': 78665},
u'results': [{u'activity_code': u'F31',
u'appl_id': 10314644,
u'appl_type_code': u'1',
u'city': None,
u'contact_pi_name': u'BROCATO, EMILY ROSE',
u'country': None,
u'dept_type': u'PHARMACOLOGY',
u'external_org_id': 353201,
u'fips_country_code': None,
u'fiscal_year': 2021,
u'full_support_year': u'01A1',
u'group_code': u'32',
u'ic_code': u'AA',
u'name': u'Special Emphasis Panel[ZAA1 GG (32)]',
u'org_city': u'RICHMOND',
u'org_country': u'UNITED STATES',
u'org_duns': [u'105300446'],
u'org_fips': u'US',
u'org_ipf_code': u'353201',
u'org_name': u'VIRGINIA COMMONWEALTH UNIVERSITY',
u'org_state': u'VA',
u'org_state_name': None,
u'org_zipcode': u'232980568',
u'project_end_date': None,
u'project_num': u'1F31AA029259-01A1',
u'project_start_date': u'2022-03-07T05:00:00Z',
u'serial_num': u'029259',
u'sra_designator_code': u'GG',
u'sra_flex_code': u'',
u'srg_code': u'ZAA1',
u'srg_flex': u'',
u'subproject_id': None,
u'suffix_code': u'A1',
u'support_year': u'01'}]}

Manim Diagonal bar_names

I want to animate a bar chart in manim and it works just fine. However, the bar_names are long and have to be displayed rather small. Is there a way to rotate them so they can be displayed bigger?
CONFIG = {
"max_value" : 100,
"bar_names" : ["Fleisch von Wiederkäuern","Anderes Fleisch, Fisch","Milchprodukte","Früchte", "Snacks, etc.","Gemüse","Pflanzliche Öle","Getreideprodukte", "Pflanzliche Proteine"],
"bar_label_scale_val" : 0.2,
"bar_stroke_width" : 0,
"width" : 10,
"height" : 6,
"label_y_axis" : False,
}
def construct(self):
composition = [96.350861, 18.5706488, 14.7071608, 8.25588773, 7.33856028, 4.24083463, 1.65574964, 1.36437485, 1]
chart = BarChart(values=composition, **self.CONFIG)
self.play(Write(chart), run_time=2)```
Maybe something like this?
(I just made new labels so delete the old ones, or delete their size)
(also I modified some of the names because my Latex crashed with some characters)
CONFIG = {
"height": 4,
"width": 10,
"n_ticks": 4,
"tick_width": 0.2,
"label_y_axis": False,
"y_axis_label_height": 0.25,
"max_value": 100,
"bar_colors": [BLUE, YELLOW],
"bar_fill_opacity": 0.8,
"bar_stroke_width": 0,
"bar_names": ["Fleisch von Wiederkuern","Anderes Fleisch, Fisch","Milchprodukte","Frchte", "Snacks, etc.","Gemse","Pflanzliche le","Getreideprodukte", "Pflanzliche Proteine"],
"bar_label_scale_val": 0
}
def construct(self):
bar_names=["Fleisch von Wiederkuern","Anderes Fleisch, Fisch","Milchprodukte","Frchte", "Snacks, etc.","Gemse","Pflanzliche le","Getreideprodukte", "Pflanzliche Proteine"]
Lsize=0.55
Lseparation=1.1
Lpositionx=-5.4
Lpositiony=2
bar_labels = VGroup()
for i in range(len(bar_names)):
label = TexMobject(bar_names[i])
label.scale(Lsize)
label.move_to(DOWN*Lpositiony+(i*Lseparation+Lpositionx)*RIGHT)
label.rotate(np.pi*(1.5/6))
bar_labels.add(label)
composition = [96.350861, 18.5706488, 14.7071608, 8.25588773, 7.33856028, 4.24083463, 1.65574964, 1.36437485, 1]
chart = BarChart(values=composition, **self.CONFIG)
chart.shift(UP)
self.play(Write(chart),Write(bar_labels), run_time=2)
# Manim Community Version 0.7.0 in Google Colab
%%manim -qm -v WARNING BarChartExample2
import numpy as np
mobject.probability.np = np
class BarChartExample2(Scene):
CONFIG = {
"height": 4,
"width": 10,
"n_ticks": 4,
"tick_width": 0.2,
"label_y_axis": False,
"y_axis_label_height": 0.25,
"max_value": 100,
"bar_colors": [BLUE, YELLOW],
"bar_fill_opacity": 0.8,
"bar_stroke_width": 0,
"bar_names": ["Fleisch von Wiederkuern","Anderes Fleisch, Fisch","Milchprodukte",
"Frchte", "Snacks, etc.","Gemse","Pflanzliche le","Getreideprodukte", "Pflanzliche Proteine"],
"bar_label_scale_val": 0
}
def construct(self):
bar_names=["Fleisch von Wiederkuern","Anderes Fleisch, Fisch","Milchprodukte",
"Frchte", "Snacks, etc.","Gemse","Pflanzliche le","Getreideprodukte", "Pflanzliche Proteine"]
Lsize=0.55
Lseparation=1.1
Lpositionx=-5.4
Lpositiony=2
bar_labels = VGroup()
for i in range(len(bar_names)):
#label = TexMobject(bar_names[i])
label = MathTex(bar_names[i])
label.scale(Lsize)
label.move_to(DOWN*Lpositiony+(i*Lseparation+Lpositionx)*RIGHT)
label.rotate(np.pi*(1.5/6))
bar_labels.add(label)
composition = [96.350861, 18.5706488, 14.7071608, 8.25588773, 7.33856028, 4.24083463, 1.65574964, 1.36437485, 1]
chart = BarChart(values=composition,**self.CONFIG)
chart.shift(UP)
self.play(Write(chart),Write(bar_labels), run_time=2)

Formating adding nested dictionary to JSON file in specific format

My Python script is working and appends to my JSON file; however, I have tried to add a numbered entry identification with no success. Additionally, I am trying to get a specific output each time the calculations are iterated. Looking for detailed examples and guidance.
Current Python Script
import json
# Dictionary All-Calculations
def dict_calc(num1, num2):
add = str(float(num1)+float(num2))
sub = str(float(num1)-float(num2))
mul = str(float(num1)*float(num2))
div = str(float(num1)/float(num2))
calc_d = {"Add" : add, "Subtract" : sub, "Multiply" : mul, "Divide" : div}
return calc_d
# Yes or No
def y_n(answer):
if answer[:1] == 'y':
return True
if answer[:1] == 'n':
return False
# Main Dictionary
data_table = {}
while True:
num1 = input("\n Enter first number: ")
num2 = input("\n Enter second number: ")
data_table = dict_calc(num1, num2)
with open('dict_calc.json', 'a', encoding='utf-8') as f:
json.dump(data_table, f, ensure_ascii=True, indent=4)
answer = input("\n Run Again? (Y/N) ").lower().strip()
if y_n(answer) == True:
continue
else:
print("\n Thank You and Goodbye")
break
Current Output Example
{
"Add": "579.0",
"Subtract": "-333.0",
"Multiply": "56088.0",
"Divide": "0.26973684210526316"
}{
"Add": "1245.0",
"Subtract": "-333.0",
"Multiply": "359784.0",
"Divide": "0.5779467680608364"
}{
"Add": "1396.0",
"Subtract": "554.0",
"Multiply": "410475.0",
"Divide": "2.315914489311164"
}
Desired Output Example - I am trying to add the Entry plus number, which increases after each iteration. In addition, I am also trying emulate this same output.
[
{
"Entry": "1",
"Add": "579.0",
"Subtract": "-333.0",
"Multiply": "56088.0",
"Divide": "0.26973684210526316"
},
{
"Entry": "2",
"Add": "1245.0",
"Subtract": "-333.0",
"Multiply": "359784.0",
"Divide": "0.5779467680608364"
},
{
"Entry": "3",
"Add": "1396.0",
"Subtract": "554.0",
"Multiply": "410475.0",
"Divide": "2.315914489311164"
}
]
JSON is a nested structure. You can't simply append more data to it. See JSON Lines format for that.
If using regular JSON format, you must read the whole JSON structure in, update it, then write it out fully again, or simply write it once the structure is complete.
Example:
import json
# Dictionary All-Calculations
def dict_calc(num1, num2, entry):
add = str(float(num1)+float(num2))
sub = str(float(num1)-float(num2))
mul = str(float(num1)*float(num2))
div = str(float(num1)/float(num2))
calc_d = {"Entry": str(entry), "Add" : add, "Subtract" : sub, "Multiply" : mul, "Divide" : div}
return calc_d
# Yes or No
def y_n(answer):
if answer[:1] == 'y':
return True
if answer[:1] == 'n':
return False
# Empty List that will hold dictionaries.
data_table = []
entry = 0 # for tracking entry numbers
while True:
num1 = input("\n Enter first number: ")
num2 = input("\n Enter second number: ")
# Count entry and add it to dictionary list.
entry += 1
data_table.append(dict_calc(num1, num2, entry))
answer = input("\n Run Again? (Y/N) ").lower().strip()
if y_n(answer) == True:
continue
else:
print("\n Thank You and Goodbye")
# Write the complete list of dictionaries in one operation.
with open('dict_calc.json', 'w', encoding='utf-8') as f:
json.dump(data_table, f, ensure_ascii=True, indent=4)
break
Output:
[
{
"Entry": "1",
"Add": "3.0",
"Subtract": "-1.0",
"Multiply": "2.0",
"Divide": "0.5"
},
{
"Entry": "2",
"Add": "8.0",
"Subtract": "-1.0",
"Multiply": "15.75",
"Divide": "0.7777777777777778"
},
{
"Entry": "3",
"Add": "13.399999999999999",
"Subtract": "-2.2",
"Multiply": "43.68",
"Divide": "0.717948717948718"
}
]
A few things you might need to change:
you need to change data_table type to list.
you need to append dict_calc function result to it.
Add a counter
Here is your code:
import json
# Dictionary All-Calculations
def dict_calc(counter, num1, num2):
add = str(float(num1)+float(num2))
sub = str(float(num1)-float(num2))
mul = str(float(num1)*float(num2))
div = str(float(num1)/float(num2))
calc_d = {"Entry": str(counter), "Add" : add, "Subtract" : sub, "Multiply" : mul, "Divide" : div}
return calc_d
# Yes or No
def y_n(answer):
if answer[:1] == 'y':
return True
if answer[:1] == 'n':
return False
# Main Dictionary
data_table = []
counter = 1
while True:
num1 = input("\n Enter first number: ")
num2 = input("\n Enter second number: ")
data_table.append( dict_calc(counter, num1, num2))
counter += 1
with open('dict_calc.json', 'a', encoding='utf-8') as f:
json.dump(data_table, f, ensure_ascii=True, indent=4)
answer = input("\n Run Again? (Y/N) ").lower().strip()
if y_n(answer) == True:
continue
else:
print("\n Thank You and Goodbye")
break

How to measure the difference between two frames given a set of pixel units for each frame?

I am using a dense optical flow algorithm in order to calculate the optical flow on a given video,
after I run the algorithm I receive the output bellow.
I would like to find a way to sum up the changes between two frames (or two sets of vectors (in pixel units)), meaning to find a numerical value to the change between frames in order to determine if these two frames are "similar" or "different"
this is the output (from what I understand, for each pixel it is the change x, y basically):
flow
[[[ 0.00080293 0.00456178]
[ 0.0023454 0.00762859]
[ 0.00337119 0.01088941]
...
[ 0.08646814 0.17195833]
[ 0.07680464 0.15070145]
[ 0.04990056 0.09711792]]
[[ 0.00197109 0.00610898]
[ 0.00431191 0.01074001]
[ 0.00629149 0.01567514]
...
[ 0.11541913 0.23083425]
[ 0.10006026 0.19827926]
[ 0.06407876 0.12646647]]
[[ 0.00333168 0.0071025 ]
[ 0.00625938 0.01281219]
[ 0.01047979 0.02093185]
...
[ 0.15598673 0.31461456]
[ 0.1284331 0.25725985]
[ 0.08006614 0.16013806]]
...
[[-0.11634359 0.09029744]
[-0.14934781 0.11287674]
[-0.24678642 0.17862432]
...
[ 0.00260158 0.00103487]
[ 0.00391656 0.00041338]
[ 0.00312206 0.00064316]]
[[-0.06021533 0.04847184]
[-0.07352059 0.05851178]
[-0.12553327 0.09319763]
...
[ 0.00314228 -0.00119414]
[ 0.00410303 -0.00139949]
[ 0.00334636 -0.00098234]]
[[-0.0192373 0.010998 ]
[-0.02326458 0.01555626]
[-0.04161371 0.02764582]
...
[ 0.00236979 -0.00039244]
[ 0.00327405 -0.00078911]
[ 0.00281549 -0.00057979]]]
flow
[[[-8.4514404e-03 -9.1092577e-03]
[-8.2096420e-03 -1.6217180e-02]
[-9.7641135e-03 -2.3235001e-02]
...
[ 8.4836602e-02 9.4629139e-02]
[ 7.0593305e-02 7.2248474e-02]
[ 6.2410351e-02 5.8204494e-02]]
[[-1.6573617e-02 -1.5174728e-02]
[-1.5833536e-02 -2.2253623e-02]
[-1.7538801e-02 -3.1138226e-02]
...
[ 1.3201687e-01 1.3085920e-01]
[ 1.1270510e-01 1.0012541e-01]
[ 1.0345179e-01 8.3722569e-02]]
[[-2.1787306e-02 -2.0292744e-02]
[-2.2391599e-02 -2.8152039e-02]
[-2.3549989e-02 -3.8980592e-02]
...
[ 1.5739001e-01 1.6933599e-01]
[ 1.3471533e-01 1.2855931e-01]
[ 1.2196152e-01 1.0327549e-01]]
...
[[-3.9006339e-03 -3.0767643e-03]
[-1.8084457e-02 -8.7532159e-03]
[-4.0460575e-02 -1.6521217e-02]
...
[ 5.4473747e-03 -1.9708525e-03]
[ 4.3195980e-03 -1.6532388e-03]
[ 2.4038905e-03 -2.6415614e-04]]
[[-2.2322503e-03 -3.0169063e-03]
[-1.1787469e-02 -8.9037549e-03]
[-2.8192652e-02 -1.6921449e-02]
...
[ 1.9799198e-03 -3.8150212e-04]
[ 1.5747466e-03 -5.4049061e-04]
[ 9.2306529e-04 -1.1204407e-04]]
[[-1.1798806e-03 -1.9108414e-03]
[-6.6612735e-03 -5.3157108e-03]
[-1.6056010e-02 -9.3358066e-03]
...
[ 4.8137631e-04 6.4036541e-04]
[ 3.4130082e-04 3.7227676e-04]
[ 1.7955518e-04 1.8480681e-04]]]...
this id the code we are using for the optical flow calculation:
def _calc_optical_flow_(distance_array):
cap = cv2.VideoCapture("videos/3.1.avi")
output_file_text = open("output.txt", "w+")
ret, frame1 = cap.read()
prvs = cv2.cvtColor(frame1, cv2.COLOR_BGR2GRAY)
hsv = np.zeros_like(frame1)
frames_array.append(frame1)
hsv[..., 1] = 255
count = 0
distance = 0
while(1):
ret, frame2 = cap.read()
if frame2 is None:
break
frames_array.append(frame2)
next = cv2.cvtColor(frame2,cv2.COLOR_BGR2GRAY)
flow = cv2.calcOpticalFlowFarneback(prvs, next, None, pyr_scale=0.5, levels=3, winsize=15, iterations=1,
poly_n=5, poly_sigma=1.2, flags=0)
mag, ang = cv2.cartToPolar(flow[..., 0], flow[..., 1])
hsv[..., 0] = ang*180/np.pi/2
hsv[..., 2] = cv2.normalize(mag, None, 0, 255, cv2.NORM_MINMAX)
rgb = cv2.cvtColor(hsv, cv2.COLOR_HSV2BGR)
if count == 10:
count = 0
output_file_text.write("flow\n")
output_file_text.write(np.array_str(flow) + "\n")
distance = function(flow, distance)
distance_array.append(distance)
#print ("flow",flow)
cv2.imshow('frame2',rgb)
count=count+1
k = cv2.waitKey(10) & 0xff
if k == 27:
break
elif k == ord('s'):
#cv2.imwrite('opticalfb.png', frame2)
#cv2.imwrite('opticalhsv.png', rgb)
prvs = next
output_file_text.close()
cap.release()
cv2.destroyAllWindows()
return distance_array

I need to create a spark dataframe from a nested json file in scala

I have a Json file that looks like this
{
"tags": [
{
"1": "NpProgressBarTag",
"2": "userPath",
"3": "screen",
"4": 6,
"12": 9,
"13": "buttonName",
"16": 0,
"17": 10,
"18": 5,
"19": 6,
"20": 1,
"35": 1,
"36": 1,
"37": 4,
"38": 0,
"39": "npChannelGuid",
"40": "npShowGuid",
"41": "npCategoryGuid",
"42": "npEpisodeGuid",
"43": "npAodEpisodeGuid",
"44": "npVodEpisodeGuid",
"45": "npLiveEventGuid",
"46": "npTeamGuid",
"47": "npLeagueGuid",
"48": "npStatus",
"50": 0,
"52": "gupId",
"54": "deviceID",
"55": 1,
"56": 0,
"57": "uiVersion",
"58": 1,
"59": "deviceOS",
"60": 1,
"61": 0,
"62": "channelLineupID",
"63": 2,
"64": "userProfile",
"65": "sessionId",
"66": "hitId",
"67": "actionTime",
"68": "seekTo",
"69": "seekFrom",
"70": "currentPosition"
}
]
}
I tried to create a dataframe using
val path = "some/path/to/jsonFile.json"
val df = sqlContext.read.json(path)
df.show()
when I run this I get
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
How do we create a df based on contents of "tags" key? all I need is, pull data out of "tags" and apply case class like this
case class ProgLang (id: String, type: String )
I need to convert this json data into dataframe with Two Column names .toDF(id, Type)
Can anyone shed some light on this error?
You may modify the JSON using Circe.
Given that your values are sometimes Strings and other times Numbers, this was quite complex.
import io.circe._, io.circe.parser._, io.circe.generic.semiauto._
val json = """ ... """ // your JSON here.
val doc = parse(json).right.get
val mappedDoc = doc.hcursor.downField("tags").withFocus { array =>
array.mapArray { jsons =>
jsons.map { json =>
json.mapObject { o =>
o.mapValues { v =>
// Cast numbers to strings.
if (v.isString) v else Json.fromString(v.asNumber.get.toString)
}
}
}
}
}
final case class ProgLang(id: String, `type`: String )
final case class Tags(tags: List[Map[String, String]])
implicit val TagsDecoder: Decoder[Tags] = deriveDecoder
val tags = mappedDoc.top.get.as[Tags]
val data = for {
tag <- res29.tags
(id, _type) <- tag
} yield ProgLang(id, _type)
Now you have a List of ProgLang you may create a DataFrame directly from it, save it as a file with each JSON per line, save it as a CSV file, etc...
If the file is very big, you may use fs2 to stream it while transforming, it integrates nicely with Circe.
DISCLAIMER: I am far from being a "pro" with Circe, this seems over-complicated for doing something which seems like a "simple-task", probably there is a better / cleaner way of doing it (maybe using Optics?), but hey! it works! - anyways, if anyone knows a better way to solve this feel free to edit the question or provide yours.
val path = "some/path/to/jsonFile.json"
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json(path)
try following code if your json file not very big
val spark = SparkSession.builder().getOrCreate()
val df = spark.read.json(spark.sparkContext.wholeTextFiles("some/path/to/jsonFile.json").values)