I was wondering if there is a way to remove/replace null/empty square brackets in json or pandas dataframe. I have tried to replace them after converting into string via .astype(str) and it is successful and/but it seems it converts all json values into string and I can not process further with the same structure. I would appreciate any solution/recommendation. thanks...
With the following toy dataframe:
import pandas as pd
df = pd.DataFrame({"col1": ["a", [1, 2, 3], [], "d"], "col2": ["e", [], "f", "g"]})
print(df)
# Output
Here is one way to do it:
df = df.applymap(lambda x: pd.NA if isinstance(x, list) and not x else x)
print(df)
# Output
Related
I have JSON file which I need to load into memory via chunks.
consider this file.json example:
[{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1}
]
THen I want to use pandas read_json function combined with chunksizes.:
import pandas as pd
file = "file.json"
dtype = {"weekly_report_day": str, "project_id": str, "actions": int, "event_id": int}
chunked = pd.read_json(file, orient = 'records', dtype=dtype, chunksize = 1, lines = True)
for df in chunked:
print(df)
This however returns an error.
I would like to ask for suggestion (I need to use chunks, as the original data are very huge).
I have multiple JSON File that need to be converted in one CSV File
These are the example JSON code
tryout1.json
{
"Product":{
"one":"Desktop Computer",
"two":"Tablet",
"three":"Printer",
"four":"Laptop"
},
"Price":{
"five":700,
"six":250,
"seven":100,
"eight":1200
}}
tryout2.json
{
"Product":{
"one":"dell xps tower",
"two":"ipad",
"three":"hp office jet",
"four":"macbook"
},
"Price":{
"five":500,
"six":200,
"seven":50,
"eight":1000
}}
This is the python code that I wrote for converting those 2 json files
import pandas as pd
df1 = pd.read_json('/home/mich/Documents/tryout.json')
print(df1)
df2 = pd.read_json('/home/mich/Documents/tryout2.json')
print(df2)
df = pd.concat([df1, df2])
df.to_csv ('/home/mich/Documents/tryout.csv', index = None)
result = pd.read_csv('/home/mich/Documents/tryout.csv')
print(result)
But I didn't get the result I need. How can I print the first json file in one column (for both product and price) and the second in the next column? (view Image via Link)
The result I got
[]
The result that I need
[]
You can first create a combined column of product and prices then concat them.
I am using axis = 1 since i want them to be combined side by side.(columns)
axis = 0 will combine by rows.
import pandas as pd
df1 = pd.read_json('/home/mich/Documents/tryout.json')
df1['product_price'] = df1['Product'].fillna(df1['Price'])
df2 = pd.read_json('/home/mich/Documents/tryout2.json')
df2['product_price'] = df2['Product'].fillna(df2['Price'])
pd.concat([df1['product_price'], df2['product_price']],axis=1)
I have a slightly complicated json that I need to convert into a dataframe. This is a standard output json from another API and hence the field names will not change.
I have the below dict which is more complicated than what I have worked with till now
>>> import pandas as pd
>>> data = [{'annotation_spec': {'description': 'Story_Driven',
... 'display_name': 'Story_Driven'},
... 'segments': [{'confidence': 0.52302074,
... 'segment': {'end_time_offset': {'nanos': 973306000, 'seconds': 14},
... 'start_time_offset': {}}}]},
... {'annotation_spec': {'description': 'real', 'display_name': 'real'},
... 'segments': [{'confidence': 0.5244379,
... 'segment': {'end_time_offset': {'nanos': 973306000, 'seconds': 14},
... 'start_time_offset': {}}}]}]
I looked through all related SO posts and the closest I can get this into a dataframe is this
from pandas.io.json import json_normalize
pd.DataFrame.from_dict(json_normalize(data,record_path=
['segments'],meta=[['annotation_spec','description'],
['annotation_spec','display_name']],errors='ignore'))
This gives me an output like this
>>> from pandas.io.json import json_normalize
>>> pd.DataFrame.from_dict(json_normalize(data,record_path=['segments'],meta=[['annotation_spec','description'],['annotation_spec','display_name']],errors='ignore'))
confidence segment annotation_spec.description annotation_spec.display_name
0 0.523021 {u'end_time_offset': {u'nanos': 973306000, u's... Story_Driven Story_Driven
1 0.524438 {u'end_time_offset': {u'nanos': 973306000, u's... real real
>>>
I want to break down the "segment"column above as well into its components. How can I do that?
Basically json_normalize takes care of nested dicts, here we have a problem because of the list in the segements key.
So if the length of the list will always be 1, we can just remove the list and then apply json_normalize
### function to remove the lsit, we basically check if its a list, if so just take the first element
remove_list = lambda dct:{k:(v[0] if type(v)==list else v) for k,v in dct.items()}
data_clean = [remove_list(entry) for entry in data]
json_normalize(data_clean, sep="__")
The format in the file looks like this
{ 'match' : 'a', 'score' : '2'},{......}
I've tried pd.DataFrame and I've also tried reading it by line but it gives me everything in one cell
I'm new to python
Thanks in advance
Expected result is a pandas dataframe
Try use json_normalize() function
Example:
from pandas.io.json import json_normalize
values = [{'match': 'a', 'score': '2'}, {'match': 'b', 'score': '3'}, {'match': 'c', 'score': '4'}]
df = json_normalize(values)
print(df)
Output:
If one line of your file corresponds to one JSON object, you can do the following:
# import library for working with JSON and pandas
import json
import pandas as pd
# make an empty list
data = []
# open your file and add every row as a dict to the list with data
with open("/path/to/your/file", "r") as file:
for line in file:
data.append(json.loads(line))
# make a pandas data frame
df = pd.DataFrame(data)
If there is more than only one JSON object on one row of your file, then you should find those JSON objects, for example here are two possible options. The solution with the second option would look like this:
# import all you will need
import pandas as pd
import json
from json import JSONDecoder
# define function
def extract_json_objects(text, decoder=JSONDecoder()):
pos = 0
while True:
match = text.find('{', pos)
if match == -1:
break
try:
result, index = decoder.raw_decode(text[match:])
yield result
pos = match + index
except ValueError:
pos = match + 1
# make an empty list
data = []
# open your file and add every JSON object as a dict to the list with data
with open("/path/to/your/file", "r") as file:
for line in file:
for item in extract_json_objects(line):
data.append(item)
# make a pandas data frame
df = pd.DataFrame(data)
The following is my Json file which is decoded on base64.
response={"response": [{"objcontent": [{"title": "Pressure","rowkeys": [
"lat",
"lon",
"Pressure"
],
"rowvalues": [
[
"WxsArK0NV0A=",
"uaQCWFxSM0A=",
"ncvggc7lcUA6MVVLnZiMQH6msaA+0yhANzLp2RsZhkBwobfXt9BXQKtxbnjV+IFARq3fVqOWiEBwyyvmt+V9QDGg7k8YUHpA4IZm9W/De0A="
],
[
"WxsArK0NV0A=",
"HqJT4w7RUkA=",
"BfPox4I5ikCLVYxUxWqIQIFwlJFA+IVAJeQ6gBLyhEBB0QlkoGiCQDOkvnAZUm1AkGbWKEgza0A+FCkwH4phQHwSRSY+iVRAKcvC4pRliEA="
],
[
"WxsArK0NV0A=",
"G5rYdw0NXkA=",
"C9dhhIVrg0B2hCvzOoKKQMrMWhll5o5AIujgxBB0ZkD8+EipfXx0QOXh0LLycH5ATdtxKqbtdkAw66X3l/VhQLqvZBbd13FAjKl2+8UUjUA="
],
[
"WxsArK0NV0A=",
"PTvsm55daEA=",
"W+wyHC12dUCrvSLM1d6BQMfay0ZjbYpAjnk4Ecc8dkDH35pL429xQPTOwkF6Z41Aci5JATkXjUBQ6Wjlp3RQQFlpNGmsNHpAFf0DUor+dUA="
]]}]}]}
I decoded the values and use these values to draw a plot.following is the code.
import base64
import struct
import numpy as np
import pylab as pl
for response_i in response['response']:
for row in response_i['objcontent'][0]['rowvalues']:
for item in row[:]:
decoded=base64.b64decode(item)
if len(decoded)<9:
a=struct.unpack('d',decoded)
else:
decoded=base64.b64decode(item)
a=struct.unpack('10d',decoded)
last=np.array(a)
pl.show(pl.plot(last))
but i would like to saparate the value of each list. in the 'row keys' there are 3 elements [ "lat", "lon", "Pressure"] accordingly there are 3 values in each list of rowvalues.
My question is how can I separate the different values in rowvalues and add them in each group of rowkeys.
so, at the end I suppose to have 3 list which included all the decoded values.
'lat': [WxsArK0NV0A=,WxsArK0NV0A=,WxsArK0NV0A=,WxsArK0NV0A=]
'lon': [uaQCWFxSM0A=,HqJT4w7RUkA=,G5rYdw0NXkA=,PTvsm55daEA=]
'pressure': [ncvggc7lcUA6MVVLnZiMQH6msaA+0yhANzLp2RsZhkBwobfXt9BXQKtxbnjV+IFARq3fVqOWiEBwyyvmt+V9QDGg7k8YUHpA4IZm9W/De0A=, BfPox4I5ikCLVYxUxWqIQIFwlJFA+IVAJeQ6gBLyhEBB0QlkoGiCQDOkvnAZUm1AkGbWKEgza0A+FCkwH4phQHwSRSY+iVRAKcvC4pRliEA=, C9dhhIVrg0B2hCvzOoKKQMrMWhll5o5AIujgxBB0ZkD8+EipfXx0QOXh0LLycH5ATdtxKqbtdkAw66X3l/VhQLqvZBbd13FAjKl2+8UUjUA=, W+wyHC12dUCrvSLM1d6BQMfay0ZjbYpAjnk4Ecc8dkDH35pL429xQPTOwkF6Z41Aci5JATkXjUBQ6Wjlp3RQQFlpNGmsNHpAFf0DUor+dUA=]
One approach would be to manually sort the data, like so:
from collections import defaultdict
from base64 import b64decode
import json
d = defaultdict(list)
js = ''
with open(json_file) as f:
js = b64decode(f.read()).decode()
js = json.loads(js)
response = js['response']['obj_content'][0]
for i, col_name in enumerate(response['row_keys']):
for row_val in ['row_values']:
d[col_name].append(row_val[i])
defaultdict automatically creates a new list when a key is called that previously didn't exist, which makes your code slightly sleeker.
Another option would be to use pandas.DataFrame and load data like so:
import pandas as pd
response = json_file['response']['obj_content'][0]
df = pd.DataFrame(response['row_values'], columns= response['row_keys'])
The neat thing about pandas is, that it's quite expansive in its features; for example, you could plot your data using the previously created DataFrame like so:
df.plot()