Create a Single CSV from Muliple JSON Files - json

Looking for a way to modify the script, below to produce a single CSV from multiple JSON files. It should include multiple rows, each row returning values for the same fields, but tied to a single JSON (ROW 1 = JSON 1, ROW 2 = JSON 2, etc.). The following produces a CSV with one row of data.
import pandas as pd
df = pd.read_json("pywu.cache.json")
df = df.loc[["station_id", "observation_time", "weather", "temperature_string", "display_location"],"current_observation"].T
df = df.append(pd.Series([df["display_location"]["latitude"], df["display_location"]["longitude"]], index=["latitude", "longitude"]))
df = df.drop("display_location")
print(df['latitude'], df['longitude'])
df = pd.to_numeric(df, errors="ignore")
pd.DataFrame(df).T.to_csv("CurrentObs.csv", index=False, header=False, sep=",")

Related

Convert multiple JSON File to CSV File, each in one column

I have multiple JSON File that need to be converted in one CSV File
These are the example JSON code
tryout1.json
{
"Product":{
"one":"Desktop Computer",
"two":"Tablet",
"three":"Printer",
"four":"Laptop"
},
"Price":{
"five":700,
"six":250,
"seven":100,
"eight":1200
}}
tryout2.json
{
"Product":{
"one":"dell xps tower",
"two":"ipad",
"three":"hp office jet",
"four":"macbook"
},
"Price":{
"five":500,
"six":200,
"seven":50,
"eight":1000
}}
This is the python code that I wrote for converting those 2 json files
import pandas as pd
df1 = pd.read_json('/home/mich/Documents/tryout.json')
print(df1)
df2 = pd.read_json('/home/mich/Documents/tryout2.json')
print(df2)
df = pd.concat([df1, df2])
df.to_csv ('/home/mich/Documents/tryout.csv', index = None)
result = pd.read_csv('/home/mich/Documents/tryout.csv')
print(result)
But I didn't get the result I need. How can I print the first json file in one column (for both product and price) and the second in the next column? (view Image via Link)
The result I got
[]
The result that I need
[]
You can first create a combined column of product and prices then concat them.
I am using axis = 1 since i want them to be combined side by side.(columns)
axis = 0 will combine by rows.
import pandas as pd
df1 = pd.read_json('/home/mich/Documents/tryout.json')
df1['product_price'] = df1['Product'].fillna(df1['Price'])
df2 = pd.read_json('/home/mich/Documents/tryout2.json')
df2['product_price'] = df2['Product'].fillna(df2['Price'])
pd.concat([df1['product_price'], df2['product_price']],axis=1)

Pandas ExcelWriter removing valid NaN values when they are needed in output spreadsheet

I am using Pandas to load a json file and output it to Excel via the ExcelWriter. "NaN" is a valid value in the json and is getting stripped in the spreadsheet. How can I store the NaN value.
Here's the json input file (simple_json_test.json)
{"action_time":"2020-04-23T07:39:51.918Z","triggered_value":"NaN"}
{"action_time":"2020-04-23T07:39:51.918Z","triggered_value":"2"}
{"action_time":"2020-04-23T07:39:51.918Z","triggered_value":"1"}
{"action_time":"2020-04-23T07:39:51.918Z","triggered_value":"NaN"}
Here's the python code:
import pandas as pd
from datetime import datetime
with open('simple_json_test.json', 'r') as f:
data = f.readlines()
data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
df = pd.read_json(data_json_str)
# Write dataframe to excel
df['action_time'] = df['action_time'].dt.tz_localize(None)
# Write the dataframe to excel
writer = pd.ExcelWriter('jsonNaNExcelTest.xlsx', engine='xlsxwriter',datetime_format='yyy-mm-dd hh:mm:ss.000')
df.to_excel(writer, header=True, sheet_name='Pandas_Test',index=False)
# Widen the columns
worksheet = writer.sheets['Pandas_Test']
worksheet.set_column('A:B', 25)
writer.save()
Here's the output excel file:
Once that basic question is answer, i want to be able to specify which columns "NaN' is a valid value so save it to excel.
The default action for to_excel() is to convert NaN to the empty string ''. See the Pandas docs for to_excel() and the na_rep parameter.
You can specify an alternative like this:
df.to_excel(writer, header=True, sheet_name='Pandas_Test',
index=False, na_rep='NaN')

JSON input Datetime not formatting correctly in excel using Pandas Excelwriter

I am trying to read in json into a dataframe in Pandas and then output the df to excel using pandas ExcelWriter. I am getting mixed outputs in excel. Both of the datetimes formats in the json are YYYY-MM-DDTHH:MM:SS.sssZ. For example, 2020-04-23T07:39:51.918Z.
Here is my code:
import pandas as pd
from datetime import datetime
with open('simple_json_test.txt', 'r') as f:
data = f.readlines()
data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
df = pd.read_json(data_json_str)
print (df.dtypes)
# Write the dataframe to excel
writer = pd.ExcelWriter('simpleJsonToExcelTest.xlsx', engine='xlsxwriter')
df.to_excel(writer, header=True, sheet_name='Pandas_Test',index=False)
writer.save()
I get the following error when I run my code: "ValueError" Excel does not support datetimes with timezones. Please ensure that the datetimes are timezone unaware before writing to Excel"
I output the df.types() to see what types are the colums:
Triggered Time object
action_time datetime64[ns]
dtype: object
It's weird since the both seem to be the same format in the json. Here is the json
{"action_time":"2020-04-23T07:39:51.918Z","Triggered Time":"2020-04-23T07:39:51.900Z"}
{"action_time":"2020-04-23T07:39:51.918Z","Triggered Time":"2020-04-23T07:39:51.900Z"}
{"action_time":"2020-04-23T07:39:51.918Z","Triggered Time":"2020-04-23T07:39:51.900Z"}
{"action_time":"2020-04-23T07:39:51.918Z","Triggered Time":"2020-04-23T07:39:51.900Z"}
I made the following updates to the code and got it to run successfully, however the output in the excel file is not the same.
import pandas as pd
from datetime import datetime
with open('simple_json_test.txt', 'r') as f:
data = f.readlines()
data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
df = pd.read_json(data_json_str)
print (df.dtypes)
df['action_time'] = pd.to_datetime(df['action_time'],errors='coerce',utc=True)
df['action_time'] = df['action_time'].apply(lambda a: datetime.strftime(a, "%Y-%m-%d %H:%M:%S%f")[:-3])
df['action_time'] = pd.to_datetime(df['action_time'], errors='coerce',format='%Y-%m-%d %H:%M:%S%f')
print (df.dtypes)
# Write the dataframe to excel
writer = pd.ExcelWriter('simpleJsonToExcelTest.xlsx', engine='xlsxwriter')
df.to_excel(writer, header=True, sheet_name='Pandas_Test',index=False)
writer.save()
I'm new to pandas, so some of the things I have tried, i don't fully understand and may be incorrect. The output in the excel file is:
action_time column is YYYY-MM-DD HH:MM:SS
Triggered Time is YYYY-MM-DDTHH:MM:SS.sssZ
action_time
Triggered Time
2020-04-23 07:39:51
2020-04-23T07:39:51.918Z
Triggered time is how i want the format to be (YYYY-MM-DDTHH:MM:SS.sssZ). I need to preserve the milliseconds. Looks like action_time in excel is an actual date field, while triggered time is not.
I even tried converting the datatype of the action_time to object and that didn't work. I'm stuck at this point.
I don't know why "action_time" and "Triggered Time" are parsed with different types but replacing the space in "Triggered Time" converts both to datetime64[ns]. Maybe someone else can explain that part.
Anyway, with that in place you can format the datetime objects in Excel like this:
import pandas as pd
from datetime import datetime
with open('simple_json_test.txt', 'r') as f:
data = f.readlines()
data = map(lambda x: x.rstrip(), data)
data = map(lambda x: x.replace('Triggered Time', 'Triggered_Time'), data)
data_json_str = "[" + ','.join(data) + "]"
df = pd.read_json(data_json_str)
print (df.dtypes)
# Write the dataframe to excel
writer = pd.ExcelWriter('simpleJsonToExcelTest.xlsx',
engine='xlsxwriter',
datetime_format='yyyy-mm-dd hh:mm:ss.000')
df.to_excel(writer, header=True, sheet_name='Pandas_Test', index=False)
# Widen the column for visibility.
worksheet = writer.sheets['Pandas_Test']
worksheet.set_column('A:B', 25)
writer.save()
Strip the timezone from the dates if needed. I didn't have to do that. Output:
See also Formatting of the Dataframe output in the XlsxWriter docs.

Parse nested json data in dataframe

I have delimited file that have JSON also keyvalues matching in the column. I need to parse this data into dataframe.
Below is the record format
**trx_id|name|service_context|status**
abc123|order|type=cdr;payload={"trx_id":"abc123","name":"abs","counter":[{"counter_type":"product"},{"counter_type":"transfer"}],"language":"id","type":"AD","can_replace":"yes","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period":"0","period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0}}]}}],"renewal_flag":"0"}|success
abc456|order|type=cdr;payload={"trx_id":"abc456","name":"abs","counter":[{"counter_type":"product"}],"language":"id","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0},"bt":{"service_id":"500_USSD","amount":"65000"}}]}}],"renewal_flag":"1"}|success
i need to convert all information from this record to have this format
trx_id|name |type|payload.trx_id|payload.name|payload.counter.counter_type|payload.counter.counter_info|.....|payload.renewal.flag|status
abc123|order|cdr |abc123 |abs |product |transfer |.....|0 |success
abc456|order|cdr |abc456 |abs |product | |.....|1 |success
Currently i've done manual parsing the data for key_value with sep=';|[|] and remove behind '=' and update the column name.
for Json, i do the below command, however the result is replacing the existing table and only contain parsing json result.
test_parse = pd.concat([pd.json_normalize(json.loads(js)) for js in test_parse['payload']])
Is there any way to do avoid any manual process to process this type of data?
The below hint will be sufficient to solve the problem.
Do it partwise for each column and then merge them together (you will need to remove the columns once you are able to split into multiple columns):
import ast
from pandas.io.json import json_normalize
x = json_normalize(df3['service_context'].apply(lambda x: (ast.literal_eval(x.split('=')[1])))).add_prefix('payload.')
y = pd.DataFrame(x['payload.counter'].apply(lambda x:[i['counter_type'] for i in x]).to_list())
y = y.rename(columns={0: 'counter_type', 1:'counter_info'})
for row in x['payload.product']:
z1 = json_normalize(row)
z2 = json_normalize(z1['customer_spec.resource_pecification'][0])
### Write your own code.
x:
y:
It's realy a 3-step approach
use primary pipe | delimiter
extract key / value pairs
normlize JSON
import pandas as pd
import io, json
# overall data structure is pipe delimited
df = pd.read_csv(io.StringIO("""abc123|order|type=cdr;payload={"trx_id":"abc123","name":"abs","counter":[{"counter_type":"product"},{"counter_type":"transfer"}],"language":"id","type":"AD","can_replace":"yes","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period":"0","period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0}}]}}],"renewal_flag":"0"}|success
abc456|order|type=cdr;payload={"trx_id":"abc456","name":"abs","counter":[{"counter_type":"product"}],"language":"id","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0},"bt":{"service_id":"500_USSD","amount":"65000"}}]}}],"renewal_flag":"1"}|success"""),
sep="|", header=None, names=["trx_id","name","data","status"])
df2 = pd.concat([
df,
# split out sub-columns ; delimted columns in 3rd column
pd.DataFrame(
[[c.split("=")[1] for c in r] for r in df.data.str.split(";")],
columns=[c.split("=")[0] for c in df.data.str.split(";")[0]],
)
], axis=1)
# extract json payload into columns. This will leave embedded lists as these are many-many
# that needs to be worked out by data owner
df3 = pd.concat([df2,
pd.concat([pd.json_normalize(json.loads(p)).add_prefix("payload.") for p in df2.payload]).reset_index()], axis=1)
output
trx_id name data status type payload index payload.trx_id payload.name payload.counter payload.language payload.type payload.can_replace payload.product payload.renewal_flag payload.price.transaction payload.price.discount
0 abc123 order type=cdr;payload={"trx_id":"abc123","name":"ab... success cdr {"trx_id":"abc123","name":"abs","counter":[{"c... 0 abc123 abs [{'counter_type': 'product'}, {'counter_type':... id AD yes [{'flag': '0', 'identifier_flag': '0', 'custom... 0 1800 0
use with caution - explode() embedded lists
df3p = df3["payload.product"].explode().apply(pd.Series)
df3.join(df3.explode("payload.counter")["payload.counter"].apply(pd.Series)).join(
pd.json_normalize(df3p.join(df3p["customer_spec"].apply(pd.Series)).explode("resource_pecification").to_dict(orient="records"))
)

How to convert this json file to pandas dataframe

The format in the file looks like this
{ 'match' : 'a', 'score' : '2'},{......}
I've tried pd.DataFrame and I've also tried reading it by line but it gives me everything in one cell
I'm new to python
Thanks in advance
Expected result is a pandas dataframe
Try use json_normalize() function
Example:
from pandas.io.json import json_normalize
values = [{'match': 'a', 'score': '2'}, {'match': 'b', 'score': '3'}, {'match': 'c', 'score': '4'}]
df = json_normalize(values)
print(df)
Output:
If one line of your file corresponds to one JSON object, you can do the following:
# import library for working with JSON and pandas
import json
import pandas as pd
# make an empty list
data = []
# open your file and add every row as a dict to the list with data
with open("/path/to/your/file", "r") as file:
for line in file:
data.append(json.loads(line))
# make a pandas data frame
df = pd.DataFrame(data)
If there is more than only one JSON object on one row of your file, then you should find those JSON objects, for example here are two possible options. The solution with the second option would look like this:
# import all you will need
import pandas as pd
import json
from json import JSONDecoder
# define function
def extract_json_objects(text, decoder=JSONDecoder()):
pos = 0
while True:
match = text.find('{', pos)
if match == -1:
break
try:
result, index = decoder.raw_decode(text[match:])
yield result
pos = match + index
except ValueError:
pos = match + 1
# make an empty list
data = []
# open your file and add every JSON object as a dict to the list with data
with open("/path/to/your/file", "r") as file:
for line in file:
for item in extract_json_objects(line):
data.append(item)
# make a pandas data frame
df = pd.DataFrame(data)