I have JSON file which I need to load into memory via chunks.
consider this file.json example:
[{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1}
]
THen I want to use pandas read_json function combined with chunksizes.:
import pandas as pd
file = "file.json"
dtype = {"weekly_report_day": str, "project_id": str, "actions": int, "event_id": int}
chunked = pd.read_json(file, orient = 'records', dtype=dtype, chunksize = 1, lines = True)
for df in chunked:
print(df)
This however returns an error.
I would like to ask for suggestion (I need to use chunks, as the original data are very huge).
Related
I was wondering if there is a way to remove/replace null/empty square brackets in json or pandas dataframe. I have tried to replace them after converting into string via .astype(str) and it is successful and/but it seems it converts all json values into string and I can not process further with the same structure. I would appreciate any solution/recommendation. thanks...
With the following toy dataframe:
import pandas as pd
df = pd.DataFrame({"col1": ["a", [1, 2, 3], [], "d"], "col2": ["e", [], "f", "g"]})
print(df)
# Output
Here is one way to do it:
df = df.applymap(lambda x: pd.NA if isinstance(x, list) and not x else x)
print(df)
# Output
I have a webapp that reads from a redis database. The database returns a List of Strings in json format. See this code snippet:
import redis
r = redis.StrictRedis(**redis_config)
keys = r.keys(pattern="*")
redis_values = r.mget(keys)
print(values[0:2])
print(type(redis_values))
print(type(redis_valus[0]))
Output:
['{"timestampx": "1621544968.075360000", "length": "528", "dscp": "0", "srcip": "172.16.1.2", "destip": "172.17.4.2"}', '{"timestampx": "1621544968.075750000", "length": "96", "dscp": "0", "srcip": "172.17.4.2", "destip": "172.16.1.2"}']
<class 'list'>
<class 'str'>
I cannot get this List of JSON strings into a Pandas dataframe. If I use:
myFrame = pd.DataFrame(redis_values)
print(myFrame.head()
Output:
0
0 {"timestampx": "1621620153.864122000", "length...
1 {"timestampx": "1621620111.615499000", "length...
2 {"timestampx": "1621620157.386244000", "length...
3 {"timestampx": "1621620123.367638000", "length...
4 {"timestampx": "1621620152.200464000", "length...
That's a 1-column frame with strings, not a 5-column frame with the data pulled from JSON.
What if I use read_json?
myFrame = pd.read_json(redis_values)
Output:
ValueError: Invalid file path or buffer object type: <class 'list'>
That fails completely.
What if I convert the List of strings to a List of JSON objects?
myJson = []
for rv in redis_values:
rv = json.loads(rv)
myJson.append(rv)
myFrame = pd.read_json(myJson)
Output:
ValueError: Invalid file path or buffer object type: <class 'list'>
If I dump redis_values to a file and then use read_json it works, but that's incredibly inefficient.
f = open('myjson.txt','w')
for rv in redis_values:
f.write(rv+'\n')
f.close()
myFrame = pd.read_json('myjson.txt', lines=True)
Converting a List of Strings in JSON to a DataFrame shouldn't be this difficult. Can you help me?
Try using json.loads with pd.DataFrame
Ex:
import json
df = pd.DataFrame(map(json.loads, vals)) # OR redis_values
print(df)
The format in the file looks like this
{ 'match' : 'a', 'score' : '2'},{......}
I've tried pd.DataFrame and I've also tried reading it by line but it gives me everything in one cell
I'm new to python
Thanks in advance
Expected result is a pandas dataframe
Try use json_normalize() function
Example:
from pandas.io.json import json_normalize
values = [{'match': 'a', 'score': '2'}, {'match': 'b', 'score': '3'}, {'match': 'c', 'score': '4'}]
df = json_normalize(values)
print(df)
Output:
If one line of your file corresponds to one JSON object, you can do the following:
# import library for working with JSON and pandas
import json
import pandas as pd
# make an empty list
data = []
# open your file and add every row as a dict to the list with data
with open("/path/to/your/file", "r") as file:
for line in file:
data.append(json.loads(line))
# make a pandas data frame
df = pd.DataFrame(data)
If there is more than only one JSON object on one row of your file, then you should find those JSON objects, for example here are two possible options. The solution with the second option would look like this:
# import all you will need
import pandas as pd
import json
from json import JSONDecoder
# define function
def extract_json_objects(text, decoder=JSONDecoder()):
pos = 0
while True:
match = text.find('{', pos)
if match == -1:
break
try:
result, index = decoder.raw_decode(text[match:])
yield result
pos = match + index
except ValueError:
pos = match + 1
# make an empty list
data = []
# open your file and add every JSON object as a dict to the list with data
with open("/path/to/your/file", "r") as file:
for line in file:
for item in extract_json_objects(line):
data.append(item)
# make a pandas data frame
df = pd.DataFrame(data)
My python code reads the excel sheet and converts it into a json file output. I have a column in the excel sheet, where the values are either "Planned" or "Unplanned".
1)In the json output, I want the Planned to be replaced with "1" and Unplanned to be replaced with "2" without changing anything in the excel file.
2)In the output I dont want "data" to appear.
3)In the excel, my Start time column value is like this "2018-11-16 08:00:00". I want the output to be "2018-11-16T08:00:00Z". Currently i am getting some garbage value.
Below is my code.
import xlrd, json, time, pytz, requests
from os import sys
from datetime import datetime, timedelta
from collections import OrderedDict
def json_from_excel():
excel_file = 'test.xlsx'
jsonfile = open('ExceltoJSON.json', 'w')
data = []
datestr = str(datetime.now().date())
loaddata = OrderedDict()
workbook = xlrd.open_workbook(excel_file)
worksheet = workbook.sheet_by_name('OMS-GX Data Extraction')
sheet = workbook.sheet_by_index(0)
for j in range(0, 6):
for i in range(1, 40):
temp = {}
temp["requestedStart"] = (sheet.cell_value(i,0)) #Start Time
temp["requestedComplete"] = (sheet.cell_value(i, 1)) #End Time
temp["location"] = (sheet.cell_value(i, 3)) #Station
temp["equipment"] = (sheet.cell_value(i, 4)) #Device Name
temp["switchOrderTypeID"] = (sheet.cell_value(i, 5)) #Outage Type
data.append(temp)
loaddata['data'] = data
json.dump(loaddata, jsonfile, indent=3, sort_keys=False)
jsonfile.write('\n')
return loaddata
if __name__ == '__main__':
data = json_from_excel()
Below is my sample output:
{
"data": [
{
"requestedStart": testtime,
"requestedComplete": testtime,
"location": "testlocation",
"equipment": "testequipment",
"switchOrderTypeID": "Planned"
},
{
"requestedStart": testtime,
"requestedComplete": testtime,
"location": "testlocation",
"equipment": "testequipment",
"switchOrderTypeID": "Unplanned"
}
]
}
Answer to the 1st question:
You may use conditional assignment.
temp["switchOrderTypeID"] = (1 if sheet.cell_value(i, 5) == "Planned" else 0)
Answer to the 2nd question:
Use loaddata = data which will be an array of the jsons without data as json key.
Answer to 3rd question:
from dateutil.parser import parse
t = "2018-11-16 08:00:00"
parse(t).strftime("%Y-%m-%dT%H:%M:%SZ")
I try to read JSON from file, get values, transform them and back write to new file.
{
"metadata": {
"info": "important info"
},
"timestamp": "2018-04-06T12:19:38.611Z",
"content": {
"id": "1",
"name": "name test",
"objects": [
{
"id": "1",
"url": "http://example.com",
"properties": [
{
"id": "1",
"value": "1"
}
]
}
]
}
}
Above is a JSON that I read from file.
Below I attach a python program that gets values, creates new JSON and write it to file.
import json
from pprint import pprint
def load_json(file_name):
return json.load(open(file_name))
def get_metadata(json):
return json["metadata"]
def get_timestamp(json):
return json["timestamp"]
def get_content(json):
return json["content"]
def create_json(metadata, timestamp, content):
dct = dict(__metadata=metadata, timestamp=timestamp, content=content)
return json.dumps(dct)
def write_json_to_file(file_name, json_content):
with open(file_name, 'w') as file:
json.dump(json_content, file)
STACK_JSON = 'stack.json';
STACK_OUT_JSON = 'stack-out.json'
if __name__ == '__main__':
json_content = load_json(STACK_JSON)
print("Loaded JSON:")
print(json_content)
metadata = get_metadata(json_content)
print("Metadata:", metadata)
timestamp = get_timestamp(json_content)
print("Timestamp:", timestamp)
content = get_content(json_content)
print("Content:", content)
created_json = create_json(metadata, timestamp, content)
print("\n\n")
print(created_json)
write_json_to_file(STACK_OUT_JSON, created_json)
But the problem is that create json is not correct. Finally as result I get:
"{\"__metadata\": {\"info\": \"important info\"}, \"timestamp\": \"2018-04-06T12:19:38.611Z\", \"content\": {\"id\": \"1\", \"name\": \"name test\", \"objects\": [{\"id\": \"1\", \"url\": \"http://example.com\", \"properties\": [{\"id\": \"1\", \"value\": \"1\"}]}]}}"
It is not that what I want to achieve. It's not correct JSON. What do I wrong?
Solution:
Change the write_json_to_file(...) method like this:
def write_json_to_file(file_name, json_content):
with open(file_name, 'w') as file:
file.write(json_content)
Explanation:
The problem is, that when you're calling write_json_to_file(STACK_OUT_JSON, created_json) at the end of your script, the variable created_json contains a string - it's the JSON representation of the dictionary created in the create_json(...) function. But inside the write_json_to_file(file_name, json_content), you're calling:
json.dump(json_content, file)
You're telling the json module write the JSON representation of variable json_content (which contains a string) into the file. And JSON representation of a string is a single value encapsulated in double-quotes ("), with all the double-quotes it contains escaped by \.
What you want to achieve is to simply write the value of the json_content variable into the file and not have it first JSON-serialized again.
Problem
You're converting a dict into a json and then right before you write it into a file, you're converting it into a json again. When you retry to convert a json to a json it gives you the \" since it's escaping the " since it assumes that you have a value there.
How to solve it?
It's a great idea to read the json file, convert it into a dict and perform all sorts of operations to it. And only when you want to print out an output or write to a file or return an output you convert to a json since json.dump() is expensive, it adds 2ms (approx) of overhead which might not seem much but when your code is running in 500 microseconds it's almost 4 times.
Other Recommendations
After seeing your code, I realize you're coming from a java background and while in java the getThis() or getThat() is a great way to module your code since we represent our code in classes in java, in python it just causes problems in the readability of the code as mentioned in the PEP 8 style guide for python.
I've updated the code below:
import json
def get_contents_from_json(file_path)-> dict:
"""
Reads the contents of the json file into a dict
:param file_path:
:return: A dictionary of all contents in the file.
"""
try:
with open(file_path) as file:
contents = file.read()
return json.loads(contents)
except json.JSONDecodeError:
print('Error while reading json file')
except FileNotFoundError:
print(f'The JSON file was not found at the given path: \n{file_path}')
def write_to_json_file(metadata, timestamp, content, file_path):
"""
Creates a dict of all the data and then writes it into the file
:param metadata: The meta data
:param timestamp: the timestamp
:param content: the content
:param file_path: The file in which json needs to be written
:return: None
"""
output_dict = dict(metadata=metadata, timestamp=timestamp, content=content)
with open(file_path, 'w') as outfile:
json.dump(output_dict, outfile, sort_keys=True, indent=4, ensure_ascii=False)
def main(input_file_path, output_file_path):
# get a dict from the loaded json
data = get_contents_from_json(input_file_path)
# the print() supports multiple args so you don't need multiple print statements
print('JSON:', json.dumps(data), 'Loaded JSON as dict:', data, sep='\n')
try:
# load your data from the dict instead of the methods since it's more pythonic
metadata = data['metadata']
timestamp = data['timestamp']
content = data['content']
# just cumulating your print statements
print("Metadata:", metadata, "Timestamp:", timestamp, "Content:", content, sep='\n')
# write your json to the file.
write_to_json_file(metadata, timestamp, content, output_file_path)
except KeyError:
print('Could not find proper keys to in the provided json')
except TypeError:
print('There is something wrong with the loaded data')
if __name__ == '__main__':
main('stack.json', 'stack-out.json')
Advantages of the above code:
More Modular and hence easily unit testable
Handling of exceptions
Readable
More pythonic
Comments because they are just awesome!