How can I load the following JSON (deeply nested) to a DataFrame? - json

A sample of the JSON is as shown below:
{
"AN": {
"dates": {
"2020-03-26": {
"delta": {
"confirmed": 1
},
"total": {
"confirmed": 1
}
}
}
},
"KA": {
"dates": {
"2020-03-09": {
"delta": {
"confirmed": 1
},
"total": {
"confirmed": 1
}
},
"2020-03-10": {
"delta": {
"confirmed": 3
},
"total": {
"confirmed": 4
}
}
}
}
}
I would like to load it into a DataFrame, such that the state names (AN, KA) are represented as Row names, and the dates and nested entries are present as Columns.
Any tips to achieve this would be very much appreciated. [I am aware of json_normalize, however I haven't figured out how to work it out yet.]
The output I am expecting, is roughly as shown below:

Can you update your post with the DataFrame you have in mind ? It'll be easier to understand what you want.
Also sometimes it's better to reshape your data if you can't make it work the way they are now.
Update:
Following your update here's what you can do.
You need to reshape your data, as I said when you can't achieve what you want it is best to look at the problem from another point of view. For instance (and from the sample you shared) the 'dates' keys is meaningless as the other keys are already dates and there are no other keys ate the same level.
A way to achieve what you want would be to use MultiIndex, it'll help you group your data the way you want. To use it you can for instance create all the indices you need and store in a dictionary the values associated.
Example :
If the only index you have is ('2020-03-26', 'delta', 'confirmed') you should have values = {'AN' : [1], 'KA':None}
Then you only need to create your DataFrame and transpose it.
I gave it a quick try and came up with a piece of code that should work. If you're looking for performance I don't think this will do the trick.
import pandas as pd
# d is the sample you shared
index = [[],[],[]]
values = {}
# Get all the dates
dates = [date for c in d.keys() for date in d[c]['dates'].keys() ]
for country in d.keys():
# For each country we create an array containing all 6 values for each date
# (missing values as None)
values[country] = []
for date in dates:
if date in d[country]['dates']:
for method in ['delta', 'total']:
for step in ['confirmed', 'recovered', 'tested']:
# Incrementing indices
index[0].append(date)
index[1].append(method)
index[2].append(step)
if step in value.keys():
values[country].append(deepcopy(d[country]['dates'][date][method][step]))
else :
values[country].append(None)
# When country does not have a date fill with None
else :
for method in ['delta', 'total']:
for step in ['confirmed', 'recovered', 'tested']:
index[0].append(date)
index[1].append(method)
index[2].append(step)
values[country].append(None)
# Removing duplicates introduced because we added n_countries times
# the indices
# 3 is the number of steps
# 2 is the number of methods
number_of_rows = 3*2*len(dates)
index[0] = index[0][:number_of_rows]
index[1] = index[1][:number_of_rows]
index[2] = index[2][:number_of_rows]
df = pd.DataFrame(values, index=index).T
Here is what I have for the transposed data frame of my output :
Hope this can help you

You clearly needs to reshape your json data before load it into a DataFrame.
Have you tried load your json like a dict ?
dataframe = pd.DataFrame.from_dict(JsonDict, orient="index")
The “orient” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.

Related

Check if a value exists in a json file with python

I've the following json file (banneds.json):
{
"players": [
{
"avatar": "https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/07/07aa315f664efa92456569429230bc2c254c3ff8_full.jpg",
"created": 1595050663,
"created_by": "<#128152620136267776>",
"nick": "teste",
"steam64": 76561198046619692
},
{
"avatar": "https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/21/21fa5c468597e9c890212b2e3bdb0fac781c040c_full.jpg",
"created": 1595056420,
"created_by": "<#128152620136267776>",
"nick": "ingridão",
"steam64": 76561199058918551
}
]
}
And I want to insert new values if the new value (inserted by user) is not already in the json, however when I try to search if the value is already there I receive a false value, an example of what I'm doing ( not the original code, only an example ):
import json
check = 76561198046619692
with open('banneds.json', 'r') as file:
data = json.load(file)
if check in data:
print(True)
else:
print(False)
I'm always receiving the "False" result, but the value is there, someone can give me a light of what I'm doing wrong please? I tried the entire night to find a solution, but no one works :(
Thanks for the help!
You are checking data as a dictionary object. When checking using if check in data it checks if data object have a key matching the value of the check variable (data.keys() to list all keys).
One easy way would be to use if check in data["players"].__str__() which will convert value to a string and search for the match.
If you want to make sure that check value only checks for the steam64 values, you can write a simple function that will iterate over all "players" and will check their "steam64" values. Another solution would be to make list of "steam64" values for faster and easier checking.
You can use any() to check if value of steam64 key is there.
For example:
import json
def check_value(data, val):
return any(player['steam64']==val for player in data['players'])
with open('banneds.json', 'r') as f_in:
data = json.load(f_in)
print(check_value(data, 76561198046619692))
Prints:
True

What JSON format does STRIP_OUTER_ARRAY support?

I have a file composed of a single array containing multiple records.
{
"Client": [
{
"ClientNo": 1,
"ClientName": "Alpha",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "12345"
},
{
"BusinessNo": 2,
"IndustryCode": "23456"
}
]
},
{
"ClientNo": 2,
"ClientName": "Bravo",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "34567"
},
{
"BusinessNo": 2,
"IndustryCode": "45678"
}
]
}
]
}
I load it with the following code:
create or replace stage stage.test
url='azure://xxx/xxx'
credentials=(azure_sas_token='xxx');
create table if not exists stage.client (json_data variant not null);
copy into stage.client_test
from #stage.test/client_test.json
file_format = (type = 'JSON' strip_outer_array = true);
Snowflake imports the entire file as one row.
I would like the the COPY INTO command to remove the outer array structure and load the records into separate table rows.
When I load larger files, I hit the size limit for variant and get the error Error parsing JSON: document is too large, max size 16777216 bytes.
If you can import the file into Snowflake, into a single row, then you can use LATERAL FLATTEN on the Clients field to generate one row per element in the array.
Here's a blog post on LATERAL and FLATTEN (or you could look them up in the snowflake docs):
https://support.snowflake.net/s/article/How-To-Lateral-Join-Tutorial
If the format of the file is, as specified, a single object with a single property that contains an array with 500 MB worth of elements in it, then perhaps importing it will still work -- if that works, then LATERAL FLATTEN is exactly what you want. But that form is not particularly great for data processing. You might want to use some text processing script to massage the data if that's needed.
RECOMMENDATION #1:
The problem with your JSON is that it doesn't have an outer array. It has a single outer object containing a property with an inner array.
If you can fix the JSON, that would be the best solution, and then STRIP_OUTER_ARRAY will work as you expected.
You could also try to recompose the JSON (an ugly business) after reading line for line with:
CREATE OR REPLACE TABLE X (CLIENT VARCHAR);
COPY INTO X FROM (SELECT $1 CLIENT FROM #My_Stage/Client.json);
User Response to Recommendation #1:
Thank you. So from what I gather, COPY with STRIP_OUTER_ARRAY can handle a file starting and ending with square brackets, and parse the file as if they were not there.
The real files don't have line breaks, so I can't read the file line by line. I will see if the source system can change the export.
RECOMMENDATION #2:
Also if you would like to see what the JSON parser does, you can experiment using this code, I have parsed JSON on the copy command using similar code. Working with your JSON data in small project can help you shape the Copy command to work as intended.
CREATE OR REPLACE TABLE SAMPLE_JSON
(ID INTEGER,
DATA VARIANT
);
INSERT INTO SAMPLE_JSON(ID,DATA)
SELECT
1,parse_json('{
"Client": [
{
"ClientNo": 1,
"ClientName": "Alpha",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "12345"
},
{
"BusinessNo": 2,
"IndustryCode": "23456"
}
]
},
{
"ClientNo": 2,
"ClientName": "Bravo",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "34567"
},
{
"BusinessNo": 2,
"IndustryCode": "45678"
}
]
}
]
}');
SELECT
C.value:ClientNo AS ClientNo
,C.value:ClientName::STRING AS ClientName
,ClientBusiness.value:BusinessNo::Integer AS BusinessNo
,ClientBusiness.value:IndustryCode::Integer AS IndustryCode
from SAMPLE_JSON f
,table(flatten( f.DATA,'Client' )) C
,table(flatten(c.value:ClientBusiness,'')) ClientBusiness;
User Response to Recommendation #2:
Thank you for the parse_json example!
Trouble is, the real files are sometimes 500 MB, so the parse_json function chokes.
Follow-up on Recommendation #2:
The JSON needs to be in the NDJSON http://ndjson.org/ format. Otherwise the JSON will be impossible to parse because of the potential for large files.
Hope the above helps other running into similar questions!

Get corresponding time and value from json using Django (Python 2)

The following is a sample of the JSON that I'm fetching from the source. I need to get certain values from it such as time, value, value type. The time and value are in different arrays and on different depth.
So, my aim is to get the time and value like for value 1, I should get the timestamp 1515831588000, 2 for 1515838788000 and 3 for 1515845987000.
Please mind that this is just a sample JSON so ignore any mistakes that I may have made while writing this array.
{
"feed":{
"component":[
{
"stream":[
{
"statistic":[
{
"type":"DOUBLE",
"data":[
1,
2,
3
]
}
],
"time":[
1515831588000,
1515838788000,
1515845987000
]
}
]
}
]
},
"message":"",
"success":true
}
Here is a function, that I have written to get the values but the timestamp that I'm getting on the final step is incorrect. Please help to resolve this issue.
get_feed_data() is the function that is giving the above JSON.
# Fetch the component UIDs from the database
component_uids = ComponentDetail.objects.values('uid')
# Make sure we have component UIDs to begin with
if component_uids:
for component_uid in component_uids:
# Fetch stream UIDs for each component from the database
stream_uids = StreamDetail.objects.values('uid').filter(comp_id=ComponentDetail.objects.get(uid=component_uid['uid']))
for stream_uid in stream_uids:
feed_data = json.loads(get_feed_data(component_uid['uid'], stream_uid['uid']))
sd = StreamDetail.objects.get(uid=stream_uid['uid'])
for component in feed_data['feed']['component']:
for stream in component['stream']:
t = {}
d = {}
stats = {}
for time_value in stream['time']:
t = time_value
for stats in stream['statistic']:
for data_value in stats['data']:
print({
'uid': sd, value_type:stats['type'], timestamp=t, value=data_value
})
I finally got it dome using lists. I simply append the data in a list and call it later in a loop using the index and fetch the corresponding date and time.

Azure tables unable to store flattened JSON

I am using the npm flat package, and arrays/objects are flattened, but object/array keys are surrounded by '' , like in 'task_status.0.data' using the object below.
These specific fields do not get stored into AzureTables - other fields go through, but these are silently ignored. How would I fix this?
var obj1 = {
"studentId": "abc",
"task_status": [
{
"status":"Current",
"date":516760078
},
{
"status":"Late",
"date":1516414446
}
],
"student_plan": "n"
}
Here is how I am using it - simplified code example: Again, it successfully gets written to the table, but does not write the properties that were flattened (see further below):
var flatten = require('flat')
newObj1 = flatten(obj1);
var entGen = azure.TableUtilities.entityGenerator;
newObj1.PartitionKey = entGen.String(uniqueIDFromMyDB);
newObj1.RowKey = entGen.String(uniqueStudentId);
tableService.insertEntity(myTableName, newObj1, myCallbackFunc);
In the above example, the flattened object would look like:
var obj1 = {
studentId: "abc",
'task_status.0.status': 'Current',
'task_status.0.date': 516760078,
'task_status.1.status': 'Late',
'task_status.1.date': 516760078,
student_plan: "n"
}
Then I would add PartitionKey and RowKey.
all the task_status fields would silently fail to be inserted.
EDIT: This does not have anything to do with the actual flattening process - I just checked a perfectly good JSON object, with keys that had 'x.y.z' in it, i.e. AzureTables doesn't seem to accept these column names....which almost completely destroys the value proposition of storing schema-less data, without significant rework.
. in column name is not supported. You can use a custom delimiter to flatten your objects instead.
For example:
newObj1 = flatten(obj1, {delimiter: '__'});

How to fetch a JSON file to get a row position from a given value or argument

I'm using wget to fetch several dozen JSON files on a daily basis that go like this:
{
"results": [
{
"id": "ABC789",
"title": "Apple",
},
{
"id": "XYZ123",
"title": "Orange",
}]
}
My goal is to find row's position on each JSON file given a value or set of values (i.e. "In which row XYZ123 is located?"). In previous example ABC789 is in row 1, XYZ123 in row 2 and so on.
As for now I use Google Regine to "quickly" visualize (using the Text Filter option) where the XYZ123 is standing (row 2).
But since it takes a while to do this manually for each file I was wondering if there is a quick and efficient way in one go.
What can I do and how can I fetch and do the request? Thanks in advance! FoF0
In python:
import json
#assume json_string = your loaded data
data = json.loads(json_string)
mapped_vals = []
for ent in data:
mapped_vals.append(ent['id'])
The order of items in the list will be indexed according to the json data, since the list is a sequenced collection.
In PHP:
$data = json_decode($json_string);
$output = array();
foreach($data as $values){
$output[] = $values->id;
}
Again, the ordered nature of PHP arrays ensure that the output will be ordered as-is with regard to indexes.
Either example could be modified to use a mapped dictionary (python) or an associative array (php) if needs demand.
You could adapt these to functions that take the id value as an argument, track how far they are into the array, and when found, break out and return the current index.
Wow. I posted the original question 10 months ago when I knew nothing about Python nor computer programming whatsoever!
Answer
But I learned basic Python last December and came up with a solution for not only get the rank order but to insert the results into a MySQL database:
import urllib.request
import json
# Make connection and get the content
response = urllib.request.urlopen(http://whatever.com/search?=ids=1212,125,54,454)
content = response.read()
# Decode Json search results to type dict
json_search = json.loads(content.decode("utf8"))
# Get 'results' key-value pairs to a list
search_data_all = []
for i in json_search['results']:
search_data_all.append(i)
# Prepare MySQL list with ranking order for each id item
ranks_list_to_mysql = []
for i in range(len(search_data_all)):
d = {}
d['id'] = search_data_all[i]['id']
d['rank'] = i + 1
ranks_list_to_mysql.append(d)