How to select right values in JSON file in pyspark - json

I got a json file similar to this.
"code": 298484,
"details": {
"date": "0001-01-01",
"code" : 0
}
code appears twice, one is filled and the other one is empty. I need the first one with the data in details. What is the approach in pyspark?
I tried to filter
df = rdd.map(lambda r: (r['code'], r['details'])).toDF()
But it shows _1, _2 (no schema).

Please try the following:
spark.read.json("path to json").select("code", "details.date")

Related

How can I load the following JSON (deeply nested) to a DataFrame?

A sample of the JSON is as shown below:
{
"AN": {
"dates": {
"2020-03-26": {
"delta": {
"confirmed": 1
},
"total": {
"confirmed": 1
}
}
}
},
"KA": {
"dates": {
"2020-03-09": {
"delta": {
"confirmed": 1
},
"total": {
"confirmed": 1
}
},
"2020-03-10": {
"delta": {
"confirmed": 3
},
"total": {
"confirmed": 4
}
}
}
}
}
I would like to load it into a DataFrame, such that the state names (AN, KA) are represented as Row names, and the dates and nested entries are present as Columns.
Any tips to achieve this would be very much appreciated. [I am aware of json_normalize, however I haven't figured out how to work it out yet.]
The output I am expecting, is roughly as shown below:
Can you update your post with the DataFrame you have in mind ? It'll be easier to understand what you want.
Also sometimes it's better to reshape your data if you can't make it work the way they are now.
Update:
Following your update here's what you can do.
You need to reshape your data, as I said when you can't achieve what you want it is best to look at the problem from another point of view. For instance (and from the sample you shared) the 'dates' keys is meaningless as the other keys are already dates and there are no other keys ate the same level.
A way to achieve what you want would be to use MultiIndex, it'll help you group your data the way you want. To use it you can for instance create all the indices you need and store in a dictionary the values associated.
Example :
If the only index you have is ('2020-03-26', 'delta', 'confirmed') you should have values = {'AN' : [1], 'KA':None}
Then you only need to create your DataFrame and transpose it.
I gave it a quick try and came up with a piece of code that should work. If you're looking for performance I don't think this will do the trick.
import pandas as pd
# d is the sample you shared
index = [[],[],[]]
values = {}
# Get all the dates
dates = [date for c in d.keys() for date in d[c]['dates'].keys() ]
for country in d.keys():
# For each country we create an array containing all 6 values for each date
# (missing values as None)
values[country] = []
for date in dates:
if date in d[country]['dates']:
for method in ['delta', 'total']:
for step in ['confirmed', 'recovered', 'tested']:
# Incrementing indices
index[0].append(date)
index[1].append(method)
index[2].append(step)
if step in value.keys():
values[country].append(deepcopy(d[country]['dates'][date][method][step]))
else :
values[country].append(None)
# When country does not have a date fill with None
else :
for method in ['delta', 'total']:
for step in ['confirmed', 'recovered', 'tested']:
index[0].append(date)
index[1].append(method)
index[2].append(step)
values[country].append(None)
# Removing duplicates introduced because we added n_countries times
# the indices
# 3 is the number of steps
# 2 is the number of methods
number_of_rows = 3*2*len(dates)
index[0] = index[0][:number_of_rows]
index[1] = index[1][:number_of_rows]
index[2] = index[2][:number_of_rows]
df = pd.DataFrame(values, index=index).T
Here is what I have for the transposed data frame of my output :
Hope this can help you
You clearly needs to reshape your json data before load it into a DataFrame.
Have you tried load your json like a dict ?
dataframe = pd.DataFrame.from_dict(JsonDict, orient="index")
The “orient” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.

Check if a value exists in a json file with python

I've the following json file (banneds.json):
{
"players": [
{
"avatar": "https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/07/07aa315f664efa92456569429230bc2c254c3ff8_full.jpg",
"created": 1595050663,
"created_by": "<#128152620136267776>",
"nick": "teste",
"steam64": 76561198046619692
},
{
"avatar": "https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/21/21fa5c468597e9c890212b2e3bdb0fac781c040c_full.jpg",
"created": 1595056420,
"created_by": "<#128152620136267776>",
"nick": "ingridão",
"steam64": 76561199058918551
}
]
}
And I want to insert new values if the new value (inserted by user) is not already in the json, however when I try to search if the value is already there I receive a false value, an example of what I'm doing ( not the original code, only an example ):
import json
check = 76561198046619692
with open('banneds.json', 'r') as file:
data = json.load(file)
if check in data:
print(True)
else:
print(False)
I'm always receiving the "False" result, but the value is there, someone can give me a light of what I'm doing wrong please? I tried the entire night to find a solution, but no one works :(
Thanks for the help!
You are checking data as a dictionary object. When checking using if check in data it checks if data object have a key matching the value of the check variable (data.keys() to list all keys).
One easy way would be to use if check in data["players"].__str__() which will convert value to a string and search for the match.
If you want to make sure that check value only checks for the steam64 values, you can write a simple function that will iterate over all "players" and will check their "steam64" values. Another solution would be to make list of "steam64" values for faster and easier checking.
You can use any() to check if value of steam64 key is there.
For example:
import json
def check_value(data, val):
return any(player['steam64']==val for player in data['players'])
with open('banneds.json', 'r') as f_in:
data = json.load(f_in)
print(check_value(data, 76561198046619692))
Prints:
True

What JSON format does STRIP_OUTER_ARRAY support?

I have a file composed of a single array containing multiple records.
{
"Client": [
{
"ClientNo": 1,
"ClientName": "Alpha",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "12345"
},
{
"BusinessNo": 2,
"IndustryCode": "23456"
}
]
},
{
"ClientNo": 2,
"ClientName": "Bravo",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "34567"
},
{
"BusinessNo": 2,
"IndustryCode": "45678"
}
]
}
]
}
I load it with the following code:
create or replace stage stage.test
url='azure://xxx/xxx'
credentials=(azure_sas_token='xxx');
create table if not exists stage.client (json_data variant not null);
copy into stage.client_test
from #stage.test/client_test.json
file_format = (type = 'JSON' strip_outer_array = true);
Snowflake imports the entire file as one row.
I would like the the COPY INTO command to remove the outer array structure and load the records into separate table rows.
When I load larger files, I hit the size limit for variant and get the error Error parsing JSON: document is too large, max size 16777216 bytes.
If you can import the file into Snowflake, into a single row, then you can use LATERAL FLATTEN on the Clients field to generate one row per element in the array.
Here's a blog post on LATERAL and FLATTEN (or you could look them up in the snowflake docs):
https://support.snowflake.net/s/article/How-To-Lateral-Join-Tutorial
If the format of the file is, as specified, a single object with a single property that contains an array with 500 MB worth of elements in it, then perhaps importing it will still work -- if that works, then LATERAL FLATTEN is exactly what you want. But that form is not particularly great for data processing. You might want to use some text processing script to massage the data if that's needed.
RECOMMENDATION #1:
The problem with your JSON is that it doesn't have an outer array. It has a single outer object containing a property with an inner array.
If you can fix the JSON, that would be the best solution, and then STRIP_OUTER_ARRAY will work as you expected.
You could also try to recompose the JSON (an ugly business) after reading line for line with:
CREATE OR REPLACE TABLE X (CLIENT VARCHAR);
COPY INTO X FROM (SELECT $1 CLIENT FROM #My_Stage/Client.json);
User Response to Recommendation #1:
Thank you. So from what I gather, COPY with STRIP_OUTER_ARRAY can handle a file starting and ending with square brackets, and parse the file as if they were not there.
The real files don't have line breaks, so I can't read the file line by line. I will see if the source system can change the export.
RECOMMENDATION #2:
Also if you would like to see what the JSON parser does, you can experiment using this code, I have parsed JSON on the copy command using similar code. Working with your JSON data in small project can help you shape the Copy command to work as intended.
CREATE OR REPLACE TABLE SAMPLE_JSON
(ID INTEGER,
DATA VARIANT
);
INSERT INTO SAMPLE_JSON(ID,DATA)
SELECT
1,parse_json('{
"Client": [
{
"ClientNo": 1,
"ClientName": "Alpha",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "12345"
},
{
"BusinessNo": 2,
"IndustryCode": "23456"
}
]
},
{
"ClientNo": 2,
"ClientName": "Bravo",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "34567"
},
{
"BusinessNo": 2,
"IndustryCode": "45678"
}
]
}
]
}');
SELECT
C.value:ClientNo AS ClientNo
,C.value:ClientName::STRING AS ClientName
,ClientBusiness.value:BusinessNo::Integer AS BusinessNo
,ClientBusiness.value:IndustryCode::Integer AS IndustryCode
from SAMPLE_JSON f
,table(flatten( f.DATA,'Client' )) C
,table(flatten(c.value:ClientBusiness,'')) ClientBusiness;
User Response to Recommendation #2:
Thank you for the parse_json example!
Trouble is, the real files are sometimes 500 MB, so the parse_json function chokes.
Follow-up on Recommendation #2:
The JSON needs to be in the NDJSON http://ndjson.org/ format. Otherwise the JSON will be impossible to parse because of the potential for large files.
Hope the above helps other running into similar questions!

Python: create json query

I'm trying to get python to create a json formatted like :
[
{
"machine_working": true
},
{
"MachineName": "TBL165-169",
"MachineType": "Rig Test"
}
]
However, i can seam to do it, this is the code i have currently but its giving me error
this_is_a_dict_too=[]
this_is_a_dict_too = dict(State="on",dict(MachineType="machinetype1",MachineName="MachineType2"))
File "c:\printjson.py", line 40
this_is_a_dict_too = dict(Statedsf="test",dict(MachineType="Rig Test",MachineName="TBL165-169")) SyntaxError: non-keyword arg after
keyword arg
this_is_a_dict_too = [dict(machine_working=True),dict(MachineType="machinetype1",MachineName="MachineType2")]
print(this_is_a_dict_too)
You are trying to make dictionary in dictionary, the error message say that you try to add element without name (corresponding key)
dict(a='b', b=dict(state='on'))
will work, but
dict(a='b', dict(state='on'))
won't.
The thing that you presented is list, so you can use
list((dict(a='b'), dict(b='a')))
Note that example above use two dictionaries packed into tuple.
or
[ dict(a='b'), dict(b='a') ]

Python3: JSON to CSV

I have a JSON dict in Python which I would like to parse into a CSV, my data and code looks like this:
import csv
import json
x = {
"success": 1,
"return": {
"variable_id": {
"var1": "val1",
"var2": "val2"
}...
f = csv.writer(open("foo.csv", "w", newline=''))
for x in x:
f.writerow([x["success"],
'--variable value--',
x["return"]["variable_id"]["var1"],
x["return"]["variable_id"]["var2"])
However, since variable_id's value is going to change I don't know how to refer to in the code. Apologies if this is trivial but I guess I lack the terminology to find the solution.
You can use the * (unpack) operator to do this, assuming only the values in your variable_id matter :
f.writerow([x["success"],
'--variable value--',
*[val for variable_id in x['return'].values() for val in variable_id.values()])
The unpack operator essentially takes everything in x["return"]["variable_id"].values() and appends it in the list you're creating as input for writerow.
EDIT this should now work if you don't know how to referencevariable_id. This will work best if you have several variable_ids in x['return'].
If you only have one variable_id, then you can also try this :
f.writerow([x["success"],
'--variable value--',
*list(x['return'].values())[0].values()])
Or
f.writerow([x["success"],
'--variable value--',
*next(iter(x['return'].values())).values()])
You can get variable_id's value using x['success']['return'].keys[0].