I have a file composed of a single array containing multiple records.
{
"Client": [
{
"ClientNo": 1,
"ClientName": "Alpha",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "12345"
},
{
"BusinessNo": 2,
"IndustryCode": "23456"
}
]
},
{
"ClientNo": 2,
"ClientName": "Bravo",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "34567"
},
{
"BusinessNo": 2,
"IndustryCode": "45678"
}
]
}
]
}
I load it with the following code:
create or replace stage stage.test
url='azure://xxx/xxx'
credentials=(azure_sas_token='xxx');
create table if not exists stage.client (json_data variant not null);
copy into stage.client_test
from #stage.test/client_test.json
file_format = (type = 'JSON' strip_outer_array = true);
Snowflake imports the entire file as one row.
I would like the the COPY INTO command to remove the outer array structure and load the records into separate table rows.
When I load larger files, I hit the size limit for variant and get the error Error parsing JSON: document is too large, max size 16777216 bytes.
If you can import the file into Snowflake, into a single row, then you can use LATERAL FLATTEN on the Clients field to generate one row per element in the array.
Here's a blog post on LATERAL and FLATTEN (or you could look them up in the snowflake docs):
https://support.snowflake.net/s/article/How-To-Lateral-Join-Tutorial
If the format of the file is, as specified, a single object with a single property that contains an array with 500 MB worth of elements in it, then perhaps importing it will still work -- if that works, then LATERAL FLATTEN is exactly what you want. But that form is not particularly great for data processing. You might want to use some text processing script to massage the data if that's needed.
RECOMMENDATION #1:
The problem with your JSON is that it doesn't have an outer array. It has a single outer object containing a property with an inner array.
If you can fix the JSON, that would be the best solution, and then STRIP_OUTER_ARRAY will work as you expected.
You could also try to recompose the JSON (an ugly business) after reading line for line with:
CREATE OR REPLACE TABLE X (CLIENT VARCHAR);
COPY INTO X FROM (SELECT $1 CLIENT FROM #My_Stage/Client.json);
User Response to Recommendation #1:
Thank you. So from what I gather, COPY with STRIP_OUTER_ARRAY can handle a file starting and ending with square brackets, and parse the file as if they were not there.
The real files don't have line breaks, so I can't read the file line by line. I will see if the source system can change the export.
RECOMMENDATION #2:
Also if you would like to see what the JSON parser does, you can experiment using this code, I have parsed JSON on the copy command using similar code. Working with your JSON data in small project can help you shape the Copy command to work as intended.
CREATE OR REPLACE TABLE SAMPLE_JSON
(ID INTEGER,
DATA VARIANT
);
INSERT INTO SAMPLE_JSON(ID,DATA)
SELECT
1,parse_json('{
"Client": [
{
"ClientNo": 1,
"ClientName": "Alpha",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "12345"
},
{
"BusinessNo": 2,
"IndustryCode": "23456"
}
]
},
{
"ClientNo": 2,
"ClientName": "Bravo",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "34567"
},
{
"BusinessNo": 2,
"IndustryCode": "45678"
}
]
}
]
}');
SELECT
C.value:ClientNo AS ClientNo
,C.value:ClientName::STRING AS ClientName
,ClientBusiness.value:BusinessNo::Integer AS BusinessNo
,ClientBusiness.value:IndustryCode::Integer AS IndustryCode
from SAMPLE_JSON f
,table(flatten( f.DATA,'Client' )) C
,table(flatten(c.value:ClientBusiness,'')) ClientBusiness;
User Response to Recommendation #2:
Thank you for the parse_json example!
Trouble is, the real files are sometimes 500 MB, so the parse_json function chokes.
Follow-up on Recommendation #2:
The JSON needs to be in the NDJSON http://ndjson.org/ format. Otherwise the JSON will be impossible to parse because of the potential for large files.
Hope the above helps other running into similar questions!
Related
A sample of the JSON is as shown below:
{
"AN": {
"dates": {
"2020-03-26": {
"delta": {
"confirmed": 1
},
"total": {
"confirmed": 1
}
}
}
},
"KA": {
"dates": {
"2020-03-09": {
"delta": {
"confirmed": 1
},
"total": {
"confirmed": 1
}
},
"2020-03-10": {
"delta": {
"confirmed": 3
},
"total": {
"confirmed": 4
}
}
}
}
}
I would like to load it into a DataFrame, such that the state names (AN, KA) are represented as Row names, and the dates and nested entries are present as Columns.
Any tips to achieve this would be very much appreciated. [I am aware of json_normalize, however I haven't figured out how to work it out yet.]
The output I am expecting, is roughly as shown below:
Can you update your post with the DataFrame you have in mind ? It'll be easier to understand what you want.
Also sometimes it's better to reshape your data if you can't make it work the way they are now.
Update:
Following your update here's what you can do.
You need to reshape your data, as I said when you can't achieve what you want it is best to look at the problem from another point of view. For instance (and from the sample you shared) the 'dates' keys is meaningless as the other keys are already dates and there are no other keys ate the same level.
A way to achieve what you want would be to use MultiIndex, it'll help you group your data the way you want. To use it you can for instance create all the indices you need and store in a dictionary the values associated.
Example :
If the only index you have is ('2020-03-26', 'delta', 'confirmed') you should have values = {'AN' : [1], 'KA':None}
Then you only need to create your DataFrame and transpose it.
I gave it a quick try and came up with a piece of code that should work. If you're looking for performance I don't think this will do the trick.
import pandas as pd
# d is the sample you shared
index = [[],[],[]]
values = {}
# Get all the dates
dates = [date for c in d.keys() for date in d[c]['dates'].keys() ]
for country in d.keys():
# For each country we create an array containing all 6 values for each date
# (missing values as None)
values[country] = []
for date in dates:
if date in d[country]['dates']:
for method in ['delta', 'total']:
for step in ['confirmed', 'recovered', 'tested']:
# Incrementing indices
index[0].append(date)
index[1].append(method)
index[2].append(step)
if step in value.keys():
values[country].append(deepcopy(d[country]['dates'][date][method][step]))
else :
values[country].append(None)
# When country does not have a date fill with None
else :
for method in ['delta', 'total']:
for step in ['confirmed', 'recovered', 'tested']:
index[0].append(date)
index[1].append(method)
index[2].append(step)
values[country].append(None)
# Removing duplicates introduced because we added n_countries times
# the indices
# 3 is the number of steps
# 2 is the number of methods
number_of_rows = 3*2*len(dates)
index[0] = index[0][:number_of_rows]
index[1] = index[1][:number_of_rows]
index[2] = index[2][:number_of_rows]
df = pd.DataFrame(values, index=index).T
Here is what I have for the transposed data frame of my output :
Hope this can help you
You clearly needs to reshape your json data before load it into a DataFrame.
Have you tried load your json like a dict ?
dataframe = pd.DataFrame.from_dict(JsonDict, orient="index")
The “orient” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.
I've the following json file (banneds.json):
{
"players": [
{
"avatar": "https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/07/07aa315f664efa92456569429230bc2c254c3ff8_full.jpg",
"created": 1595050663,
"created_by": "<#128152620136267776>",
"nick": "teste",
"steam64": 76561198046619692
},
{
"avatar": "https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/21/21fa5c468597e9c890212b2e3bdb0fac781c040c_full.jpg",
"created": 1595056420,
"created_by": "<#128152620136267776>",
"nick": "ingridão",
"steam64": 76561199058918551
}
]
}
And I want to insert new values if the new value (inserted by user) is not already in the json, however when I try to search if the value is already there I receive a false value, an example of what I'm doing ( not the original code, only an example ):
import json
check = 76561198046619692
with open('banneds.json', 'r') as file:
data = json.load(file)
if check in data:
print(True)
else:
print(False)
I'm always receiving the "False" result, but the value is there, someone can give me a light of what I'm doing wrong please? I tried the entire night to find a solution, but no one works :(
Thanks for the help!
You are checking data as a dictionary object. When checking using if check in data it checks if data object have a key matching the value of the check variable (data.keys() to list all keys).
One easy way would be to use if check in data["players"].__str__() which will convert value to a string and search for the match.
If you want to make sure that check value only checks for the steam64 values, you can write a simple function that will iterate over all "players" and will check their "steam64" values. Another solution would be to make list of "steam64" values for faster and easier checking.
You can use any() to check if value of steam64 key is there.
For example:
import json
def check_value(data, val):
return any(player['steam64']==val for player in data['players'])
with open('banneds.json', 'r') as f_in:
data = json.load(f_in)
print(check_value(data, 76561198046619692))
Prints:
True
I have some code which reads information in from a CSV file. The information is regarding road traffic accidents.
so with the code, i can target the values using 'd['Weather Conditions']' which gives me -
which is great! But what i'm actually looking to do is to store values of particular types in variables. e.g so all values equal to 'Fine without high winds' would be stored in an array like object like 'var windy' or something similar, is there anyway i could go about this simply?
*update
some sample data
Any help would be appreciated. Thanks
Right now, in your code, you are simply printing to console each weather condition you encounter as you go over the dataset row by row.
If you want to group your data based on the weather conditions, you can do it using d3.nest().
The example below uses some simple json data to show how you can do it. Here, using d3.nest(), we are setting the key to be the weather condition and the values to be the object.
After structuring, your data will be grouped, you can check by running the script below:
All accidents with rainy will be under the key rainy and the same for all other weather conditions.
var data = [{
"Weather Condition": "rainy",
"property1": "val1",
"property2": "val2"
}, {
"Weather Condition": "sunny",
"property1": "val1",
"property2": "val2"
}, {
"Weather Condition": "sunny",
"property1": "val1",
"property2": "val2"
}];
// put the snippet below in your d3.csv function
var expensesByName = d3.nest()
.key(function(d) {
return d["Weather Condition"];
})
.entries(data);
console.log(expensesByName);
<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/3.4.11/d3.min.js"></script>
I have several large json objects (think GB scale), where the object values in some of the innermost levels are arrays of objects. I'm using jq 1.4 and I'm trying to break these arrays into individual objects, each of which will have a key such as g__0 or g__1, where the numbers correspond to the index in the original array, as returned by the keys function. The number of objects in each array may be arbitrarily large (in my example it is equal to 3). At the same time I want to keep the remaining structure.
For what it's worth the original structure comes from MongoDB, but I am unable to change it at this level. I will then use this json file to create a schema for BigQuery, where an example column will be seeds.g__1.guid and so on.
What I have:
{
"port": 4500,
"notes": "This is an example",
"seeds": [
{
"seed": 12,
"guid": "eaf612"
},
{
"seed": 23,
"guid": "bea143"
},
{
"seed": 38,
"guid": "efk311"
}
]
}
What I am hoping to achieve:
{
"port": 4500,
"notes": "This is an example",
"seeds": {
"g__0": {
"seed": 12,
"guid": "eaf612"
},
"g__1": {
"seed": 23,
"guid": "bea143"
},
"g__2": {
"seed": 38,
"guid": "efk311"
}
}
}
Thanks!
The following jq program should do the trick. At least it produces the desired results for the given JSON. The program is so short and straightforward that I'll let it speak for itself:
def array2object(prefix):
. as $in
| reduce range(0;length) as $i ({}; .["\(prefix)_\($i)"] = $in[$i]);
.seeds |= array2object("g__")
So, you essentially want to transpose (pivot) your data in BigQuery Table such that instead of having data in rows as below
you will have your data in columns as below
Thus, my recommendation would be
First, load your data as is to start with
So now, instead of doing schema transformation outside of BigQuery, let’s rather do it within BigQuery!
Below would be an example of how to achieve transformation you are looking for (assuming you have max three items/objects in array)
#standardSQL
SELECT
port, notes,
STRUCT(
seeds[SAFE_OFFSET(0)] AS g__0,
seeds[SAFE_OFFSET(1)] AS g__1,
seeds[SAFE_OFFSET(2)] AS g__2
) AS seeds
FROM yourTable
You can test this with dummy data using CTE like below
#standardSQL
WITH yourTable AS (
SELECT
4500 AS port, 'This is an example' AS notes,
[STRUCT<seed INT64, guid STRING>
(12, 'eaf612'), (23, 'bea143'), (38, 'efk311')
] AS seeds
UNION ALL SELECT
4501 AS port, 'This is an example 2' AS notes,
[STRUCT<seed INT64, guid STRING>
(42, 'eaf412'), (53, 'bea153')
] AS seeds
)
SELECT
port, notes,
STRUCT(
seeds[SAFE_OFFSET(0)] AS g__0,
seeds[SAFE_OFFSET(1)] AS g__1,
seeds[SAFE_OFFSET(2)] AS g__2
) AS seeds
FROM yourTable
So, technically, if you know max number of items/object in seeds array – you can just manually write needed SQL statement, to run it against real data.
Hope you got an idea
Of course you can script /automate process – you can find examples for similar pivoting tasks here:
https://stackoverflow.com/a/40766540/5221944
https://stackoverflow.com/a/42287566/5221944
Is there a functionality to use logical operators in JSON path file used in copy command.
For example, I have a JSON which can contain a key which can either be
Desc
Or
Description
So in the JSON it would be something like -
{
"Desc": "Hello",
"City" : "City1",
"Age": "21"
}
{
"Description" : "World",
"City" : "City2",
"Age": "25"
}
I'm using copy command to pull the data from the JSON above into my table in redshift. The table has a column named "description_data". This would store values of either "Desc" or "Description". So I want my path file to identify using an "OR" condition.
This is the path file that I'm currently using -
{
"jsonpaths": [
"$['Desc']",
"$['City']",
"$['Age']"
]
}
Which is working fine.
What I'm trying to do is the below (this is where I'm unsure if there is any syntax or functionality to achieve the objective)
{
"jsonpaths": [
"$['Desc']" or "$['Description']",
"$['City']",
"$['Age']"
]
}
No, Redshift doesn't support this.
You can issue two copy commands, one with Desc, and another with Description, to load the data into two temporary tables. After that, you can merge the two into your final table.