I have a copy statement that unload from Snowflake to S3.
COPY INTO #MY_STAGE/path/my_filename FROM (
SELECT OBJECT_CONSTRUCT(*) from my_table)
FILE_FORMAT =(TYPE = JSON COMPRESSION = NONE)
OVERWRITE=TRUE;
Current result in myfilename.json:
{
"a": 123,
"b": "def"
}
{
"a": 456,
"b": "ghi"
}
Using OBJECT_CONSTRUCT() will produce in ndjson format. However, I want to save the file in array of json such as:
[
{
"a": 123,
"b": "def"
},
{
"a": 456,
"b": "ghi"
}
]
The data needs to be stored in one single file. I know I need to do something with SELECT OBJECT_CONSTRUCT(*) from my_table, however I'm not sure how to transform this.
You can use ARRAY_AGG to achieve this:
COPY INTO #MY_STAGE/path/my_filename FROM (
SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*)) from my_table)
FILE_FORMAT =(TYPE = JSON COMPRESSION = NONE)
OVERWRITE=TRUE;
This function converts your record set into an array. Since Snowflake arrays are basically JSON arrays, the array returned by ARRAY_AGG can be written directly into the JSON file.
Unfortunately, ARRAY_AGG has a limitation of being able to hold only 16 MB of data. If your dataset is larger than this, it's unfortunately not possible to write all of it into a file in your desired format.
Related
I am new to Apache Spark and I am trying to compare two json files.
My requirement is to find out that which key/value is added, removed or modified and what is its path.
To explain my problem, I am sharing the code which I have tried with a small json sample here.
Sample Json 1 is:
{
"employee": {
"name": "sonoo",
"salary": 57000,
"married": true
} }
Sample Json 2 is:
{
"employee": {
"name": "sonoo",
"salary": 58000,
"married": true
} }
My code is:
//Compare two multiline json files
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Load first json file
val jsonData_1 = sqlContext.read.json(sc.wholeTextFiles("D:\\File_1.json").values)
//Load second json file
val jsonData_2 = sqlContext.read.json(sc.wholeTextFiles("D:\\File_2.json").values)
//Compare both json files
jsonData_2.except(jsonData_1).show(false)
The output which I get on executing this code is:
+--------------------+
|employee |
+--------------------+
|{true, sonoo, 58000}|
+--------------------+
But here only one field i.e. salary was modified so output should be only the updated field with its path.
Below is the expected output details:
[
{
"op" : "replace",
"path" : "/employee/salary",
"value" : 58000
}
]
Can anyone point me in the right direction?
Assuming each json has an identifier, and that you have two json groups (e.g. folders), you need to compare b/w the jsons in the two groups:
Load the jsons from each group into a dataframe, providing a schema matching the structure of the son. After this, you have two dataframes.
Compare the jsons (by now rows in a dataframe) by joining on the identifiers, looking for mismatched values.
I am new to LUA, I have to pars below JSON value, I need to read all the val and attrid define in attributes, there will be more value might come in the attributes section, I tried with table, but no luck, any help will be appreciated
{
"obj1": {
"attributes": [
{
"val": "1",
"attrid": "test2"
},
{
"val": "1",
"attrid": "test1"
}
]
"status": 0
}
}
-- Require some JSON library.
-- You can get lua-cjson from luarocks.
local json = require 'cjson'
-- You probably get this from a file or something in your actual code
local your_json_string = "string_containing_json"
-- Parse the json into a Lua table
local data = json.decode(your_json_string)
-- Iterate over the array like any other Lua sequence
for i, attribute in ipairs(data.obj1.attributes) do
-- Do whatever you want with the val and attrid values
print(attribute.val)
print(attribute.attrid)
end
I have a tsql stored procedure that generates a json file of the database I am working with. A certain part of the json file is an array of properties with a name, value, and requirement variable. This part of the JSON file looks like this:
"Properties": [
{
"PropertyName": "Foo"
"PropertyValue": "Bar"
"Required": "True"
},
{
"PropertyName": "Foo2"
"PropertyValue": "Bar2"
"Required": "False"
}
]
This is generated as a subquery of a larger SQL query that creates the entire JSON document:
SELECT,...,
(SELECT P.Name as PropertyName, P.Value as PropertyValue, P.Requirement as Required
FROM Properties P FOR JSON PATH) as Properties
FROM Foo
FOR JSON PATH
I want the JSON file to be laid out so the value of PropertyName is the key for the other two values:
"Properties" [
{"Foo": {{"PropertyValue": "Bar"},{"Required": "True"}},
etc.
]
Ideally the code could be something like:
(Select P.Value as [[P.Name].PropertyValue], P.Requirement as [[P.Name].Required]
FROM Properties P
FOR JSON PATH)
But obviously that isn't a legal expression. Is there any way I can do this within an SQL framework or should I modify the file after it's created?
A newbie to protobuff here. I am working on compressing a JSON file using protobuff. The problem is that this JSON file comes as a response from a webserver and contains certain fields whose name are random i.e. with each request posted to the server, the key names differ. For example consider the below JSON:
{
"field1": [
{
"abc": "vala",
"def": "valb",
"ghi": "valc"
}
],
"field2": "val2",
"field3": "val3"
}
In the above json, the field names "abc", "def", "ghi" can vary each time. Is there a way in protobuf so that I get field1's value completely (like a single string or anything else) without losing the random fields inside it?
I think you want "struct.proto", i.e.
syntax = "proto3";
import "google/protobuf/struct.proto";
message Foo {
.google.protobuf.Struct field1 = 1;
string field2 = 2;
string field3 = 3;
}
or possibly (because of the array):
syntax = "proto3";
import "google/protobuf/struct.proto";
message Foo {
repeated .google.protobuf.Struct field1 = 1;
string field2 = 2;
string field3 = 3;
}
However, I should emphasize that protobuf isn't well-suited for parsing arbitrary json; for that you should use a json library, not a protobuf library.
I have a Pandas DataFrame that I need to convert to JSON. The to_json() DataFrame method results in an acceptable format, but it converts my DataFrame index to strings (e.g. 0 becomes "0.0"). I need "0".
The DataFrame comes from JSON using the pd.io.json.read_json() method, which sets the index to float64.
Input JSON:
{"chemical": {"1": "chem2", "0": "chem1"},
"type": {"1": "pesticide", "0": "pesticide"}}
DataFrame (from read_json()):
chemical type
0 chem1 pesticide
1 chem2 pesticide
Produced JSON (from to_json()):
{"chemical": {"0.0": "chem1", "1.0": "chem2"},
"type": {"0.0": "pesticide", "1.0": "pesticide"}}
Needed JSON:
{"chemical": {"0": "chem1", "1": "chem2"},
"type": {"0": "pesticide", "1": "pesticide"}}
#shx2 pointed me in the right direction, but I changed my approach to creating the DataFrame from JSON.
Instead of using the to_json() method on a JSON string, I used the pd.DataFrame.from_dict() method on the JSON as a Python dictionary to create the DataFrame. This results in df.index.dtype == dtype('O')
I had to set dtype='float64' in the from_dict() method to set the correct dtype for the non-string entries.
pd_obj = pd.DataFrame.from_dict(request.json["inputs"], dtype='float64')
Seems like the dtype of the index is float (check df.index.dtype). You need to convert it to int:
df.index = df.index.astype(int)
df.to_json()
=> {"chemical": {"0": "chem1", "1": "chem2"}, "type": {"0": "pesticide", "1": "pesticide"}}