How to parse json efficiently in Snowpipe with ON_ERROR=CONTINUE - json

I'm setting up a snowpipe to load data from s3 bucket to snowflake schema.
S3 contains files in NDJOSN format. One file can contain multiple records and I want to process all of them. Even if one record is broken.
To do so, I need to add on_error='continue' option to pipe creation and use csv file format as stated in official snowflake docs here.
That way I receive raw strings of JSON that I need to parse to access data. And since snowpipes do not support nested selects the only way to do that is to parse it for each column individually.
Resulting in this copy statement:
copy into MY_TABLE
from (select parse_json($1):id, parse_json($1):name, parse_json($1):status
from #MY_STAGE_CONNECTED_TO_S3)
on_error = 'continue'
This code needs to parse json 3 times for each row.
I have a table with ~40 col's and for it, this query works ~5 times slower than a more straightforward solution that uses file format option to parse JSON but unfortunately does not support on_error=continue option.
copy into HQO_DEVELOPMENT.PUBLIC.DIM_TENANT
from (select $1:id, $1:name, $1:status
from #HQO_DEVELOPMENT.PUBLIC.DIM_TENANT_STAGE_NN)
file_format = (type = 'json')
What was tried
Using nested select like this: : is not supported
copy into HQO_DEVELOPMENT.PUBLIC.DIM_TENANT from
(select $1:id, $1:name, $1:status from (
select parse_json($1) from
#HQO_DEVELOPMENT.PUBLIC.DIM_TENANT_STAGE))
on_error = 'continue'
Using JSON type on stage and omit on the pipe: won't help the issue
Is there a way to have the benefit of on_error=continue and don't parse JSON for every column?

Snowflake documentation says: Both CSV and semi-structured file types are supported; however, even when loading semi-structured data (e.g. JSON), you should set CSV as the file format type (default value). You can use the corresponding file format (e.g. JSON), but any error in the transformation will stop the COPY operation, even if you set the ON_ERROR option to continue or skip the file.
https://docs.snowflake.net/manuals/sql-reference/sql/copy-into-table.html
On the other hand, I see on_error option works for NDJSON files, at least when you set file type on stage level. For testing,
I created the following NDJSON file for testing:
{ "id":1, "name":"Gokhan", "location":"Netherlands" }
{ "id":2, "name":"Hans", location:Germany -- broken json #1 }
{ "id":3, "name":"Joe", "location":"UK" }
{ broken json #2 }
{ "id":4, "name":"Mike", "location":"US" }
I created a file type object, and created a stage using this file type (you can alter your existing stage and set a file type):
CREATE FILE FORMAT myformat TYPE = json;
CREATE STAGE mystage FILE_FORMAT = (FORMAT_NAME = myformat);
I uploaded my sample NDJSON file to this stage, created a snowpipe to load it:
CREATE PIPE mypipe AS
COPY INTO mytable
FROM (SELECT $1:id, $1:name, $1:location FROM #mystage)
ON_ERROR = CONTINUE;
When I refreshed the pipe, it successfully loaded the "valid" (3) records from the file.

Related

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

Databricks Delta tables from json files: Ignore initial load when running COPY INTO

I am working with Databricks on AWS. I have mounted an S3 bucket as /mnt/bucket-name/. This bucket contains json files under the prefix jsons. I create a Delta table from these json files as follows:
%python
df = spark.read.json('/mnt/bucket-name/jsons')
df.write.format('delta').save('/mnt/bucket-name/delta')
%sql
CREATE TABLE IF NOT EXISTS default.table_name
USING DELTA
LOCATION '/mnt/bucket-name/delta'
So far, so good. Then new json files arrive in the bucket. In order to update the Delta table, I run the following:
%sql
COPY INTO default.table_name
FROM '/mnt/bucket-name/jsons'
FILEFORMAT = JSON
This does indeed update the Delta table, but it duplicates the rows contained in the initial load, i.e. the rows in df are now contained in table_name twice. I have the following workaround, whereby I create an empty dataframe with the correct schema:
%python
df_schema = spark.read.json('/mnt/bucket-name/jsons').schema
df = spark.createDataFrame([], df_schema)
df.write.format('delta').save('/mnt/bucket-name/delta')
%sql
CREATE TABLE IF NOT EXISTS default.table_name
USING DELTA
LOCATION '/mnt/bucket-name/delta'
%sql
COPY INTO default.table_name
FROM '/mnt/bucket-name/jsons'
FILEFORMAT = JSON
This works and there is no duplication, but it seems neither elegant nor efficient, since spark.read.json('/mnt/bucket-name/jsons').schema reads all the json files, even though only the schema needs to be inferred. (The schema of the json files can be assumed to be stable.) Is there a way to tell COPY INTO to ignore the initial json files? There's the option modifiedAfter, but that would be cumbersome and doesn't sit well idempotently. I also considered recreating the dataframe and then running df.write.format('delta').mode('append').save('/mnt/bucket-name/delta') followed by REFRESH TABLE default.table_name, but this seems inefficient, since why should the initial json files be read again? Edit: This method also duplicates the initial load.
Or is there a way to circumvent using a Spark dataframe entirely and create a Delta table from the json files directly? I have searched for such a solution but to no avail.
One last point: Schema inference is crucial and so I do not want a solution that requires the schema of the json files to be written out manually.

Index JSON filename along with JSON content in Solr

I have 2 directories: 1 with txt files and the other with corresponding JSON (metadata) files (around 90000 of each). There is one JSON file for each CSV file, and they share the same name (they don't share any other fields). I am trying to index all these files in Apache solr.
The txt files just have plain text, I mapped each line to a field call 'sentence' and included the file name as a field using the data import handler. No problems here.
The JSON file has metadata: 3 tags: a URL, author and title (for the content in the corresponding txt file).
When I index the JSON file (I just used the _default schema, and posted the fields to the schema, as explained in the official solr tutorial), I don't know how to get the file name into the index as a field. As far as i know, that's no way to use the Data import handler for JSON files. I've read that I can pass a literal through the bin/post tool, but again, as far as I understand, I can't pass in the file name dynamically as a literal.
I NEED to get the file name, it is the only way in which I can associate the metadata with each sentence in the txt files in my downstream Python code.
So if anybody has a suggestion about how I should index the JSON file name along with the JSON content (or even some workaround), I'd be eternally grateful.
As #MatsLindh mentioned in the comments, I used Pysolr to do the indexing and get the filename. It's pretty basic, but I thought I'd post what I did as Pysolr doesn't have much documentation.
So, here's how you use Pysolr to index multiple JSON files, while also indexing the file name of the files. This method can be used if you have your files and your metadata files with the same filename (but different extensions), and you want to link them together somehow, like in my case.
Open a connection to your Solr instance using the pysolr.Solr command.
Loop through the directory containing your files, and get the filename of each file using os.path.basename and store it in a variable (after removing the extension, if necessary).
Read the file's JSON content into another variable.
Pysolr expects whatever is to be indexed to be stored in a list of dictionaries where each dictionary corresponds to one record.
Store all the fields you want to index in a dictionary (solr_content in my code below) while making sure the keys match the field names in your managed-schema file.
Append the dictionary created in each iteration to a list (list_for_solr in my code).
Outside the loop, use the solr.add command to send your list of dictionaries to be indexed in Solr.
That's all there is to it! Here's the code.
solr = pysolr.Solr('http://localhost:8983/solr/collection_name')
folderpath = directory-where-the-files-are-present
list_for_solr = []
for filepath in iglob(os.path.join(folderpath, '*.meta')):
with open(filepath, 'r') as file:
filename = os.path.basename(filepath)
# filename is xxxx.yyyy.meta
filename_without_extension = '.'.join(filename.split('.')[:2])
content = json.load(file)
solr_content = {}
solr_content['authors'] = content['authors']
solr_content['title'] = content['title']
solr_content['url'] = content['url']
solr_content['filename'] = filename_without_extension
list_for_solr.append(solr_content)
solr.add(list_for_solr)

append data to an existing json file

Appreciate if someone can point me to the right direction in here, bit new to python :)
I have a json file that looks like this:
[
{
"user":"user5",
"games":"game1"
},
{
"user":"user6",
"games":"game2"
},
{
"user":"user5",
"games":"game3"
},
{
"user":"user6",
"games":"game4"
}
]
And i have a small csv file that looks like this:
module_a,module_b
10,20
15,16
1,11
2,6
I am trying to append the csv data into the above mentioned json so it looks this, keeping the order as it is:
[
{
"user":"user5",
"module_a":"10",
"games":"game1",
"module_b":"20"
},
{
"user":"user6",
"module_a":"15",
"games":"game2",
"module_b":"16"
},
{
"user":"user5",
"module_a":"1",
"games":"game3",
"module_b":"11"
},
{
"user":"user6",
"module_a":"2",
"games":"game4",
"module_b":"6"
}
]
what would be the best approach to achive this keep the output order as it is.
Appreciate any guidance.
JSON specification doesn't prescribe orderness and it won't be enforced (unless it's a default mode of operation of the underlying platform) by any JSON parser so going a long way just to keep the order when processing JSON files is usually pointless. To quote:
An object is an unordered collection of zero or more name/value
pairs, where a name is a string and a value is a string, number,
boolean, null, object, or array.
...
JSON parsing libraries have been observed to differ as to whether or
not they make the ordering of object members visible to calling
software. Implementations whose behavior does not depend on member
ordering will be interoperable in the sense that they will not be
affected by these differences.
That being said, if you really insist on order, you can parse your JSON into a collections.OrderedDict (and write it back from it) which will allow you to inject data at specific places while keeping the overall order. So, first load your JSON as:
import json
from collections import OrderedDict
with open("input_file.json", "r") as f: # open the JSON file for reading
json_data = json.load(f, object_pairs_hook=OrderedDict) # read & parse it
Now that you have your JSON, you can go ahead and load up your CSV, and since there's not much else to do with the data you can immediately apply it to the json_data. One caveat, tho - since there is no direct map between the CSV and the JSON one has to assume index as being the map (i.e. the first CSV row being applied to the first JSON element etc.) so we'll use enumerate() to track the current index. There is also no info on where to insert individual values so we'll assume that the first column goes after the first JSON object entry, the second goes after the second entry and so on, and since they can have different lenghts we'll use itertools.izip_longest() to interleave them. So:
import csv
from itertools import izip_longest # use zip_longest on Python 3.x
with open("input_file.csv", "rb") as f: # open the CSV file for reading
reader = csv.reader(f) # build a CSV reader
header = next(reader) # lets store the header so we can get the key values later
for index, row in enumerate(reader): # enumerate and iterate over the rest
if index >= len(json_data): # there are more CSV rows than we have elements in JSO
break
row = [(header[i], v) for i, v in enumerate(row)] # turn the row into element tuples
# since collections.OrderedDict doesn't support random access by index we'll have to
# rebuild it by mixing in the CSV elements with the existing JSON elements
# use json_data[i].items() on Python 3.x
data = (v for p in izip_longest(json_data[index].iteritems(), row) for v in p)
# then finally overwrite the current element in json_data with a new OrderedDict
json_data[index] = OrderedDict(data)
And with our CSV data nicely inserted into the json_data, all that's left is to write back the JSON (you may overwrite the original file if you wish):
with open("output_file.json", "w") as f: # open the output JSON file for writing
json.dump(json_data, f, indent=2) # finally, write back the modified JSON
This will produce the result you're after. It even respects the names in the CSV header so you can replace them with bob and fred and it will insert those keys in your JSON. You can even add more of them if you need more elements added to your JSON.
Still, just because it's possible, you really shouldn't rely on JSON orderness. If it's user-readibility you're after, there are far more suitable formats with optional orderness like YAML.

Remove new lines and tabs in luajson output

I'm trying to convert a saved Lua table into something I can parse more easily for inclusion on a web page. I'm using Lua for windows from code.google's luaforwindows. It has included in it this harningt's luajson for handling this conversion. I've been able to figure out how to load in the contents of the file. I get no errors, but the "json" it produces is invalid. it just encloses the entire thing in quotes and adds \n and \t. The file I'm reading is a .lua file, which follows the format:
MyBorrowedData = {
["first"] = {
["firstSub"] = {
["firstSubSub"] = {
{
["stuffHere"]="someVal"
},
{
["stuffHere2"]="some2Val"
},
},
},
},
}
Note the , following the final item in each "row" of the table, is that the issue? Is it valid Lua data? I feel like given the output, Lua is unable to parse the table when I read it in. I believe this even more when I try to just require the lua data file, and I seem to be unable to iterate through the table manually.
Can anyone tell me if it's a bug in the code or poorly formatted data that's causing the issue?
The Lua lifting is easy:
local json = require("json")
local file = "myBorrowedData.lua"
local jsonOutput = "myBorrowedOutput.json"
r = io.input(file)
t = io.read('*all')
u = io.output(jsonOutput)
s = json.encode(t)
io.write(s)
You're reading the file as plain text, and not loading the Lua code contained inside of it. t becomes a string with the Lua file's contents, which of course serializes to a plain JSON string.
To serialize the data in the Lua file, you need to run it.
local func = loadstring(lua_code_string) -- Load Lua code from string
local data = {} -- Create table to store data
setfenv(func, data) -- Set the loaded function's environment, so any global writes that func does will go to the data table instead.
func() -- Run the loaded function.
local serialized_out = json.encode(data)
Also, ending the last item of a table with a comma (or semicolon) is perfectly valid syntax. In fact, I recommend it; you don't need to worry about adding a comma to the former last object when adding new items to the table.