How to handle NaN in data in DBT model having JSON format? - json

We have a DBT model which is used to transform data in Athena and S3 with following configuration.
{{
config(
materialized='incremental',
external_location="s3://" + env_var('ORG') + "-" + env_var('ENV') + "-lan" + "/act/" + this.identifier,
partitioned_by = ['e_date'],
incremental_strategy='append',
format='json'
)
}}
For one particular Athena table, we are getting below error when we fire a query on it:
HIVE_CURSOR_ERROR: io.trino.hive.$internal.org.codehaus.jackson.JsonParseException: Non-standard token 'NaN': enable JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS to allow
Can you suggest changes in the config so that DBT JSON parser will parse Nan value as well.

Related

How to parse json and replace a value in a nested json?

I have the below json:
{
"status":"success",
"data":{
"_id":"ABCD",
"CNTL":{"XMN Version":"R3.1.0"},
"OMN":{"dree":["ANY"]},
"os0":{
"Enable":true,"Service Reference":"","Name":"",
"TD ex":["a0.c985.c0"],
"pn ex":["s0.c100.c0"],"i ex":{},"US Denta Treatment":"copy","US Denta Value":0,"DP":{"Remote ID":"","cir ID":"","Sub Options":"","etp Number":54469},"pe":{"Remote ID":"","cir ID":""},"rd":{"can Identifier":"","can pt ID":"","uno":"Default"},"Filter":{"pv":"pass","pv6":"pass","ep":"pass","pe":"pass"},"sc":"Max","dc":"","st Limit":2046,"dm":false},
"os1":{
"Enable":false,"Service Reference":"","Name":"",
"TD ex":[],
"pn ex":[],"i ex":{},"US Denta Treatment":"copy","US Denta Value":0,"DP":{"Remote ID":"","cir ID":"","Sub Options":"","etp Number":54469},"pe":{"Remote ID":"","cir ID":""},"rd":{"can Identifier":"","can pt ID":"","uno":"Default"},"Filter":{"pv":"pass","pv6":"pass","ep":"pass","pe":"pass"},"sc":"Max","dc":"","st Limit":2046,"dm":false},
"ONM":{
"ONM-ALARM-XMN":"Default","Auto Boot Mode":false,"XMN Change Count":0,"CVID":0,"FW Bank Files":[],"FW Bank":[],"FW Bank Ptr":65535,"pn Max Frame Size":2000,"Realtime Stats":false,"Reset Count":0,"SRV-XMN":"Unmodified","Service Config Once":false,"Service Config pts":[],"Skip ot":false,"Name":"","Location":"","dree":"","Picture":"","Tag":"","PHY Delay":0,"Labels":[],"ex":"From OMN","st Age":60,"Laser TX Disable Time":0,"Laser TX Disable Count":0,"Clear st Count":0,"MIB Reset Count":0,"Expected ID":"ANY","Create Date":"2023-02-15 22:41:14.422681"},
"SRV-XMN Values":{},
"nc":{"Name":"ABCD"},
"Alarm History":{
"Alarm IDs":[],"Ack Count":0,"Ack Operator":"","Purge Count":0},"h FW Upgrade":{"wsize":64,"Backoff Divisor":2,"Backoff Delay":5,"Max Retries":4,"End Download Timeout":0},"Epn FW Upgrade":{"Final Ack Timeout":60},
"UNI-x 1":{"Max Frame Size":2000,"Duplex":"Auto","Speed":"Auto","lb":false,"Enable":true,"bd Rate Limit":200000,"st Limit":100,"lb Type":"PHY","Clear st Count":0,"ex":"Off","pc":false},
"UNI-x 2":{"Max Frame Size":2000,"Duplex":"Auto","Speed":"Auto","lb":false,"Enable":true,"bd Rate Limit":200000,"st Limit":100,"lb Type":"PHY","Clear st Count":0,"ex":"Off","pc":false},
"UNI-POTS 1":{"Enable":true},"UNI-POTS 2":{"Enable":true}}
}
All I am trying to do is to replace only 1 small value in this super-complicated json. I am trying to replace the value of os0 tags's TD ex's value from ["a0.c985.c0"] to ["a0.c995.c0"].
Is freemarker the best way to do this? I need to change only 1 value. Can this be done through regex or should I use gson?
I can replace the value like this:
JsonObject jsonObject = new JsonParser().parse(inputJson).getAsJsonObject();
JsonElement jsonElement = jsonObject.get("data").getAsJsonObject().get("os0").getAsJsonObject().get("TD ex");
String str = jsonElement.getAsString();
System.out.println(str);
String[] strs = str.split("\\.");
String replaced = strs[0] + "." + strs[1].replaceAll("\\d+", "201") + "." + strs[2];
System.out.println(replaced);
How to put it back and create the json?
FreeMarker is a template engine, so it's not the tool for this. Load JSON with some real JSON parser library (like Jackson, or GSon) to a node tree, change the value in that, and then use the same JSON library to generate JSON from the node tree. Also, always avoid doing anything in JSON with regular expressions, as JSON (like most pracitcal languages) can describe the same value in many ways, and so writing truly correct regular expression is totally unpractical.

AWS Athena and handling json

I have millions of files with the following (poor) JSON format:
{
"3000105002":[
{
"pool_id": "97808",
"pool_name": "WILDCAT (DO NOT USE)",
"status": "Zone Permanently Plugged",
"bhl": "D-12-10N-05E 902 FWL 902 FWL",
"acreage": ""
},
{
"pool_id": "96838",
"pool_name": "DRY & ABANDONED",
"status": "Zone Permanently Plugged",
"bhl": "D-12-10N-05E 902 FWL 902 FWL",
"acreage": ""
}]
}
I've tried to generate an Athena DDL that would accommodate this type (especially the api field) of structure with this:
CREATE EXTERNAL TABLE wp_info (
api:array < struct < pool_id:string,
pool_name:string,
status:string,
bhl:string,
acreage:string>>)
LOCATION 's3://foo/'
After trying to generate a table with this, the following error is thrown:
Your query has the following error(s):
FAILED: ParseException line 2:12 cannot recognize input near ':' 'array' '<' in column type
What is a workable solution to this issue? Note that the api string is different for every one of the million files. The api key is not actually within any of the files, so I hope there is a way that Athena can accommodate just the string-type value for these data.
If you don't have control over the JSON format that you are receiving, and you don't have a streaming service in the middle to transform the JSON format to something simpler, you can use regex functions to retrieve the relevant data that you need.
A simple way to do it is to use Create-Table-As-Select (CTAS) query that will convert the data from its complex JSON format to a simpler table format.
CREATE TABLE new_table
WITH (
external_location = 's3://path/to/ctas_partitioned/',
format = 'Parquet',
parquet_compression = 'SNAPPY')
AS SELECT
regexp_extract(line, '"pool_id": "(\d+)"', 1) as pool_id,
regexp_extract(line, ' "pool_name": "([^"])",', 1) as pool_name,
...
FROM json_lines_table;
You will improve the performance of the queries to the new table, as you are using Parquet format.
Note that you can also update the table when you can new data, by running the CTAS query again with external_location as 's3://path/to/ctas_partitioned/part=01' or any other partition scheme

TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U57') dtype('<U57') dtype('<U57')

I am using great-expectation for pipeline testing.
I have One Dataframe batch of type :-
great_expectations.dataset.pandas_dataset.PandasDataset
I want to build dynamic validation expression.
i.e
batch.("columnname","value") in which
validationtype columname and value coming from json file .
JSON structure:-
{
"column_name": "sex",
"validation_type": "expect_column_values_to_be_in_set",
"validation_value": ["MALE","FEMALE"]
},
when i am building this expression getting error message described below .
Code:-
def add_validation(self,batch,validation_list):
for d in validation_list:
expression = "." + d["validation_type"] + "(" + d["column_name"] + "," +
str(d["validation_value"]) + ")"
print(expression)
batch+expression
batch.save_expectation_suite(discard_failed_expectations=False)
return batch
Output:-
print statement output
.expect_column_values_to_be_in_set(sex,['MALE','FEMALE'])
Error:-
TypeError: ufunc 'add' did not contain a loop with signature matching
types dtype('
In great_expectations, the expectation_suite object is designed to capture all of the information necessary to evaluate an expectation. So, in your case, the most natural thing to do would be to translate the source json file you have into the great_expectations expectation suite format.
The best way to do that will depend on where you're getting the original JSON structure from -- you'd ideally want to do the translation as early as possible (maybe even before creating that source JSON?) and keep the expectations in the GE format.
For example, if all of the expectations you have are of the type expect_column_values_to_be_in_set, you could do a direct translation:
expectations = []
for d in validation_list:
expectation_config = {
"expectation_type": d["validation_type"],
"kwargs": {
"column": d["column_name"],
"value_set": d["validation_value"]
}
}
expectation_suite = {
"expectation_suite_name": "my_suite",
"expectations": expectations
}
On the other hand, if you are working with a variety of different expectations, you would also need to make sure that the validation_value in your JSON gets mapped to the right kwargs for the expectation (for example, if you expect_column_values_to_be_between then you actually need to provide min_value and/or max_value).

How to convert a JSON file to an SQLite database

If I have some sample data, how do I put it into SQLite (preferably fully automated)?
{"uri":"/","user_agent":"example1"}
{"uri":"/foobar","user_agent":"example1"}
{"uri":"/","user_agent":"example2"}
{"uri":"/foobar","user_agent":"example3"}
I found the easiest way to do this is by using jq and CSV as an intermediary format.
Getting the CSV
First write your data to a file.
I will assume data.json here.
Then construct the header using jq:
% head -1 data.json | jq -r 'keys | #csv'
"uri","user_agent"
The head -1 is because we only want one line.
jq's -r makes the output a plain string instead of a JSON-String wrapping the CSV.
We then call the internal function keys to get the keys of the input as an array.
This we send to the #csv formatter which outputs us a single string with the headers in quoted CSV format.
We then need to construct the data.
% jq -r 'map(tostring) | #csv' < data.json
"/","example1"
"/foobar","example1"
"/","example2"
"/foobar","example3"
We now take the whole input and deconstruct the associative array (map) using .[] and then put it back into a simple array […].
This basically converts our dictionary to an array of keys.
Sent to the #csv formatter, we again get some CSV.
Putting it all together we get a single one-liner in the form of:
% (head -1 data.json | jq -r 'keys | #csv' && jq -r 'map(tostring) | #csv' < data.json) > data.csv
If you need to convert the data on the fly, i.e. without a file, try this:
% cat data.json | (read -r first && jq -r '(keys | #csv),(map(tostring) | #csv)' <<<"${first}" && jq -r 'map(tostring) | #csv')
Loading it into SQLite
Open an SQLite database:
sqlite3 somedb.sqlite
Now in the interactive shell do the following (assuming you wrote the CSV to data.csv and want it in a table called my_table):
.mode csv
.import data.csv my_table
Now close the shell and open it again for a clean environment.
You can now easily SELECT from the database and do whatever you want to.
Putting it all together
Have an asciinema recording right there:
Edits
Edit:
As pointed out (thanks #Leo), the original question did show newline delimited JSON objects, which each on their own conform to rfc4627, but not all together in that format.
jq can handle a single JSON array of objects much the same way though by preprocessing the file using jq '.[]' <input.json >preprocessed.json.
If you happen to be dealing with JSON text sequences (rfc7464) luckily jq has got your back too with the --seq parameter.
Edit 2:
Both the newline separated JSON and the JSON text sequences have one important advantage; they reduce memory requirements down to O(1), meaning your total memory requirement is only dependent on your longest line of input, whereas putting the entire input in a single array requires that either your parser can handle late errors (i.e. after the first 100k elements there's a syntax error), which generally isn't the case to my knowledge, or it will have to parse the entire file twice (first validating syntax, then parsing, in the process discarding previous elements, as is the case with jq --stream) which also happens rarely to my knowledge, or it will try to parse the whole input at once and return the result in one step (think of receiving a Python dict which contains the entirety of your say 50G input data plus overhead) which is usually memory backed, hence raising your memory footprint by just about your total data size.
Edit 3:
If you hit any obstacles, try using keys_unsorted instead of keys.
I haven't tested that myself (I kind of assume my columns were already sorted), however #Kyle Barron reports that this was needed.
Edit 4:
As pointed out by youngminz in the comment below the original command fails when working with non-{number,string} values like nested lists.
The command has been updated (with a slightly adapted version from the comment, map() – unlike map_values() converts objects to their keys the same as [.[]], making the map more readable).
Keys remain unaffected, if you really have complex types as keys (which may not even conform to JSON, but I'm too lazy to look it up right now) you can do the same for the key-related mappings.
A way do this without CSV or a 3rd party tool is to use the JSON1 extension of SQLite combined with the readfile extension that is provided in the sqlite3 CLI tool. As well as overall being a "more direct" solution, this has the advantage of handling JSON NULL values more consistently than CSV, which will otherwise import them as empty strings.
If the input file is a well-formed JSON file, e.g. the example given as an array:
[
{"uri":"/","user_agent":"example1"},
{"uri":"/foobar","user_agent":"example1"},
{"uri":"/","user_agent":"example2"},
{"uri":"/foobar","user_agent":"example3"}
]
Then this can be read into the corresponding my_table table as follows. Open the SQLite database file my_db.db using the sqlite3 CLI:
sqlite3 my_db.db
then create my_table using:
CREATE TABLE my_table(uri TEXT, user_agent TEXT);
Finally, the JSON data in my_data.json can be inserted into the table with the CLI command:
INSERT INTO my_table SELECT
json_extract(value, '$.uri'),
json_extract(value, '$.user_agent')
FROM json_each(readfile('my_data.json'));
If the initial JSON file is newline separated JSON elements, then this can be converted first using jq using:
jq -s <my_data_raw.json >my_data.json
It's likely there is a way to do this directly in SQLite using JSON1, but I didn't pursue that given that I was already using jq to massage the data prior to import to SQLite.
sqlitebiter appears to provide a python solution:
A CLI tool to convert CSV/Excel/HTML/JSON/LTSV/Markdown/SQLite/TSV/Google-Sheets to a SQLite database file. http://sqlitebiter.rtfd.io/
docs:
http://sqlitebiter.readthedocs.io/en/latest/
project:
https://github.com/thombashi/sqlitebiter
last update approximately 3 months ago
last issue closed approximately 1 month ago, none open
noted today, 2018-03-14
You can use spyql.
spyql reads the json files (with 1 json object per line) and generates INSERT statements that you can pipe into sqlite:
$ spyql -Otable=my_table "SELECT json->uri, json->user_agent FROM json TO sql" < sample3.json | sqlite3 my.db
This assumes that you already created an empty table in the sqlite database my.db.
Disclaimer: I am the author of spyql.
To work with a file of newline delimited JSON objects, including \n in the data.
Add a header column name and ensure the JSON is compact (1 line per record).
cat <(echo '"line"') source.json | jq -c '.' > source.fauxcsv
Import the JSON and header as a "csv" into a temporary table with a column separator \t that won't occur in the JSON. Then create the real table via SQLites JSON functions.
sqlite3 file.db \
-cmd '.separator \t \n' \
-cmd '.import --schema temp source.fauxcsv temp_json_lines' <<-'EOSQL'
INSERT into records SELECT
json_extract(line, '$.rid'),
coalesce(json_extract(line, '$.created_at'), strftime('%Y-%m-%dT%H:%M:%fZ', 'now')),
json_extract(line, '$.name')
FROM temp_json_lines;
EOSQL
If (as in the original question) the JSON data comes in the form of JSONLines (that is, one JSON entity per line), and if it is desired to create a table with one of these entities per row, then sqlite3 can be used to import the data by setting .mode=line, e.g. as follows:
create table input (
raw JSON
);
.mode=line
.import input.json input
This approach is worth knowing not least because it can easily be adapted to handle cases where the data is not already in JSONLines format. For example, if input.json contains a single very long JSON array, we could use a tool such as jq or gojq to "splat" it:
.mode=line
.import "|jq -c .[] input.json" input
Similarly, if input.json contains a single object with many keys, and if it is desired to create a table of corresponding single-key objects:
.mode=line
.import "|jq -c 'to_entries[] | {(.key): .value}'" input
If the original data is a single very large JSON array or JSON object, jq's streaming parser could be used to save memory. In this context, it may be worth mentioning two CLI tools with minimal memory requirements: my own jm (based on JSON Machine), and jm.py (based on ijson). E.g., to "splat" each array in a file containing one or more JSON arrays:
.mode=line
.import "|jm input.json" input
With the JSON data safely in an SQLite table, it is (thanks to SQLite's support for JSON) now quite straightforward to create indices, populate other tables, etc., etc.
Here is the first answer compiled into a deno script:
// just for convenience (pathExists)
import {} from "https://deno.land/x/simple_shell#0.9.0/src/stringUtils.ts";
/**
* #description
* convert a json db to csv and then to sqlite
*
* #note
* `sqliteTableConstructor` is a string that is used to create the table, if it is specified the csv file *should not* contain a header row.
* if it's not specified then the csv file *must* contain a header row so it can be used to infer the column names.
*/
const jsonToSqlite = async (
{
jsonDbPath,
jsonToCsvFn,
sqliteDbPath,
sqliteTableConstructor,
tableName,
}: {
jsonDbPath: string;
sqliteDbPath: string;
tableName: string;
sqliteTableConstructor?: string;
// deno-lint-ignore no-explicit-any
jsonToCsvFn: (jsonDb: any) => string;
},
) => {
// convert it into csv
const csvDbPath = `${jsonDbPath.replace(".json", "")}.csv`;
if (csvDbPath.pathExists()) {
console.log(`${csvDbPath} already exists`);
} else {
const db = JSON.parse(await Deno.readTextFile(jsonDbPath));
const csv = jsonToCsvFn(db);
await Deno.writeTextFile(csvDbPath, csv);
}
// convert it to sqlite
if (sqliteDbPath.pathExists()) {
console.log(`${sqliteDbPath} already exists`);
} else {
const sqlite3 = Deno.spawnChild("sqlite3", {
args: [sqliteDbPath],
stdin: "piped",
stderr: "null", // required to make sqlite3 work
});
await sqlite3.stdin.getWriter().write(
new TextEncoder().encode(
".mode csv\n" +
(sqliteTableConstructor ? `${sqliteTableConstructor};\n` : "") +
`.import ${csvDbPath} ${tableName}\n` +
".exit\n",
),
);
await sqlite3.status;
}
};
Example of usage:
await jsonToSqlite(
{
jsonDbPath: "./static/db/db.json",
sqliteDbPath: "./static/db/db.sqlite",
tableName: "radio_table",
sqliteTableConstructor:
"CREATE TABLE radio_table(name TEXT, country TEXT, language TEXT, votes INT, url TEXT, favicon TEXT)",
jsonToCsvFn: (
db: StationDBType[],
) => {
const sanitize = (str: string) =>
str.trim().replaceAll("\n", " ").replaceAll(",", " ");
return db.filter((s) => s.name.trim() && s.url.trim())
.map(
(station) => {
return (
sanitize(station.name) + "," +
sanitize(station.country) + "," +
sanitize(station.language) + "," +
station.votes + "," +
sanitize(station.url) + "," +
sanitize(station.favicon)
);
},
).join("\n");
},
},
);
Edit1:
Importing csv to sqlite by defaults sets all column types to string. In this edit I allow the user to create the table first (via an optional constructor) before importing the csv into it, this way he can specify the exact column types.
Improve example
Edit2:
Turns out that with deno and sqlite-deno you don't need to use csv as an intermediate or shell out to sqlite, here is an example on how to achieve this:
This next code will create a new sql db from the json one.
import { DB } from "https://deno.land/x/sqlite#v3.2.1/mod.ts";
export interface StationDBType {
name: string;
country: string;
language: string;
votes: number;
url: string;
favicon: string;
}
export const db = new DB("new.sql");
db.query(
"create TABLE radio_table (name TEXT, country TEXT, language TEXT, votes INT, url TEXT, favicon TEXT)",
);
const jsonDb: StationDBType[] = JSON.parse(
await Deno.readTextFile("static/db/compressed_db.json"),
);
const sanitize = (s: string) => s.replaceAll('"', "").replaceAll("'", "");
db.query(
`insert into radio_table values ${
jsonDb.map((station) =>
"('" +
sanitize(station.name) +
"','" +
sanitize(station.country) +
"','" +
sanitize(station.language) +
"'," +
station.votes +
",'" +
sanitize(station.url) +
"','" +
sanitize(station.favicon) +
"')"
).join(",")
}`,
);
db.close();

How to use a CSV field to define the node label in a LOAD statement

This example is taken from https://neo4j.com/developer/guide-importing-data-and-etl/#_importing_the_data_using_cypher"
LOAD CSV WITH HEADERS FROM "file:customers.csv" AS row
CREATE (:Customer {companyName: row.CompanyName, customerID: row.CustomerID, fax: row.Fax, phone: row.Phone});
What I want to do is use a field in the CSV file to define the label in the node. For example:
LOAD CSV WITH HEADERS FROM "FILE:///Neo4j_AttributeProvenance.csv" AS CSVLine CREATE (q:CSVLine.NodeType { NodeID:CSVLine.NodeID, SchemaName:CSVLine.SchemaName, TableName:CSVLine.TableName, DataType:CSVLine.DataType, PreviousNodeID:CSVLine.PreviousNodeID });
You should have a look at the APOC procedures. In this case there's a procedure able to create nodes dinamically based on column values in your .csv file. The syntax is:
CALL apoc.create.node(['Label'], {key:value,…​})
In your case the simplest syntax should be:
CALL apoc.create.node(["' + CSVLine.NodeType + '"], {NodeID: "' + NodeID:CSVLine.NodeID + '", etc}) yield node