Some background
Versioning notebooks can become very inefficient if the output is expected to vary a lot. I solved this problem with my Jupyter notebooks using nbstripout, but so far I've found no alternative for Zeppelin notebooks.
Because nbstripout uses nbformat to parse ipynb files, it's not an easy patch to make it support Zeppelin. On the other hand, the goal is not that complex: simply empty out all the "msg": "...".
Goal
Given a JSON file, empty out all 'paragraphs.result.msg' fields.
Sample (schema):
{"paragraps": [{"result": {"msg": "Very long output..."}}]}
In (1) and (2) below, I'll assume that the incoming JSON looks like this:
{
"paragraphs": [
{
"result": {
"msg": "msg1"
}
},
{
"result": {
"msg": "msg2"
}
}
]
}
1. To set the .result.msg values to ""
.paragraphs[].result.msg = ""
2. To remove the .result.msg fields altogether:
del(.paragraphs[].result.msg)
3. To remove "msg" fields in all objects, wherever they occur:
walk(if type == "object" then del(.msg) else . end)
(If your jq does not have walk, google: jq faq walk)
4. To remove "msg" fields wherever they occur in a .result object in a .paragraphs array:
walk(if type == "object" and (.paragraphs|type) == "array"
then del(.paragraphs[].result?.msg?) else . end)
JQ can do this:
jq .paragraphs[].result.msg file
http://stedolan.github.io/jq
Git Filter
The best solution (thanks to #steven-penny) is to run this:
git config filter.znbstripout.clean "jq '.paragraphs[].result.msg = \"\"'"
which will setup a filter called znbstripout that invokes the jq tool. Then, in your .gitattributes file you can just put:
*.json filter=znbstripout
Python Script (usable with Git Hooks)
The following can be used as a git hook:
#!/usr/bin/env python3
from glob import glob
import json
files = glob('**/note.json', recursive=True)
for file in files:
with open(file, 'r') as fp:
nb = json.load(fp)
for p in nb['paragraphs']:
if 'result' in p:
p['result']['msg'] = ""
with open(file, 'w') as fp:
json.dump(nb, fp, sort_keys=True, indent=2)
Related
While there are several posts about this topic on Stack Overflow, none match my exact use case. I am using a Linux shell script to run SnowSQL to generate a json file.
========================
My json file needs to have a comma between json objects.
This:
{
"CAMPAIGN": "Welcome_New",
"UUID": "fe881781-bdc2-41b2-95f2-e0e8c19dc597"
}
{
"CAMPAIGN": "Welcome_Existing",
"UUID": "77a41c02-beb9-48bf-ada4-b2074c1a78cb"
}
...needs to look this:
{
"CAMPAIGN": "Welcome_New",
"UUID": "fe881781-bdc2-41b2-95f2-e0e8c19dc597"
},
{
"CAMPAIGN": "Welcome_Existing",
"UUID": "77a41c02-beb9-48bf-ada4-b2074c1a78cb"
}
Here is my complete ksh script:
#!/usr/bin/ksh
. /appl/.snf_logon
export SNOW_PKEY_FILE=$(mktemp ./pkey-XXXXXX)
trap "rm -f ${SNOW_PKEY_FILE}" EXIT
LibGetSnowCred
{
outFile=JSON_FILE_TYPE_TEST.json
inDir=/testing
outFileNm=#my_db.my_schema.my_file_stage/${outFile}
snowsql \
--private-key-path $SNOW_PKEY_FILE \
-o exit_on_error=true \
-o friendly=false \
-o timing=false \
-o log_level=ERROR \
-o echo=true <<!
COPY INTO ${outFileNm}
FROM (SELECT object_construct(
'UUID',UUID
,'CAMPAIGN',CAMPAIGN)
FROM my_db.my_schema.JSON_Test_Table
LIMIT 2)
FILE_FORMAT=(
TYPE=JSON
COMPRESSION=NONE
)
OVERWRITE=True
HEADER=False
SINGLE=True
MAX_FILE_SIZE=4900000000
;
get ${outFileNm} file://${inDir}/;
rm ${outFileNm};
!
if [ $? -eq 0 ]; then
echo "Export successful"
else
echo "ERROR in export"
fi
}
Is the best practice to add the comma during the SELECT or after the file is generated and how?
With or without that comma, the text is still not JSON but just a random text that looks like JSON. You export several rows, each row as an independent object. You need to gather all these objects into an array to produce a valid JSON.
A JSON that encodes an array of rows looks like this:
[
{
"CAMPAIGN": "Welcome_New",
"UUID": "fe881781-bdc2-41b2-95f2-e0e8c19dc597"
},
{
"CAMPAIGN": "Welcome_Existing",
"UUID": "77a41c02-beb9-48bf-ada4-b2074c1a78cb"
}
]
The easiest way to produce this output would be to ask the database, if it supports this option (to wrap all the records into a list before generating the JSON, to not export each record in a separate JSON).
If this is not possible then you have a file that contains multiple JSONs. You can use jq to convert these individual JSONs into a JSON similar to the one described above (encoding an array of objects).
It is as simple as that:
jq --slurp '.' input_file > output_file
The option --slurp tells jq to read all the JSONs from the file input_file in memory, to parse them and to put them into an array. That is the program input.
'.' is the jq program. It says "dump the current object". It does not do any processing to the input data. The current object is the array.
After it executes the program (which, in this case doesn't do anything), jq dumps the modified value (as JSON, of course) to the standard output (by default, on screen).
The > output_file part redirects this output to a file (named output_file) instead of showing it on screen.
You can see how it works on the jq playground.
I have key:value JSON object that is used in my JavaScript project. Value is a string and this object looks like this
{
key1:{
someKey: "Some text",
someKey2: "Some text2"
},
key2:{
someKey3:{
someKey4: "Some text3",
someKey5: "Some text4"
}
}
}
I use it in the project like this: key1.someKey and key2.someKey3.someKey4. Do you have idea how to delete unused properties? Let's say we don't use key2.someKey3.someKey5 in any file in a project, so i want it to be deleted from a JSON file. To people in the comments. I did't say i want to use JavaScript for this. I don't want to use it in browser or server. I just want the script that can do that on my local computer.
If you live within javascript and node, you can use something like this to get all the paths:
Using some modified code from here: https://stackoverflow.com/a/70763473/999943
var lodash=require('lodash') // use this if calling from the node REPL
// import lodash from 'lodash'; // use this if calling from a script
const allPaths = (o, prefix = '', out = []) => {
if (lodash.isObject(o) || lodash.isArray(o)) Object.entries(o).forEach(([k, v]) => allPaths(v, prefix === '' ? k : `${prefix}.${k}`, out));
else out.push(prefix);
return out;
};
let j = {
key1: { someKey: 'Some text', someKey2: 'Some text2' },
key2: { someKey3: { someKey4: 'Some text3', someKey5: 'Some text4' } }
}
allPaths(j)
[
'key1.someKey',
'key1.someKey2',
'key2.someKey3.someKey4',
'key2.someKey3.someKey5'
]
That's all well and good, but now you want to take that list and look through your codebase for usage.
The main choices available are text searching with grep or awk or ag, or parse the language and look through the symbolic representation of the language after it's loaded into your project. Tree-shaking can do this for libraries... I haven't looked into how to do tree-shaking for dictionary keys, or some other undefined reference check like a linter may do for a language.
Then once you have all the instances found, then you either manually modify your list or use a json library to modify it.
My weapons of choice in this instance are:
jq and bash and grep
It's not infallible. But it's a start. (use with caution).
setup_test.sh
#!/usr/bin/env bash
mkdir src
echo "key2.someKey3.someKey4" > src/a.js
echo "key1.someKey2" > src/b.js
echo "key3.otherKey" > src/c.js
test.json
{
"key1":{
"someKey": "Some text",
"someKey2": "Some text2"
},
"key2":{
"someKey3":{
"someKey4": "Some text3",
"someKey5": "Some text4"
}
}
}
check_for_dict_references.sh
#!/usr/bin/env bash
json_input=$1
code_path=$2
cat << HEREDOC
json_input=$json_input
code_path=$code_path
HEREDOC
echo "Paths found in json"
paths="$(cat "$json_input" | jq -r 'paths | join(".")')"
no_refs=
for path in $paths; do
escaped_path=$(echo "$path" | sed -e "s|\.|\\\\.|g")
if ! grep -r "$escaped_path" "$code_path" ; then
no_refs="$no_refs $path"
fi
done
echo "Missing paths..."
echo "$no_refs"
echo "Creating a new json file without the unused paths"
del_paths_list=
for path in $no_refs; do
del_paths_list+=".$path, "
done
del_paths_list=${del_paths_list:0:-2} # remove trailing comma space
cat "$json_input" | jq -r 'del('$del_paths_list')' > ${json_input}.new.json
Running the setup_test.sh, then we can test the jq + grep solution
$ ./check_for_dict_references.sh test.json src
json_input=test.json
code_path=src
Paths found in json
src/b.js:key1.someKey2
src/b.js:key1.someKey2
src/b.js:key1.someKey2
src/a.js:key2.someKey3.someKey4
src/a.js:key2.someKey3.someKey4
src/a.js:key2.someKey3.someKey4
Missing paths...
key2.someKey3.someKey5
Creating a new json file without the unused paths
If you look closely you would want it to also print key1.someKey, but this got "found" in the middle of the name key1.someKey2. There are some more fancy regex things you can do, but for the purpose of this script it may be enough.
Now look in your directory for the new json file:
$ cat test.json.new.json
{
"key1": {
"someKey": "Some text",
"someKey2": "Some text2"
},
"key2": {
"someKey3": {
"someKey4": "Some text3"
}
}
}
Hope that helps.
I'm working on parsing JSON data using JSON.sh. And I wanted to read data from json file (test.json) whose content will be something like,
{
"/home/ukrishnan/projects/test.yml": {
"LOG_DRIVER": "syslog",
"IMAGE": "mysql:5.6"
},
"/home/ukrishnan/projects/mysql/app.xml": {
"ENV_ACCOUNT_BRIDGE_ENDPOINT": "/u01/src/test/sample.txt"
}
}
And I try to parse this JSON using JSON.sh by using,
test_parser=`sh ./lib/JSON.sh < test/test.json`
echo $test_parser
It prints,
["/home/ukrishnan/projects/test.yml","LOG_DRIVER"] "syslog" ["/home/ukrishnan/projects/test.yml","IMAGE"] "mysql:5.6" ["/home/ukrishnan/projects/test.yml"] {"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"} ["/home/ukrishnan/projects/mysql/app.xml","ENV_ACCOUNT_BRIDGE_ENDPOINT"] "/u01/src/test/sample.txt" ["/home/ukrishnan/projects/mysql/app.xml"] {"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"} [] {"/home/ukrishnan/projects/test.yml":{"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"},"/home/ukrishnan/projects/mysql/app.xml":{"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"}}
Whereas, the same command (sh ./lib/JSON.sh < test/test.json), if I run through terminal, it is printing with line breaks,
["/home/ukrishnan/projects/test.yml","LOG_DRIVER"] "syslog"
["/home/ukrishnan/projects/test.yml","IMAGE"] "mysql:5.6"
["/home/ukrishnan/projects/test.yml"] {"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"}
["/home/ukrishnan/projects/mysql/app.xml","ENV_ACCOUNT_BRIDGE_ENDPOINT"] "/u01/src/test/sample.txt"
["/home/ukrishnan/projects/mysql/app.xml"] {"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"}
[] {"/home/ukrishnan/projects/test.yml":{"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"},"/home/ukrishnan/projects/mysql/app.xml":{"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"}}
I wanted to read this and assign to bash variables like,
file_name='/home/ukrishnan/projects/test.yml'
key='LOG_DRIVER'
value='syslog'
As I'm almost completely new to shell script and grep or awk, I don't have much idea of how to achieve this. Any help on this would be greatly appreciated.
I wrote a JSON serializer / deserializer for gawk, if you're interested. Save that script and modify it, replacing everything above # === FUNCTIONS === with the following:
#!/usr/bin/gawk -f
# capture JSON string from beginning to end into a scalar variable
{ json = json ORS $0 }
END {
# objectify JSON string to the multilevel array "obj"
deserialize(json, obj)
for (filename in obj) {
print "file_name=" quote(filename)
for (key in obj[filename]) {
# print key="value"
print key "=" quote(obj[filename][key])
}
}
}
Do chmod 755 json.awk and execute it. Output will resemble this:
$ ./json.awk test5.json
file_name="/home/ukrishnan/projects/mysql/app.xml"
ENV_ACCOUNT_BRIDGE_ENDPOINT="/u01/src/test/sample.txt"
file_name="/home/ukrishnan/projects/test.yml"
LOG_DRIVER="syslog"
IMAGE="mysql:5.6"
Hopefully the logic is reasonably easy to follow. If you prefer to output filename=, key=, and value= on every loop iteration, modify the nested for loops accordingly:
for (filename in obj) {
for (key in obj[filename]) {
print "file_name=" quote(filename)
print "key=" quote(key)
print "value=" quote(obj[filename][key])
}
}
That change will result in the following output:
$ ./json.awk test5.json
file_name="/home/ukrishnan/projects/mysql/app.xml"
key="ENV_ACCOUNT_BRIDGE_ENDPOINT"
value="/u01/src/test/sample.txt"
file_name="/home/ukrishnan/projects/test.yml"
key="LOG_DRIVER"
value="syslog"
file_name="/home/ukrishnan/projects/test.yml"
key="IMAGE"
value="mysql:5.6"
Anyway, with that output, you can do something silly in BASH like this to populate and act upon the variables:
#!/bin/bash
./test.awk test5.json | while read -r line; do {
eval $line
[ "${line/=*/}" = "value" ] && {
echo "bash: file_name=$file_name"
echo "bash: key=$key"
echo "bash: value=$value"
echo "------"
}
}; done
It'd probably be more graceful just to do all processing within gawk from start to finish and not mess with the polyglot handoff, though.
Getting back to json.awk, if you prefer to keep json.awk modular for easy reuse in future projects, you could remove everything above # === FUNCTIONS ===, create a separate main.awk containing the code block at the top of this answer, and #include "json.awk" as a helper library pretty much anywhere outside of END {...} (just below the shbang, for example).
JSON.sh (from http://json.org) offers a nice bash friendly means of flattening out a JSON file. Which you've already provided how it looks in your question. So, the flatten form is the format:
[node] tab value
You have to think in UNIX script in extracting the information you want, you'll note the lines you're interested in actually follow this pattern:
["filename","key"] tab ["value"]
In regex notation, we replace:
filename with (.*)
key with (.*)
tab with \t
value with (.*)
We can retrieve the first, second and third matching groups with \1, \2, \3 respectively.
When used in sed we also note that these symbols []() need to be escaped with a backslash \, resulting in the following script:
./lib/JSON.sh < test/test.json | sed 's/\["\(.*\)","\(.*\)\"]\t"\(.*\)"/\1,\2,\3/;t;d'
/home/ukrishnan/projects/test.yml,LOG_DRIVER,syslog
/home/ukrishnan/projects/test.yml,IMAGE,mysql:5.6
/home/ukrishnan/projects/mysql/app.xml,ENV_ACCOUNT_BRIDGE_ENDPOINT,/u01/src/test/sample.txt
Now we put the lines in a loop and for each line, we can extract out filename,key,value:
for line in $(./lib/JSON.sh < test/test.json | sed 's/\["\(.*\)","\(.*\)\"]\t"\(.*\)"/\1,\2,\3/;t;d')
do
IFS="," read -ra arr <<< $line
filename=${arr[0]}
key=${arr[1]}
value=${arr[2]}
cat <<EOF
filename : $filename
key : $key
value : $value
EOF
done
Which outputs:
filename : /home/ukrishnan/projects/test.yml
key : LOG_DRIVER
value : syslog
filename : /home/ukrishnan/projects/test.yml
key : IMAGE
value : mysql:5.6
filename : /home/ukrishnan/projects/mysql/app.xml
key : ENV_ACCOUNT_BRIDGE_ENDPOINT
value : /u01/src/test/sample.txt
Need your expertise here!
I am trying to load a JSON file (generated by JSON dumps) into redshift using copy command which is in the following format,
[
{
"cookieId": "cb2278",
"environment": "STAGE",
"errorMessages": [
"70460"
]
}
,
{
"cookieId": "cb2271",
"environment": "STG",
"errorMessages": [
"70460"
]
}
]
We ran into the error - "Invalid JSONPath format: Member is not an object."
when I tried to get rid of square braces - [] and remove the "," comma separator between JSON dicts then it loads perfectly fine.
{
"cookieId": "cb2278",
"environment": "STAGE",
"errorMessages": [
"70460"
]
}
{
"cookieId": "cb2271",
"environment": "STG",
"errorMessages": [
"70460"
]
}
But in reality most JSON files from API s have this formatting.
I could do string replace or reg ex to get rid of , and [] but I am wondering if there is a better way to load into redshift seamlessly with out modifying the file.
One way to convert a JSON array into a stream of the array's elements is to pipe the former into jq '.[]'. The output is sent to stdout.
If the JSON array is in a file named input.json, then the following command will produce a stream of the array's elements on stdout:
$ jq ".[]" input.json
If you want the output in jsonlines format, then use the -c switch (i.e. jq -c ......).
For more on jq, see https://stedolan.github.io/jq
I have a json file that requires parsing.
Using scripting like sed/awk or perl, how to extract value30 and substitute that to value6 prefixed by string "XX" (eg. XX + value30).
Where:
field6 = fixed string
value6 = fixed string
value30 = varying string
[
{"field6" : "value6", "field30" : "value30" },
{ "field6" : "value6", "field30" : "value30" }
]
If I understand you correctly, this program should do what you're after:
use JSON qw(decode_json encode_json);
use strict;
use warnings;
# set the input line separator to undefined so the next read (<>) reads the entire file
undef $/;
# read the entire input (stdin or a file passed on the command line) and parse it as JSON
my $data = decode_json(<>);
my $from_field = "field6";
my $to_field = "field30";
for (#$data) {
$_->{$to_field} = $_->{$from_field};
}
print encode_json($data), "\n";
It relies on the JSON module being installed, which you can install via cpanm (which should be available in most modern Perl distributions):
cpanm install JSON
If the program is in the file substitute.pl and your json array is in data.json, then you would run it as:
perl substitute.pl data.json
# or
cat data.json | perl substitute.pl
It should produce:
[{"field30":"value6","field6":"value6"},{"field30":"value6","field6":"value6"}]
Replacing field30's value iwth field6's.
Is this what you were attempting to do?