joining jq arrays, for CSV output - csv

I'm looking to create a CSV based on two json arrays (arrays are a reduction of a large jason array with key value pairs)
[
"Name",
"Role",
"Type",
"Service",
"Group",
]
[
"some-server.com",
"web server",
"production",
"apps",
"main",
]
I'm able to get a more less what I'm looking for with:
jq -r '[.Tags[].Key], [.Tags[].Value] | join (",")' output.json
The issue is, the keys are not always sorted in the same order. For some objects I get:
Name, Role, Type
and other times:
Role, Type Name ..
I'm looking for a way to make the output consistent.

You can normalize the objects using:
def sortKeys: to_entries | sort | from_entries
For example, if A is an array of the unnormalized objects, you could write:
A | map(sortKeys)
Or the objects could be normalized as soon as they are created.
For CSV, you might want to fix the order based on a pre-determined array of key names. In that case, you could use:
def selectKeys(keys):
. as $in | reduce keys[] as $k ({}; . + {($k): $in[$k]})

Related

Count number of objects whose attribute are "null" or contain "null"

I have the following JSON. From there I'd like to count how many objects I have which type attribute is either "null" or has an array that contains the value "null". In the following example, the answer would be two. Note that the JSON could also be deeply nested.
{
"A": {
"type": "string"
},
"B": {
"type": "null"
},
"C": {
"type": [
"null",
"string"
]
}
}
I came up with the following, but obviously this doesn't work since it misses the arrays. Any hints how to solve this?
jq '[..|select(.type?=="null")] | length'
This answer focuses on efficiency, straightforwardness, and generality.
In brief, the following jq program produces 2 for the given example.
def count(s): reduce s as $x (0; .+1);
def hasValue($value):
has("type") and
(.type | . == $value or (type == "array" and any(. == $value)));
count(.. | objects | select(hasValue("null")))
Notice that using this approach, it would be easy to count the number of objects having null or "null":
count(.. | objects | select(hasValue("null") or hasValue(null)))
You were almost there. For arrays you could use IN. I also used objects, strings and arrays which are shortcuts to a select of the according types.
jq '[.. | objects.type | select(strings == "null", IN(arrays[]; "null"))] | length'
2
Demo
On larger structures you could also improve performance by not creating that array of which you would only calculate the length, but by instead just iterating over the matching items (e.g. using reduce) and counting on the go.
jq 'reduce (.. | objects.type | select(strings == "null", IN(arrays[]; "null"))) as $_ (0; .+1)'
2
Demo

"Transpose"/"Rotate"/"Flip" JSON elements

I would like to "transpose" (not sure that's the right word) JSON elements.
For example, I have a JSON file like this:
{
"name": {
"0": "fred",
"1": "barney"
},
"loudness": {
"0": "extreme",
"1": "not so loud"
}
}
... and I would like to generate a JSON array like this:
[
{
"name": "fred",
"loudness": "extreme"
},
{
"name": "barney",
"loudness": "not so loud"
}
]
My original JSON has many more first level elements than just "name" and "loudness", and many more names, features, etc.
For this simple example I could fully specify the transformation like this:
$ echo '{"name":{"0":"fred","1":"barney"},"loudness":{"0":"extreme","1":"not so loud"}}'| \
> jq '[{"name":.name."0", "loudness":.loudness."0"},{"name":.name."1", "loudness":.loudness."1"}]'
[
{
"name": "fred",
"loudness": "extreme"
},
{
"name": "barney",
"loudness": "not so loud"
}
]
... but this isn't feasible for the original JSON.
How can jq create the desired output while being key-agnostic for my much larger JSON file?
Yes, transpose is an appropriate word, as the following makes explicit.
The following generic helper function makes for a simple solution that is completely agnostic about the key names, both of the enclosing object and the inner objects:
# Input: an array of values
def objectify($keys):
. as $in | reduce range(0;length) as $i ({}; .[$keys[$i]] = $in[$i]);
Assuming consistency of the ordering of the inner keys
Assuming the key names in the inner objects are given in a consistent order, a solution can now obtained as follows:
keys_unsorted as $keys
| [.[] | [.[]]] | transpose
| map(objectify($keys))
Without assuming consistency of the ordering of the inner keys
If the ordering of the inner keys cannot be assumed to be consistent, then one approach would be to order them, e.g. using this generic helper function:
def reorder($keys):
. as $in | reduce $keys[] as $k ({}; .[$k] = $in[$k]);
or if you prefer a reduce-free def:
def reorder($keys): [$keys[] as $k | {($k): .[$k]}] | add;
The "main" program above can then be modified as follows:
keys_unsorted as $keys
| (.[$keys[0]]|keys_unsorted) as $inner
| map_values(reorder($inner))
| [.[] | [.[]]] | transpose
| map(objectify($keys))
Caveat
The preceding solution only considers the key names in the first inner object.
Building upon Peak's solution, here is an alternative based on group_by to deal with arbitrary orders of inner keys.
keys_unsorted as $keys
| map(to_entries[])
| group_by(.key)
| map(with_entries(.key = $keys[.key] | .value |= .value))
Using paths is a good idea as pointed out by Hobbs. You could also do something like this :
[ path(.[][]) as $p | { key: $p[0], value: getpath($p), id: $p[1] } ]
| group_by(.id)
| map(from_entries)
This is a bit hairy, but it works:
. as $data |
reduce paths(scalars) as $p (
[];
setpath(
[ $p[1] | tonumber, $p[0] ];
( $data | getpath($p) )
)
)
First, capture the top level as $data because . is about to get a new value in the reduce block.
Then, call paths(scalars) which gives a key path to all of the leaf nodes in the input. e.g. for your sample it would give ["name", "0"] then ["name", "1"], then ["loudness", "0"], then ["loudness", "1"].
Run a reduce on each of those paths, starting the reduction with an empty array.
For each path, construct a new path, in the opposite order, with numbers-in-strings turned into real numbers that can be used as array indices, e.g. ["name", "0"] becomes [0, "name"].
Then use getpath to get the value at the old path in $data and setpath to set a value at the new path in . and return it as the next . for the reduce.
At the end, the result will be
[
{
"name": "fred",
"loudness": "extreme"
},
{
"name": "barney",
"loudness": "not so loud"
}
]
If your real data structure might be two levels deep then you would need to replace [ $p[1] | tonumber, $p[0] ] with a more appropriate expression to transform the path. Or maybe some of your "values" are objects/arrays that you want to leave alone, in which case you probably need to replace paths(scalars) with something like paths | select(length == 2).

Flatten nested JSON using jq

I'd like to flatten a nested json object, e.g. {"a":{"b":1}} to {"a.b":1} in order to digest it in solr.
I have 11 TB of json files which are both nested and contains dots in field names, meaning not elasticsearch (dots) nor solr (nested without the _childDocument_ notation) can digest it as is.
The other solutions would be to replace dots in the field names with underscores and push it to elasticsearch, but I have far better experience with solr therefore I prefer the flatten solution (unless solr can digest those nested jsons as is??).
I will prefer elasticsearch only if the digestion process will take far less time than solr, because my priority is digesting as fast as I can (thus I chose jq instead of scripting it in python).
Kindly help.
EDIT:
I think the pair of examples 3&4 solves this for me:
https://lucidworks.com/blog/2014/08/12/indexing-custom-json-data/
I'll try soon.
You can also use the following jq command to flatten nested JSON objects in this manner:
[leaf_paths as $path | {"key": $path | join("."), "value": getpath($path)}] | from_entries
The way it works is: leaf_paths returns a stream of arrays which represent the paths on the given JSON document at which "leaf elements" appear, that is, elements which do not have child elements, such as numbers, strings and booleans. We pipe that stream into objects with key and value properties, where key contains the elements of the path array as a string joined by dots and value contains the element at that path. Finally, we put the entire thing in an array and run from_entries on it, which transforms an array of {key, value} objects into an object containing those key-value pairs.
This is just a variant of Santiago's jq:
. as $in
| reduce leaf_paths as $path ({};
. + { ($path | map(tostring) | join(".")): $in | getpath($path) })
It avoids the overhead of the key/value construction and destruction.
(If you have access to a version of jq later than jq 1.5, you can omit the "map(tostring)".)
Two important points about both these jq solutions:
Arrays are also flattened.
E.g. given {"a": {"b": [0,1,2]}} as input, the output would be:
{
"a.b.0": 0,
"a.b.1": 1,
"a.b.2": 2
}
If any of the keys in the original JSON contain periods, then key collisions are possible; such collisions will generally result in the loss of a value. This would happen, for example, with the following input:
{"a.b":0, "a": {"b": 1}}
Here is a solution that uses tostream, select, join, reduce and setpath
reduce ( tostream | select(length==2) | .[0] |= [join(".")] ) as [$p,$v] (
{}
; setpath($p; $v)
)
I've recently written a script called jqg that flattens arbitrarily complex JSON and searches the results using a regex; to simply flatten the JSON, your regex would be '.', which matches everything. Unlike the answers above, the script will handle embedded arrays, false and null values, and can optionally treat empty arrays and objects ([] & {}) as leaf nodes.
$ jq . test/odd-values.json
{
"one": {
"start-string": "foo",
"null-value": null,
"integer-number": 101
},
"two": [
{
"two-a": {
"non-integer-number": 101.75,
"number-zero": 0
},
"true-boolean": true,
"two-b": {
"false-boolean": false
}
}
],
"three": {
"empty-string": "",
"empty-object": {},
"empty-array": []
},
"end-string": "bar"
}
$ jqg . test/odd-values.json
{
"one.start-string": "foo",
"one.null-value": null,
"one.integer-number": 101,
"two.0.two-a.non-integer-number": 101.75,
"two.0.two-a.number-zero": 0,
"two.0.true-boolean": true,
"two.0.two-b.false-boolean": false,
"three.empty-string": "",
"three.empty-object": {},
"three.empty-array": [],
"end-string": "bar"
}
jqg was tested using jq 1.6
Note: I am the author of the jqg script.
As it turns out, curl -XPOST 'http://localhost:8983/solr/flat/update/json/docs' -d #json_file does just this:
{
"a.b":[1],
"id":"24e3e780-3a9e-4fa7-9159-fc5294e803cd",
"_version_":1535841499921514496
}
EDIT 1: solr 6.0.1 with bin/solr -e cloud. collection name is flat, all the rest are default (with data-driven-schema which is also default).
EDIT 2: The final script I used: find . -name '*.json' -exec curl -XPOST 'http://localhost:8983/solr/collection1/update/json/docs' -d #{} \;.
EDIT 3: Is is also possible to parallel with xargs and to add the id field with jq: find . -name '*.json' -print0 | xargs -0 -n 1 -P 8 -I {} sh -c "cat {} | jq '. + {id: .a.b}' | curl -XPOST 'http://localhost:8983/solr/collection/update/json/docs' -d #-" where -P is the parallelism factor. I used jq to set an id so multiple uploads of the same document won't create duplicates in the collection (when I searched for the optimal value of -P it created duplicates in the collection)
As #hraban mentioned, leaf_paths does not work as expected (furthermore, it is deprecated). leaf_paths is equivalent to paths(scalars), it returns the paths of any values for which scalars returns a truthy value. scalars returns its input value if it is a scalar, or null otherwise. The problem with that is that null and false are not truthy values, so they will be removed from the output. The following code does work, by checking the type of the values directly:
. as $in
| reduce paths(type != "object" and type != "array") as $path ({};
. + { ($path | map(tostring) | join(".")): $in | getpath($path) })

How to convert arbitrary simple JSON to CSV using jq?

Using jq, how can arbitrary JSON encoding an array of shallow objects be converted to CSV?
There are plenty of Q&As on this site that cover specific data models which hard-code the fields, but answers to this question should work given any JSON, with the only restriction that it's an array of objects with scalar properties (no deep/complex/sub-objects, as flattening these is another question). The result should contain a header row giving the field names. Preference will be given to answers that preserve the field order of the first object, but it's not a requirement. Results may enclose all cells with double-quotes, or only enclose those that require quoting (e.g. 'a,b').
Examples
Input:
[
{"code": "NSW", "name": "New South Wales", "level":"state", "country": "AU"},
{"code": "AB", "name": "Alberta", "level":"province", "country": "CA"},
{"code": "ABD", "name": "Aberdeenshire", "level":"council area", "country": "GB"},
{"code": "AK", "name": "Alaska", "level":"state", "country": "US"}
]
Possible output:
code,name,level,country
NSW,New South Wales,state,AU
AB,Alberta,province,CA
ABD,Aberdeenshire,council area,GB
AK,Alaska,state,US
Possible output:
"code","name","level","country"
"NSW","New South Wales","state","AU"
"AB","Alberta","province","CA"
"ABD","Aberdeenshire","council area","GB"
"AK","Alaska","state","US"
Input:
[
{"name": "bang", "value": "!", "level": 0},
{"name": "letters", "value": "a,b,c", "level": 0},
{"name": "letters", "value": "x,y,z", "level": 1},
{"name": "bang", "value": "\"!\"", "level": 1}
]
Possible output:
name,value,level
bang,!,0
letters,"a,b,c",0
letters,"x,y,z",1
bang,"""!""",0
Possible output:
"name","value","level"
"bang","!","0"
"letters","a,b,c","0"
"letters","x,y,z","1"
"bang","""!""","1"
First, obtain an array containing all the different object property names in your object array input. Those will be the columns of your CSV:
(map(keys) | add | unique) as $cols
Then, for each object in the object array input, map the column names you obtained to the corresponding properties in the object. Those will be the rows of your CSV.
map(. as $row | $cols | map($row[.])) as $rows
Finally, put the column names before the rows, as a header for the CSV, and pass the resulting row stream to the #csv filter.
$cols, $rows[] | #csv
All together now. Remember to use the -r flag to get the result as a raw string:
jq -r '(map(keys) | add | unique) as $cols | map(. as $row | $cols | map($row[.])) as $rows | $cols, $rows[] | #csv'
The Skinny
jq -r '(.[0] | keys_unsorted) as $keys | $keys, map([.[ $keys[] ]])[] | #csv'
or:
jq -r '(.[0] | keys_unsorted) as $keys | ([$keys] + map([.[ $keys[] ]])) [] | #csv'
The Details
Aside
Describing the details is tricky because jq is stream-oriented, meaning it operates on a sequence of JSON data, rather than a single value. The input JSON stream gets converted to some internal type which is passed through the filters, then encoded in an output stream at program's end. The internal type isn't modeled by JSON, and doesn't exist as a named type. It's most easily demonstrated by examining the output of a bare index (.[]) or the comma operator (examining it directly could be done with a debugger, but that would be in terms of jq's internal data types, rather than the conceptual data types behind JSON).
$ jq -c '.[]' <<<'["a", "b"]'
"a"
"b"
$ jq -cn '"a", "b"'
"a"
"b"
Note that the output isn't an array (which would be ["a", "b"]). Compact output (the -c option) shows that each array element (or argument to the , filter) becomes a separate object in the output (each is on a separate line).
A stream is like a JSON-seq, but uses newlines rather than RS as an output separator when encoded. Consequently, this internal type is referred to by the generic term "sequence" in this answer, with "stream" being reserved for the encoded input and output.
Constructing the Filter
The first object's keys can be extracted with:
.[0] | keys_unsorted
Keys will generally be kept in their original order, but preserving the exact order isn't guaranteed. Consequently, they will need to be used to index the objects to get the values in the same order. This will also prevent values being in the wrong columns if some objects have a different key order.
To both output the keys as the first row and make them available for indexing, they're stored in a variable. The next stage of the pipeline then references this variable and uses the comma operator to prepend the header to the output stream.
(.[0] | keys_unsorted) as $keys | $keys, ...
The expression after the comma is a little involved. The index operator on an object can take a sequence of strings (e.g. "name", "value"), returning a sequence of property values for those strings. $keys is an array, not a sequence, so [] is applied to convert it to a sequence,
$keys[]
which can then be passed to .[]
.[ $keys[] ]
This, too, produces a sequence, so the array constructor is used to convert it to an array.
[.[ $keys[] ]]
This expression is to be applied to a single object. map() is used to apply it to all objects in the outer array:
map([.[ $keys[] ]])
Lastly for this stage, this is converted to a sequence so each item becomes a separate row in the output.
map([.[ $keys[] ]])[]
Why bundle the sequence into an array within the map only to unbundle it outside? map produces an array; .[ $keys[] ] produces a sequence. Applying map to the sequence from .[ $keys[] ] would produce an array of sequences of values, but since sequences aren't a JSON type, so you instead get a flattened array containing all the values.
["NSW","AU","state","New South Wales","AB","CA","province","Alberta","ABD","GB","council area","Aberdeenshire","AK","US","state","Alaska"]
The values from each object need to be kept separate, so that they become separate rows in the final output.
Finally, the sequence is passed through #csv formatter.
Alternate
The items can be separated late, rather than early. Instead of using the comma operator to get a sequence (passing a sequence as the right operand), the header sequence ($keys) can be wrapped in an array, and + used to append the array of values. This still needs to be converted to a sequence before being passed to #csv.
The following filter is slightly different in that it will ensure every value is converted to a string. (jq 1.5+)
# For an array of many objects
jq -f filter.jq [file]
# For many objects (not within array)
jq -s -f filter.jq [file]
Filter: filter.jq
def tocsv:
(map(keys)
|add
|unique
|sort
) as $cols
|map(. as $row
|$cols
|map($row[.]|tostring)
) as $rows
|$cols,$rows[]
| #csv;
tocsv
$cat test.json
[
{"code": "NSW", "name": "New South Wales", "level":"state", "country": "AU"},
{"code": "AB", "name": "Alberta", "level":"province", "country": "CA"},
{"code": "ABD", "name": "Aberdeenshire", "level":"council area", "country": "GB"},
{"code": "AK", "name": "Alaska", "level":"state", "country": "US"}
]
$ jq -r '["Code", "Name", "Level", "Country"], (.[] | [.code, .name, .level, .country]) | #tsv ' test.json
Code Name Level Country
NSW New South Wales state AU
AB Alberta province CA
ABD Aberdeenshire council area GB
AK Alaska state US
$ jq -r '["Code", "Name", "Level", "Country"], (.[] | [.code, .name, .level, .country]) | #csv ' test.json
"Code","Name","Level","Country"
"NSW","New South Wales","state","AU"
"AB","Alberta","province","CA"
"ABD","Aberdeenshire","council area","GB"
"AK","Alaska","state","US"
I created a function that outputs an array of objects or arrays to csv with headers. The columns would be in the order of the headers.
def to_csv($headers):
def _object_to_csv:
($headers | #csv),
(.[] | [.[$headers[]]] | #csv);
def _array_to_csv:
($headers | #csv),
(.[][:$headers|length] | #csv);
if .[0]|type == "object"
then _object_to_csv
else _array_to_csv
end;
So you could use it like so:
to_csv([ "code", "name", "level", "country" ])
This variant of Santiago's program is also safe but ensures that the key names in
the first object are used as the first column headers, in the same order as they
appear in that object:
def tocsv:
if length == 0 then empty
else
(.[0] | keys_unsorted) as $firstkeys
| (map(keys) | add | unique) as $allkeys
| ($firstkeys + ($allkeys - $firstkeys)) as $cols
| ($cols, (.[] as $row | $cols | map($row[.])))
| #csv
end ;
tocsv
If you're open to using other Unix tools, csvkit has an in2csv tool:
in2csv example.json
Using your sample data:
> in2csv example.json
code,name,level,country
NSW,New South Wales,state,AU
AB,Alberta,province,CA
ABD,Aberdeenshire,council area,GB
AK,Alaska,state,US
I like the pipe approach for piping directly from jq:
cat example.json | in2csv -f json -
A simple way is to just use string concatenation. If your input is a proper array:
# filename.txt
[
{"field1":"value1", "field2":"value2"},
{"field1":"value1", "field2":"value2"},
{"field1":"value1", "field2":"value2"}
]
then index with .[]:
cat filename.txt | jq -r '.[] | .field1 + ", " + .field2'
or if it's just line by line objects:
# filename.txt
{"field1":"value1", "field2":"value2"}
{"field1":"value1", "field2":"value2"}
{"field1":"value1", "field2":"value2"}
just do this:
cat filename.txt | jq -r '.field1 + ", " + .field2'

JSON to CSV Schema

I have some JSON data structures of bank accounts information that I export as CSV files in order to be opened up in Microsoft Excel. The JSON for each account is:
{
"apy": 2.0,
"product_type": "Investors Checking",
"features": {
"ATM_FEES": "Refunded",
"ATM_CARD_AVAILABLE": "Yes",
"SIMPLY_MAINTAIN_A_MONTHLY_BALANCE_OF": "$10,000"
},
"min_investment": "",
"max_investment": 20000,
"institution_type": "Credit Union",
"institution_num": 11307,
"institution": "Apple Federal Credit Union"
}
I can export it fine with columns for everything except the "features" dictionary. That ends up as a column containing the object:
{
"ATM_FEES": "Refunded",
"ATM_CARD_AVAILABLE": "Yes",
"SIMPLY_MAINTAIN_A_MONTHLY_BALANCE_OF": "$10,000"
}
For any given bank, the features dict can be any arbitrary length with a variety of features. I mostly have experience with document-oriented databases (MongoDB).
How should I construct a relational schema for the same data?
Here the CSV and relational structure don't match. CSV can have arbitrary number of fields with each feature as a separate column. In a relation database you would do that differently. I would suggest a table for the basic data, and one for the features. Something like this:
table BANK_ACCOUNT_INFO:
ID
apy
product_type
min_investment
max_investment
institution_type
institution_num
institution
table BANK_ACCOUNT_FEATURES:
ID
BANK_ACCOUNT_ID
FEATURE_NAME
FEATURE_VALUE
1 record in the basic table can be related to several records in the features table.
Here is a solution using jq
def headers:
keys_unsorted[] as $k
| if .[$k]|type == "object" then (.[$k]|headers)
else $k
end
;
def data:
keys_unsorted[] as $k
| if .[$k]|type == "object" then (.[$k]|data)
else .[$k]
end
;
(.[0] | [headers])
, (.[] | [data])
| #csv
If filter.jq contains this filter and data.json contains the sample data then
$ jq -Mrs -f filter.jq data.json
will produce
"apy","product_type","ATM_FEES","ATM_CARD_AVAILABLE","SIMPLY_MAINTAIN_A_MONTHLY_BALANCE_OF","min_investment","max_investment","institution_type","institution_num","institution"
2,"Investors Checking","Refunded","Yes","$10,000","",20000,"Credit Union",11307,"Apple Federal Credit Union"