JSON to CSV Schema - json

I have some JSON data structures of bank accounts information that I export as CSV files in order to be opened up in Microsoft Excel. The JSON for each account is:
{
"apy": 2.0,
"product_type": "Investors Checking",
"features": {
"ATM_FEES": "Refunded",
"ATM_CARD_AVAILABLE": "Yes",
"SIMPLY_MAINTAIN_A_MONTHLY_BALANCE_OF": "$10,000"
},
"min_investment": "",
"max_investment": 20000,
"institution_type": "Credit Union",
"institution_num": 11307,
"institution": "Apple Federal Credit Union"
}
I can export it fine with columns for everything except the "features" dictionary. That ends up as a column containing the object:
{
"ATM_FEES": "Refunded",
"ATM_CARD_AVAILABLE": "Yes",
"SIMPLY_MAINTAIN_A_MONTHLY_BALANCE_OF": "$10,000"
}
For any given bank, the features dict can be any arbitrary length with a variety of features. I mostly have experience with document-oriented databases (MongoDB).
How should I construct a relational schema for the same data?

Here the CSV and relational structure don't match. CSV can have arbitrary number of fields with each feature as a separate column. In a relation database you would do that differently. I would suggest a table for the basic data, and one for the features. Something like this:
table BANK_ACCOUNT_INFO:
ID
apy
product_type
min_investment
max_investment
institution_type
institution_num
institution
table BANK_ACCOUNT_FEATURES:
ID
BANK_ACCOUNT_ID
FEATURE_NAME
FEATURE_VALUE
1 record in the basic table can be related to several records in the features table.

Here is a solution using jq
def headers:
keys_unsorted[] as $k
| if .[$k]|type == "object" then (.[$k]|headers)
else $k
end
;
def data:
keys_unsorted[] as $k
| if .[$k]|type == "object" then (.[$k]|data)
else .[$k]
end
;
(.[0] | [headers])
, (.[] | [data])
| #csv
If filter.jq contains this filter and data.json contains the sample data then
$ jq -Mrs -f filter.jq data.json
will produce
"apy","product_type","ATM_FEES","ATM_CARD_AVAILABLE","SIMPLY_MAINTAIN_A_MONTHLY_BALANCE_OF","min_investment","max_investment","institution_type","institution_num","institution"
2,"Investors Checking","Refunded","Yes","$10,000","",20000,"Credit Union",11307,"Apple Federal Credit Union"

Related

Maintain order of jq CSV output and include empty values

I am currently working on a bash script that combines the output of both the aws iam list-users and aws iam list-user-tags commands in a CSV file containing all users along with their respective information and assigned tags. To parse the JSON output of those commands I choose to use jq.
Retrieving parsing and converting (JSON to CSV) the list-user output works fine and produces the expected comma-separated list of values.
The output of list-user-tags does not quite behave that way. Its JSON output has the following schema:
{
Tags: [
{
Key: "Name",
Value: "NameOfUser"
},
{
Key: "Email",
Value: "EmailOfUser"
},
{
Key: "Company",
Value: "CompanyOfUser"
}
]
}
Unfortunately the order of the tags is not consistent across users (and possibly across queries) which currently makes it impossible for me to maintain the order defined in the CSV file. On top of that there is the possibility of one or multiple missing tags.
What I am looking for is a way to achieve the following (preferably using jq):
Select a tags "Value"-value by its "Key"-value
Check whether it is existent and if not add an empty entry
Put the value in the exact same place every time (maintain a certain order)
Repeat for every entry in the original output
Convert the resulting array of values into CSV
What I tried so far:
aws iam list-user-tags --user-name abcdef --no-cli-pager \
| jq -r '[.Tags[] | select(.Key=="Name"),select(.Key=="Email"),select(.Key=="Company") | .Value // ""] | #csv'
Any help is much appreciated!
Let's suppose your sample "schema" is in a file named schema.jq
Then
jq -n -f schema.jq | jq -r '
def get($key):
map(select(.Key == $key))
| if length == 0 then null else .[0].Value end;
.Tags | [get("Name"), get("Email"), get("Company")] | #csv
'
produces the following CSV:
"NameOfUser","EmailOfUser","CompanyOfUser"
It should be easy to adapt this illustration to your needs.

Fine tuning jq filters to reduce repetition in filter string

I have a complex JSON object produced from an API call (full JSON found in this gist). It's describing attributes of an entity (fields, parameters, child relationships, etc.). Using jq, I'm trying to extract just one child field array and convert it to CSV where field keys are a single header row and values of each array item form the subsequent rows. (NOTE: fields are uniform across all items in the array.)
So far I'm successful, but I feel as if my jq filter string could be better as there is a repetition of unpacking this array in two separate filters.
Here is a redacted version of the JSON for reference:
{
...
"result": {
...
"fields": [
{
"aggregatable": true,
"aiPredictionField": false,
"autoNumber": false,
"byteLength": 18,
"name": "Id",
...
},
{
"aggregatable": true,
"aiPredictionField": false,
"autoNumber": false,
"byteLength": 18,
"name": "OwnerId",
...
},
{
"aggregatable": false,
"aiPredictionField": false,
"autoNumber": false,
"byteLength": 0,
"name": "IsDeleted",
...
},
...
],
...
}
}
So far, here is the working command:
jq -r '.result.fields | (.[0] | keys) , .[] | [.[] | tostring] | #csv'
repeated array unpacking---^-------------^
I could be happy with this, but I would prefer to unpack the result.fields array in the first filter so that it starts out like this:
jq -r '.result.fields[] | ...
Only then there is no longer an array, just a set of objects. I tried several things but none of them gave me what I wanted. Here two things I tried before I realized that unpacking .result.fields[] destroyed anything array-like for me to work with (yep...slow learner here, and can be a bit thick):
jq -r '.result.fields[] | ( keys | .[0] ) , [.[] | tostring] | #csv'
jq -r '.result.fields[] | keys[0] , [.[] | tostring] | #csv'
So the real question is: can I unpack result.fields once and then work with what that gives me? And if not, is there a more efficient way to arrive at the CSV structure I'm looking for?
Your code is buggy, because keys sorts the keys. What's needed here is keys_unsorted.
If you want to accomplish everything in a single invocation of jq, you cannot start the pipeline with result.fields[].
The following does avoid one very small inefficiency of your approach:
.result.fields
| (.[0] | keys_unsorted),
(.[] | [.[] | tostring])
| #csv

Extract schema of nested JSON object

Let's assume this is the source json file:
{
"name": "tom",
"age": 12,
"visits": {
"2017-01-25": 3,
"2016-07-26": 4,
"2016-01-24": 1
}
}
I want to get:
[
"age",
"name",
"visits.2017-01-25",
"visits.2016-07-26",
"visits.2016-01-24"
]
I am able to extract the keys using: jq '. | keys' file.json, but this skips nested fields. How to include those?
With your input, the invocation:
jq 'leaf_paths | join(".")'
produces:
"name"
"age"
"visits.2017-01-25"
"visits.2016-07-26"
"visits.2016-01-24"
If you want to include "visits", use paths. If you want the result as a JSON array, enclose the filter with square brackets: [ ... ]
If your input might include arrays, then unless you are using jq 1.6 or later, you will need to convert the integer indices to strings explicitly; also, since leaf_paths is now deprecated, you might want to use its def. The result:
jq 'paths(scalars) | map(tostring) | join(".")'
allpaths
To include paths to null, you could use allpaths defined as follows:
def allpaths:
def conditional_recurse(f): def r: ., (select(.!=null) | f | r); r;
path(conditional_recurse(.[]?)) | select(length > 0);
Example:
{"a": null, "b": false} | allpaths | join(".")
produces:
"a"
"b"
all_leaf_paths
Assuming jq version 1.5 or higher, we can get to all_leaf_paths by following the strategy used in builtins.jq, that is, by adding these definitions:
def allpaths(f):
. as $in | allpaths | select(. as $p|$in|getpath($p)|f);
def isscalar:
. == null or . == true or . == false or type == "number" or type == "string";
def all_leaf_paths: allpaths(isscalar);
Example:
{"a": null, "b": false, "object":{"x":0} } | all_leaf_paths | join(".")
produces:
"a"
"b"
"object.x"
Some time ago, I wrote a structural-schema inference engine that
produces simple structural schemas that mirror the JSON documents under consideration,
e.g. for the sample JSON given here, the inferred schema is:
{
"name": "string",
"age": "number",
"visits": {
"2017-01-25": "number",
"2016-07-26": "number",
"2016-01-24": "number"
}
}
This is not exactly the format requested in the original posting, but
for large collections of objects, it does provide a useful overview.
More importantly, there is now a complementary validator for
checking whether a collection of JSON documents matches a structural
schema. The validator checks against schemas written in
JESS (JSON Extended Structural Schemas), a superset of the simple
structural schemas (SSS) produced by the schema inference engine .
(The idea is that one can use the SSS as a starting point to add
more elaborate constraints, including recursive constraints,
within-document referential integrity constraints, etc.)
For reference, here is how one the SSS for your sample.json
would be produced using the "schema" module:
jq 'include "schema"; schema' source.json > source.schema.json
And to validate source.json against a SSS or ESS:
JESS --schema source.schema.json source.json
This does what you want but it doesn't return the data in an array, but it should be an easy modification:
https://github.com/ilyash/show-struct
you can also check out this page:
https://ilya-sher.org/2016/05/11/most-jq-you-will-ever-need/

joining jq arrays, for CSV output

I'm looking to create a CSV based on two json arrays (arrays are a reduction of a large jason array with key value pairs)
[
"Name",
"Role",
"Type",
"Service",
"Group",
]
[
"some-server.com",
"web server",
"production",
"apps",
"main",
]
I'm able to get a more less what I'm looking for with:
jq -r '[.Tags[].Key], [.Tags[].Value] | join (",")' output.json
The issue is, the keys are not always sorted in the same order. For some objects I get:
Name, Role, Type
and other times:
Role, Type Name ..
I'm looking for a way to make the output consistent.
You can normalize the objects using:
def sortKeys: to_entries | sort | from_entries
For example, if A is an array of the unnormalized objects, you could write:
A | map(sortKeys)
Or the objects could be normalized as soon as they are created.
For CSV, you might want to fix the order based on a pre-determined array of key names. In that case, you could use:
def selectKeys(keys):
. as $in | reduce keys[] as $k ({}; . + {($k): $in[$k]})

How to convert arbitrary simple JSON to CSV using jq?

Using jq, how can arbitrary JSON encoding an array of shallow objects be converted to CSV?
There are plenty of Q&As on this site that cover specific data models which hard-code the fields, but answers to this question should work given any JSON, with the only restriction that it's an array of objects with scalar properties (no deep/complex/sub-objects, as flattening these is another question). The result should contain a header row giving the field names. Preference will be given to answers that preserve the field order of the first object, but it's not a requirement. Results may enclose all cells with double-quotes, or only enclose those that require quoting (e.g. 'a,b').
Examples
Input:
[
{"code": "NSW", "name": "New South Wales", "level":"state", "country": "AU"},
{"code": "AB", "name": "Alberta", "level":"province", "country": "CA"},
{"code": "ABD", "name": "Aberdeenshire", "level":"council area", "country": "GB"},
{"code": "AK", "name": "Alaska", "level":"state", "country": "US"}
]
Possible output:
code,name,level,country
NSW,New South Wales,state,AU
AB,Alberta,province,CA
ABD,Aberdeenshire,council area,GB
AK,Alaska,state,US
Possible output:
"code","name","level","country"
"NSW","New South Wales","state","AU"
"AB","Alberta","province","CA"
"ABD","Aberdeenshire","council area","GB"
"AK","Alaska","state","US"
Input:
[
{"name": "bang", "value": "!", "level": 0},
{"name": "letters", "value": "a,b,c", "level": 0},
{"name": "letters", "value": "x,y,z", "level": 1},
{"name": "bang", "value": "\"!\"", "level": 1}
]
Possible output:
name,value,level
bang,!,0
letters,"a,b,c",0
letters,"x,y,z",1
bang,"""!""",0
Possible output:
"name","value","level"
"bang","!","0"
"letters","a,b,c","0"
"letters","x,y,z","1"
"bang","""!""","1"
First, obtain an array containing all the different object property names in your object array input. Those will be the columns of your CSV:
(map(keys) | add | unique) as $cols
Then, for each object in the object array input, map the column names you obtained to the corresponding properties in the object. Those will be the rows of your CSV.
map(. as $row | $cols | map($row[.])) as $rows
Finally, put the column names before the rows, as a header for the CSV, and pass the resulting row stream to the #csv filter.
$cols, $rows[] | #csv
All together now. Remember to use the -r flag to get the result as a raw string:
jq -r '(map(keys) | add | unique) as $cols | map(. as $row | $cols | map($row[.])) as $rows | $cols, $rows[] | #csv'
The Skinny
jq -r '(.[0] | keys_unsorted) as $keys | $keys, map([.[ $keys[] ]])[] | #csv'
or:
jq -r '(.[0] | keys_unsorted) as $keys | ([$keys] + map([.[ $keys[] ]])) [] | #csv'
The Details
Aside
Describing the details is tricky because jq is stream-oriented, meaning it operates on a sequence of JSON data, rather than a single value. The input JSON stream gets converted to some internal type which is passed through the filters, then encoded in an output stream at program's end. The internal type isn't modeled by JSON, and doesn't exist as a named type. It's most easily demonstrated by examining the output of a bare index (.[]) or the comma operator (examining it directly could be done with a debugger, but that would be in terms of jq's internal data types, rather than the conceptual data types behind JSON).
$ jq -c '.[]' <<<'["a", "b"]'
"a"
"b"
$ jq -cn '"a", "b"'
"a"
"b"
Note that the output isn't an array (which would be ["a", "b"]). Compact output (the -c option) shows that each array element (or argument to the , filter) becomes a separate object in the output (each is on a separate line).
A stream is like a JSON-seq, but uses newlines rather than RS as an output separator when encoded. Consequently, this internal type is referred to by the generic term "sequence" in this answer, with "stream" being reserved for the encoded input and output.
Constructing the Filter
The first object's keys can be extracted with:
.[0] | keys_unsorted
Keys will generally be kept in their original order, but preserving the exact order isn't guaranteed. Consequently, they will need to be used to index the objects to get the values in the same order. This will also prevent values being in the wrong columns if some objects have a different key order.
To both output the keys as the first row and make them available for indexing, they're stored in a variable. The next stage of the pipeline then references this variable and uses the comma operator to prepend the header to the output stream.
(.[0] | keys_unsorted) as $keys | $keys, ...
The expression after the comma is a little involved. The index operator on an object can take a sequence of strings (e.g. "name", "value"), returning a sequence of property values for those strings. $keys is an array, not a sequence, so [] is applied to convert it to a sequence,
$keys[]
which can then be passed to .[]
.[ $keys[] ]
This, too, produces a sequence, so the array constructor is used to convert it to an array.
[.[ $keys[] ]]
This expression is to be applied to a single object. map() is used to apply it to all objects in the outer array:
map([.[ $keys[] ]])
Lastly for this stage, this is converted to a sequence so each item becomes a separate row in the output.
map([.[ $keys[] ]])[]
Why bundle the sequence into an array within the map only to unbundle it outside? map produces an array; .[ $keys[] ] produces a sequence. Applying map to the sequence from .[ $keys[] ] would produce an array of sequences of values, but since sequences aren't a JSON type, so you instead get a flattened array containing all the values.
["NSW","AU","state","New South Wales","AB","CA","province","Alberta","ABD","GB","council area","Aberdeenshire","AK","US","state","Alaska"]
The values from each object need to be kept separate, so that they become separate rows in the final output.
Finally, the sequence is passed through #csv formatter.
Alternate
The items can be separated late, rather than early. Instead of using the comma operator to get a sequence (passing a sequence as the right operand), the header sequence ($keys) can be wrapped in an array, and + used to append the array of values. This still needs to be converted to a sequence before being passed to #csv.
The following filter is slightly different in that it will ensure every value is converted to a string. (jq 1.5+)
# For an array of many objects
jq -f filter.jq [file]
# For many objects (not within array)
jq -s -f filter.jq [file]
Filter: filter.jq
def tocsv:
(map(keys)
|add
|unique
|sort
) as $cols
|map(. as $row
|$cols
|map($row[.]|tostring)
) as $rows
|$cols,$rows[]
| #csv;
tocsv
$cat test.json
[
{"code": "NSW", "name": "New South Wales", "level":"state", "country": "AU"},
{"code": "AB", "name": "Alberta", "level":"province", "country": "CA"},
{"code": "ABD", "name": "Aberdeenshire", "level":"council area", "country": "GB"},
{"code": "AK", "name": "Alaska", "level":"state", "country": "US"}
]
$ jq -r '["Code", "Name", "Level", "Country"], (.[] | [.code, .name, .level, .country]) | #tsv ' test.json
Code Name Level Country
NSW New South Wales state AU
AB Alberta province CA
ABD Aberdeenshire council area GB
AK Alaska state US
$ jq -r '["Code", "Name", "Level", "Country"], (.[] | [.code, .name, .level, .country]) | #csv ' test.json
"Code","Name","Level","Country"
"NSW","New South Wales","state","AU"
"AB","Alberta","province","CA"
"ABD","Aberdeenshire","council area","GB"
"AK","Alaska","state","US"
I created a function that outputs an array of objects or arrays to csv with headers. The columns would be in the order of the headers.
def to_csv($headers):
def _object_to_csv:
($headers | #csv),
(.[] | [.[$headers[]]] | #csv);
def _array_to_csv:
($headers | #csv),
(.[][:$headers|length] | #csv);
if .[0]|type == "object"
then _object_to_csv
else _array_to_csv
end;
So you could use it like so:
to_csv([ "code", "name", "level", "country" ])
This variant of Santiago's program is also safe but ensures that the key names in
the first object are used as the first column headers, in the same order as they
appear in that object:
def tocsv:
if length == 0 then empty
else
(.[0] | keys_unsorted) as $firstkeys
| (map(keys) | add | unique) as $allkeys
| ($firstkeys + ($allkeys - $firstkeys)) as $cols
| ($cols, (.[] as $row | $cols | map($row[.])))
| #csv
end ;
tocsv
If you're open to using other Unix tools, csvkit has an in2csv tool:
in2csv example.json
Using your sample data:
> in2csv example.json
code,name,level,country
NSW,New South Wales,state,AU
AB,Alberta,province,CA
ABD,Aberdeenshire,council area,GB
AK,Alaska,state,US
I like the pipe approach for piping directly from jq:
cat example.json | in2csv -f json -
A simple way is to just use string concatenation. If your input is a proper array:
# filename.txt
[
{"field1":"value1", "field2":"value2"},
{"field1":"value1", "field2":"value2"},
{"field1":"value1", "field2":"value2"}
]
then index with .[]:
cat filename.txt | jq -r '.[] | .field1 + ", " + .field2'
or if it's just line by line objects:
# filename.txt
{"field1":"value1", "field2":"value2"}
{"field1":"value1", "field2":"value2"}
{"field1":"value1", "field2":"value2"}
just do this:
cat filename.txt | jq -r '.field1 + ", " + .field2'