I have json (that actually starts as csv) of the form of an array of elements of the form:
{
"field1" : "value1"
"field2.1; Field2.2 Field2.3" : "Field2.1Value0; Field2.2Value0; Field2.3Value0; Field2.1Value1; Field2.2Value1; Field2.3Value1; ..."
}
...
I would like to iterate over the string of the field "field2.1; Field2.2 Field2.3", three ";" separated items at a time to produce an array of key value pairs
{
"field1" : "value1"
"newfield" : [
{ "Field2.1": "Field2.1Value0",
"Field2.2": "Field2.2Value0",
"Field2.3": "Field2.1Value0" },
{ "Field2.1": "Field2.1Value1",
"Field2.2": "Field2.2Value1",
"Field2.3": "Field2.3Value1"},
...
]
}
...
note that there are actually a couple of keys that need to be expanded like this. Each with a variable number of "sub-keys".
In other words, the original CSV file contains some columns that represent tuples of field values separated by semicolons.
I know how to get down to the "field2.1; Field2.2 Field2.3" and say split it on the ";" but then I'm stuck trying to iterate through that 3 (or however many) items at a time to produce the separate 3 tuples.
The real world example/context is the format of the CSV from catalog export from the Google Play Store.
For example Field2.1 is Locale, Field2.2 is Title and Field3.3 is Description:
jq '."Locale; Title; Description" |= split(";") '
If possible, then it would be nice if the iteration is based on the number of semicolon separated "subfields" in the key value. There is another column that has a similar format for the price in each country.
The following assumes the availability of splits/1 for splitting a string based on a regex. If your jq does not have it, and if you cannot or don't want to upgrade, you could devise a workaround using split/1, which only works on strings.
First, let's start with a simple variant of the problem that does not require recycling the headers. If the following jq program is in a file (say program.jq):
# Assuming header is an array of strings,
# create an object from an array of values:
def objectify(headers):
. as $in
| reduce range(0; headers|length) as $i ({}; .[headers[$i]] = ($in[$i]) );
# From an object of the form {key: _, value: _},
# construct an object by splitting each _
def devolve:
if .key|index(";")
then .key as $key
| ( [.value | splits("; *")] ) | objectify([$key | splits("; *")])
else { (.key): .value }
end;
to_entries | map( devolve )
and if the following JSON is in input.json:
{
"field1" : "value1",
"field2.1; Field2.2; Field2.3" : "Field2.1Value0; Field2.2Value0; Field2.3Value0"
}
then the invocation:
jq -f program.jq input.json
should yield:
[
{
"field1": "value1"
},
{
"field2.1": "Field2.1Value0",
"Field2.2": "Field2.2Value0",
"Field2.3": "Field2.3Value0"
}
]
It might make sense to add some error-checking or error-correcting code.
Recycling the headers
Now let's modify the above so that headers will be recycled in accordance with the problem statement.
def objectifyRows(headers):
(headers|length) as $m
| (length / $m) as $n
| . as $in
| reduce range(0; $n) as $i ( [];
.[$i] = (reduce range(0; $m) as $h ({};
.[headers[$h]] = $in[($i * $m) + $h] ) ) );
def devolveRows:
if .key|index(";")
then .key as $key
| ( [.value | splits("; *")] )
| objectifyRows([$key | splits("; *")])
else { (.key): .value }
end;
to_entries | map( devolveRows )
With input:
{
"field1" : "value1",
"field2.1; Field2.2; Field2.3" :
"Field2.1Value0; Field2.2Value0; Field2.3Value0; Field2.4Value0; Field2.5Value0; Field2.6Value0"
}
the output would be:
[
{
"field1": "value1"
},
[
{
"field2.1": "Field2.1Value0",
"Field2.2": "Field2.2Value0",
"Field2.3": "Field2.3Value0"
},
{
"field2.1": "Field2.4Value0",
"Field2.2": "Field2.5Value0",
"Field2.3": "Field2.6Value0"
}
]
]
This output can now easily be tweaked along the lines suggested by the OP, e.g. to introduce a new key, one could pipe the above into:
.[0] + { newfield: .[1] }
Functional definitions
Here are reduce-free but efficient (assuming jq >= 1.5) implementations of objectify and objectifyRows:
def objectify(headers):
[headers, .] | transpose | map( {(.[0]): .[1]} ) | add;
def objectifyRows(headers):
def gather(n):
def g: if length>0 then .[0:n], (.[n:] | g ) else empty end;
g;
[gather(headers|length) | objectify(headers)] ;
Here is my almost final solution that inserts the new key as well as uses the first element of the ";" list as the key for sorting the array.
def objectifyRows(headers):
(headers|length) as $m
| (headers[0]) as $firstkey
| (length / $m) as $n
| . as $in
| reduce range(0; $n) as $i ( [];
.[$i] = (reduce range(0; $m) as $h ({};
.[headers[$h]] = $in[($i * $m) + $h] ) ) )
;
def devolveRows:
if .key|index(";")
then .key as $multikey
| ( [.value | splits("; *")] )
# Create a new key with value being an array of the "splits"
| { ($multikey): objectifyRows([$multikey | splits("; *")])}
# here "arbitrarily" sort by the first split key
| .[$multikey] |= sort_by(.[[$multikey | splits("; *")][0]])
else { (.key): .value }
end;
to_entries | map( devolveRows )
Related
I have a large JSON file that I am using JQ to pair down to only those elements I need. I have that working but there are some values that are string in all caps. Unfortunately, while jq has ascii_downcase and ascii_upcase, it does not have a built in function for uppercasing only the first letter of each word.
I need to only perform this on brand_name and generic_name, while ensure that the manufacturer name is also first letter capitalized with the exception of things like LLC which should remain capitalized.
Here's my current jq statement:
jq '.results[] | select(.openfda.brand_name != null or .openfda.generic_name != null or .openfda.rxcui != null) | select(.openfda|has("rxcui")) | {brand_name: .openfda.brand_name[0], generic_name: .openfda.generic_name[0], manufacturer: .openfda.manufacturer_name[0], rxcui: .openfda.rxcui[0]}' filename.json > newfile.json
This is a sample output:
{
"brand_name": "VELTIN",
"generic_name": "CLINDAMYCIN PHOSPHATE AND TRETINOIN",
"manufacturer": "Almirall, LLC",
"rxcui": "882548"
}
I need the output to be:
{
"brand_name": "Veltin",
"generic_name": "Clindamycin Phosphate And Tretinoin",
"manufacturer": "Almirall, LLC",
"rxcui": "882548"
}
Suppose we are given an array of words that are to be left as is, e.g.:
def exceptions: ["LLC", "USA"];
We can then define a capitalization function as follows:
# Capitalize all the words in the input string other than those specified by exceptions:
def capitalize:
INDEX(exceptions[]; .) as $e
| [splits("\\b") | select(length>0)]
| map(if $e[.] then . else (.[:1]|ascii_upcase) + (.[1:] |ascii_downcase) end)
| join("");
For example, given "abc-DEF ghi USA" as input, the result would be "Abc-Def Ghi USA".
Split at space characters to get an array of words, then split again at the empty string to get an array of characters. For the inner array, use ascii_downcase on all elements but the first, then put all back together using add on the inner and join with a space character on the outer array.
(.brand_name, .generic_name) |= (
(. / " ") | map(. / "" | .[1:] |= map(ascii_downcase) | add) | join(" ")
)
{
"brand_name": "Veltin",
"generic_name": "Clindamycin Phosphate And Tretinoin",
"manufacturer": "Almirall, LLC",
"rxcui": "882548"
}
Demo
To ignore certain words from being processed, capture them with an if condition:
map_values((. / " ") | map(
if IN("LLC", "AND") then .
else . / "" | .[1:] |= map(ascii_downcase) | add end
) | join(" "))
{
"brand_name": "Veltin",
"generic_name": "Clindamycin Phosphate AND Tretinoin",
"manufacturer": "Almirall, LLC",
"rxcui": "882548"
}
Demo
Using jq we can easily merge two multi-level objects X and Y using *:
X='{
"a": 1,
"b": 5,
"c": {
"a": 3
}
}' Y='{
"d": 2,
"a": 3,
"c": {
"x": 10,
"y": 11
}
}' && Z=`echo "[$X,$Y]"|jq '.[0] * .[1]'` && echo "Z='$Z'"
gives us:
Z='{
"a": 3,
"b": 5,
"c": {
"a": 3,
"x": 10,
"y": 11
},
"d": 2
}'
But in my case, I'm starting with X and Z and want to calculate Y (such that X * Y = Z). If we only have objects with scalar properties, then jq X + Y equals Z, and we can also calculate Y as jq Z - X. However, this fails if X or Y contain properties with object values such as in the above example:
X='{
"a": 1,
"b": 5,
"c": {
"a": 3
}
}' Z='{
"a": 3,
"b": 5,
"c": {
"a": 3,
"x": 10,
"y": 11
},
"d": 2
}' && echo "[$X,$Z]" | jq '.[1] - .[0]'
throws an error jq: error (at <stdin>:16): object ({"a":3,"b":...) and object ({"a":1,"b":...) cannot be subtracted
Is there an elegant solution to this problem with jq?
UPDATE: I've accepted the answer that I found easier to read / maintain and with superior performance. In addition, I found a wrinkle in my need which was that if X contained a key K that was not present in Z, I needed the output (Y) to nullify it by containing the key K with a value of null.
The best way I could come up with to do this was to pre-process Z to add the missing keys using the below:
def add_null($y):
reduce (to_entries[] | [ .key, .value ] ) as [ $k, $v ] (
$y;
if $y | has($k) | not then
.[$k] = null
elif $v | type == "object" then
.[$k] = ($v | add_null($y[$k]))
else
.[$k] = $v
end
);
so we end up with:
def add_null(...);
def remove(...);
. as [ $X, $Z ] | ($X | add_null($Z)) | remove($X)
Any better suggestions to this variation are still appreciated!
def remove($o2):
reduce ( to_entries[] | [ .key, .value ] ) as [ $k, $v1 ] (
{};
if $o2 | has($k) | not then
# Keep existing value if $o2 doesn't have the key.
.[$k] = $v1
else
$o2[$k] as $v2 |
if $v1 | type == "object" then
# We're comparing objects.
( $v1 | remove($o2[$k]) ) as $v_diff |
if $v_diff | length == 0 then
# Discard identical values.
.
else
# Keep the differences of the values.
.[$k] = $v_diff
end
else
# We're comparing non-objects.
if $v1 == $v2 then
# Discard identical values.
.
else
# Keep existing value if different.
.[$k] = $v1
end
end
end
);
. as [ $Z, $X ] | $Z | remove($X)
Demo on jqplay
or
def sub($v2):
( type ) as $t1 |
( $v2 | type ) as $t2 |
if $t1 == $t2 then
if $t1 == "object" then
with_entries(
.key as $k |
.value = (
.value |
if $v2 | has($k) then sub( $v2[$k] ) else . end
)
) |
select( length != 0 )
else
select( . != $v2 )
end
else
.
end;
. as [ $Z, $X ] | $Z | sub($X)
Demo on jqplay
I don't know if this is elegant, but it works for your sample data
echo "[$X,$Z]" | jq '
. as [$x,$z]
| map([paths(scalars)])
| .[0] |= map(select(. as $p | [$x, $z | getpath($p)] | .[1] == .[0]))
| reduce (.[1] - .[0])[] as $p (null; setpath($p; $z | getpath($p)))
'
{
"a": 3,
"c": {
"x": 10,
"y": 11
},
"d": 2
}
Demo
We are trying to parse a JSON file to a tsv file. We are having problems trying to eliminate duplicate Id with unique.
JSON file
[
{"Id": "101",
"Name": "Yugi"},
{"Id": "101",
"Name": "Yugi"},
{"Id": "102",
"Name": "David"},
]
cat getEvent_all.json | jq -cr '.[] | [.Id] | unique_by(.[].Id)'
jq: error (at :0): Cannot iterate over string ("101")
A reasonable approach would be to use unique_by, e.g.:
unique_by(.Id)[]
| [.Id, .Name]
| #tsv
Alternatively, you could form the pairs first:
map([.Id, .Name])
| unique_by(.[0])[]
| #tsv
uniques_by/2
For very large arrays, though, or if you want to respect the original ordering, a sort-free alternative to unique_by should be considered. Here is a suitable, generic, stream-oriented alternative:
def uniques_by(stream; f):
foreach stream as $x ({};
($x|f) as $s
| ($s|type) as $t
| (if $t == "string" then $s
else ($s|tostring) end) as $y
| if .[$t][$y] then .emit = false
else .emit = true | (.item = $x) | (.[$t][$y] = true)
end;
if .emit then .item else empty end );
I want to compare two json files to see if one can be extracted from the other one.
P1 (p1.json)
{
"id": 12,
"keys": ["key1","key2"],
"body": {
"height": "180cm",
"wight": "70kg"
},
"name": "Alex"
}
P2 (p2.json)
{
"id": 12,
"keys": ["key2","key1"],
"body": {
"height": "180cm"
}
}
As it can be seen P2 is not completely equal to P1 but it can be extracted from P1 (It provides less data about the same person but the data is correct).
Expected behavior:
p1 extends p2 --> true
p2 extends p1 --> false
Notes
- An array cannot be extracted from the same array with some additional elements
The following definition of extends/1 uses a purely object-based definition of extension (in particular, it does not sort arrays). The OP requirements regarding arrays are unclear to me, but a variant definition is offered in the following section.
# Usage: $in | extends($b) iff $in contains $b in an object-based sense
def extends($b):
# Handle the case that both are objects:
def objextends($x):
. as $in | all($x|keys[]; . as $k | $in[$k] | extends($x[$k]));
# Handle the case that both are arrays:
def arrayextends($x):
. as $in
| length == ($x|length) and
all( range(0;length); . as $i | $in[$i] | extends($x[$i]));
if . == $b then true
else . as $in
| type as $intype
| ($intype == ($b|type)) and
(($intype == "object" and objextends($b)) or
($intype == "array" and arrayextends($b)))
end;
Examples:
{a:{a:1,b:2}, b:2} | extends({a:{a:1}}) # true
{a:{a:1,b:2}, b:2} | extends({a:{a:2}}) # false
{a:{a:1,b:2}, b:[{x:1,y:2}]} | extends({a:{a:2}, b:[{x:1}]}) # true
Alternative definition
The following definition sorts arrays and is sufficiently generous to handle the given example:
# Usage: $in | extends2($b) iff $in contains $b in a way which ignores the order of array elements
def extends2($b):
# Both are objects
def objextends($x):
. as $in | all($x|keys[]; . as $k | $in[$k] | extends($x[$k]));
def arrayextends($x): ($x|sort) - sort == [];
if . == $b then true
else . as $in
| type as $intype
| ($intype == ($b|type)) and
(($intype == "object" and objextends($b)) or
($intype == "array" and arrayextends($b)))
end;
With $P1 and $P2 as shown:
$P1 | extends2($P2) # yields true
If you know there are no duplicates in any subarrays then you could use this approach which computes the difference between sets of [path,value] pairs returned from tostream replacing array indices with null:
def details:[
tostream
| select(length==2) as [$p,$v]
| [$p|map(if type=="number" then null else . end),$v]
];
def extends(a;b): (b|details) - (a|details) == [];
If P1 and P2 are functions returning the sample data
def P1: {
"id": 12,
"keys": ["key1","key2"],
"body": {
"height": "180cm",
"wight": "70kg"
},
"name": "Alex"
}
;
def P2: {
"id": 12,
"keys": ["key2","key1"],
"body": {
"height": "180cm"
}
}
;
then
extends(P1;P2) # returns true
, extends(P2;P1) # returns false
In the presence of duplicates the result is less clear. e.g.
extends(["a","b","b"];["a","a","b"]) # returns true
Try it online!
I'd like to filter output from below json file to get all start with "tag_Name"
{
...
"tag_Name_abc": [
"10_1_4_3",
"10_1_6_2",
"10_1_5_3",
"10_1_5_5"
],
"tag_Name_efg": [
"10_1_4_5"
],
...
}
Try something but failed.
$ cat output.json |jq 'map(select(startswith("tag_Name")))'
jq: error (at <stdin>:1466): startswith() requires string inputs
There's plenty of ways you can do this but the simplest way you can do so is to convert that object to entries so you can get access to the keys, then filter the entries by the names you want then back again.
with_entries(select(.key | startswith("tag_Name")))
Here are a few more solutions:
1) combining values for matching keys with add
. as $d
| keys
| map( select(startswith("tag_Name")) | {(.): $d[.]} )
| add
2) filtering out non-matching keys with delpaths
delpaths([
keys[]
| select(startswith("tag_Name") | not)
| [.]
])
3) filtering out non-matching keys with reduce and del
reduce keys[] as $k (
.
; if ($k|startswith("tag_Name")) then . else del(.[$k]) end
)