Find duplicates in JSON array - json

Does anyone know how to use jq to find the duplicate(s) in a JSON array?
For example:
Input:
[{"foo": 1, "bar": 2}, {"foo": 1, "bar": 2}, {"foo": 4, "bar": 5}]
Output:
[{"foo": 1, "bar": 2}]

One of many possible solutions in jq:
group_by(.) | map(select(length>1) | .[0])

Solutions involving the built-in group_by involve a sort and are therefore inefficient if the goal is simply to identify the duplicates. Here is a sort-free solution that uses a generic and powerful bagof function defined here on a stream:
# Create a two-level dictionary giving [item, n] where n
# is the multiplicity of the item in the stream
def bagof(stream):
reduce stream as $x ({};
($x | [type, tostring]) as $key
| getpath($key) as $entry
| if $entry then setpath($key; [$x, ($entry[1] + 1 )])
else setpath($key; [$x, 1])
end ) ;
# Emit a stream of duplicated items in the stream, s:
def duplicates(s): bagof(s) | .[][] | select(.[1]>1) | .[0];
# Input: an array
# Output: an array of items that are duplicated in the array
def duplicates: [duplicates(.[])];

Related

jq: error (at <stdin>:0): Cannot iterate over string, cannot execute unique problem

We are trying to parse a JSON file to a tsv file. We are having problems trying to eliminate duplicate Id with unique.
JSON file
[
{"Id": "101",
"Name": "Yugi"},
{"Id": "101",
"Name": "Yugi"},
{"Id": "102",
"Name": "David"},
]
cat getEvent_all.json | jq -cr '.[] | [.Id] | unique_by(.[].Id)'
jq: error (at :0): Cannot iterate over string ("101")
A reasonable approach would be to use unique_by, e.g.:
unique_by(.Id)[]
| [.Id, .Name]
| #tsv
Alternatively, you could form the pairs first:
map([.Id, .Name])
| unique_by(.[0])[]
| #tsv
uniques_by/2
For very large arrays, though, or if you want to respect the original ordering, a sort-free alternative to unique_by should be considered. Here is a suitable, generic, stream-oriented alternative:
def uniques_by(stream; f):
foreach stream as $x ({};
($x|f) as $s
| ($s|type) as $t
| (if $t == "string" then $s
else ($s|tostring) end) as $y
| if .[$t][$y] then .emit = false
else .emit = true | (.item = $x) | (.[$t][$y] = true)
end;
if .emit then .item else empty end );

Compare two json files using jq or any other tools in bash

I want to compare two json files to see if one can be extracted from the other one.
P1 (p1.json)
{
"id": 12,
"keys": ["key1","key2"],
"body": {
"height": "180cm",
"wight": "70kg"
},
"name": "Alex"
}
P2 (p2.json)
{
"id": 12,
"keys": ["key2","key1"],
"body": {
"height": "180cm"
}
}
As it can be seen P2 is not completely equal to P1 but it can be extracted from P1 (It provides less data about the same person but the data is correct).
Expected behavior:
p1 extends p2 --> true
p2 extends p1 --> false
Notes
- An array cannot be extracted from the same array with some additional elements
The following definition of extends/1 uses a purely object-based definition of extension (in particular, it does not sort arrays). The OP requirements regarding arrays are unclear to me, but a variant definition is offered in the following section.
# Usage: $in | extends($b) iff $in contains $b in an object-based sense
def extends($b):
# Handle the case that both are objects:
def objextends($x):
. as $in | all($x|keys[]; . as $k | $in[$k] | extends($x[$k]));
# Handle the case that both are arrays:
def arrayextends($x):
. as $in
| length == ($x|length) and
all( range(0;length); . as $i | $in[$i] | extends($x[$i]));
if . == $b then true
else . as $in
| type as $intype
| ($intype == ($b|type)) and
(($intype == "object" and objextends($b)) or
($intype == "array" and arrayextends($b)))
end;
Examples:
{a:{a:1,b:2}, b:2} | extends({a:{a:1}}) # true
{a:{a:1,b:2}, b:2} | extends({a:{a:2}}) # false
{a:{a:1,b:2}, b:[{x:1,y:2}]} | extends({a:{a:2}, b:[{x:1}]}) # true
Alternative definition
The following definition sorts arrays and is sufficiently generous to handle the given example:
# Usage: $in | extends2($b) iff $in contains $b in a way which ignores the order of array elements
def extends2($b):
# Both are objects
def objextends($x):
. as $in | all($x|keys[]; . as $k | $in[$k] | extends($x[$k]));
def arrayextends($x): ($x|sort) - sort == [];
if . == $b then true
else . as $in
| type as $intype
| ($intype == ($b|type)) and
(($intype == "object" and objextends($b)) or
($intype == "array" and arrayextends($b)))
end;
With $P1 and $P2 as shown:
$P1 | extends2($P2) # yields true
If you know there are no duplicates in any subarrays then you could use this approach which computes the difference between sets of [path,value] pairs returned from tostream replacing array indices with null:
def details:[
tostream
| select(length==2) as [$p,$v]
| [$p|map(if type=="number" then null else . end),$v]
];
def extends(a;b): (b|details) - (a|details) == [];
If P1 and P2 are functions returning the sample data
def P1: {
"id": 12,
"keys": ["key1","key2"],
"body": {
"height": "180cm",
"wight": "70kg"
},
"name": "Alex"
}
;
def P2: {
"id": 12,
"keys": ["key2","key1"],
"body": {
"height": "180cm"
}
}
;
then
extends(P1;P2) # returns true
, extends(P2;P1) # returns false
In the presence of duplicates the result is less clear. e.g.
extends(["a","b","b"];["a","a","b"]) # returns true
Try it online!

How to convert nested JSON to CSV using only jq

I've following json,
{
"A": {
"C": {
"D": "T1",
"E": 1
},
"F": {
"D": "T2",
"E": 2
}
},
"B": {
"C": {
"D": "T3",
"E": 3
}
}
}
I want to convert it into csv as follows,
A,C,T1,1
A,F,T2,2
B,C,T3,3
Description of output: The parents keys will be printed until, I've reached the leaf child. Once I reached leaf child, print its value.
I've tried following and couldn't succeed,
cat my.json | jq -r '(map(keys) | add | unique) as $cols | map(. as $row | $cols | map($row[.])) as $rows | $rows[] | #csv'
and it throwing me an error.
I can't hardcode the parent keys, as the actual json has too many records. But the structure of the json is similar. What am I missing?
Some of the requirements are unclear, but the following solves one interpretation of the problem:
paths as $path
| {path: $path, value: getpath($path)}
| select(.value|type == "object" )
| select( [.value[]][0] | type != "object")
| .path + ([.value[]])
| #csv
(This program could be optimized but the presentation here is intended to make the separate steps clear.)
Invocation:
jq -r -f leaves-to-csv.jq input.json
Output:
"A","C","T1",1
"A","F","T2",2
"B","C","T3",3
Unquoted strings
To avoid the quotation marks around strings, you could replace the last component of the pipeline above with:
join(",")
Here is a solution using tostream and group_by
[
tostream
| select(length == 2) # e.g. [["A","C","D"],"T1"]
| .[0][:-1] + [.[1]] # ["A","C","T1"]
]
| group_by(.[:-1]) # [[["A","C","T1"],["A","C",1]],...
| .[] # [["A","C","T1"],["A","C",1]]
| .[0][0:2] + map(.[-1]|tostring) # ["A","C","T1","1"]
| join(",") # "A,C,T1,1"

jq iterate over a array of values a subset at a time

I have json (that actually starts as csv) of the form of an array of elements of the form:
{
"field1" : "value1"
"field2.1; Field2.2 Field2.3" : "Field2.1Value0; Field2.2Value0; Field2.3Value0; Field2.1Value1; Field2.2Value1; Field2.3Value1; ..."
}
...
I would like to iterate over the string of the field "field2.1; Field2.2 Field2.3", three ";" separated items at a time to produce an array of key value pairs
{
"field1" : "value1"
"newfield" : [
{ "Field2.1": "Field2.1Value0",
"Field2.2": "Field2.2Value0",
"Field2.3": "Field2.1Value0" },
{ "Field2.1": "Field2.1Value1",
"Field2.2": "Field2.2Value1",
"Field2.3": "Field2.3Value1"},
...
]
}
...
note that there are actually a couple of keys that need to be expanded like this. Each with a variable number of "sub-keys".
In other words, the original CSV file contains some columns that represent tuples of field values separated by semicolons.
I know how to get down to the "field2.1; Field2.2 Field2.3" and say split it on the ";" but then I'm stuck trying to iterate through that 3 (or however many) items at a time to produce the separate 3 tuples.
The real world example/context is the format of the CSV from catalog export from the Google Play Store.
For example Field2.1 is Locale, Field2.2 is Title and Field3.3 is Description:
jq '."Locale; Title; Description" |= split(";") '
If possible, then it would be nice if the iteration is based on the number of semicolon separated "subfields" in the key value. There is another column that has a similar format for the price in each country.
The following assumes the availability of splits/1 for splitting a string based on a regex. If your jq does not have it, and if you cannot or don't want to upgrade, you could devise a workaround using split/1, which only works on strings.
First, let's start with a simple variant of the problem that does not require recycling the headers. If the following jq program is in a file (say program.jq):
# Assuming header is an array of strings,
# create an object from an array of values:
def objectify(headers):
. as $in
| reduce range(0; headers|length) as $i ({}; .[headers[$i]] = ($in[$i]) );
# From an object of the form {key: _, value: _},
# construct an object by splitting each _
def devolve:
if .key|index(";")
then .key as $key
| ( [.value | splits("; *")] ) | objectify([$key | splits("; *")])
else { (.key): .value }
end;
to_entries | map( devolve )
and if the following JSON is in input.json:
{
"field1" : "value1",
"field2.1; Field2.2; Field2.3" : "Field2.1Value0; Field2.2Value0; Field2.3Value0"
}
then the invocation:
jq -f program.jq input.json
should yield:
[
{
"field1": "value1"
},
{
"field2.1": "Field2.1Value0",
"Field2.2": "Field2.2Value0",
"Field2.3": "Field2.3Value0"
}
]
It might make sense to add some error-checking or error-correcting code.
Recycling the headers
Now let's modify the above so that headers will be recycled in accordance with the problem statement.
def objectifyRows(headers):
(headers|length) as $m
| (length / $m) as $n
| . as $in
| reduce range(0; $n) as $i ( [];
.[$i] = (reduce range(0; $m) as $h ({};
.[headers[$h]] = $in[($i * $m) + $h] ) ) );
def devolveRows:
if .key|index(";")
then .key as $key
| ( [.value | splits("; *")] )
| objectifyRows([$key | splits("; *")])
else { (.key): .value }
end;
to_entries | map( devolveRows )
With input:
{
"field1" : "value1",
"field2.1; Field2.2; Field2.3" :
"Field2.1Value0; Field2.2Value0; Field2.3Value0; Field2.4Value0; Field2.5Value0; Field2.6Value0"
}
the output would be:
[
{
"field1": "value1"
},
[
{
"field2.1": "Field2.1Value0",
"Field2.2": "Field2.2Value0",
"Field2.3": "Field2.3Value0"
},
{
"field2.1": "Field2.4Value0",
"Field2.2": "Field2.5Value0",
"Field2.3": "Field2.6Value0"
}
]
]
This output can now easily be tweaked along the lines suggested by the OP, e.g. to introduce a new key, one could pipe the above into:
.[0] + { newfield: .[1] }
Functional definitions
Here are reduce-free but efficient (assuming jq >= 1.5) implementations of objectify and objectifyRows:
def objectify(headers):
[headers, .] | transpose | map( {(.[0]): .[1]} ) | add;
def objectifyRows(headers):
def gather(n):
def g: if length>0 then .[0:n], (.[n:] | g ) else empty end;
g;
[gather(headers|length) | objectify(headers)] ;
Here is my almost final solution that inserts the new key as well as uses the first element of the ";" list as the key for sorting the array.
def objectifyRows(headers):
(headers|length) as $m
| (headers[0]) as $firstkey
| (length / $m) as $n
| . as $in
| reduce range(0; $n) as $i ( [];
.[$i] = (reduce range(0; $m) as $h ({};
.[headers[$h]] = $in[($i * $m) + $h] ) ) )
;
def devolveRows:
if .key|index(";")
then .key as $multikey
| ( [.value | splits("; *")] )
# Create a new key with value being an array of the "splits"
| { ($multikey): objectifyRows([$multikey | splits("; *")])}
# here "arbitrarily" sort by the first split key
| .[$multikey] |= sort_by(.[[$multikey | splits("; *")][0]])
else { (.key): .value }
end;
to_entries | map( devolveRows )

How to make it work with filter in jq

I'd like to filter output from below json file to get all start with "tag_Name"
{
...
"tag_Name_abc": [
"10_1_4_3",
"10_1_6_2",
"10_1_5_3",
"10_1_5_5"
],
"tag_Name_efg": [
"10_1_4_5"
],
...
}
Try something but failed.
$ cat output.json |jq 'map(select(startswith("tag_Name")))'
jq: error (at <stdin>:1466): startswith() requires string inputs
There's plenty of ways you can do this but the simplest way you can do so is to convert that object to entries so you can get access to the keys, then filter the entries by the names you want then back again.
with_entries(select(.key | startswith("tag_Name")))
Here are a few more solutions:
1) combining values for matching keys with add
. as $d
| keys
| map( select(startswith("tag_Name")) | {(.): $d[.]} )
| add
2) filtering out non-matching keys with delpaths
delpaths([
keys[]
| select(startswith("tag_Name") | not)
| [.]
])
3) filtering out non-matching keys with reduce and del
reduce keys[] as $k (
.
; if ($k|startswith("tag_Name")) then . else del(.[$k]) end
)