Convert json to long format table using jq - json

Given an array of JSON objects, all having the same key names (key1, key2, key3) and just one key (key3) whose value is an array, how can it be converted to a long format table?
Input:
[
{ "key1": "A",
"key2": 1,
"key3" : ["aaa", "bbb"]
},
{ "key1": "B",
"key2": 2,
"key3" : ["ccc", "ddd"]
}
]
Desired output:
key1
key2
key3
A
1
aaa
A
1
bbb
B
2
ccc
B
2
ddd

.[]| ([.key1,.key2] + (.key3[]|[.])) | #csv

Related

jq - Get a higher level key after a selection

Given a JSON like the following:
{
"data": [{
"id": "1a2b3c",
"info": {
"a": {
"number": 0
},
"b": {
"number": 1
},
"c": {
"number": 2
}
}
}]
}
I want to select on a number that is greater than or equal to 2 and for that selection I want to return the values of id and number. I did this like so:
$ jq -r '.data[] | .id as $ID | .info[] | select(.number >= 2) | [$ID, .number]' in.json
[
"1a2b3c",
2
]
Now I would also like to return a higher level key for my selection, in my case I need to return c. How can I accomplish this?
Assuming you want the string "c" instead of 2 in the output, this will work:
$ jq '.data[] | .id as $ID | .info | to_entries[] | select(.value.number >= 2) | [$ID, .key]' input.json
[
"1a2b3c",
"c"
]

read pretty json format data through spark

We read data present in hour format present in S3 through spark in scala.For example,
spark.read.textFile("s3://'Bucket'/'key'/'yyyy'/'MM'/'dd'/'hh'/*").
spark.read.textFile reads records one line at a time so for example records that are present in jsonLines(full json data in one line) are read and can be parsed later to retrieve data from json.
Now,I have to read data which is having multiple json but in pretty format instead of json lines.Using same strategy gives corrupt record error.For example Dataset[String] obtained after reading through spark.read.textFile:
{
"a": 1,
"b": 2
}
is
_corrupt_record|
+---------------+
| {|
| "a": 1, |
| "b": 2|
| }|
Input data :
{
"key1": "value1",
"key2": "value2"
}
{
"key1": "value1",
"key2": "value2"
}
ExpectedOutput
+------+------+
|key1 |key2 |
+------+------+
|value1|value2|
|value1|value2|
+------+------+
This file has multiple pretty formatted json with delimiter between records as newline.
Approaches already used
spark.read.option("multiline", "true").json("") .This will not work as multiline requires data to be present in form of [{},{}].
Approach working
val x=sparkSession
.read
.json(sc
.wholeTextFiles(filePath)
.values
.flatMap(x=> {x
.replace("\n", "")
.replace("}{", "}}{{")
.split("\\}\\{")}))
I just wanted to ask if there is a better approach as the above solution is doing some slice and dice on data which might lead to performance issue for large data?Thanks
This can be a working solution for you, use from_json() and correct schema in order to parse a json correctly
Create the dataframe here
df = spark.createDataFrame([(str([{"key1":"value1","key2":"value2"}, {"key1": "value3", "key2": "value4"}]))],T.StringType())
df.show(truncate=False)
+----------------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------------+
|[{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value3', 'key2': 'value4'}]|
+----------------------------------------------------------------------------+
Now, use explode() as the value/json column is a list in order to map correctly
And, finally use getItem() to extract the columns
df = df.withColumn('col', F.from_json("value", T.ArrayType(T.StringType())))
df = df.withColumn("col", F.explode("col"))
df = df.withColumn("col", F.from_json("col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("key1", df.col.getItem("key1")).withColumn("key2", df.col.getItem("key2"))
+----------------------------------------------------------------------------+--------------------------------+------+------+
|value |col |key1 |key2 |
+----------------------------------------------------------------------------+--------------------------------+------+------+
|[{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value3', 'key2': 'value4'}]|[key1 -> value1, key2 -> value2]|value1|value2|
|[{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value3', 'key2': 'value4'}]|[key1 -> value3, key2 -> value4]|value3|value4|
+----------------------------------------------------------------------------+--------------------------------+------+------+
df.show(truncate=False)

Extracting the elements of a JsonB dictionary in PostgreSQL 12

I have looked at similar questions, but I can't find how to go from storing dictionaries in:
Table A: id int, data jsonb
For example:
id = 1
data = {"Key1": 1, "Key2": "a2", "Key3": [3, 4]}
to Table B: id int, key text, payload jsonb
Using the same example as above, I would get the 3 records:
id Key payload
--------------------
1 Key1 1
1 Key2 "a2"
1 Key3 [3, 4]
Thanks in advance for any help!
Use jsonb_each():
insert into table_b
select id, key, payload
from table_a
cross join lateral jsonb_each(data) as e(key, payload);

JQ - Denormalize nested object

I've been trying to convert some JSON to csv and I have the following problem:
I have the following input json:
{"id": 100, "a": [{"t" : 1,"c" : 2 }, {"t": 2, "c" : 3 }] }
{"id": 200, "a": [{"t": 2, "c" : 3 }] }
{"id": 300, "a": [{"t": 1, "c" : 3 }] }
And I expect the following CSV output:
id,t1,t2
100,2,3
200,,3
300,3,
Unfortunately JQ doesn't output if one of select has no match.
Example:
echo '{ "id": 100, "a": [{"t" : 1,"c" : 2 }, {"t": 2, "c" : 3 }] }' | jq '{t1: (.a[] | select(.t==1)).c , t2: (.a[] | select(.t==2)).c }'
output:
{ "t1": 2, "t2": 3 }
but if one of the objects select returns no match it doesn't return at all.
Example:
echo '{ "id": 100, "a": [{"t" : 1,"c" : 2 }] }' | jq '{t1: (.a[] | select(.t==1)).c , t2: (.a[] | select(.t==2)).c }'
Expected output:
{ "t1": 2, "t2": null }
Does anyone know how to achieve this with JQ?
EDIT:
Based on a comment made by #peak I found the solution that I was looking for.
jq -r '["id","t1","t2"],[.id, (.a[] | select(.t==1)).c//null, (.a[] | select(.t==2)).c//null ]|#csv'
The alternative operator does exactly what I was looking for.
Alternative Operator
Here's a simple solution that does not assume anything about the ordering of the items in the .a array, and easily generalizes to arbitrarily many .t values:
# Convert an array of {t, c} to a dictionary:
def tod: map({(.t|tostring): .c}) | add;
["id", "t1", "t2"], # header
(inputs
| (.a | tod) as $dict
| [.id, (range(1;3) as $i | $dict[$i|tostring]) ])
| #csv
Command-line options
Use the -n option (because inputs is being used), and the -r option (to produce CSV).
This is an absolute mess, but it works:
$ cat tmp.json
{"id": 100, "a": [{"t" : 1,"c" : 2 }, {"t": 2, "c" : 3 }] }
{"id": 200, "a": [{"t": 2, "c" : 3 }] }
{"id": 300, "a": [{"t": 1, "c" : 3 }] }
$ cat filter.jq
def t(id):
.a |
map({key: "t\(.t)", value: .c}) |
({t1:null, t2:null, id:id} | to_entries) + . | from_entries
;
inputs |
map(.id as $id | t($id)) |
(.[0] | keys) as $hdr |
([$hdr] + map(to_entries |map(.value)))[]|
#csv
$ jq -rn --slurp -f filter.jq tmp.json
"id","t1","t2"
2,3,100
,3,200
3,,300
In short, you produce a direct object containing the values from your input, then add it to a "default" object to fill in the missing keys.

Json to CSV conversion with value as headers

I have a below JSON file and need to convert to CSV file with some values as headers and below that values should get populated. Below is the sample json
{
"environments" : [ {
"dimensions" : [ {
"metrics" : [ {
"name" : "count",
"values" : [ "123" ]
}, {
"name" : "response_time",
"values" : [ "15.7" ]
}],
"name" : "abcd"
}, {
"metrics" : [ {
"name" : "count",
"values" : [ "456" ]
}, {
"name" : "response_time",
"values" : [ "18.7" ]
}],
"name" : "xyzz"
}
This is what I have tried already
jq -r '.environments[].dimensions[] | .name as $p_name | .metrics[] | .name as $val_name | if $val_name == "response_time" then ($p_name,$val_name, .values[])' input.json
Expected out as
name,count,response_time
abcd, 123, 15.7
xyzz, 456, 18.7
If the goal is to rely on the JSON itself to supply the header names in whatever order the "metrics" arrays present them,
then consider:
.environments[].dimensions
| ["name", (.[0] | .metrics[] | .name)], # first emit the headers
( .[] | [.name, (.metrics[].values[0])] ) # ... and then the data rows
| #csv
Generating the headers is easy, so I'll focus on generating the rest of the CSV.
The following has the advantage of being straightforward and will hopefully be more-or-less self-explanatory, at least with the jq manual at the ready. A tweak with an eye to efficiency follows.
jq -r '
# name,count,response_time
.environments[].dimensions[]
| .name as $p_name
| .metrics
| [$p_name]
+ map(select(.name == "count") | .values[0] )
+ map(select(.name == "response_time") | .values[0] )
| #csv
'
Efficiency
Here's a variant of the above which would be appropriate if the .metrics array had a large number of items:
jq -r '
# name,count,response_time
.environments[].dimensions[]
| .name as $p_name
| INDEX(.metrics[]; .name) as $dict
| [$p_name, $dict["count"].values[0], $dict["response_time"].values[0]]
| #csv
'