Pyspark reading json file with indentation character (\t) - json

I am trying to read a json file using pyspark. I am usually able to open json file, however somehow one of my json files, when reading, shows indentation as \t character. At first, I made the following attempt to read the file:
spark = SparkSession.builder.appName("spark_learning").getOrCreate()
read1 = spark.read.format("json").option("multiplelines", "true").load(file_path)
This resulted in the ['_corrupt_record'] as outcome. In the second attempt, I tried the following code
read2 = spark.read.format("text").load(file_path)
read2.show()
The output is
+--------------------+
| value|
+--------------------+
| {|
| \t"key1": "value1",|
| \t"key2": "value2",|
| \t"key3": "value3",|
| \t"key4": [{|
|\t\t"sub_key1": 1...|
| \t\t"sub_key2": [{|
|\t\t\t"nested_key...|
| \t}]|
| }|
+--------------------+
output snapshot attached
When compared this json file to others I was able to read, I noticed the difference of \t. It seems that the file is reading indentation as character (\t). I also tried to replace \t by blank space using the answers available (e.g, How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?). However, I was not successful. It was still giving me corrupt_record column. I would be happy to receive any help from this community.
(PS: I am new to Big Data world and PySpark.)
Here is the sample data:
{
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": [{
"sub_key1": 1111,
"sub_key2": [{
"nested_key1": [5555]}]
}]
}
(https://drive.google.com/file/d/1_0-9d41LnFR8_OGP4k0HK7ghn1JkQhmR/view?usp=sharing)

The .option() should be multiline not multiplelines. With this change you should be able to read the json as is into a dataframe, otherwise you have to read with wholeTextFiles() and map it to json.
df=spark.read.format("json").option("multiline","true").load(file_path)
df.show(truncate=False)
+------+------+------+--------------------+
|key1 |key2 |key3 |key4 |
+------+------+------+--------------------+
|value1|value2|value3|[{1111, [{[5555]}]}]|
+------+------+------+--------------------+

Related

Maintain order of jq CSV output and include empty values

I am currently working on a bash script that combines the output of both the aws iam list-users and aws iam list-user-tags commands in a CSV file containing all users along with their respective information and assigned tags. To parse the JSON output of those commands I choose to use jq.
Retrieving parsing and converting (JSON to CSV) the list-user output works fine and produces the expected comma-separated list of values.
The output of list-user-tags does not quite behave that way. Its JSON output has the following schema:
{
Tags: [
{
Key: "Name",
Value: "NameOfUser"
},
{
Key: "Email",
Value: "EmailOfUser"
},
{
Key: "Company",
Value: "CompanyOfUser"
}
]
}
Unfortunately the order of the tags is not consistent across users (and possibly across queries) which currently makes it impossible for me to maintain the order defined in the CSV file. On top of that there is the possibility of one or multiple missing tags.
What I am looking for is a way to achieve the following (preferably using jq):
Select a tags "Value"-value by its "Key"-value
Check whether it is existent and if not add an empty entry
Put the value in the exact same place every time (maintain a certain order)
Repeat for every entry in the original output
Convert the resulting array of values into CSV
What I tried so far:
aws iam list-user-tags --user-name abcdef --no-cli-pager \
| jq -r '[.Tags[] | select(.Key=="Name"),select(.Key=="Email"),select(.Key=="Company") | .Value // ""] | #csv'
Any help is much appreciated!
Let's suppose your sample "schema" is in a file named schema.jq
Then
jq -n -f schema.jq | jq -r '
def get($key):
map(select(.Key == $key))
| if length == 0 then null else .[0].Value end;
.Tags | [get("Name"), get("Email"), get("Company")] | #csv
'
produces the following CSV:
"NameOfUser","EmailOfUser","CompanyOfUser"
It should be easy to adapt this illustration to your needs.

Fuzzy match string with jq

Let's say I have some JSON in a file, it's a subset of JSON data extracted from a larger JSON file - that's why I'll use stream later in my attempted solution - and it looks like this:
[
{"_id":"1","#":{},"article":false,"body":"Hello world","comments":"3","createdAt":"20201007200628","creator":{"id":"4a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"mkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"},
{"_id":"2","#":{},"article":false,"body":"Goodbye world","comments":"3","createdAt":"20201007200628","creator":{"id":"4a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"mkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"}
],
[
{"_id":"55","#":{},"article":false,"body":"Hello world","comments":"3","createdAt":"20201007200628","creator":{"id":"3a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"jkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"},
{"_id":"56","#":{},"article":false,"body":"Goodbye world","comments":"3","createdAt":"20201007200628","creator":{"id":"3a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"jkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"}
]
It describes 4 posts written by 2 different authors, with unique _id fields for each post. Both authors wrote 2 posts, where 1 says "Hello World" and the other says "Goodbye World".
I want to match on the word "Hello" and return the _id only for fields containing "Hello". The expected result is:
1
55
The closest I could come in my attempt was:
jq -nr --stream '
fromstream(1|truncate_stream(inputs))
| select(.body %like% "Hello")
| ._id
' <input_file
Assuming the input is modified slightly to make it a stream of the arrays as shown in the Q:
jq -nr --stream '
fromstream(1|truncate_stream(inputs))
| select(.body | test("Hello"))
| ._id
'
produces the desired output.
test uses regex matching. In your case, it seems you could use simple substring matching instead.
Handling extraneous commas
Assuming the input has commas between a stream of valid JSON exactly as shown, you could presumably use sed to remove them first.
Or, if you want an only-jq solution, use the following in conjunction with the -n, -r and --stream command-line options:
def iterate:
fromstream(1|truncate_stream(inputs?))
| select(.body | test("Hello"))
| ._id,
iterate;
iterate
(Notice the "?".)
The streaming parser (invoked with --stream) is usually not needed for the kind of task you describe, so in this response, I'm going to assume that the following (or a variant thereof) will suffice:
.[]
| select( .body | test("Hello") )._id
This of course assumes that the input is valid JSON.
Handling comma-delimited JSON
If your input is a comma-delimited stream of JSON as shown in the Q, you could use the following in conjunction with the -n command-line option:
# This is a variant of the built-in `recurse/1`:
def iterate(f): def r: f | (., r); r;
iterate( inputs? | .[] | select( .body | test("Hello") )._id )
Please note that this assumes that whatever occurs on a line after a delimiting comma can be ignored.

have jq detect errors in influxdb json output

I have a jq filter that converts (influxdb) json input to csv for further parsing. However, this filter fails when influxdb returns an error. I'm trying to improve my jq filter to detect this, however I can't get this to work. I need something like https://stackoverflow.com/a/41829748 but can't seem to get this to work. Any ideas?
Example data
{"results":[{"statement_id":0,"series":[{"name":"energyv3","columns":["time","value"],"values":[["2015-07-30T23:59:00Z",56980800],["2015-07-31T23:59:00Z",95108400]]}]}]}
{"error":"error parsing query: found EOF, expected integer at line 1, char 34"}
Desired outcome
"\"time\",\"value\""
"\"2015-07-30T23:59:00Z\",56980800"
"\"2015-07-31T23:59:00Z\",95108400"
"error parsing query: found EOF, expected integer at line 1, char 34"
i.e.
For input with .results key: data formatted as csv (works OK)
For input with .error key: only error string (doesn't work)
Current filter used
select(.results) | (.results[0].series[0].columns), (.results[0].series[0].values[]) | #csv
Attempt to combine filters
((select(.error) | {error}) // null) + select(.results) | (.results[0].series[0].columns), (.results[0].series[0].values[]) | #csv
Based on your attempts, and the assumption that each object contain either results or error, this should do it:
( .results[0].series | .[0].columns, .[]?.values[] ) // [ .error ] | #csv
REPL demo

Bash script to extract all specific key values from a unstructured JSON file

I was trying to extract all the values from a specific key in the below JSON file.
{
"tags": [
{
"name": "xxx1",
"image_id": "yyy1"
},
{
"name": "xxx2",
"image_id": "yyy2"
}
]
}
I used the below code to get the image_id key values.
echo new.json | jq '.tags[] | .["image_id"]'
I'm getting the below error message.
parse error: Invalid literal at line 2, column 0
I think either the JSON file is not in the proper format OR the echo command to call the Json file is wrong.
Given the above input, my intended/desired output is:
yyy1
yyy2
What needs to be fixed to make this happen?
When you run:
echo new.json | jq '.tags[] | .["image_id"]'
...the string new.json -- not the contents of the file named new.json -- is fed to jq's stdin, and is thus what it tries to parse as JSON text.
Instead, run:
jq -r '.tags[] | .["image_id"]' <new.json
...to directly open new.json connected to the stdin of jq (and, with -r, to avoid adding unwanted quotes to the output stream).
Your filter .tags[] | .["image_id"]
is valid, but can be abbreviated to:
.tags[] | .image_id
or even:
.tags[].image_id
If you want the values associated with the "image_id" key, wherever that key occurs, you could go with:
.. | objects | select(has("image_id")) | .image_id
Or, if you don't mind throwing away false and null values:
.. | .image_id? // empty

replace comma in json file's field with jq-win

I have a problem in working JSON file. I launch curl in AutoIt sciript to download a json file from web and then convert it to csv format by jq-win
jq-win32 -r ".[]" -c class.json>class.txt
and the json is in the following format:
[
{
"id":"1083",
"name":"AAAAA",
"channelNumber":8,
"channelImage":""},
{
"id":"1084",
"name":"bbbbb",
"channelNumber":7,
"channelImage":""},
{
"id":"1088",
"name":"CCCCCC",
"channelNumber":131,
"channelImage":""},
{
"id":"1089",
"name":"DDD,DDD",
"channelNumber":132,
"channelImage":""},
]
after jq-win, the file should become:
{"id":"1083","name":"AAAAA","channelNumber":8,"channelImage":""}
{"id":"1084","name":"bbbbb","channelNumber":7,"channelImage":""}
{"id":"1088","name":"CCCCCC","channelNumber":131,"channelImage":""}
{"id":"1089","name":"DDD,DDD","channelNumber":132,"channelImage":""}
and then the csv file will be further process by the AutoIt script and become:
AAAAA,1083
bbbbb,1084
CCCCCC,1088
DDD,DDD,1089
The json has around 300 records and among them, 5~6 record has comma in it eg DDD,DDD
so when I tried read in the csv file by _FileReadToArray, the comma in DDD,DDD cause trouble.
My question is: can I replace comma in the field using jq-win ?
(I tried use fart.exe but it will replace all comma in json file which is not suitable for me.)
Thanks a lot.
Regds
LAM Chi-fung
can I replace comma in the field using jq-win ?
Yes. For example, use gsub, pretty much as you’d use awk’s gsub, e.g.
gsub(","; "|")
If you want more details, please provide more details as per [mcve].
Example
With the given JSON input, the jq program:
.[]
| .name |= gsub(",";";")
| [.[]]
| map(tostring)
| join(",")
yields:
1083,AAAAA,8,
1084,bbbbb,7,
1088,CCCCCC,131,
1089,DDD;DDD,132,