Read multiline JSON in Apache Spark - json

I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:
val df = sqlCtx.read.json("/path/to/user.json")
df.registerTempTable("user_tt")
val info = sqlCtx.sql("SELECT name FROM user_tt")
info.show()
df.printSchema() result:
root
|-- _corrupt_record: string (nullable = true)
My JSON file:
{
"id": 1,
"name": "Morty",
"age": 21
}
Exeption:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record];
How can I fix it?
UPD
_corrupt_record is
+--------------------+
| _corrupt_record|
+--------------------+
| {|
| "id": 1,|
| "name": "Morty",|
| "age": 21|
| }|
+--------------------+
UPD2
It's weird, but when I rewrite my JSON to make it oneliner, everything works fine.
{"id": 1, "name": "Morty", "age": 21}
So the problem is in a newline.
UPD3
I found in docs the next sentence:
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
It isn't convenient to keep JSON in such format. Is there any workaround to get rid of multi-lined structure of JSON or to convert it in oneliner?

Spark >= 2.2
Spark 2.2 introduced wholeFile multiLine option which can be used to load JSON (not JSONL) files:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
See:
SPARK-18352 - Parse normal, multi-line JSON files (not just JSON Lines).
SPARK-20980 - Rename the option wholeFile to multiLine for JSON and CSV.
Spark < 2.2
Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.
It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.
That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:
spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())

Just to add on to zero323's answer, the option in Spark 2.2+ to read multi-line JSON was renamed to multiLine (see the Spark documentation here).
Therefore, the correct syntax is now:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
This happened in https://issues.apache.org/jira/browse/SPARK-20980.

Related

Pulling data out of JSON Object via jq to CSV in Bash

I'm working on a bash script (running via gitBash on Windows technically but I don't think that matters) that will convert some JSON API data into CSV files. Most of it has gone fairly well, especially since I'm not particularly familiar with JQ as this is my first time using it.
I've got some JSON data that looks like the array below. What I'm trying to do is select the cardType,MaskedPan,amount and datetime out of the data.
this is probably the first time in life that my google searching has failed me. I know(or should I say think) that that is actually an object and not just a simple array.
I've not really found anything that helps me know how to grab that data I need and export it into a CSV file. I've had no issue grabbing the other data that I need but these few pieces are proving to be a big problem for me.
The script I'm trying basically can be boiled down to this:
jq='/c/jq-win64.exe -r';
header='("cardType")';
fields='[.TransactionDetails[0].Value[0].cardType]';
$jq ''$header',(.[] | '$fields' | #csv)' < /t/API_Data/JSON/GetByDate-082719.json >
/t/API_Data/CSV/test.csv;
If I do .TransactionDetails[0].Value I can get that whole chunk of data. But that is problematic in a CSV as it contains commas.
I suppose I could make this a TSV and import it into the database as one big string and sub string it out. But that isn't the "right" solution. I'm sure there is a way JQ can give me what I need.
"TransactionDetails": [
{
"TransactionId": 123456789,
"Name": "BlacklinePaymentDetail",
"Value": "{\"cardType\":\"Visa\",\"maskedPan\":\"1234\",\"paymentDetails\":{\"reference\":\"123456789012\",\"amount\":99.99,\"dateTime\":\"2019/08/27 08:41:09\"}}",
"ShowOnTill": false,
"PrintOnOrder": false,
"PrintOnReceipt": false
}
]
Ideally I'd be able to just have a field in the CSV for cardType,MaskedPan,amount and datetime instead of pulling the "Value" that contains all of it.
Any advice would be appreciated.
The ingredient you're missing is fromjson, which converts a stringified JSON to JSON. Adding enclosing braces around your sample input,
the invocation:
jq -r -f program.jq input.json
produces:
"Visa","1234",99.99,"2019/08/27 08:41:09"
where program.jq is:
.TransactionDetails[0].Value
| fromjson
| [.cardType, .maskedPan] + (.paymentDetails | [.amount, .dateTime])
| #csv

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column

I have a json file:
{
"a": {
"b": 1
}
}
I am trying to read it:
val path = "D:/playground/input.json"
val df = spark.read.json(path)
df.show()
But getting an error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed
when the referenced columns only include the internal corrupt record
column (named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and
spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the
same query. For example, val df =
spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
So I tried to cache it as they suggest:
val path = "D:/playground/input.json"
val df = spark.read.json(path).cache()
df.show()
But I keep getting the same error.
You may try either of these two ways.
Option-1: JSON in single line as answered above by #Avishek Bhattacharya.
Option-2: Add option to read multi line JSON in the code as follows. You could read the nested attribute also as shown below.
val df = spark.read.option("multiline","true").json("C:\\data\\nested-data.json")
df.select("a.b").show()
Here is the output for Option-2.
20/07/29 23:14:35 INFO DAGScheduler: Job 1 finished: show at NestedJsonReader.scala:23, took 0.181579 s
+---+
| b|
+---+
| 1|
+---+
The problem is with the JSON file. The file : "D:/playground/input.json" looks like as you descibed as
{
"a": {
"b": 1
}
}
This is not right. Spark while processing json data considers each new line as a complete json. Thus it is failing.
You should keep your complete json in a single line in a compact form by removing all white spaces and newlines.
Like
{"a":{"b":1}}
If you want multiple jsons in a single file keep them like this
{"a":{"b":1}}
{"a":{"b":2}}
{"a":{"b":3}} ...
For more infos see
This error means 2 things:
1- either your file format isn't what you think (and you are using the wrong method for it, like its text but you mistakenly used json method)
2- you file doesn't follow the standards for the format you are using (while you used correct method for correct format), this usually happens with json.

Oracle SQLcl: Spool to json, only include content in items array?

I'm making a query via Oracle SQLcl. I am spooling into a .json file.
The correct data is presented from the query, but the format is strange.
Starting off as:
SET ENCODING UTF-8
SET SQLFORMAT JSON
SPOOL content.json
Follwed by a query, produces a JSON file as requested.
However, how do I remove the outer structure, meaning this part:
{"results":[{"columns":[{"name":"ID","type":"NUMBER"},
{"name":"LANGUAGE","type":"VARCHAR2"},{"name":"LOCATION","type":"VARCHAR2"},{"name":"NAME","type":"VARCHAR2"}],"items": [
// Here is the actual data I want to see in the file exclusively
]
I only want to spool everything in the items array, not including that key itself.
Is this possible to set as a parameter before querying? Reading the Oracle docs have not yielded any answers, hence asking here.
Thats how I handle this.
After output to some file, I use jq command to recreate the file with only the items
ssh cat file.json | jq --compact-output --raw-output '.results[0].items' > items.json
`
Using this library = https://stedolan.github.io/jq/

Ways to parse JSON using KornShell

I have a working code for parsing a JSON output using KornShell by treating it as a string of characters. The issue I have is that the vendor keeps changing the position of the field that I am intersted in. I understand in JSON, we can parse it by key-value pairs.
Is there something out there that can do this? I am intersted in a specific field and I would like to use it to run the checks on the status of another RESTAPI call.
My sample json output is like this:
JSONDATA value :
{
"status": "success",
"job-execution-id": 396805,
"job-execution-user": "flexapp",
"job-execution-trigger": "RESTAPI"
}
I would need the job-execution-id value to monitor this job through the rest of the script.
I am using the following command to parse it:
RUNJOB=$(print ${DATA} |cut -f3 -d':'|cut -f1 -d','| tr -d [:blank:]) >> ${LOGDIR}/${LOGFILE}
The problem with this is, it is field delimited by :. The field position has been known to be changed by the vendors during releases.
So I am trying to see if I can use a utility out there that would always give me the key-value pair of "job-execution-id": 396805, no matter where it is in the json output.
I started looking at jsawk, and it requires the js interpreter to be installed on our machines which I don't want. Any hint on how to go about finding which RPM that I need to solve it?
I am using RHEL5.5.
Any help is greatly appreciated.
The ast-open project has libdss (and a dss wrapper) which supposedly could be used with ksh. Documentation is sparse and is limited to a few messages on the ast-user mailing list.
The regression tests for libdss contain some json and xml examples.
I'll try to find more info.
Python is included by default with CentOS so one thing you could do is pass your JSON string to a Python script and use Python's JSON parser. You can then grab the value written out by the script. An example you could modify to meet your needs is below.
Note that by specifying other dictionary keys in the Python script you can get any of the values you need without having to worry about the order changing.
Python script:
#get_job_execution_id.py
# The try/except is because you'll probably have Python 2.4 on CentOS 5.5,
# and the straight "import json" statement won't work unless you have Python 2.6+.
try:
import json
except:
import simplejson as json
import sys
json_data = sys.argv[1]
data = json.loads(json_data)
job_execution_id = data['job-execution-id']
sys.stdout.write(str(job_execution_id))
Kornshell script that executes it:
#get_job_execution_id.sh
#!/bin/ksh
JSON_DATA='{"status":"success","job-execution-id":396805,"job-execution-user":"flexapp","job-execution-trigger":"RESTAPI"}'
EXECUTION_ID=`python get_execution_id.py "$JSON_DATA"`
echo $EXECUTION_ID

Format for writing a JSON log file?

Are there any format standards for writing and parsing JSON log files?
The problem I see is that you can't have a "pure" JSON log file since you need matching brackets and trailing commas are forbidden. So while the following may be written by an application, it can't be parsed by standard JSON parsers:
[{date:'2012-01-01 02:00:01', severity:"ERROR", msg:"Foo failed"},
{date:'2012-01-01 02:04:02', severity:"INFO", msg:"Bar was successful"},
{date:'2012-01-01 02:10:12', severity:"DEBUG", msg:"Baz was notified"},
So you must have some conventions on how to structure your log files in a way that a parser can process them. The easiest thing would be "one log message object per line, newlines in string values are escaped". Are there any existing standards and tools?
You're not going to write a single JSON object per FILE, you're going to write a JSON object per LINE. Each line can then be parsed individually. You don't need to worry about trailing commas and have the whole set of objects enclosed by brackets, etc. See http://blog.nodejs.org/2012/03/28/service-logging-in-json-with-bunyan/ for a more detailed explanation of what this can look like.
Also check out Fluentd http://fluentd.org/ for a neat toolset to work with.
Edit: this format is now called JSONLines or jsonl as pointed out by #Mnebuerquo below - see http://jsonlines.org/
gem log_formatter is the ruby choice, as the formatter group, now support json formatter for ruby and log4r.
simple to get stated for ruby.
gem 'log_formatter'
require 'log_formatter'
require 'log_formatter/ruby_json_formatter'
logger.debug({data: "test data", author: 'chad'})
result
{
"source": "examples",
"data": "test data",
"author": "chad",
"log_level": "DEBUG",
"log_type": null,
"log_app": "app",
"log_timestamp": "2016-08-25T15:34:25+08:00"
}
for log4r:
require 'log4r'
require 'log_formatter'
require 'log_formatter/log4r_json_formatter'
logger = Log4r::Logger.new('Log4RTest')
outputter = Log4r::StdoutOutputter.new(
"console",
:formatter => Log4r::JSONFormatter::Base.new
)
logger.add(outputter)
logger.debug( {data: "test data", author: 'chad'} )
Advance usage: README
Full Example Code: examples