Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column - json

I have a json file:
{
"a": {
"b": 1
}
}
I am trying to read it:
val path = "D:/playground/input.json"
val df = spark.read.json(path)
df.show()
But getting an error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed
when the referenced columns only include the internal corrupt record
column (named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and
spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the
same query. For example, val df =
spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
So I tried to cache it as they suggest:
val path = "D:/playground/input.json"
val df = spark.read.json(path).cache()
df.show()
But I keep getting the same error.

You may try either of these two ways.
Option-1: JSON in single line as answered above by #Avishek Bhattacharya.
Option-2: Add option to read multi line JSON in the code as follows. You could read the nested attribute also as shown below.
val df = spark.read.option("multiline","true").json("C:\\data\\nested-data.json")
df.select("a.b").show()
Here is the output for Option-2.
20/07/29 23:14:35 INFO DAGScheduler: Job 1 finished: show at NestedJsonReader.scala:23, took 0.181579 s
+---+
| b|
+---+
| 1|
+---+

The problem is with the JSON file. The file : "D:/playground/input.json" looks like as you descibed as
{
"a": {
"b": 1
}
}
This is not right. Spark while processing json data considers each new line as a complete json. Thus it is failing.
You should keep your complete json in a single line in a compact form by removing all white spaces and newlines.
Like
{"a":{"b":1}}
If you want multiple jsons in a single file keep them like this
{"a":{"b":1}}
{"a":{"b":2}}
{"a":{"b":3}} ...
For more infos see

This error means 2 things:
1- either your file format isn't what you think (and you are using the wrong method for it, like its text but you mistakenly used json method)
2- you file doesn't follow the standards for the format you are using (while you used correct method for correct format), this usually happens with json.

Related

Can't display CSV file in pyspark(ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling)

I'm getting an error while displaying a CSV file through Pyspark. I've attached the PySpark code and CSV file that I used.
from pyspark.sql import *
spark.conf.set("fs.azure.account.key.xxocxxxxxxx","xxxxx")
time_on_site_tablepath= "wasbs://dwpocblob#dwadfpoc.blob.core.windows.net/time_on_site.csv"
time_on_site = spark.read.format("csv").options(header='true', inferSchema='true').load(time_on_site_tablepath)
display(time_on_site.head(50))
The error is shown below
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling
CSV file format is attached below
time_on_site:pyspark.sql.dataframe.DataFrame
next_eventdate:timestamp
barcode:integer
eventdate:timestamp
sno:integer
eventaction:string
next_action:string
next_deviceid:integer
next_device:string
type_flag:string
site:string
location:string
flag_perimeter:integer
deviceid:integer
device:string
tran_text:string
flag:integer
timespent_sec:integer
gg:integer
CSV file data is attached below
next_eventdate,barcode,eventdate,sno,eventaction,next_action,next_deviceid,next_device,type_flag,site,location,flag_perimeter,deviceid,device,tran_text,flag,timespent_sec,gg
2018-03-16 05:23:34.000,1998296,2018-03-14 18:50:29.000,1,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,124385,0
2018-03-17 07:22:16.000,1998296,2018-03-16 18:41:09.000,3,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,45667,0
2018-03-19 07:23:55.000,1998296,2018-03-17 18:36:17.000,6,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,1,132458,1
2018-03-21 07:25:04.000,1998296,2018-03-19 18:23:26.000,8,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,133298,0
2018-03-24 07:33:38.000,1998296,2018-03-23 18:39:04.000,10,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,46474,0
What could be done to load the CSV file successfully?
There is no issue in your syntax, it's working fine.
The issue is in your data of CSV file, where the column named as type_flag have only None(null) values, So it doesn't infer it's Datatype.
So, here are two options.
you can display the data without using head(). Like
display(time_on_site)
If you want to use head() then you need to replace the null value, at here I replaced it with the empty string('').
time_on_site = time_on_site.fillna('')
display(time_on_site.head(50))
For some reason, probably a bug, even if you provide a schema on the spark.read.schema(my_schema).csv('path') call
you get the same error on a display(df.head()) call
the display(df) works though, but it gave me a WTF moment.

Python loading multiple JSON values on one line

I have a file (not a valid JSON file) that looks similar to this:
[[0,0,0],[0,0,0],[0,0,0]]["testing", "foo", "bar"]
These are two (or more), non-delimited, valid JSON values that I need to load in from STDIN. I tried just using the following (in Python 3.7):
for line in sys.stdin:
stripped = line.strip()
if not stripped: break
x = loads(stripped)
But that gave the error
json.decoder.JSONDecodeError: Extra data: line 1 column 118 (char 117)
which makes sense, as it can only load one JSON value at a time. How would I go about loading in multiple of these values from STDIN when they are not delimited? Is there a way to check if the JSON loader successfully completed a load and then start another one from the same line?

How open and read JSON file?

I have json file but this file have weight 186 mb. I try read via python .
import json
f = open('file.json','r')
r = json.loads(f.read())
ValueError: Extra data: line 88 column 2 -...
FILE
How to open it? Help me
Your JSON file isn't a JSON file, it's several JSON files mashed together.
The first instance of this occurs in the 1630070th character:
'шова"}]}]}{"response":[{"count'
^ here
That said, jq appears to be able to handle it, so the individual parts are fine.
You'll need to split the file at the boundaries of the individual JSON objects. Try catching the JSONDecodeError and use its .colno to slice the text into correct chunks.
It should be:
r = json.loads(f)

Read multiline JSON in Apache Spark

I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:
val df = sqlCtx.read.json("/path/to/user.json")
df.registerTempTable("user_tt")
val info = sqlCtx.sql("SELECT name FROM user_tt")
info.show()
df.printSchema() result:
root
|-- _corrupt_record: string (nullable = true)
My JSON file:
{
"id": 1,
"name": "Morty",
"age": 21
}
Exeption:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record];
How can I fix it?
UPD
_corrupt_record is
+--------------------+
| _corrupt_record|
+--------------------+
| {|
| "id": 1,|
| "name": "Morty",|
| "age": 21|
| }|
+--------------------+
UPD2
It's weird, but when I rewrite my JSON to make it oneliner, everything works fine.
{"id": 1, "name": "Morty", "age": 21}
So the problem is in a newline.
UPD3
I found in docs the next sentence:
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
It isn't convenient to keep JSON in such format. Is there any workaround to get rid of multi-lined structure of JSON or to convert it in oneliner?
Spark >= 2.2
Spark 2.2 introduced wholeFile multiLine option which can be used to load JSON (not JSONL) files:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
See:
SPARK-18352 - Parse normal, multi-line JSON files (not just JSON Lines).
SPARK-20980 - Rename the option wholeFile to multiLine for JSON and CSV.
Spark < 2.2
Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.
It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.
That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:
spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())
Just to add on to zero323's answer, the option in Spark 2.2+ to read multi-line JSON was renamed to multiLine (see the Spark documentation here).
Therefore, the correct syntax is now:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
This happened in https://issues.apache.org/jira/browse/SPARK-20980.

Saving Pandas DataFrame and meta-data to JSON format

I have a need to save a Pandas DataFrame, along with some metadata to a file in JSON format. (The JSON format is a requirement.)
Background
A) I can successfully read/write my rather large Pandas Dataframe from/to JSON using DataFrame.to_json() and DataFrame.from_json(). No problems.
B) I have no problems saving my metadata (dict) to JSON using json.dump()/json.load()
My first attempt
Since Pandas does not support DataFrame metadata directly, my first thought was to
top_level_dict = {}
top_level_dict['data'] = df.to_dict()
top_level_dict['metadata'] = {'some':'stuff'}
json.dump(top_level_dict, fp)
Failure modes
C) I have found that even the simplified case of
df_dict = df.to_dict()
json.dump(df_dict, fp)
fails with:
TypeError: key (u'US', 112, 5, 80, 'wl') is not a string
D) Investigating, I've found that the complement also fails.
df.to_json(fp)
json.load(fp)
fails with
384 raise ValueError("No JSON object could be decoded")
ValueError: Expecting : delimiter: line 1 column 17 (char 16)
So it appears that Pandas JSON format and the Python's JSON library are not compatible.
My first thought is to chase down a way to modify the df.to_dict() output of C to make it amenable to Python's JSON library, but I keep hearing "If you're struggling to do something in Python, you're probably doing it wrong." in my head.
Question
What is the cannonical/recommended method for adding metadata to a Pandas DataFrame and storing to a JSON-formatted file?
Python 2.7.10
Pandas 0.17
Edit 1:
While trying out Evan Wright's great answer, I found the source of my problems: Pandas (as of 0.17) does not like saving Multi-Indexed DataFrames to JSON. The library I had created to save my (Multi-Indexed) DataFrames is quietly performing a df.reset_index() before calling DataFrame.to_json(). My newer code was not. So it was DataFrame.to_json() burping on the MultiIndex.
Lesson: Read the documentation kids, even when it's your own documentation.
Edit 2:
If you need to store both the DataFrame and the metadata in a single JSON object, see my answer below.
You should be able to just put the data on separate lines.
Writing:
f = open('test.json', 'w')
df.to_json(f)
print >> f
json.dump(metadata, f)
Reading:
f = open('test.json')
df = pd.read_json(next(f))
metdata = json.loads(next(f))
In my question, I erroneously stated that I needed the JSON in a file. In that situation, Evan Wright's answer is my preferred solution.
In my case, I actually need to store the JSON output as a single "blob" in a database, so my dictionary-wrangling approach appears to be necessary.
If you similarly need to store the data and metadata in a single JSON blob, the following code will work:
top_level_dict = {}
top_level_dict['data'] = df.to_dict()
top_level_dict['metadata'] = {'some':'stuff'}
with open(FILENAME, 'w') as outfile:
json.dump(top_level_dict, outfile)
Just make sure DataFrame is singly-indexed. If it's Multi-Indexed, reset the index (i.e. df.reset_index()) before doing the above.
Reading the data back in:
with open(FILENAME, 'r') as infile:
top_level_dict = json.load(infile)
df_as_dict = top_level_dict.pop('data', {})
df = pandas.DataFrame().as_dict(df_as_dict)
meta = top_level_dict['metadata']
At this point, you'll need to re-create your Multi-Index (if applicable)