Read Multilple json schema with spark - json

Software Configuration:
Hadoop distribution:Amazon 2.8.3
Applications:Hive 2.3.2, Pig 0.17.0, Hue 4.1.0, Spark 2.3.0
Tried to read with multiple json schema,
val df = spark.read.option("mergeSchema",
"true").json("s3a://s3bucket/2018/01/01/*")
Throws an error,
org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:206)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:397)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:340)
How to read json with multipl schema's with spark?

This sometimes happens when you are pointing to wrong path (when data does not exist).

Related

"Unable to infer schema for JSON." error in PySpark?

I have a json file with about 1,200,000 records.
I want to read this file with pyspark as :
spark.read.option("multiline","true").json('file.json')
But it causes this error:
AnalysisException: Unable to infer schema for JSON. It must be specified manually.
When I create a json file with a smaller record count in the main file, this code can read the file.
I can read this json file with pandas, when I set the encoding to utf-8-sig:
pd.read_json("file.json", encoding = 'utf-8-sig')
How can I solve this problem?
Try this out:
spark.read.option("multiline","true").option("inferSchema", "true").json('file.json')
Since adding the encoding helps, maybe the following is what you need:
spark.read.json("file.json", multiLine=True, encoding="utf8")

Writing dataframe to CSV - Spark 1.6

I am trying to write a pyspark dataframe to CSV. I have Spark 1.6, and I am trying things such as the line: df.write.format('com.intelli.spark.csv).save('mycsv.csv') and df.write.format('com.databricks.spark.csv').save(PATH).
These always give an error along the lines of java.lang.ClassNotFoundException: Failed to find data source: com.intelli.spark.csv. Please find packages at http://spark-packages.org.
I have tried downloading spark-cv_2.10-0.1.jar and using it in the --jars argument of spark-submit, but that also leads to a similar error. I have also tried spark-shell --packages com.databricks:spark-csv_2.10:1.5.0 but it gives server access errors.
Try this way. In spark 1.6, you will have to covert it to rdd and write.
def toCSVLine(data):
return ','.join(str(d) for d in data)
rdd1 = df.rdd.map(toCSVLine)
rdd1.saveAsTextFile('output_dir')
Edit-
Try to add this in your spark code after passing
--py-files argument.
spark.sparkContext.addPyFile("/path/to/jar/xxxx.jar")

How structured streaming dynamically parses kafka's json data

I am trying to read data from Kafka using structured streaming. The data received from kafka is in json format.
My code is as follows:
in the code I use the from_json function to convert the json to a dataframe for further processing.
val **schema**: StructType = new StructType()
.add("time", LongType)
.add(id", LongType)
.add("properties",new StructType()
.add("$app_version", StringType)
.
.
)
val df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers","...")
.option("subscribe","...")
.load()
.selectExpr("CAST(value AS STRING) as value")
.select(from_json(col("value"), **schema**))
My problem is that if the field is increased,
I can't stop the spark program to manually add these fields,
then how can I parse these fields dynamically, I tried schema_of_json(),
it can only take the first line to infer the field type and it not suitable for multi-level nested structures json data.
My problem is that if the field is increased, I can't stop the spark program to manually add these fields, then how can I parse these fields dynamically
It is not possible in Spark Structured Streaming (or even Spark SQL) out of the box. There are a couple of solutions though.
Changing Schema in Code and Resuming Streaming Query
You simply have to stop your streaming query, change the code to match the current schema, and resume it. It is possible in Spark Structured Streaming with data sources that support resuming from checkpoint. Kafka data source does support it.
User-Defined Function (UDF)
You could write a user-defined function (UDF) that would do this dynamic JSON parsing for you. That's also among the easiest options.
New Data Source (MicroBatchReader)
Another option is to create an extension to the built-in Kafka data source that would do the dynamic JSON parsing (similarly to Kafka deserializers). That requires a bit more development, but is certainly doable.

How do I read a Large JSON Array File in PySpark

Issue
I recently encountered a challenge in Azure Data Lake Analytics when I attempted to read in a Large UTF-8 JSON Array file and switched to HDInsight PySpark (v2.x, not 3) to process the file. The file is ~110G and has ~150m JSON Objects.
HDInsight PySpark does not appear to support Array of JSON file format for input, so I'm stuck. Also, I have "many" such files with different schemas in each containing hundred of columns each, so creating the schemas for those is not an option at this point.
Question
How do I use out-of-the-box functionality in PySpark 2 on HDInsight to enable these files to be read as JSON?
Thanks,
J
Things I tried
I used the approach at the bottom of this page:
from Databricks that supplied the below code snippet:
import json
df = sc.wholeTextFiles('/tmp/*.json').flatMap(lambda x: json.loads(x[1])).toDF()
display(df)
I tried the above, not understanding how "wholeTextFiles" works, and of course ran into OutOfMemory errors that killed my executors quickly.
I attempted loading to an RDD and other open methods, but PySpark appears to support only the JSONLines JSON file format, and I have the Array of JSON Objects due to ADLA's requirement for that file format.
I tried reading in as a text file, stripping Array characters, splitting on the JSON object boundaries and converting to JSON like the above, but that kept giving errors about being unable to convert unicode and/or str (ings).
I found a way through the above, and converted to a dataframe containing one column with Rows of strings that were the JSON Objects. However, I did not find a way to output only the JSON Strings from the data frame rows to an output file by themselves. The always came out as
{'dfColumnName':'{...json_string_as_value}'}
I also tried a map function that accepted the above rows, parsed as JSON, extracted the values (JSON I wanted), then parsed the values as JSON. This appeared to work, but when I would try to save, the RDD was type PipelineRDD and had no saveAsTextFile() method. I then tried the toJSON method, but kept getting errors about "found no valid JSON Object", which I did not understand admittedly, and of course other conversion errors.
I finally found a way forward. I learned that I could read json directly from an RDD, including a PipelineRDD. I found a way to remove the unicode byte order header, wrapping array square brackets, split the JSON Objects based on a fortunate delimiter, and have a distributed dataset for more efficient processing. The output dataframe now had columns named after the JSON elements, inferred the schema, and dynamically adapts for other file formats.
Here is the code - hope it helps!:
#...Spark considers arrays of Json objects to be an invalid format
# and unicode files are prefixed with a byteorder marker
#
thanksMoiraRDD = sc.textFile( '/a/valid/file/path', partitions ).map(
lambda x: x.encode('utf-8','ignore').strip(u",\r\n[]\ufeff")
)
df = sqlContext.read.json(thanksMoiraRDD)

Spark exception handling for json

I am trying to catch/ignore a parsing error when I'm reading a json file
val DF = sqlContext.jsonFile("file")
There are a couple of lines that aren't valid json objects, but the data is too large to go through individually (~1TB)
I've come across exception handling for mapping using import scala.util.Tryand in.map(a => Try(a.toInt)) referencing:
how to handle the Exception in spark map() function?
How would I catch an exception when reading a json file with the function sqlContext.jsonFile?
Thanks!
Unfortunately you are out of luck here. DataFrameReader.json which is used under the hood is pretty much all-or-nothing. If your input contains malformed lines you have to filter these manually. A basic solution could look like this:
import scala.util.parsing.json._
val df = sqlContext.read.json(
sc.textFile("file").filter(JSON.parseFull(_).isDefined)
)
Since above validation is rather expensive you may prefer to drop jsonFile / read.json completely and to use parsed JSON lines directly.