Json to Parquet with custom schema with Spark - json

What I'm trying to do is something similar to the Stackoverflow question here: basically converting .seq.gz JSON files to Parquet files with a proper schema defined.
I don't want to infer the schema, rather I would like to define my own, ideally having my Scala case classes so they can be reused as models by other jobs.
I'm not too sure whether I should deserialise my JSON into a case class and let toDS() to implicitly convert my data like below:
spark
.sequenceFile(input, classOf[IntWritable], classOf[Text])
.mapValues(
json => deserialize[MyClass](json.toString) // json to case class instance
)
.toDS()
.write.mode(SaveMode.Overwrite)
.parquet(outputFile)
...or rather use a Spark Data Frame schema instead, or even a Parquet schema. But I don't know how to do it though.
My objective is having full control over my models and possibly map JSON types (which is a poorer format) to Parquet types.
Thanks!

Related

Entity Framework Queries For Complicated JSON Documents (npgsql)

I am handling legacy (old) JSON files that we are now uploading to a database that was built using code-first EF Core (with the JSON elements saved as a jsonb field in a postgresql db, represented as JsonDocument properties in the EF classes). We want to be able to query these massive documents against any of the JSON's many properties. I've been very interested in the excellent docs here https://www.npgsql.org/efcore/mapping/json.html?tabs=data-annotations%2Cpoco, but the problem in our case is that our JSON has incredibly complicated hierarchies.
According to the npgsql/EF doc above, a way to do this for "shallow" json hierarchies would be something like:
myDbContext.MyClass
.Where(e => e.JsonDocumentField.RootElement.GetProperty("FieldToSearch").GetString() == "SearchTerm")
.ToList();
But that only works if is directly under the root of the JSONDocument. If the doc is structed like, say
{"A": {
...
"B": {
...
"C": {
...
"FieldToSearch":
<snip>
Then the above query won't work. There is an alternative to map our JSON to an actual POCO model, but this JSON structure (a) may change and (b) is truly massive and would result in some ridiculously complicated objects.
Right now, I'm building SQL strings using field configurations where I save strings to find the fields I want using psql's JSON querying language
Example:
"(JSONDocumentField->'A'->'B'->'C'->>'FieldToSearch')"
and then using that sql against the DB using
myDbContext.MyClass.FromSqlRaw(sql).ToList();
This is hacky and I'd much rather do it in a method call. Is there a way to force JsonDocument's GetProperty call to drill down into the hierarchy to find the first/any instance of the property name in question (or another method I'm not aware of)?
Thanks!

How structured streaming dynamically parses kafka's json data

I am trying to read data from Kafka using structured streaming. The data received from kafka is in json format.
My code is as follows:
in the code I use the from_json function to convert the json to a dataframe for further processing.
val **schema**: StructType = new StructType()
.add("time", LongType)
.add(id", LongType)
.add("properties",new StructType()
.add("$app_version", StringType)
.
.
)
val df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers","...")
.option("subscribe","...")
.load()
.selectExpr("CAST(value AS STRING) as value")
.select(from_json(col("value"), **schema**))
My problem is that if the field is increased,
I can't stop the spark program to manually add these fields,
then how can I parse these fields dynamically, I tried schema_of_json(),
it can only take the first line to infer the field type and it not suitable for multi-level nested structures json data.
My problem is that if the field is increased, I can't stop the spark program to manually add these fields, then how can I parse these fields dynamically
It is not possible in Spark Structured Streaming (or even Spark SQL) out of the box. There are a couple of solutions though.
Changing Schema in Code and Resuming Streaming Query
You simply have to stop your streaming query, change the code to match the current schema, and resume it. It is possible in Spark Structured Streaming with data sources that support resuming from checkpoint. Kafka data source does support it.
User-Defined Function (UDF)
You could write a user-defined function (UDF) that would do this dynamic JSON parsing for you. That's also among the easiest options.
New Data Source (MicroBatchReader)
Another option is to create an extension to the built-in Kafka data source that would do the dynamic JSON parsing (similarly to Kafka deserializers). That requires a bit more development, but is certainly doable.

Explicitly providing schema in a form of a json in Spark / Mongodb integration

When integrating spark and mongodb, it is possible to provide a sample schema in a form of an object - as described here: https://docs.mongodb.com/spark-connector/master/scala/datasets-and-sql/#sql-declare-schema
As a short-cut, there is a sample code how one can provide mongodb spark connector with sample schema:
case class Character(name: String, age: Int)
val explicitDF = MongoSpark.load[Character](sparkSession)
explicitDF.printSchema()
I have a collection, which has a constant document structure. I can provide a sample json, however to create a sample object manually will be impossible (30k properties in a document, 1.5MB average size). Is there a way how spark would infer schema just from that very json and would circumvent Mongodb connector's initial sampling which is quite exhaustive?
Spark is able to infer the schema, especially from sources having it as MongoDB. For instance for RDBMS it executes a simple query returning nothing but table columns with their types (SELECT * FROM $table WHERE 1=0).
For the sampling it'll read all documents unless you specify the configuration option called samplingRatio like this:
sparkSession.read.option("samplingRatio", 0.1)
For above Spark will only read 10% of the data. You can of course set any value you want. But be careful because if your documents have inconsistent schemas (e.g. 50% have a field called "A", the others not), the schema deduced by Spark may be incomplete and at the end you can miss some data.
Some time ago I wrote a post about schema projection if you're interested: http://www.waitingforcode.com/apache-spark-sql/schema-projection/read

how to convert nested json file into csv in scala

I want to convert my nested json into csv ,i used
df.write.format("com.databricks.spark.csv").option("header", "true").save("mydata.csv")
But it can use to normal json but not nested json. Anyway that I can convert my nested json to csv?help will be appreciated,Thanks!
When you ask Spark to convert a JSON structure to a CSV, Spark can only map the first level of the JSON.
This happens because of the simplicity of the CSV files. It is just asigning a value to a name. That is why {"name1":"value1", "name2":"value2"...} can be represented as a CSV with this structure:
name1,name2, ...
value1,value2,...
In your case, you are converting a JSON with several levels, so Spark exception is saying that it cannot figure out how to convert such a complex structure into a CSV.
If you try to add only a second level to your JSON, it will work, but be careful. It will remove the names of the second level to include only the values in an array.
You can have a look at this link to see the example for json datasets. It includes an example.
As I have no information about the nature of the data, I can't say much more about it. But if you need to write the information as a CSV you will need to simplify the structure of your data.
Read json file in spark and create dataframe.
val path = "examples/src/main/resources/people.json"
val people = sqlContext.read.json(path)
Save the dataframe using spark-csv
people.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("newcars.csv")
Source :
read json
save to csv

How to write arbitrary map to parque with spark

I will be parsing a bunch of json objects with spark and writing them out to parquet for searching and analysis later. Most of the json has a regular schema that maps directly to parquet, but there is one section where I have a data object that can have arbitrary fields and values with arbitrary data types.
Are there any thoughts on how to handle this? I can not put a Map[String, Any] in a dataframe.