Why does Spark read data even no actions are called? - json

I have a confusion of the lazy load on Spark while using spark.read.json.
I have the following code:
df_location_user_profile = [
f"hdfs://hdfs_cluster:8020/data/*/*"
]
df_json = spark.read.json(json_data_files)
While the JSON data on HDFS is partitioned by year and month (year=yyyy, month=mm) and I want to retrieve all data of that dataset.
For this code block, I only read data from the defined location and there are no actions is executed. But I found on the Spark UI the following stage with giant input data.
As I understand, the lazy load fashion of Spark will not read data until an action is called. Then this makes me confused.
After that, I call the count() action then the new stage is created and Spark read data again.
My question is why does Spark read data when no action is called (on the first job, stage)? How can I optimize this?

It is doing a pass to evaluate the schema as it was not supplied. Aka infer schema.

Related

Spark partition projection/pushdown and schema inference with partitioned JSON

I would like to read a subset of partitioned data, in JSON format, with spark (3.0.1) inferring the schema from the JSON.
My data is partitioned as s3a://bucket/path/type=[something]/dt=2020-01-01/
When I try to read this with read(json_root_path).where($"type" == x && $"dt" >= y && $"dt" <= z), spark attempts to read the entire dataset in order to infer the schema.
When I try to figure out my partition paths in advance and pass them with read(paths :_*), spark throws an error that it cannot infer the schema and I need to specify the schema manually. (Note that in this case, unless I specify basePath, spark also loses the columns for type and dt, but that's fine, I can live with that.)
What I'm looking for, I think, is some option that tells spark to either infer the schema from only the relevant partitions, so the partitioning is pushed-down, or tells it that it can infer the schema from just the JSONs in the paths I've given it. Note that I do not have the option of calling mcsk or glue to maintain a hive metastore. In addition, the schema changes over time, so it can't be specified in advance - taking advantage of spark JSON schema inference is an explicit goal.
Can anyone help?
Could you read each day you are interested in using schema inference and then union the dataframes using schema merge code like this:
Spark - Merge / Union DataFrame with Different Schema (column names and sequence) to a DataFrame with Master common schema
One way that comes to my mind is to extract the schema you need from a single file, and then force it when you want to read the others.
Since you know the first partition and the path, try to read first a single JSON like s3a://bucket/path/type=[something]/dt=2020-01-01/file_0001.json then extract the schema.
Run the full reading part and pass the schema that you extracted as parameter read(json_root_path).schema(json_schema).where(...
The schema should be converted into a StructType to be accepted.
I've found a question that may partially help you Create dataframe with schema provided as JSON file

How structured streaming dynamically parses kafka's json data

I am trying to read data from Kafka using structured streaming. The data received from kafka is in json format.
My code is as follows:
in the code I use the from_json function to convert the json to a dataframe for further processing.
val **schema**: StructType = new StructType()
.add("time", LongType)
.add(id", LongType)
.add("properties",new StructType()
.add("$app_version", StringType)
.
.
)
val df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers","...")
.option("subscribe","...")
.load()
.selectExpr("CAST(value AS STRING) as value")
.select(from_json(col("value"), **schema**))
My problem is that if the field is increased,
I can't stop the spark program to manually add these fields,
then how can I parse these fields dynamically, I tried schema_of_json(),
it can only take the first line to infer the field type and it not suitable for multi-level nested structures json data.
My problem is that if the field is increased, I can't stop the spark program to manually add these fields, then how can I parse these fields dynamically
It is not possible in Spark Structured Streaming (or even Spark SQL) out of the box. There are a couple of solutions though.
Changing Schema in Code and Resuming Streaming Query
You simply have to stop your streaming query, change the code to match the current schema, and resume it. It is possible in Spark Structured Streaming with data sources that support resuming from checkpoint. Kafka data source does support it.
User-Defined Function (UDF)
You could write a user-defined function (UDF) that would do this dynamic JSON parsing for you. That's also among the easiest options.
New Data Source (MicroBatchReader)
Another option is to create an extension to the built-in Kafka data source that would do the dynamic JSON parsing (similarly to Kafka deserializers). That requires a bit more development, but is certainly doable.

What does dehydration mean in the context of data loading?

There are a few threads which discuss the meaning of data hydration.
But I can't find any definitions of data dehydration.
I can see three possible meanings
1) Extracting the current state of an object
2) Clearing the state of an object
3) Both 1) and 2)
EDIT
This is not a duplicate of other threads where the meaning of hydration is discussed. This question is about dehydration.
Dehydration is simply the opposite of hydration. In context of data loading, given you have an API that reads data from an ORM. The process to transform the ORM object tree into a JSON with primitive types in it, is dehydration. The extraction of data from the hydrated objects.
As you already mentioned:
Extracting the current state of an object
Data dehydration usually refers to the process of storing db data off in a compressed file. The data is then removed from the db (and as such is inaccessible)
The process of rehydration is loading the stored data into the original or ancillary table so that it can be worked with.

Explicitly providing schema in a form of a json in Spark / Mongodb integration

When integrating spark and mongodb, it is possible to provide a sample schema in a form of an object - as described here: https://docs.mongodb.com/spark-connector/master/scala/datasets-and-sql/#sql-declare-schema
As a short-cut, there is a sample code how one can provide mongodb spark connector with sample schema:
case class Character(name: String, age: Int)
val explicitDF = MongoSpark.load[Character](sparkSession)
explicitDF.printSchema()
I have a collection, which has a constant document structure. I can provide a sample json, however to create a sample object manually will be impossible (30k properties in a document, 1.5MB average size). Is there a way how spark would infer schema just from that very json and would circumvent Mongodb connector's initial sampling which is quite exhaustive?
Spark is able to infer the schema, especially from sources having it as MongoDB. For instance for RDBMS it executes a simple query returning nothing but table columns with their types (SELECT * FROM $table WHERE 1=0).
For the sampling it'll read all documents unless you specify the configuration option called samplingRatio like this:
sparkSession.read.option("samplingRatio", 0.1)
For above Spark will only read 10% of the data. You can of course set any value you want. But be careful because if your documents have inconsistent schemas (e.g. 50% have a field called "A", the others not), the schema deduced by Spark may be incomplete and at the end you can miss some data.
Some time ago I wrote a post about schema projection if you're interested: http://www.waitingforcode.com/apache-spark-sql/schema-projection/read

how to load big files ( json or csv ) in spark once

I do more than one select in two register tables in spark loaded from JSON and CSV.
but in every select the two files loaded every time, can I load in a global object once?
you can use persist() with StorageLevel as MEMORY_AND_DISK
import org.apache.spark.storage.StorageLevel
dataFrame.persist(StorageLevel.MEMORY_AND_DISK)
check the documentation here
Note: this option is more useful, where you have performed some aggregations/tranformation on the input dataset and before going to do next transformation