I would like to read a subset of partitioned data, in JSON format, with spark (3.0.1) inferring the schema from the JSON.
My data is partitioned as s3a://bucket/path/type=[something]/dt=2020-01-01/
When I try to read this with read(json_root_path).where($"type" == x && $"dt" >= y && $"dt" <= z), spark attempts to read the entire dataset in order to infer the schema.
When I try to figure out my partition paths in advance and pass them with read(paths :_*), spark throws an error that it cannot infer the schema and I need to specify the schema manually. (Note that in this case, unless I specify basePath, spark also loses the columns for type and dt, but that's fine, I can live with that.)
What I'm looking for, I think, is some option that tells spark to either infer the schema from only the relevant partitions, so the partitioning is pushed-down, or tells it that it can infer the schema from just the JSONs in the paths I've given it. Note that I do not have the option of calling mcsk or glue to maintain a hive metastore. In addition, the schema changes over time, so it can't be specified in advance - taking advantage of spark JSON schema inference is an explicit goal.
Can anyone help?
Could you read each day you are interested in using schema inference and then union the dataframes using schema merge code like this:
Spark - Merge / Union DataFrame with Different Schema (column names and sequence) to a DataFrame with Master common schema
One way that comes to my mind is to extract the schema you need from a single file, and then force it when you want to read the others.
Since you know the first partition and the path, try to read first a single JSON like s3a://bucket/path/type=[something]/dt=2020-01-01/file_0001.json then extract the schema.
Run the full reading part and pass the schema that you extracted as parameter read(json_root_path).schema(json_schema).where(...
The schema should be converted into a StructType to be accepted.
I've found a question that may partially help you Create dataframe with schema provided as JSON file
Related
suppose we are developing an application that pulls Avro records from a source
stream (e.g. Kafka/Kinesis/etc), parses them into JSON, then further processes that
JSON with additional transformations. Further assume these records can have a
varying schema (which we can look up and fetch from a registry).
We would like to use Spark's built in from_avro function, But it is pretty clear that
Spark from_avro wants you to hard code a >Fixed< schema into your code. It doesn't seem
to allow the schema to vary row by incoming row.
That sort of makes sense if you are parsing the Avro to Internal row format.. One would need
a consistent structure for the dataframe. But what if we wanted something like
from_avro which grabbed the bytes from some column in the row and also grabbed the string
representation of the Avro schema from some other column in the row, and then parsed that Avro
into a JSON string.
Does such built-in method exist? Or is such functionality available in a 3rd party library ?
Thanks !
I'm new to Structured Streaming, and I'd like to know is there a way to specify Kafka value's schema like what we do in normal structured streaming jobs. The format in Kafka value is 50+ fields syslog-like csv, and manually splitting is painfully slow.
Here's the brief part of my code (see full gist here)
spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "myserver:9092")
.option("subscribe", "mytopic")
.load()
.select(split('value, """\^""") as "raw")
.select(ColumnExplode('raw, schema.size): _*) // flatten WrappedArray
.toDF(schema.fieldNames: _*) // apply column names
.select(fieldsWithTypeFix: _*) // cast column types from string
.select(schema.fieldNames.map(col): _*) // re-order columns, as defined in schema
.writeStream.format("console").start()
with no further operations, I can only achieve roughly 10MB/s throughput on a 24-core 128GB mem server. Would it help if I convert the syslog to JSON in prior? In that case I can use from_json with schema, and maybe it will be faster.
is there a way to specify Kafka value's schema like what we do in normal structured streaming jobs.
No. The so-called output schema for kafka external data source is fixed and cannot be changed ever. See this line.
Would it help if I convert the syslog to JSON in prior? In that case I can use from_json with schema, and maybe it will be faster.
I don't think so. I'd even say that CSV is a simpler text format than JSON (as there's simply a single separator usually).
Using split standard function is the way to go and think you can hardly get better performance since it's to split a row and take every element to build the final output.
When integrating spark and mongodb, it is possible to provide a sample schema in a form of an object - as described here: https://docs.mongodb.com/spark-connector/master/scala/datasets-and-sql/#sql-declare-schema
As a short-cut, there is a sample code how one can provide mongodb spark connector with sample schema:
case class Character(name: String, age: Int)
val explicitDF = MongoSpark.load[Character](sparkSession)
explicitDF.printSchema()
I have a collection, which has a constant document structure. I can provide a sample json, however to create a sample object manually will be impossible (30k properties in a document, 1.5MB average size). Is there a way how spark would infer schema just from that very json and would circumvent Mongodb connector's initial sampling which is quite exhaustive?
Spark is able to infer the schema, especially from sources having it as MongoDB. For instance for RDBMS it executes a simple query returning nothing but table columns with their types (SELECT * FROM $table WHERE 1=0).
For the sampling it'll read all documents unless you specify the configuration option called samplingRatio like this:
sparkSession.read.option("samplingRatio", 0.1)
For above Spark will only read 10% of the data. You can of course set any value you want. But be careful because if your documents have inconsistent schemas (e.g. 50% have a field called "A", the others not), the schema deduced by Spark may be incomplete and at the end you can miss some data.
Some time ago I wrote a post about schema projection if you're interested: http://www.waitingforcode.com/apache-spark-sql/schema-projection/read
Trying to create a DataFrame from a JSON file, but when I load data, spark automatically infers that the numeric values in the data are of type Long, although they are actually Integers, and this is also how I parse the data in my code.
Since I'm loading the data in a test env, I don't mind using a few workarounds to fix the schema. I've tried more than a few, such as:
Changing the schema manually
Casting the data using a UDF
Define the entire schema manually
The issue is that the schema is quite complex, and the fields I'm after are nested, which makes all of the options above irrelevant or too complex to write from scratch.
My main question is, how does spark decides if a numeric value is an Integer or Long? and is there anything I can do to enforce that all\some numerics are of a specific type?
Thanks!
It's always LongType by default.
From the source code:
// For Integer values, use LongType by default.
case INT | LONG => LongType
So you cannot change this behaviour. You can iterate by columns and then do casting:
for (c <- schema.fields.filter(_.dataType.isInstanceOf[NumericType])) {
df.withColumn(c.name, col(c.name).cast(IntegerType))
}
It's only a snippet, but something like this should help you :)
I have huge number (35k) of small (16kb) json files stored on S3Bucket. I need to load them into DataFrame for futher processing, here is my code for extract:
val jsonData = sqlContext.read.json("s3n://bucket/dir1/dir2")
.where($"nod1.filter1"==="filterValue")
.where($"nod2.subNode1.subSubNode2.created"(0)==="filterValue2")
I'm storing this data into temp table and use for futher operations (exploading nested structures into separate data frames)
jsonData.registerTempTable("jsonData")
So now I have autogenerated schema for this deeply nested dataframe.
With above code I have terrible performance issues I presume its caused by not using sc.parallelize during bucket load, moreover I'm pretty sure that autogeneration of schema in read.json() method is taking a lot of time.
Questions part:
How should my bucket load look like, to be more efficient and faster?
Is there any way to declare this schema in advance (I need to work around Case Class tuple problem thou) to avoid auto-generation?
Does filtering data during load make sense or I should simply load all and filter data after?
Found so far:
sqlContext.jsonRdd(rdd, schema)
It did the part with auto generated schema, but InteliJ screams about depreciated method, is there any alternative for it?
As an alternative to case class, use a custom class that implements the Product interface, and then DataFrame will use the schema exposed by your class members without the case class constraints. See in-line comment here http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
If your json is composed of unrooted fragments you could use s3distcp to group the files and concatenate them into fewer files. Also try s3a protocol as it is better performance than s3n.