Spark JSON Read Nested Structured Strings as Structs - json

I have a file with many JSON records in it. Each record contains a Struct ('Properties') and within each, a String that looks like this:
'meterDetails: "#{meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000}"'
Note that values are not enclosed in a "".
I want to treat this column (meterDetails) as another Struct in my DF as all structs will be flattened eventually.
Proceeding with defining a schema, removing the # with regexp_replace('col','#','') and using from_json with the schema resulted in a new col in Json format, but all NULL values.
Splitting the col with split(col("meterDetails"),";")) turns it into an Array, but upon conversion to Json - back to all NULL values.
Question:
I'm clearly misunderstanding the #{..} structured passed by this API. In Spark, should I convert this string to an object that natively will result to a Struct?

Somehow I gravitate towards str_to_map function, but then you will need to transform the map to struct. I'm not sure this is the best way, but I'd do it like this.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[("#{meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000}",)],
['meterDetails'])
df.show(truncate=0)
# +-----------------------------------------------------------------------------------------------------------------+
# |meterDetails |
# +-----------------------------------------------------------------------------------------------------------------+
# |#{meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000}|
# +-----------------------------------------------------------------------------------------------------------------+
Script:
# Converting string to map
df = df.withColumn(
"meterDetails",
F.expr("str_to_map(TRIM(BOTH '#{}' FROM meterDetails), '; ', '=')")
)
# Converting map to struct
df = df.withColumn("meterDetails", F.to_json("meterDetails"))
json_schema = spark.read.json(df.rdd.map(lambda r: r.meterDetails)).schema
df = df.withColumn("meterDetails", F.from_json("meterDetails", json_schema))
df.show(truncate=0)
# +---------------------------------------------------------+
# |meterDetails |
# +---------------------------------------------------------+
# |{Storage, Read Operations, General Block Blob, 100000000}|
# +---------------------------------------------------------+
df.printSchema()
# root
# |-- meterDetails: struct (nullable = true)
# | |-- meterCategory: string (nullable = true)
# | |-- meterName: string (nullable = true)
# | |-- meterSubCategory: string (nullable = true)
# | |-- unitOfMeasure: string (nullable = true)

To transform your dataset into an operable JSON file, I would first use regexp_extract to get whatever is within #{ and }, that is done through:
df1 = df1.withColumn("data", regexp_extract(col("string"), "\\{(.*)\\}", 1))
Now, data looks as below:
meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000
Next, I would split on ; and transform your data into a named_struct:
df1 = df1.withColumn("data", split(col("data"), ";"))
df1 = df1.withColumn("data",
expr("transform(data, x -> named_struct('key',split(x, \"=\")[0],'value',split(x, \"=\")[1]))")
)
Now, data looks as:
[{meterName, Read Operations}, { meterCategory, Storage}, { meterSubCategory, General Block Blob}, { unitOfMeasure, 100000000}]
where col(data)[0].key gives meterName for example.
For more details, let's say you want another column to extract keys only (not, key and value are hard-coded from the previous step):
df1 = df1.withColumn("otherData", expr("transform(data, x -> x.key)"))
The result:
[meterName, meterCategory, meterSubCategory, unitOfMeasure]
I hope this is what you are looking for, good luck!

Related

Json creation from spark dataframe in scala

Currently, we are converting a spark dataframe to JSON String to be sent to kafka.
In the process, we are doing toJSON twice which inserts \ for the inner json.
Snippet of the code:
val df=spark.sql("select * from dB.tbl")
val bus_dt="2022-09-23"
case class kafkaMsg(busDate:String,msg:String)
Assuming my df has 2 columns as ID,STATUS, this will constitute the inner json of my kafka message.
JSON is created for msg and applied to case class.
val rdd=df.toJSON.rdd.map(msg=>kafkaMsg(busDate,msg))
Output at this step:
kafkaMsg(2022-09-23,{"id":1,"status":"active"})
Now, to send busDate and msg as JSON to kafka ,again a toJSON is applied.
val df1=spark.createDataFrame(rdd).toJSON
The output is:
{"busDate":"2022-09-23","msg":"{\"id\":1,\"status\":\"active\"}"}
The inner JSON is having \ which is not what the consumers are expecting.
Expected JSON:
{"busDate":"2022-09-23","msg":{"id":1,"status":"active"}}
How can I create this json without \ and send to kafka.
Please note the msg value varies and cannot be mapped to a schema.
Your msg is escaped because it's already a string. So, you are toString-ing a String when you convert to JSON...
JSON can be represented as Map[String, ?], so define a schema if your input data doesn't already have it.
Using PySpark as an example.
scm = StructType([
StructField('busDate', StringType(), nullable=False),
StructField('msg', MapType(StringType(), StringType()), nullable=False)
])
sdf = spark.createDataFrame([
('2022-09-23', {"id":1,"status":"active"}),
], schema=scm)
Schema - Notice that msg is not a string, but a Map[String, String]. And no, you cannot have multiple value types - Spark SQL and MapType with string keys and any values
root
|-- busDate: string (nullable = false)
|-- msg: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = true)
As JSON - You don't need Jackson, or hack around with RDDs...
kafkaDf = sdf.selectExpr("to_json(struct(*)) as value")
kafkaDf.show(truncate=False)
Not escaped...
Notice that the id type was converted. If that's not something you want, then you need to use msg : StructType rather than MapType and give id : IntegerType, for example. (This assumes all records in the dataframe are consistently typed, obviously)
+-----------------------------------------------------------+
|value |
+-----------------------------------------------------------+
|{"busDate":"2022-09-23","msg":{"id":"1","status":"active"}}|
+-----------------------------------------------------------+
You could also pull out the key (switched to using spark.sql.functions)
kafkaDf = sdf.select(
f.col("msg.id").cast("int").alias('key'),
f.to_json(f.struct('*')).alias('value')
)
kafkaDf.printSchema()
kafkaDf.show(truncate=False)
root
|-- key: integer (nullable = true)
|-- value: string (nullable = true)
+---+-----------------------------------------------------------+
|key|value |
+---+-----------------------------------------------------------+
|1 |{"busDate":"2022-09-23","msg":{"id":"1","status":"active"}}|
+---+-----------------------------------------------------------+
Then you can use kafkaDf.write.format("kafka"), as normal
Alternatively, if you wanted to wrap string information in a single field, rather then key-value pairs, then your Kafka consumers would need to handle that on their own, such as double-deserializing both the record, then the inner string (JSON value).

PySpark: TypeError: col should be Column

I am trying to create a dataframe out of a nested JSON structure, but I am encountering a problem that I don't understand. I have exploded an array-of-dicts structure in the JSON and now I am trying to access these dicts and create columns with the values in there. This is how the dicts look like:
The values at index 1 (subject, glocations etc.) go under the key "name" according to the schema:
However, when I try:
dataframe = dataframe.withColumn("keywords_name", dataframe.keywords_exp.name)
it throws error:
PySpark: TypeError: col should be Column
There is no such problem with any other of the keys in the dict, i.e. "value".
I really do not understand the problem, do I have to assume that there are inconsistencies in the data? If yes, can you recommend a way to check for or even dodge them?
Edit: Khalid had a good idea to pre-define the schema. I tried to do so by storing one of the JSON files as a kind of default file. From that file, I wanted to extract the schema as follows:
schemapath = 'default_schema.json'
with open(schemapath) as f:
d = json.load(f)
schemaNew = StructType.fromJson(d)
responseDf = spark.read.schema(schemaNew).json("apiResponse.json", multiLine=True)
however, line
schemaNew = StructType.fromJson(d)
throws following error:
KeyError: 'fields'
No idea, where this 'fields' is coming from...
Errors in Spark tell truth.
dataframe.withColumn("keywords_name", dataframe.keywords_exp.name)
TypeError: col should be Column
DataFrame.withColumn documentation tells you how its input parameters are called and their data types:
Parameters:
- colName: str
string, name of the new column.
- col: Column
a Column expression for the new column.
So, col is parameter's name and Column is its type. Column is the data type which withColumn expects to get as the parameter named col. What did it actually receive? It received dataframe.keywords_exp.name. But what data type is it of?
print(type(dataframe.keywords_exp.name))
# <class 'method'>
As can be seen, it's not of the expected type Column...
To get Column from Struct's field, you must use a different syntax.
Note: data types in the dataframe are not what you think they are. You don't have dicts anymore. Instead, you have a Struct type column. The keys from the old dictionaries are now Field names for Struct type column.
To access struct fields, you should be using any of the following options:
df = dataframe.withColumn("keywords_name", F.col("keywords_exp.name"))
df = dataframe.withColumn("keywords_name", dataframe.keywords_exp['name'])
(Both, F.col("keywords_exp.name") and dataframe.keywords_exp['name'] are of type Column.)
This is a dataframe having the same schema as yours. You can see that withColumn works well:
from pyspark.sql import functions as F
dataframe = spark.createDataFrame(
[(("N", "glocations", 1, "Cuba"),)],
'keywords_exp struct<major:string,name:string,rank:bigint,value:string>')
dataframe.printSchema()
# root
# |-- keywords_exp: struct (nullable = true)
# | |-- major: string (nullable = true)
# | |-- name: string (nullable = true)
# | |-- rank: long (nullable = true)
# | |-- value: string (nullable = true)
df = dataframe.withColumn("keywords_name", F.col("keywords_exp.name"))
df.show()
# +--------------------+-------------+
# | keywords_exp|keywords_name|
# +--------------------+-------------+
# |{N, glocations, 1...| glocations|
# +--------------------+-------------+
Try setting scheme before reading.
Edit: I think the json schema needs to be in specific format. I know it's not documented very well, but you can extract an example using .json() method to see the format and then adjust your schema files. See below updated example:
aa.json
[{"keyword_exp": {"name": "aa", "value": "bb"}}, {"keyword_exp": {"name": "oo", "value": "ee"}}]
test.py
from pyspark.sql.session import SparkSession
import json
if __name__ == '__main__':
spark = SparkSession.builder.appName("test-app").master("local[1]").getOrCreate()
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
StructField('keyword_exp', StructType([
StructField('name', StringType(), False),
StructField('value', StringType(), False),
])),
])
json_str = schema.json()
json_obj = json.loads(json_str)
# Save output of this as file
print(json_str)
# Just to see it pretty
print(json.dumps(json_obj, indent=4))
# Save to file
with open("file_schema.json", "w") as f:
f.write(json_str)
# Load
with open("file_schema.json", "r") as f:
scheme_obj = json.loads(f.read())
# Re-load
loaded_schema = StructType.fromJson(scheme_obj)
df = spark.read.json("./aa.json", schema=schema)
df.printSchema()
df = df.select("keyword_exp.name", "keyword_exp.value")
df.show()
output:
{"fields":[{"metadata":{},"name":"keyword_exp","nullable":true,"type":{"fields":[{"metadata":{},"name":"name","nullable":false,"type":"string"},{"metadata":{},"name":"value","nullable":false,"type":"string"}],"type":"struct"}}],"type":"struct"}
{
"fields": [
{
"metadata": {},
"name": "keyword_exp",
"nullable": true,
"type": {
"fields": [
{
"metadata": {},
"name": "name",
"nullable": false,
"type": "string"
},
{
"metadata": {},
"name": "value",
"nullable": false,
"type": "string"
}
],
"type": "struct"
}
}
],
"type": "struct"
}
root
|-- keyword_exp: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- value: string (nullable = true)
+----+-----+
|name|value|
+----+-----+
| aa| bb|
| oo| ee|
+----+-----+
the Spark API seems to have problems with certain protected words. I came across this link when googling the error message
AttributeError: ‘function’ object has no attribute
https://learn.microsoft.com/en-us/azure/databricks/kb/python/function-object-no-attribute
Though "name" is not on the list, I changed all "name"-occurences in the JSON to "nameabcde" and now I can access it:

Create dataframe from json string having true false value

Wanted to create a spark dataframe from json string without using schema in Python. The json is mutlilevel nested which may contain array.
I had used below for creating dataframe, but getting 'Cannot infer Schema'
spark.createDataFrame(jsonStr)
I tried loading same json from file using below
spark.read.option("multiline", "true").json("/path")
This statement didn't have any issue and loaded the data to spark dataframe.
Is there any similar way to load the data from json variable?
It is fine even if all the values are not normallized.
Edit:
Found out that the issue might be due to true and false(Bool value) present in the json, when I was trying to use createDataFrame python is taking true and false as variable.
Is there any way to bypass this, the file also contains true or false. I tried to convert the list (list of nested dictionary) to json by using json.dumps() also. It is giving error as
Can not infer schema for type : <class 'str'>
Edit 2:
Input:
data = [
{
"a":"testA",
"b":"testB",
"c":false
}
]
Required output dataframe
a | b | c
--------------------
testA | testB | false
I get the required output when loading from the file. The file contains exact same as data.
spark.read.option("multiline", "true").json("/path/test.json")
Also if the data is string then I get a error Can not infer schema for type : <class 'str'>
If you don't want to load data from json file, you'd have to provide a schema for the JSON and use from_json to parse it
from pyspark.sql import functions as F
from pyspark.sql import types as T
schema = T.ArrayType(T.StructType([
T.StructField('a', T.StringType()),
T.StructField('b', T.StringType()),
T.StructField('c', T.BooleanType()),
]))
df = (spark
.createDataFrame([('dummy',)], ['x'])
.withColumn('x', F.from_json(F.lit(data), schema))
)
df.show(10, False)
df.printSchema()
+-----------------------+
|x |
+-----------------------+
|[{testA, testB, false}]|
+-----------------------+
root
|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: boolean (nullable = true)
If your input is a json you can deserialize it to a list of dictionary before creating a spark dataframe:
spark.createDataFrame(json.loads(data))

Extract json data in Spark/Scala

I have a json file with this structure
root
|-- labels: struct (nullable = true)
| |-- compute.googleapis.com/resource_name: string (nullable = true)
| |-- container.googleapis.com/namespace_name: string (nullable = true)
| |-- container.googleapis.com/pod_name: string (nullable = true)
| |-- container.googleapis.com/stream: string (nullable = true)
I want to extract the four .....googleapis.com/... into four columns.
I tried this:
import org.apache.spark.sql.functions._
df = df.withColumn("resource_name", df("labels.compute.googleapis.com/resource_name"))
.withColumn("namespace_name", df("labels.compute.googleapis.com/namespace_name"))
.withColumn("pod_name", df("labels.compute.googleapis.com/pod_name"))
.withColumn("stream", df("labels.compute.googleapis.com/stream"))
I also tried this, making the labels an array which has solved the first error that it said the sublevels are not array or map
df2 = df.withColumn("labels", explode(array(col("labels"))))
.select(col("labels.compute.googleapis.com/resource_name").as("resource_name"), col("labels.compute.googleapis.com/namespace_name").as("namespace_name"), col("labels.compute.googleapis.com/pod_name").as("pod_name"), col("labels.compute.googleapis.com/stream").as("stream"))
I still get this error
org.apache.spark.sql.AnalysisException: No such struct field compute in compute.googleapis.com/resource_name .....
I know Spark thinks that each dot is a nested level, but how I can format compute.googleapis.com/resource_name that spark recognises as a name of the level rather than a multilevel.
I also tried to solve as stated here
How to get Apache spark to ignore dots in a query?
But this also did not solve my problem. I have labels.compute.googleapis.com/resource_name, adding backticks to the compute.googleapis.com/resource_name still gives same error.
Renaming the columns (or sublevels), then do the withColumn
val schema = """struct<resource_name:string, namespace_name:string, pod_name:string, stream:string>"""
val df1 = df.withColumn("labels", $"labels".cast(schema))
You can use back apostrophe ` to isolate the names that contain special characters like '.'. You need to use backtick after the labels, as it is parent tag.
val extracted = df.withColumn("resource_name", df("labels.`compute.googleapis.com/resource_name`"))
.withColumn("namespace_name", df("labels.`container.googleapis.com/namespace_name`"))
.withColumn("pod_name", df("labels.`container.googleapis.com/pod_name`"))
.withColumn("stream", df("labels.`container.googleapis.com/stream`"))
extracted.show(10, false)
Output:
+--------------------+-------------+--------------+--------+------+
|labels |resource_name|namespace_name|pod_name|stream|
+--------------------+-------------+--------------+--------+------+
|[RN_1,NM_1,PM_1,S_1]|RN_1 |NM_1 |PM_1 |S_1 |
+--------------------+-------------+--------------+--------+------+
UPDATE 1
Full working example.
import org.apache.spark.sql.functions._
val j_1 =
"""
|{ "labels" : {
| "compute.googleapis.com/resource_name" : "RN_1",
| "container.googleapis.com/namespace_name" : "NM_1",
| "container.googleapis.com/pod_name" : "PM_1",
| "container.googleapis.com/stream" : "S_1"
| }
|}
""".stripMargin
val df = spark.read.json(Seq(j_1).toDS)
df.printSchema()
val extracted = df.withColumn("resource_name", df("labels.`compute.googleapis.com/resource_name`"))
.withColumn("namespace_name", df("labels.`container.googleapis.com/namespace_name`"))
.withColumn("pod_name", df("labels.`container.googleapis.com/pod_name`"))
.withColumn("stream", df("labels.`container.googleapis.com/stream`"))
extracted.show(10, false)

From Postgres JSONB into Spark JSONRDD [duplicate]

I have a Cassandra table that for simplicity looks something like:
key: text
jsonData: text
blobData: blob
I can create a basic data frame for this using spark and the spark-cassandra-connector using:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
I'm struggling though to expand the JSON data into its underlying structure. I ultimately want to be able to filter based on the attributes within the json string and return the blob data. Something like jsonData.foo = "bar" and return blobData. Is this currently possible?
Spark >= 2.4
If needed, schema can be determined using schema_of_json function (please note that this assumes that an arbitrary row is a valid representative of the schema).
import org.apache.spark.sql.functions.{lit, schema_of_json, from_json}
import collection.JavaConverters._
val schema = schema_of_json(lit(df.select($"jsonData").as[String].first))
df.withColumn("jsonData", from_json($"jsonData", schema, Map[String, String]().asJava))
Spark >= 2.1
You can use from_json function:
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("k", StringType, true), StructField("v", DoubleType, true)
))
df.withColumn("jsonData", from_json($"jsonData", schema))
Spark >= 1.6
You can use get_json_object which takes a column and a path:
import org.apache.spark.sql.functions.get_json_object
val exprs = Seq("k", "v").map(
c => get_json_object($"jsonData", s"$$.$c").alias(c))
df.select($"*" +: exprs: _*)
and extracts fields to individual strings which can be further casted to expected types.
The path argument is expressed using dot syntax, with leading $. denoting document root (since the code above uses string interpolation $ has to be escaped, hence $$.).
Spark <= 1.5:
Is this currently possible?
As far as I know it is not directly possible. You can try something similar to this:
val df = sc.parallelize(Seq(
("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),
("2", """{"k": "bar", "v": 3.0}""", "some_other_field_2")
)).toDF("key", "jsonData", "blobData")
I assume that blob field cannot be represented in JSON. Otherwise you cab omit splitting and joining:
import org.apache.spark.sql.Row
val blobs = df.drop("jsonData").withColumnRenamed("key", "bkey")
val jsons = sqlContext.read.json(df.drop("blobData").map{
case Row(key: String, json: String) =>
s"""{"key": "$key", "jsonData": $json}"""
})
val parsed = jsons.join(blobs, $"key" === $"bkey").drop("bkey")
parsed.printSchema
// root
// |-- jsonData: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: double (nullable = true)
// |-- key: long (nullable = true)
// |-- blobData: string (nullable = true)
An alternative (cheaper, although more complex) approach is to use an UDF to parse JSON and output a struct or map column. For example something like this:
import net.liftweb.json.parse
case class KV(k: String, v: Int)
val parseJson = udf((s: String) => {
implicit val formats = net.liftweb.json.DefaultFormats
parse(s).extract[KV]
})
val parsed = df.withColumn("parsedJSON", parseJson($"jsonData"))
parsed.show
// +---+--------------------+------------------+----------+
// |key| jsonData| blobData|parsedJSON|
// +---+--------------------+------------------+----------+
// | 1|{"k": "foo", "v":...|some_other_field_1| [foo,1]|
// | 2|{"k": "bar", "v":...|some_other_field_2| [bar,3]|
// +---+--------------------+------------------+----------+
parsed.printSchema
// root
// |-- key: string (nullable = true)
// |-- jsonData: string (nullable = true)
// |-- blobData: string (nullable = true)
// |-- parsedJSON: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: integer (nullable = false)
zero323's answer is thorough but misses one approach that is available in Spark 2.1+ and is simpler and more robust than using schema_of_json():
import org.apache.spark.sql.functions.from_json
val json_schema = spark.read.json(df.select("jsonData").as[String]).schema
df.withColumn("jsonData", from_json($"jsonData", json_schema))
Here's the Python equivalent:
from pyspark.sql.functions import from_json
json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
df.withColumn("jsonData", from_json("jsonData", json_schema))
The problem with schema_of_json(), as zero323 points out, is that it inspects a single string and derives a schema from that. If you have JSON data with varied schemas, then the schema you get back from schema_of_json() will not reflect what you would get if you were to merge the schemas of all the JSON data in your DataFrame. Parsing that data with from_json() will then yield a lot of null or empty values where the schema returned by schema_of_json() doesn't match the data.
By using Spark's ability to derive a comprehensive JSON schema from an RDD of JSON strings, we can guarantee that all the JSON data can be parsed.
Example: schema_of_json() vs. spark.read.json()
Here's an example (in Python, the code is very similar for Scala) to illustrate the difference between deriving the schema from a single element with schema_of_json() and deriving it from all the data using spark.read.json().
>>> df = spark.createDataFrame(
... [
... (1, '{"a": true}'),
... (2, '{"a": "hello"}'),
... (3, '{"b": 22}'),
... ],
... schema=['id', 'jsonData'],
... )
a has a boolean value in one row and a string value in another. The merged schema for a would set its type to string. b would be an integer.
Let's see how the different approaches compare. First, the schema_of_json() approach:
>>> json_schema = schema_of_json(df.select("jsonData").take(1)[0][0])
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: boolean (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true]|
| 2| null|
| 3| []|
+---+--------+
As you can see, the JSON schema we derived was very limited. "a": "hello" couldn't be parsed as a boolean and returned null, and "b": 22 was just dropped because it wasn't in our schema.
Now with spark.read.json():
>>> json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: long (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true,]|
| 2|[hello,]|
| 3| [, 22]|
+---+--------+
Here we have all our data preserved, and with a comprehensive schema that accounts for all the data. "a": true was cast as a string to match the schema of "a": "hello".
The main downside of using spark.read.json() is that Spark will scan through all your data to derive the schema. Depending on how much data you have, that overhead could be significant. If you know that all your JSON data has a consistent schema, it's fine to go ahead and just use schema_of_json() against a single element. If you have schema variability but don't want to scan through all your data, you can set samplingRatio to something less than 1.0 in your call to spark.read.json() to look at a subset of the data.
Here are the docs for spark.read.json(): Scala API / Python API
The from_json function is exactly what you're looking for. Your code will look something like:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
//You can define whatever struct type that your json states
val schema = StructType(Seq(
StructField("key", StringType, true),
StructField("value", DoubleType, true)
))
df.withColumn("jsonData", from_json(col("jsonData"), schema))
underlying JSON String is
"{ \"column_name1\":\"value1\",\"column_name2\":\"value2\",\"column_name3\":\"value3\",\"column_name5\":\"value5\"}";
Below is the script to filter the JSON and load the required data in to Cassandra.
sqlContext.read.json(rdd).select("column_name1 or fields name in Json", "column_name2","column_name2")
.write.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "Table_name", "keyspace" -> "Key_Space_name"))
.mode(SaveMode.Append)
.save()
I use the following
(available since 2.2.0, and i am assuming that your json string column is at column index 0)
def parse(df: DataFrame, spark: SparkSession): DataFrame = {
val stringDf = df.map((value: Row) => value.getString(0), Encoders.STRING)
spark.read.json(stringDf)
}
It will automatically infer the schema in your JSON. Documented here:
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameReader.html