Create dataframe from json string having true false value - json

Wanted to create a spark dataframe from json string without using schema in Python. The json is mutlilevel nested which may contain array.
I had used below for creating dataframe, but getting 'Cannot infer Schema'
spark.createDataFrame(jsonStr)
I tried loading same json from file using below
spark.read.option("multiline", "true").json("/path")
This statement didn't have any issue and loaded the data to spark dataframe.
Is there any similar way to load the data from json variable?
It is fine even if all the values are not normallized.
Edit:
Found out that the issue might be due to true and false(Bool value) present in the json, when I was trying to use createDataFrame python is taking true and false as variable.
Is there any way to bypass this, the file also contains true or false. I tried to convert the list (list of nested dictionary) to json by using json.dumps() also. It is giving error as
Can not infer schema for type : <class 'str'>
Edit 2:
Input:
data = [
{
"a":"testA",
"b":"testB",
"c":false
}
]
Required output dataframe
a | b | c
--------------------
testA | testB | false
I get the required output when loading from the file. The file contains exact same as data.
spark.read.option("multiline", "true").json("/path/test.json")
Also if the data is string then I get a error Can not infer schema for type : <class 'str'>

If you don't want to load data from json file, you'd have to provide a schema for the JSON and use from_json to parse it
from pyspark.sql import functions as F
from pyspark.sql import types as T
schema = T.ArrayType(T.StructType([
T.StructField('a', T.StringType()),
T.StructField('b', T.StringType()),
T.StructField('c', T.BooleanType()),
]))
df = (spark
.createDataFrame([('dummy',)], ['x'])
.withColumn('x', F.from_json(F.lit(data), schema))
)
df.show(10, False)
df.printSchema()
+-----------------------+
|x |
+-----------------------+
|[{testA, testB, false}]|
+-----------------------+
root
|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: boolean (nullable = true)

If your input is a json you can deserialize it to a list of dictionary before creating a spark dataframe:
spark.createDataFrame(json.loads(data))

Related

Json creation from spark dataframe in scala

Currently, we are converting a spark dataframe to JSON String to be sent to kafka.
In the process, we are doing toJSON twice which inserts \ for the inner json.
Snippet of the code:
val df=spark.sql("select * from dB.tbl")
val bus_dt="2022-09-23"
case class kafkaMsg(busDate:String,msg:String)
Assuming my df has 2 columns as ID,STATUS, this will constitute the inner json of my kafka message.
JSON is created for msg and applied to case class.
val rdd=df.toJSON.rdd.map(msg=>kafkaMsg(busDate,msg))
Output at this step:
kafkaMsg(2022-09-23,{"id":1,"status":"active"})
Now, to send busDate and msg as JSON to kafka ,again a toJSON is applied.
val df1=spark.createDataFrame(rdd).toJSON
The output is:
{"busDate":"2022-09-23","msg":"{\"id\":1,\"status\":\"active\"}"}
The inner JSON is having \ which is not what the consumers are expecting.
Expected JSON:
{"busDate":"2022-09-23","msg":{"id":1,"status":"active"}}
How can I create this json without \ and send to kafka.
Please note the msg value varies and cannot be mapped to a schema.
Your msg is escaped because it's already a string. So, you are toString-ing a String when you convert to JSON...
JSON can be represented as Map[String, ?], so define a schema if your input data doesn't already have it.
Using PySpark as an example.
scm = StructType([
StructField('busDate', StringType(), nullable=False),
StructField('msg', MapType(StringType(), StringType()), nullable=False)
])
sdf = spark.createDataFrame([
('2022-09-23', {"id":1,"status":"active"}),
], schema=scm)
Schema - Notice that msg is not a string, but a Map[String, String]. And no, you cannot have multiple value types - Spark SQL and MapType with string keys and any values
root
|-- busDate: string (nullable = false)
|-- msg: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = true)
As JSON - You don't need Jackson, or hack around with RDDs...
kafkaDf = sdf.selectExpr("to_json(struct(*)) as value")
kafkaDf.show(truncate=False)
Not escaped...
Notice that the id type was converted. If that's not something you want, then you need to use msg : StructType rather than MapType and give id : IntegerType, for example. (This assumes all records in the dataframe are consistently typed, obviously)
+-----------------------------------------------------------+
|value |
+-----------------------------------------------------------+
|{"busDate":"2022-09-23","msg":{"id":"1","status":"active"}}|
+-----------------------------------------------------------+
You could also pull out the key (switched to using spark.sql.functions)
kafkaDf = sdf.select(
f.col("msg.id").cast("int").alias('key'),
f.to_json(f.struct('*')).alias('value')
)
kafkaDf.printSchema()
kafkaDf.show(truncate=False)
root
|-- key: integer (nullable = true)
|-- value: string (nullable = true)
+---+-----------------------------------------------------------+
|key|value |
+---+-----------------------------------------------------------+
|1 |{"busDate":"2022-09-23","msg":{"id":"1","status":"active"}}|
+---+-----------------------------------------------------------+
Then you can use kafkaDf.write.format("kafka"), as normal
Alternatively, if you wanted to wrap string information in a single field, rather then key-value pairs, then your Kafka consumers would need to handle that on their own, such as double-deserializing both the record, then the inner string (JSON value).

Spark JSON Read Nested Structured Strings as Structs

I have a file with many JSON records in it. Each record contains a Struct ('Properties') and within each, a String that looks like this:
'meterDetails: "#{meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000}"'
Note that values are not enclosed in a "".
I want to treat this column (meterDetails) as another Struct in my DF as all structs will be flattened eventually.
Proceeding with defining a schema, removing the # with regexp_replace('col','#','') and using from_json with the schema resulted in a new col in Json format, but all NULL values.
Splitting the col with split(col("meterDetails"),";")) turns it into an Array, but upon conversion to Json - back to all NULL values.
Question:
I'm clearly misunderstanding the #{..} structured passed by this API. In Spark, should I convert this string to an object that natively will result to a Struct?
Somehow I gravitate towards str_to_map function, but then you will need to transform the map to struct. I'm not sure this is the best way, but I'd do it like this.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[("#{meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000}",)],
['meterDetails'])
df.show(truncate=0)
# +-----------------------------------------------------------------------------------------------------------------+
# |meterDetails |
# +-----------------------------------------------------------------------------------------------------------------+
# |#{meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000}|
# +-----------------------------------------------------------------------------------------------------------------+
Script:
# Converting string to map
df = df.withColumn(
"meterDetails",
F.expr("str_to_map(TRIM(BOTH '#{}' FROM meterDetails), '; ', '=')")
)
# Converting map to struct
df = df.withColumn("meterDetails", F.to_json("meterDetails"))
json_schema = spark.read.json(df.rdd.map(lambda r: r.meterDetails)).schema
df = df.withColumn("meterDetails", F.from_json("meterDetails", json_schema))
df.show(truncate=0)
# +---------------------------------------------------------+
# |meterDetails |
# +---------------------------------------------------------+
# |{Storage, Read Operations, General Block Blob, 100000000}|
# +---------------------------------------------------------+
df.printSchema()
# root
# |-- meterDetails: struct (nullable = true)
# | |-- meterCategory: string (nullable = true)
# | |-- meterName: string (nullable = true)
# | |-- meterSubCategory: string (nullable = true)
# | |-- unitOfMeasure: string (nullable = true)
To transform your dataset into an operable JSON file, I would first use regexp_extract to get whatever is within #{ and }, that is done through:
df1 = df1.withColumn("data", regexp_extract(col("string"), "\\{(.*)\\}", 1))
Now, data looks as below:
meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000
Next, I would split on ; and transform your data into a named_struct:
df1 = df1.withColumn("data", split(col("data"), ";"))
df1 = df1.withColumn("data",
expr("transform(data, x -> named_struct('key',split(x, \"=\")[0],'value',split(x, \"=\")[1]))")
)
Now, data looks as:
[{meterName, Read Operations}, { meterCategory, Storage}, { meterSubCategory, General Block Blob}, { unitOfMeasure, 100000000}]
where col(data)[0].key gives meterName for example.
For more details, let's say you want another column to extract keys only (not, key and value are hard-coded from the previous step):
df1 = df1.withColumn("otherData", expr("transform(data, x -> x.key)"))
The result:
[meterName, meterCategory, meterSubCategory, unitOfMeasure]
I hope this is what you are looking for, good luck!

Is there a way to convert few 1000 columns from string to Integer, while saving as parquet file?

Using pyspark, I am extracting 1500 fields from JSON file and saving as parquet and create hive external table.
All the fields extracted from JSON are in string format. In Hive DDL all the column names should be in Integer.
When i save as parquet and query the hive table i see below error:
java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException
Is there a way to handle this error?
Converting columns to Int before saving as parquet helps. But converting 1500 columns explicitly to Integer will not be possible.
I knew a wider way of doing it, as follows:
>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import col
# Consider df to be the dataframe from reading the JSON file.
>>> df.show()
+-------+------+
|details|header|
+-------+------+
| def| 2.0|
+-------+------+
>>> df.printSchema()
root
|-- details: string (nullable = true)
|-- header: string (nullable = true)
# Convert all columns to integer type.
>>> df_parq=df.select(*(col(c).cast(IntegerType()).alias(c) for c in df.columns))
>>> df_parq.printSchema()
root
|-- details: integer (nullable = true)
|-- header: integer (nullable = true)
# Write file with modified column types to Parquet.
>>> df_parq.write.parquet('F:\Parquet\sample_out3')
>>> df_read_parq=spark.read.parquet('F:\Parquet\sample_out3')
>>> df_read_parq.printSchema()
root
|-- details: integer (nullable = true)
|-- header: integer (nullable = true)

Extract json data in Spark/Scala

I have a json file with this structure
root
|-- labels: struct (nullable = true)
| |-- compute.googleapis.com/resource_name: string (nullable = true)
| |-- container.googleapis.com/namespace_name: string (nullable = true)
| |-- container.googleapis.com/pod_name: string (nullable = true)
| |-- container.googleapis.com/stream: string (nullable = true)
I want to extract the four .....googleapis.com/... into four columns.
I tried this:
import org.apache.spark.sql.functions._
df = df.withColumn("resource_name", df("labels.compute.googleapis.com/resource_name"))
.withColumn("namespace_name", df("labels.compute.googleapis.com/namespace_name"))
.withColumn("pod_name", df("labels.compute.googleapis.com/pod_name"))
.withColumn("stream", df("labels.compute.googleapis.com/stream"))
I also tried this, making the labels an array which has solved the first error that it said the sublevels are not array or map
df2 = df.withColumn("labels", explode(array(col("labels"))))
.select(col("labels.compute.googleapis.com/resource_name").as("resource_name"), col("labels.compute.googleapis.com/namespace_name").as("namespace_name"), col("labels.compute.googleapis.com/pod_name").as("pod_name"), col("labels.compute.googleapis.com/stream").as("stream"))
I still get this error
org.apache.spark.sql.AnalysisException: No such struct field compute in compute.googleapis.com/resource_name .....
I know Spark thinks that each dot is a nested level, but how I can format compute.googleapis.com/resource_name that spark recognises as a name of the level rather than a multilevel.
I also tried to solve as stated here
How to get Apache spark to ignore dots in a query?
But this also did not solve my problem. I have labels.compute.googleapis.com/resource_name, adding backticks to the compute.googleapis.com/resource_name still gives same error.
Renaming the columns (or sublevels), then do the withColumn
val schema = """struct<resource_name:string, namespace_name:string, pod_name:string, stream:string>"""
val df1 = df.withColumn("labels", $"labels".cast(schema))
You can use back apostrophe ` to isolate the names that contain special characters like '.'. You need to use backtick after the labels, as it is parent tag.
val extracted = df.withColumn("resource_name", df("labels.`compute.googleapis.com/resource_name`"))
.withColumn("namespace_name", df("labels.`container.googleapis.com/namespace_name`"))
.withColumn("pod_name", df("labels.`container.googleapis.com/pod_name`"))
.withColumn("stream", df("labels.`container.googleapis.com/stream`"))
extracted.show(10, false)
Output:
+--------------------+-------------+--------------+--------+------+
|labels |resource_name|namespace_name|pod_name|stream|
+--------------------+-------------+--------------+--------+------+
|[RN_1,NM_1,PM_1,S_1]|RN_1 |NM_1 |PM_1 |S_1 |
+--------------------+-------------+--------------+--------+------+
UPDATE 1
Full working example.
import org.apache.spark.sql.functions._
val j_1 =
"""
|{ "labels" : {
| "compute.googleapis.com/resource_name" : "RN_1",
| "container.googleapis.com/namespace_name" : "NM_1",
| "container.googleapis.com/pod_name" : "PM_1",
| "container.googleapis.com/stream" : "S_1"
| }
|}
""".stripMargin
val df = spark.read.json(Seq(j_1).toDS)
df.printSchema()
val extracted = df.withColumn("resource_name", df("labels.`compute.googleapis.com/resource_name`"))
.withColumn("namespace_name", df("labels.`container.googleapis.com/namespace_name`"))
.withColumn("pod_name", df("labels.`container.googleapis.com/pod_name`"))
.withColumn("stream", df("labels.`container.googleapis.com/stream`"))
extracted.show(10, false)

From Postgres JSONB into Spark JSONRDD [duplicate]

I have a Cassandra table that for simplicity looks something like:
key: text
jsonData: text
blobData: blob
I can create a basic data frame for this using spark and the spark-cassandra-connector using:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
I'm struggling though to expand the JSON data into its underlying structure. I ultimately want to be able to filter based on the attributes within the json string and return the blob data. Something like jsonData.foo = "bar" and return blobData. Is this currently possible?
Spark >= 2.4
If needed, schema can be determined using schema_of_json function (please note that this assumes that an arbitrary row is a valid representative of the schema).
import org.apache.spark.sql.functions.{lit, schema_of_json, from_json}
import collection.JavaConverters._
val schema = schema_of_json(lit(df.select($"jsonData").as[String].first))
df.withColumn("jsonData", from_json($"jsonData", schema, Map[String, String]().asJava))
Spark >= 2.1
You can use from_json function:
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("k", StringType, true), StructField("v", DoubleType, true)
))
df.withColumn("jsonData", from_json($"jsonData", schema))
Spark >= 1.6
You can use get_json_object which takes a column and a path:
import org.apache.spark.sql.functions.get_json_object
val exprs = Seq("k", "v").map(
c => get_json_object($"jsonData", s"$$.$c").alias(c))
df.select($"*" +: exprs: _*)
and extracts fields to individual strings which can be further casted to expected types.
The path argument is expressed using dot syntax, with leading $. denoting document root (since the code above uses string interpolation $ has to be escaped, hence $$.).
Spark <= 1.5:
Is this currently possible?
As far as I know it is not directly possible. You can try something similar to this:
val df = sc.parallelize(Seq(
("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),
("2", """{"k": "bar", "v": 3.0}""", "some_other_field_2")
)).toDF("key", "jsonData", "blobData")
I assume that blob field cannot be represented in JSON. Otherwise you cab omit splitting and joining:
import org.apache.spark.sql.Row
val blobs = df.drop("jsonData").withColumnRenamed("key", "bkey")
val jsons = sqlContext.read.json(df.drop("blobData").map{
case Row(key: String, json: String) =>
s"""{"key": "$key", "jsonData": $json}"""
})
val parsed = jsons.join(blobs, $"key" === $"bkey").drop("bkey")
parsed.printSchema
// root
// |-- jsonData: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: double (nullable = true)
// |-- key: long (nullable = true)
// |-- blobData: string (nullable = true)
An alternative (cheaper, although more complex) approach is to use an UDF to parse JSON and output a struct or map column. For example something like this:
import net.liftweb.json.parse
case class KV(k: String, v: Int)
val parseJson = udf((s: String) => {
implicit val formats = net.liftweb.json.DefaultFormats
parse(s).extract[KV]
})
val parsed = df.withColumn("parsedJSON", parseJson($"jsonData"))
parsed.show
// +---+--------------------+------------------+----------+
// |key| jsonData| blobData|parsedJSON|
// +---+--------------------+------------------+----------+
// | 1|{"k": "foo", "v":...|some_other_field_1| [foo,1]|
// | 2|{"k": "bar", "v":...|some_other_field_2| [bar,3]|
// +---+--------------------+------------------+----------+
parsed.printSchema
// root
// |-- key: string (nullable = true)
// |-- jsonData: string (nullable = true)
// |-- blobData: string (nullable = true)
// |-- parsedJSON: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: integer (nullable = false)
zero323's answer is thorough but misses one approach that is available in Spark 2.1+ and is simpler and more robust than using schema_of_json():
import org.apache.spark.sql.functions.from_json
val json_schema = spark.read.json(df.select("jsonData").as[String]).schema
df.withColumn("jsonData", from_json($"jsonData", json_schema))
Here's the Python equivalent:
from pyspark.sql.functions import from_json
json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
df.withColumn("jsonData", from_json("jsonData", json_schema))
The problem with schema_of_json(), as zero323 points out, is that it inspects a single string and derives a schema from that. If you have JSON data with varied schemas, then the schema you get back from schema_of_json() will not reflect what you would get if you were to merge the schemas of all the JSON data in your DataFrame. Parsing that data with from_json() will then yield a lot of null or empty values where the schema returned by schema_of_json() doesn't match the data.
By using Spark's ability to derive a comprehensive JSON schema from an RDD of JSON strings, we can guarantee that all the JSON data can be parsed.
Example: schema_of_json() vs. spark.read.json()
Here's an example (in Python, the code is very similar for Scala) to illustrate the difference between deriving the schema from a single element with schema_of_json() and deriving it from all the data using spark.read.json().
>>> df = spark.createDataFrame(
... [
... (1, '{"a": true}'),
... (2, '{"a": "hello"}'),
... (3, '{"b": 22}'),
... ],
... schema=['id', 'jsonData'],
... )
a has a boolean value in one row and a string value in another. The merged schema for a would set its type to string. b would be an integer.
Let's see how the different approaches compare. First, the schema_of_json() approach:
>>> json_schema = schema_of_json(df.select("jsonData").take(1)[0][0])
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: boolean (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true]|
| 2| null|
| 3| []|
+---+--------+
As you can see, the JSON schema we derived was very limited. "a": "hello" couldn't be parsed as a boolean and returned null, and "b": 22 was just dropped because it wasn't in our schema.
Now with spark.read.json():
>>> json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: long (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true,]|
| 2|[hello,]|
| 3| [, 22]|
+---+--------+
Here we have all our data preserved, and with a comprehensive schema that accounts for all the data. "a": true was cast as a string to match the schema of "a": "hello".
The main downside of using spark.read.json() is that Spark will scan through all your data to derive the schema. Depending on how much data you have, that overhead could be significant. If you know that all your JSON data has a consistent schema, it's fine to go ahead and just use schema_of_json() against a single element. If you have schema variability but don't want to scan through all your data, you can set samplingRatio to something less than 1.0 in your call to spark.read.json() to look at a subset of the data.
Here are the docs for spark.read.json(): Scala API / Python API
The from_json function is exactly what you're looking for. Your code will look something like:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
//You can define whatever struct type that your json states
val schema = StructType(Seq(
StructField("key", StringType, true),
StructField("value", DoubleType, true)
))
df.withColumn("jsonData", from_json(col("jsonData"), schema))
underlying JSON String is
"{ \"column_name1\":\"value1\",\"column_name2\":\"value2\",\"column_name3\":\"value3\",\"column_name5\":\"value5\"}";
Below is the script to filter the JSON and load the required data in to Cassandra.
sqlContext.read.json(rdd).select("column_name1 or fields name in Json", "column_name2","column_name2")
.write.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "Table_name", "keyspace" -> "Key_Space_name"))
.mode(SaveMode.Append)
.save()
I use the following
(available since 2.2.0, and i am assuming that your json string column is at column index 0)
def parse(df: DataFrame, spark: SparkSession): DataFrame = {
val stringDf = df.map((value: Row) => value.getString(0), Encoders.STRING)
spark.read.json(stringDf)
}
It will automatically infer the schema in your JSON. Documented here:
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameReader.html