Spark Dataframe to StringType - json

In PySpark, how do I convert a Dataframe to normal String?
Background:
I'm using PySpark with Kafka and instead of hard coding broker name, I have parameterized Kafka broker name in PySpark.
Json file is holding the Broker details and Spark read this Json input and assign values to variable. These variables are of Dataframe type with String.
I'm facing issue when I pass dataframe to Pyspark-Kakfa connection details to substitute the values.
Error :
Can only concatenate String (Not a Dataframe) to String.
Json parameter file :
{
"broker": "https://at.com:8082",
"topicname": "dev_hello"
}
PySpark Code :
parameter = spark.read.option("multiline", "true").json("/at/dev_parameter.json")
kserver = parameter.select("broker")
ktopic = parameter.select("topicname")
df.selectExpr("CAST(id AS STRING) AS key", "to_json(struct(*)) AS value")
.write
.format("kafka")
.outputMode("append")
.option("kafka.bootstrap.servers", "f"+ **kserver**)
.option("topic", "josn_data_topic",**ktopic** )
.save()
Please advise on it.
my second query is how do I pass these Python based variables to another Scala based Spark notebook.

Use json.load instead of Spark json reader:
import json
with open("/at/dev_parameter.json") as f:
parameter = json.load(f)
kserver = parameter["broker"]
ktopic = parameter["topicname"]
df.selectExpr("CAST(id AS STRING) AS key", "to_json(struct(*)) AS value") \
.write \
.format("kafka") \
.outputMode("append") \
.option("kafka.bootstrap.servers", kserver) \
.option("topic", ktopic) \
.save()
If you prefer using Spark json reader, you can do:
parameter = spark.read.option("multiline", "true").json("/at/dev_parameter.json")
kserver = parameter.select("broker").head()[0]
ktopic = parameter.select("topicname").head()[0]

Related

How to convert JSON to Spark schema automatically?

I have a big JSON which is want to use in Spark Structured Streaming. I don't want to re-type this JSON as Spark schema expression manually. Can I do this automatically once?
I wrote this
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Infer Schema") \
.getOrCreate()
df = spark \
.read \
.option("multiline", True) \
.json("file_examples/dataflow/row01.json")
df.printSchema()
df.show()
with open("dataflow_schema.json", "w") as fp:
fp.write(df.schema.json())
Is this ok?
You are on the right path. You may save your schema as a json and then load it later. Be sure to convert it to json and then a StructType before use
import json
from pyspark.sql.types import StructType
with open("dataflow_schema.json", "r") as fp:
json_schema_str = fp.read()
my_schema = StructType.fromJson(json.loads(json_schema_str))
In your structured streaming query if you have a json column you may use the from_json method to convert your json to a struct type and eventually several columns eg:
from pyspark.sql.functions import from_json,col
# Assume that we have a kafkaStream
kafkaStream.selectExpr("CAST(value as string)")\
.select(from_json(col("value"),my_schema).alias("json_value"))\
.selectExpr("json_value.*") # extract as columns

pyspark flat csv to nested json in mongodb

Input CSV Data
userid, Code, Status
1234, 1 , final
1287, 2, notfinal
#Applied Pyspark Script
#Create Spark Session
spark = SparkSession.builder.master("yarn").appName().enableHiveSupport().config("spark.some.config.option", "some-value").getOrCreate()
#read csv data into dataframe
df = spark.read.load("Book3.csv",format="csv", sep=",", inferSchema="true", header="true")
#define schema for json df
newschema = StructType([StructField("userid", StringType()),StructField("report",
StringType(),metadata={"maxlength":6000})])
jsondf = df.rdd.map(lambda row: (row[0], ({"Code":row[1],"status" : row[2]})))\
.map(lambda row: (row[0], json.dumps(row[1])))\
.toDF(newschema)
jsondf.write.format("mongo").mode("append")\
.option("uri","mongodb://gcp.mongodb.net/").option("database","dbname").option("collection",
"testcollection").save()
Resulant Mongo Data
{
"userid" : "1234",
"report" : "{\"Code\": \"1\", \"status\": \"final\"}"
}
{
"userid" : "1287",
"report" : "{\"Code\": \"2\", \"status\": \"notfinal\"}"
}
In mongo i get a complete json encoded string in "report" which is not a surprise given i have taken report field as Stringtype().
This effectively makes any nested field based search in mongo impossible and whole code is useless then.
How can i make it a proper nested json so that mongo can search on nested fields as well ?
when i try to change field to proper structred json using below code
>>> new_df = sql_context.read.json(df.rdd.map(lambda r: r.json))
>>> new_df.printSchema()
i get error that "raise AttributeError(item) AttributeError: json"
Please help with soem code tips...
i am ok to use groupby as well but struggling what to put in aggregate functions and i need dataframe in result to write to mongo.
The solution is to properly define schema in pyspark "df_schema" and then map your base df into a new df "df_mongo" making sure that df.rdd.map should follow the pattern defined in df_schema .
df = spark.read.load("sourcelocation",format="csv", sep="|", inferSchema="true", header="true")
df_schema = StructType([StructField("field1", StringType(),True),StructField("field2", StringType(),True)])
df_mongo = df.rdd.map(lambda row: ([row[15],row[12]])).toDF(df_schema)
df_mongo.write.format("mongo").mode("append").option("uri",mongodb_uri). \
option("database",dbname).option("collection", collection_name).save()

how to easily extract json data of a known class from kafka in spark 2.3.0 structured streaming

In the this databricks blogpost they instruct how to extract json data from kafka:
# Construct a streaming DataFrame that reads from topic1
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1") \
.option("startingOffsets", "earliest") \
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
# value schema: { "a": 1, "b": "string" }
schema = StructType().add("a", IntegerType()).add("b", StringType())
df.select( \
col("key").cast("string"),
from_json(col("value").cast("string"), schema))
Is there a way to create the schema from a known case class without manually defining all the fields?
Note: this is a spark 2.2 doc, and i believe you now have to add new in the schema line:
schema = new StructType().add("a", new IntegerType()).add("b", new StringType())
I was able to figure this out thanks to this SO question.

How to use CustomJsonParser to parse json string in Spark Structured Streaming?

Instead of parsing whole JSON string, user will provide a CustomJsonParser to parse partial JSON string into CustomObject. How to use this CustomJsonParser to convert JSON string in Spark Structured Streaming instead of using from_json and get_json_object methods?
Sample Code like this:
val jsonDF = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", kakfaBrokers)
.option("subscribe", kafkaConsumeTopicName)
.option("group.id", kafkaConsumerGroupId)
.option("startingOffsets", startingOffsets)
.option("auto.offset.reset", autoOffsetReset)
.option("key.deserializer", classOf[StringDeserializer].getName)
.option("value.deserializer", classOf[StringDeserializer].getName)
.option("enable.auto.commit", "false")
.load()
val messagesDF = jsonDF.selectExpr("CAST(value AS STRING)")
spark.udf.register("parseJson", (json: String) =>
customJsonParser.parseJson(json)
)
val objDF = messagesDF.selectExpr("""parseJson(value) AS message""")
val query = objDF.writeStream
.outputMode(OutputMode.Append())
.format("console")
.start()
query.awaitTermination()
It runs with the following error:
Exception in thread "main" java.lang.UnsupportedOperationException:
Schema for type com.xxx.xxxEntity is not supported at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:755)
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:693)
at
org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:159)

Reading JSON with Apache Spark - `corrupt_record`

I have a json file, nodes that looks like this:
[{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}
,{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}
,{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}
,{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}]
I am able to read and manipulate this record with Python.
I am trying to read this file in scala through the spark-shell.
From this tutorial, I can see that it is possible to read json via sqlContext.read.json
val vfile = sqlContext.read.json("path/to/file/nodes.json")
However, this results in a corrupt_record error:
vfile: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
Can anyone shed some light on this error? I can read and use the file with other applications and I am confident it is not corrupt and sound json.
As Spark expects "JSON Line format" not a typical JSON format, we can tell spark to read typical JSON by specifying:
val df = spark.read.option("multiline", "true").json("<file>")
Spark cannot read JSON-array to a record on top-level, so you have to pass:
{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}
{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}
{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}
{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}
As it's described in the tutorial you're referring to:
Let's begin by loading a JSON file, where each line is a JSON object
The reasoning is quite simple. Spark expects you to pass a file with a lot of JSON-entities (entity per line), so it could distribute their processing (per entity, roughly saying).
To put more light on it, here is a quote form the official doc
Note that the file that is offered as a json file is not a typical
JSON file. Each line must contain a separate, self-contained valid
JSON object. As a consequence, a regular multi-line JSON file will
most often fail.
This format is called JSONL. Basically it's an alternative to CSV.
To read the multi-line JSON as a DataFrame:
val spark = SparkSession.builder().getOrCreate()
val df = spark.read.json(spark.sparkContext.wholeTextFiles("file.json").values)
Reading large files in this manner is not recommended, from the wholeTextFiles docs
Small files are preferred, large file is also allowable, but may cause bad performance.
I run into the same problem. I used sparkContext and sparkSql on the same configuration:
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("Simple Application")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.config(conf)
.getOrCreate()
Then, using the spark context I read the whole json (JSON - path to file) file:
val jsonRDD = sc.wholeTextFiles(JSON).map(x => x._2)
You can create a schema for future selects, filters...
val schema = StructType( List(
StructField("toid", StringType, nullable = true),
StructField("point", ArrayType(DoubleType), nullable = true),
StructField("index", DoubleType, nullable = true)
))
Create a DataFrame using spark sql:
var df: DataFrame = spark.read.schema(schema).json(jsonRDD).toDF()
For testing use show and printSchema:
df.show()
df.printSchema()
sbt build file:
name := "spark-single"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.2"
libraryDependencies +="org.apache.spark" %% "spark-sql" % "2.0.2"