How to convert JSON to Spark schema automatically? - json

I have a big JSON which is want to use in Spark Structured Streaming. I don't want to re-type this JSON as Spark schema expression manually. Can I do this automatically once?
I wrote this
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Infer Schema") \
.getOrCreate()
df = spark \
.read \
.option("multiline", True) \
.json("file_examples/dataflow/row01.json")
df.printSchema()
df.show()
with open("dataflow_schema.json", "w") as fp:
fp.write(df.schema.json())
Is this ok?

You are on the right path. You may save your schema as a json and then load it later. Be sure to convert it to json and then a StructType before use
import json
from pyspark.sql.types import StructType
with open("dataflow_schema.json", "r") as fp:
json_schema_str = fp.read()
my_schema = StructType.fromJson(json.loads(json_schema_str))
In your structured streaming query if you have a json column you may use the from_json method to convert your json to a struct type and eventually several columns eg:
from pyspark.sql.functions import from_json,col
# Assume that we have a kafkaStream
kafkaStream.selectExpr("CAST(value as string)")\
.select(from_json(col("value"),my_schema).alias("json_value"))\
.selectExpr("json_value.*") # extract as columns

Related

Spark Dataframe to StringType

In PySpark, how do I convert a Dataframe to normal String?
Background:
I'm using PySpark with Kafka and instead of hard coding broker name, I have parameterized Kafka broker name in PySpark.
Json file is holding the Broker details and Spark read this Json input and assign values to variable. These variables are of Dataframe type with String.
I'm facing issue when I pass dataframe to Pyspark-Kakfa connection details to substitute the values.
Error :
Can only concatenate String (Not a Dataframe) to String.
Json parameter file :
{
"broker": "https://at.com:8082",
"topicname": "dev_hello"
}
PySpark Code :
parameter = spark.read.option("multiline", "true").json("/at/dev_parameter.json")
kserver = parameter.select("broker")
ktopic = parameter.select("topicname")
df.selectExpr("CAST(id AS STRING) AS key", "to_json(struct(*)) AS value")
.write
.format("kafka")
.outputMode("append")
.option("kafka.bootstrap.servers", "f"+ **kserver**)
.option("topic", "josn_data_topic",**ktopic** )
.save()
Please advise on it.
my second query is how do I pass these Python based variables to another Scala based Spark notebook.
Use json.load instead of Spark json reader:
import json
with open("/at/dev_parameter.json") as f:
parameter = json.load(f)
kserver = parameter["broker"]
ktopic = parameter["topicname"]
df.selectExpr("CAST(id AS STRING) AS key", "to_json(struct(*)) AS value") \
.write \
.format("kafka") \
.outputMode("append") \
.option("kafka.bootstrap.servers", kserver) \
.option("topic", ktopic) \
.save()
If you prefer using Spark json reader, you can do:
parameter = spark.read.option("multiline", "true").json("/at/dev_parameter.json")
kserver = parameter.select("broker").head()[0]
ktopic = parameter.select("topicname").head()[0]

How to read csv with schema in pyspark

I know how to read a csv with pyspark, but I'm having a lot of problems to load it with the correct format. My csv has 3 columns, where the first and the second are strings, but the third is a list of dicts. I'm not being able to load this last column.
I tried with
schema = StructType([
StructField("_id", StringType()),
StructField("text", StringType()),
StructField("links", ArrayType(elementType=MapType(StringType(), StringType())))
])
but it's raising an error. With Inferschema neither it's working.
You need to have inferSchema="true". If it causes problems, read everything as string, and then you can use ast.literal_eval() from ast package to convert the str to dict.
You use this function:
def read_csv_spark(spark, file_path):
"""
:param spark: SparkSession or SQLContext
:param file_path: Path to the file
:return: Spark Dataframe
"""
df = (
spark.read.format("com.databricks.spark.csv")
.options(header="true", inferSchema="true")
.load(file_path)
)
return df

Reading a csv file as a spark dataframe

I have got a CSV file along with a header which has to be read through Spark(2.0.0 and Scala 2.11.8) as a dataframe.
Sample csv data:
Item,No. of items,Place
abc,5,xxx
def,6,yyy
ghi,7,zzz
.........
I'm facing problem when I try to read this csv data in spark as a dataframe, because the header contains column(No. of items) having special character "."
Code with which I try to read csv data is:
val spark = SparkSession.builder().appName("SparkExample")
import spark.implicits._
val df = spark.read.option("header", "true").csv("file:///INPUT_FILENAME")
Error I'm facing:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to resolve No. of items given [Item,No. of items,Place];
If I remove the "." from the header, I wont get any error. Even tried with escaping the character,but it escapes all the "." characters even from the data.
Is there any way to escape the special character "." only from the CSV header using spark code?
#Pooja Nayak, Not sure if this was solved; answering this in the interest of community.
sc: SparkContext
spark: SparkSession
sqlContext: SQLContext
// Read the raw file from localFS as-is.
val rdd_raw = sc.textFile("file:///home/xxxx/sample.csv")
// Drop the first line in first partition because it is the header.
val rdd = rdd_raw.mapPartitionsWithIndex{(idx,iter) =>
if(idx == 0) iter.drop(1) else iter
}
// A function to create schema dynamically.
def schemaCreator(header: String): StructType = {
StructType(header
.split(",")
.map(field => StructField(field.trim, StringType, true))
)
}
// Create the schema for the csv that was read and store it.
val csvSchema: StructType = schemaCreator(rdd_raw.first)
// As the input is CSV, split it at "," and trim away the whitespaces.
val rdd_curated = rdd.map(x => x.split(",").map(y => y.trim)).map(xy => Row(xy:_*))
// Create the DF from the RDD.
val df = sqlContext.createDataFrame(rdd_curated, csvSchema)
imports that are necessary
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark._
I am giving you example which is working with pyspark, hopefully same will work for you, just by adding some language related syntax.
file =r'C:\Users\e5543130\Desktop\sampleCSV2.csv'
conf = SparkConf().setAppName('FICBOutputGenerator')
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
sqlContext = SQLContext(sc)
df = sqlContext.read.options(delimiter=",", header="true").csv("cars.csv") #Without deprecated API
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter", ",").load("cars.csv")

How to deal with multiple csv.gz files in Spark?

I have a huge dataset with multiple tables. Each table is split into hundreds of csv.gz files and I need to import them to Spark through PySpark. Any idea on how to import the "csv.gz" files to Spark? Does SparkContext or SparkSession from SparkSQL provide a function to import this type of files?
You can import gzipped csv files natively using spark.read.csv():
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("stackOverflow") \
.getOrCreate()
fpath1 = "file1.csv.gz"
DF = spark.read.csv(fpath1, header=True)
where DF is a spark DataFrame.
You can read from multiple files by feeding in a list of files:
fpath1 = "file1.csv.gz"
fpath2 = "file2.csv.gz"
DF = spark.read.csv([fpath1, fpath2] header=True)
You can also create a "temporary view" allowing for SQL queries:
fpath1 = "file1.csv.gz"
fpath2 = "file2.csv.gz"
DF = spark.read.csv([fpath1, fpath2] header=True)
DF.createOrReplaceTempView("table_name")
DFres = spark.sql("SELECT * FROM table_name)
where DFres is a spark DataFrame generated from the query.

What are SparkSession Config Options

I am trying to use SparkSession to convert JSON data of a file to RDD with Spark Notebook. I already have the JSON file.
val spark = SparkSession
.builder()
.appName("jsonReaderApp")
.config("config.key.here", configValueHere)
.enableHiveSupport()
.getOrCreate()
val jread = spark.read.json("search-results1.json")
I am very new to spark and do not know what to use for config.key.here and configValueHere.
SparkSession
To get all the "various Spark parameters as key-value pairs" for a SparkSession, “The entry point to programming Spark with the Dataset and DataFrame API," run the following (this is using Spark Python API, Scala would be very similar).
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
SparkConf().getAll()
or without importing SparkConf:
spark.sparkContext.getConf().getAll()
Depending on which API you are using, see one of the following:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/SparkSession.html
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html
You can get a deeper level list of SparkSession configuration options by running the code below. Most are the same, but there are a few extra ones. I am not sure if you can change these.
spark.sparkContext._conf.getAll()
SparkContext
To get all the "various Spark parameters as key-value pairs" for a SparkContext, the "Main entry point for Spark functionality," ... "connection to a Spark cluster," ... and "to create RDDs, accumulators and broadcast variables on that cluster,” run the following.
import pyspark
from pyspark import SparkConf, SparkContext
spark_conf = SparkConf().setAppName("test")
spark = SparkContext(conf = spark_conf)
SparkConf().getAll()
Depending on which API you are using, see one of the following:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html
Spark parameters
You should get a list of tuples that contain the "various Spark parameters as key-value pairs" similar to the following:
[(u'spark.eventLog.enabled', u'true'),
(u'spark.yarn.appMasterEnv.PYSPARK_PYTHON', u'/<yourpath>/parcels/Anaconda-4.2.0/bin/python'),
...
...
(u'spark.yarn.jars', u'local:/<yourpath>/lib/spark2/jars/*')]
Depending on which API you are using, see one of the following:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkConf.html
https://spark.apache.org/docs/latest//api/python/reference/api/pyspark.SparkConf.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkConf.html
For a complete list of Spark properties, see:
http://spark.apache.org/docs/latest/configuration.html#viewing-spark-properties
Setting Spark parameters
Each tuple is ("spark.some.config.option", "some-value") which you can set in your application with:
SparkSession
spark = (
SparkSession
.builder
.appName("Your App Name")
.config("spark.some.config.option1", "some-value")
.config("spark.some.config.option2", "some-value")
.getOrCreate())
sc = spark.sparkContext
SparkContext
spark_conf = (
SparkConf()
.setAppName("Your App Name")
.set("spark.some.config.option1", "some-value")
.set("spark.some.config.option2", "some-value"))
sc = SparkContext(conf = spark_conf)
spark-defaults
You can also set the Spark parameters in a spark-defaults.conf file:
spark.some.config.option1 some-value
spark.some.config.option2 "some-value"
then run your Spark application with spark-submit (pyspark):
spark-submit \
--properties-file path/to/your/spark-defaults.conf \
--name "Your App Name" \
--py-files path/to/your/supporting/pyspark_files.zip \
--class Main path/to/your/pyspark_main.py
This is how it worked for me to add spark or hive settings in my scala:
{
val spark = SparkSession
.builder()
.appName("StructStreaming")
.master("yarn")
.config("hive.merge.mapfiles", "false")
.config("hive.merge.tezfiles", "false")
.config("parquet.enable.summary-metadata", "false")
.config("spark.sql.parquet.mergeSchema","false")
.config("hive.merge.smallfiles.avgsize", "160000000")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.sql.orc.impl", "native")
.config("spark.sql.parquet.binaryAsString","true")
.config("spark.sql.parquet.writeLegacyFormat","true")
//.config(“spark.sql.streaming.checkpointLocation”, “hdfs://pp/apps/hive/warehouse/dev01_landing_initial_area.db”)
.getOrCreate()
}
The easiest way to set some config:
spark.conf.set("spark.sql.shuffle.partitions", 500).
Where spark refers to a SparkSession, that way you can set configs at runtime. It's really useful when you want to change configs again and again to tune some spark parameters for specific queries.
In simple terms, values set in "config" method are automatically propagated to both SparkConf and SparkSession's own configuration.
for eg :
you can refer to
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-settings.html to understand how hive warehouse locations are set for SparkSession using config option
To know about the this api you can refer to : https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SparkSession.Builder.html
Every Spark config option is expolained at: http://spark.apache.org/docs/latest/configuration.html
You can set these at run-time as in your example above or through the config file given to spark-submit