spark streaming+kafka - appending to parquet - json

I am streaming meter reading records as JSON from kafka 2.11-1 into Spark 2.1. I dont understand how to convert the streamed object into a dataframe before saving it to a parquet file. I want the scala script to infer the schema from JSON so that a new parquet file format will be generated automatically when the JSON format of the streaming source data changes (I'll figure out later how to detect this and start a new file whenever a format change occurs). For now, I am unable to write the parquet file.
import org.apache.spark
import org.apache.spark.streaming._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, SQLContext, SaveMode, SparkSession}
val ssc = new StreamingContext(sc, batchDuration = Seconds(5))
val sqlContext = new SQLContext(sc)
ssc.checkpoint("_checkpoint")
// Connect to Kafka
import org.apache.spark.streaming.kafka.KafkaUtils
import _root_.kafka.serializer.StringDecoder
val kafkaParams = Map("metadata.broker.list" -> "xx.xx.xx.xx:9092")
val kafkaTopics = Set("test")
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, kafkaTopics)
messages.print()
messages.foreachRDD(rdd => {
val part1= rdd.map(_._1)
val part2= rdd.map(_._2) // this has the json
print ("%%%% part1 is : " + part1)
print ("%%%% part2 is : " + part2)
// here: infer the schema from json and append the streamed data to a parquet file on hdfs
} )
ssc.start()
ssc.awaitTermination()
The json looks like this:
-------------------------------------------
Time: 1513155855000 ms
-------------------------------------------
(null,{"customer_id":"customer_51","customer_acct_id":"cusaccid_1197","serv_acct_id":"service_1957","installed_service_id":"instserv_946","meter_id":"meter_319","channel_number":"156","interval_read_date":"2013-06-16 11:26:04","interval_received":"5","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N", "provisioned_meter_ind:"N"})
(null,{"customer_id":"customer_25","customer_acct_id":"cusaccid_1303","serv_acct_id":"service_844","installed_service_id":"instserv_1636","meter_id":"meter_663","channel_number":"1564","interval_read_date":"2014-02-13 12:52:34","interval_received":"8","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N","provisioned_meter_ind":"N"})
(null,{"customer_id":"customer_1955","customer_acct_id":"cusaccid_1793","serv_acct_id":"service_577","installed_service_id":"instserv_1971","meter_id":"meter_1459","channel_number":"1312","interval_read_date":"2017-05-23 07:32:13","interval_received":"11","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N","provisioned_meter_ind":"N"})
(null,{"customer_id":"customer_1833","customer_acct_id":"cusaccid_1381","serv_acct_id":"service_461","installed_service_id":"instserv_477","meter_id":"meter_1373","channel_number":"1769","interval_read_date":"2011-12-13 10:12:20","interval_received":"15","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N","provisioned_meter_ind":"N"})
(null,{"customer_id":"customer_1597","customer_acct_id":"cusaccid_1753","serv_acct_id":"service_379","installed_service_id":"instserv_1061","meter_id":"meter_1759","channel_number":"632","interval_read_date":"2013-07-22 05:49:55","interval_received":"7","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N","provisioned_meter_ind":"N"})
2017-12-13 09:04:15,626 INFO org.apache.spark.streaming.scheduler.JobGenerator (Logging.scala:logInfo(54)) - Checkpointing graph for time 1513155855000 ms
I'm testing this using spark-shell:
spark-shell --jars /opt/alti-spark-2.1.1/external/kafka-0-8/target/spark-streaming-kafka-0-8_2.11-2.1.1.jar --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0

Related

should i create .parquet file in the directory path in order to write data frame in parquet format?

I'm trying to write parquet data but when I run this code I do not see any parquet file created in the specified write path. should I create an empty folder and point to write to that folder or create a .parquet file and overwrite the data to the already created parquet file?
import org.apache.spark.sql.SparkSession
object csv_parquet {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("Csv_To_Parquet").master("local[*]").getOrCreate()
val df = spark.read.format("csv")
.option("delimiter",",")
.option("header", "true")
.option("path", "C:\\Users\\kiran\\Sample-Spreadsheet-10-rows.csv").load()
df.show()
df.write
.parquet("C:\\parquet_data")
spark.read.parquet("C:\\parquet_data").show()
spark.close()
}
}

Databricks Write Json file is too slow

I have simple scala snippet to read/write json files of total of 10GB (with mounting dir from storage account) --> it took 1.7 hour with almost all the time in the write json file line.
Cluster setup:
Azure Databricks DBR 7.3 LTS, spark 3.0.1, scala 2.12
11 workers + one driver of type Standard_E4as_v4 (each has 32.0 GB Memory, 4 Cores)
Why writing is too slow?
Is not writing is parallelized as reading accross partitions/workers?
How to speed writing or the whole process up?
Code for mount dir:
val containerName = "containerName"
val storageAccountName = "storageAccountName"
val sas = "sastoken"
val config = "fs.azure.sas." + containerName+ "." + storageAccountName + ".blob.core.windows.net"
dbutils.fs.mount(
source = "wasbs://containerName#storageAccountName.blob.core.windows.net/",
mountPoint = "/mnt/myfile",
extraConfigs = Map(config -> sas))
Code for read/write json files:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import spark.implicits._
val jsonDF = spark.read.json("/mnt/myfile")
jsonDF.write.json("/mnt/myfile/spark_output_test")
Solved by using "I would say" better spark apis, like
read --> spark.read.format("json").load("path")
write --> res.write.format("parquet").save("path)
and writing in parquet format as it is compressed and very optimized for spark

Convert spark dataframe into json files which contains array of json

I am writing Spark Application in scala which reads the HiveTable and save the output in HDFS as Json Format file.
I read the hive table using HiveContext and it returns the DataFrame. Below is the code snippet.
val sparkConf = new SparkConf().setAppName("SparkReadHive")
val sc = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val df = sqlContext.sql(
"""
|SELECT *
|FROM database.table
|""".stripMargin)
df.write.format("json").save(path)
I need output file looks like below:
[{"name":"tom", "age": 8},
{"name":"Jerry", "age": 7}]
However, what I get is like below:
{"name":"tom", "age": 8}
{"name":"Jerry", "age": 7}
Can someone please help me with it? Thank you!
We can use .toJSON, collect() and .mkString method to get array of json objects and by using hadoop filesystem to create a file in hdfs with the desired format.
Example:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
import java.io._
//sample dataframe
val df=sc.parallelize(Seq(("tom",8),("Jerry",7))).toDF("name","age")
//making array of json object
val data=df.toJSON.collect().mkString("[",",\n","]")
//filesystem object
val path = new Path("hdfs://<namenode>:8020/<path>/myfile.txt")
val conf = new Configuration(sc.hadoopConfiguration)
val fs = path.getFileSystem(conf)
if (fs.exists(path))
fs.delete(path, true)
val out = new BufferedOutputStream(fs.create(path))
out.write(data.getBytes("UTF-8"))
out.flush()
out.close()
fs.close()
Check contents of file in HDFS:
hadoop fs -cat myfile.txt
[{"name":"tom","age":8},
{"name":"Jerry","age":7}]

DStream JSON object to SQLite

stack overflow community,
I have the following question:
I am using Spark Streaming and KafkaUtils to read from a Kafka topic, then I transform the Dstream to JSON. What I want is to save this JSON object to an SQLite database with a column-row format.
Sample of the code I run in spark-streaming:
import sys
import json
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == '__main__':
conf = SparkConf().setAppName("PythonStreamingDirectKafka").setMaster("spark://spark-master:7077")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
ssc = StreamingContext(sc, 20)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {'metadata.broker.list': brokers})
message = kvs.map(lambda x: x[1])
message.pprint()
# Functions
json_object = message.map(lambda s: eval(s))
temperatures_object = json_object.map(lambda s: s["temperature_value"])
#Aggregations
json_object.pprint()
temperatures_object.pprint()
ssc.start()
ssc.awaitTermination()
The output of DStream
DStream output
SQLite schema:
Database Schema
Do you have any idea how to achieve this? It's complicated to me how to transfer JSON data to SQLite from spark streaming using Pyspark.
I appreciate any help in advance!

Reading a csv file as a spark dataframe

I have got a CSV file along with a header which has to be read through Spark(2.0.0 and Scala 2.11.8) as a dataframe.
Sample csv data:
Item,No. of items,Place
abc,5,xxx
def,6,yyy
ghi,7,zzz
.........
I'm facing problem when I try to read this csv data in spark as a dataframe, because the header contains column(No. of items) having special character "."
Code with which I try to read csv data is:
val spark = SparkSession.builder().appName("SparkExample")
import spark.implicits._
val df = spark.read.option("header", "true").csv("file:///INPUT_FILENAME")
Error I'm facing:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to resolve No. of items given [Item,No. of items,Place];
If I remove the "." from the header, I wont get any error. Even tried with escaping the character,but it escapes all the "." characters even from the data.
Is there any way to escape the special character "." only from the CSV header using spark code?
#Pooja Nayak, Not sure if this was solved; answering this in the interest of community.
sc: SparkContext
spark: SparkSession
sqlContext: SQLContext
// Read the raw file from localFS as-is.
val rdd_raw = sc.textFile("file:///home/xxxx/sample.csv")
// Drop the first line in first partition because it is the header.
val rdd = rdd_raw.mapPartitionsWithIndex{(idx,iter) =>
if(idx == 0) iter.drop(1) else iter
}
// A function to create schema dynamically.
def schemaCreator(header: String): StructType = {
StructType(header
.split(",")
.map(field => StructField(field.trim, StringType, true))
)
}
// Create the schema for the csv that was read and store it.
val csvSchema: StructType = schemaCreator(rdd_raw.first)
// As the input is CSV, split it at "," and trim away the whitespaces.
val rdd_curated = rdd.map(x => x.split(",").map(y => y.trim)).map(xy => Row(xy:_*))
// Create the DF from the RDD.
val df = sqlContext.createDataFrame(rdd_curated, csvSchema)
imports that are necessary
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark._
I am giving you example which is working with pyspark, hopefully same will work for you, just by adding some language related syntax.
file =r'C:\Users\e5543130\Desktop\sampleCSV2.csv'
conf = SparkConf().setAppName('FICBOutputGenerator')
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
sqlContext = SQLContext(sc)
df = sqlContext.read.options(delimiter=",", header="true").csv("cars.csv") #Without deprecated API
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter", ",").load("cars.csv")