Convert spark dataframe into json files which contains array of json - json

I am writing Spark Application in scala which reads the HiveTable and save the output in HDFS as Json Format file.
I read the hive table using HiveContext and it returns the DataFrame. Below is the code snippet.
val sparkConf = new SparkConf().setAppName("SparkReadHive")
val sc = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val df = sqlContext.sql(
"""
|SELECT *
|FROM database.table
|""".stripMargin)
df.write.format("json").save(path)
I need output file looks like below:
[{"name":"tom", "age": 8},
{"name":"Jerry", "age": 7}]
However, what I get is like below:
{"name":"tom", "age": 8}
{"name":"Jerry", "age": 7}
Can someone please help me with it? Thank you!

We can use .toJSON, collect() and .mkString method to get array of json objects and by using hadoop filesystem to create a file in hdfs with the desired format.
Example:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
import java.io._
//sample dataframe
val df=sc.parallelize(Seq(("tom",8),("Jerry",7))).toDF("name","age")
//making array of json object
val data=df.toJSON.collect().mkString("[",",\n","]")
//filesystem object
val path = new Path("hdfs://<namenode>:8020/<path>/myfile.txt")
val conf = new Configuration(sc.hadoopConfiguration)
val fs = path.getFileSystem(conf)
if (fs.exists(path))
fs.delete(path, true)
val out = new BufferedOutputStream(fs.create(path))
out.write(data.getBytes("UTF-8"))
out.flush()
out.close()
fs.close()
Check contents of file in HDFS:
hadoop fs -cat myfile.txt
[{"name":"tom","age":8},
{"name":"Jerry","age":7}]

Related

should i create .parquet file in the directory path in order to write data frame in parquet format?

I'm trying to write parquet data but when I run this code I do not see any parquet file created in the specified write path. should I create an empty folder and point to write to that folder or create a .parquet file and overwrite the data to the already created parquet file?
import org.apache.spark.sql.SparkSession
object csv_parquet {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("Csv_To_Parquet").master("local[*]").getOrCreate()
val df = spark.read.format("csv")
.option("delimiter",",")
.option("header", "true")
.option("path", "C:\\Users\\kiran\\Sample-Spreadsheet-10-rows.csv").load()
df.show()
df.write
.parquet("C:\\parquet_data")
spark.read.parquet("C:\\parquet_data").show()
spark.close()
}
}

DStream JSON object to SQLite

stack overflow community,
I have the following question:
I am using Spark Streaming and KafkaUtils to read from a Kafka topic, then I transform the Dstream to JSON. What I want is to save this JSON object to an SQLite database with a column-row format.
Sample of the code I run in spark-streaming:
import sys
import json
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == '__main__':
conf = SparkConf().setAppName("PythonStreamingDirectKafka").setMaster("spark://spark-master:7077")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
ssc = StreamingContext(sc, 20)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {'metadata.broker.list': brokers})
message = kvs.map(lambda x: x[1])
message.pprint()
# Functions
json_object = message.map(lambda s: eval(s))
temperatures_object = json_object.map(lambda s: s["temperature_value"])
#Aggregations
json_object.pprint()
temperatures_object.pprint()
ssc.start()
ssc.awaitTermination()
The output of DStream
DStream output
SQLite schema:
Database Schema
Do you have any idea how to achieve this? It's complicated to me how to transfer JSON data to SQLite from spark streaming using Pyspark.
I appreciate any help in advance!

How to create DataFrame based on multiple JSON files

I have many JSON files inside a folder. All of them have the same structure. Now I want to create the DataFrame, and each JSON file should be the row of this DataFrame.
I know how to create DataFrame based on a single JSON string, but I don't know how to deal with multiple ones:
import spark.implicits._
val jsonStr = """{ "key": 111, "value": 54, stamp: "aaa"}"""
val df = spark.read.json(Seq(jsonStr).toDS)
Assuming you have your JSONs in folder src/main/resources
Following code will produce desired result:
private val df: DataFrame = spark.read.json("src/main/resources")
df.show()
+---+-----+-----+
|key|stamp|value|
+---+-----+-----+
|111| aaa| 54|
|111| aaa| 54|
+---+-----+-----+
Note that JSON should be machine-readable, not human readable (that means that JSONs shouldn't have new line characters.

spark streaming+kafka - appending to parquet

I am streaming meter reading records as JSON from kafka 2.11-1 into Spark 2.1. I dont understand how to convert the streamed object into a dataframe before saving it to a parquet file. I want the scala script to infer the schema from JSON so that a new parquet file format will be generated automatically when the JSON format of the streaming source data changes (I'll figure out later how to detect this and start a new file whenever a format change occurs). For now, I am unable to write the parquet file.
import org.apache.spark
import org.apache.spark.streaming._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, SQLContext, SaveMode, SparkSession}
val ssc = new StreamingContext(sc, batchDuration = Seconds(5))
val sqlContext = new SQLContext(sc)
ssc.checkpoint("_checkpoint")
// Connect to Kafka
import org.apache.spark.streaming.kafka.KafkaUtils
import _root_.kafka.serializer.StringDecoder
val kafkaParams = Map("metadata.broker.list" -> "xx.xx.xx.xx:9092")
val kafkaTopics = Set("test")
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, kafkaTopics)
messages.print()
messages.foreachRDD(rdd => {
val part1= rdd.map(_._1)
val part2= rdd.map(_._2) // this has the json
print ("%%%% part1 is : " + part1)
print ("%%%% part2 is : " + part2)
// here: infer the schema from json and append the streamed data to a parquet file on hdfs
} )
ssc.start()
ssc.awaitTermination()
The json looks like this:
-------------------------------------------
Time: 1513155855000 ms
-------------------------------------------
(null,{"customer_id":"customer_51","customer_acct_id":"cusaccid_1197","serv_acct_id":"service_1957","installed_service_id":"instserv_946","meter_id":"meter_319","channel_number":"156","interval_read_date":"2013-06-16 11:26:04","interval_received":"5","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N", "provisioned_meter_ind:"N"})
(null,{"customer_id":"customer_25","customer_acct_id":"cusaccid_1303","serv_acct_id":"service_844","installed_service_id":"instserv_1636","meter_id":"meter_663","channel_number":"1564","interval_read_date":"2014-02-13 12:52:34","interval_received":"8","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N","provisioned_meter_ind":"N"})
(null,{"customer_id":"customer_1955","customer_acct_id":"cusaccid_1793","serv_acct_id":"service_577","installed_service_id":"instserv_1971","meter_id":"meter_1459","channel_number":"1312","interval_read_date":"2017-05-23 07:32:13","interval_received":"11","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N","provisioned_meter_ind":"N"})
(null,{"customer_id":"customer_1833","customer_acct_id":"cusaccid_1381","serv_acct_id":"service_461","installed_service_id":"instserv_477","meter_id":"meter_1373","channel_number":"1769","interval_read_date":"2011-12-13 10:12:20","interval_received":"15","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N","provisioned_meter_ind":"N"})
(null,{"customer_id":"customer_1597","customer_acct_id":"cusaccid_1753","serv_acct_id":"service_379","installed_service_id":"instserv_1061","meter_id":"meter_1759","channel_number":"632","interval_read_date":"2013-07-22 05:49:55","interval_received":"7","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N","provisioned_meter_ind":"N"})
2017-12-13 09:04:15,626 INFO org.apache.spark.streaming.scheduler.JobGenerator (Logging.scala:logInfo(54)) - Checkpointing graph for time 1513155855000 ms
I'm testing this using spark-shell:
spark-shell --jars /opt/alti-spark-2.1.1/external/kafka-0-8/target/spark-streaming-kafka-0-8_2.11-2.1.1.jar --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0

Reading a csv file as a spark dataframe

I have got a CSV file along with a header which has to be read through Spark(2.0.0 and Scala 2.11.8) as a dataframe.
Sample csv data:
Item,No. of items,Place
abc,5,xxx
def,6,yyy
ghi,7,zzz
.........
I'm facing problem when I try to read this csv data in spark as a dataframe, because the header contains column(No. of items) having special character "."
Code with which I try to read csv data is:
val spark = SparkSession.builder().appName("SparkExample")
import spark.implicits._
val df = spark.read.option("header", "true").csv("file:///INPUT_FILENAME")
Error I'm facing:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to resolve No. of items given [Item,No. of items,Place];
If I remove the "." from the header, I wont get any error. Even tried with escaping the character,but it escapes all the "." characters even from the data.
Is there any way to escape the special character "." only from the CSV header using spark code?
#Pooja Nayak, Not sure if this was solved; answering this in the interest of community.
sc: SparkContext
spark: SparkSession
sqlContext: SQLContext
// Read the raw file from localFS as-is.
val rdd_raw = sc.textFile("file:///home/xxxx/sample.csv")
// Drop the first line in first partition because it is the header.
val rdd = rdd_raw.mapPartitionsWithIndex{(idx,iter) =>
if(idx == 0) iter.drop(1) else iter
}
// A function to create schema dynamically.
def schemaCreator(header: String): StructType = {
StructType(header
.split(",")
.map(field => StructField(field.trim, StringType, true))
)
}
// Create the schema for the csv that was read and store it.
val csvSchema: StructType = schemaCreator(rdd_raw.first)
// As the input is CSV, split it at "," and trim away the whitespaces.
val rdd_curated = rdd.map(x => x.split(",").map(y => y.trim)).map(xy => Row(xy:_*))
// Create the DF from the RDD.
val df = sqlContext.createDataFrame(rdd_curated, csvSchema)
imports that are necessary
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark._
I am giving you example which is working with pyspark, hopefully same will work for you, just by adding some language related syntax.
file =r'C:\Users\e5543130\Desktop\sampleCSV2.csv'
conf = SparkConf().setAppName('FICBOutputGenerator')
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
sqlContext = SQLContext(sc)
df = sqlContext.read.options(delimiter=",", header="true").csv("cars.csv") #Without deprecated API
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter", ",").load("cars.csv")