Spark | Could not create FileClient | read json | scala - json

I am trying to read a json file in windows local machine using spark and scala. I have tried like below :
object JsonTry extends App{
System.setProperty("hadoop.home.dir", "C:\\winutils")
val sparkSession = SparkSession.builder()
.master("local[*]")
.config("some-config", "some-value")
.appName("App Name")
.getOrCreate();
val res = sparkSession.read.json("./src/main/resources/test.json")
res.printSchema()
}
Json file which is under resource folder looks like below :
{"name":"Some name"}
But i am getting an exception when I run this main class :
Exception in thread "main" java.io.IOException: Could not create
FileClient
Screenshot attached :
To my surprise this piece of code is working, but i am looking to read json from file directly.
val res = sparkSession.read.option("multiline", true).json(sparkSession.sparkContext.parallelize(Seq("{\"name\":\"name\"}")))
Please let me know what is causing this issue, as I am not getting any solution .

I tried to read a json file in a similar way but I didnt face any problem,
You may try this too .
object myTest extends App {
val spark : SparkSession = SparkSession.builder()
.appName("MyTest")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val jsonDataDF = spark.read.option("multiline","true").json("/Users/gp/Desktop/temp/test.json")
jsonDataDF.show()
}
Output -
I/P Data (JSOn) ->
Do let me know if I understood your question properly or not ?

Related

Is there a way to modify this code to let spark streaming read from json?

I'm working on a spark streaming app/code which continuously reads data from localhost 9098. Is there a way to modify localhost into <users/folder/path> so to read data from folder path or json automatically ?
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.log4j.Logger
import org.apache.log4j.Level
object StreamingApplication extends App {
Logger.getLogger("Org").setLevel(Level.ERROR)
//creating spark streaming context
val sc = new SparkContext("local[*]", "wordCount")
val ssc = new StreamingContext(sc, Seconds(5))
// lines is a Dstream
val lines = ssc.socketTextStream("localhost", 9098)
// words is a transformed Dstream
val words = lines.flatMap(x => x.split(" "))
// bunch of transformations
val pairs = words.map(x=> (x,1))
val wordsCount = pairs.reduceByKey((x,y) => x+y)
// print is an action
wordsCount.print()
// start the streaming context
ssc.start()
ssc.awaitTermination()
}
Basically, I need help to modify code below:
val lines = ssc.socketTextStream("localhost", 9098)
to this:
val lines = ssc.socketTextStream("<folder path>")
fyi, I'm using IntelliJ Idea to build this.
I'd recommend reading Spark documentation, especially the scaladoc.
There seem to exist a method fileStream.
https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/streaming/StreamingContext.html

Is it possible to create a dataframe column with json data which doesn't have a fixed schema?

I am trying to create a dataframe column with JSON data which does not have a fixed schema. I am trying to write it in its original form as map/object but getting various errors.
I don't want to convert it to a string as I need to write this data in it's original form to the file.
Later this file is used for json processing, original structure should not be compromised.
Currently when I try writing data to a file it contain all the escape characters and is considered entire json as a string instead of complex type. Eg
{"field1":"d1","field2":"app","value":"{\"data\":\"{\\\"app\\\":\\\"am\\\"}\"}"}
You could try to make up a schema for the json file.
I don't know what output you expect.
As a clue I give you an example and two interesting links:
spark-read-json-with-schema
spark-schema-explained-with-examples
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{StringType, StructType}
object RareJson {
val spark = SparkSession
.builder()
.appName("RareJson")
.master("local[*]")
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","RareJson") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
val sqlContext = spark.sqlContext
val input = "/home/cloudera/files/tests/rare.json"
def main(args: Array[String]): Unit = {
Logger.getRootLogger.setLevel(Level.ERROR)
try {
val structureSchema = new StructType()
.add("field1",StringType)
.add("field2",StringType)
.add("value",StringType,true)
val rareJson = sqlContext
.read
.option("allowBackslashEscapingAnyCharacter", true)
.option("allowUnquotedFieldNames", true)
.option("multiLine", true)
.option("mode", "DROPMALFORMED")
.schema(structureSchema)
.json(input)
rareJson.show(truncate = false)
// To have the opportunity to view the web console of Spark: http://localhost:4041/
println("Type whatever to the console to exit......")
scala.io.StdIn.readLine()
} finally {
sc.stop()
println("SparkContext stopped")
spark.stop()
println("SparkSession stopped")
}
}
}
output
+------+------+---------------------------+
|field1|field2|value |
+------+------+---------------------------+
|d1 |app |{"data":"{\"app\":\"am\"}"}|
+------+------+---------------------------+
You can try to parse the value column too if it maintain the same format along the all rows.

Converting DataSet to Json Array Spark using Scala

I am new to the spark and unable to figure out the solution for the following problem.
I have a JSON file to parse and then create a couple of metrics and write the data back into the JSON format.
now following is my code I am using
import org.apache.spark.sql._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.functions._
object quick2 {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder
.appName("quick1")
.master("local[*]")
.getOrCreate()
val rawData = spark.read.json("/home/umesh/Documents/Demo2/src/main/resources/sampleQuick.json")
val mat1 = rawData.select(rawData("mal_name"),rawData("cust_id")).distinct().orderBy("cust_id").toJSON.cache()
val mat2 = rawData.select(rawData("file_md5"),rawData("mal_name")).distinct().orderBy(asc("file_md5")).toJSON.cache()
val write1 = mat1.coalesce(1).toJavaRDD.saveAsTextFile("/home/umesh/Documents/Demo2/src/test/mat1/")
val write = mat2.coalesce(1).toJavaRDD.saveAsTextFile("/home/umesh/Documents/Demo2/src/test/mat2/")
}
}
Now above code is writing the proper json format.
However, matrices can contain duplicate result as well
example:
md5 mal_name
1 a
1 b
2 c
3 d
3 e
so with above code every object is getting written in single line
like this
{"file_md5":"1","mal_name":"a"}
{"file_md5":"1","mal_name":"b"}
{"file_md5":"2","mal_name":"c"}
{"file_md5":"3","mal_name":"d"}
and so on.
but I want to combine the data of common keys:
so the output should be
{"file_md5":"1","mal_name":["a","b"]}
can somebody please suggest me what shall I do here. Or if there is any other better way to approach this problem.
Thanks!
You can use collect_list or collect_set as per your need on mal_name column
You can directly save DataFrame/DataSet directly as JSON file
import org.apache.spark.sql.functions.{alias, collect_list}
import spark.implicits._
rawData.groupBy($"file_md5")
.agg(collect_set($"mal_name").alias("mal_name"))
.write
.format("json")
.save("json/file/location/to/save")
as wrote by #mrsrinivas I changed my code as per below
val mat2 = rawData.select(rawData("file_md5"),rawData("mal_name")).distinct().orderBy(asc("file_md5")).cache()
val labeledDf = mat2.toDF("file_md5","mal_name")
labeledDf.groupBy($"file_md5").agg(collect_list($"mal_name")).coalesce(1).write.format("json").save("/home/umesh/Documents/Demo2/src/test/run8/")
Keeping this quesion open for some more suggestions if any.

Issue with json conversions for scalax graph library for scala

We are using graph-json 1.11.0 and graph-core-1.11.5 with in a play 2.5.x application .
http://www.scala-graph.org/
The user guide examples for toJson and fromJson for a graph , do not wok with the current stable 1.11.0 release .
Ref : http://www.scala-graph.org/guides/json.html
We make use of a simple string graph in our application
Graph[String,DiEdge] .
We managed to write the graph to json conversion part , but unable to identify the exact fromJson syntax for the new stable version .
Below is a sample code used in our application . Can some one help us on how to convert json to graph.
import play.api.libs.json.{JsValue, Json}
import scalax.collection.Graph
import scalax.collection.GraphEdge.DiEdge
import scalax.collection.io.json.JsonGraph
import scalax.collection.io.json.descriptor.Descriptor
import scalax.collection.io.json.descriptor.predefined.Di
import scalax.collection.io.json.descriptor.{Descriptor, StringNodeDescriptor}
object HierarchyGraph {
val descriptor = new Descriptor(StringNodeDescriptor,Di.descriptor[String]())
def toJson(graph : Graph[String, DiEdge]) : JsValue = {
val jsText = JsonGraph(graph).toJson(descriptor)
try {
Json.parse(jsText)
} catch {
case e : Exception => Json.toJson(Json.obj())
}
}
def fromJson(graphAsJsValue : JsValue) : Graph[String,DiEdge] = {
// json to graph conversion code here
}
}
Issue addressed by peter-empen on github .
https://github.com/scala-graph/scala-graph/issues/71

Loading json data into hive using spark sql

I am Unable to push json data into hive Below is the sample json data and my work . Please suggest me the missing one
json Data
{
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani#gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani#gmail.com"
},
{
"userId":"thanks",
"jobTitleName":"Program Directory",
"firstName":"Tom",
"lastName":"Hanks",
"preferredFullName":"Tom Hanks",
"employeeCode":"E3",
"region":"CA",
"phoneNumber":"408-2222222",
"emailAddress":"tomhanks#gmail.com"
}
]
}
I tried to use sqlcontext and jsonFile method to load which is failing to parse the json
val f = sqlc.jsonFile("file:///home/vm/Downloads/emp.json")
f.show
error is : java.lang.RuntimeException: Failed to parse a value for data type StructType() (current token: VALUE_STRING)
I tried in different way and able to crack and get the schema
val files = sc.wholeTextFiles("file:///home/vm/Downloads/emp.json")
val jsonData = files.map(x => x._2)
sqlc.jsonRDD(jsonData).registerTempTable("employee")
val emp= sqlc.sql("select Employees[1].userId as ID,Employees[1].jobTitleName as Title,Employees[1].firstName as FirstName,Employees[1].lastName as LastName,Employees[1].preferredFullName as PeferedName,Employees[1].employeeCode as empCode,Employees[1].region as Region,Employees[1].phoneNumber as Phone,Employees[1].emailAddress as email from employee")
emp.show // displays all the values
I am able to get the data and schema seperately for each record but I am missing an idea to get all the data and load into hive.
Any help or suggestion is much appreaciated.
Here is the Cracked answer
val files = sc.wholeTextFiles("file:///home/vm/Downloads/emp.json")
val jsonData = files.map(x => x._2)
import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive.HiveContext
val hc=new HiveContext(sc)
hc.jsonRDD(jsonData).registerTempTable("employee")
val fuldf=hc.jsonRDD(jsonData)
val dfemp=fuldf.select(explode(col("Employees")))
dfemp.saveAsTable("empdummy")
val df=sql("select * from empdummy")
df.select ("_c0.userId","_c0.jobTitleName","_c0.firstName","_c0.lastName","_c0.preferredFullName","_c0.employeeCode","_c0.region","_c0.phoneNumber","_c0.emailAddress").saveAsTable("dummytab")
Any suggestion for optimising the above code.
SparkSQL only supports reading JSON files when the file contains one JSON object per line.
SQLContext.scala
/**
* Loads a JSON file (one object per line), returning the result as a [[DataFrame]].
* It goes through the entire dataset once to determine the schema.
*
* #group specificdata
* #deprecated As of 1.4.0, replaced by `read().json()`. This will be removed in Spark 2.0.
*/
#deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
def jsonFile(path: String): DataFrame = {
read.json(path)
}
Your file should look like this (strictly speaking, it's not a proper JSON file)
{"userId":"rirani","jobTitleName":"Developer","firstName":"Romin","lastName":"Irani","preferredFullName":"Romin Irani","employeeCode":"E1","region":"CA","phoneNumber":"408-1234567","emailAddress":"romin.k.irani#gmail.com"}
{"userId":"nirani","jobTitleName":"Developer","firstName":"Neil","lastName":"Irani","preferredFullName":"Neil Irani","employeeCode":"E2","region":"CA","phoneNumber":"408-1111111","emailAddress":"neilrirani#gmail.com"}
{"userId":"thanks","jobTitleName":"Program Directory","firstName":"Tom","lastName":"Hanks","preferredFullName":"Tom Hanks","employeeCode":"E3","region":"CA","phoneNumber":"408-2222222","emailAddress":"tomhanks#gmail.com"}
Please have a look at the outstanding JIRA issue. Don't think it is that much of priority, but just for record.
You have two options
Convert your json data to the supported format, one object per line
Have one file per JSON object - this will result in too many files.
Note that SQLContext.jsonFile is deprecated, use SQLContext.read.json.
Examples from spark documentation