Spark file load - `try` and `except` in scala - json

I wish to read a file that is in one of two locations. I wish to try the first location and if this fails, try the second location. In python I would use a try and then if a IOError file does not exist is returned use except for the second location. I can read one location in scala like this:
val vertices_raw = sqlContext.read.json("location_a/file.json")
I have tried the following, using getOrElse:
val vertices_raw = sqlContext.read.json("location_a/file.json") getOrElse vertices_raw = sqlContext.read.json("location_b/file.json")
However this did not compile

You can do the same thing in Scala
val vertices_raw: DataFrame = try {
sqlContext.read.json("location_a/file.json")
} catch {
case e: Exception => sqlContext.read.json("location_b/file.json")
}
Or alternatively
import scala.util.Try
val vertices_raw =
Try(sqlContext.read.json("location_a/file.json"))
.getOrElse(sqlContext.read.json("location_b/file.json"))

Related

Is there a way to modify this code to let spark streaming read from json?

I'm working on a spark streaming app/code which continuously reads data from localhost 9098. Is there a way to modify localhost into <users/folder/path> so to read data from folder path or json automatically ?
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.log4j.Logger
import org.apache.log4j.Level
object StreamingApplication extends App {
Logger.getLogger("Org").setLevel(Level.ERROR)
//creating spark streaming context
val sc = new SparkContext("local[*]", "wordCount")
val ssc = new StreamingContext(sc, Seconds(5))
// lines is a Dstream
val lines = ssc.socketTextStream("localhost", 9098)
// words is a transformed Dstream
val words = lines.flatMap(x => x.split(" "))
// bunch of transformations
val pairs = words.map(x=> (x,1))
val wordsCount = pairs.reduceByKey((x,y) => x+y)
// print is an action
wordsCount.print()
// start the streaming context
ssc.start()
ssc.awaitTermination()
}
Basically, I need help to modify code below:
val lines = ssc.socketTextStream("localhost", 9098)
to this:
val lines = ssc.socketTextStream("<folder path>")
fyi, I'm using IntelliJ Idea to build this.
I'd recommend reading Spark documentation, especially the scaladoc.
There seem to exist a method fileStream.
https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/streaming/StreamingContext.html

Get only required file details from a HDFS directory in scala

I encountered an issue while working with org.apache.hadoop.fs package in Spark Scala. I need only required file details(file name, block size, modification time) from a given directory. I tried using the following code
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileStatus, FileSystem, Path}
val fs = FileSystem.get(new Configuration())
val dir: String = "/env/domain/work/latest_ts"
val input_files = fs.listStatus(new Path(dir))
The variable input_files obtained is Array[FileStatus] and has all the details about the files in that directory. In My Spark code, I only need the above mentioned three parameters for each file present in the form of a List[Details].
case class Details(name: String, size: Double, time: String)
In the Array[FileStatus], we have 'path' (file full path) as String, block size as Long and modification time.
I tried parsing the Array[FileStatus] as Json and taking out required key value pairs but I couldn't. I also tried the following where I created three lists separately and zipped them to form a list of tuple (String, Double, String) but it is not matching to List[Details] and throqing an error while execution.
val names = fs.listStatus(new Path(dir)).map(_.getPath().getName).toList
val size = fs.listStatus(new Path(dir)).map(_.getBlockSize.toDouble).toList
val time = fs.listStatus(new Path(dir)).map(_.getModificationTime.toString).toList
val input_tuple = (names zip time zip size) map {case ((n,t),s) => (n,t,s)}
val input_files : List[Details] = input_tuple.asInstanceOf[List[Details]]
The error I got was
Exception during processing!
java.lang.ClassCastException: scala.Tuple3 cannot be cast to com.main.Details
Could any one please advise is there a way to get the required parameters from fs or how to correctly cast the tuple I have to Details
Please help, Thanks in advance
To convert Json and read key value pairs, I converted Array[FileStatus] to String using mkString(",") and tried to parse using JSON.parseFull(input_string) which threw an error.
Here is what you can do:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileStatus, FileSystem, Path}
val fs = FileSystem.get(new Configuration())
val dir: String = "/env/domain/work/latest_ts"
val input_files = fs.listStatus(new Path(dir))
val details = input_files.map(m => Details(m.getPath.toString, m.getBlockSize, m.getModificationTime.toString)).toList
This will give you List[Details]. Hope this helps!

spark/scala string to json inside map

I have a pairRDD that looks like
(1, {"id":1, "picture": "url1"})
(2, {"id":2, "picture": "url2"})
(3, {"id":3, "picture": "url3"})
...
where second element is a string, i got from function get() from http://alvinalexander.com/scala/how-to-write-scala-http-get-request-client-source-fromurl. here is that function:
#throws(classOf[java.io.IOException])
#throws(classOf[java.net.SocketTimeoutException])
def get(url: String,
connectTimeout: Int = 5000,
readTimeout: Int = 5000,
requestMethod: String = "GET") =
{
import java.net.{URL, HttpURLConnection}
val connection = (new URL(url)).openConnection.asInstanceOf[HttpURLConnection]
connection.setConnectTimeout(connectTimeout)
connection.setReadTimeout(readTimeout)
connection.setRequestMethod(requestMethod)
val inputStream = connection.getInputStream
val content = io.Source.fromInputStream(inputStream).mkString
if (inputStream != null) inputStream.close
content
}
now I want to convert that string to json to get picture url from it. (from this https://stackoverflow.com/a/38271732/1456026)
val step2 = pairRDD_1.map({case(x,y)=>{
val jsonStr = y
val rdd = sc.parallelize(Seq(jsonStr))
val df = sqlContext.read.json(rdd)
(x,y("picture"))
}})
but i'm constantly getting
Exception in thread "main" org.apache.spark.SparkException: Task not
serializable
when i printed out first 20 elements and tried to convert strings to json manually one-by-one outside .map it worked.
val rdd = sc.parallelize(Seq("""{"id":1, "picture": "url1"}"""))
val df = sqlContext.read.json(rdd)
println(df)
>>>[id: string, picture: string]
how to convert string to json in spark/scala inside .map?
You cannot use SparkContext in a distributed operation. In the code above, you cannot access SparkContext in the map operation on pairRDD_1.
Consider using a JSON library to perform the conversion.
Typically when you see this message, it's because you are using a resource in your map function (read anonymous function) that was defined outside of it, and is not able to be serialized.
Running in clustered mode, the anonymous function will be running on a different machine altogether. On that separate machine, a new instance of your app is instantiated and it's state (variables/values/etc) is set from data that has been serialized by the driver and sent to the new instance. If you anonymous function is a closure (i.e. utilizes variables outside of it's scope), then those resources must be serializable, in order to be sent to the worker nodes.
For example, a map function may attempt to use a database connection to grab some information for each record in the RDD. That database connection is only valid on the host that created it (from a networking perspective, of course), which is typically the driver program, so it cannot be serialized, sent, and utilized from a different host. In this particular example, you would do a mapPartitions() to instantiate a database connection from the worker itself, then map() each of the records within that partition to query the database.
I can't provide much further help without your full code example, to see what potential value or variable is unable to be serialized.
One of the answers is to use json4s library.
source: http://muster.json4s.org/docs/jawn_codec.html
//case class defined outside main()
case class Pictures(id: String, picture: String)
// import library
import muster._
import muster.codec.jawn._
// here all the magic happens
val json_read_RDD = pairRDD_1.map({case(x,y) =>
{
val json_read_to_case_class = JawnCodec.as[Pictures](y)
(x, json_read_to_case_class.picture)
}})
// add to build.sbt
libraryDependencies ++= Seq(
"org.json4s" %% "muster-codec-json" % "0.3.0",
"org.json4s" %% "muster-codec-jawn" % "0.3.0")
credits goes to Travis Hegner, who explained why original code didn't work
and to Anton Okolnychyi for advice of using json library.

Loading json data into hive using spark sql

I am Unable to push json data into hive Below is the sample json data and my work . Please suggest me the missing one
json Data
{
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani#gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani#gmail.com"
},
{
"userId":"thanks",
"jobTitleName":"Program Directory",
"firstName":"Tom",
"lastName":"Hanks",
"preferredFullName":"Tom Hanks",
"employeeCode":"E3",
"region":"CA",
"phoneNumber":"408-2222222",
"emailAddress":"tomhanks#gmail.com"
}
]
}
I tried to use sqlcontext and jsonFile method to load which is failing to parse the json
val f = sqlc.jsonFile("file:///home/vm/Downloads/emp.json")
f.show
error is : java.lang.RuntimeException: Failed to parse a value for data type StructType() (current token: VALUE_STRING)
I tried in different way and able to crack and get the schema
val files = sc.wholeTextFiles("file:///home/vm/Downloads/emp.json")
val jsonData = files.map(x => x._2)
sqlc.jsonRDD(jsonData).registerTempTable("employee")
val emp= sqlc.sql("select Employees[1].userId as ID,Employees[1].jobTitleName as Title,Employees[1].firstName as FirstName,Employees[1].lastName as LastName,Employees[1].preferredFullName as PeferedName,Employees[1].employeeCode as empCode,Employees[1].region as Region,Employees[1].phoneNumber as Phone,Employees[1].emailAddress as email from employee")
emp.show // displays all the values
I am able to get the data and schema seperately for each record but I am missing an idea to get all the data and load into hive.
Any help or suggestion is much appreaciated.
Here is the Cracked answer
val files = sc.wholeTextFiles("file:///home/vm/Downloads/emp.json")
val jsonData = files.map(x => x._2)
import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive.HiveContext
val hc=new HiveContext(sc)
hc.jsonRDD(jsonData).registerTempTable("employee")
val fuldf=hc.jsonRDD(jsonData)
val dfemp=fuldf.select(explode(col("Employees")))
dfemp.saveAsTable("empdummy")
val df=sql("select * from empdummy")
df.select ("_c0.userId","_c0.jobTitleName","_c0.firstName","_c0.lastName","_c0.preferredFullName","_c0.employeeCode","_c0.region","_c0.phoneNumber","_c0.emailAddress").saveAsTable("dummytab")
Any suggestion for optimising the above code.
SparkSQL only supports reading JSON files when the file contains one JSON object per line.
SQLContext.scala
/**
* Loads a JSON file (one object per line), returning the result as a [[DataFrame]].
* It goes through the entire dataset once to determine the schema.
*
* #group specificdata
* #deprecated As of 1.4.0, replaced by `read().json()`. This will be removed in Spark 2.0.
*/
#deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
def jsonFile(path: String): DataFrame = {
read.json(path)
}
Your file should look like this (strictly speaking, it's not a proper JSON file)
{"userId":"rirani","jobTitleName":"Developer","firstName":"Romin","lastName":"Irani","preferredFullName":"Romin Irani","employeeCode":"E1","region":"CA","phoneNumber":"408-1234567","emailAddress":"romin.k.irani#gmail.com"}
{"userId":"nirani","jobTitleName":"Developer","firstName":"Neil","lastName":"Irani","preferredFullName":"Neil Irani","employeeCode":"E2","region":"CA","phoneNumber":"408-1111111","emailAddress":"neilrirani#gmail.com"}
{"userId":"thanks","jobTitleName":"Program Directory","firstName":"Tom","lastName":"Hanks","preferredFullName":"Tom Hanks","employeeCode":"E3","region":"CA","phoneNumber":"408-2222222","emailAddress":"tomhanks#gmail.com"}
Please have a look at the outstanding JIRA issue. Don't think it is that much of priority, but just for record.
You have two options
Convert your json data to the supported format, one object per line
Have one file per JSON object - this will result in too many files.
Note that SQLContext.jsonFile is deprecated, use SQLContext.read.json.
Examples from spark documentation

Hocon: Read an array of objects from a configuration file

I have created an Play application (2.1) which uses the configuration in conf/application.conf in the Hocon format.
I want to add an array of projects in the configuration. The file conf/application.conf looks like this:
...
projects = [
{name: "SO", url: "http://stackoverflow.com/"},
{name: "google", url: "http://google.com"}
]
I try to read this configuration in my Scala project:
import scala.collection.JavaConversions._
case class Project(name: String, url: String)
val projectList: List[Project] =
Play.maybeApplication.map{x =>
val simpleConfig = x.configration.getObjectList("projects").map{y =>
y.toList.map{z =>
Project(z.get("name").toString, z.get("url").toString) // ?!? doesn't work
...
}}}}}}}} // *arg*
This approach seems to be very complicated, I am lost in a lot of Options, and my Eclipse IDE cannot give me any hints about the classes.
Has anybody an example how you can read an array of objects from a Hocon configuration file?
Or should I use for this a JSON-file with an JSON-parser instead of Hocon?
The following works for me in Play 2.1.2 (I don't have a .maybeApplication on my play.Play object though, and I'm not sure why you do):
import play.Play
import scala.collection.JavaConversions._
case class Project(name: String, url: String)
val projectList: List[Project] = {
val projs = Play.application.configuration.getConfigList("projects") map { p =>
Project(p.getString("name"), p.getString("url")) }
projs.toList
}
println(projectList)
Giving output:
List(Project(SO,http://stackoverflow.com/), Project(google,http://google.com))
There's not a whole lot different, although I don't get lost in a whole lot of Option instances either (again, different from the API you seem to have).
More importantly, getConfigList seems to be a closer match for what you want to do, since it returns List[play.Configuration], which enables you to specify types on retrieval instead of resorting to casts or .toString() calls.
What are you trying to accomplish with this part y.toList.map{z =>? If you want a collection of Project as the result, why not just do:
val simpleConfig = x.configration.getObjectList("projects").map{y =>
Project(y.get("name").toString, y.get("url").toString)
}
In this case, the map operation should be taking instances of ConfigObject which is what y is. That seems to be all you need to get your Project instances, so I'm not sure why you are toListing that ConfigObject (which is a Map) into a List of Tuple2 and then further mapping that again.
If a normal HOCON configuration then similar to strangefeatures answer this will work
import javax.inject._
import play.api.Configuration
trait Barfoo {
def configuration: Configuration
def projects = for {
projectsFound <- configuration.getConfigList("projects").toList
projectConfig <- projectsFound
name <- projectConfig.getString("name").toList
url <- projectConfig.getString("url").toList
} yield Project(name,url)
}
class Foobar #Inject() (val configuration: Configuration) extends Barfoo
(Using Play 2.4+ Injection)
Given that the contents of the array are Json and you have a case class, you could try to use the Json Play API and work with the objects in that way. The Inception part should make it trivial.