Get only required file details from a HDFS directory in scala - json

I encountered an issue while working with org.apache.hadoop.fs package in Spark Scala. I need only required file details(file name, block size, modification time) from a given directory. I tried using the following code
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileStatus, FileSystem, Path}
val fs = FileSystem.get(new Configuration())
val dir: String = "/env/domain/work/latest_ts"
val input_files = fs.listStatus(new Path(dir))
The variable input_files obtained is Array[FileStatus] and has all the details about the files in that directory. In My Spark code, I only need the above mentioned three parameters for each file present in the form of a List[Details].
case class Details(name: String, size: Double, time: String)
In the Array[FileStatus], we have 'path' (file full path) as String, block size as Long and modification time.
I tried parsing the Array[FileStatus] as Json and taking out required key value pairs but I couldn't. I also tried the following where I created three lists separately and zipped them to form a list of tuple (String, Double, String) but it is not matching to List[Details] and throqing an error while execution.
val names = fs.listStatus(new Path(dir)).map(_.getPath().getName).toList
val size = fs.listStatus(new Path(dir)).map(_.getBlockSize.toDouble).toList
val time = fs.listStatus(new Path(dir)).map(_.getModificationTime.toString).toList
val input_tuple = (names zip time zip size) map {case ((n,t),s) => (n,t,s)}
val input_files : List[Details] = input_tuple.asInstanceOf[List[Details]]
The error I got was
Exception during processing!
java.lang.ClassCastException: scala.Tuple3 cannot be cast to com.main.Details
Could any one please advise is there a way to get the required parameters from fs or how to correctly cast the tuple I have to Details
Please help, Thanks in advance
To convert Json and read key value pairs, I converted Array[FileStatus] to String using mkString(",") and tried to parse using JSON.parseFull(input_string) which threw an error.

Here is what you can do:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileStatus, FileSystem, Path}
val fs = FileSystem.get(new Configuration())
val dir: String = "/env/domain/work/latest_ts"
val input_files = fs.listStatus(new Path(dir))
val details = input_files.map(m => Details(m.getPath.toString, m.getBlockSize, m.getModificationTime.toString)).toList
This will give you List[Details]. Hope this helps!

Related

spark/scala string to json inside map

I have a pairRDD that looks like
(1, {"id":1, "picture": "url1"})
(2, {"id":2, "picture": "url2"})
(3, {"id":3, "picture": "url3"})
...
where second element is a string, i got from function get() from http://alvinalexander.com/scala/how-to-write-scala-http-get-request-client-source-fromurl. here is that function:
#throws(classOf[java.io.IOException])
#throws(classOf[java.net.SocketTimeoutException])
def get(url: String,
connectTimeout: Int = 5000,
readTimeout: Int = 5000,
requestMethod: String = "GET") =
{
import java.net.{URL, HttpURLConnection}
val connection = (new URL(url)).openConnection.asInstanceOf[HttpURLConnection]
connection.setConnectTimeout(connectTimeout)
connection.setReadTimeout(readTimeout)
connection.setRequestMethod(requestMethod)
val inputStream = connection.getInputStream
val content = io.Source.fromInputStream(inputStream).mkString
if (inputStream != null) inputStream.close
content
}
now I want to convert that string to json to get picture url from it. (from this https://stackoverflow.com/a/38271732/1456026)
val step2 = pairRDD_1.map({case(x,y)=>{
val jsonStr = y
val rdd = sc.parallelize(Seq(jsonStr))
val df = sqlContext.read.json(rdd)
(x,y("picture"))
}})
but i'm constantly getting
Exception in thread "main" org.apache.spark.SparkException: Task not
serializable
when i printed out first 20 elements and tried to convert strings to json manually one-by-one outside .map it worked.
val rdd = sc.parallelize(Seq("""{"id":1, "picture": "url1"}"""))
val df = sqlContext.read.json(rdd)
println(df)
>>>[id: string, picture: string]
how to convert string to json in spark/scala inside .map?
You cannot use SparkContext in a distributed operation. In the code above, you cannot access SparkContext in the map operation on pairRDD_1.
Consider using a JSON library to perform the conversion.
Typically when you see this message, it's because you are using a resource in your map function (read anonymous function) that was defined outside of it, and is not able to be serialized.
Running in clustered mode, the anonymous function will be running on a different machine altogether. On that separate machine, a new instance of your app is instantiated and it's state (variables/values/etc) is set from data that has been serialized by the driver and sent to the new instance. If you anonymous function is a closure (i.e. utilizes variables outside of it's scope), then those resources must be serializable, in order to be sent to the worker nodes.
For example, a map function may attempt to use a database connection to grab some information for each record in the RDD. That database connection is only valid on the host that created it (from a networking perspective, of course), which is typically the driver program, so it cannot be serialized, sent, and utilized from a different host. In this particular example, you would do a mapPartitions() to instantiate a database connection from the worker itself, then map() each of the records within that partition to query the database.
I can't provide much further help without your full code example, to see what potential value or variable is unable to be serialized.
One of the answers is to use json4s library.
source: http://muster.json4s.org/docs/jawn_codec.html
//case class defined outside main()
case class Pictures(id: String, picture: String)
// import library
import muster._
import muster.codec.jawn._
// here all the magic happens
val json_read_RDD = pairRDD_1.map({case(x,y) =>
{
val json_read_to_case_class = JawnCodec.as[Pictures](y)
(x, json_read_to_case_class.picture)
}})
// add to build.sbt
libraryDependencies ++= Seq(
"org.json4s" %% "muster-codec-json" % "0.3.0",
"org.json4s" %% "muster-codec-jawn" % "0.3.0")
credits goes to Travis Hegner, who explained why original code didn't work
and to Anton Okolnychyi for advice of using json library.

How to specify only particular fields using read.schema in JSON : SPARK Scala

I am trying to programmatically enforce schema(json) on textFile which looks like json. I tried with jsonFile but the issue is for creating a dataframe from a list of json files, spark has to do a 1 pass through the data to create a schema for the dataframe. So it needs to parse all the data which is taking longer time (4 hours since my data is zipped and of size TBs). So I want to try reading it as textFile and enforce schema to get interested fields alone to later query on the resulting data frame. But I am not sure how do I map it to the input. Can some give me some reference on how do I map schema to json like input.
input :
This is the full schema :
records: org.apache.spark.sql.DataFrame = [country: string, countryFeatures: string, customerId: string, homeCountry: string, homeCountryFeatures: string, places: array<struct<freeTrial:boolean,placeId:string,placeRating:bigint>>, siteName: string, siteId: string, siteTypeId: string, Timestamp: bigint, Timezone: string, countryId: string, pageId: string, homeId: string, pageType: string, model: string, requestId: string, sessionId: string, inputs: array<struct<inputName:string,inputType:string,inputId:string,offerType:string,originalRating:bigint,processed:boolean,rating:bigint,score:double,methodId:string>>]
But I am only interested in few fields like :
res45: Array[String] = Array({"requestId":"bnjinmm","siteName":"bueller","pageType":"ad","model":"prepare","inputs":[{"methodId":"436136582","inputType":"US","processed":true,"rating":0,"originalRating":1},{"methodId":"23232322","inputType":"UK","processed":falase,"rating":0,"originalRating":1}]
val records = sc.textFile("s3://testData/sample.json.gz")
val schema = StructType(Array(StructField("requestId",StringType,true),
StructField("siteName",StringType,true),
StructField("model",StringType,true),
StructField("pageType",StringType,true),
StructField("inputs", ArrayType(
StructType(
StructField("inputType",StringType,true),
StructField("originalRating",LongType,true),
StructField("processed",BooleanType,true),
StructField("rating",LongType,true),
StructField("methodId",StringType,true)
),true),true)))
val rowRDD = ??
val inputRDD = sqlContext.applySchema(rowRDD, schema)
inputRDD.registerTempTable("input")
sql("select * from input").foreach(println)
Is there any way to map this ? Or do I need to use son parser or something. I want to use textFile only because of the constraints.
Tried with :
val records =sqlContext.read.schema(schema).json("s3://testData/test2.gz")
But keeping getting the error :
<console>:37: error: overloaded method value apply with alternatives:
(fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
cannot be applied to (org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField)
StructField("inputs",ArrayType(StructType(StructField("inputType",StringType,true), StructField("originalRating",LongType,true), StructField("processed",BooleanType,true), StructField("rating",LongType,true), StructField("score",DoubleType,true), StructField("methodId",StringType,true)),true),true)))
^
It can load with following code with predefined schema, spark don't need to go through the file in ZIP file. The code in the question has ambiguity.
import org.apache.spark.sql.types._
val input = StructType(
Array(
StructField("inputType",StringType,true),
StructField("originalRating",LongType,true),
StructField("processed",BooleanType,true),
StructField("rating",LongType,true),
StructField("score",DoubleType,true),
StructField("methodId",StringType,true)
)
)
val schema = StructType(Array(
StructField("requestId",StringType,true),
StructField("siteName",StringType,true),
StructField("model",StringType,true),
StructField("inputs",
ArrayType(input,true),
true)
)
)
val records =sqlContext.read.schema(schema).json("s3://testData/test2.gz")
Not all the fields need to be provided. While it's good to provide all if possible.
Spark try best to parse all, if some row is not valid. It will add _corrupt_record as a column which contains the whole row.
While if it's plained json file file.

Loading json data into hive using spark sql

I am Unable to push json data into hive Below is the sample json data and my work . Please suggest me the missing one
json Data
{
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani#gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani#gmail.com"
},
{
"userId":"thanks",
"jobTitleName":"Program Directory",
"firstName":"Tom",
"lastName":"Hanks",
"preferredFullName":"Tom Hanks",
"employeeCode":"E3",
"region":"CA",
"phoneNumber":"408-2222222",
"emailAddress":"tomhanks#gmail.com"
}
]
}
I tried to use sqlcontext and jsonFile method to load which is failing to parse the json
val f = sqlc.jsonFile("file:///home/vm/Downloads/emp.json")
f.show
error is : java.lang.RuntimeException: Failed to parse a value for data type StructType() (current token: VALUE_STRING)
I tried in different way and able to crack and get the schema
val files = sc.wholeTextFiles("file:///home/vm/Downloads/emp.json")
val jsonData = files.map(x => x._2)
sqlc.jsonRDD(jsonData).registerTempTable("employee")
val emp= sqlc.sql("select Employees[1].userId as ID,Employees[1].jobTitleName as Title,Employees[1].firstName as FirstName,Employees[1].lastName as LastName,Employees[1].preferredFullName as PeferedName,Employees[1].employeeCode as empCode,Employees[1].region as Region,Employees[1].phoneNumber as Phone,Employees[1].emailAddress as email from employee")
emp.show // displays all the values
I am able to get the data and schema seperately for each record but I am missing an idea to get all the data and load into hive.
Any help or suggestion is much appreaciated.
Here is the Cracked answer
val files = sc.wholeTextFiles("file:///home/vm/Downloads/emp.json")
val jsonData = files.map(x => x._2)
import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive.HiveContext
val hc=new HiveContext(sc)
hc.jsonRDD(jsonData).registerTempTable("employee")
val fuldf=hc.jsonRDD(jsonData)
val dfemp=fuldf.select(explode(col("Employees")))
dfemp.saveAsTable("empdummy")
val df=sql("select * from empdummy")
df.select ("_c0.userId","_c0.jobTitleName","_c0.firstName","_c0.lastName","_c0.preferredFullName","_c0.employeeCode","_c0.region","_c0.phoneNumber","_c0.emailAddress").saveAsTable("dummytab")
Any suggestion for optimising the above code.
SparkSQL only supports reading JSON files when the file contains one JSON object per line.
SQLContext.scala
/**
* Loads a JSON file (one object per line), returning the result as a [[DataFrame]].
* It goes through the entire dataset once to determine the schema.
*
* #group specificdata
* #deprecated As of 1.4.0, replaced by `read().json()`. This will be removed in Spark 2.0.
*/
#deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
def jsonFile(path: String): DataFrame = {
read.json(path)
}
Your file should look like this (strictly speaking, it's not a proper JSON file)
{"userId":"rirani","jobTitleName":"Developer","firstName":"Romin","lastName":"Irani","preferredFullName":"Romin Irani","employeeCode":"E1","region":"CA","phoneNumber":"408-1234567","emailAddress":"romin.k.irani#gmail.com"}
{"userId":"nirani","jobTitleName":"Developer","firstName":"Neil","lastName":"Irani","preferredFullName":"Neil Irani","employeeCode":"E2","region":"CA","phoneNumber":"408-1111111","emailAddress":"neilrirani#gmail.com"}
{"userId":"thanks","jobTitleName":"Program Directory","firstName":"Tom","lastName":"Hanks","preferredFullName":"Tom Hanks","employeeCode":"E3","region":"CA","phoneNumber":"408-2222222","emailAddress":"tomhanks#gmail.com"}
Please have a look at the outstanding JIRA issue. Don't think it is that much of priority, but just for record.
You have two options
Convert your json data to the supported format, one object per line
Have one file per JSON object - this will result in too many files.
Note that SQLContext.jsonFile is deprecated, use SQLContext.read.json.
Examples from spark documentation

Scala play 2 framework how can I implement a write or format on Json

I am new to Scala and play 2 and haven't found a way to return a Json request from the database using Anorm. This is my simple code
def locations = Action {implicit c=>
import play.api.libs.json._
implicit val readLocations = SQL("select city,state from zips limit 1")
Ok(Json.toJson(readLocations))
}
The method is a post I simply want to return via Json 1 record from the database table there but I get this error
Error:(57, 21) Play 2 Compiler:
.scala:57: No Json serializer found for type anorm.SqlQuery. Try to implement an implicit Writes or Format for this type.
Ok(Json.toJson(readLocations))
Any suggestions would be welcomed, I have been switching the code above some but nothing is working. I know I need to do a Write or Format but can't seem to find out how.
^**
Looks like you are trying to send a List of Locations. You could do:
def locations = Action {implicit c=>
import play.api.libs.json._
implicit val locationFmt = Json.format[Location]
case class Location(city: String, state: String)
//Send Multiple Locations if you want
val readLocations = SQL("select city,state from zips").list.map{case Row(city: String, state: String) =>
Location(city, state)
}
// Send a single Location
val readLocation = SQL("select city,state from zips limit 1").list.headOption.map{case Row(city: String, state: String) =>
Location(city, state)
}.getOrElse(throw new NoSuchElementException)
Ok(Json.toJson(readLocation))
}

Inserting JsNumber into Mongo

When trying to insert a MongoDBObject that contains a JsNumber
val obj: DBObject = getDbObj // contains a "JsNumber()"
collection.insert(obj)
the following error occurs:
[error] play - Cannot invoke the action, eventually got an error: java.lang.IllegalArgumentException: can't serialize class scala.math.BigDecimal
I tried to replace the JsNumber with an Int, but I got the same error.
EDIT
Error can be reproduced via this test code. Full code in scalatest (https://gist.github.com/kman007us/6617735)
val collection = MongoConnection()("test")("test")
val obj: JsValue = Json.obj("age" -> JsNumber(100))
val q = MongoDBObject("name" -> obj)
collection.insert(q)
There are no registered handlers for Plays JSON implementation - you could add handlers to automatically translate plays Js Types to BSON types. However, that wont handle mongodb extended json which has a special structure dealing with non native json types eg: date and objectid translations.
An example of using this is:
import com.mongodb.util.JSON
val obj: JsValue = Json.obj("age" -> JsNumber(100))
val doc: DBObject = JSON.parse(obj.toString).asInstanceOf[DBObject]
For an example of a bson transformer see the joda time transformer.
It seems that casbah driver isn't compatible with Plays's JSON implementation. If I look through the cashbah code than it seems that you must use a set of MongoDBObject objects to build your query. The following snippet should work.
val collection = MongoConnection()("test")("test")
val obj = MongoDBObject("age" -> 100)
val q = MongoDBObject("name" -> obj)
collection.insert(q)
If you need the compatibility with Play's JSON implementation then use ReactiveMongo and Play-ReactiveMongo.
Edit
Maybe this Gist can help to convert JsValue objects into MongoDBObject objects.