spark/scala string to json inside map - json

I have a pairRDD that looks like
(1, {"id":1, "picture": "url1"})
(2, {"id":2, "picture": "url2"})
(3, {"id":3, "picture": "url3"})
...
where second element is a string, i got from function get() from http://alvinalexander.com/scala/how-to-write-scala-http-get-request-client-source-fromurl. here is that function:
#throws(classOf[java.io.IOException])
#throws(classOf[java.net.SocketTimeoutException])
def get(url: String,
connectTimeout: Int = 5000,
readTimeout: Int = 5000,
requestMethod: String = "GET") =
{
import java.net.{URL, HttpURLConnection}
val connection = (new URL(url)).openConnection.asInstanceOf[HttpURLConnection]
connection.setConnectTimeout(connectTimeout)
connection.setReadTimeout(readTimeout)
connection.setRequestMethod(requestMethod)
val inputStream = connection.getInputStream
val content = io.Source.fromInputStream(inputStream).mkString
if (inputStream != null) inputStream.close
content
}
now I want to convert that string to json to get picture url from it. (from this https://stackoverflow.com/a/38271732/1456026)
val step2 = pairRDD_1.map({case(x,y)=>{
val jsonStr = y
val rdd = sc.parallelize(Seq(jsonStr))
val df = sqlContext.read.json(rdd)
(x,y("picture"))
}})
but i'm constantly getting
Exception in thread "main" org.apache.spark.SparkException: Task not
serializable
when i printed out first 20 elements and tried to convert strings to json manually one-by-one outside .map it worked.
val rdd = sc.parallelize(Seq("""{"id":1, "picture": "url1"}"""))
val df = sqlContext.read.json(rdd)
println(df)
>>>[id: string, picture: string]
how to convert string to json in spark/scala inside .map?

You cannot use SparkContext in a distributed operation. In the code above, you cannot access SparkContext in the map operation on pairRDD_1.
Consider using a JSON library to perform the conversion.

Typically when you see this message, it's because you are using a resource in your map function (read anonymous function) that was defined outside of it, and is not able to be serialized.
Running in clustered mode, the anonymous function will be running on a different machine altogether. On that separate machine, a new instance of your app is instantiated and it's state (variables/values/etc) is set from data that has been serialized by the driver and sent to the new instance. If you anonymous function is a closure (i.e. utilizes variables outside of it's scope), then those resources must be serializable, in order to be sent to the worker nodes.
For example, a map function may attempt to use a database connection to grab some information for each record in the RDD. That database connection is only valid on the host that created it (from a networking perspective, of course), which is typically the driver program, so it cannot be serialized, sent, and utilized from a different host. In this particular example, you would do a mapPartitions() to instantiate a database connection from the worker itself, then map() each of the records within that partition to query the database.
I can't provide much further help without your full code example, to see what potential value or variable is unable to be serialized.

One of the answers is to use json4s library.
source: http://muster.json4s.org/docs/jawn_codec.html
//case class defined outside main()
case class Pictures(id: String, picture: String)
// import library
import muster._
import muster.codec.jawn._
// here all the magic happens
val json_read_RDD = pairRDD_1.map({case(x,y) =>
{
val json_read_to_case_class = JawnCodec.as[Pictures](y)
(x, json_read_to_case_class.picture)
}})
// add to build.sbt
libraryDependencies ++= Seq(
"org.json4s" %% "muster-codec-json" % "0.3.0",
"org.json4s" %% "muster-codec-jawn" % "0.3.0")
credits goes to Travis Hegner, who explained why original code didn't work
and to Anton Okolnychyi for advice of using json library.

Related

Why does SparkSQL always return out of range value when accessing any value in MySQL table?

I'm trying to access my MariaDB database in Spark to perform SQL queries on it.
It does successfully print the schema of the table, so the connection is working, but whenever I try to access any column or value inside the database I always get out of range exceptions:
java.sql.SQLException: Out of range value for column : value canonical
The full log and stacktrace is below.
I can access the database outside Spark and successfully get the values of the database.
Moreover, I've tried using deprecated classes such as SparkSQLContext to access the database with similar results.
object Main {
def main(args: Array[String]) {
// parse commandline parameters, get database properties
val commandLineParser = new CommandLineParser()
val commandLineParameters = commandLineParser.parseCommandLineParameters(args)
val databaseProperties = PropertiesParser.readPropertiesFile(commandLineParameters.configFilePath)
if (commandLineParameters.sparkSupport) {
val spark =
if (commandLineParameters.localMode) {
SparkSession
.builder()
.appName("Spark Benchmark CLI")
.config("spark.master", "local")
.config("spark.driver.extraClassPath", "/opt/spark-apps/spark-apps/mariadb-java-client-2.4.1.jar")
.getOrCreate()
}
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
// connect
Class.forName("org.mariadb.jdbc.Driver")
val connection = DriverManager.getConnection(databaseProperties.jdbcURL, databaseProperties.user, databaseProperties.password)
connection.isClosed
// Spark likes working with properties, hence we create a properties object
val connectionProperties = new Properties()
connectionProperties.put("user", s"${databaseProperties.user}")
connectionProperties.put("password", s"${databaseProperties.password}")
connectionProperties.put("driver", s"${commandLineParameters.databaseDriver}")
val table = spark.read.jdbc(databaseProperties.jdbcURL, commandLineParameters.table, connectionProperties)
table.printSchema() // this does successfully print the schema
table.show() // this is where the exceptions are created
} else {
// some code that accesses the database successfully outside spark
}
}
}
I expect to be able to run SQL queries inside Spark without any out of range value exceptions.
The full log and stacktrace of what is actually happening:
https://gist.github.com/Zethson/7e3f43cd80daac219704df25cccd68fa
A colleague of mine figured it out. It's a bug in Spark/MariaDB Connector:
References: https://jira.mariadb.org/browse/CONJ-421
https://issues.apache.org/jira/browse/SPARK-25013
I solved it by replacing mariadb in the DB Url with mysql.

Get only required file details from a HDFS directory in scala

I encountered an issue while working with org.apache.hadoop.fs package in Spark Scala. I need only required file details(file name, block size, modification time) from a given directory. I tried using the following code
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileStatus, FileSystem, Path}
val fs = FileSystem.get(new Configuration())
val dir: String = "/env/domain/work/latest_ts"
val input_files = fs.listStatus(new Path(dir))
The variable input_files obtained is Array[FileStatus] and has all the details about the files in that directory. In My Spark code, I only need the above mentioned three parameters for each file present in the form of a List[Details].
case class Details(name: String, size: Double, time: String)
In the Array[FileStatus], we have 'path' (file full path) as String, block size as Long and modification time.
I tried parsing the Array[FileStatus] as Json and taking out required key value pairs but I couldn't. I also tried the following where I created three lists separately and zipped them to form a list of tuple (String, Double, String) but it is not matching to List[Details] and throqing an error while execution.
val names = fs.listStatus(new Path(dir)).map(_.getPath().getName).toList
val size = fs.listStatus(new Path(dir)).map(_.getBlockSize.toDouble).toList
val time = fs.listStatus(new Path(dir)).map(_.getModificationTime.toString).toList
val input_tuple = (names zip time zip size) map {case ((n,t),s) => (n,t,s)}
val input_files : List[Details] = input_tuple.asInstanceOf[List[Details]]
The error I got was
Exception during processing!
java.lang.ClassCastException: scala.Tuple3 cannot be cast to com.main.Details
Could any one please advise is there a way to get the required parameters from fs or how to correctly cast the tuple I have to Details
Please help, Thanks in advance
To convert Json and read key value pairs, I converted Array[FileStatus] to String using mkString(",") and tried to parse using JSON.parseFull(input_string) which threw an error.
Here is what you can do:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileStatus, FileSystem, Path}
val fs = FileSystem.get(new Configuration())
val dir: String = "/env/domain/work/latest_ts"
val input_files = fs.listStatus(new Path(dir))
val details = input_files.map(m => Details(m.getPath.toString, m.getBlockSize, m.getModificationTime.toString)).toList
This will give you List[Details]. Hope this helps!

GCP Proto Datastore encode JsonProperty in base64

I store a blob of Json in the datastore using JsonProperty.
I don't know the structure of the json data.
I am using endpoints proto datastore in order to retrieve my data.
The probleme is the json property is encoded in base64 and I want a plain json object.
For the example, the json data will be:
{
first: 1,
second: 2
}
My code looks something like:
import endpoints
from google.appengine.ext import ndb
from protorpc import remote
from endpoints_proto_datastore.ndb import EndpointsModel
class Model(EndpointsModel):
data = ndb.JsonProperty()
#endpoints.api(name='myapi', version='v1', description='My Sample API')
class DataEndpoint(remote.Service):
#Model.method(path='mymodel2', http_method='POST',
name='mymodel.insert')
def MyModelInsert(self, my_model):
my_model.data = {"first": 1, "second": 2}
my_model.put()
return my_model
#Model.method(path='mymodel/{entityKey}',
http_method='GET',
name='mymodel.get')
def getMyModel(self, model):
print(model.data)
return model
API = endpoints.api_server([DataEndpoint])
When I call the api for getting a model, I get:
POST /_ah/api/myapi/v1/mymodel2
{
"data": "eyJzZWNvbmQiOiAyLCAiZmlyc3QiOiAxfQ=="
}
where eyJzZWNvbmQiOiAyLCAiZmlyc3QiOiAxfQ== is the base64 encoded of {"second": 2, "first": 1}
And the print statement give me: {u'second': 2, u'first': 1}
So, in the method, I can explore the json blob data as a python dict.
But, in the api call, the data is encoded in base64.
I expeted the api call to give me:
{
'data': {
'second': 2,
'first': 1
}
}
How can I get this result?
After the discussion in the comments of your question, let me share with you a sample code that you can use in order to store a JSON object in Datastore (it will be stored as a string), and later retrieve it in such a way that:
It will show as plain JSON after the API call.
You will be able to parse it again to a Python dict using eval.
I hope I understood correctly your issue, and this helps you with it.
import endpoints
from google.appengine.ext import ndb
from protorpc import remote
from endpoints_proto_datastore.ndb import EndpointsModel
class Sample(EndpointsModel):
column1 = ndb.StringProperty()
column2 = ndb.IntegerProperty()
column3 = ndb.StringProperty()
#endpoints.api(name='myapi', version='v1', description='My Sample API')
class MyApi(remote.Service):
# URL: .../_ah/api/myapi/v1/mymodel - POSTS A NEW ENTITY
#Sample.method(path='mymodel', http_method='GET', name='Sample.insert')
def MyModelInsert(self, my_model):
dict={'first':1, 'second':2}
dict_str=str(dict)
my_model.column1="Year"
my_model.column2=2018
my_model.column3=dict_str
my_model.put()
return my_model
# URL: .../_ah/api/myapi/v1/mymodel/{ID} - RETRIEVES AN ENTITY BY ITS ID
#Sample.method(request_fields=('id',), path='mymodel/{id}', http_method='GET', name='Sample.get')
def MyModelGet(self, my_model):
if not my_model.from_datastore:
raise endpoints.NotFoundException('MyModel not found.')
dict=eval(my_model.column3)
print("This is the Python dict recovered from a string: {}".format(dict))
return my_model
application = endpoints.api_server([MyApi], restricted=False)
I have tested this code using the development server, but it should work the same in production using App Engine with Endpoints and Datastore.
After querying the first endpoint, it will create a new Entity which you will be able to find in Datastore, and which contains a property column3 with your JSON data in string format:
Then, if you use the ID of that entity to retrieve it, in your browser it will show the string without any strange encoding, just plain JSON:
And in the console, you will be able to see that this string can be converted to a Python dict (or also a JSON, using the json module if you prefer):
I hope I have not missed any point of what you want to achieve, but I think all the most important points are covered with this code: a property being a JSON object, store it in Datastore, retrieve it in a readable format, and being able to use it again as JSON/dict.
Update:
I think you should have a look at the list of available Property Types yourself, in order to find which one fits your requirements better. However, as an additional note, I have done a quick test working with a StructuredProperty (a property inside another property), by adding these modifications to the code:
#Define the nested model (your JSON object)
class Structured(EndpointsModel):
first = ndb.IntegerProperty()
second = ndb.IntegerProperty()
#Here I added a new property for simplicity; remember, StackOverflow does not write code for you :)
class Sample(EndpointsModel):
column1 = ndb.StringProperty()
column2 = ndb.IntegerProperty()
column3 = ndb.StringProperty()
column4 = ndb.StructuredProperty(Structured)
#Modify this endpoint definition to add a new property
#Sample.method(request_fields=('id',), path='mymodel/{id}', http_method='GET', name='Sample.get')
def MyModelGet(self, my_model):
if not my_model.from_datastore:
raise endpoints.NotFoundException('MyModel not found.')
#Add the new nested property here
dict=eval(my_model.column3)
my_model.column4=dict
print(json.dumps(my_model.column3))
print("This is the Python dict recovered from a string: {}".format(dict))
return my_model
With these changes, the response of the call to the endpoint looks like:
Now column4 is a JSON object itself (although it is not printed in-line, I do not think that should be a problem.
I hope this helps too. If this is not the exact behavior you want, maybe should play around with the Property Types available, but I do not think there is one type to which you can print a Python dict (or JSON object) without previously converting it to a String.

Gatling :- Compare web service Json response using jsonFileFeeder

I'm using JSON feeder to compare JSON output by web services as follows,
val jsonFileFeeder = jsonFile("test_data.json")
val strategy = (value: Option[String], session: Session) => value.map { jsonFileFeeder =>
val result = JSONCompare.compareJSON("expectedStr", "actualStr", JSONCompareMode.STRICT)
if (result.failed) Failure(result.getMessage)
else Success(value)
}.getOrElse(Failure("Missing body"))
val login = exec(http("Login")
.get("/login"))
.pause(1)
.feed(feeder)
.exec(http("authorization")
.post("/auth")
.headers(headers_10)
.queryParam("""email""", "${email}")
.queryParam("""password""", "${password}")
.check(status.is(200))
.check(bodyString.matchWith(strategy)))
.pause(1)
But it throws error
value matchWith is not a member of io.gatling.core.check.DefaultFindChe
ckBuilder[io.gatling.http.check.HttpCheck,io.gatling.http.response.Response,String,String]
15:10:01.963 [ERROR] i.g.a.ZincCompiler$ - .check(bodyString.matchWith(jsonFileFeeder)))
s\lib\Login.scala:18: not found: value JSONCompare
15:10:05.224 [ERROR] i.g.a.ZincCompiler$ - val result = JSONCompare.compareJSON(jsonFileFeeder, j
sonFileFeeder, JSONCompareMode.STRICT)
^
15:10:05.631 [ERROR] i.g.a.ZincCompiler$ - two errors found
Compilation failed
Here's a sample script that semantically compares a JSON response with expected output:
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import io.gatling.core.json.Jackson
import java.nio.charset.StandardCharsets.UTF_8
import scala.concurrent.duration._
class BasicSimulation extends Simulation {
lazy val expectedJson = Jackson.parse(
getClass.getResourceAsStream("/output.json"),
UTF_8
)
val scn = scenario("Scenario Name")
.exec(http("request_1")
.get("http://localhost:8000/output.json")
.check(bodyString.transform(Jackson.parse).is(expectedJson))
)
setUp(scn.inject(atOnceUsers(1)))
}
It assumes there is a file output.json in the resources directory (the directory that also contains your data and request-bodies).
However, I think you should carefully consider whether this solution is right for your needs. It won't scale as well as JSONPath or regex checks (especially for large JSON files), it's inflexible, and it seems more like a functional testing task than a performance task. I suspect that if you're trying to compare JSON files in this way, then you're probably trying to solve the wrong problem.
Note that it doesn't use jsonFile, as jsonFile is designed for use as a feeder, whereas I suspect you want to compare a single request with a hard-coded response. However, jsonFile may prove useful if you will be making a number of different requests with different parameters and expect different (known) responses. Here's an example script that takes this approach:
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import io.gatling.core.json.Jackson
import scala.concurrent.duration._
class BasicSimulation extends Simulation {
val myFeed = jsonFile("json_data.json").circular
val scn = scenario("Scenario Name")
.feed(myFeed)
.exec(http("request_1")
.get("${targetUrl}")
.check(bodyString.transform(Jackson.parse).is("${expectedResponse}"))
)
setUp(scn.inject(atOnceUsers(2)))
}
It assumes there is a json resource in data/json_data.json, that looks something like the following:
[
{
"targetUrl":"http://localhost:8000/failure.json",
"expectedResponse":
{
"success": false,
"message": "Request Failed"
}
},
{
"targetUrl":"http://localhost:8000/success.json",
"expectedResponse":
{
"success": true,
"message": "Request Succeeded"
}
}
]
The expectedResponse should be the exact JSON you expect to get back from the server. And of course you don't just have to parameterise targetUrl, you can parameterise whatever you want in this way.
As an aside, you may also be interested to know that Gatling 2.1 is expected to allow comparing an response with a file without using hacks like these (although the current development version only supports comparing byte-for-byte, not comparing-as-json).

Inserting JsNumber into Mongo

When trying to insert a MongoDBObject that contains a JsNumber
val obj: DBObject = getDbObj // contains a "JsNumber()"
collection.insert(obj)
the following error occurs:
[error] play - Cannot invoke the action, eventually got an error: java.lang.IllegalArgumentException: can't serialize class scala.math.BigDecimal
I tried to replace the JsNumber with an Int, but I got the same error.
EDIT
Error can be reproduced via this test code. Full code in scalatest (https://gist.github.com/kman007us/6617735)
val collection = MongoConnection()("test")("test")
val obj: JsValue = Json.obj("age" -> JsNumber(100))
val q = MongoDBObject("name" -> obj)
collection.insert(q)
There are no registered handlers for Plays JSON implementation - you could add handlers to automatically translate plays Js Types to BSON types. However, that wont handle mongodb extended json which has a special structure dealing with non native json types eg: date and objectid translations.
An example of using this is:
import com.mongodb.util.JSON
val obj: JsValue = Json.obj("age" -> JsNumber(100))
val doc: DBObject = JSON.parse(obj.toString).asInstanceOf[DBObject]
For an example of a bson transformer see the joda time transformer.
It seems that casbah driver isn't compatible with Plays's JSON implementation. If I look through the cashbah code than it seems that you must use a set of MongoDBObject objects to build your query. The following snippet should work.
val collection = MongoConnection()("test")("test")
val obj = MongoDBObject("age" -> 100)
val q = MongoDBObject("name" -> obj)
collection.insert(q)
If you need the compatibility with Play's JSON implementation then use ReactiveMongo and Play-ReactiveMongo.
Edit
Maybe this Gist can help to convert JsValue objects into MongoDBObject objects.