Loading json data into hive using spark sql - json

I am Unable to push json data into hive Below is the sample json data and my work . Please suggest me the missing one
json Data
{
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani#gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani#gmail.com"
},
{
"userId":"thanks",
"jobTitleName":"Program Directory",
"firstName":"Tom",
"lastName":"Hanks",
"preferredFullName":"Tom Hanks",
"employeeCode":"E3",
"region":"CA",
"phoneNumber":"408-2222222",
"emailAddress":"tomhanks#gmail.com"
}
]
}
I tried to use sqlcontext and jsonFile method to load which is failing to parse the json
val f = sqlc.jsonFile("file:///home/vm/Downloads/emp.json")
f.show
error is : java.lang.RuntimeException: Failed to parse a value for data type StructType() (current token: VALUE_STRING)
I tried in different way and able to crack and get the schema
val files = sc.wholeTextFiles("file:///home/vm/Downloads/emp.json")
val jsonData = files.map(x => x._2)
sqlc.jsonRDD(jsonData).registerTempTable("employee")
val emp= sqlc.sql("select Employees[1].userId as ID,Employees[1].jobTitleName as Title,Employees[1].firstName as FirstName,Employees[1].lastName as LastName,Employees[1].preferredFullName as PeferedName,Employees[1].employeeCode as empCode,Employees[1].region as Region,Employees[1].phoneNumber as Phone,Employees[1].emailAddress as email from employee")
emp.show // displays all the values
I am able to get the data and schema seperately for each record but I am missing an idea to get all the data and load into hive.
Any help or suggestion is much appreaciated.

Here is the Cracked answer
val files = sc.wholeTextFiles("file:///home/vm/Downloads/emp.json")
val jsonData = files.map(x => x._2)
import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive.HiveContext
val hc=new HiveContext(sc)
hc.jsonRDD(jsonData).registerTempTable("employee")
val fuldf=hc.jsonRDD(jsonData)
val dfemp=fuldf.select(explode(col("Employees")))
dfemp.saveAsTable("empdummy")
val df=sql("select * from empdummy")
df.select ("_c0.userId","_c0.jobTitleName","_c0.firstName","_c0.lastName","_c0.preferredFullName","_c0.employeeCode","_c0.region","_c0.phoneNumber","_c0.emailAddress").saveAsTable("dummytab")
Any suggestion for optimising the above code.

SparkSQL only supports reading JSON files when the file contains one JSON object per line.
SQLContext.scala
/**
* Loads a JSON file (one object per line), returning the result as a [[DataFrame]].
* It goes through the entire dataset once to determine the schema.
*
* #group specificdata
* #deprecated As of 1.4.0, replaced by `read().json()`. This will be removed in Spark 2.0.
*/
#deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
def jsonFile(path: String): DataFrame = {
read.json(path)
}
Your file should look like this (strictly speaking, it's not a proper JSON file)
{"userId":"rirani","jobTitleName":"Developer","firstName":"Romin","lastName":"Irani","preferredFullName":"Romin Irani","employeeCode":"E1","region":"CA","phoneNumber":"408-1234567","emailAddress":"romin.k.irani#gmail.com"}
{"userId":"nirani","jobTitleName":"Developer","firstName":"Neil","lastName":"Irani","preferredFullName":"Neil Irani","employeeCode":"E2","region":"CA","phoneNumber":"408-1111111","emailAddress":"neilrirani#gmail.com"}
{"userId":"thanks","jobTitleName":"Program Directory","firstName":"Tom","lastName":"Hanks","preferredFullName":"Tom Hanks","employeeCode":"E3","region":"CA","phoneNumber":"408-2222222","emailAddress":"tomhanks#gmail.com"}
Please have a look at the outstanding JIRA issue. Don't think it is that much of priority, but just for record.
You have two options
Convert your json data to the supported format, one object per line
Have one file per JSON object - this will result in too many files.
Note that SQLContext.jsonFile is deprecated, use SQLContext.read.json.
Examples from spark documentation

Related

convert nested json string column into map type column in spark

overall aim
I have data landing into blob storage from an azure service in form of json files where each line in a file is a nested json object. I want to process this with spark and finally store as a delta table with nested struct/map type columns which can later be queried downstream using the dot notation columnName.key
data nesting visualized
{
key1: value1
nestedType1: {
key1: value1
keyN: valueN
}
nestedType2: {
key1: value1
nestedKey: {
key1: value1
keyN: valueN
}
}
keyN: valueN
}
current approach and problem
I am not using the default spark json reader as it is resulting in some incorrect parsing of the files instead I am loading the files as text files and then parsing using udfs by using python's json module ( eg below ) post which I use explode and pivot to get the first level of keys into columns
#udf('MAP<STRING,STRING>' )
def get_key_val(x):
try:
return json.loads(x)
except:
return None
Post this initial transformation I now need to convert the nestedType columns to valid map types as well. Now since the initial function is returning map<string,string> the values in nestedType columns are not valid jsons so I cannot use json.loads, instead I have regex based string operations
#udf('MAP<STRING,STRING>' )
def convert_map(string):
try:
regex = re.compile(r"""\w+=.*?(?:(?=,(?!"))|(?=}))""")
obj = dict([(a.split('=')[0].strip(),(a.split('=')[1])) for a in regex.findall(s)])
return obj
except Exception as e:
return e
this is fine for second level of nesting but if I want to go further that would require another udf and subsequent complications.
question
How can I use a spark udf or native spark functions to parse the nested json data such that it is queryable in columnName.key format.
also there is no restriction of spark version, hopefully I was able to explain this properly. do let me know if you want me to put some sample data and the code for ease. Any help is appreciated.

How to export all data from Elastic Search Index to file in JSON format with _id field specified?

I'm new to both Spark and Scala. I'm trying to read all data from a particular index in Elastic Search into a RDD and use this data to write to Mongo DB.
I'm loading the Elastic search data to a esJsonRDD and when I try to print the RDD contents, it is in the following format,
(1765770532{"FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"})
Expected format,
{_id:"1765770532","FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"}
How can I achieve the output from elastic search to be formatted this way?.
Any help would be appreciated.
The data retrieved from elastic search is in the following format,
(1765770532{"FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"})
Expected format is,
{_id:"1765770532","FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"}
object readFromES {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("readFromES")
.set("es.nodes", Config.ES_NODES)
.set("es.nodes.wan.only", Config.ES_NODES_WAN_ONLY)
.set("es.net.http.auth.user", Config.ES_NET_HTTP_AUTH_USER)
.set("es.net.http.auth.pass", Config.ES_NET_HTTP_AUTH_PASS)
.set("es.net.ssl", Config.ES_NET_SSL)
.set("es.output.json","true")
val sc = new SparkContext(conf)
val RDD = EsSpark.esJsonRDD(sc, "userdata/user")
//RDD.coalesce(1).saveAsTextFile(args(0))
RDD.take(5).foreach(println)
}
}
I would like the RDD output to be written to a file in the following JSON Format(one line per doc),
{_id:"1765770532","FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"}
{_id:"1765770533","FirstName":DEF,"LastName":"DEF",Zipcode":"35525","City":"PortWinchestor","StateCode":"AI"}
"_id" is a part of metadata, to access it you should add .config("es.read.metadata", true) to config.
Then you can access it two ways, You can use
val RDD = EsSpark.esJsonRDD(sc, "userdata/user")
and manually add the _id field in json
Or easier way is to read as a dataframe
val df = spark.read
.format("org.elasticsearch.spark.sql")
.load("userdata/user")
.withColumn("_id", $"_metadata".getItem("_id"))
.drop("_metadata")
//Write as json in file
df.write.json("output folder ")
Here the spark is the spark session created as
val spark = SparkSession.builder().master("local[*]").appName("Test")
.config("spark.es.nodes","host")
.config("spark.es.port","ports")
.config("spark.es.nodes.wan.only","true")
.config("es.read.metadata", true) //for enabling metadata
.getOrCreate()
Hope this helps

GCP Proto Datastore encode JsonProperty in base64

I store a blob of Json in the datastore using JsonProperty.
I don't know the structure of the json data.
I am using endpoints proto datastore in order to retrieve my data.
The probleme is the json property is encoded in base64 and I want a plain json object.
For the example, the json data will be:
{
first: 1,
second: 2
}
My code looks something like:
import endpoints
from google.appengine.ext import ndb
from protorpc import remote
from endpoints_proto_datastore.ndb import EndpointsModel
class Model(EndpointsModel):
data = ndb.JsonProperty()
#endpoints.api(name='myapi', version='v1', description='My Sample API')
class DataEndpoint(remote.Service):
#Model.method(path='mymodel2', http_method='POST',
name='mymodel.insert')
def MyModelInsert(self, my_model):
my_model.data = {"first": 1, "second": 2}
my_model.put()
return my_model
#Model.method(path='mymodel/{entityKey}',
http_method='GET',
name='mymodel.get')
def getMyModel(self, model):
print(model.data)
return model
API = endpoints.api_server([DataEndpoint])
When I call the api for getting a model, I get:
POST /_ah/api/myapi/v1/mymodel2
{
"data": "eyJzZWNvbmQiOiAyLCAiZmlyc3QiOiAxfQ=="
}
where eyJzZWNvbmQiOiAyLCAiZmlyc3QiOiAxfQ== is the base64 encoded of {"second": 2, "first": 1}
And the print statement give me: {u'second': 2, u'first': 1}
So, in the method, I can explore the json blob data as a python dict.
But, in the api call, the data is encoded in base64.
I expeted the api call to give me:
{
'data': {
'second': 2,
'first': 1
}
}
How can I get this result?
After the discussion in the comments of your question, let me share with you a sample code that you can use in order to store a JSON object in Datastore (it will be stored as a string), and later retrieve it in such a way that:
It will show as plain JSON after the API call.
You will be able to parse it again to a Python dict using eval.
I hope I understood correctly your issue, and this helps you with it.
import endpoints
from google.appengine.ext import ndb
from protorpc import remote
from endpoints_proto_datastore.ndb import EndpointsModel
class Sample(EndpointsModel):
column1 = ndb.StringProperty()
column2 = ndb.IntegerProperty()
column3 = ndb.StringProperty()
#endpoints.api(name='myapi', version='v1', description='My Sample API')
class MyApi(remote.Service):
# URL: .../_ah/api/myapi/v1/mymodel - POSTS A NEW ENTITY
#Sample.method(path='mymodel', http_method='GET', name='Sample.insert')
def MyModelInsert(self, my_model):
dict={'first':1, 'second':2}
dict_str=str(dict)
my_model.column1="Year"
my_model.column2=2018
my_model.column3=dict_str
my_model.put()
return my_model
# URL: .../_ah/api/myapi/v1/mymodel/{ID} - RETRIEVES AN ENTITY BY ITS ID
#Sample.method(request_fields=('id',), path='mymodel/{id}', http_method='GET', name='Sample.get')
def MyModelGet(self, my_model):
if not my_model.from_datastore:
raise endpoints.NotFoundException('MyModel not found.')
dict=eval(my_model.column3)
print("This is the Python dict recovered from a string: {}".format(dict))
return my_model
application = endpoints.api_server([MyApi], restricted=False)
I have tested this code using the development server, but it should work the same in production using App Engine with Endpoints and Datastore.
After querying the first endpoint, it will create a new Entity which you will be able to find in Datastore, and which contains a property column3 with your JSON data in string format:
Then, if you use the ID of that entity to retrieve it, in your browser it will show the string without any strange encoding, just plain JSON:
And in the console, you will be able to see that this string can be converted to a Python dict (or also a JSON, using the json module if you prefer):
I hope I have not missed any point of what you want to achieve, but I think all the most important points are covered with this code: a property being a JSON object, store it in Datastore, retrieve it in a readable format, and being able to use it again as JSON/dict.
Update:
I think you should have a look at the list of available Property Types yourself, in order to find which one fits your requirements better. However, as an additional note, I have done a quick test working with a StructuredProperty (a property inside another property), by adding these modifications to the code:
#Define the nested model (your JSON object)
class Structured(EndpointsModel):
first = ndb.IntegerProperty()
second = ndb.IntegerProperty()
#Here I added a new property for simplicity; remember, StackOverflow does not write code for you :)
class Sample(EndpointsModel):
column1 = ndb.StringProperty()
column2 = ndb.IntegerProperty()
column3 = ndb.StringProperty()
column4 = ndb.StructuredProperty(Structured)
#Modify this endpoint definition to add a new property
#Sample.method(request_fields=('id',), path='mymodel/{id}', http_method='GET', name='Sample.get')
def MyModelGet(self, my_model):
if not my_model.from_datastore:
raise endpoints.NotFoundException('MyModel not found.')
#Add the new nested property here
dict=eval(my_model.column3)
my_model.column4=dict
print(json.dumps(my_model.column3))
print("This is the Python dict recovered from a string: {}".format(dict))
return my_model
With these changes, the response of the call to the endpoint looks like:
Now column4 is a JSON object itself (although it is not printed in-line, I do not think that should be a problem.
I hope this helps too. If this is not the exact behavior you want, maybe should play around with the Property Types available, but I do not think there is one type to which you can print a Python dict (or JSON object) without previously converting it to a String.

Parsing nodes on JSON with Scala -

I've been asked to parse a JSON file to get all the buses that are over a specified speed inputed by the user.
The JSON file can be downloaded here
It's like this:
{
"COLUMNS": [
"DATAHORA",
"ORDEM",
"LINHA",
"LATITUDE",
"LONGITUDE",
"VELOCIDADE"
],
"DATA": [
[
"04-16-2015 00:00:55",
"B63099",
"",
-22.7931,
-43.2943,
0
],
[
"04-16-2015 00:01:02",
"C44503",
781,
-22.853649,
-43.37616,
25
],
[
"04-16-2015 00:11:40",
"B63067",
"",
-22.7925,
-43.2945,
0
],
]
}
The thing is: I'm really new to scala and I have never worked with json before (shame on me). What I need is to get the "Ordem", "Linha" and "Velocidade" from DATA node.
I created a case class to enclousure all the data so as to later look for those who are over the specified speed.
case class Bus(ordem: String, linha: Int, velocidade: Int)
I did this reading the file as a textFile and spliting. Although this way, I need to foreknow the content of the file in order to go to the lines after DATA node.
I want to know how to do this using a JSON parser. I've tried many solutions, but I couldn't adapt to my problem, because I need to extract all the lines from DATA node instead of nodes inside one node.
Can anyone help me?
PS: Sorry for my english, not a native speaker.
First of all, you need to understand the different JSON data types. The basic types in JSON are numbers, strings, booleans, arrays, and objects. The data returned in your example is an object with two keys: COLUMNS and DATA. The COLUMNS key has a value that is an array of strings and numbers. The DATA key has a value which is an array of arrays of strings.
You can use a library like PlayJSON to work with this type of data:
val js = Json.parse(x).as[JsObject]
val keys = (js \ "COLUMNS").as[List[String]]
val values = (js \ "DATA").as[List[List[JsValue]]]
val busses = values.map(valueList => {
val keyValues = (keys zip valueList).toMap
for {
ordem <- keyValues("ORDEM").asOpt[String]
linha <- keyValues("LINHA").asOpt[Int]
velocidade <- keyValues("VELOCIDADE").asOpt[Int]
} yield Bus(ordem, linha, velocidade)
})
Note the use of asOpt when converting the properties to the expected types. This operator converts the key-values to the provided type if possible (wrapped in Some), and returns None otherwise. So, if you want to provide a default value instead of ignoring other results, you could use keyValues("LINHA").asOpt[Int].getOrElse(0), for example.
You can read more about the Play JSON methods used here, like \ and as, and asOpt in their docs.
You can use Spark SQL to achieve it. Refer section under JSON Datasets here
In essence, Use spark APIs to load a JSON and register it as temp table.
You can run your SQL queries on the table from there.
As seen on #Ben Reich answer, that code works great. Thank you very much.
Although, my Json had some type problems on "Linha". As it can be seen on the JSON example that I put on the Question, there are "" and also numbers, e.g., 781.
When trying to do keyValues("LINHA").asOpt[Int].getOrElse(0), it was producing an error saying that value flatMap is not a member of Int.
So, I had to change some things:
case class BusContainer(ordem: String, linha: String, velocidade: Int)
val jsonString = fromFile("./project/rj_onibus_gps.json").getLines.mkString
val js = Json.parse(jsonString).as[JsObject]
val keys = (js \ "COLUMNS").as[List[String]]
val values = (js \ "DATA").as[List[List[JsValue]]]
val buses = values.map(valueList => {
val keyValues = (keys zip valueList).toMap
println(keyValues("ORDEM"),keyValues("LINHA"),keyValues("VELOCIDADE"))
for {
ordem <- keyValues("ORDEM").asOpt[String]
linha <- keyValues("LINHA").asOpt[Int].orElse(keyValues("LINHA").asOpt[String])
velocidade <- keyValues("VELOCIDADE").asOpt[Int]
} yield BusContainer(ordem, linha.toString, velocidade)
})
Thanks for the help!

Gatling :- Compare web service Json response using jsonFileFeeder

I'm using JSON feeder to compare JSON output by web services as follows,
val jsonFileFeeder = jsonFile("test_data.json")
val strategy = (value: Option[String], session: Session) => value.map { jsonFileFeeder =>
val result = JSONCompare.compareJSON("expectedStr", "actualStr", JSONCompareMode.STRICT)
if (result.failed) Failure(result.getMessage)
else Success(value)
}.getOrElse(Failure("Missing body"))
val login = exec(http("Login")
.get("/login"))
.pause(1)
.feed(feeder)
.exec(http("authorization")
.post("/auth")
.headers(headers_10)
.queryParam("""email""", "${email}")
.queryParam("""password""", "${password}")
.check(status.is(200))
.check(bodyString.matchWith(strategy)))
.pause(1)
But it throws error
value matchWith is not a member of io.gatling.core.check.DefaultFindChe
ckBuilder[io.gatling.http.check.HttpCheck,io.gatling.http.response.Response,String,String]
15:10:01.963 [ERROR] i.g.a.ZincCompiler$ - .check(bodyString.matchWith(jsonFileFeeder)))
s\lib\Login.scala:18: not found: value JSONCompare
15:10:05.224 [ERROR] i.g.a.ZincCompiler$ - val result = JSONCompare.compareJSON(jsonFileFeeder, j
sonFileFeeder, JSONCompareMode.STRICT)
^
15:10:05.631 [ERROR] i.g.a.ZincCompiler$ - two errors found
Compilation failed
Here's a sample script that semantically compares a JSON response with expected output:
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import io.gatling.core.json.Jackson
import java.nio.charset.StandardCharsets.UTF_8
import scala.concurrent.duration._
class BasicSimulation extends Simulation {
lazy val expectedJson = Jackson.parse(
getClass.getResourceAsStream("/output.json"),
UTF_8
)
val scn = scenario("Scenario Name")
.exec(http("request_1")
.get("http://localhost:8000/output.json")
.check(bodyString.transform(Jackson.parse).is(expectedJson))
)
setUp(scn.inject(atOnceUsers(1)))
}
It assumes there is a file output.json in the resources directory (the directory that also contains your data and request-bodies).
However, I think you should carefully consider whether this solution is right for your needs. It won't scale as well as JSONPath or regex checks (especially for large JSON files), it's inflexible, and it seems more like a functional testing task than a performance task. I suspect that if you're trying to compare JSON files in this way, then you're probably trying to solve the wrong problem.
Note that it doesn't use jsonFile, as jsonFile is designed for use as a feeder, whereas I suspect you want to compare a single request with a hard-coded response. However, jsonFile may prove useful if you will be making a number of different requests with different parameters and expect different (known) responses. Here's an example script that takes this approach:
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import io.gatling.core.json.Jackson
import scala.concurrent.duration._
class BasicSimulation extends Simulation {
val myFeed = jsonFile("json_data.json").circular
val scn = scenario("Scenario Name")
.feed(myFeed)
.exec(http("request_1")
.get("${targetUrl}")
.check(bodyString.transform(Jackson.parse).is("${expectedResponse}"))
)
setUp(scn.inject(atOnceUsers(2)))
}
It assumes there is a json resource in data/json_data.json, that looks something like the following:
[
{
"targetUrl":"http://localhost:8000/failure.json",
"expectedResponse":
{
"success": false,
"message": "Request Failed"
}
},
{
"targetUrl":"http://localhost:8000/success.json",
"expectedResponse":
{
"success": true,
"message": "Request Succeeded"
}
}
]
The expectedResponse should be the exact JSON you expect to get back from the server. And of course you don't just have to parameterise targetUrl, you can parameterise whatever you want in this way.
As an aside, you may also be interested to know that Gatling 2.1 is expected to allow comparing an response with a file without using hacks like these (although the current development version only supports comparing byte-for-byte, not comparing-as-json).