BSON structure created by Apache Spark and MongoDB Hadoop-Connector - json

I'm trying to save some JSON from Spark (Scala) to MongoDB using the MongoDB Hadoop-Connector. The problem I'm having is that this API always seems to save your data as "{_id: ..., value: {your JSON document}}".
In the code example below, my document gets saved like this:
{
"_id" : ObjectId("55e80cfea9fbee30aa703261"),
"value" : {
"_id" : "55e6c65da9fbee285f2f9175",
"year" : 2014,
"month" : 5,
"day" : 6,
"hour" : 18,
"user_id" : 246
}
}
Is there any way to persuade the MongoDB Hadoop Connector to write the JSON/BSON in the structure you've specified, instead of nesting it under these _id/value fields?
Here's my Scala Spark code:
val jsonstr = List("""{
"_id" : "55e6c65da9fbee285f2f9175",
"year" : 2014,
"month" : 5,
"day" : 6,
"hour" : 18,
"user_id" : 246}""")
val conf = new SparkConf().setAppName("Mongo Dummy").setMaster("local[*]")
val sc = new SparkContext(conf)
// DB params
val host = "127.0.0.1"
val port = "27017"
val database = "dummy"
val collection = "fubar"
// input is collection we want to read (not doing so here)
val mongo_input = s"mongodb://$host/$database.$collection"
// output is collection we want to write
val mongo_output = s"mongodb://$host/$database.$collection"
// Set up extra config for Hadoop connector
val hadoopConfig = new Configuration()
//hadoopConfig.set("mongo.input.uri", mongo_input)
hadoopConfig.set("mongo.output.uri", mongo_output)
// convert JSON to RDD
val rdd = sc.parallelize(jsonstr)
// write JSON data to DB
val saveRDD = rdd.map { json =>
(null, Document.parse(json))
}
saveRDD.saveAsNewAPIHadoopFile("file:///bogus",
classOf[Object],
classOf[BSONObject],
classOf[MongoOutputFormat[Object, BSONObject]],
hadoopConfig)
// Finished
sc.stop
And here's my SBT:
name := "my-mongo-test"
version := "1.0"
scalaVersion := "2.10.4"
// Spark needs to appear in SBT BEFORE Mongodb connector!
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.0"
// MongoDB-Hadoop connector
libraryDependencies += "org.mongodb.mongo-hadoop" % "mongo-hadoop-core" % "1.4.0"
To be honest, I'm kind of mystified at how hard it seems to be to save JSON --> BSON --> MongoDB from Spark. So any suggestions on how to save my JSON data more flexibly would be welcomed.

Well, I just found the solution. It turns out that MongoRecordWriter which is used by MongoOutputFormat inserts any value that does not inherit from BSONWritable or MongoOutput or BSONObject under value field.
The most simple solution, therefore, is to create RDD that contain BSONObject as a value, rather than Document.
I tried this solution in Java, but I'm sure it will work in Scala as well. Here is a sample code:
JavaPairRDD<Object, BSONObject> bsons = values.mapToPair(lineValues -> {
BSONObject doc = new BasicBSONObject();
doc.put("field1", lineValues.get(0));
doc.put("field2", lineValues.get(1));
return new Tuple2<Object, BSONObject>(UUID.randomUUID().toString(), doc);
});
Configuration outputConfig = new Configuration();
outputConfig.set("mongo.output.uri",
"mongodb://localhost:27017/my_db.lines");
bsons.saveAsNewAPIHadoopFile("file:///this-is-completely-unused"
, Object.class
, BSONObject.class
, MongoOutputFormat.class
, outputConfig);

Related

Groovy - Parse XML Response, Select Fields, Create New File

I think I've read every Groovy parsing question on here but I can't seem to find my exact scenario so I'm reaching out for help - please be kind, I'm new to Groovy and I've really bitten off more than I can chew in this latest endeavor.
So I have this XML Response:
<?xml version="1.0" encoding="UTF-8"?>
<worklogs date_from="2020-04-19 00:00:00" date_to="2020-04-25 23:59:59" number_of_worklogs="60" format="xml" diffOnly="false" errorsOnly="false" validOnly="false" addDeletedWorklogs="true" addBillingInfo="false" addIssueSummary="true" addIssueDescription="false" duration_ms="145" headerOnly="false" userName="smm288" addIssueDetails="false" addParentIssue="false" addUserDetails="true" addWorklogDetails="false" billingKey="" issueKey="" projectKey="" addApprovalStatus="true" >
<worklog>
<worklog_id></worklog_id>
<jira_worklog_id></jira_worklog_id>
<issue_id></issue_id>
<issue_key></issue_key>
<hours></hours>
<work_date></work_date>
<username></username>
<staff_id />
<billing_key></billing_key>
<billing_attributes></billing_attributes>
<activity_id></activity_id>
<activity_name></activity_name>
<work_description></work_description>
<parent_key></parent_key>
<reporter></reporter>
<external_id />
<external_tstamp />
<external_hours></external_hours>
<external_result />
<customField_11218></customField_11218>
<customField_12703></customField_12703>
<customField_12707></customField_12707>
<hash_value></hash_value>
<issue_summary></issue_summary>
<user_details>
<full_name></full_name>
<email></email>
<user-prop key="auto_approve_timesheet"></user-prop>
<user-prop key="cris_id"></user-prop>
<user-prop key="iqn_gl_string"></user-prop>
<user-prop key="is_contractor"></user-prop>
<user-prop key="is_employee"></user-prop>
<user-prop key="it_leadership"></user-prop>
<user-prop key="primary_role"></user-prop>
<user-prop key="resource_manager"></user-prop>
<user-prop key="team"></user-prop>
</user_details>
<approval_status></approval_status>
<timesheet_approval>
<status></status>
<status_date></status_date>
<reviewer></reviewer>
<actor></actor>
<comment></comment>
</timesheet_approval>
</worklog>
....
....
</worklogs>
And I'm retrieving this XML Response from an API call so the response is held within an object. NOTE: The sample XML above is from Postman.
What I'm trying to do is the following:
1. Only retrieve certain values from this response from all the nodes.
2. Write the values collected to a .json file.
I've created a map but now I'm kind of stuck on how to parse through it and create a .json file out of the fields I want.
This is what I have thus far
#Grab('org.codehaus.groovy.modules.http-builder:http-builder:0.7.1')
#Grab('oauth.signpost:signpost-core:1.2.1.2')
#Grab('oauth.signpost:signpost-commonshttp4:1.2.1.2')
import groovyx.net.http.RESTClient
import groovyx.net.http.Method
import static groovyx.net.http.ContentType.*
import groovyx.net.http.HttpResponseException
import groovy.json.JsonBuilder
import groovy.json.JsonOutput
import groovy.json.*
// User Credentials
def jiraAuth = ""
// JIRA Endpoints
def jiraUrl = "" //Dev
def jiraUrl = "" //Production
// Tempo API Tokens
//def tempoApiToken = "" //Dev
//def tempoApiToken = "" //Production
// Define Weekly Date Range
def today = new Date()
def lastPeriodStart = today - 8
def lastPeriodEnd = today - 2
def dateFrom = lastPeriodStart.format("yyyy-MM-dd")
def dateTo = lastPeriodEnd.format("yyyy-MM-dd")
def jiraClient = new RESTClient(jiraUrl)
jiraClient.ignoreSSLIssues()
def headers = [
"Authorization" : "Basic " + jiraAuth,
"X-Atlassian-token": "no-check",
"Content-Type" : "application/json"
]
def response = jiraClient.get(
path: "",
query: [
tempoApiToken: "${tempoApiToken}",
format: "xml",
dateFrom: "${dateFrom}",
dateTo: "${dateTo}",
addUserDetails: "true",
addApprovalStatus: "true",
addIssueSummary: "true"
],
headers: headers
) { response, worklogs ->
println "Processing..."
// Start building the Output - Creates a Worklog Map
worklogs.worklog.each { worklognodes ->
def workLog = convertToMap(worklognodes)
// Print out the Map
println (workLog)
}
}
// Helper Method
def convertToMap(nodes) {
nodes.children().collectEntries {
if (it.name() == 'user-prop') {
[it['#key'], it.childNodes() ? convertToMap(it) : it.text()]
} else {
[it.name(), it.childNodes() ? convertToMap(it) : it.text()]
}
}
}
I'm only interested in parsing out the following fields from each node:
<worklogs>
<worklog>
<hours>
<work_date>
<billing_key>
<customField_11218>
<issue_summary>
<user_details>
<full_name>
<user-prop key="auto_approve_timesheet">
<user-prop key="it_leadership">
<user-prop key="resource_manager">
<user-prop key="team">
<user-prop key="cris_id">
<user-prop key="iqn_id">
<approval_status>
</worklog>
...
</worklogs>
I've tried the following:
1. Converting the workLog to a json string (JsonOutput.toJson) and then converting the json string to prettyPrint (JsonOutput.prettyPrint) - but this just returns a collection of .json responses which I can't do anything with (thought process was, this is as good as I can get and I'll just use a .json to .csv converter and get rid of what I don't want) - which is not the solution I ultimately want.
2. Printing the map workLog just returns little collections which I can't do anything with either
3. Create a new file using File and creating a .json file of workLog but again, it doesn't translate well.
The results of the println for workLog is here (just so everyone can see that the response is being held and the map matches the XML response).
[worklog_id: , jira_worklog_id: , issue_id: , issue_key: , hours: , work_date: , username: , staff_id: , billing_key: , billing_attributes: , activity_id: , activity_name: , work_description: , parent_key: , reporter: , external_id:, external_tstamp:, external_hours: , external_result:, customField_11218: , hash_value: , issue_summary: , user_details:[full_name: , email: , auto_approve_timesheet: , cris_id: , iqn_gl_approver: , iqn_gl_string: , iqn_id: , is_contractor: , is_employee: , it_leadership: , primary_role: , resource_manager: , team: ], approval_status: , timesheet_approval:[status: ]]
I would so appreciate it if anyone could offer some insights on how to move forward or even documentation that has good examples of what I'm trying to achieve (Apache's documentation is sorely lacking in examples, in my opinion).
It's not all of the way there. But, I was able to get a JSON file created with the XML and Map. From there I can just use the .json file to create a .csv and then get rid of the columns I don't want.
// Define Weekly Date Range
def today = new Date()
def lastPeriodStart = today - 8
def lastPeriodEnd = today - 2
def dateFrom = lastPeriodStart.format("yyyy-MM-dd")
def dateTo = lastPeriodEnd.format("yyyy-MM-dd")
def jiraClient = new RESTClient(jiraUrl)
jiraClient.ignoreSSLIssues()
// Creates and Begins the File
File file = new File("${dateFrom}_RPT05.json")
file.write("")
file.append("[\n")
// Defines the File
def arrplace = 0
def arrsize = 0
def headers = [
"Authorization" : "Basic " + jiraAuth,
"X-Atlassian-token": "no-check",
"Content-Type" : "application/json"
]
def response = jiraClient.get(
path: "/plugins/servlet/tempo-getWorklog/",
query: [
tempoApiToken: "${tempoApiToken}",
format: "xml",
dateFrom: "${dateFrom}",
dateTo: "${dateTo}",
addUserDetails: "true",
addApprovalStatus: "true",
addIssueSummary: "true"
],
headers: headers
) { response, worklogs ->
println "Processing..."
// Gets Size of Array
worklogs.worklog.each { worklognodes ->
arrsize = arrsize+1 }
// Start Building the Output - Creates a Worklog Map
worklogs.worklog.each { worklognodes ->
worklognodes = convertToMap(worklognodes)
// Convert Map to a JSON String
def json_str = JsonOutput.toJson(worklognodes)
// Adds Row to File
file.append(json_str)
arrplace = arrplace+1
if(arrplace<arrsize)
{file.append(",")}
file.append("\n")
print "."
}
}
file.append("]")
// Helper Method
def convertToMap(nodes) {
nodes.children().collectEntries {
if (it.name() == 'user-prop') {
[it['#key'], it.childNodes() ? convertToMap(it) : it.text()]
} else {
[it.name(), it.childNodes() ? convertToMap(it) : it.text()]
}
}
}
The output is a collection/array of worklogs.

Using Microsoft.FSharpLu to serialize JSON to a stream

I've been using the Newtonsoft.Json and Newtonsoft.Json.Fsharp libraries to create a new JSON serializer and stream to a file. I like the ability to stream to a file because I'm handling large files and, prior to streaming, often ran into memory issues.
I stream with a simple fx:
open Newtonsoft.Json
open Newtonsoft.Json.FSharp
open System.IO
let writeToJson (path: string) (obj: 'a) : unit =
let serialized = JsonConvert.SerializeObject(obj)
let fileStream = new StreamWriter(path)
let serializer = new JsonSerializer()
serializer.Serialize(fileStream, obj)
fileStream.Close()
This works great. My problem is that the JSON string is then absolutely cluttered with stuff I don't need. For example,
let m =
[
(1.0M, None)
(2.0M, Some 3.0M)
(4.0M, None)
]
let makeType (tup: decimal * decimal option) = {FieldA = fst tup; FieldB = snd tup}
let y = List.map makeType m
Default.serialize y
val it : string =
"[{"FieldA": 1.0},
{"FieldA": 2.0,
"FieldB": {
"Case": "Some",
"Fields": [3.0]
}},
{"FieldA": 4.0}]"
If this is written to a JSON and read into R, there are nested dataframes and any of the Fields associated with a Case end up being a list:
library(jsonlite)
library(dplyr)
q <- fromJSON("default.json")
x <-
q %>%
flatten()
x
> x
FieldA FieldB.Case FieldB.Fields
1 1 <NA> NULL
2 2 Some 3
3 4 <NA> NULL
> sapply(x, class)
FieldA FieldB.Case FieldB.Fields
"numeric" "character" "list"
I don't want to have to handle these things in R. I can do it but it's annoying and, if there are files with many, many columns, it's silly.
This morning, I started looking at the Microsoft.FSharpLu.Json documentation. This library has a Compact.serialize function. Quick tests suggest that this library will eliminate the need for nested dataframes and the lists associated with any Case and Field columns. For example:
Compact.serialize y
val it : string =
"[{
"FieldA": 1.0
},
{
"FieldA": 2.0,
"FieldB": 3.0
},
{
"FieldA": 4.0
}
]"
When this string is read into R,
q <- fromJSON("compact.json")
x <- q
x
> x
FieldA FieldB
1 1 NA
2 2 3
3 4 NA
> sapply(x, class)
FieldA FieldB
"numeric" "numeric
This is much simpler to handle in R. and I'd like to start using this library.
However, I don't know if I can get the Compact serializer to serialize to a stream. I see .serializeToFile, .desrializeStream, and .tryDeserializeStream, but nothing that can serialize to a stream. Does anyone know if Compact can handle writing to a stream? How can I make that work?
The helper to serialize to stream is missing from the Compact module in FSharpLu.Json, but you should be able to do it by following the C# example from
http://www.newtonsoft.com/json/help/html/SerializingJSON.htm. Something along the lines:
let writeToJson (path: string) (obj: 'a) : unit =
let serializer = new JsonSerializer()
serializer.Converters.Add(new Microsoft.FSharpLu.Json.CompactUnionJsonConverter())
use sw = new StreamWriter(path)
use writer = new JsonTextWriter(sw)
serializer.Serialize(writer, obj)

json lookup list with enumeration in it

I'm currently learning python and I'm familiar (still beginner) with json
my goal is to have one json with many lookup list that could have duplicated value but different key
instead of having duplication of list, I'm trying to find a way to only have one copy and then reusing, i have made htis simple example;
import json
json_enum1 = '{"01" : "ab", "02" : "cd"}'
json_enum2 = '{"01" : "zz", "02" : "xx"}'
json_string = '{"val1": null, "val2": null, "val3": null}'
parsed_json = json.loads(json_string)
parsed_enum1 = json.loads(json_enum1)
parsed_enum2 = json.loads(json_enum2)
parsed_json['val1'] = parsed_enum1
parsed_json['val2'] = parsed_enum2
parsed_json['val3'] = parsed_enum1
print(parsed_json)
print(parsed_json['val1']['01'])
print(parsed_json['val2']['01'])
print(parsed_json['val3']['02'])
result
{u'val3': {u'02': u'cd', u'01': u'ab'}, u'val2': {u'02': u'xx', u'01':
u'zz'}, u'val1': {u'02': u'cd', u'01': u'ab'}}
ab
zz
cd
I could also do that:
import json
json_string = '{"val1": {"01" : "ab", "02" : "cd"}, "val2": {"01" : "zz", "02" : "xx"}, "val3": {"01" : "ab", "02" : "cd"}}'
parsed_json = json.loads(json_string)
print(parsed_json)
print(parsed_json['val1']['01'])
print(parsed_json['val2']['01'])
print(parsed_json['val3']['02'])
which give the same result but now if json_enum1 change, i need to change it twice
and this is a very small example, the real data is way bigger
my question is; is there a better way of doing what I'm describing/showing?

Look for JSON example with all allowed combinations of structure in max depth 2 or 3

I've wrote a program which process JSON objects. Now I want to verify if I've missed something.
Is there an JSON-example of all allowed JSON structure combinations? Something like this:
{
"key1" : "value",
"key2" : 1,
"key3" : {"key1" : "value"},
"key4" : [
[
"string1",
"string2"
],
[
1,
2
],
...
],
"key5" : true,
"key6" : false,
"key7" : null,
...
}
As you can see at http://json.org/ on the right hand side the grammar of JSON isn't quite difficult, but I've got several exceptions because I've forgotten to handles some structure combinations which are possible. E.g. inside an array there can be "string, number, object, array, true, false, null" but my program couldn't handle arrays inside an array until I ran into an exception. So everything was fine until I got this valid JSON object with arrays inside an array.
I want to test my program with a JSON object (which I'm looking for). After this test I want to be feel certain that my program handle every possible valid JSON structure on earth without an exception.
I don't need nesting in depth 5 or so. I only need something in nested depth 2 or max 3. With all base types which nested all allowed base types, inside this base type.
Have you thought of escaped characters and objects within an object?
{
"key1" : {
"key1" : "value",
"key2" : [
"String1",
"String2"
],
},
"key2" : "\"This is a quote\"",
"key3" : "This contains an escaped slash: \\",
"key4" : "This contains accent charachters: \u00eb \u00ef",
}
Note: \u00eb and \u00ef are resp. charachters ë and ï
Choose a programming language that support json.
Try to load your json, on fail the exception's message is descriptive.
Example:
Python:
import json, sys;
json.loads(open(sys.argv[1]).read())
Generate:
import random, json, os, string
def json_null(depth = 0):
return None
def json_int(depth = 0):
return random.randint(-999, 999)
def json_float(depth = 0):
return random.uniform(-999, 999)
def json_string(depth = 0):
return ''.join(random.sample(string.printable, random.randrange(10, 40)))
def json_bool(depth = 0):
return random.randint(0, 1) == 1
def json_list(depth):
lst = []
if depth:
for i in range(random.randrange(8)):
lst.append(gen_json(random.randrange(depth)))
return lst
def json_object(depth):
obj = {}
if depth:
for i in range(random.randrange(8)):
obj[json_string()] = gen_json(random.randrange(depth))
return obj
def gen_json(depth = 8):
if depth:
return random.choice([json_list, json_object])(depth)
else:
return random.choice([json_null, json_int, json_float, json_string, json_bool])(depth)
print(json.dumps(gen_json(), indent = 2))

How to handle error in JSON data has incorrect is node

I'm expecting following json format:
'{
"teamId" : 9,
"teamMembers" : [ {
"userId" : 1000
}, {
"userId" : 2000
}]
}'
If I test my code with following format:-
'{
"teaXmId" : 9,
"teamMembers" : [ {
"usXerId" : 1000
}, {
"userXId" : 2000
}]
}'
I'm parsing json value as follows:-
val userId = (request.body \\ "userId" )
val teamId = (request.body \ "teamId")
val list = userId.toList
list.foreach( x => Logger.info("x val: "+x)
It doesn't throw any error to handle. Code execution goes one. Later if I try to use teamId or userId, of course it doesn't work then.
So how to check whether parsing was done correctly or stop execution right away and notify user to provide correct json format
If a value is not found when using \, then the result will be of type JsUndefined(msg). You can throw an error immediately by making sure you have the type you expect:
val teamId = (request.body \ "teamId").as[Int]
or:
val JsNumber(teamId) = request.body \ "teamId" //teamId will be BigDecimal
When using \\, if nothing is found, then an empty List is returned, which makes sense. If you want to throw an error when a certain key is not found on any object of an array, you might get the object that contains the list and proceed from there:
val teamMembers = (request.body \"teamMembers").as[Seq[JsValue]]
or:
val JsObject(teamMembers) = request.body \ "teamMembers"
And then:
val userIds = teamMembers.map(v => (v \ "userId").as[Int])