read.json only reading the first object in Spark - json

I have a multiLine json file, and I am using spark's read.json to read the json, the problem is that it is only reading the first object from that json file
val dataFrame = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json(path)
dataFrame.rdd.saveAsTextFile("DataFrame")
Sample json:
{
"_id" : "589895e123c572923e69f5e7",
"thing" : "54eb45beb5f1e061454c5bf4",
"timeline" : [
{
"reason" : "TRIP_START",
"timestamp" : "2017-02-06T17:20:18.007+02:00",
"type" : "TRIP_EVENT",
"location" : [
11.1174091,
69.1174091
],
"endLocation" : [],
"startLocation" : []
},
"reason" : "TRIP_END",
"timestamp" : "2017-02-06T17:25:26.026+02:00",
"type" : "TRIP_EVENT",
"location" : [
11.5691428,
48.1122443
],
"endLocation" : [],
"startLocation" : []
}
],
"__v" : 0
}
{
"_id" : "589895e123c572923e69f5e8",
"thing" : "54eb45beb5f1e032241c5bf4",
"timeline" : [
{
"reason" : "TRIP_START",
"timestamp" : "2017-02-06T17:20:18.007+02:00",
"type" : "TRIP_EVENT",
"location" : [
11.1174091,
50.1174091
],
"endLocation" : [],
"startLocation" : []
},
"reason" : "TRIP_END",
"timestamp" : "2017-02-06T17:25:26.026+02:00",
"type" : "TRIP_EVENT",
"location" : [
51.1174091,
69.1174091
],
"endLocation" : [],
"startLocation" : []
}
],
"__v" : 0
}
I get only the first entry with id = 589895e123c572923e69f5e7.
Is there something that I am doing wrong?

Are you sure multiple multi line JSON is supported?
Each line must contain a separate, self-contained valid JSON object... For a regular multi-line JSON file, set the multiLine option to true
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
Where a "regular JSON file" means the entire file is a singular JSON object / array, however, simply putting {} around your data won't work because you need a key for every object, and so you'd need a top level key, maybe say "objects". Similarly, you can try an array, but wrapping with []. Either way, these will only work if every object in that array or object is separated by commas.
tl;dr - the whole file needs to be one valid JSON object when multiline=true
You're only getting one object because it parses the first set of brackets, and that's it.
If you have full control over the JSON file, the indented layout is purely for human consumption. Just flatten the objects and let Spark parse it as the API is intended to be used

Keep one line and one JsValue in file, remove .option("multiLine", true).
like this:
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}

Related

How can i match fields with wildcards using jq?

I have a JSON object of the following form:
{
"Task11c-0-20181209-12:59:30-65611" : {
"attributes" : {
"configname" : "Task11c",
"datetime" : "20181209-12:59:30",
"experiment" : "Task11c",
"inifile" : "lab1.ini",
"iterationvars" : "",
"iterationvarsf" : "",
"measurement" : "",
"network" : "Manhattan1_1C",
"processid" : "65611",
"repetition" : "0",
"replication" : "#0",
"resultdir" : "results",
"runnumber" : "0",
"seedset" : "0"
},
......
},
......
"Task11b-12-20181209-13:03:17-65612" : {
....
....
},
.......
}
I reported only the first part, but in general I have many other sub-objects which match a string like Task11c-0-20181209-12:59:30-65611. They all have in common the initial word Task. I want to extract the processid from each sub-object. I'm trying to use a wildcard like in bash, but it seems not to be possible.
I also read about the match() function, but it works with strings and not json objects.
Thanks for the support.
Filter keys that start with Test and get only the attribute of your choice using the select() expression
jq 'to_entries[] | select(.key|startswith("Task")).value.attributes.processid' json

Convert MongoDB Document to Extended JSON in Shell

I am looking for a Shell tool that can convert a mongodb document into extended JSON.
So if the original JSON file looks like this:
{
"_id" : ObjectId("5a8c60b8c83eaf000fb39547"),
"name" : "myName",
"created" : ISODate("2018-02-20T17:54:00.091Z"),
"components" : [
...
The result would be something like this:
{
"$oid" : "5a8c60b8c83eaf000fb39547",
"name" : "myName",
"created" : { "$date" : "2018-02-20T17:54:00.091Z"},
"components" : [
...
The MongoDB shell speaks Javascript, so the answer is simple: use JSON.stringify(). If your command is db.serverStatus(), then you can simply do this:
JSON.stringify(db.serverStatus())
This won't output the proper "strict mode" representation of each of the fields ({ "floatApprox": <number> } instead of { "$numberLong": "<number>" }), but if what you care about is getting standards-compliant JSON out, this'll do the trick.

Is updating an object nested inside an array via JsonPatchDocument possible?

I'm using Microsoft.AspNetCore.JsonPatch V2.1.1.
I have an object structure like this:
{
"key" : "value",
"nested" : [
{ "key" : "value" }
]
}
Now I want to update the key in the nested object within the array:
[
{"op" : "replace", "path" : "/nested/0/key", "value" : "test" }
]
But I get this exception:
JsonPatchException: The target location specified by path segment '0' was not found.
Do I have to explicitly make an endpoint for the inner object with its own PATCH method?

Avoid repetition in JSON file

I am not familiar with JSON object and I want to use them with Python . I have a JSON object like this
{
"a" : {"value" : 20200212, "conversion":"{"fun":["strptime"], "module":["datetime"], "extra_args":["%Y%m%d]} },
"b" : {"value":"something here"},
"c" : {"value" : 20211121,"conversion":{"fun":["strptime"], "module":["datetime"], "extra_args":["%Y%m%d]} }
}
My question is it possible to not repeat this in this file?
{"fun":["strptime"], "module":["datetime"], "extra_args":["%Y%m%d]}

is it possible to extract the specific data in a JSON data , without reading all the values

I have this JSON Data .
My question is that , is it possible to extract the specific data in a JSON data , without reading all the values .
I mean is it possible to query the data as we do in SQL ??
{ "_id" : ObjectId("4e61501e6a73bc73f82f91f3"), "created_at" : "2011-09-02 17:52:30.285", "cust_id" : "sdtest", "moduleName" : "balances", "responses" : [
{
"questionNum" : "1",
"answer" : "Hard",
"comments" : "is that you john wayne?"
},
{
"questionNum" : "2",
"answer" : "Somewhat",
"comments" : "ARg!"
},
{
"questionNum" : "3",
"answer" : "",
"comments" : "Yes"
}
] }
Yes, but you will need to write extra code to do it, or use a third party library. There are a few available: http://www.google.co.uk/search?q=json+linq+sql
Well, unless you use an incremental JSON parser, you'll have to parse the whole JSON first. After that, it depends on your programming language's abilities of how you can filter. For example, in Python
import json
obj = json.loads(jsonData)
answeredQuestions = filter(lambda response: response.answer, obj["responses"])