Streaming huge json with Akka Stream - json

I have a problem of huge http response with a json slab, where only portion is point of interest.
I cannot change the response structure.
here is an example
{
"searchString": "search",
"redirectUrl": "",
"0": {
"numRecords": 123,
"refinementViewModelCollector": {},
// Lots of data here
"results": [
{
"productCode": "123",
"productShortDescription": "Desc",
"brand": "Brand",
"productReview": {
"reviewScore": 0
},
"priceView": {
"salePriceDisplayable": false,
},
"productImageUrl": "url",
"alternateImageUrls": [
"url1"
],
"largeProductImageUrl": "url4",
"videoUrl": ""
},
{
"productCode": "124",
"productShortDescription": "Desc",
"brand": "Brand",
"productReview": {
"reviewScore": 0
},
"priceView": {
"salePriceDisplayable": false,
},
"preOrder": false,
"productImageUrl": "url",
"alternateImageUrls": [
"url1"
],
"largeProductImageUrl": "url4",
"videoUrl": ""
}
]
//lots of data here
}
}
My point of interest is entries in results Jason Array, but the are sitting in the middle of json
I created a small Play WS Client like this:
val wsClient: WSClient = ???
val ret = wsClient.url("url").stream()
ret.flatMap { response =>
response.body.via(JsonFraming.objectScanner(1024))
.map(_.utf8String)
.runWith(Sink.foreach(println))
}
this will not work because it will take whole json slab as Json object. I need to skip some data until "results": entry appear in the stream, then start parsing entries and skip all the rest.
Any ideas how to do this?

Check out Alpakka's JSON module, which can stream specific parts of a nested JSON structure:
response
.body
.via(JsonReader.select("$.0.results[*]"))
.map(_.utf8String)
.runWith(Sink.foreach(println)) // or runForeach(println)

There are parsers that support parsing as a stream. For a good example check out this Circe example https://github.com/circe/circe/tree/master/examples/sf-city-lots

I'd love a better, Scala-specific answer to this question, but check out the "Mixed Reads Example" in the documentation for Google's GSON library:
https://sites.google.com/site/gson/streaming
Gson also supports mixed streaming & object model access. This lets your application have the best of both worlds: the productivity of object model access with the efficiency of streaming
...
This code reads a JSON document containing an array of messages. It steps through array elements as a stream to avoid loading the complete document into memory. It is concise because it uses Gson’s object-model to parse the individual messages
This should have great memory-performance (the code reads from a Java InputStream, so the full structure is never in memory), but may require some effort to get your results into Scala case classes.

Related

Can Filebeat parse JSON fields instead of the whole JSON object into kibana?

I am able to get a single JSON object in Kibana:
By having this in the filebeat.yml file:
output.elasticsearch:
hosts: ["localhost:9200"]
How can I get the individual elements in the JSON string. So say if I wanted to compare all the "pseudorange" fields of all my JSON objects. How would I:
Select "pseudorange" field from all my JSON messages to compare them.
Compare them visually in kibana. At the moment I can't even find the message let alone the individual fields in the visualisation tab...
I have heard of people using logstash to parse the string somehow but is there no way of doing this simply with filebeat? If there isn't then what do I do with logstash to help filter the individual fields in the json instead of have my message just one big json string that I cannot interact with?
I get the following output from output.console, note I am putting some information in <> to hide it:
"#timestamp": "2021-03-23T09:37:21.941Z",
"#metadata": {
"beat": "filebeat",
"type": "doc",
"version": "6.8.14",
"truncated": false
},
"message": "{\n\t\"Signal_data\" : \n\t{\n\t\t\"antenna type:\" : \"GPS\",\n\t\t\"frequency type:\" : \"GPS\",\n\t\t\"position x:\" : 0.0,\n\t\t\"position y:\" : 0.0,\n\t\t\"position z:\" : 0.0,\n\t\t\"pseudorange:\" : 20280317.359730639,\n\t\t\"pseudorange_error:\" : 0.0,\n\t\t\"pseudorange_rate:\" : -152.02620448094211,\n\t\t\"svid\" : 18\n\t}\n}\u0000",
"source": <ip address>,
"log": {
"source": {
"address": <ip address>
}
},
"input": {
"type": "udp"
},
"prospector": {
"type": "udp"
},
"beat": {
"name": <ip address>,
"hostname": "ip-<ip address>",
"version": "6.8.14"
},
"host": {
"name": "ip-<ip address>",
"os": {
<ubuntu info>
},
"id": <id>,
"containerized": false,
"architecture": "x86_64"
},
"meta": {
"cloud": {
<cloud info>
}
}
}
In Filebeat, you can leverage the decode_json_fields processor in order to decode a JSON string and add the decoded fields into the root obejct:
processors:
- decode_json_fields:
fields: ["message"]
process_array: false
max_depth: 2
target: ""
overwrite_keys: true
add_error_key: false
Credit to Val for this. His answer worked however as he suggested my JSON string had a \000 at the end which stops it being JSON and prevented the decode_json_fields processor from working as it should...
Upgrading to version 7.12 of Filebeat (also ensure version 7.12 of Elasticsearch and Kibana because mismatched versions between them can cause issues) allows us to use the script processor: https://www.elastic.co/guide/en/beats/filebeat/current/processor-script.html.
Credit to Val here again, this script removed the null terminator:
- script:
lang: javascript
id: trim
source: >
function process(event) {
event.Put("message", event.Get("message").trim());
}
After the null terminator was removed the decode_json_fields processor did its job as Val suggested and I was able to extract the individual elements of the JSON field which allowed Kibana visualisation to look at the elements I wanted!

Siesta: Child resources

I am having difficulties understanding how does Siesta figure out the child of a resource. For example I have the following events resource:
JSON returned by "/events"
{
"success": 1,
"events": [
{
"id": 1,
"type": "meeting",
"eventDate": "2015-08-20",
"notes": "fadsfasfa",
"title": null
},{
"id": 2,
"type": "game",
"eventDate": "2015-08-31",
"notes": "fdsafdf",
"title": null
}
]
}
Sadly, calling "/events/1" for example, does not return the event with id=2. Is there a way to tell Siesta which event has the id=2?
Suppose you have:
let events = myService.resource("/events")
Then you can navigate from the /events resource to the /events/2 resource like this:
let event = events.child("2")
That will give you the same object as if you had asked for myService.resource("/events/2").
To extract that 2 from the JSON, use normal Swift JSON parsing techniques. (Siesta doesn’t apply any special inspection or interpretation to the JSON once it’s parsed.) I recommend using the SwiftyJSON library for easier JSON traversal. For example, it lets you do something like this to extract those event IDs and get the child resources:
let allEventResources =
JSON(events.jsonDict)["events"]
.arrayValue
.flatMap { $0["id"].string }
.map(event.child)

Extracting data from a JSON file

I have a large JSON file that looks similar to the code below. Is there anyway I can iterate through each object, look for the field "element_type" (it is not present in all objects in the file if that matters) and extract or write each object with the same element type to a file? For example each user would end up in a file called user.json and each book in a file called book.json?
I thought about using javascript but to my knowledge js can't write to files, I also tried to do it using linux command line tools by removing all new lines, then inserting a new line after each "}," and then iterating through each line to find the element type and write it to a file. This worked for most of the data; however, where there were objects like the "problem_type" below, it inserted a new line in the middle of the data due to the nested json in the "times" element. I've run out of ideas at this point.
{
"data": [
{
"element_type": "user",
"first": "John",
"last": "Doe"
},
{
"element_type": "user",
"first": "Lucy",
"last": "Ball"
},
{
"element_type": "book",
"name": "someBook",
"barcode": "111111"
},
{
"element_type": "book",
"name": "bookTwo",
"barcode": "111111"
},
{
"element_type": "problem_type",
"name": "problem object",
"times": "[{\"start\": \"1230\", \"end\": \"1345\", \"day\": \"T\"}, {\"start\": \"1230\", \"end\": \"1345\", \"day\": \"R\"}]"
}
]
}
I would recommend Java for this purpose. It sounds like you're running on Linux so it should be a good fit.
You'll have no problems writing to files. And you can use a library like this - http://json-lib.sourceforge.net/ - to gain access to things like JSONArray and JSONObject. Which you can easily use to iterate through the data in your JSON request, and check what's in "element_type" and write to a file accordingly.

Best way to structure json data with backbone

I'm just about to write a web app using backbone and want to know what's the best way to structure my json file? I've read that using 'dictionary' arrays are best but was wondering if there's a better way of structuring the data. An example of how the data should be structure would be great too.
My data seems to have a lot of nested arrays and these seem to be hard to search through.
JSON has two types of relevant container objects.
Object and Array.
Object is probably what you're thinking of when you say "dictionary array".
You probably want an object with arrays of objects :)
{"courses": [{
"name": "Spanish 101",
"subject": "How to speak Spanish",
}, {
"name": "Introduction to Film",
"subject": "Make pretty films",
}, {
"name": "Social Psychology",
"subject": "Why people are weird.",
}],
"sections": [],
"modules": [],
"topics": [],
"lessons": [],
// etc..
}
Each of the [] items would be field with numerous objects.
Once you get this data into your APP (either JSONP, AJAX, or if it's just assinged to a variable in our page) you can put them in your Backbone collections using the reset function (See: http://backbonejs.org/#Collection-reset):
var Courses = new Backbone.Collection;
function processData(data) {
Courses.reset(data.courses);
// etc...
}

How should a JSON response be formatted?

I have a REST service that returns a list of objects. Each object contains objectcode and objectname.
This is my first time building a REST service, so I'm not sure how to format the response.
Should it be:
{
"objects": {
"count": 2,
"object": [
{
"objectcode": "1",
"objectname": "foo"
},
{
"objectcode": "2",
"objectname": "bar"
},
...more objects
]
}
}
OR
[
{
"objectcode": "1",
"objectname": "foo"
},
{
"objectcode": "2",
"objectname": "bar"
},
...more objects
]
I realize this might be a little subjective, but which would be easier to consume? I would also need to support XML formatted response later.
They are the same to consume, as a library handles both just fine. The first one has an advantage over the second though: You will be able to expand the response to include other information additional to the objects (for example, categories) without breaking existing code.
Something like
{
"objects": {
"count": 2,
"object": [
{
"objectcode": "1",
"objectname": "foo"
},
{
"objectcode": "2",
"objectname": "bar"
},
...more objects
]
}
"categories": {
"count": 2,
"category" : [
{ "name": "some category"}
]
}
}
Additionally, the json shouldn't be formatted in any way, so remove whitespace, linebreaks etc. Also, the count isn't really necessary, as it will be saved while parsing the objects themselves.
I often see the first one. Sometimes it's easier to manipulate data to have meta-data. For exemple google API use first one : http://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=true
It's not only the question of personal preference; it's also the question fo your requirements. For example, if I was in the same situation and I did need object count on client side then I'd go with first approach otherwise I will choose the second one.
Also please note that "classic" REST server mostly will work a bit different way. If some REST function is to return a list of objects then it should return only a list of URLs to those objects. The URLs should be pointing to details endpoints - so by querying each endpoint you may get details on specific single object.
As a client I would prefer the second format. If the first format only includes the number of "objects", this is redundant information.