jackson jsonparser restart parsing in broken JSON - json

I am using Jackson to process JSON that comes in chunks in Hadoop. That means, they are big files that are cut up in blocks (in my problem it's 128M but it doesn't really matter).
For efficiency reasons, I need it to be streaming (not possible to build the whole tree in memory).
I am using a mixture of JsonParser and ObjectMapper to read from my input.
At the moment, I am using a custom InputFormat that is not splittable, so I can read my whole JSON.
The structure of the (valid) JSON is something like:
[ { "Rep":
{
"date":"2013-07-26 00:00:00",
"TBook":
[
{
"TBookC":"ABCD",
"Records":
[
{"TSSName":"AAA",
...
},
{"TSSName":"AAB",
...
},
{"TSSName":"ZZZ",
...
}
] } ] } } ]
The records I want to read in my RecordReader are the elements inside the "Records" element. The "..." means that there is more info there, which conforms my record.
If I have an only split, there is no problem at all.
I use a JsonParser for fine grain (headers and move to "Records" token) and then I use ObjectMapper and JsonParser to read records as Objects. For details:
configure(JsonParser.Feature.AUTO_CLOSE_SOURCE, false);
MappingJsonFactory factory = new MappingJsonFactory();
mapper = new ObjectMapper(factory);
mapper.configure(Feature.FAIL_ON_UNKNOWN_PROPERTIES,false);
mapper.configure(SerializationConfig.Feature.FAIL_ON_EMPTY_BEANS,false);
parser = factory.createJsonParser(iStream);
mapper.readValue(parser, JsonNode.class);
Now, let's imagine I have a file with two inputsplits (i.e. there are a lot of elements in "Records").
The valid JSON starts on the first split, and I read and keep the headers (which I need for each record, in this case the "date" field).
The split would cut anywhere in the Records array. So let's assume I get a second split like this:
...
},
{"TSSName":"ZZZ",
...
},
{"TSSName":"ZZZ2",
...
}
] } ] } } ]
I can check before I start parsing, to move the InputStream (FSDataInputStream) to the beginning ("{" ) of the record with the next "TSSNAME" in it (and this is done OK). It's fine to discard the trailing "garbage" at the beginning. So we got this:
{"TSSName":"ZZZ",
...
},
{"TSSName":"ZZZ2",
...
},
...
] } ] } } ]
Then I handle it to the JsonParser/ObjectMapper pair seen above.
The first object "ZZZ" is read OK.
But for the next "ZZZ2", it breaks: the JSONParser complaints about malformed JSON. It is encountering a "," not being in an array. So it fails. And then I cannot keep on reading my records.
How could this problem be solved, so I can still be reading my records from the second (and nth) splits? How could I make the parser ignore these errors on the commas, or either let the parser know in advance it's reading contents of an array?

It seems it's OK just catching the exception: the parser goes on and it's able to keep on reading objects via the ObjectMapper.
I don't really like it - I would like an option where the parser could not throw Exceptions on nonstandard or even bad JSON. So I don't know if this fully answers the question, but I hope it helps.

Related

Processing JSON arrays

I have a JSON file arranged in this pattern:
[
{
"Title ID": "4224031",
"Overtime Status": "Non-Exempt",
"Shift rates": "No Shift rates",
"On call rates": "No On call rates"
},
[
{
"Step: 1.0": [
"$38.87",
"(38.870000)"
]
}
]
][
{
"Title ID": "4225031",
"Overtime Status": "Non-Exempt",
"Shift rates": "No Shift rates",
"On call rates": "No On call rates"
},
[
{
"Step: 1.0": [
"$38.87",
"(38.870000)"
]
}
]
]
I am trying to get it into a Pandas DataFrame. I have tried opening a connection to the JSON file and running JSON.load(s). Unfortunately, I get JSON decode errors like: "JSONDecodeError: Extra data: line 16 column 2 (char 182)". When running the JSON through a linter, I see that there might be an issue with the way the JSON is presented in the file. The parts between the brackets are valid but when wrapped in brackets, become invalid. I have then tried to get at the dictionaries with the wrapping brackets but have not been able to make much progress. Does anyone have tips on how I can successfully access this JSON data and get it into a pandas DataFrame?
The json is invalid beacuase it has more than one root in this representation.
This has to be like this
jsonObject = [{"1":"3"}], [{"4":"5"}]
Hacks that I am able to think of are replace these brackets ][ to this ],[ by find and replace in editor. You'll be able to then create a dataframe as its a list now.
Second, if its not a one time job, then you need to write a regex that can do this for you in text cleaning pipeline(or code). I'm not good at writing of working regex(sorry mate).
I found a solution.
First, after examining the JSON data in a linter, I found that I had some extra brackets and braces at different points. So, I am running the data through a regex that cleans out the unnecessary brackets and braces.
Next, I run each line, which now looks like a string dictionary through json.loads
Finally, I call pd.DataFrame(pd.json_normalize(data)) to get my desired pandas dataframe.
Thanks for the help from commenters.

How to write a splittable DoFn in python - convert json to ndjson in apache beam

I have a large dataset in GCS in json format that I need to load into BigQuery.
The problem is that the json data is not stored in NdJson but rather in a few large json files, where each key in the JSON should really be a field in json itself.
For example - the following Json:
{
"johnny": {
"type": "student"
},
"jeff": {
"type": "teacher"
}
}
should be converted into
[
{
"name": "johnny",
"type": "student"
},
{
"name": "jeff",
"type": "teacher"
}
]
I am trying to solve it via Google Data Flow an Apache Beam, but the performance is terrible since ech "Worker" has to do a lot of work:
class JsonToNdJsonDoFn(beam.DoFn):
def __init__(self, pk_field_name):
self.__pk_field_name = pk_field_name
def process(self, line):
for key, record in json.loads(line).items():
record[self.__pk_field_name] = key
yield record
I know that this can solved somehow via implementing it as a SplittableDoFn - but the implementation example in Python there is not really clear. How should I build this DoFn as splittable, and how will it be used as part of the pipeline?
You need a way to specify a partial range to process of the json file. It could be a byte range, for example.
The Avro example in the blog post is a good one. Something like:
class MyJsonReader(DoFn):
def process(filename, tracker=DoFn.RestrictionTrackerParam)
with fileio.ChannelFactory.open(filename) as file:
start, stop = tracker.current_restriction()
# Seek to the first block starting at or after the start offset.
file.seek(start)
next_record_start = find_next_record(file, start)
while start:
# Claim the position of the current record
if not tracker.try_claim(next_record_start):
# Out of range of the current restriction - we're done.
return
# start will point to the end of the record that was read
record, start = read_record(file, next_record_start)
yield record
def get_initial_restriction(self, filename):
return (0, fileio.ChannelFactory.size_in_bytes(filename))
However, json doesn't have clear record boundaries, so if your work has to start at byte 548, there's no clear way of telling how much to shift. If the file is literally what you have there, then you can skip bytes until you see the pattern "<string>": {. And then read the json object starting on the {.

Return nested JSON in AWS AppSync query

I'm quite new to AppSync (and GraphQL), in general, but I'm running into a strange issue when hooking up resolvers to our DynamoDB tables. Specifically, we have a nested Map structure for one of our item's attributes that is arbitrarily constructed (its complexity and form depends on the type of parent item) — a little something like this:
"item" : {
"name": "something",
"country": "somewhere",
"data" : {
"nest-level-1a": {
"attr1a" : "foo",
"attr1b" : "bar",
"nest-level-2" : {
"attr2a": "something else",
"attr2b": [
"some list element",
"and another, for good measure"
]
}
}
},
"cardType": "someType"
}
Our accompanying GraphQL type is the following:
type Item {
name: String!
country: String!
cardType: String!
data: AWSJSON! ## note: it was originally String!
}
When we query the item we get the following response:
{
"data": {
"genericItemQuery": {
"name": "info/en/usa/bra/visa",
"country": "USA:BRA",
"cardType": "visa",
"data": "{\"tourist\":{\"reqs\":{\"sourceURL\":\"https://travel.state.gov/content/passports/en/country/brazil.html\",\"visaFree\":false,\"type\":\"eVisa required\",\"stayLimit\":\"30 days from date of entry\"},\"pages\":\"One page per stamp required\"}}"
}}}
The problem is we can't seem to get the Item.data field resolver to return a JSON object (even when we attach a separate field-level resolver to it on top of the general Query resolver). It always returns a String and, weirdly, if we change the expected field type to String!, the response will replace all : in data with =. We've tried everything with our response resolvers, including suggestions like How return JSON object from DynamoDB with appsync?, but we're completely stuck at this point.
Our current response resolver for our query has been reverted back to the standard response after none of the suggestions in the aforementioned post worked:
## 'Before' response mapping template on genericItemQuery query; same result as the 'After' listed below **
#set($result = $ctx.result)
#set($result.data = $util.parseJson($ctx.result.data))
$util.toJson($result)
## 'After' response mapping template **
$util.toJson($ctx.result)
We're trying to avoid a situation where we need to include supporting types for each nest level in data (since it changes based on parent Item type and in cases like the example I gave it can have three or four tiers), and we thought changing the schema type to AWSJSON! would do the trick. I'm beginning to worry there's no way to get around rebuilding our base schema, though. Any suggestions to the contrary would be helpful!
P.S. I've noticed in the CloudWatch logs that the appropriate JSON response exists under the context.result.data response field, but somehow there's the following transformedTemplate (which, again, I find very unusual considering we're not applying any mapping template except to transform the result into valid JSON):
"arn": ...
"transformedTemplate": "{data={tourist={reqs={sourceURL=https://travel.state.gov/content/passports/en/country/brazil.html, visaFree=false, type=eVisa required, stayLimit=30 days from date of entry}, pages=One page per stamp required}}, resIds=USA:BRA, cardType=visa, id=info/en/usa/bra/visa}",
"context": ...
Apologies for the lengthy question, but I'm stumped.
AWSJSON is a JSON string type so you will always get back a string value (this is what your type definition must adhere to).
You could try to make a type for data field which contains all possible fields and then resolve fields to a corresponding to a parent type or alternatively you could try to implement graphQL interfaces

How to read invalid JSON format amazon firehose

I've got this most horrible scenario in where i want to read the files that kinesis firehose creates on our S3.
Kinesis firehose creates files that don't have every json object on a new line, but simply a json object concatenated file.
{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}
Now is this a scenario not supported by normal JSON.parse and i have tried working with following regex: .scan(/({((\".?\":.?)*?)})/)
But the scan only works in scenario's without nested brackets it seems.
Does anybody know an working/better/more elegant way to solve this problem?
The one in the initial anwser is for unquoted jsons which happens some times. this one:
({((\\?\".*?\\?\")*?)})
Works for quoted jsons and unquoted jsons
Besides this improved it a bit, to keep it simpler.. as you can have integer and normal values.. anything within string literals will be ignored due too the double capturing group.
https://regex101.com/r/kPSc0i/1
Modify the input to be one large JSON array, then parse that:
input = File.read("input.json")
json = "[#{input.rstrip.gsub(/\}\s*\{/, '},{')}]"
data = JSON.parse(json)
You might want to combine the first two to save some memory:
json = "[#{File.read('input.json').rstrip.gsub(/\}\s*\{/, '},{')}]"
data = JSON.parse(json)
This assumes that } followed by some whitespace followed by { never occurs inside a key or value in your JSON encoded data.
As you concluded in your most recent comment, the put_records_batch in firehose requires you to manually put delimiters in your records to be easily parsed by the consumers. You can add a new line or some special character that is solely used for parsing, % for example, which should never be used in your payload.
Other option would be sending record by record. This would be only viable if your use case does not require high throughput. For that you may loop on every record and load as a stringified data blob. If done in Python, we would have a dictionary "records" having all our json objects.
import json
def send_to_firehose(records):
firehose_client = boto3.client('firehose')
for record in records:
data = json.dumps(record)
firehose_client.put_record(DeliveryStreamName=<your stream>,
Record={
'Data': data
}
)
Firehose by default buffers the data before sending it to your bucket and it should end up with something like this. This will be easy to parse and load in memory in your preferred data structure.
[
{
"metadata": {
"schema_id": "4096"
},
"payload": {
"zaza": 12,
"price": 20,
"message": "Testing sendnig the data in message attribute",
"source": "coming routing to firehose"
}
},
{
"metadata": {
"schema_id": "4096"
},
"payload": {
"zaza": 12,
"price": 20,
"message": "Testing sendnig the data in message attribute",
"source": "coming routing to firehose"
}
}
]

TJSONUnMarshal: how to track what is actually unmarshalled

Is there another way to track what is unmarshalled than write own reverter for each field?
I'm updating my local data based on json message and my problem is (simplified):
I'm expecting json like
{ "items": [ { "id":1, "name":"foobar", "price":"12.34" } ] }
which is then unmarshaled to class TItems by
UnMarshaller.TryCreateObject( TItems, TJsonObject( OneJsonElement ), TargetItem )
My problem is that I can't make difference between
{ "items": [ { "id":1, "name":"", "price":"12.34" } ] }
and
{ "items": [ { "id":1, "price":"12.34" } ] }
In both cases name is blank and i'd like to update only those fields that are passed on json message. Of course I could create a reverted for each field, but there are plenty of fields and messages so it's quite huge.
I tried to look REST.Jsonreflect.pas source, but couldn't make sense.
I'm using delphi 10.
In Rest.Json unit there is a TJson class defined that offers several convenience methods like converting objects to JSON and vice versa. Specifically, it has a class function JsonToObject where you can specify options like for example ignore empty strings or ignore empty arrays. I think the TJson class can serve you. For unmarshalling complex business objects you have to write custom converters though.
Actually, my problem was finally simple to solve.
Instead of using TJSONUnMarshal.tryCreateObject I use now TJSONUnMarshal.CreateObject. First one has object parameters declared with out modifier, but CreateObject has Object parameter var modifier, so I was able to
create object, initalize it from database and pass it to CreateObject which only modifies fields in json message.