How to read invalid JSON format amazon firehose - json

I've got this most horrible scenario in where i want to read the files that kinesis firehose creates on our S3.
Kinesis firehose creates files that don't have every json object on a new line, but simply a json object concatenated file.
{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}
Now is this a scenario not supported by normal JSON.parse and i have tried working with following regex: .scan(/({((\".?\":.?)*?)})/)
But the scan only works in scenario's without nested brackets it seems.
Does anybody know an working/better/more elegant way to solve this problem?

The one in the initial anwser is for unquoted jsons which happens some times. this one:
({((\\?\".*?\\?\")*?)})
Works for quoted jsons and unquoted jsons
Besides this improved it a bit, to keep it simpler.. as you can have integer and normal values.. anything within string literals will be ignored due too the double capturing group.
https://regex101.com/r/kPSc0i/1

Modify the input to be one large JSON array, then parse that:
input = File.read("input.json")
json = "[#{input.rstrip.gsub(/\}\s*\{/, '},{')}]"
data = JSON.parse(json)
You might want to combine the first two to save some memory:
json = "[#{File.read('input.json').rstrip.gsub(/\}\s*\{/, '},{')}]"
data = JSON.parse(json)
This assumes that } followed by some whitespace followed by { never occurs inside a key or value in your JSON encoded data.

As you concluded in your most recent comment, the put_records_batch in firehose requires you to manually put delimiters in your records to be easily parsed by the consumers. You can add a new line or some special character that is solely used for parsing, % for example, which should never be used in your payload.
Other option would be sending record by record. This would be only viable if your use case does not require high throughput. For that you may loop on every record and load as a stringified data blob. If done in Python, we would have a dictionary "records" having all our json objects.
import json
def send_to_firehose(records):
firehose_client = boto3.client('firehose')
for record in records:
data = json.dumps(record)
firehose_client.put_record(DeliveryStreamName=<your stream>,
Record={
'Data': data
}
)
Firehose by default buffers the data before sending it to your bucket and it should end up with something like this. This will be easy to parse and load in memory in your preferred data structure.
[
{
"metadata": {
"schema_id": "4096"
},
"payload": {
"zaza": 12,
"price": 20,
"message": "Testing sendnig the data in message attribute",
"source": "coming routing to firehose"
}
},
{
"metadata": {
"schema_id": "4096"
},
"payload": {
"zaza": 12,
"price": 20,
"message": "Testing sendnig the data in message attribute",
"source": "coming routing to firehose"
}
}
]

Related

Azure Data Factory - Azure Dataflow, json being converted/serialized into a different format string

For context I have Azure Dataflow reading from CosmosDB where the table has two "schemas". By this I mean one format has a field called "data" that is a JSON representation of data I need. The other format also has a field called "data" but the field is just a compressed string and I believe this is what causes the issue I'm having.
When dataflow reads the field from the source, the JSON gets turned into a non-json format through some form of serialization because when the projection gets imported, the data field is being read as a string and not a complex object. (Example below) I suspect this is because the two "data" fields have the same name.
I unfortunately cannot change how the data is being stored in CosmosDB so I cannot change one of the field names to another value.
Is there any way to prevent this, I would like to keep it in JSON format with quotes and colons etc, instead of what I have below.
Example of data in CosmosDB:
{
"Name": "sample",
"ValueInfo": [
{
"Field1": "foo",
"Field2": "bar"
}
]
}
How it looks in ADF
{
Name=sample,
ValueInfo=[
{
Field1=foo,
Field2=bar
}
]
}

Processing JSON arrays

I have a JSON file arranged in this pattern:
[
{
"Title ID": "4224031",
"Overtime Status": "Non-Exempt",
"Shift rates": "No Shift rates",
"On call rates": "No On call rates"
},
[
{
"Step: 1.0": [
"$38.87",
"(38.870000)"
]
}
]
][
{
"Title ID": "4225031",
"Overtime Status": "Non-Exempt",
"Shift rates": "No Shift rates",
"On call rates": "No On call rates"
},
[
{
"Step: 1.0": [
"$38.87",
"(38.870000)"
]
}
]
]
I am trying to get it into a Pandas DataFrame. I have tried opening a connection to the JSON file and running JSON.load(s). Unfortunately, I get JSON decode errors like: "JSONDecodeError: Extra data: line 16 column 2 (char 182)". When running the JSON through a linter, I see that there might be an issue with the way the JSON is presented in the file. The parts between the brackets are valid but when wrapped in brackets, become invalid. I have then tried to get at the dictionaries with the wrapping brackets but have not been able to make much progress. Does anyone have tips on how I can successfully access this JSON data and get it into a pandas DataFrame?
The json is invalid beacuase it has more than one root in this representation.
This has to be like this
jsonObject = [{"1":"3"}], [{"4":"5"}]
Hacks that I am able to think of are replace these brackets ][ to this ],[ by find and replace in editor. You'll be able to then create a dataframe as its a list now.
Second, if its not a one time job, then you need to write a regex that can do this for you in text cleaning pipeline(or code). I'm not good at writing of working regex(sorry mate).
I found a solution.
First, after examining the JSON data in a linter, I found that I had some extra brackets and braces at different points. So, I am running the data through a regex that cleans out the unnecessary brackets and braces.
Next, I run each line, which now looks like a string dictionary through json.loads
Finally, I call pd.DataFrame(pd.json_normalize(data)) to get my desired pandas dataframe.
Thanks for the help from commenters.

How can I get Boomi to return valid JSON

I am querying records from Salesforce and trying to return the record set as a JSON array of records.
Unfortunately, it returns every record as if it was a single record as the complete JSON rather than an array element in the same JSON object.
{
"AppointmentID": "a046g00000Nyk6oAAB"
}{
"AppointmentID": "a046g00000NyjhfAAB"
}{
"AppointmentID": "a046g00000NygSfAAJ"
}
There are no commas between the records. So I built the array into the JSON response and get:
{
"Appointments": [
{
"AppointmentID": "a046g00000Nyk6oAAB"
}
]
}{
"Appointments": [
{
"AppointmentID": "a046g00000NyjhfAAB"
}
]
}{
"Appointments": [
{
"AppointmentID": "a046g00000NygSfAAJ"
}
]
}
and it sends each record as the entire JSON template rather than a element of the array. Again, it also does not send commas back between the elements. I can work with a less than ideal structure but I need valid JSON returned.
Lastly, I tried to modify the results with a Data Process Shape using s Search and Replace
searching for: \}\{
replacing with \}\,\{
trying for force a comma between the braces, but the search never finds any matches even though this is a valid Javascript regex search.
Any suggestions would be greatly appreciated.
Final/Fixed Map
It's likely that the destination profile is incorrect and that you manually created the JSON profile. I would write the JSON out that you're expecting with all of the fields and then import (when you open the JSON profile, it's a blue button in the top right).
Also, Salesforce usually returns each record as 1 document and not combined. So, it's likely multiple documents are coming out of the map and you'll need to do a combine (data process shape).

How to write a splittable DoFn in python - convert json to ndjson in apache beam

I have a large dataset in GCS in json format that I need to load into BigQuery.
The problem is that the json data is not stored in NdJson but rather in a few large json files, where each key in the JSON should really be a field in json itself.
For example - the following Json:
{
"johnny": {
"type": "student"
},
"jeff": {
"type": "teacher"
}
}
should be converted into
[
{
"name": "johnny",
"type": "student"
},
{
"name": "jeff",
"type": "teacher"
}
]
I am trying to solve it via Google Data Flow an Apache Beam, but the performance is terrible since ech "Worker" has to do a lot of work:
class JsonToNdJsonDoFn(beam.DoFn):
def __init__(self, pk_field_name):
self.__pk_field_name = pk_field_name
def process(self, line):
for key, record in json.loads(line).items():
record[self.__pk_field_name] = key
yield record
I know that this can solved somehow via implementing it as a SplittableDoFn - but the implementation example in Python there is not really clear. How should I build this DoFn as splittable, and how will it be used as part of the pipeline?
You need a way to specify a partial range to process of the json file. It could be a byte range, for example.
The Avro example in the blog post is a good one. Something like:
class MyJsonReader(DoFn):
def process(filename, tracker=DoFn.RestrictionTrackerParam)
with fileio.ChannelFactory.open(filename) as file:
start, stop = tracker.current_restriction()
# Seek to the first block starting at or after the start offset.
file.seek(start)
next_record_start = find_next_record(file, start)
while start:
# Claim the position of the current record
if not tracker.try_claim(next_record_start):
# Out of range of the current restriction - we're done.
return
# start will point to the end of the record that was read
record, start = read_record(file, next_record_start)
yield record
def get_initial_restriction(self, filename):
return (0, fileio.ChannelFactory.size_in_bytes(filename))
However, json doesn't have clear record boundaries, so if your work has to start at byte 548, there's no clear way of telling how much to shift. If the file is literally what you have there, then you can skip bytes until you see the pattern "<string>": {. And then read the json object starting on the {.

jackson jsonparser restart parsing in broken JSON

I am using Jackson to process JSON that comes in chunks in Hadoop. That means, they are big files that are cut up in blocks (in my problem it's 128M but it doesn't really matter).
For efficiency reasons, I need it to be streaming (not possible to build the whole tree in memory).
I am using a mixture of JsonParser and ObjectMapper to read from my input.
At the moment, I am using a custom InputFormat that is not splittable, so I can read my whole JSON.
The structure of the (valid) JSON is something like:
[ { "Rep":
{
"date":"2013-07-26 00:00:00",
"TBook":
[
{
"TBookC":"ABCD",
"Records":
[
{"TSSName":"AAA",
...
},
{"TSSName":"AAB",
...
},
{"TSSName":"ZZZ",
...
}
] } ] } } ]
The records I want to read in my RecordReader are the elements inside the "Records" element. The "..." means that there is more info there, which conforms my record.
If I have an only split, there is no problem at all.
I use a JsonParser for fine grain (headers and move to "Records" token) and then I use ObjectMapper and JsonParser to read records as Objects. For details:
configure(JsonParser.Feature.AUTO_CLOSE_SOURCE, false);
MappingJsonFactory factory = new MappingJsonFactory();
mapper = new ObjectMapper(factory);
mapper.configure(Feature.FAIL_ON_UNKNOWN_PROPERTIES,false);
mapper.configure(SerializationConfig.Feature.FAIL_ON_EMPTY_BEANS,false);
parser = factory.createJsonParser(iStream);
mapper.readValue(parser, JsonNode.class);
Now, let's imagine I have a file with two inputsplits (i.e. there are a lot of elements in "Records").
The valid JSON starts on the first split, and I read and keep the headers (which I need for each record, in this case the "date" field).
The split would cut anywhere in the Records array. So let's assume I get a second split like this:
...
},
{"TSSName":"ZZZ",
...
},
{"TSSName":"ZZZ2",
...
}
] } ] } } ]
I can check before I start parsing, to move the InputStream (FSDataInputStream) to the beginning ("{" ) of the record with the next "TSSNAME" in it (and this is done OK). It's fine to discard the trailing "garbage" at the beginning. So we got this:
{"TSSName":"ZZZ",
...
},
{"TSSName":"ZZZ2",
...
},
...
] } ] } } ]
Then I handle it to the JsonParser/ObjectMapper pair seen above.
The first object "ZZZ" is read OK.
But for the next "ZZZ2", it breaks: the JSONParser complaints about malformed JSON. It is encountering a "," not being in an array. So it fails. And then I cannot keep on reading my records.
How could this problem be solved, so I can still be reading my records from the second (and nth) splits? How could I make the parser ignore these errors on the commas, or either let the parser know in advance it's reading contents of an array?
It seems it's OK just catching the exception: the parser goes on and it's able to keep on reading objects via the ObjectMapper.
I don't really like it - I would like an option where the parser could not throw Exceptions on nonstandard or even bad JSON. So I don't know if this fully answers the question, but I hope it helps.