Processing JSON arrays - json

I have a JSON file arranged in this pattern:
[
{
"Title ID": "4224031",
"Overtime Status": "Non-Exempt",
"Shift rates": "No Shift rates",
"On call rates": "No On call rates"
},
[
{
"Step: 1.0": [
"$38.87",
"(38.870000)"
]
}
]
][
{
"Title ID": "4225031",
"Overtime Status": "Non-Exempt",
"Shift rates": "No Shift rates",
"On call rates": "No On call rates"
},
[
{
"Step: 1.0": [
"$38.87",
"(38.870000)"
]
}
]
]
I am trying to get it into a Pandas DataFrame. I have tried opening a connection to the JSON file and running JSON.load(s). Unfortunately, I get JSON decode errors like: "JSONDecodeError: Extra data: line 16 column 2 (char 182)". When running the JSON through a linter, I see that there might be an issue with the way the JSON is presented in the file. The parts between the brackets are valid but when wrapped in brackets, become invalid. I have then tried to get at the dictionaries with the wrapping brackets but have not been able to make much progress. Does anyone have tips on how I can successfully access this JSON data and get it into a pandas DataFrame?

The json is invalid beacuase it has more than one root in this representation.
This has to be like this
jsonObject = [{"1":"3"}], [{"4":"5"}]
Hacks that I am able to think of are replace these brackets ][ to this ],[ by find and replace in editor. You'll be able to then create a dataframe as its a list now.
Second, if its not a one time job, then you need to write a regex that can do this for you in text cleaning pipeline(or code). I'm not good at writing of working regex(sorry mate).

I found a solution.
First, after examining the JSON data in a linter, I found that I had some extra brackets and braces at different points. So, I am running the data through a regex that cleans out the unnecessary brackets and braces.
Next, I run each line, which now looks like a string dictionary through json.loads
Finally, I call pd.DataFrame(pd.json_normalize(data)) to get my desired pandas dataframe.
Thanks for the help from commenters.

Related

Сonvert dict or json formats with nested objects into string

Now I have a string in format dict but as i can guess its a json format its look like:
{
"gid":"1201400250397201",
"memberships":[
"can be nested objects",
...
],
"name":"Name of task",
"parent":{
"gid":"1201400250397199",
"name":"name of parent task"
},
"permalink_url":"https://url...."
}
So first question: am i right? I used dumps() from json library but got unicode escape sequences, loads() didnt work for me, i got error "the JSON object must be str, bytes or bytearray, not dict".
Second question: if its not json format, how can i get comfortable view? I did it:
first of all i get dict-line, then I print a dictionary's key:
for key in task:
task
print(task[key])
output:
1201400250397201
[]
Name of task
{'gid': '1201400250397199', 'name': ''name of parent task'}
https://url....
At actually it would be great if I get something like that:
gid: 1201400250397201
name: Name of task
parent_name: 'Name of task' etc
But I dont know how to get it :(
Next question: as you can see for part "parent" (penultimate line) I also get dictionary, how can I extract it and get convenient format?
Or maybe you have your comfortable methods?
Like stated in your error, the object you are working with is already a dictionary. You can print it directly as json with json.dumps:
task = {'gid': '1201400250397201', 'memberships': [{}], 'name': 'Name of task', 'parent': {'gid': '1201400250397199', 'name': 'name of parent task'},'permalink_url': 'https://url....'}
print(json.dumps(task, indent=4))
Setting indent=4 makes it readable and you'll get:
{
"gid": "1201400250397201",
"memberships": [
{}
],
"name": "Name of task",
"parent": {
"gid": "1201400250397199",
"name": "name of parent task"
},
"permalink_url": "https://url...."
}
If you don't want unicode characters to be escaped, add the argument ensure_ascii=False:
print(json.dumps(task, indent=4, ensure_ascii=False))

How can I get Boomi to return valid JSON

I am querying records from Salesforce and trying to return the record set as a JSON array of records.
Unfortunately, it returns every record as if it was a single record as the complete JSON rather than an array element in the same JSON object.
{
"AppointmentID": "a046g00000Nyk6oAAB"
}{
"AppointmentID": "a046g00000NyjhfAAB"
}{
"AppointmentID": "a046g00000NygSfAAJ"
}
There are no commas between the records. So I built the array into the JSON response and get:
{
"Appointments": [
{
"AppointmentID": "a046g00000Nyk6oAAB"
}
]
}{
"Appointments": [
{
"AppointmentID": "a046g00000NyjhfAAB"
}
]
}{
"Appointments": [
{
"AppointmentID": "a046g00000NygSfAAJ"
}
]
}
and it sends each record as the entire JSON template rather than a element of the array. Again, it also does not send commas back between the elements. I can work with a less than ideal structure but I need valid JSON returned.
Lastly, I tried to modify the results with a Data Process Shape using s Search and Replace
searching for: \}\{
replacing with \}\,\{
trying for force a comma between the braces, but the search never finds any matches even though this is a valid Javascript regex search.
Any suggestions would be greatly appreciated.
Final/Fixed Map
It's likely that the destination profile is incorrect and that you manually created the JSON profile. I would write the JSON out that you're expecting with all of the fields and then import (when you open the JSON profile, it's a blue button in the top right).
Also, Salesforce usually returns each record as 1 document and not combined. So, it's likely multiple documents are coming out of the map and you'll need to do a combine (data process shape).

JSON file to Redshift using JSONPATHS file - Invalid jsonpath format

Trying to load JSON file from s3 into Redshift using Copy with JSONPATHS file. The file contains N number of records.
Loading the entire set in one go throws an error:
Invalid operation: Invalid JSONPath format. Supported notations are 'dot-notation' and 'bracket-notation'
The Json paths:
{"jsonpaths":
[
"$.item[:].col1",
"$.item[:].col2",
"$.item[:].col3"
]
}
sample file:
{"item":
[
{
"col1":"A",
"col2":"b",
"col3":"d"
},
{
"col1": "123",
"col2": "red",
"col3": "456"
}
]
}
Working FILE:-
{"jsonpaths":
[
"$.item[0].col1",
"$.item[0].col2",
"$.item[0].col3"
]
}
What am I doing wrong to cause this error?
As per the documentation, there are 2 ways of specifying the JSONPaths. One is to use the dot notation and another is to use the bracket notation.
In this example, the user has used the dot notation, but the arrays have been indexed using a colon (:). The correct way to index JSON arrays elements is to use a number. Hence the second example of the JSONPath file works.

How to read invalid JSON format amazon firehose

I've got this most horrible scenario in where i want to read the files that kinesis firehose creates on our S3.
Kinesis firehose creates files that don't have every json object on a new line, but simply a json object concatenated file.
{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}
Now is this a scenario not supported by normal JSON.parse and i have tried working with following regex: .scan(/({((\".?\":.?)*?)})/)
But the scan only works in scenario's without nested brackets it seems.
Does anybody know an working/better/more elegant way to solve this problem?
The one in the initial anwser is for unquoted jsons which happens some times. this one:
({((\\?\".*?\\?\")*?)})
Works for quoted jsons and unquoted jsons
Besides this improved it a bit, to keep it simpler.. as you can have integer and normal values.. anything within string literals will be ignored due too the double capturing group.
https://regex101.com/r/kPSc0i/1
Modify the input to be one large JSON array, then parse that:
input = File.read("input.json")
json = "[#{input.rstrip.gsub(/\}\s*\{/, '},{')}]"
data = JSON.parse(json)
You might want to combine the first two to save some memory:
json = "[#{File.read('input.json').rstrip.gsub(/\}\s*\{/, '},{')}]"
data = JSON.parse(json)
This assumes that } followed by some whitespace followed by { never occurs inside a key or value in your JSON encoded data.
As you concluded in your most recent comment, the put_records_batch in firehose requires you to manually put delimiters in your records to be easily parsed by the consumers. You can add a new line or some special character that is solely used for parsing, % for example, which should never be used in your payload.
Other option would be sending record by record. This would be only viable if your use case does not require high throughput. For that you may loop on every record and load as a stringified data blob. If done in Python, we would have a dictionary "records" having all our json objects.
import json
def send_to_firehose(records):
firehose_client = boto3.client('firehose')
for record in records:
data = json.dumps(record)
firehose_client.put_record(DeliveryStreamName=<your stream>,
Record={
'Data': data
}
)
Firehose by default buffers the data before sending it to your bucket and it should end up with something like this. This will be easy to parse and load in memory in your preferred data structure.
[
{
"metadata": {
"schema_id": "4096"
},
"payload": {
"zaza": 12,
"price": 20,
"message": "Testing sendnig the data in message attribute",
"source": "coming routing to firehose"
}
},
{
"metadata": {
"schema_id": "4096"
},
"payload": {
"zaza": 12,
"price": 20,
"message": "Testing sendnig the data in message attribute",
"source": "coming routing to firehose"
}
}
]

jackson jsonparser restart parsing in broken JSON

I am using Jackson to process JSON that comes in chunks in Hadoop. That means, they are big files that are cut up in blocks (in my problem it's 128M but it doesn't really matter).
For efficiency reasons, I need it to be streaming (not possible to build the whole tree in memory).
I am using a mixture of JsonParser and ObjectMapper to read from my input.
At the moment, I am using a custom InputFormat that is not splittable, so I can read my whole JSON.
The structure of the (valid) JSON is something like:
[ { "Rep":
{
"date":"2013-07-26 00:00:00",
"TBook":
[
{
"TBookC":"ABCD",
"Records":
[
{"TSSName":"AAA",
...
},
{"TSSName":"AAB",
...
},
{"TSSName":"ZZZ",
...
}
] } ] } } ]
The records I want to read in my RecordReader are the elements inside the "Records" element. The "..." means that there is more info there, which conforms my record.
If I have an only split, there is no problem at all.
I use a JsonParser for fine grain (headers and move to "Records" token) and then I use ObjectMapper and JsonParser to read records as Objects. For details:
configure(JsonParser.Feature.AUTO_CLOSE_SOURCE, false);
MappingJsonFactory factory = new MappingJsonFactory();
mapper = new ObjectMapper(factory);
mapper.configure(Feature.FAIL_ON_UNKNOWN_PROPERTIES,false);
mapper.configure(SerializationConfig.Feature.FAIL_ON_EMPTY_BEANS,false);
parser = factory.createJsonParser(iStream);
mapper.readValue(parser, JsonNode.class);
Now, let's imagine I have a file with two inputsplits (i.e. there are a lot of elements in "Records").
The valid JSON starts on the first split, and I read and keep the headers (which I need for each record, in this case the "date" field).
The split would cut anywhere in the Records array. So let's assume I get a second split like this:
...
},
{"TSSName":"ZZZ",
...
},
{"TSSName":"ZZZ2",
...
}
] } ] } } ]
I can check before I start parsing, to move the InputStream (FSDataInputStream) to the beginning ("{" ) of the record with the next "TSSNAME" in it (and this is done OK). It's fine to discard the trailing "garbage" at the beginning. So we got this:
{"TSSName":"ZZZ",
...
},
{"TSSName":"ZZZ2",
...
},
...
] } ] } } ]
Then I handle it to the JsonParser/ObjectMapper pair seen above.
The first object "ZZZ" is read OK.
But for the next "ZZZ2", it breaks: the JSONParser complaints about malformed JSON. It is encountering a "," not being in an array. So it fails. And then I cannot keep on reading my records.
How could this problem be solved, so I can still be reading my records from the second (and nth) splits? How could I make the parser ignore these errors on the commas, or either let the parser know in advance it's reading contents of an array?
It seems it's OK just catching the exception: the parser goes on and it's able to keep on reading objects via the ObjectMapper.
I don't really like it - I would like an option where the parser could not throw Exceptions on nonstandard or even bad JSON. So I don't know if this fully answers the question, but I hope it helps.