Failed to parse JSON using Biquery from Google storage bucket - json

I upload from my backend the attached JSON to Google bucket,
Now I'm trying to connect this JSON to the Bigquery table but getting the following error, what are the changes I need to make?
Error while reading table: XXXXX, error message: Failed to parse JSON: No object found when new array is started.; BeginArray returned false; Parser terminated before end of string
[["video_screen","click_on_screen","false","202011231958","1","43","0"],["buy","error","2","202011231807","1","6","0"],["sign_in","enter","user_details","202011231220","2","4","0"],["video_screen","click_on_screen","false","202011230213","1","4","0"],["video_screen","click_on_screen","false","202011230633","1","4","0"],["video_screen","click_on_screen","false","202011230709","1","4","0"],["video_screen","click_on_screen","false","202011230712","1","4","0"],["video_screen","click_on_screen","false","202011230723","1","4","0"],["video_screen","click_on_screen","false","202011230725","1","4","0"],["video_screen","click_on_screen","false","202011231739","1","4","0"],["category","select","MTV","202011232228","1","3","0"],["sign_in","enter","user_details","202011230108","2","3","0"],["sign_in","enter","user_details","202011230442","2","3","0"],["video","select","youtube","202011230108","1","3","0"],["video","select","youtube","202011230633","1","3","0"],["video_screen","click_on_screen","false","202011230458","1","3","0"],["video_screen","click_on_screen","false","202011230552","1","3","0"],["video_screen","click_on_screen","false","202011230612","1","3","0"],["video_screen","click_on_screen","false","202011231740","1","3","0"],["category","select","Disney Karaoke","202011232228","1","2","0"],["category","select","Duet","202011232228","1","2","0"],["category","select","Free","202011230726","1","2","0"],["category","select","Free","202011231830","2","2","0"],["category","select","Free","202011232228","1","2","0"],["category","select","Love","202011232228","1","2","0"],["category","select","New","202011232228","1","2","0"],["category","select","Pitch Perfect 2","202011232228","1","2","0"],["developer","click","hithub","202011230749","1","2","0"],["sign_in","enter","user_details","202011230134","1","2","0"],["sign_in","enter","user_details","202011230211","1","2","0"],["sign_in","enter","user_details","202011230219","1","2","0"]]

Bigquery reads JSONL files. The example is not in that format.
JSONL uses \n as the delimeter between records. The example is all on one line with commas delimiting.
Every JSONL line is a json object, so starts with { and ends with }. The example has JSON arrays which are not supported.
JSONL is based on JSON. Every data element needs to be named. So the first record might appear as { "field1_name": "video_screen", "field2_name": "click_on_screen", "field3_name": false, "field4_name": 202011231958, "Field5_name": 1, "field6_name": 43, "field7_name": 0}
JSONL does not have the outer pair of brackets []. The first line starts { not [{ and the last line ends } not }].

Related

Create JSON with correct key-value strings to be parsed

I have a requirement to reproduce a json, basically to run some tests on my code that is used to parse similar json in real-time.
This is my code snippet, event is generated from s3 PUT notification i.e. published to SQS from SNS listener -
event_body = json.loads(event["Records"][0]["body"])
event_body_msg = json.loads(event_body['Message'])
event_body_dict = event_body_msg['Records'][0]
s3_buck = event_body_dict['s3']['bucket']['name']
I want my json to contain the hierarchical structure that would be parsed by this code (as my test is written on the value of s3_buck). Here is the json I came up with -
{
"Records": [
{
"messageId": "a6665910-ab5a-46c3-baaf-6086c0c90511",
"receiptHandle": "AQEBscBCR7DwSLqd5SXvEAX+8NUImpMPNmJ9hSD03HgWHhPnNZoIIqHkqI8lvwGMLjhX2R1ogPfo09z8EHcI7Nuh851vi4cIPBngMbIS6yw/rBtG115vSUyfN8i1yKM6Oz7iSJ2kIJCGmWRF2Rhsc8dH31zcyZKbVz/SzCOK8S/E9SdFHkPi2iNm4tr4PgrI+ZrvtYUicOuZQAJ8++hYo0rB43YCZKSZWMV1LG4iz2+OKVO08qZv3WyJ3pUegW4LXNp1xAf2abep44lYgWqqDWyWRlnpKayagqaTSaqR/OzNM3Iky9MnXqVz3g7CRBO28h2noUy4T6cW6HmlZ+xe3TWHOToJeWqiRnsY1HYuZxGscRpDUXIq5V7pZPhkLU2XbdQg",
"body": "{\"Message\": {\"Records\": [{\"s3\": {\"bucket\": {\"name\": \"demo-bucket-name\",\"arn\":
\"arn:aws:s3:::demo-bucket-name\"},\"object\": {\"key\":
\"demo-key-prefix.json\"}}}]}}"
}
]
}
I am trying to replicate aws sns notification to create a sample json that would only contain attributes for my usecase. Here is the sample event sns produces(copied from the lambda console) - https://www.heypasteit.com/clip/0IUAE3. I am only picking attributes that I need like s3 bucket_name or arn etc.
However, the problem here is that I run into errors where I run event["Records"][0]["body"], with the error message -
Traceback (most recent call last): File "", line 1, in
File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/init.py",
line 354, in loads
return _default_decoder.decode(s) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py",
line 342, in decode
raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 1 column 251 (char 250)
I tried enclosing the json string inside "body" key using r""" but no luck. Wondering what's the right format to create the json.
body is not valid as a JSON string. It contains 2 records without [ and ,.
Let's print your body string then validate with https://jsonlint.com/

Selecting random greeting from reading a JSON file using python

I have a JSON file looking like this which i have to randomise so that every time any input comes it shows any random output from the 3 in the json file.
{
"1":"Welcome",
"2":"Hello",
"3":"Hi"
}
I read the JSON file
greeting_template1=readjson(input_file_path+'greeting_template1.json')
and to randomise
greeting_template1 = random.choice(greeting_template1)
But I am getting the error:
greeting_template1 = random.choice(greeting_template1)
File "C:\Users\\AppData\Local\Continuum\anaconda3\envs\lib\random.py", line 262, in choice
return seq[i]
KeyError: 2
Please highlight where I am going wrong
As others have pointed out your JSON is not valid.
Valid json file would be:
{
"1":"Welcome",
"2":"Hello",
"3":"Hi"
}
And the code to get a random would look something like:
import json
import random
with open('greeting_template1.json') as json_file:
data = json.load(json_file)
random_greeting = data[random.choice(list(data))]
The reason you are getting error is because random.choice() needs a sequence as an argument. Parsing a json gives you a python dictionary which is not a sequence.
Your document has 3 JSONs in it, not one. Once you close the initial {, that is your JSON. You need to rewrite to:
{
"1":"Welcome",
"2":"Hello",
"3":"Hi"
}

I need to flatten a JSON web response with nested arrays [[ ]] into a DataFrame

I'm trying to convert an http JSON response into a DataFrame, then out to CSV file.
I'm struggling with the JSON into DF.
http line:
http://api.kraken.com/0/public/OHLC?pair=XXBTZEUR&interval=1440
JSON response (part of - 720 records in arrays):
[formatted using a JSON site does not post here apparently]
{
"error": [],
"result": {
"XXBTZEUR": [
[1486252800, "959.7", "959.7", "935.0", "943.6", "945.6", "4423.72544809", 5961],
[1486339200, "943.8", "959.7", "940.0", "952.9", "953.5", "4464.48492401", 7678],
[1486425600, "953.6", "990.0", "952.7", "988.5", "977.3", "8123.94462701", 10964],
[1486512000, "988.4", "1000.1", "963.3", "987.5", "983.7", "10989.31074845", 16741],
[1486598400, "987.4", "1007.4", "847.9", "926.4", "934.5", "22530.11626076", 52668],
[1486684800, "926.4", "949.0", "886.0", "939.7", "916.7", "11173.53504917", 12588],
],
"last": 1548288000
}
}
I get
KeyError: 'XXBTZEUR'
on the json_normalize line. Seems to indicate to me that json_normalize is trying to build the DF from the "XXBTZEUR" level, not lower down at the record level. How do I get json_normalize to read the records instead. ie How do I get it to reference deep enough?
I have read several other posts on this site without understanding what I'm doing wrong.
One post mentions that json.loads() must be used. Is json_string.json() also loading the JSON object or do I need the json.loads() instead?
Also tried variations of json_normalize:
BTCEUR_Daily_Table = json_normalize(json_data[[]])
TypeError: unhashable type: 'list'
Can normalize not load an array into a DF line?
code so far:
BTCEUR_Daily_URL = 'http://api.kraken.com/0/public/OHLC?pair=XXBTZEUR&interval=1440'
json_string = requests.get(BTCEUR_Daily_URL)
json_data = json_string.json()
BTCEUR_Daily_Table = json_normalize(json_data, record_path=["XXBTZEUR"])
What I need in result:
In my DF, I just want the arrayed records shown in the "body" of the JSON structure. None of the header & footer are needed.
The solution I found was:
BTCEUR_Daily_Table = json_normalize(data=json_data, record_path=[['result','XXBTZEUR']])
The 2nd parameter specifies the full "path" to the parent label of the records.
Apparently double brackets are needed to specify a full path, otherwise the 2 labels are taken to mean 2 top level names.
Without another post here, I would never have found the solution.

How to read invalid JSON format amazon firehose

I've got this most horrible scenario in where i want to read the files that kinesis firehose creates on our S3.
Kinesis firehose creates files that don't have every json object on a new line, but simply a json object concatenated file.
{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}
Now is this a scenario not supported by normal JSON.parse and i have tried working with following regex: .scan(/({((\".?\":.?)*?)})/)
But the scan only works in scenario's without nested brackets it seems.
Does anybody know an working/better/more elegant way to solve this problem?
The one in the initial anwser is for unquoted jsons which happens some times. this one:
({((\\?\".*?\\?\")*?)})
Works for quoted jsons and unquoted jsons
Besides this improved it a bit, to keep it simpler.. as you can have integer and normal values.. anything within string literals will be ignored due too the double capturing group.
https://regex101.com/r/kPSc0i/1
Modify the input to be one large JSON array, then parse that:
input = File.read("input.json")
json = "[#{input.rstrip.gsub(/\}\s*\{/, '},{')}]"
data = JSON.parse(json)
You might want to combine the first two to save some memory:
json = "[#{File.read('input.json').rstrip.gsub(/\}\s*\{/, '},{')}]"
data = JSON.parse(json)
This assumes that } followed by some whitespace followed by { never occurs inside a key or value in your JSON encoded data.
As you concluded in your most recent comment, the put_records_batch in firehose requires you to manually put delimiters in your records to be easily parsed by the consumers. You can add a new line or some special character that is solely used for parsing, % for example, which should never be used in your payload.
Other option would be sending record by record. This would be only viable if your use case does not require high throughput. For that you may loop on every record and load as a stringified data blob. If done in Python, we would have a dictionary "records" having all our json objects.
import json
def send_to_firehose(records):
firehose_client = boto3.client('firehose')
for record in records:
data = json.dumps(record)
firehose_client.put_record(DeliveryStreamName=<your stream>,
Record={
'Data': data
}
)
Firehose by default buffers the data before sending it to your bucket and it should end up with something like this. This will be easy to parse and load in memory in your preferred data structure.
[
{
"metadata": {
"schema_id": "4096"
},
"payload": {
"zaza": 12,
"price": 20,
"message": "Testing sendnig the data in message attribute",
"source": "coming routing to firehose"
}
},
{
"metadata": {
"schema_id": "4096"
},
"payload": {
"zaza": 12,
"price": 20,
"message": "Testing sendnig the data in message attribute",
"source": "coming routing to firehose"
}
}
]

Json handling in ROBOT

I have a Json file in which there is a field which I need to edit and save the file for next usage.
But the field which I need to edit is as shown below,
The value I need to assign fr the field is generated Randomly in run time which i'll be capturing in a variable and pass it to this json specific key "dp" then save the json.
The saved json will be used for REST POST url.
{
"p": "10",
"v": 100,
"vt": [
{
"dp": "Field to be edited"(integer value) ,
]
}
The simplest solution would be to write a python keyword that can change the value for you. However, you can solve this with robot keywords by performing the following steps:
convert the JSON string to a dictionary
modify the dictionary
convert the dictionary back to a JSON string
Convert the JSON string to a dictionary
Python has a module (json) for working with JSON data. You can use the evaluate keyword to convert your JSON string to a python dictionary using the loads (load string) method of that module.
Assuming your JSON data is in a robot variable named ${json_string}, you can convert it to a python dictionary like this:
${json}= evaluate json.loads('''${json_string}''') json
With the above, ${json} now holds a reference to a dictionary that contains all of the json data.
Modify the dictionary
The Collections library that comes with robot has a keyword named set to dictionary which can be used to set the value of a dictionary element. In this case, you need to change the value of a dictionary nested inside the vt element of the JSON object. We can reference that nested dictionary using robot's extended variable syntax.
For example:
set to dictionary ${json["vt"]} dp=the new value
With that, ${json} now has the new value. However, it is still a python dictionary rather than JSON data, so there's one more step.
Convert the dictionary back to JSON
Converting the dictionary back to JSON is the reverse of the first step. Namely, use the dumps (dump string) method of the json module:
${json_string}= evaluate json.dumps(${json}) json
With that, ${json_string} will contain a valid JSON string with the modified data.
Complete example
The following is a complete working example. The JSON string will be printed before and after the substitution of the new value:
*** Settings ***
Library Collections
*** Test Cases ***
Example
${json_string}= catenate
... {
... "p": "10",
... "v": 100,
... "vt": {
... "dp": "Field to be edited"
... }
... }
log to console \nOriginal JSON:\n${json_string}
${json}= evaluate json.loads('''${json_string}''') json
set to dictionary ${json["vt"]} dp=the new value
${json_string}= evaluate json.dumps(${json}) json
log to console \nNew JSON string:\n${json_string}
For reading and writing data to and from file I am using OperatingSystem library
${json} Get Binary File ${json_path}nameOfJsonFile.json
It works for me on API testing, to read .json and POST, like here
*** Settings ***
Library Collections
Library ExtendedRequestsLibrary
Library OperatingSystem
*** Variables ***
${uri} https://blabla.com/service/
${json_path} C:/home/user/project/src/json/
*** Test Cases ***
Robot Test Case
Create Session alias ${uri}
&{headers} Create Dictionary Content-Type=application/json; charset=utf-8
${json} Get Binary File ${json_path}nameOfJsonFile.json
${resp} Post Request alias data=${json} headers=${headers}
Should Be Equal As Strings ${resp.status_code} 200
For integer values in JSON, the other answers did not work for me.
This worked:
${json}= Catenate { "p": "10", "v": 100, "vt": { "dp": "Field to be edited" } }
${value} Set Variable 2 #the value you want
${value} Convert To Integer ${value}
${json}= Evaluate json.loads('''${json}''') json
Set To Dictionary ${json["vt"]} dp=${value}
${json}= Evaluate json.dumps(${json}) json
Log ${json}
Convert To Integer was required, otherwise the value is still in string "${value}"