I have a data that was identified as having JSON format. The data is in a *.txt file and the following is a snapshot from the text:
{"company_number":"09145694","data":{"address":{"address_line_1":"St. Andrews Road","country":"England","locality":"Henley-On-Thames","postal_code":"RG9 1HP","premises":"2"},"ceased_on":"2018-05-14","country_of_residence":"England","date_of_birth":{"month":2,"year":1977},"etag":"3b8caf795c03af63921e381f7bb8300a51ebb73d","kind":"individual-person-with-significant-control","links":{"self":"/company/09145694/persons-with-significant-control/individual/bIhuKnMFctSnjrDjUG8n3NgOrlU"},"name":"Mrs Nga Thanh Wildman","name_elements":{"forename":"Nga","middle_name":"Thanh","surname":"Wildman","title":"Mrs"},"nationality":"Vietnamese","natures_of_control":["ownership-of-shares-50-to-75-percent"],"notified_on":"2016-04-06"}}
{"company_number":"08581893","data":{"address":{"address_line_1":"High Street","address_line_2":"Wendover","country":"England","locality":"Aylesbury","postal_code":"HP22 6EA","premises":"14a","region":"Buckinghamshire"},"ceased_on":"2016-07-01","country_of_residence":"England","date_of_birth":{"month":9,"year":1947},"etag":"45f9c9e5494b574eb52abc3990a49bd96fe09df3","kind":"individual-person-with-significant-control","links":{"self":"/company/08581893/persons-with-significant-control/individual/RgR9Zhc7yGhV0SBys8_WJ6H9O1o"},"name":"Mr Stephen Robert Charles Davies","name_elements":{"forename":"Stephen","middle_name":"Robert Charles","surname":"Davies","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","ownership-of-shares-25-to-50-percent-as-firm"],"notified_on":"2016-06-30"}}
{"company_number":"08581893","data":{"address":{"address_line_1":"High Street","address_line_2":"Wendover","country":"England","locality":"Aylesbury","postal_code":"HP22 6EA","premises":"14a","region":"Buckinghamshire"},"ceased_on":"2016-07-01","country_of_residence":"England","date_of_birth":{"month":6,"year":1965},"etag":"d55168c49f85ab1ef38a12ed76238d68f79f5a01","kind":"individual-person-with-significant-control","links":{"self":"/company/08581893/persons-with-significant-control/individual/-6HQmkhiomEBXJI2rgHccU67fpM"},"name":"Mr Quentin Colin Maxwell Solt","name_elements":{"forename":"Quentin","middle_name":"Colin Maxwell","surname":"Solt","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","voting-rights-25-to-50-percent"],"notified_on":"2016-06-30"}}
How do I transfer this to a normal excel table with appropriate headings please?
The code i Have tried is from the suggestion in the comments and i have added the dictionary to the excel file as described here, suggested by #skin.
github.com/VBA-tools/VBA-JSON
I am getting a 424 error on the line
Set Parsed = JsonConverter.ParseJson(JsonText)
Here is the code:
Sub jsonchik()
Dim FSO As New FileSystemObject
Dim JsonTS As TextStream
Dim JsonText As String
Dim Parsed As Dictionary
' Read .json file
Set JsonTS = FSO.OpenTextFile("psc-snapshot-2022-11-12_1of22.txt", ForReading)
JsonText = JsonTS.ReadAll
JsonTS.Close
' Parse json to Dictionary
' "values" is parsed as Collection
' each item in "values" is parsed as Dictionary
Set Parsed = JsonConverter.ParseJson(JsonText)
' Prepare and write values to sheet
Dim Values As Variant
ReDim Values(Parsed("values").Count, 3)
Dim Value As Dictionary
Dim i As Long
i = 0
For Each Value In Parsed("values")
Values(i, 0) = Value("a")
Values(i, 1) = Value("b")
Values(i, 2) = Value("c")
i = i + 1
Next Value
Sheets("example").Range(Cells(1, 1), Cells(Parsed("values").Count, 3)) = Values
End Sub
Many thanks,
Suren
I have found a solution, using Microsfot Query, which has a JSON parser.
First, my data had some problems, so i had to validate it here: [https://jsonlint.com/][1]
second, I use MS Excel/Get & Transform Data/from Text/CSV to import my data.
In Importing Wisard, I clicked Transform Data, instead of Load to:
Under Transform Menu/Text Column/Parse/Json.
Works like a charm!
Thank you all.
You get this because you haven't loaded the code for JsonConverter.
Download the zip file with the code
Unzip the content
Right-click on your project and Import the file named JsonConverter.bas
Now, JsonConverter will refer to the module you just imported and ParseJson to the method in that module.
Remark:
I'm suspecting that you don't have Option Explicit set at the top of your module. If you had, you would get a "Variable Undefined Error" with ParseJson highlighted. That would certainly be more informative in terms error messaging. And that's just one of the good reasons to use Option Explicit!
You can use the json library in Python to convert the JSON file to a CSV file, example:
import json
import csv
# Open the JSON file
with open('yourfile.json', 'r', encoding='utf-8') as json_file:
json_data = []
# Read the file line by line
for line in json_file:
json_data.append(json.loads(line))
# Open a CSV file for writing
with open('yourfile.csv', 'w', newline='') as csv_file:
fieldnames = ['company_number', 'address_line_1', 'country', 'locality', 'postal_code', 'premises', 'ceased_on', 'country_of_residence', 'date_of_birth', 'name', 'nationality', 'natures_of_control', 'notified_on']
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
# Write each JSON object as a row in the CSV file
for json_object in json_data:
company_number = json_object['company_number']
address = json_object['data']['address']
address_line_1 = address['address_line_1']
country = address['country']
locality = address['locality']
postal_code = address['postal_code']
premises = address['premises']
ceased_on = json_object['data']['ceased_on']
country_of_residence = json_object['data']['country_of_residence']
date_of_birth = json_object['data']['date_of_birth']
name = json_object['data']['name']
nationality = json_object['data']['nationality']
natures_of_control = json_object['data']['natures_of_control']
notified_on = json_object['data']['notified_on']
writer.writerow({'company_number': company_number, 'address_line_1': address_line_1, 'country': country, 'locality': locality, 'postal_code': postal_code, 'premises': premises, 'ceased_on': ceased_on, 'country_of_residence': country_of_residence, 'date_of_birth': date_of_birth, 'name': name, 'nationality': nationality, 'natures_of_control': natures_of_control, 'notified_on': notified_on})
Related
Basic Information
I am creating a python script that can encrypt and decrypt a file with previous session data.
The Problem
I am able to decrypt my file and read it using a key. This returns a bytes string which I can in turn convert to a string. However, this string needs to be converted to a dictionary, which I cannot do. Using ast, json and eval I have run into errors.
Bytes string
decrypted = fernet.decrypt(encrypted)
String
string = decrypted.decode("UTF-8").replace("'", '"')
If I use eval() or ast.literal_eval() I get the following error:
Then I tried using json.loads() and I get the following error:
The information blocked out on both images is to protect my SSH connections. In the first image it is giving me a SyntaxError at the last digit of my ip address.
The Function
The function that is responsible for this when called looks like this:
def FileDecryption():
with open('enc_key.key', 'rb') as filekey:
key = filekey.read()
filekey.close()
fernet = Fernet(key)
with open('saved_data.txt', 'rb') as enc_file:
encrypted = enc_file.read()
enc_file.close()
decrypted = fernet.decrypt(encrypted)
print(decrypted)
string = decrypted.decode("UTF-8").replace("'", '"')
data = f'{string}'
print(data)
#data = eval(data)
data = json.loads(data)
print(type(data))
for key in data:
#command_string = ["load", data[key][1], data[key][2], data[key][3], data[key][4]]
#SSH.CreateSSH(command_string)
print(key)
Any help would be appreciated. Thanks!
Your data seems like it was written incorrectly in the first place, but without a complete example hard to say.
Here's a complete example that round-trips a JSON-able data object.
# requirement:
# pip install cryptography
from cryptography.fernet import Fernet
import json
def encrypt(data, data_filename, key_filename):
key = Fernet.generate_key()
with open(key_filename, 'wb') as file:
file.write(key)
fernet = Fernet(key)
encrypted = fernet.encrypt(json.dumps(data).encode())
with open(data_filename, 'wb') as file:
file.write(encrypted)
def decrypt(data_filename, key_filename):
with open(key_filename, 'rb') as file:
key = file.read()
fernet = Fernet(key)
with open(data_filename, 'rb') as file:
return json.loads(fernet.decrypt(file.read()))
data = {'key1': 'value1', 'key2': 'value2'}
encrypt(data, 'saved_data.txt', 'enc_key.key')
decrypted = decrypt('saved_data.txt', 'enc_key.key')
print(decrypted)
Output:
{'key1': 'value1', 'key2': 'value2'}
I am trying to convert csv files in a folder to a single json file. Below code does the job, but the issue is, json file has the first csv written several times. Below is the code i tried. I guess i am going wrong with assigning the data variable. Help me fix it
import csv, json, os
dir_path = 'C:/Users/USER/Desktop/output_files'
inputfiles = [file for file in os.listdir(dir_path) if file.endswith('.csv')]
outputfile = "data_backup1.json"
for file in inputfiles:
filepath = os.path.join(dir_path, file)
data = {}
with open(filepath, "r") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
id = row['ID']
data[id] = row
with open(outputfile, "a") as jsonfile:
jsonfile.write(json.dumps(data, indent=4))
Expected output: Json file needs to have each csv written only once into it.
if your .csv files and all of the rows do have different ['ID']s, your assigned dictionary keys should be unique. In this case, your dictionary is growing with one entry per reader .csv row.
You have to change the indentation of the jsonfile.write() function as shown below to produce just one .json file. To sort your entries you could add sort_keys=True in this function.
for file in inputfiles:
filepath = os.path.join(dir_path, file)
data = {}
with open(filepath, "r") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
id = row['ID']
data[id] = row
with open(outputfile, "a") as jsonfile:
jsonfile.write(json.dumps(data, indent=4, sort_keys=True))
I've got an Athena table where some fields have a fairly complex nested format. The backing records in S3 are JSON. Along these lines (but we have several more levels of nesting):
CREATE EXTERNAL TABLE IF NOT EXISTS test (
timestamp double,
stats array<struct<time:double, mean:double, var:double>>,
dets array<struct<coords: array<double>, header:struct<frame:int,
seq:int, name:string>>>,
pos struct<x:double, y:double, theta:double>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json'='true')
LOCATION 's3://test-bucket/test-folder/'
Now we need to be able to query the data and import the results into Python for analysis. Because of security restrictions I can't connect directly to Athena; I need to be able to give someone the query and then they will give me the CSV results.
If we just do a straight select * we get back the struct/array columns in a format that isn't quite JSON.
Here's a sample input file entry:
{"timestamp":1520640777.666096,"stats":[{"time":15,"mean":45.23,"var":0.31},{"time":19,"mean":17.315,"var":2.612}],"dets":[{"coords":[2.4,1.7,0.3], "header":{"frame":1,"seq":1,"name":"hello"}}],"pos": {"x":5,"y":1.4,"theta":0.04}}
And example output:
select * from test
"timestamp","stats","dets","pos"
"1.520640777666096E9","[{time=15.0, mean=45.23, var=0.31}, {time=19.0, mean=17.315, var=2.612}]","[{coords=[2.4, 1.7, 0.3], header={frame=1, seq=1, name=hello}}]","{x=5.0, y=1.4, theta=0.04}"
I was hoping to get those nested fields exported in a more convenient format - getting them in JSON would be great.
Unfortunately it seems that cast to JSON only works for maps, not structs, because it just flattens everything into arrays:
SELECT timestamp, cast(stats as JSON) as stats, cast(dets as JSON) as dets, cast(pos as JSON) as pos FROM "sampledb"."test"
"timestamp","stats","dets","pos"
"1.520640777666096E9","[[15.0,45.23,0.31],[19.0,17.315,2.612]]","[[[2.4,1.7,0.3],[1,1,""hello""]]]","[5.0,1.4,0.04]"
Is there a good way to convert to JSON (or another easy-to-import format) or should I just go ahead and do a custom parsing function?
I have skimmed through all the documentation and unfortunately there seems to be no way to do this as of now. The only possible workaround is
converting a struct to a json when querying athena
SELECT
my_field,
my_field.a,
my_field.b,
my_field.c.d,
my_field.c.e
FROM
my_table
Or I would convert the data to json using post processing. Below script shows how
#!/usr/bin/env python
import io
import re
pattern1 = re.compile(r'(?<={)([a-z]+)=', re.I)
pattern2 = re.compile(r':([a-z][^,{}. [\]]+)', re.I)
pattern3 = re.compile(r'\\"', re.I)
with io.open("test.csv") as f:
headers = list(map(lambda f: f.strip(), f.readline().split(",")))
for line in f.readlines():
orig_line = line
data = []
for i, l in enumerate(line.split('","')):
data.append(headers[i] + ":" + re.sub('^"|"$', "", l))
line = "{" + ','.join(data) + "}"
line = pattern1.sub(r'"\1":', line)
line = pattern2.sub(r':"\1"', line)
print(line)
The output on your input data is
{"timestamp":1.520640777666096E9,"stats":[{"time":15.0, "mean":45.23, "var":0.31}, {"time":19.0, "mean":17.315, "var":2.612}],"dets":[{"coords":[2.4, 1.7, 0.3], "header":{"frame":1, "seq":1, "name":"hello"}}],"pos":{"x":5.0, "y":1.4, "theta":0.04}
}
Which is a valid JSON
The python code from #tarun almost got me there, but I had to modify it in several ways due to my data. In particular, I have:
json structures saved in Athena as strings
Strings that contain multiple words, and therefore need to be in between double quotes. Some of them contain "[]" and "{}" symbols.
Here is the code that worked for me, hopefully will be useful for others:
#!/usr/bin/env python
import io
import re, sys
pattern1 = re.compile(r'(?<={)([a-z]+)=', re.I)
pattern2 = re.compile(r':([a-z][^,{}. [\]]+)', re.I)
pattern3 = re.compile(r'\\"', re.I)
with io.open(sys.argv[1]) as f:
headers = list(map(lambda f: f.strip(), f.readline().split(",")))
print(headers)
for line in f.readlines():
orig_line = line
#save the double quote cases, which mean there is a string with quotes inside
line = re.sub('""', "#", orig_line)
data = []
for i, l in enumerate(line.split('","')):
item = re.sub('^"|"$', "", l.rstrip())
if (item[0] == "{" and item[-1] == "}") or (item[0] == "[" and item[-1] == "]"):
data.append(headers[i] + ":" + item)
else: #we have a string
data.append(headers[i] + ": \"" + item + "\"")
line = "{" + ','.join(data) + "}"
line = pattern1.sub(r'"\1":', line)
line = pattern2.sub(r':"\1"', line)
#restate the double quotes to single ones, once inside the json
line = re.sub("#", '"', line)
print(line)
This method is not by modifying the Query.
Its by Post Processing For Javascript/Nodejs we can use the npm package athena-struct-parser.
Detailed Answer with Example
https://stackoverflow.com/a/67899845/6662952
Reference - https://www.npmjs.com/package/athena-struct-parser
I used a simple approach to get around the struct -> json Athena limitation. I created a second table where the json columns were saved as raw strings. Using presto json and array functions I was able to query the data and return the valid json string to my program:
--Array transform functions too
select
json_extract_scalar(dd, '$.timestamp') as timestamp,
transform(cast(json_extract(json_parse(dd), '$.stats') as ARRAY<JSON>), x -> json_extract_scalar(x, '$.time')) as arr_stats_time,
transform(cast(json_extract(json_parse(dd), '$.stats') as ARRAY<JSON>), x -> json_extract_scalar(x, '$.mean')) as arr_stats_mean,
transform(cast(json_extract(json_parse(dd), '$.stats') as ARRAY<JSON>), x -> json_extract_scalar(x, '$.var')) as arr_stats_var
from
(select '{"timestamp":1520640777.666096,"stats":[{"time":15,"mean":45.23,"var":0.31},{"time":19,"mean":17.315,"var":2.612}],"dets":[{"coords":[2.4,1.7,0.3], "header":{"frame":1,"seq":1,"name":"hello"}}],"pos": {"x":5,"y":1.4,"theta":0.04}}' as dd);
I know the query will take longer to execute but there are ways to optimize.
I worked around this by creating a second table using the same S3 location, but changed the field's data type to string. The resulting CSV then had the string that Athena pulled from the object in the JSON file and I was able to parse the result.
I also had to adjust the #tarun code, because I had more complex data and nested structures. Here is the solution I've got, I hope it helps:
import re
import json
import numpy as np
pattern1 = re.compile(r'(?<=[{,\[])\s*([^{}\[\],"=]+)=')
pattern2 = re.compile(r':([^{}\[\],"]+|()(?![{\[]))')
pattern3 = re.compile(r'"null"')
def convert_metadata_to_json(value):
if type(value) is str:
value = pattern1.sub('"\\1":', value)
value = pattern2.sub(': "\\1"', value)
value = pattern3.sub('null', value)
elif np.isnan(value):
return None
return json.loads(value)
df = pd.read_csv('test.csv')
df['metadata_json'] = df.metadata.apply(convert_metadata_to_json)
How to scrape this data,
http://jsonviewer.stack.hu/#http://91.134.133.185:5000/viaroute?loc=25.299919,55.376774&loc=25.298738,55.369181
and Extract only total_time" to a file?
It should be fairly easy to achieve this with a little search.
You just have to find some modules to work with json, dataframes and text files, and learn how to use them.
Steps:
1 - read json data using pandas.from_json()
2 - set data = df['total_time']
2 - write data using pandas.to_csv()
Simple as py.
Documentation:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html
import json
json_string = '''Json data here'''
data = json.loads(json_string)
total_time = data["route_summary"]["total_time"]
f = open("file_name_here.txt", "w+")
f.write(str(total_time))
f.close()
I've wrote this program for you:
import json, urllib2
url = 'http://91.134.133.185:5000/viaroute?loc=25.299919,55.376774&loc=25.298738,55.369181'
response = urllib2.urlopen(url)
data = json.load(response)
tot_time = str(data['route_summary']['total_time'])
s = tot_time + "\n"
outfile = "C:\\Users\\USER\\Desktop\\outfile.txt"
with open(outfile, "a+") as f:
f.write(s)
It'll append each observation to the end of outfile.txt
Saving json data to a file and reading that file
import json, urllib2
url = 'http://91.134.133.185:5000/viaroute?loc=25.299919,55.376774&loc=25.298738,55.369181'
response = urllib2.urlopen(url)
data = json.load(response)
outfile = "C:\\Users\\USER\\Desktop\\outfile.txt"
#saving json to file
with open(outfile, "w") as f:
f.write(str(data))
#reading file with json data
with open(outfile, 'r') as g:
json_data = g.readline()
print json_data
#Output:
{u'route_geometry': u'{_ego#m}|rhBpBaBvHuC`EuArEUtEtAlDvEnD`MlDvMli#hsEfFzn#QlTgNhwCs#fKwBhF', u'status': 0, u'via_indices': [0, 15], u'route_summary': {u'total_time': 101, u'end_point': u'', u'start_point': u'', u'total_distance': 871}, u'route_name': [u'', u''], u'hint_data': {u'checksum': 326195011, u'locations': [u'AXQDAP____8AAAAABwAAABEAAAAYAAAAIwIAAERwAgAAAAAADgyCAef7TAMCAAEB', u'bOsDAP____8AAAAAAwAAAAcAAADFAQAAFAAAAEJwAgAAAAAANQeCAd3dTAMFAAEB']}, u'via_points': [[25.299982, 55.376873], [25.29874, 55.369179]], u'status_message': u'Found route between points', u'found_alternative': False}
I am writing record to Kinesis Firehose stream that is eventually written to a S3 file by Amazon Kinesis Firehose.
My record object looks like
ItemPurchase {
String personId,
String itemId
}
The data is written to S3 looks like:
{"personId":"p-111","itemId":"i-111"}{"personId":"p-222","itemId":"i-222"}{"personId":"p-333","itemId":"i-333"}
NO COMMA SEPERATION.
NO STARTING BRACKET as in a Json Array
[
NO ENDING BRACKET as in a Json Array
]
I want to read this data get a list of ItemPurchase objects.
List<ItemPurchase> purchases = getPurchasesFromS3(IOUtils.toString(s3ObjectContent))
What is the correct way to read this data?
It boggles my mind that Amazon Firehose dumps JSON messages to S3 in this manner, and doesn't allow you to set a delimiter or anything.
Ultimately, the trick I found to deal with the problem was to process the text file using the JSON raw_decode method
This will allow you to read a bunch of concatenated JSON records without any delimiters between them.
Python code:
import json
decoder = json.JSONDecoder()
with open('giant_kinesis_s3_text_file_with_concatenated_json_blobs.txt', 'r') as content_file:
content = content_file.read()
content_length = len(content)
decode_index = 0
while decode_index < content_length:
try:
obj, decode_index = decoder.raw_decode(content, decode_index)
print("File index:", decode_index)
print(obj)
except JSONDecodeError as e:
print("JSONDecodeError:", e)
# Scan forward and keep trying to decode
decode_index += 1
I also had the same problem, here is how I solved.
replace "}{" with "}\n{"
line split by "\n".
input_json_rdd.map(lambda x : re.sub("}{", "}\n{", x, flags=re.UNICODE))
.flatMap(lambda line: line.split("\n"))
A nested json object has several "}"s, so split line by "}" doesn't solve the problem.
I've had the same issue.
It would have been better if AWS allowed us to set a delimiter but we can do it on our own.
In my use case, I've been listening on a stream of tweets, and once receiving a new tweet I immediately put it to Firehose.
This, of course, resulted in a 1-line file which could not be parsed.
So, to solve this, I have concatenated the tweet's JSON with a \n.
This, in turn, let me use some packages that can output lines when reading stream contents, and parse the file easily.
Hope this helps you.
I think the best ways to tackle this is to first create a properly formatted json file containing well separated json objects within them. In my case I added ',' to the events which was pushed into the firehose. Then After a file is saved in s3, all the files will contain json object separated by some delimitter(comma- in our case). Another thing that must be added are '[' and ']' at the beginning and end of the file. Then you have a proper json file containing multiple json objects. Parsing them will be possible now.
If the input source for the firehose is an Analytics application, this concatenated JSON without a delimiter is a known issue as cited here. You should have a lambda function as here that outputs JSON objects in multiple lines.
I used a transformation Lambda to add a line break at the end of every record
def lambda_handler(event, context):
output = []
for record in event['records']:
# Decode from base64 (Firehose records are base64 encoded)
payload = base64.b64decode(record['data'])
# Read json as utf-8
json_string = payload.decode("utf-8")
# Add a line break
output_json_with_line_break = json_string + "\n"
# Encode the data
encoded_bytes = base64.b64encode(bytearray(output_json_with_line_break, 'utf-8'))
encoded_string = str(encoded_bytes, 'utf-8')
# Create a deep copy of the record and append to output with transformed data
output_record = copy.deepcopy(record)
output_record['data'] = encoded_string
output_record['result'] = 'Ok'
output.append(output_record)
print('Successfully processed {} records.'.format(len(event['records'])))
return {'records': output}
Use this simple Python code.
input_str = '''{"personId":"p-111","itemId":"i-111"}{"personId":"p-222","itemId":"i-222"}{"personId":"p-333","itemId":"i-333"}'''
data_str = "[{}]".format(input_str.replace("}{","},{"))
data_json = json.loads(data_str)
And then (if you want) convert to Pandas.
import pandas as pd
df = pd.DataFrame().from_records(data_json)
print(df)
And this is result
itemId personId
0 i-111 p-111
1 i-222 p-222
2 i-333 p-333
If there's a way to change the way data is written, please separate all the records by a line. That way you can read the data simply, line by line. If not, then simply build a scanner object which takes "}" as a delimiter and use the scanner to read. That would do the job.
You can find the each valid JSON by counting the brackets. Assuming the file starts with a { this python snippet should work:
import json
def read_block(stream):
open_brackets = 0
block = ''
while True:
c = stream.read(1)
if not c:
break
if c == '{':
open_brackets += 1
elif c == '}':
open_brackets -= 1
block += c
if open_brackets == 0:
yield block
block = ''
if __name__ == "__main__":
c = 0
with open('firehose_json_blob', 'r') as f:
for block in read_block(f):
record = json.loads(block)
print(record)
This problem can be solved with a JSON parser that consumes objects one at a time from a stream. The raw_decode method of the JSONDecoder exposes just such a parser, but I've written a library that makes it straightforward to do this with a one-liner.
from firehose_sipper import sip
for entry in sip(bucket=..., key=...):
do_something_with(entry)
I've added some more details in this blog post
In Spark, we had the same problem. We're using the following:
from pyspark.sql.functions import *
#udf
def concatenated_json_to_array(text):
final = "["
separator = ""
for part in text.split("}{"):
final += separator + part
separator = "}{" if re.search(r':\s*"([^"]|(\\"))*$', final) else "},{"
return final + "]"
def read_concatenated_json(path, schema):
return (spark.read
.option("lineSep", None)
.text(path)
.withColumn("value", concatenated_json_to_array("value"))
.withColumn("value", from_json("value", schema))
.withColumn("value", explode("value"))
.select("value.*"))
It works as follows:
Read the data as one string per file (no delimiters!)
Use a UDF to introduce the JSON array and split the JSON objects by introducing a comma. Note: be careful not to break any strings with }{ in them!
Parse the JSON with a schema into DataFrame fields.
Explode the array into separate rows
Expand the value object into column.
Use it like this:
from pyspark.sql.types import *
schema = ArrayType(
StructType([
StructField("type", StringType(), True),
StructField("value", StructType([
StructField("id", IntegerType(), True),
StructField("joke", StringType(), True),
StructField("categories", ArrayType(StringType()), True)
]), True)
])
)
path = '/mnt/my_bucket_name/messages/*/*/*/*/'
df = read_concatenated_json(path, schema)
I've written more details and considerations here: Parsing JSON data from S3 (Kinesis) with Spark. Do not just split by }{, as it can mess up your string data! For example: { "line": "a\"r}{t" }.
You can use below script.
If streamed data size is not over buffer size that you set, each file of s3 have one pair of brackets([]) and comma.
import base64
print('Loading function')
def lambda_handler(event, context):
output = []
for record in event['records']:
print(record['recordId'])
payload = base64.b64decode(record['data']).decode('utf-8')+',\n'
# Do custom processing on the payload here
output_record = {
'recordId': record['recordId'],
'result': 'Ok',
'data': base64.b64encode(payload.encode('utf-8'))
}
output.append(output_record)
last = len(event['records'])-1
print('Successfully processed {} records.'.format(len(event['records'])))
start = '['+base64.b64decode(output[0]['data']).decode('utf-8')
end = base64.b64decode(output[last]['data']).decode('utf-8')+']'
output[0]['data'] = base64.b64encode(start.encode('utf-8'))
output[last]['data'] = base64.b64encode(end.encode('utf-8'))
return {'records': output}
Using JavaScript Regex.
JSON.parse(`[${item.replace(/}\s*{/g, '},{')}]`);