Python 3 Tornado bytes and JSON issues - json

I stumbled into python 3, and specifically into tornado framework.
My task was to integrate facebook authentification, and i used test cases from here:
https://github.com/tornadoweb/tornado/tree/master/demos/facebook
So the point is that user is a dictionary with bytes data.
class AuthLoginHandler(BaseHandler, tornado.auth.FacebookGraphMixin):
#tornado.web.asynchronous
def get(self):
....
def _on_auth(self, user):
if not user:
raise tornado.web.HTTPError(500, "Facebook auth failed")
self.set_secure_cookie("fbdemo_user", tornado.escape.json_encode(user))
self.redirect(self.get_argument("next", "/"))
_on_auth always produces this Error: b'token or sesion_expire data here' is not JSON serializable
Ive come out with few solitons found on stackoverflow:
Fix the data before encode
import collections.abc
def convert(data):
'''
Converts bytes data into unicode strings, so this can be encoded into JSON
'''
if isinstance(data, str):
return str(data)
elif isinstance(data, bytes):
return data.decode('utf-8')
elif isinstance(data, collections.abc.Mapping):
return dict(map(convert, data.items()))
elif isinstance(data, collections.abc.Iterable):
return type(data)(map(convert, data))
else:
return data
# ... and somewhere in the code
tornado.escape.json_encode(convert(user))
And the next one is to extend the json itself:
import json
class JSONEncoder(json.JSONEncoder):
def default(self, o):
if isinstance(o, bytes):
return o.decode('utf-8')
return json.JSONEncoder.default(self, o)
Now the question: why are there such an isses with data like type(data) == <class 'bytes'>, and am i doing it right?
Thank you

Better late than never.
Python 3 json encoder does not accept byte strings. Tornado provides a method to_basestring which can be used to overcome this problem.
Here's what the source doc says about the issue:
In python2, byte and unicode strings are mostly interchangeable, so
functions that deal with a user-supplied argument in combination with
ascii string constants can use either and should return the type the
user supplied. In python3, the two types are not interchangeable, so
this method is needed to convert byte strings to unicode.
Usage:
tornado.escape.to_basestring(value)

Related

dumping YAML with tags as JSON

I know I can use ruamel.yaml to load a file with tags in it. But when I want to dump without them i get an error. Simplified example :-
from ruamel.yaml import YAML
from json import dumps
import sys
yaml = YAML()
data = yaml.load(
"""
!mytag
a: 1
b: 2
c: 2022-05-01
"""
)
try:
yaml2 = YAML(typ='safe', pure=True)
yaml.default_flow_style = True
yaml2.dump(data, sys.stdout)
except Exception as e:
print('exception dumping using yaml', e)
try:
print(dumps(data))
except Exception as e:
print('exception dumping using json', e)
exception dumping using cannot represent an object: ordereddict([('a', 1), ('b', 2), ('c', datetime.date(2022, 5, 1))])
exception dumping using json Object of type date is not JSON serializable
I cannot change the load() without getting an error on the tag. How to get output with tags stripped (YAML or JSON)?
You get the error because the neither the safe dumper (pure or not), nor JSON, do know about the ruamel.yaml internal
types that preserve comments, tagging, block/flow-style, etc.
Dumping as YAML, you could register these types with alternate dump methods. As JSON this is more complex
as AFAIK you can only convert the leaf-nodes (i.e. the YAML scalars, you would e.g. be
able to use that to dump the datetime.datetime instance that is loaded as the value of key c).
I have used YAML as a readable, editable and programmatically updatable config file with
an much faster loading JSON version of the data used if its file is not older than the corresponding YAML (if
it is older JSON gets created from the YAML). The thing to do in order to dump(s) is
recursively generate Python primitives that JSON understands.
The following does so, but there are other constructs besides datetime
instances that JSON doesn't allow. E.g. when using sequences or dicts
as keys (which is allowed in YAML, but not in JSON). For keys that are
sequences I concatenate the string representation of the elements
:
from ruamel.yaml import YAML
import sys
import datetime
import json
from collections.abc import Mapping
yaml = YAML()
data = yaml.load("""\
!mytag
a: 1
b: 2
c: 2022-05-01
[d, e]: !myseq [42, 196]
{f: g, 18: y}: !myscalar x
""")
def json_dump(data, out, indent=None):
def scalar(obj):
if obj is None:
return None
if isinstance(obj, (datetime.date, datetime.datetime)):
return str(obj)
if isinstance(obj, ruamel.yaml.scalarbool.ScalarBoolean):
return obj == 1
if isinstance(obj, bool):
return bool(obj)
if isinstance(obj, int):
return int(obj)
if isinstance(obj, float):
return float(obj)
if isinstance(obj, tuple):
return '_'.join([str(x) for x in obj])
if isinstance(obj, Mapping):
return '_'.join([f'{k}-{v}' for k, v in obj.items()])
if not isinstance(obj, str): print('type', type(obj))
return obj
def prep(obj):
if isinstance(obj, dict):
return {scalar(k): prep(v) for k, v in obj.items()}
if isinstance(obj, list):
return [prep(elem) for elem in obj]
if isinstance(obj, ruamel.yaml.comments.TaggedScalar):
return prep(obj.value)
return scalar(obj)
res = prep(data)
json.dump(res, out, indent=indent)
json_dump(data, sys.stdout, indent=2)
which gives:
{
"a": 1,
"b": 2,
"c": "2022-05-01",
"d_e": [
42,
196
],
"f-g_18-y": "x"
}

JSON Serialization is empty when serialising from eulxml in python

I am working with eXistDB in python and leveraging the eulxml library to handle mapping from the xml in the database into custom objects. I want to then serialize these objects to json (for another application to consume) but I'm running into issues. jsonpickle doesn't work (it ends up returning all sorts of excess garbage and the value are the fields aren't actually encoded but rather their eulxml type) and the standard json.dumps() is simply giving me empty json (this was after trying to implement the solution detailed here). The problem seems to stem from the fact that the __dict__ values are not initialised __oninit__ (as they are mapped as class properties) so the __dict__ appears empty upon serialization. Here is some sample code:
Serializable Class Object
class Serializable(dict):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# hack to fix _json.so make_encoder serialize properly
self.__setitem__('dummy', 1)
def _myattrs(self):
return [
(x, self._repr(getattr(self, x)))
for x in self.__dir__()
if x not in Serializable().__dir__()
]
def _repr(self, value):
if isinstance(value, (str, int, float, list, tuple, dict)):
return value
else:
return repr(value)
def __repr__(self):
return '<%s.%s object at %s>' % (
self.__class__.__module__,
self.__class__.__name__,
hex(id(self))
)
def keys(self):
return iter([x[0] for x in self._myattrs()])
def values(self):
return iter([x[1] for x in self._myattrs()])
def items(self):
return iter(self._myattrs())
Base Class
from eulxml import xmlmap
import inspect
import lxml
import json as JSON
from models.serializable import Serializable
class AlcalaBase(xmlmap.XmlObject,Serializable):
def toJSON(self):
return JSON.dumps(self, indent=4)
def to_json(self, skipBegin=False):
json = list()
if not skipBegin:
json.append('{')
json.append(str.format('"{0}": {{', self.ROOT_NAME))
for attr, value in inspect.getmembers(self):
if (attr.find("_") == -1
and attr.find("serialize") == -1
and attr.find("context") == -1
and attr.find("node") == -1
and attr.find("schema") == -1):
if type(value) is xmlmap.fields.NodeList:
if len(value) > 0:
json.append(str.format('"{0}": [', attr))
for v in value:
json.append(v.to_json())
json.append(",")
json = json[:-1]
json.append("]")
else:
json.append(str.format('"{0}": null', attr))
elif (type(value) is xmlmap.fields.StringField
or type(value) is str
or type(value) is lxml.etree._ElementUnicodeResult):
value = JSON.dumps(value)
json.append(str.format('"{0}": {1}', attr, value))
elif (type(value) is xmlmap.fields.IntegerField
or type(value) is int
or type(value) is float):
json.append(str.format('"{0}": {1}', attr, value))
elif value is None:
json.append(str.format('"{0}": null', attr))
elif type(value) is list:
if len(value) > 0:
json.append(str.format('"{0}": [', attr))
for x in value:
json.append(x)
json.append(",")
json = json[:-1]
json.append("]")
else:
json.append(str.format('"{0}": null', attr))
else:
json.append(value.to_json(skipBegin=True))
json.append(",")
json = json[:-1]
if not skipBegin:
json.append('}')
json.append('}')
return ''.join(json)
Sample Class that implements Base
from eulxml import xmlmap
from models.alcalaMonth import AlcalaMonth
from models.alcalaBase import AlcalaBase
class AlcalaPage(AlcalaBase):
ROOT_NAME = "page"
id = xmlmap.StringField('pageID')
year = xmlmap.IntegerField('content/#yearID')
months = xmlmap.NodeListField('content/month', AlcalaMonth)
The toJSON() method on the base is the method that is using the Serializable class and is returning empty json, e.g. "{}". The to_json() is my attempt to for a json-like implementation but that has it's own problems (for some reason it skips certain properties / child objects for no reason I can see but thats a thread for another day).
If I attempt to access myobj.keys or myobj.values (both of which are exposed via Serializable) I can see property names and values as I would expect but I have no idea why json.dumps() produces an empty json string.
Does anyone have any idea why I cannot get these objects to serialize to json?! I've been pulling my hair out for weeks with this. Any help would be greatly appreciated.
So after a lot of playing around, I was finally able to fix this with jsonpickle and it took only 3 lines of code:
def toJson(self):
jsonpickle.set_preferred_backend('simplejson')
return jsonpickle.encode(self, unpicklable=False)
I used simplejson to eliminate some of the additional object notation that was being added and the unpicklable property removed the rest (I'm not sure if this would work with the default json backend as I didn't test it).
Now when I call toJson() on any object that inherits from this base class, I get very nice json and it works brilliantly.

Python 3 - Writing data from struct.unpack into json without individual recasting

I have a large object that is read from a binary file using struct.unpack and some of the values are character arrays which are read as bytes.
Since the character arrays in Python 3 are read as bytes instead of string (like in Python 2) they cannot be directly passed to json.dumps since "bytes" are not JSON serializable.
Is there any way to go from unpacked struct to json without searching through each value and converting the bytes to strings?
You can use a custom encoder in this case. See below
import json
x = {}
x['bytes'] = [b"i am bytes", "test"]
x['string'] = "strings"
x['unicode'] = u"unicode string"
class MyEncoder(json.JSONEncoder):
def default(self, o):
if type(o) is bytes:
return o.decode("utf-8")
return super(MyEncoder, self).default(o)
print(json.dumps(x, cls=MyEncoder))
# {"bytes": ["i am bytes", "test"], "string": "strings", "unicode": "unicode string"}

Reading the data written to s3 by Amazon Kinesis Firehose stream

I am writing record to Kinesis Firehose stream that is eventually written to a S3 file by Amazon Kinesis Firehose.
My record object looks like
ItemPurchase {
String personId,
String itemId
}
The data is written to S3 looks like:
{"personId":"p-111","itemId":"i-111"}{"personId":"p-222","itemId":"i-222"}{"personId":"p-333","itemId":"i-333"}
NO COMMA SEPERATION.
NO STARTING BRACKET as in a Json Array
[
NO ENDING BRACKET as in a Json Array
]
I want to read this data get a list of ItemPurchase objects.
List<ItemPurchase> purchases = getPurchasesFromS3(IOUtils.toString(s3ObjectContent))
What is the correct way to read this data?
It boggles my mind that Amazon Firehose dumps JSON messages to S3 in this manner, and doesn't allow you to set a delimiter or anything.
Ultimately, the trick I found to deal with the problem was to process the text file using the JSON raw_decode method
This will allow you to read a bunch of concatenated JSON records without any delimiters between them.
Python code:
import json
decoder = json.JSONDecoder()
with open('giant_kinesis_s3_text_file_with_concatenated_json_blobs.txt', 'r') as content_file:
content = content_file.read()
content_length = len(content)
decode_index = 0
while decode_index < content_length:
try:
obj, decode_index = decoder.raw_decode(content, decode_index)
print("File index:", decode_index)
print(obj)
except JSONDecodeError as e:
print("JSONDecodeError:", e)
# Scan forward and keep trying to decode
decode_index += 1
I also had the same problem, here is how I solved.
replace "}{" with "}\n{"
line split by "\n".
input_json_rdd.map(lambda x : re.sub("}{", "}\n{", x, flags=re.UNICODE))
.flatMap(lambda line: line.split("\n"))
A nested json object has several "}"s, so split line by "}" doesn't solve the problem.
I've had the same issue.
It would have been better if AWS allowed us to set a delimiter but we can do it on our own.
In my use case, I've been listening on a stream of tweets, and once receiving a new tweet I immediately put it to Firehose.
This, of course, resulted in a 1-line file which could not be parsed.
So, to solve this, I have concatenated the tweet's JSON with a \n.
This, in turn, let me use some packages that can output lines when reading stream contents, and parse the file easily.
Hope this helps you.
I think the best ways to tackle this is to first create a properly formatted json file containing well separated json objects within them. In my case I added ',' to the events which was pushed into the firehose. Then After a file is saved in s3, all the files will contain json object separated by some delimitter(comma- in our case). Another thing that must be added are '[' and ']' at the beginning and end of the file. Then you have a proper json file containing multiple json objects. Parsing them will be possible now.
If the input source for the firehose is an Analytics application, this concatenated JSON without a delimiter is a known issue as cited here. You should have a lambda function as here that outputs JSON objects in multiple lines.
I used a transformation Lambda to add a line break at the end of every record
def lambda_handler(event, context):
output = []
for record in event['records']:
# Decode from base64 (Firehose records are base64 encoded)
payload = base64.b64decode(record['data'])
# Read json as utf-8
json_string = payload.decode("utf-8")
# Add a line break
output_json_with_line_break = json_string + "\n"
# Encode the data
encoded_bytes = base64.b64encode(bytearray(output_json_with_line_break, 'utf-8'))
encoded_string = str(encoded_bytes, 'utf-8')
# Create a deep copy of the record and append to output with transformed data
output_record = copy.deepcopy(record)
output_record['data'] = encoded_string
output_record['result'] = 'Ok'
output.append(output_record)
print('Successfully processed {} records.'.format(len(event['records'])))
return {'records': output}
Use this simple Python code.
input_str = '''{"personId":"p-111","itemId":"i-111"}{"personId":"p-222","itemId":"i-222"}{"personId":"p-333","itemId":"i-333"}'''
data_str = "[{}]".format(input_str.replace("}{","},{"))
data_json = json.loads(data_str)
And then (if you want) convert to Pandas.
import pandas as pd
df = pd.DataFrame().from_records(data_json)
print(df)
And this is result
itemId personId
0 i-111 p-111
1 i-222 p-222
2 i-333 p-333
If there's a way to change the way data is written, please separate all the records by a line. That way you can read the data simply, line by line. If not, then simply build a scanner object which takes "}" as a delimiter and use the scanner to read. That would do the job.
You can find the each valid JSON by counting the brackets. Assuming the file starts with a { this python snippet should work:
import json
def read_block(stream):
open_brackets = 0
block = ''
while True:
c = stream.read(1)
if not c:
break
if c == '{':
open_brackets += 1
elif c == '}':
open_brackets -= 1
block += c
if open_brackets == 0:
yield block
block = ''
if __name__ == "__main__":
c = 0
with open('firehose_json_blob', 'r') as f:
for block in read_block(f):
record = json.loads(block)
print(record)
This problem can be solved with a JSON parser that consumes objects one at a time from a stream. The raw_decode method of the JSONDecoder exposes just such a parser, but I've written a library that makes it straightforward to do this with a one-liner.
from firehose_sipper import sip
for entry in sip(bucket=..., key=...):
do_something_with(entry)
I've added some more details in this blog post
In Spark, we had the same problem. We're using the following:
from pyspark.sql.functions import *
#udf
def concatenated_json_to_array(text):
final = "["
separator = ""
for part in text.split("}{"):
final += separator + part
separator = "}{" if re.search(r':\s*"([^"]|(\\"))*$', final) else "},{"
return final + "]"
def read_concatenated_json(path, schema):
return (spark.read
.option("lineSep", None)
.text(path)
.withColumn("value", concatenated_json_to_array("value"))
.withColumn("value", from_json("value", schema))
.withColumn("value", explode("value"))
.select("value.*"))
It works as follows:
Read the data as one string per file (no delimiters!)
Use a UDF to introduce the JSON array and split the JSON objects by introducing a comma. Note: be careful not to break any strings with }{ in them!
Parse the JSON with a schema into DataFrame fields.
Explode the array into separate rows
Expand the value object into column.
Use it like this:
from pyspark.sql.types import *
schema = ArrayType(
StructType([
StructField("type", StringType(), True),
StructField("value", StructType([
StructField("id", IntegerType(), True),
StructField("joke", StringType(), True),
StructField("categories", ArrayType(StringType()), True)
]), True)
])
)
path = '/mnt/my_bucket_name/messages/*/*/*/*/'
df = read_concatenated_json(path, schema)
I've written more details and considerations here: Parsing JSON data from S3 (Kinesis) with Spark. Do not just split by }{, as it can mess up your string data! For example: { "line": "a\"r}{t" }.
You can use below script.
If streamed data size is not over buffer size that you set, each file of s3 have one pair of brackets([]) and comma.
import base64
print('Loading function')
def lambda_handler(event, context):
output = []
for record in event['records']:
print(record['recordId'])
payload = base64.b64decode(record['data']).decode('utf-8')+',\n'
# Do custom processing on the payload here
output_record = {
'recordId': record['recordId'],
'result': 'Ok',
'data': base64.b64encode(payload.encode('utf-8'))
}
output.append(output_record)
last = len(event['records'])-1
print('Successfully processed {} records.'.format(len(event['records'])))
start = '['+base64.b64decode(output[0]['data']).decode('utf-8')
end = base64.b64decode(output[last]['data']).decode('utf-8')+']'
output[0]['data'] = base64.b64encode(start.encode('utf-8'))
output[last]['data'] = base64.b64encode(end.encode('utf-8'))
return {'records': output}
Using JavaScript Regex.
JSON.parse(`[${item.replace(/}\s*{/g, '},{')}]`);

Forcing Python json module to work with ASCII

I'm using json.dump() and json.load() to save/read a dictionary of strings to/from disk. The issue is that I can't have any of the strings in unicode. They seem to be in unicode no matter how I set the parameters to dump/load (including ensure_ascii and encoding).
If you are just dealing with simple JSON objects, you can use the following:
def ascii_encode_dict(data):
ascii_encode = lambda x: x.encode('ascii')
return dict(map(ascii_encode, pair) for pair in data.items())
json.loads(json_data, object_hook=ascii_encode_dict)
Here is an example of how it works:
>>> json_data = '{"foo": "bar", "bar": "baz"}'
>>> json.loads(json_data) # old call gives unicode
{u'foo': u'bar', u'bar': u'baz'}
>>> json.loads(json_data, object_hook=ascii_encode_dict) # new call gives str
{'foo': 'bar', 'bar': 'baz'}
This answer works for a more complex JSON structure, and gives some nice explanation on the object_hook parameter. There is also another answer there that recursively takes the result of a json.loads() call and converts all of the Unicode strings to byte strings.
And if the json object is a mix of datatypes, not only unicode strings, you can use this expression:
def ascii_encode_dict(data):
ascii_encode = lambda x: x.encode('ascii') if isinstance(x, unicode) else x
return dict(map(ascii_encode, pair) for pair in data.items())