Forcing Python json module to work with ASCII - json

I'm using json.dump() and json.load() to save/read a dictionary of strings to/from disk. The issue is that I can't have any of the strings in unicode. They seem to be in unicode no matter how I set the parameters to dump/load (including ensure_ascii and encoding).

If you are just dealing with simple JSON objects, you can use the following:
def ascii_encode_dict(data):
ascii_encode = lambda x: x.encode('ascii')
return dict(map(ascii_encode, pair) for pair in data.items())
json.loads(json_data, object_hook=ascii_encode_dict)
Here is an example of how it works:
>>> json_data = '{"foo": "bar", "bar": "baz"}'
>>> json.loads(json_data) # old call gives unicode
{u'foo': u'bar', u'bar': u'baz'}
>>> json.loads(json_data, object_hook=ascii_encode_dict) # new call gives str
{'foo': 'bar', 'bar': 'baz'}
This answer works for a more complex JSON structure, and gives some nice explanation on the object_hook parameter. There is also another answer there that recursively takes the result of a json.loads() call and converts all of the Unicode strings to byte strings.

And if the json object is a mix of datatypes, not only unicode strings, you can use this expression:
def ascii_encode_dict(data):
ascii_encode = lambda x: x.encode('ascii') if isinstance(x, unicode) else x
return dict(map(ascii_encode, pair) for pair in data.items())

Related

Python 3 - Writing data from struct.unpack into json without individual recasting

I have a large object that is read from a binary file using struct.unpack and some of the values are character arrays which are read as bytes.
Since the character arrays in Python 3 are read as bytes instead of string (like in Python 2) they cannot be directly passed to json.dumps since "bytes" are not JSON serializable.
Is there any way to go from unpacked struct to json without searching through each value and converting the bytes to strings?
You can use a custom encoder in this case. See below
import json
x = {}
x['bytes'] = [b"i am bytes", "test"]
x['string'] = "strings"
x['unicode'] = u"unicode string"
class MyEncoder(json.JSONEncoder):
def default(self, o):
if type(o) is bytes:
return o.decode("utf-8")
return super(MyEncoder, self).default(o)
print(json.dumps(x, cls=MyEncoder))
# {"bytes": ["i am bytes", "test"], "string": "strings", "unicode": "unicode string"}

Decoding a String with escaped special characters in Scala issue

I have a multi-line JSON file with records that contain special characters encoded as hexadecimals. Here is an example of a single JSON record:
{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}
This record is supposed to be {"value":"ıarines Bintıç Ramuçlar"} , e.g. '"' character are replaced with corresponding hexadecimal \x22 and other special Unicode characters are replaced with one or two hexadecimals (for instance \xC3\xA7 encodes ç, etc.)
I need to convert similar Strings into a regular Unicode String in Scala, so when printed it produced {"value":"ıarines Bintıç Ramuçlar"} without hexadecimals.
In Python I can easily decode these records with a line of code:
>>> a = "{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"
>>> a.decode("utf-8")
u'{"value":"\u0131arines Bint\u0131\xe7 Ramu\xe7lar"}'
>>> print a.decode("utf-8")
{"value":"ıarines Bintıç Ramuçlar"}
But in Scala I can't find a way to decode it. I unsuccessfully tried to convert it like this:
scala> val a = """{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"""
scala> print(new String(a.getBytes(), "UTF-8"))
{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}
I also tried URLDecoder as I found in solution for similar problem (but with URL):
scala> val a = """{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"""
scala> print(java.net.URLDecoder.decode(a.replace("\\x", "%"), "UTF-8"))
{"value":"ıarines Bintıç Ramuçlar"}
It produced the desired result for this example but is seems not safe for generic text fields since it designed to work with URLs and requires replacing all \x to % in the string.
Does Scala have some better way to deal with this issue?
I am new to Scala and will be thankful for any help
UPDATE:
I have made a custom solution with javax.xml.bind.DatatypeConverter.parseHexBinary. It works for now, but it seems cumbersome and not at all elegant. I think there should be a simpler way to do this.
Here is the code:
import javax.xml.bind.DatatypeConverter
import scala.annotation.tailrec
import scala.util.matching.Regex
def decodeHexChars(string: String): String = {
val regexHex: Regex = """\A\\[xX]([0-9a-fA-F]{1,2})(.*)""".r
def purgeBuffer(buffer: String, acc: List[Char]): List[Char] = {
if (buffer.isEmpty) acc
else new String(DatatypeConverter.parseHexBinary(buffer)).reverse.toList ::: acc
}
#tailrec
def traverse(s: String, acc: List[Char], buffer: String): String = s match {
case "" =>
val accUpdated = purgeBuffer(buffer, acc)
accUpdated.foldRight("")((str, b) => b + str)
case regexHex(chars, suffix) =>
traverse(suffix, acc, buffer + chars)
case _ =>
val accUpdated = purgeBuffer(buffer, acc)
traverse(s.tail, s.head :: accUpdated, "")
}
traverse(string, Nil, "")
}
Each \x?? encodes one byte, like \x22 encodes " and \x5C encodes \. But in UTF-8 some characters are encoded using multiple bytes, so you need to transform \xC4\xB1 to ı symbol and so on.
replaceAllIn is really nice, but it might eat your slashes. So, if you don't use groups (like \1) in a replaced string, quoteReplacement is a recommended way to escape \ and $ symbols.
/** "22" -> 34, "AA" -> -86 */
def hex2byte(hex: String) = Integer.parseInt(hex, 16).toByte
/** decode strings like \x22 or \xC4\xB1\xC3\xA7 to specified encoding */
def decodeHexadecimals(str: String, encoding: String="UTF-8") =
new String(str.split("""\\x""").tail.map(hex2byte), encoding)
/** fix weird strings */
def replaceHexadecimals(str: String, encoding: String="UTF-8") =
"""(\\x[\dA-F]{2})+""".r.replaceAllIn(str, m =>
util.matching.Regex.quoteReplacement(
decodeHexadecimals(m.group(0), encoding)))
P.S. Does anyone know the difference between java.util.regex.Matcher.quoteReplacement and scala.util.matching.Regex.quoteReplacement?
The problem is that encoding is really specific to python (i think). Something like this might work:
val s = """{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"""
"""\\x([A-F0-9]{2})""".r.replaceAllIn(s, (x: Regex.Match) =>
new String(BigInt(x.group(1), 16).toByteArray, "UTF-8")
)

How to retain double quotes while loading a json in python

Dumping JSON using YAML,
c= {"a":1}
d = yaml.dump(c)
Loading JSON using YAML
yaml.load(d)
{'a': 1} # double quotes is lost
How to ensure that the output of the load has double quotes ?
Note: I tried json and simplejson also, all behave the same way.
For Python there is no difference between single and double quotes.
If you need response as JSON string then use standard json module - it will create string with correctly formated JSON - with double quotes.
>>> import json
>>> json.dumps({'a': 1})
'{"a": 1}'
Some frameworks or modules (as requests) have built-in functions to
send correctly-formated JSON (they may use standard json module in background) so don't have to do it on your own.
This
c = {"a":1}
d = yaml.dump(c)
doesn't dump JSON, it dumps a python dict as YAML. Use json.dumps() to make a JSON string from the dict and then optionally load/dump as YAML and preserve the double quotes by specifying preserver_quotes while loading:
import sys
import json
import ruamel.yaml
c= {"a":1}
json_string = json.dumps(c)
print(json_string)
print('---------')
data = ruamel.yaml.round_trip_load(json_string, preserve_quotes=True)
data['a'] = 3
ruamel.yaml.round_trip_dump(data, sys.stdout)
that will print:
{"a": 1}
---------
{"a": 3}

Reading the data written to s3 by Amazon Kinesis Firehose stream

I am writing record to Kinesis Firehose stream that is eventually written to a S3 file by Amazon Kinesis Firehose.
My record object looks like
ItemPurchase {
String personId,
String itemId
}
The data is written to S3 looks like:
{"personId":"p-111","itemId":"i-111"}{"personId":"p-222","itemId":"i-222"}{"personId":"p-333","itemId":"i-333"}
NO COMMA SEPERATION.
NO STARTING BRACKET as in a Json Array
[
NO ENDING BRACKET as in a Json Array
]
I want to read this data get a list of ItemPurchase objects.
List<ItemPurchase> purchases = getPurchasesFromS3(IOUtils.toString(s3ObjectContent))
What is the correct way to read this data?
It boggles my mind that Amazon Firehose dumps JSON messages to S3 in this manner, and doesn't allow you to set a delimiter or anything.
Ultimately, the trick I found to deal with the problem was to process the text file using the JSON raw_decode method
This will allow you to read a bunch of concatenated JSON records without any delimiters between them.
Python code:
import json
decoder = json.JSONDecoder()
with open('giant_kinesis_s3_text_file_with_concatenated_json_blobs.txt', 'r') as content_file:
content = content_file.read()
content_length = len(content)
decode_index = 0
while decode_index < content_length:
try:
obj, decode_index = decoder.raw_decode(content, decode_index)
print("File index:", decode_index)
print(obj)
except JSONDecodeError as e:
print("JSONDecodeError:", e)
# Scan forward and keep trying to decode
decode_index += 1
I also had the same problem, here is how I solved.
replace "}{" with "}\n{"
line split by "\n".
input_json_rdd.map(lambda x : re.sub("}{", "}\n{", x, flags=re.UNICODE))
.flatMap(lambda line: line.split("\n"))
A nested json object has several "}"s, so split line by "}" doesn't solve the problem.
I've had the same issue.
It would have been better if AWS allowed us to set a delimiter but we can do it on our own.
In my use case, I've been listening on a stream of tweets, and once receiving a new tweet I immediately put it to Firehose.
This, of course, resulted in a 1-line file which could not be parsed.
So, to solve this, I have concatenated the tweet's JSON with a \n.
This, in turn, let me use some packages that can output lines when reading stream contents, and parse the file easily.
Hope this helps you.
I think the best ways to tackle this is to first create a properly formatted json file containing well separated json objects within them. In my case I added ',' to the events which was pushed into the firehose. Then After a file is saved in s3, all the files will contain json object separated by some delimitter(comma- in our case). Another thing that must be added are '[' and ']' at the beginning and end of the file. Then you have a proper json file containing multiple json objects. Parsing them will be possible now.
If the input source for the firehose is an Analytics application, this concatenated JSON without a delimiter is a known issue as cited here. You should have a lambda function as here that outputs JSON objects in multiple lines.
I used a transformation Lambda to add a line break at the end of every record
def lambda_handler(event, context):
output = []
for record in event['records']:
# Decode from base64 (Firehose records are base64 encoded)
payload = base64.b64decode(record['data'])
# Read json as utf-8
json_string = payload.decode("utf-8")
# Add a line break
output_json_with_line_break = json_string + "\n"
# Encode the data
encoded_bytes = base64.b64encode(bytearray(output_json_with_line_break, 'utf-8'))
encoded_string = str(encoded_bytes, 'utf-8')
# Create a deep copy of the record and append to output with transformed data
output_record = copy.deepcopy(record)
output_record['data'] = encoded_string
output_record['result'] = 'Ok'
output.append(output_record)
print('Successfully processed {} records.'.format(len(event['records'])))
return {'records': output}
Use this simple Python code.
input_str = '''{"personId":"p-111","itemId":"i-111"}{"personId":"p-222","itemId":"i-222"}{"personId":"p-333","itemId":"i-333"}'''
data_str = "[{}]".format(input_str.replace("}{","},{"))
data_json = json.loads(data_str)
And then (if you want) convert to Pandas.
import pandas as pd
df = pd.DataFrame().from_records(data_json)
print(df)
And this is result
itemId personId
0 i-111 p-111
1 i-222 p-222
2 i-333 p-333
If there's a way to change the way data is written, please separate all the records by a line. That way you can read the data simply, line by line. If not, then simply build a scanner object which takes "}" as a delimiter and use the scanner to read. That would do the job.
You can find the each valid JSON by counting the brackets. Assuming the file starts with a { this python snippet should work:
import json
def read_block(stream):
open_brackets = 0
block = ''
while True:
c = stream.read(1)
if not c:
break
if c == '{':
open_brackets += 1
elif c == '}':
open_brackets -= 1
block += c
if open_brackets == 0:
yield block
block = ''
if __name__ == "__main__":
c = 0
with open('firehose_json_blob', 'r') as f:
for block in read_block(f):
record = json.loads(block)
print(record)
This problem can be solved with a JSON parser that consumes objects one at a time from a stream. The raw_decode method of the JSONDecoder exposes just such a parser, but I've written a library that makes it straightforward to do this with a one-liner.
from firehose_sipper import sip
for entry in sip(bucket=..., key=...):
do_something_with(entry)
I've added some more details in this blog post
In Spark, we had the same problem. We're using the following:
from pyspark.sql.functions import *
#udf
def concatenated_json_to_array(text):
final = "["
separator = ""
for part in text.split("}{"):
final += separator + part
separator = "}{" if re.search(r':\s*"([^"]|(\\"))*$', final) else "},{"
return final + "]"
def read_concatenated_json(path, schema):
return (spark.read
.option("lineSep", None)
.text(path)
.withColumn("value", concatenated_json_to_array("value"))
.withColumn("value", from_json("value", schema))
.withColumn("value", explode("value"))
.select("value.*"))
It works as follows:
Read the data as one string per file (no delimiters!)
Use a UDF to introduce the JSON array and split the JSON objects by introducing a comma. Note: be careful not to break any strings with }{ in them!
Parse the JSON with a schema into DataFrame fields.
Explode the array into separate rows
Expand the value object into column.
Use it like this:
from pyspark.sql.types import *
schema = ArrayType(
StructType([
StructField("type", StringType(), True),
StructField("value", StructType([
StructField("id", IntegerType(), True),
StructField("joke", StringType(), True),
StructField("categories", ArrayType(StringType()), True)
]), True)
])
)
path = '/mnt/my_bucket_name/messages/*/*/*/*/'
df = read_concatenated_json(path, schema)
I've written more details and considerations here: Parsing JSON data from S3 (Kinesis) with Spark. Do not just split by }{, as it can mess up your string data! For example: { "line": "a\"r}{t" }.
You can use below script.
If streamed data size is not over buffer size that you set, each file of s3 have one pair of brackets([]) and comma.
import base64
print('Loading function')
def lambda_handler(event, context):
output = []
for record in event['records']:
print(record['recordId'])
payload = base64.b64decode(record['data']).decode('utf-8')+',\n'
# Do custom processing on the payload here
output_record = {
'recordId': record['recordId'],
'result': 'Ok',
'data': base64.b64encode(payload.encode('utf-8'))
}
output.append(output_record)
last = len(event['records'])-1
print('Successfully processed {} records.'.format(len(event['records'])))
start = '['+base64.b64decode(output[0]['data']).decode('utf-8')
end = base64.b64decode(output[last]['data']).decode('utf-8')+']'
output[0]['data'] = base64.b64encode(start.encode('utf-8'))
output[last]['data'] = base64.b64encode(end.encode('utf-8'))
return {'records': output}
Using JavaScript Regex.
JSON.parse(`[${item.replace(/}\s*{/g, '},{')}]`);

Python 3 Tornado bytes and JSON issues

I stumbled into python 3, and specifically into tornado framework.
My task was to integrate facebook authentification, and i used test cases from here:
https://github.com/tornadoweb/tornado/tree/master/demos/facebook
So the point is that user is a dictionary with bytes data.
class AuthLoginHandler(BaseHandler, tornado.auth.FacebookGraphMixin):
#tornado.web.asynchronous
def get(self):
....
def _on_auth(self, user):
if not user:
raise tornado.web.HTTPError(500, "Facebook auth failed")
self.set_secure_cookie("fbdemo_user", tornado.escape.json_encode(user))
self.redirect(self.get_argument("next", "/"))
_on_auth always produces this Error: b'token or sesion_expire data here' is not JSON serializable
Ive come out with few solitons found on stackoverflow:
Fix the data before encode
import collections.abc
def convert(data):
'''
Converts bytes data into unicode strings, so this can be encoded into JSON
'''
if isinstance(data, str):
return str(data)
elif isinstance(data, bytes):
return data.decode('utf-8')
elif isinstance(data, collections.abc.Mapping):
return dict(map(convert, data.items()))
elif isinstance(data, collections.abc.Iterable):
return type(data)(map(convert, data))
else:
return data
# ... and somewhere in the code
tornado.escape.json_encode(convert(user))
And the next one is to extend the json itself:
import json
class JSONEncoder(json.JSONEncoder):
def default(self, o):
if isinstance(o, bytes):
return o.decode('utf-8')
return json.JSONEncoder.default(self, o)
Now the question: why are there such an isses with data like type(data) == <class 'bytes'>, and am i doing it right?
Thank you
Better late than never.
Python 3 json encoder does not accept byte strings. Tornado provides a method to_basestring which can be used to overcome this problem.
Here's what the source doc says about the issue:
In python2, byte and unicode strings are mostly interchangeable, so
functions that deal with a user-supplied argument in combination with
ascii string constants can use either and should return the type the
user supplied. In python3, the two types are not interchangeable, so
this method is needed to convert byte strings to unicode.
Usage:
tornado.escape.to_basestring(value)