json encoding - Why do utf8 and utf-8 produce different outputs? - json

These two commands output different results:
In [102]: json.dumps({'Café': 1}, ensure_ascii=False, encoding='utf-8')
Out[102]: '{"Caf\xc3\xa9": 1}'
In [103]: json.dumps({'Café': 1}, ensure_ascii=False, encoding='utf8')
Out[103]: u'{"Caf\xe9": 1}'
What's the difference between utf-8 and utf8?

Notice that the second iteration returns a Unicode object.
It seems strange but the documentation calls this out:
If ensure_ascii is False, the result may contain non-ASCII characters and the return value may be a unicode instance.
It would appear that only "UTF-8" works with ensure_ascii=False AND if the input is a UTF-8 encoded string (Not Unicode). With a Unicode input:
>>> json.dumps({u'Caf€': 1}, ensure_ascii=False, encoding='utf-8')
u'{"Caf\u20ac": 1}'
With ensure_ascii=False, all other valid encodings return a Unicode instance.
If you set ensure_ascii=True, then the encoding is consistent and works with other encoding, such as "windows-1252" (The input needs to be a Unicode)
I guess the rationale is that JSON should be ASCII and all encodings should be escaped, even when it's UTF-8.
To avoid any surprises follow these rules:
For proper spec. ASCII JSON:
Pass Unicode object
Call:
>>> json.dumps({u'Caf€': 1}, ensure_ascii=True)
'{"Caf\\u20ac": 1}'
UTF-8 Encoded JSON:
Pass Unicode object
Call:
>>> json.dumps({u'Caf€': 1}, ensure_ascii=False).encode("utf-8")
'{"Caf\xe2\x82\xac": 1}'

Related

python 3 - how to clean json string with double backslashes and u00

I have several ugly json strings like the following:
test_string = '{\\"test_key\\": \\"Testing tilde \\u00E1\\u00F3\\u00ED\\"}'
that I need to transform it in a more visually friendly dictionary and then save it to a file:
{'test_key': 'Testing tilde áóí'}
So for that I am doing:
test_string = test_string.replace("\\\"", "\"") # I suposse there is a safer way to do this
print(test_string)
#{"test_key": "Testing tilde \u00E1\u00F3\u00ED"}
test_dict = json.loads(test_string, strict=False)
print(test_dict)
#{'test_key': 'Testing tilde áóí'}
At this point test_dict seems correct. Then I save it to a file:
with open('test.json', "w") as json_w_file:
json.dump(test_dict, json_w_file)
At this point the content of test.json is the ugly version of the json:
{"test_key": "Testing tilde \u00E1\u00F3\u00ED"}
Is there a safer way to transform my ugly json to a dictionary?
Then how could I save the visually friendly version of my dictionary to a file?
Python 3
The string looks like double-encoded json to me. This decodes it an writes a utf-8 json file.
test_string = '{\\"test_key\\": \\"Testing tilde \\u00E1\\u00F3\\u00ED\\"}'
test_dict = json.loads(json.loads(f'"{test_string}"'))
with open('test.json', "w") as json_w_file:
json.dump(test_dict, json_w_file, ensure_ascii=False)

Can't decode byte list to string list

In my users variable i store a byte list that i get from a local ldap and convert to a string list with my for loop.
I must return this list as jsonify.
If i don't use that encoding key i get a different output from the original but still encoded.
The problem is i can't access the decode method anywhere.
Any help?
users = ldap.get_group_members('ship_crew')
user_list = []
for user in users:
user_list.append((str(user, encoding='utf-8').split(",")[0].split("=")[1]))
return jsonify(user_list)
original list from users variable:
[
"cn=Philip J. Fry,ou=people,dc=planetexpress,dc=com",
"cn=Turanga Leela,ou=people,dc=planetexpress,dc=com",
"cn=Bender Bending Rodr\u00edguez,ou=people,dc=planetexpress,dc=com"
]
for loop with encoded output:
[
"Philip J. Fry",
"Turanga Leela",
"Bender Bending Rodr\u00edguez"
]
expected:
[
"Philip J. Fry",
"Turanga Leela",
"Bender Bending Rodríguez"
]
I Would use regex to extract your names:
import re
l = [
"cn=Philip J. Fry,ou=people,dc=planetexpress,dc=com",
"cn=Turanga Leela,ou=people,dc=planetexpress,dc=com",
"cn=Bender Bending Rodr\u00edguez,ou=people,dc=planetexpress,dc=com"
]
NAME_PATTERN = re.compile(r'cn=(.*?),')
result = [NAME_PATTERN.match(s).group(1) for s in l]
print(result)
Output:
['Philip J. Fry', 'Turanga Leela', 'Bender Bending Rodríguez']
Note that when you dump it to JSON the í isn't supported since by default it tries converting to ASCII so it dumps it to UTF-16 (Unicode 0x00ED):
import json
print(json.dumps(result, indent=2))
Output:
[
"Philip J. Fry",
"Turanga Leela",
"Bender Bending Rodr\u00edguez"
]
You can get around this via setting ensure_ascii=False if you want, though if you are using this in an API I would be careful and stick with ASCII with unicode encodings:
print(json.dumps(result, indent=2, ensure_ascii=False))
Output:
[
"Philip J. Fry",
"Turanga Leela",
"Bender Bending Rodríguez"
]
Your output is correct JSON. Unicode code point 00ED is í, and in JSON any character can be escaped using its Unicode code point. "\00ed" as in your JSON output is a valid way to write that character.
It would also be correct JSON to have that character without encoding it, but apparently jsonify chooses to encode it.
Any competent JSON decoder will then turn it back into a í.
If using the standard library's json.dumps you can use ensure_ascii=False to prevent this behaviour if you don't want it, but I don't know what "jsonify" is.

Read a file in R with mixed character encodings

I'm trying to read tables into R from HTML pages that are mostly encoded in UTF-8 (and declare <meta charset="utf-8">) but have some strings in some other encodings (I think Windows-1252 or ISO 8859-1). Here's an example. I want everything decoded properly into an R data frame. XML::readHTMLTable takes an encoding argument but doesn't seem to allow one to try multiple encodings.
So, in R, how can I try several encodings for each line of the input file? In Python 3, I'd do something like:
with open('file', 'rb') as o:
for line in o:
try:
line = line.decode('UTF-8')
except UnicodeDecodeError:
line = line.decode('Windows-1252')
There do seem to be R library functions for guessing character encodings, like stringi::stri_enc_detect, but when possible, it's probably better to use the simpler determinstic method of trying a fixed set of encodings in order. It looks like the best way to do this is to take advantage of the fact that when iconv fails to convert a string, it returns NA.
linewise.decode = function(path)
sapply(readLines(path), USE.NAMES = F, function(line) {
if (validUTF8(line))
return(line)
l2 = iconv(line, "Windows-1252", "UTF-8")
if (!is.na(l2))
return(l2)
l2 = iconv(line, "Shift-JIS", "UTF-8")
if (!is.na(l2))
return(l2)
stop("Encoding not detected")
})
If you create a test file with
$ python3 -c 'with open("inptest", "wb") as o: o.write(b"This line is ASCII\n" + "This line is UTF-8: I like π\n".encode("UTF-8") + "This line is Windows-1252: Müller\n".encode("Windows-1252") + "This line is Shift-JIS: ハローワールド\n".encode("Shift-JIS"))'
then linewise.decode("inptest") indeed returns
[1] "This line is ASCII"
[2] "This line is UTF-8: I like π"
[3] "This line is Windows-1252: Müller"
[4] "This line is Shift-JIS: ハローワールド"
To use linewise.decode with XML::readHTMLTable, just say something like XML::readHTMLTable(linewise.decode("http://example.com")).

Encoding works for 1 and not for other list in Twitter using python

I am trying to start cheating data from twitter using twitter module and python. Here's is my code
import twitter
import win_unicode_console
win_unicode_console.enable()
CONSUMER_KEY = 'xxxxxxxxxxxxxxxxxx'
CONSUMER_SECRET = 'xxxxxxxxxxxxxxx'
OAUTH_TOKEN = 'xxxxxxxxxxxxxxxxx'
OAUTH_TOKEN_SECRET = 'xxxxxxxxxxxx'
auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
CONSUMER_KEY, CONSUMER_SECRET)
twitter_api = twitter.Twitter(auth=auth)
print(twitter_api)
WORLD_WOE_ID = 1
US_WOE_ID = 23424977
world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID)
us_trends = twitter_api.trends.place(_id=US_WOE_ID)
print(us_trends)
print(world_trends)
I was getting encoding error . so i used
print((us_trends).encode('utf-8'))
which resulted in
AttributeError: 'TwitterListResponse' object has no attribute 'encode'
so i decided to use win_unicode_console module
But whats confusing is that us_trends is returning value.
[{'trends': [{'name': 'El Chapo', 'url': 'http://twitter.com/search?q=%22El+Chapo%22', 'promoted_content': None, 'query': '%22El+Chapo%22', 'tweet_volume': 103536}, {'name': 'Antonio Brown', 'url': 'http://twitter.com/search?q=%22Antonio+Brown%22', 'promoted_
but the statement
print(world_trends)
gives below error
File "C:\Users\nawendu\Desktop\TWIT.PY", line 25, in <module>
print(world_trends)
File
line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 24-
29: character maps to <undefined>
How can the encoding work in us trends and not in world trends??
encode is a method of a string.
You have a json object, it doesn't have this method.
When you print an object it needs to convert the object to a string representation for your output encoding (probably windows encoding here). If there are characters in there (e.g. Emoji) that are not in the output encoding then you get an error.
Encodings are a difficult topic (and a pain point in Python), but you'll need to learn about them if you want to print output.

How to load a json file with strings including double quotes (")

I've been given a load of JSON files which I'm trying to load into python 3.5
I've already had to do some clean up work, removing double backslashes and extra quotations, however I've run into an issue I don't know how to solve.
I'm running the following code:
with open(filepath,'r') as json_file:
reader = json_file.readlines()
for row in reader:
row = row.replace('\\', '')
row = row.replace('"{', '{')
row = row.replace('}"', '}')
response = json.loads(row)
for i in response:
responselist.append(i['ActionName'])
However it's throwing up the error:
JSONDecodeError: Expecting ',' delimiter: line 1 column 388833 (char 388832)
The part of the JSON that's causing the issue is the status text entry below:
"StatusId":8,
"StatusIdString":"UnknownServiceError",
"StatusText":"u003cCompany docTypeu003d"Mobile.Tile" statusIdu003d"421" statusTextu003d"Start time of 11/30/2015 12:15:00 PM is more than 5 minutes in the past relative to the current time of 12/1/2015 12:27:01 AM." copyrightu003d"Copyright Company Inc." versionNumberu003d"7.3" createdDateu003d"2015-12-01T00:27:01Z" responseIdu003d"e74710c0-dc7c-42db-b608-bf905d95d153" /u003e",
"ActionName":"GetTrafficTile"
I added the line breaks to illustrate my point, it looks like python is unhappy that the string contains double quotes.
I have a feeling this may be to do with my replacing '\ \' with '' messing with the unicode characters in the string. Is there any way to repair these nested strings? I don't mind if the StatusText field is deleted completely, all I'm after is a list of the ActionName fields.
EDIT:
I've hosted an example file here:
https://www.dropbox.com/s/1oanrneg3aqandz/2015-12-01T00%253A00%253A42.527Z_2015-12-01T00%253A01%253A17.478Z?dl=0
This is exactly as I received, before I've replaced the extra backslashes and quotations
Here is a pared down version of the sample with one bad entry
["{\"apiServerType\":0,\"RequestId\":\"52a65260-1637-4653-a496-7555a2386340\",\"StatusId\":0,\"StatusIdString\":\"Ok\",\"StatusText\":null,\"ActionName\":\"GetCameraImage\",\"Url\":\"http://mosi-prod.cloudapp.net/api/v1/GetCameraImage?AuthToken=vo*AB57XLptsKXf0AzKjf1MOgQ1hZ4BKipKgYl3uGew%7C&CameraId=13782\",\"Lat\":0.0,\"Lon\":0.0,\"iVendorId\":12561,\"iConsumerId\":2986897,\"iSliverId\":51846,\"UserId\":\"2986897\",\"HardwareId\":null,\"AuthToken\":\"vo*AB57XLptsKXf0AzKjf1MOgQ1hZ4BKipKgYl3uGew|\",\"RequestTime\":\"2015-12-01T00:00:42.5278699Z\",\"ResponseTime\":\"2015-12-01T00:01:02.5926127Z\",\"AppId\":null,\"HttpMethod\":\"GET\",\"RequestHeaders\":\"{\\\"Connection\\\":[\\\"keep-alive\\\"],\\\"Via\\\":[\\\"HTTP/1.1 nycnz01msp1ts10.wnsnet.attws.com\\\"],\\\"Accept\\\":[\\\"application/json\\\"],\\\"Accept-Encoding\\\":[\\\"gzip\\\",\\\"deflate\\\"],\\\"Accept-Language\\\":[\\\"en-us\\\"],\\\"Host\\\":[\\\"mosi-prod.cloudapp.net\\\"],\\\"User-Agent\\\":[\\\"Traffic/5.4.0\\\",\\\"CFNetwork/758.1.6\\\",\\\"Darwin/15.0.0\\\"]}\",\"RequestContentHeaders\":\"{}\",\"RequestContentBody\":\"\",\"ResponseBody\":null,\"ResponseContentHeaders\":\"{\\\"Content-Type\\\":[\\\"image/jpeg\\\"]}\",\"ResponseHeaders\":\"{}\",\"MiniProfilerJson\":null}"]
The problem is a little different than you think. Whatever program built these files used data that was already json-encoded and ended up double and even triple encoding some of the information. I peeled it apart in a shell session and got usable python data. You can (1) go dope-slap whoever wrote the program that built this steaming pile of... um... goodness? and (2) manually scan through and decode inner json strings.
I decoded the data and it was a list of strings, but those strings looked suspiciously like json
>>> data = json.load(open('test.json'))
>>> type(data)
<class 'list'>
>>> d0 = data[0]
>>> type(d0)
<class 'str'>
>>> d0[:70]
'{"apiServerType":0,"RequestId":"52a65260-1637-4653-a496-7555a2386340",'
Sure enough, I can decode it
>>> d0_1 = json.loads(d0)
>>> type(d0_1)
<class 'dict'>
>>> d0_1
{'ResponseBody': None, 'StatusText': None, 'AppId': None, 'ResponseTime': '2015-12-01T00:01:02.5926127Z', 'HardwareId': None, 'RequestTime': '2015-12-01T00:00:42.5278699Z', 'StatusId': 0, 'Lon': 0.0, 'Url': 'http://mosi-prod.cloudapp.net/api/v1/GetCameraImage?AuthToken=vo*AB57XLptsKXf0AzKjf1MOgQ1hZ4BKipKgYl3uGew%7C&CameraId=13782', 'RequestContentBody': '', 'RequestId': '52a65260-1637-4653-a496-7555a2386340', 'MiniProfilerJson': None, 'RequestContentHeaders': '{}', 'ActionName': 'GetCameraImage', 'StatusIdString': 'Ok', 'HttpMethod': 'GET', 'iSliverId': 51846, 'ResponseHeaders': '{}', 'ResponseContentHeaders': '{"Content-Type":["image/jpeg"]}', 'apiServerType': 0, 'AuthToken': 'vo*AB57XLptsKXf0AzKjf1MOgQ1hZ4BKipKgYl3uGew|', 'iConsumerId': 2986897, 'RequestHeaders': '{"Connection":["keep-alive"],"Via":["HTTP/1.1 nycnz01msp1ts10.wnsnet.attws.com"],"Accept":["application/json"],"Accept-Encoding":["gzip","deflate"],"Accept-Language":["en-us"],"Host":["mosi-prod.cloudapp.net"],"User-Agent":["Traffic/5.4.0","CFNetwork/758.1.6","Darwin/15.0.0"]}', 'iVendorId': 12561, 'Lat': 0.0, 'UserId': '2986897'}
Picking one of the entries, that looks like more json
>>> hdrs = d0_1['RequestHeaders']
>>> type(hdrs)
<class 'str'>
Yep, it decodes to what I want
>>> hdrs_0 = json.loads(hdrs)
>>> type(hdrs_0)
<class 'dict'>
>>>
>>> hdrs_0["Via"]
['HTTP/1.1 nycnz01msp1ts10.wnsnet.attws.com']
>>>
>>> type(hdrs_0["Via"])
<class 'list'>
Here you are :) :
responselist = []
with open('dataFile.json','r') as json_file:
reader = json_file.readlines()
for row in reader:
strActNm = 'ActionName":"'; lenActNm = len(strActNm)
actionAt = row.find(strActNm)
while actionAt > 0:
nxtQuotAt = row.find('"',actionAt+lenActNm+2)
responselist.append( row[actionAt-1: nxtQuotAt+1] )
actionAt = row.find('ActionName":"', nxtQuotAt)
print(responselist)
which gives:
>python3.6 -u "dataFile.py"
['"ActionName":"GetTrafficTile"']
>Exit code: 0
where dataFile.json is the file with the line you provided and dataFile.py the code provided above.
It's the hard tour, but if the files are in a bad format you have to find a way around and a simple pattern matching works in any case. For more complex cases you will need regex (regular expressions), but in this case a simple .find() is enough to do the job.
The code finds also multiple "actions" in the line (if the line would contain more than one action).
Here the result for the file you provided in your link while using following small modification of the code above:
responselist = []
with open('dataFile1.json','r') as json_file:
reader = json_file.readlines()
for row in reader:
strActNm='\\"ActionName\\":\\"'
# strActNm = 'ActionName":"'
lenActNm = len(strActNm)
actionAt = row.find(strActNm)
while actionAt > 0:
nxtQuotAt = row.find('"',actionAt+lenActNm+2)
responselist.append( row[actionAt: nxtQuotAt+1].replace('\\','') )
actionAt = row.find('ActionName":"', nxtQuotAt)
print(responselist)
gives:
>python3.6 -u "dataFile.py"
['"ActionName":"GetCameraImage"']
>Exit code: 0
where dataFile1.json is the file you provided in the link.