Can't decode byte list to string list - json

In my users variable i store a byte list that i get from a local ldap and convert to a string list with my for loop.
I must return this list as jsonify.
If i don't use that encoding key i get a different output from the original but still encoded.
The problem is i can't access the decode method anywhere.
Any help?
users = ldap.get_group_members('ship_crew')
user_list = []
for user in users:
user_list.append((str(user, encoding='utf-8').split(",")[0].split("=")[1]))
return jsonify(user_list)
original list from users variable:
[
"cn=Philip J. Fry,ou=people,dc=planetexpress,dc=com",
"cn=Turanga Leela,ou=people,dc=planetexpress,dc=com",
"cn=Bender Bending Rodr\u00edguez,ou=people,dc=planetexpress,dc=com"
]
for loop with encoded output:
[
"Philip J. Fry",
"Turanga Leela",
"Bender Bending Rodr\u00edguez"
]
expected:
[
"Philip J. Fry",
"Turanga Leela",
"Bender Bending Rodríguez"
]

I Would use regex to extract your names:
import re
l = [
"cn=Philip J. Fry,ou=people,dc=planetexpress,dc=com",
"cn=Turanga Leela,ou=people,dc=planetexpress,dc=com",
"cn=Bender Bending Rodr\u00edguez,ou=people,dc=planetexpress,dc=com"
]
NAME_PATTERN = re.compile(r'cn=(.*?),')
result = [NAME_PATTERN.match(s).group(1) for s in l]
print(result)
Output:
['Philip J. Fry', 'Turanga Leela', 'Bender Bending Rodríguez']
Note that when you dump it to JSON the í isn't supported since by default it tries converting to ASCII so it dumps it to UTF-16 (Unicode 0x00ED):
import json
print(json.dumps(result, indent=2))
Output:
[
"Philip J. Fry",
"Turanga Leela",
"Bender Bending Rodr\u00edguez"
]
You can get around this via setting ensure_ascii=False if you want, though if you are using this in an API I would be careful and stick with ASCII with unicode encodings:
print(json.dumps(result, indent=2, ensure_ascii=False))
Output:
[
"Philip J. Fry",
"Turanga Leela",
"Bender Bending Rodríguez"
]

Your output is correct JSON. Unicode code point 00ED is í, and in JSON any character can be escaped using its Unicode code point. "\00ed" as in your JSON output is a valid way to write that character.
It would also be correct JSON to have that character without encoding it, but apparently jsonify chooses to encode it.
Any competent JSON decoder will then turn it back into a í.
If using the standard library's json.dumps you can use ensure_ascii=False to prevent this behaviour if you don't want it, but I don't know what "jsonify" is.

Related

Python Regex: How to match the string and then modify that string by adding something at the end

UPDATED CODE: It is working but now the problem is that the code is attaching same random_value to every Path.
Following is my code with a sample chunk of text. I want to read Path and it's value then add (/some unique random alphabet and number combination) at the end of every Path value without changing the already existed value. For example I want the Path to be like
"Path" : "already existed value/1A" e.t.c something like that.
I am unable to make the exact regex pattern of replacing it.
Any help would be appreciated.
It can be done by json parse but the requirement of the task is to do it via REGEX.
from io import StringIO
import re
import string
import random
reader = StringIO("""{
"Bounds": [
{
"HasClip": true,
"Lang": "no",
"Page": 0,
"Path": "//Document/Sect[2]/Aside/P",
"Text": "Potsdam, den 9. Juni 2021 ",
"TextSize": 12.0
}
],
},
{
"Bounds": [
{
"HasClip": true,
"Lang": "de",
"Page": 0,
"Path": "//Document/Sect[3]/P[4]",
"Text": "this is some text ",
"TextSize": 9.0,
}
],
}""")
def id_generator(size=3, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
text = reader.read()
random_value = id_generator()
pattern = r'"Path": "(.*?)"'
replacement = '"Path": "\\1/'+random_value+'"'
text = re.sub(pattern, replacement, text)
#This is working but it is only attaching one same random_value on every Path
print(text)
Use group 1 in the replacement:
replacement = '"Path": "\\1/1A"'
See live demo.
The replacement regex \1 puts back what was captured in group 1 of the match via (.*?).
Since you already have a json structure, maybe it would help to use the json module to parse it.
import json
myDict = json.loads("your json string / variable here")
# now myDict is a dictionary that you can use to loop/read/edit/modify and you can then export myDict as json.

Dask how to open json with list of dicts

I'm trying to open a bunch of JSON files using read_json In order to get a Dataframe as follow
ddf.compute()
id owner pet_id
0 1 "Charlie" "pet_1"
1 2 "Charlie" "pet_2"
3 4 "Buddy" "pet_3"
but I'm getting the following error
_meta = pd.DataFrame(
columns=list(["id", "owner", "pet_id"]])
).astype({
"id":int,
"owner":"object",
"pet_id": "object"
})
ddf = dd.read_json(f"mypets/*.json", meta=_meta)
ddf.compute()
*** ValueError: Metadata mismatch found in `from_delayed`.
My JSON files looks like
[
{
"id": 1,
"owner": "Charlie",
"pet_id": "pet_1"
},
{
"id": 2,
"owner": "Charlie",
"pet_id": "pet_2"
}
]
As far I understand the problem is that I'm passing a list of dicts, so I'm looking for the right way to specify it the meta= argument
PD:
I also tried doing it in the following way
{
"id": [1, 2],
"owner": ["Charlie", "Charlie"],
"pet_id": ["pet_1", "pet_2"]
}
But Dask is wrongly interpreting the data
ddf.compute()
id owner pet_id
0 [1, 2] ["Charlie", "Charlie"] ["pet_1", "pet_2"]
1 [4] ["Buddy"] ["pet_3"]
The invocation you want is the following:
dd.read_json("data.json", meta=meta,
blocksize=None, orient="records",
lines=False)
which can be largely gleaned from the docstring.
meta looks OK from your code
blocksize must be None, since you have a whole JSON object per file and cannot split the file
orient "records" means list of objects
lines=False means this is not a line-delimited JSON file, which is the more common case for Dask (you are not assuming that a newline character means a new record)
So why the error? Probably Dask split your file on some newline character, and so a partial record got parsed, which therefore did not match your given meta.

How to evaluate JSON Path with fields that contain quotes inside a value?

I have a NiFi flow that takes JSON files and evaluates a JSON Path argument against them. It work perfectly except when dealing with records that contain Korean text. The Jayway JSONPath evaluator does not seem to recognize the escape "\" in the headline field before the double quote character. Here is an example (newlines added to help with formatting):
{"data": {"body": "[이데일리 김관용 기자] 우리 군이 2018 남북정상회담을 앞두고 남
북간 군사적 긴장\r\n완화와 평화로운 회담 분위기 조성을 위해 23일 0시를 기해 군사분계선
(MDL)\r\n일대에서의 대북확성기 방송을 중단했다.\r\n\r\n국방부는 이날 남북정상회담 계기
대북확성기 방송 중단 관련 내용을 발표하며\r\n“이번 조치가 남북간 상호 비방과 선전활동을
중단하고 ‘평화, 새로운 시작’을\r\n만들어 나가는 성과로 이어지기를 기대한다”고 밝혔
다.\r\n\r\n전방부대 우리 군 장병이 대북확성기 방송을 위한 장비를 점검하고 있다.\r\n[사
진=국방부공동취재단]\r\n\r\n\r\n\r\n▶ 당신의 생활 속 언제 어디서나 이데일리 \r\n▶
스마트 경제종합방송 ‘이데일리 TV’ | 모바일 투자정보 ‘투자플러스’\r\n▶ 실시간 뉴스와
속보 ‘모바일 뉴스 앱’ | 모바일 주식 매매 ‘MP트래블러Ⅱ’\r\n▶ 전문가를 위한 국내 최상의
금융정보단말기 ‘이데일리 마켓포인트 3.0’ | ‘이데일리 본드웹 2.0’\r\n▶ 증권전문가방송
‘이데일리 ON’ 1666-2200 | ‘ON스탁론’ 1599-2203\n<ⓒ종합 경제정보 미디어 이데일리 -
무단전재 & 재배포 금지> \r\n",
"mimeType": "text/plain",
"language": "ko",
"headline": "국방부 \"軍 대북확성기 방송, 23일 0시부터 중단\"",
"id": "EDYM00251_1804232beO/5WAUgdlYbHS853hYOGrIL+Tj7oUjwSYwT"}}
If this object is in my file, the JSON path evaluates blanks for all the path arguments. Is there a way to force the Jayway engine to recognize the "\"? It appears to function correctly in other languages.
I was able to resolve this after understanding the difference between definite and indefinite paths. The Jayway github README points out the following will make a path indefinite and return a list:
When evaluating a path you need to understand the concept of when a
path is definite. A path is indefinite if it contains:
.. - a deep scan operator
?(<expression>) - an expression
[<number>, <number> (, <number>)] - multiple array indexes Indefinite paths
always returns a list (as represented by current JsonProvider).
My JSON looked like the following:
{
"Version":"14",
"Items":[
{"data": {"body": "[이데일리 ... \r\n",
"mimeType": "text/plain",
"language": "ko",
"headline": "국방부 \"軍 ... 중단\"",
"id": "1"}
},
{"data": {"body": "[이데일리 ... \r\n",
"mimeType": "text/plain",
"language": "ko",
"headline": "국방부 \"軍 ... 중단\"",
"id": "2"}
...
}
]
}
This JSON path selector I was using ($.data.headline) did not grab the values as I expected. It instead returned null values.
Changing it to $.Items[*].data.headline or $..data.headline returns a list of each headline.

json encoding - Why do utf8 and utf-8 produce different outputs?

These two commands output different results:
In [102]: json.dumps({'Café': 1}, ensure_ascii=False, encoding='utf-8')
Out[102]: '{"Caf\xc3\xa9": 1}'
In [103]: json.dumps({'Café': 1}, ensure_ascii=False, encoding='utf8')
Out[103]: u'{"Caf\xe9": 1}'
What's the difference between utf-8 and utf8?
Notice that the second iteration returns a Unicode object.
It seems strange but the documentation calls this out:
If ensure_ascii is False, the result may contain non-ASCII characters and the return value may be a unicode instance.
It would appear that only "UTF-8" works with ensure_ascii=False AND if the input is a UTF-8 encoded string (Not Unicode). With a Unicode input:
>>> json.dumps({u'Caf€': 1}, ensure_ascii=False, encoding='utf-8')
u'{"Caf\u20ac": 1}'
With ensure_ascii=False, all other valid encodings return a Unicode instance.
If you set ensure_ascii=True, then the encoding is consistent and works with other encoding, such as "windows-1252" (The input needs to be a Unicode)
I guess the rationale is that JSON should be ASCII and all encodings should be escaped, even when it's UTF-8.
To avoid any surprises follow these rules:
For proper spec. ASCII JSON:
Pass Unicode object
Call:
>>> json.dumps({u'Caf€': 1}, ensure_ascii=True)
'{"Caf\\u20ac": 1}'
UTF-8 Encoded JSON:
Pass Unicode object
Call:
>>> json.dumps({u'Caf€': 1}, ensure_ascii=False).encode("utf-8")
'{"Caf\xe2\x82\xac": 1}'

R list toJSON without slash symbols

I have a list:
[[1]]$period
[1] "DAY"
[[1]]$dates
[1] 1.361743e+12 1.362348e+12 1.362953e+12 1.363558e+12 1.364162e+12 1.364764e+12 1.365368e+12 1.365973e+12 1.366578e+12
I want to put this list to json:
toJSON(my_list)
answer:
[
{
\"period\": \"DAY\",
\"dates\": [
1361743200000,
1362348000000,
1362952800000,
1363557600000,
1364162400000,
1364763600000,
1365368400000,
1365973200000,
1366578000000
]
}
]
The answer is with slash symbols "\".
How to get rid of slash symbols? Maybe I should apply another function, to parse my_list to json?
The slash is just R's escape character. Used in this context it allows a quotation mark without closing the string. Although it appears in R console output, it doesn't appear when writing out to a file and it and the character you are escaping are counted as a single character:
x <- "ab\"c"
x
[1] "ab\"c"
writeLines(x)
ab"c
nchar(x)
[1] 4