I am getting the following Exception when trying to save a row to a db:
Unexpected error: (<type 'exceptions.UnicodeEncodeError'>, UnicodeEncodeError('latin-1', u"First 'A\u043a' Last", 7, 8, 'ordinal not in range(256)'), <traceback object at 0x106562908>)
Before inserting, I am converting every string in a dictionary to latin-1 like this:
for k,v in row.items():
if type(v) is str:
row[k] = v.decode('utf-8').encode('latin-1')
The offending character seems to be 'A\u043a' - in other cases there seem to be other characters also "not within range."
Help appreciated.
Solved. The problem was attempting to decode a string which was already UTF-8. I also added 'ignore' to the encode() arguments,
v.encode('latin-1', 'ignore')
This ensures any non-encodable characters are replaced with a '?'
Related
I'm having a hard time dealing with some parsing issues related to Emojis.
I have a json requested through the brandwatch site using urllib.(1) Then, I must decode it in utf-8, however, when I do so, I'm getting surrogate keys and the json.loader cannot deal with them. (2)
I've tried to use BeautifulSoup4, which works great, however, when there's a " on the site result, it is transformed to ", and then, the json.loader cannot deal with it for it says that a , is missing. After tons of search, I gave up trying to escape the " which would be the ideal.(3)
So now, I'm stuck with both "solutions/problems". Any ideas on how to proceed?
Obs: This is a program that fetchs data from the brandwatch and put it inside an MySQL database. So performance is an issue here.
Obs2: PyJQ is a JQ for Python with does the request and I can change the opener.
(1) - Dealing with the first approach using urllib, these are the relevants parts of the code used for it:
def downloader(url):
return json.loads(urllib.request.urlopen(url).read().decode('utf8'))
...
parsed = pyjq.all(jqparser,url=url, vars={"today" : start_date}, opener=downloader)
Error Casted:
Exception ignored in: '_pyjq.pyobj_to_jv'
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 339: surrogates not allowed
*** Error in `python': munmap_chunk(): invalid pointer: 0x00007f5f806303f0 ***
If I print the result of urllib.request.urlopen(url).read().decode('utf8') instead of sending it to json.loader, that's what appears. These keys seems to be Emojis.
"fullname":"Botinhas\uD83D\uDC62"
(2) Dealing with the second approach using BeautifulSoup4, here's the relevant part of the code. (Same as above, just changed the downloader function)
def downloader(url):
return json.loads(BeautifulSoup(urllib.request.urlopen(url), 'lxml').get_text())
...
parsed = pyjq.all(jqparser,url=url, vars={"today" : start_date}, opener=downloader)
And this is the error casted:
Expecting ',' delimiter: line 1 column 4814765 (char 4814764)
Doing the print, the " before Diretas Já should be escaped.
"title":"Por "Diretas Já", manifestações pelo país ocorrem em preparação ao "Ocupa Brasília" - Sindicato dos Engenheiros no Estado do Rio de Janeiro"
I've thought of running a regex, however, I'm not sure whether this would be the most appropriate solution to this case as performance is an issue.
(3) - Part of Brandwatch result with the " problem mentioned above
UPDATE:
As Martin stated in the comments, I ran a replace swapping " for nothing. Then, it raised the former problem, of the emoji.
Exception ignored in: '_pyjq.pyobj_to_jv'
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 339: surrogates not allowed
*** Error in `python': munmap_chunk(): invalid pointer: 0x00007f5f806303f0 ***
UPDATE2:
I've added this to the downloader function:
re.sub(r'\\u(d|D)([a-z|A-Z|0-9]{3})', "", urllib.request.urlopen(url).read().decode('utf-8','ignore'))
It solved the issue, however, I don't think it's the best way to solve it. If anybody knows a better option.
I'm using a program called Wrapper.py(this), but there's some type of error. Becouse it's Python, I've tried to found the error. As far as I know, the error is at that it tries to write & load some JSON, but it receives strings like this: "Közép-európai nyelvezet", or something like this. It causes UnicodeDecodeError:
>>>import json
>>>out={"a":"Közép-európai nyelvterület"}
>>>json.dumps(out)
Tracebank(the path, etc.)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x94 in position 1: Invalid start byte
Then I googled, & found that solution for encoding:
>>>a=json.dumps(out,ensure_ascii=False)
>>>a
'{"a":"K\x94z\x82p-eur\xarpai nyelvter\x81let"}'
Then I wanted to load it:
>>>json.loads(a)
Traceback, etc.
UnicodeDecodeError: 'utf8' codec can't decode byte 0x94 in position 1: Invalid start byte
>>>json.load(a,ensure_ascii=False)
Traceback
TypeError: __init__() got an unespected keyword argument: 'ensure_ascii'
How can I load my data back?
Thanks in advance for your help!
Use text instead of bytestrings.
out = {u"a":u"Közép-európai nyelvterület"}
I'm pulling vertex attributes from a MySQL DB with CHARSET=latin1, and I get Error in nchar(labels) : invalid multibyte string 326 when I try the following:
plot(graph,
layout=fr_layout,
vertex.label=V(graph)$univ,
vertex.size=2,
edge.arrow.size=.5)
For example, I have a vertex that is "Università degli Studi di Milano" and "St. John's University". What's causing the error, and how can I fix it? I've tried using CAST during my SELECT and replacing all punctuation, but that doesn't seem to change anything. How do I convert accented characters in a MySQL field to something R can use as a label in a plot?
Use iconv to convert between latin-1 and UTF-8 encoding:
iconv(labels, "latin1", "UTF-8")
I'm unable to understand why there occurs character decoding failed warning at my server.
May 19, 2012 2:56:57 AM org.apache.tomcat.util.http.Parameters processParameters
WARNING: Parameters: Character decoding failed. Parameter 'width' with value '100%' has been ignored. Note that the name and value quoted here may corrupted due to the failed decoding. Use debug level logging to see the original, non-corrupted values.
Well, the % at the end signifies a bad encoding. If properly encoded, it should really be followed by two hexadecimal characters.
if 'Parameters: Character decoding failed' warning include % issue then we can use given below
code, no need to do any thing. I am using this in my ajax function.
var productPromoCodeIdParam = productPromoCodeId.replace('%','%25');
Our Rails 3 app needs to be able to accept foreign characters like ä and こ, and save them to our MySQL db, which has its character_set as 'utf8.'
One of our models runs a validation which is used to strip out all the non-word characters in its name, before being saved. In Ruby 1.8.7 and Rails 2, the following was sufficient:
def strip_non_words(string)
string.gsub!(/\W/,'')
end
This stripped out bad characters, but preserved things like 'ä', 'こ', and '3.' With Ruby 1.9's new encodings, however, that statement no longer works - it is now removing those characters as well as the others we don't want. I am trying to find a way to do that.
Changing the gsub to something like this:
def strip_non_words(string)
string.gsub!(/[[:punct]]/,'')
end
lets the string pass through fine, but then the database kicks up the following error:
Mysql2::Error: Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation
Running the string through Iconv to try and convert it, like so:
def strip_non_words(string)
Iconv.conv('LATIN1', 'UTF8', string)
string.gsub!(/[[:punct]]/,'')
end
Results in this error:
Iconv::IllegalSequence: "こäè" # "こäè" being a test string
I'm basically at my whits end here. Does anyone know of a way to do do what I need?
This ended up being a bit of an interesting fix.
I discovered that Ruby has a regex I could use, but only for ASCII strings. So I had to convert the string to ASCII, run the regex, then convert it back for submission to the db. End result looks like this:
def strip_non_words(string)
string_encoded = string.force_encoding(Encoding::ASCII_8BIT)
string_encoded.gsub!(/\p{Word}+/, '') # non-word characters
string_reencoded = string_encoded.force_encoding('ISO-8859-1')
string_reencoded #return
end
Turns out you have to encode things separately due to how Ruby handles changing a character encoding: http://ablogaboutcode.com/2011/03/08/rails-3-patch-encoding-bug-while-action-caching-with-memcachestore/