I'm unable to understand why there occurs character decoding failed warning at my server.
May 19, 2012 2:56:57 AM org.apache.tomcat.util.http.Parameters processParameters
WARNING: Parameters: Character decoding failed. Parameter 'width' with value '100%' has been ignored. Note that the name and value quoted here may corrupted due to the failed decoding. Use debug level logging to see the original, non-corrupted values.
Well, the % at the end signifies a bad encoding. If properly encoded, it should really be followed by two hexadecimal characters.
if 'Parameters: Character decoding failed' warning include % issue then we can use given below
code, no need to do any thing. I am using this in my ajax function.
var productPromoCodeIdParam = productPromoCodeId.replace('%','%25');
Related
So I tried to download this dataset from kaggle and when I try to import it shows the following error. Error Picture here
I opened in Excel and even notebook and saved as UTF-8 but still faced an error. Does this mean this dataset can only be opened with python? I have not yet studied python but wanted do a few queries with SQL and visualizations for my project.
https://www.kaggle.com/datasets/vardan95ghazaryan/top-250-football-transfers-from-2000-to-2018
The character set must be specified in multiple places:
The client
The table definition (or defaulted from the database)
and maybe other places.
For further discussion, please show the line that is in question, plus hex of that line, plus what you expect the line to day.
Kaggle
I found this in that download; there are doubtless other issues:
Diego Tristán
The á character in that name is encoded as hex E1, implying that it is one of these encodings: cp1250, dec8, latin1, latin2, latin5. (It is likely to be latin1.)
Your Workbench setup was (apparently) configured to assume that any data coming at it would be UTF-8. When it saw the E1, it croaked because that is not valid UTF-8.
Find out how you can configure "imports". It should allow you to change the "character set"; change that to "latin1". Then try the import again.
Meanwhile, complain to Kaggle that UTF-8 is becoming the de facto standard and they should change their data to that encoding.
You say you "saved as UTF-8", if so, can you provide me with that file. I'll do a similar analysis.
I'm having a hard time dealing with some parsing issues related to Emojis.
I have a json requested through the brandwatch site using urllib.(1) Then, I must decode it in utf-8, however, when I do so, I'm getting surrogate keys and the json.loader cannot deal with them. (2)
I've tried to use BeautifulSoup4, which works great, however, when there's a " on the site result, it is transformed to ", and then, the json.loader cannot deal with it for it says that a , is missing. After tons of search, I gave up trying to escape the " which would be the ideal.(3)
So now, I'm stuck with both "solutions/problems". Any ideas on how to proceed?
Obs: This is a program that fetchs data from the brandwatch and put it inside an MySQL database. So performance is an issue here.
Obs2: PyJQ is a JQ for Python with does the request and I can change the opener.
(1) - Dealing with the first approach using urllib, these are the relevants parts of the code used for it:
def downloader(url):
return json.loads(urllib.request.urlopen(url).read().decode('utf8'))
...
parsed = pyjq.all(jqparser,url=url, vars={"today" : start_date}, opener=downloader)
Error Casted:
Exception ignored in: '_pyjq.pyobj_to_jv'
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 339: surrogates not allowed
*** Error in `python': munmap_chunk(): invalid pointer: 0x00007f5f806303f0 ***
If I print the result of urllib.request.urlopen(url).read().decode('utf8') instead of sending it to json.loader, that's what appears. These keys seems to be Emojis.
"fullname":"Botinhas\uD83D\uDC62"
(2) Dealing with the second approach using BeautifulSoup4, here's the relevant part of the code. (Same as above, just changed the downloader function)
def downloader(url):
return json.loads(BeautifulSoup(urllib.request.urlopen(url), 'lxml').get_text())
...
parsed = pyjq.all(jqparser,url=url, vars={"today" : start_date}, opener=downloader)
And this is the error casted:
Expecting ',' delimiter: line 1 column 4814765 (char 4814764)
Doing the print, the " before Diretas Já should be escaped.
"title":"Por "Diretas Já", manifestações pelo país ocorrem em preparação ao "Ocupa Brasília" - Sindicato dos Engenheiros no Estado do Rio de Janeiro"
I've thought of running a regex, however, I'm not sure whether this would be the most appropriate solution to this case as performance is an issue.
(3) - Part of Brandwatch result with the " problem mentioned above
UPDATE:
As Martin stated in the comments, I ran a replace swapping " for nothing. Then, it raised the former problem, of the emoji.
Exception ignored in: '_pyjq.pyobj_to_jv'
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 339: surrogates not allowed
*** Error in `python': munmap_chunk(): invalid pointer: 0x00007f5f806303f0 ***
UPDATE2:
I've added this to the downloader function:
re.sub(r'\\u(d|D)([a-z|A-Z|0-9]{3})', "", urllib.request.urlopen(url).read().decode('utf-8','ignore'))
It solved the issue, however, I don't think it's the best way to solve it. If anybody knows a better option.
Based on Perl JSON 2.90 documentation, to encode JSON object in UTF-8 all you need to do is:
$json_text = JSON->new->utf8->encode($perl_scalar)
That is obvious and this what I did. After a while, I got an issue report on GitHub from one of users, which made me really surprised, as it shouldn't be happening!
I was beating for hours to figure out what was happening but the solution happened to be very weird and wrong from my point of view.
What eventually worked for me is this:
$json_text = JSON->new->latin1->encode($perl_scalar)
After that, I tested this code with all different characters, including Russian and Chinese - it just worked?
Can anyone please explain, why encoding is working correctly with latin1 and not with utf8, when it's actually has to be visa versa?
Two possible bugs could result in the described outcome.
You were passing strings already encoded using UTF-8 to encode.
If $string contains installé and sprintf '%vX', $string returns 69.6E.73.74.61.6C.6C.C3 A9, are suffering from this bug.
If you are suffering from the this bug, properly decode all inputs to your program, and continue using JSON->new->utf8->encode (aka encode_json).
You were encoding the output of the JSON command using UTF-8 a second time, possibly via a :utf8 or :encoding layer on a file handle.
If $string contains installé and sprintf '%vX', $string returns 69.6E.73.74.61.6C.6C.E9, are suffering from this bug.
If you are suffering from the this bug, either use JSON->new->encode (aka to_json) and keep the second layer of encoding, or use JSON->new->utf8->encode (aka encode_json) and remove the second layer of encoding.
In neither case is the solution to use JSON->new->latin1->encode.
What are you doing to output $json_text? What kind of binmode do you use on that handle? The screenshot looks like it's double-encoded, which suggests the handle has :utf8 or :encoding enabled (which is incorrect for writing encoded data to). As unintuitively as it may seem, ->latin1 giving a correct result matches that hypothesis (PerlIO assumes any binary string is encoded as latin-1).
I actually am generating an MS Excel file with the currencies and if you see the file I generated (tinyurl.com/currencytestxls), opening it in the text editor shows the correct symbol but somehow, MS Excel does not display the symbol. I am guessing there is some issue with the encoding. Any thoughts?
Here is my tcl code to generate the symbol:
set yen_val [format %c 165]
Firstly, this does produce a Yen symbol (I put format string in double quotes here just for clarity with the formatting):
format "%c" 165
You can then pass it around just fine. The problem is likely to come when you try to output it; when Tcl writes a string to the outside world (with the possible exception of the terminal on Windows, as that's tricky) it encodes that string into a definite byte sequence. The default encoding is the one reported by:
encoding system
But you can see what it is and change it for any channel (if you pass in the new name):
fconfigure $theChannel -encoding $theEncoding
For example, on my system (which uses UTF-8, which can handle any character):
% fconfigure stdout -encoding
utf-8
% puts [format %c 165]
¥
If you use an encoding that cannot represent a particular character, the replacement character for that encoding is used instead. For many encodings, that's a “?”. When you are sending data to another program (including to a web server or to a browser over the internet) it is vital that both sides agree on what the encoding of the data is. Sometimes this agreement is by convention (e.g., the system encoding), sometimes it is defined by the protocol (HTTP headers have this clearly defined), and sometimes this is done by explicitly transferred metadata (HTTP content).
If you're writing a CSV file to be ingested by Excel, use either the “unicode” or the “utf-8” encoding and make sure you put the byte-order mark in correctly. Tcl doesn't write BOMs automatically (because it's the wrong thing to do in some cases). To write a BOM, do this as the first thing when you start writing the file:
puts -nonewline $channel "\ufeff"
I am getting the following Exception when trying to save a row to a db:
Unexpected error: (<type 'exceptions.UnicodeEncodeError'>, UnicodeEncodeError('latin-1', u"First 'A\u043a' Last", 7, 8, 'ordinal not in range(256)'), <traceback object at 0x106562908>)
Before inserting, I am converting every string in a dictionary to latin-1 like this:
for k,v in row.items():
if type(v) is str:
row[k] = v.decode('utf-8').encode('latin-1')
The offending character seems to be 'A\u043a' - in other cases there seem to be other characters also "not within range."
Help appreciated.
Solved. The problem was attempting to decode a string which was already UTF-8. I also added 'ignore' to the encode() arguments,
v.encode('latin-1', 'ignore')
This ensures any non-encodable characters are replaced with a '?'