How to detect parts of unicode in a string in Python - mysql

I call an api to get some info, and sometime the response has examples like below.
"address": "BOULEVARD DU MÃ\u0089ROU - SN PEÃ\u008fRE, "
How can I detect these and convert them to the latin letters? I want to upload this data to a MYSQL Database. Right now it throws the following warning.
Warning: (1366, "Incorrect string value: '\\xC2\\x88ME A...' for column 'address' at row 1")
I'm using pymysql, to insert this info to the DB.

The example data was original encoded as UTF8, but decoded as latin1. You can reverse the process to fix it, or read it from the source using utf8 to begin with:
>>> s = "BOULEVARD DU MÃ\u0089ROU - SN PEÃ\u008fRE, "
>>> s.encode('latin1').decode('utf8')
'BOULEVARD DU MÉROU - SN PEÏRE, '

you can use the .encode() str function:
>>> "BOULEVARD DU MÃ\u0089ROU - SN PEÃ\u008fRE, ".encode("latin-1)
'BOULEVARD DU MÉROU - SN PEÏRE, '
Though be aware if the API response contains any UTF-8 characters that cannot be encoded in "latin-1" then you'll hit a UnicodeEncodeError
If at all possible, rather than do this you'll probably want to change the character set of your mysql database to UTF-8

It looks like you have multiple errors -- "double encoding" and unicode "codepoints". Hence, it is hard to unravel what things went wrong.
It would be better to go back to the source and fix the encoding at each stage -- not to try to encode/decode after the mess is made. In almost all cases no conversion code is needed if you specify UTF-8 at every stage.
Here are some notes on what to do in Python: http://mysql.rjweb.org/doc.php/charcoll#python
The hex for É should be C389 and the hex for Ï should be C38F. There should be no \uxxxx except in HTML. Even in HTML, is is normally better to simply use the utf8 encoding since HTML can handle such.

Related

Unhandled exception: 'charmap' codec can't decode byte 0x81 in position 3852: character maps to <undefined>

So I tried to download this dataset from kaggle and when I try to import it shows the following error. Error Picture here
I opened in Excel and even notebook and saved as UTF-8 but still faced an error. Does this mean this dataset can only be opened with python? I have not yet studied python but wanted do a few queries with SQL and visualizations for my project.
https://www.kaggle.com/datasets/vardan95ghazaryan/top-250-football-transfers-from-2000-to-2018
The character set must be specified in multiple places:
The client
The table definition (or defaulted from the database)
and maybe other places.
For further discussion, please show the line that is in question, plus hex of that line, plus what you expect the line to day.
Kaggle
I found this in that download; there are doubtless other issues:
Diego Tristán
The á character in that name is encoded as hex E1, implying that it is one of these encodings: cp1250, dec8, latin1, latin2, latin5. (It is likely to be latin1.)
Your Workbench setup was (apparently) configured to assume that any data coming at it would be UTF-8. When it saw the E1, it croaked because that is not valid UTF-8.
Find out how you can configure "imports". It should allow you to change the "character set"; change that to "latin1". Then try the import again.
Meanwhile, complain to Kaggle that UTF-8 is becoming the de facto standard and they should change their data to that encoding.
You say you "saved as UTF-8", if so, can you provide me with that file. I'll do a similar analysis.

How to decode an HTTP request with utf-8 and treat the surrogate keys (Emojis)

I'm having a hard time dealing with some parsing issues related to Emojis.
I have a json requested through the brandwatch site using urllib.(1) Then, I must decode it in utf-8, however, when I do so, I'm getting surrogate keys and the json.loader cannot deal with them. (2)
I've tried to use BeautifulSoup4, which works great, however, when there's a &quot on the site result, it is transformed to ", and then, the json.loader cannot deal with it for it says that a , is missing. After tons of search, I gave up trying to escape the " which would be the ideal.(3)
So now, I'm stuck with both "solutions/problems". Any ideas on how to proceed?
Obs: This is a program that fetchs data from the brandwatch and put it inside an MySQL database. So performance is an issue here.
Obs2: PyJQ is a JQ for Python with does the request and I can change the opener.
(1) - Dealing with the first approach using urllib, these are the relevants parts of the code used for it:
def downloader(url):
return json.loads(urllib.request.urlopen(url).read().decode('utf8'))
...
parsed = pyjq.all(jqparser,url=url, vars={"today" : start_date}, opener=downloader)
Error Casted:
Exception ignored in: '_pyjq.pyobj_to_jv'
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 339: surrogates not allowed
*** Error in `python': munmap_chunk(): invalid pointer: 0x00007f5f806303f0 ***
If I print the result of urllib.request.urlopen(url).read().decode('utf8') instead of sending it to json.loader, that's what appears. These keys seems to be Emojis.
"fullname":"Botinhas\uD83D\uDC62"
(2) Dealing with the second approach using BeautifulSoup4, here's the relevant part of the code. (Same as above, just changed the downloader function)
def downloader(url):
return json.loads(BeautifulSoup(urllib.request.urlopen(url), 'lxml').get_text())
...
parsed = pyjq.all(jqparser,url=url, vars={"today" : start_date}, opener=downloader)
And this is the error casted:
Expecting ',' delimiter: line 1 column 4814765 (char 4814764)
Doing the print, the " before Diretas Já should be escaped.
"title":"Por "Diretas Já", manifestações pelo país ocorrem em preparação ao "Ocupa Brasília" - Sindicato dos Engenheiros no Estado do Rio de Janeiro"
I've thought of running a regex, however, I'm not sure whether this would be the most appropriate solution to this case as performance is an issue.
(3) - Part of Brandwatch result with the &quot problem mentioned above
UPDATE:
As Martin stated in the comments, I ran a replace swapping &quot for nothing. Then, it raised the former problem, of the emoji.
Exception ignored in: '_pyjq.pyobj_to_jv'
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 339: surrogates not allowed
*** Error in `python': munmap_chunk(): invalid pointer: 0x00007f5f806303f0 ***
UPDATE2:
I've added this to the downloader function:
re.sub(r'\\u(d|D)([a-z|A-Z|0-9]{3})', "", urllib.request.urlopen(url).read().decode('utf-8','ignore'))
It solved the issue, however, I don't think it's the best way to solve it. If anybody knows a better option.

Mysql fails to save UTF string in some cases

during spam fighting, I found some spam comments stored without any content...
After trying to isolate the problem, here is what I have found after saving similar comments to file along with the MySQL database...
This is (HEX because of unknown input encoding) what comment first few "chars" look like:
D1EA E0F7 E0F2 FC20 EFEE EFF3 EBFF F0ED FBE5 20EF F0EE E3F0 E0EC ECFB
After executing INSERT INTO test VALUES (0xD1EAE0F7E0F2FC20EFEEEFF3EBFFF0EDFBE520EFF0EEE3F0E0ECECFB21),(0x21D1EAE0F7E0F2FC20EFEEEFF3EBFFF0EDFBE520EFF0EEE3F0E0ECECFB), (0x21) test mysql table (utf-8) contains 3 rows, first without any text, second and third with single character "!" as a text... (note that 21 hex code for "!" is also in the end of first entry, yet it is not saved). (latin1 encoding saved some useless text replacements for every byte, but this post is not about it)
Of course, D1EA (D=1101 0001 should be followed by one 10xxxxxx byte, not 1110xxxx) isn't valid UTF-8 character, but robust system like database server should be able to deal with it...
My guess is, Mysql (ver. 5.1.66-0+squeeze1) shouldn't choose when to save data and when not, even if it's not valid UTF-8 encoded character... Or at least, it should not claim query was successfull when it decides not to store the data!
Is it bug in mysql, or what?
Thanks
Encoding is Windows-1251, and decodes to
Скачать популярные программы
//"Download popular software" google translated
You should reject non-UTF8 input in your code before doing anything with it.
if( !mb_check_encoding($input, "UTF-8") ) {
header("HTTP/1.1 400 Bad Request");
die("Invalid encoding");
}
FTR, your queries are hex literals, not misencoded text.

How to Encode and Decode "Acute accented characters" using Perl

I am working in a web based educational website, where we are using Perl, MySQL 5, Apache and Template Toolkit. we are planning to introduce the support for multiple\ Language in our website.
What we have done in
IF we have a Tab name like Courses Main Page<\h1> in our template file, we have converted that to
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<h1>[% glossary.$language.courses_main_page %]<\h1>
where $language is getting the value which user selects when he logs in.
We have a table to maintain this data in our Mysql DB:
CREATE TABLE translation ( english varchar(255) NOT NULL,
language varchar(255) NOT NULL, translation varchar(2000) NOT
NULL, ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Translation of
Element text to a foreign language'
IN the connect function of MySQL, I am providing 'SET character_set_results=NULL'.
I tried with utf8, but the issue which is limited to some tabs got increased to many sections.
So as soon as the user logins into the system, we fetch all the translation and store it in a PERL hash and Cache it. we pass this hash to template file which will replace the value.
Problem: Acute accented characters like á and é etc are getting replaced with some different character set symbols.
For ex: in Front end we are seeing "Cursos Página Principal" for Cursos Página Principal.
It is very similar to the solution given in htmlentities and é (e acute)
Can any one tell me how to achieve the same in Perl.
Denoting the charset
For ex: in Front end we are seeing "Cursos Página Principal" for Cursos Página Principal.
This mojibake happens when the characters are transferred as UTF-8 but interpreted as ISO-8859-1 or similar. So I suggest the easiest way to fix this is making sure that your HTML page gets shipped to the client with a proper mime type, i.e.
Content-Type: text/html; charset=utf-8
If that information is present in the HTML header, the value there will override any setting in the HTML document itself. So make sure that either you set the HTML header, or that your HTML header specifies no charset at all, so that the browser will have a look at the meta setting.
In some browsers (Firefox for example) you can manually change the character set using View / Character Encoding. You can use that to check whether a wrong character encoding while rendering really is the cause of the problem.
Actually encoding and decoding
There are some situations where fixing the charset won't help. It might be that you simply don't control that part of your framework. Or that something translates your characters from ISO-8859-1 to UTF-8 twice, so that the unreadable symbols are in fact represented as UTF-8 already. In these cases, you can use the Encode module to encode the characters in Perl directly, using HTML character references as output:
use Encode qw(decode encode FB_HTMLCREF);
# maybe: $unicodeString = decode("utf-8", $byteString);
$htmlString = encode("ascii", $unicodeString, FB_HTMLCREF);
Whether or not the decode step is neccessary depends on how you talk to your database. If your database connection is capable of supporting unicode, then you'll already have unicode strings, and you can simply encode these to HTML. For DBD::mysql there is a parameter mysql_enable_utf8 => 1 which achieves this. Using it is preferable to decoding things in your own code. This answer has details on the syntax.
One example on what these functions do:
$byteString = "Cursos P\xc3\xa1gina Principal."; # two bytes
$unicodeString = "Cursos P\N{U+00E1}gina Principal."; # one unicode character
$htmlString = "Cursos Página Principal."; # html character reference

Json parsing with unicode characters

i have a json file with unicode characters, and i'm having trouble to parse it. I've tried in Flash CS5, the JSON library, and i have tried it in http://json.parser.online.fr/ and i always get "unexpected token - eval fails"
I'm sorry, there realy was a problem with the syntax, it came this way from the client.
Can someone please help me? Thanks
Quoth the RFC:
JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.
So a correctly encoded Unicode character should not be a problem. Which leads me to believe that it's not correctly encoded (maybe it uses latin-1 instead of UTF-8). How did you create the file? In a text editor?
There might be an obscure Unicode whitespace character hidden in your string.
This URL contains more detail:
http://timelessrepo.com/json-isnt-a-javascript-subset
In asp.net you would think you would use System.Text.Encoding to convert a string like "Paul\u0027s" back to a string like "Paul's" but i tried for hours and found nothing that worked.
The trouble is hardcoding a string as shown above already decodes the string as you will see if you put a break point on it so in the end i wrote a function that converts the Hex27 to Dec39 so that i ended up with HTML encodeing and then decoded that.
string Padding = "000";
for (int f = 1; f <= 256; f++)
{
string Hex = "\\u" + Padding.Substring(0, 4 - f.ToString().Length) + f;
string Dec = "&#" + Int32.Parse(f.ToString(), NumberStyles.HexNumber) + ";";
HTML = HTML.Replace(Hex, Dec);
}
HTML = System.Web.HttpUtility.HtmlDecode(HTML);
Ugly as sin, I know but without using the latest framework (Not on ISP's server) it was the best I could do and someone must know a better solution.
I had the same problem and I just change the file encoding type Mac-Roman/windows-1252 to UTF-8.. and it worked
I had the same problem with Twitter json files. I was parsing them in Python with json.loads(tweet) but it failed for half of the records.
I changed to Python3 and it works well now.
If you seem to have trouble with the encoding of a JSON file (i.e. escaped codes such as \u00fc aren't displayed correctly regardless of your editor's encoding setting) generated by Python with json.dump s(): it encodes ASCII by default and escapes the unicode characters! See python json unicode - how do I eval using javascript (and python: json.dumps can't handle utf-8? and Why does json.dumps escape non-ascii characters with "\uxxxx").