Django with MySQL and UTF-8 [duplicate] - mysql

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?
Background:
I am using Django with MySQL 5.1 and I am having trouble with 4-byte UTF-8 characters causing fatal errors throughout my web application.
I've used a script to convert all tables and columns in my database to UTF-8 which has fixed most unicode issues, but there is still an issue with 4-byte unicode characters. As noted elsewhere, MySQL 5.1 does not support UTF-8 characters over 3 bytes in length.
Whenever I enter a 4-byte unicode character (e.g. 🀐) into a ModelForm on my Django website the form validates and then an exception similar to the following is raised:
Incorrect string value: '\xF0\x9F\x80\x90' for column 'first_name' at row 1
My question:
What is a reasonable way to avoid fatal errors caused by 4-byte UTF-8 characters in a Django web application with a MySQL 5.1 database.
I have considered:
Selectively disabling MySQL warnings to avoid specifically that error message (not sure whether that is possible yet)
Creating middleware that will look through the request.POST QueryDict and substitute/remove all invalid UTF8 characters
Somehow hook/alter/monkey patch the mechanism that outputs SQL queries for Django or for MySQLdb to substitute/remove all invalid UTF-8 characters before the query is executed
Example middleware to replacing invalid characters (inspired by this SO question):
import re
class MySQLUnicodeFixingMiddleware(object):
INVALID_UTF8_RE = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
def process_request(self, request):
"""Replace 4-byte unicode characters by REPLACEMENT CHARACTER"""
request.POST = request.POST.copy()
for key, values in request.POST.iterlists():
request.POST.setlist(key,
[self.INVALID_UTF8_RE.sub(u'\uFFFD', v) for v in values])

Do you have an option to upgrade mysql? If you do, you can upgrade and set the encoding to utf8mb4.
Assuming that you don't have the option, I see these options for you:
1) Add java script / frontend validations to prevent entry of anything other than 1,2, or 3 byte unicode characters,
2) Supplement that with a cleanup function in your models to strip the data of any 4 byte unicode characters (which would be your option 2 or 3)
At the same time, it does look like your users are in fact using 4 byte characters. If there is a business case for using them in your application, you could go to the powers that be and request for an upgrade.

Related

MySQL - Table Data Import Wizard error in MacOS "Unhandled exception: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)"

I am unable to load any CSV file into MySQL. Using the Table Data Import Wizard, this error pops up every time I get to the 'Configure Import Settings' step:
"Unhandled exception: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)"
... even though the CSV is encoded as UTF-8 and that seems to be the default encoding setting for MySQL Workbench. Granted, I am not very skilled with computers, I have only a few weeks' exposure to MySQL. This has not always happened to me. I had no issues with this a couple of months ago while I was in a database management course.
But, I think this is where my problem lies: at one point I tried to uninstall MySQL Workbench and Community Server and re-installed, and ever since, this error happens every time I try to load data. I am even using a very basic test file that still won't load (all column types are set to 'Text' in Excel and saved as UTF-8 CSV:
I am using MySQL 8.0.28 on MacOS 11.5.2 (Big Sur)
Case 1, you wanted ï ("LATIN SMALL LETTER I WITH DIAERESIS"):
Character set ASCII is not adequate for the accented letters you have. You probably need latin1
Case 2, the first 3 bytes of the file are (hex) EF BB BF:
That is "BOM", which is a marker at the beginning of the file that indicates that it is encoded in UTF-8. But, apparently, the program reading it dos not handle such.
In some situations, you can remove the 3 bytes and proceed; in other situations, you need to read it using some UTF-8 setting.
Since you say "Text' in Excel and saved as UTF-8 CSV", I suspect that it is case 2. But that only addresses the source (Excel), over which you may not have enough control to get rid of the BOM.
I don't know what app has "Table Data Import Wizard", I cannot address the destination side of the problem. Maybe the wizard has a setting of UTF-8 or utf8mb4 or utf8; any of those might work instead of "ascii".
Sorry, I don't have the full explanation, but maybe the clues "BOM" or "EFBBBF" will help you find a solution either in Excel or in the Wizard.
Was able to solve it by saving my excel file to csv using MS DOS csv and Macintosh csv. After that, I was able to import my csv through the import wizard without the bug.

Chinese characters encode error when move database from aws to google cloud sql

We have a website which is dealing with Chinese characters and was hosted on AWS.
Here I can save Chinese characters in database without any problem.
Now we move to Google Cloud and I am facing issue saving Chinese characters in database.
They display as 一地兩檢
I am following all rules like "column should be utf8-unicode-ci" and "database connection as utf8".
It is working fine on localhost.
Any Idea what can be problem ?
Thanks.
If the data (column) in the database holds (similar) UTF8-encoded data in both cases and the code/platform which handles the data in the web-page is the same (meaning not python 2 vs python 3 for example), the difference might be the current local setting, either of the Google server (environment-variables), the SQL-client (UTF8-settings) or the php-settings.
Lets start with the sql-client:
Try to run the php - function
mysqli_character_set_name()
to get the encoding. If it is not UTF-8 then set it with
mysqli_set_charset('utf8')
If this is not working ensure the php-html stuff by setting the charset in the META html-tag to utf-8
charset=utf-8
and enforce it with
declare(encoding='utf8')
Looks like you have latin1 somewhere in the processing.
一地兩檢 is "Mojibake" for 一地兩檢
See Mojibake in Trouble with UTF-8 characters; what I see is not what I stored
Some Chinese characters take 4 bytes, not just 3 bytes. So, I recommend you use utf8mb4, not simply utf8.

Character encoding issues with migrating from MSSQL to MySQL

We have an application called JIRA running on Windows using MSSQL and I need to migrate it to Linux/MySQL. The character encoding in the existing MSSQL db is latin1 but I need to use UTF-8 in MySQL.
I take an xml dump of the MSSQL data using a backup mechanism provided by the application. Run it through python filter to convert the encoding from latin1 to UTF-8. Here is the python code that was provided to me by my colleague.
#!/usr/bin/python
import codecs, re
try:
highpoints = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
#fin = codecs.open('unicodestuff.txt', encoding='utf-8', errors='replace')
fin = codecs.open('entities.xml', encoding='latin1')
fout = codecs.open('stripped.xml', encoding='utf-8', mode='w', errors='replace')
for line in fin:
line = highpoints.sub(u'', line)
fout.write(line)
fin.close()
fout.close()
I take the filtered xml dump and using a "restore" mechanism in the application, I restore the data. However, after restoring the data, I spot checked few records on the MySQL side and I see some weird characters and I am assuming these are related to character encoding. For example,
On the MSSQL side, the text string is
““Number of debits exceeds maximum of 0”
“2-Restrict All Credits”
Default ของประเภทบัญชีถูกต้อง แต่เลขบัญชีไม่ถูกต้อง
Branch : 724 มาบุญครอง
whereas on the MYSQL side, the corresponding text appears as
â??â??Number of debits exceeds maximum of 0â?
â??2-Restrict All Creditsâ?
Default à¸à¸­à¸à¸à¸£à¸°à¹à¸ à¸à¸à¸±à¸à¸à¸µà¸à¸¹à¸à¸à¹à¸­à¸ à¹à¸à¹à¹à¸¥à¸à¸à¸±à¸à¸à¸µà¹à¸¡à¹à¸à¸¹à¸à¸à¹à¸­à¸
Branch : 724 มาà¸à¸¸à¸à¸à¸£à¸­à¸
Can you please provide me some ideas to fix these character encoding issues? Kindly let me know if additional information is required.
Thanks
Sam
Clearly your XML file does not actually use the Latin-1 character set. You've shown that text such as "ของประเภทบัญชีถูกต้อง แต่เลขบัญชีไม่ถูกต้อง" is present in it. The Latin-1 character set does what it says on the label: it represents letters from Latin alphabets. Those letters do not exist in it. If the headers in your XML file claim that it's in Latin-1, then those headers are untrue and the XML is, strictly speaking, not valid. But it might still be usable.
Now the problem is, what character encoding is that XML file actually using? To find out, you may have to examine the XML file in hexadecimal. There are three main possibilities: (1) it's using an old codepage such as 874 which contains these characters; (2) it's using UTF-16; (3) it's using UTF-8.
If you examine in hexadecimal a section of the XML which contains some of this non-latin text, and some of the latin letters nearby, here's what you might see. If it's in a codepage such as 874, each latin letter will be one byte with a value from 32 to 7F, and each nonlatin letter will be one (or possibly two?) bytes with values of 80 to FF. If it's in UTF-16, each latin letter will be two bytes, one from 32 to 7F and the other being always 00, and the nonlatin letters will be two bytes with neither being 00. If it's in UTF-8, the latin letters will be one byte from 32 to 7F, and the nonlatin letters will be (probably) three bytes, all being from 80 to FF.
There may be an alternative to examining hexadecimal. Some text editor programs can save text files in your choice of encoding formats. TextPad 7, for instance, can save as ANSI, DOS, UTF-8, Unicode, or Unicode (big-endian). The latter two options are actually UTF-16. Try loading the XML into such a program, and saving copies of it as UTF-8 and as Unicode. One of these copies should be the same size as the original (plus or minus two or three bytes), and the other will be a different size. Whichever matches the size is probably the correct format. If both differ, then you've got something weird.
Anyway, if you save a version as UTF-8 and then are able to open it and see your data intact, you should then be able to import that without using a Python translator.

Excel CSV String Not Fully Uploading To Excel

I have this string in Excel (I've UTF encoded It) when I save as CSV and import to MySql I get only the below, I know it's probably a charset issue but could you explain why as I'm having difficulty understanding it.
In Excel Cell:
PARTY HARD PAYDAY SPECIAL â UPTO £40 OFF EVENT PACKAGES INCLUDING HOTTEST EVENTS! MUST END SUNDAY! http://bit.ly/1Gzrw9H
Ends up in DB:
PARTY HARD PAYDAY SPECIAL
The field is structured to be utf8_general_ci encoded and VARCHAR(10000)
Mysql does not support full unicode utf8. There are some 4 byte characters that cannot be processed and, I guess, stored properly in regular utf8. I am assuming that upon import it is truncating the value after SPECIAL since mysql does not know how to process or store the character in the string that comes after that.
In order to handle full utf8 with 4 byte characters you will have to switch over to the utf8mb4.
This is from the mysql documentation:
The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. The utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters...
You can read more here #dev.mysql
Also, Here is a great detailed explanation on reg-utf8 issues in mysql and how to switch to utf8mb4.

Deciphering MySQL Encoding

I'm having an issue with encoding in MySQL, and I need some help in figuring out what's going on.
First, some parameters. The default encoding of the table is utf8. The character_set_client, character_set_connection, collation_connection, and character_set_server MySQL system variables, though, are all latin1.
I ssh into my MySQL server and I connect to the local server using the local command line client. I select record/column and the string that's returned, let's say the character comes back as A, which is correct. A is represented by hex in UTF-8 as "C5 9F."
However, the PHP app that hits the server interprets it as XY. In the MySQL commandline client, if I send the command "SET NAMES utf8", it will also now display it as XY.
If I do a select INTO OUTFILE and use hexedit to edit the file, I see two hex characters that map to X, then two hex characters that map to Y. ("c3 85" for X and "C5 B8" for Y). Basically, it's taking the two hex values and displaying them indeed as UTF8 characters.
First and foremost, it looks like the database is indeed storing things as UTF8, but the wrong kind of UTF8, correct? Are they going in as raw Unicode, but somehow, maybe because of the sytem variables, it is not being translated to UTF8?
Second, how/why is the MySQL command line client correctly interpreting XY as A?
Finally, to the successful interpretation of the MySQL command line, is there a chart that shows how C3 85 C5 B8 is getting converted to A, or XY is getting converted to A?
Thanks a bunch for any insight.
Your question is kind of confusing, so I'll explain with an example of my own:
You connect to the database without issuing SET NAMES, so the connection is set to Latin-1. That means the database expects any communication between you and it to be encoded in Latin-1.
You send the bytes C3A2 to the database, which you want to mean "â" in the UTF-8 encoding.
The database, expecting Latin-1, is interpreting this as the characters "¢" (C3 and A2 in the Latin-1 encoding).
The database will store these two characters internally in whatever encoding the table is set to.
You connect to the database in a different fashion, running SET NAMES UTF-8. The database now expects to talk to you in UTF-8.
You query the data stored in the database, you receive the characters "¢" encoded in UTF-8 as C382 C2A2, because you told the database to store the characters "¢" and you are now querying them over a UTF-8 connection.
If you connected to the database again using Latin-1 for the connection, the database would give you the characters "¢" encoded in Latin-1, which are the bytes C3 A2. If the client that you used to connect is interpreting that in Latin-1, you'll see the characters "¢". If the client is interpreting that as UTF-8, you'll see the character "â".
Essentially these are the points at which something can screw up:
the database will interpret any bytes it receives as characters in whatever encoding is set for the connection and convert the encoding of these characters to match the table they're supposed to be stored in
the database will convert the encoding of any characters from the encoding they're stored in into the encoding of the connection when retrieving data
the client may or may not interpret the bytes it receives from the database into the right characters to display on screen, especially command line environments aren't always set to correctly display UTF-8 data
Hope that helps.