How to insert a string with special characters into MySQL? - mysql

I am going crazy. Please help.
I have a form with a text input. This is inserted into My SQL text column. Sometime a user will enter some unknown character that breaks the insert.
For example one that I just found is Topkapı, which is a town in Turkey. You will notice the last character ı. On insert, this causes a database error:
Error Executing Database Query. Incorrect string value: '/xC4/xB1
and...' for column 'country_description' at row 1
Is there a simple method to either remove these characters or escape them? I am using cfqueryparam and tried HTMLEditFormat, CFSavecontent etc to no avail.

EncodeForHTML() does not fix this particular issue if you are actually inserting HTML from TinyMCE for example.
What fixed this was changing the Collation to utf8mb4. You can do this in Workbench by expanding the header. It's collapsed by default.
Backup your table.
Go to "Alter Table".
Click the arrows on the top right of the windows
Select utf8mb4 from the Collation dropdown.
Click "Apply"

Here are your options in my opinion:
If you're using ColdFusion 10 or above, try using EncodeForHTML()
Validate your UI to accept only US and UK English characters, numbers etc.
Change the column data type in MySQL to VARCHAR(n) CHARSET utf8.
Hope this helps.

Related

Incorrect string value: '\xE2\x80\xAF(fo...' for column 'description' at row 1 Error: INSERT INTO my_table_name

Sometimes, when the text is copy pasted from a third website in my form based application (in the textarea) the data don't get inserted in database, instead throw this below error.
Incorrect string value: '\xE2\x80\xAF(fo...' for column 'my_column_name' at row 1 Error: INSERT INTO my_table_name
I tried the below query in mysql workbench to solve this issue.
ALTER TABLE my_database_name.my_table CONVERT TO CHARACTER SET utf8
But I am getting the below error from the database.
Error Code: 1118. Row size too large. The maximum row size for the used table type, not counting BLOBs, is 65535. This includes storage overhead, check the manual. You have to change some columns to TEXT or BLOBs
Your column data type accepts maximum limit of 65535 bytes. you need to change the column data type to text or BLOB
One more thing while copying content from website or word document just paste in any plain text editor and check whether expected content is copied
You can use $content= preg_replace('/[\xE2\x80\xAF]/', '', $content); in programming. the above example is in PHP
Don't use whitespace in names: hex E280AF is UTF-8 FOR "NARROW NO-BREAK SPACE".
I worry that doing ALTER TABLE my_database_name.my_table CONVERT TO CHARACTER SET utf8 without first diagnosing the problem has only made things worse.
You were probably using latin1 before? Did you have any other non-English text in the database? They may (or may not) be messed up.
We may be able to fix the mess, but we need to know more details about what you originally had, and what steps lead to this.
Also, what language(s) do you expect your customers to be using?

Invalid unicode character causing MySQL string error

I need to add a record to our MySQL database (via Omeka) that includes an invalid unicode character (this one)
The error message I get via Omeka is:
Mysqli statement execute error : Incorrect string value: '\xF0\xAA\xA8\xA7\xE7\x94...' for column 'text' at row 1
The database field is longtext with collation utf8_unicode_ci. There are already a lot of records in this table and I'm not quite sure what I should change without affecting the other data already in it. Suggestions?
ALTER TABLE tbl CONVERT TO utf8mb4;
Meanwhile, the text for that row in that column is probably truncated or the whole row is missing.
As best as I can tell, F0AAA8A7 is not yet assigned, but I think it is in the area of Chinese characters, not Emoji, which also need utf8mb4. It is Unicode "codepoint" 2AA27.

Character Encoding error when copying double quotes from word or other source

I am using JSP servlets and have a mysql database. I have an input field "Introduction". The error is when a user copy pastes a para from word then the character "(double quotes) is entered as ? in my table but only when the character is copied from a word or some other source. Also, if a user copies two paragraph's with spaces in between then a buggy character enters my sql table and the JS which is trying to load the introduction in my jsp page fails. i have also attached the screenshot for this. Please help me how can i resolve this.
MicroSoft, in its infinite wisdom, decided to have non-standard double quotes -- a left version and a right version. But that should be fixable, since those quotes do exist somewhere in the huge world of utf8 characters.
However, the data from your 'copy' was probably not copied in utf8 encoding. Since is is unclear how that is being done, we can't give you complete details on fixing it.
The "best" plan is to establish "utf8" at all stages of data/client/server/database/table/column/etc.
The quick-and-dirty fix is to replace the funny quotes with ascii quotes.

Navicat utf8 not working on mysql database

I'm currently trying to merge data changes between two out of sync expressionengine databases. For this i have opted to use navicat.
The website makes heavy use of the greek character set in templates. When I view greek table field data in phpmyadmin, I see the characters in greek as expected. When I load them up into Navicat, I only see "???" question marks in their place. When I try to sync data between the databases, the result is that question marks are put in the place of greek characters.
The field types in question are "text" using "utf8_general_ci".
what am i doing wrong?
I had the same probem with Navicat Linux... The problem was solved following these steps:
In Navicat, open your table in design view.
Under Fields tab, select the fields which require to show as Unicode. Change the Character set and Collation into utf8 and uft8_general_ci.
Under Options tab, change the Character set and Collation into utf8--UTF-8 Unicode and uft8_general_ci.
Save.
Select Tools > Options. Under Font, change the "Editor Font" into e.g. AR PL ZenKai Uni .
Re-start your Navicat.
Select the correct "Editor Font" in Options until it shows your data correctly. (Repeat steps 5 and 6 above.)
Source: http://wiki.navicat.com/wiki/index.php/How_can_I_display_Unicode_in_Linux%3F
Ran into this issue and saw there is no proper answer.
The solution is:
In Connection Properties... for your MySql database, click on the Advanced tab and check (tick if you're from UK) the box for "Use MySql character set" and then your tables will display correctly.
Under connection's properties, in Advanced tab set encoding to Auto. This is for navicat 15.0.22

Where did I go wrong with this unicode field in MySQL?

I have a table with a field which contains strings in my MySQL database.
The MySQL version is 5.0.51a. The default character set for the table is 'utf8'.
Many of the strings have unicode characters such as \xae and \u21222 (registered symbol and trademark symbol respectively).
For example, suppose I have a row with a field this value:
"Bing® Blang™ Blaow"
The default character set of my mysql command line client is "latin1".
If I issue a SELECT statement in the mysql client program from the command line without specifying a character set, the output of the title shows up like so:
"Bing® Blang Blaow"
The (R) symbol is correct but the (TM) symbol is missing. If I cut and paste this string from the console into TextMate, the (TM) symbol appears, but is half-way behind the g in the word "Blang".
I am assuming that the half-way-behind-the-g thing is a just a display error in TextMate (though if anyone can provide further detail that'd be great, but that's not really the important part).
The main thing I am inferring from the its-there-after-you-cut-and-paste behavior is that the data is in the database but there's something wrong with some sort of character set setting somewhere.
If I override the default encoding of the mysql client on the command line like so:
mysql --default-character-set=utf8
Then do the same select, the string comes out as:
"Bing® Blang™ Blaow"
which is to say that both the (R) and (TM) symbols appear and are in the right place but both are preceded by the unicode character \xae which is an A with a circumflex on top.
(Incidentally this is also how the data is displayed when I pull it out using python and display it on a web page, which is what my real problem is).
Anyway, what is going on here? Everything we have done recently has used UTF8 everywhere possible, but it's possible that some of these rows were inserted prior to that change which means they would've been using the latin1 default... however neither encoding seems to produce the right result?
If the rows were inserted when the default encoding on the table was latin1 before it was switched to utf8, then the encoding was switched (via alter table..) then would the encoding have actually been updated? Should one of the encodings work now? Will unicode ever stop kicking my ass?
There are quite a number of issues here:
About the characters
You indicate that the text has characters U+AE and U+2122 (® and ™ respectively). However, the results imply that the text has U+99 as the character after "Blang": When you set MySQL to output UTF8, then you see this "™" -- which is the UTF8 sequence for U+99 displayed on a terminal that is interpreting this byte stream as Windows-1252.
U+99 probably isn't what you wanted: In Unicode, that is an extended control character with no graphic representation. It just so happens that in Windows-1252, that 0x99 is the encoding of the trademark symbol (U+2122).
(Please note that both MySQL and most web browsers have a common, "broken" behavior of using Windows-1252 when you choose Latin1. Sigh.)
What's probably wrong
Your terminal isn't operating in the right character set. It is clearly operating in Windows-1252.
Programs should be connecting to the database in UTF-8. You can do that in the command line, as you've found, or by executing the statement SET NAMES utf8_general_ci; in your database handle before doing anything else. Some other database APIs may have other ways of doing this, but there is no generic way for all SQL engines. SET NAMES ... is specific to MySQL, but sets all the required character set variables (there are three!) at once.
The process that is inserting data into the database is taking user input and not correctly converting it from Windows-1252 into UTF-8 before inserting. This is how you got a U+99 into your database. Since I don't know how you are getting that data, I'm not sure what to fix, but here are several possibilities:
If the data comes from a web page form, be sure the page with the form is served in UTF-8, is properly marked as such (via the MIME Type, and the <meta> tag.) Be sure also, that the <form> tag is not specifying a different character set.
When converting the data, be sure that you use iconv or similar libraries to convert from the input character set to UTF-8. Even if you think the input is Latin1, do not try to do this by hand (for example, by zero expanding every byte to 16-bits then claiming this is UTF-16 - that won't work for Windwos-1252!). Make absolutely certain that you know the character set of the source data. In particular, be sure to know if it is Latin1 or Windows-1252.
Instead of converting the user input, you could connect to the database in character set of the user input, and then just insert the raw byte data you get from the user. However, you must be sure to only do insertions this way: reading back data from the data with the user's character set in effect will lose information if other rows have data that can't be represented in that character set. It is possible to set up a MySQL connection so that you issue statements in one character set and read results back in another... But it isn't for the faint of heart, and future programmers will likely go nuts trying to understand why the code does this.
If, when you pull the data out with Python and display it in a web page, you see the string "™", then that is indication that your are pulling the data out of the database correctly as UTF-8, but then putting it into a web page that is not correctly identified as UTF-8. Probably it is just defaulting to Latin1, which as noted above will really be Windows-1252.
Nonetheless, even if you fix the display, note that the data base has bad data in it, since U+99 isn't really the trademark symbol in a UTF-8 column. You'll need to clean up your data, by reading all the data, and replacing any characters in the range of U+80 through U+9F with what they were likely to have been, assuming the data was really Windows-1252. If you're not certain what character set the data was in originally -- then this data is, alas, just junk.
About changing character sets of tables
Converting the character set and collation of the table after inserting data will convert the columns, but, of course, any data already inserted will have already lost whatever characters the original character set couldn't represent.
Be careful to note the difference between ALTER TABLE foo CONVERT TO CHARACTER SET ... and ALTER TABLE foo CHARACTER SET ... The later only changes the default character set for the table, and will not change any columns, even if they were set to the default at creation. (MySQL only uses the defaults at column creation time, it doesn't remember that a given column is "defaulted" not does it keep it in sync with the table's default.)
I think it has to do with the settings of the mysql connection in your Python code.
try setting conn.character_set_name or something like that, depends on the mysql connection lib you are using.
in case of MySQLdb it should be smthng like this:
def character_set_name(*args, **kwargs): return 'utf-8'
conn.character_set_name = new.instancemethod(character_set_name, conn, conn.__class__)
Could it be that some of the columns have an explicitly different character set than the table default?
something like this...?
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci