MySQL does not identify character '?' on select - mysql

I got a table in MySQL with the following columns:
id name email address borningDate
I have a form in a HTML page that submits this data to a servlet, responsible for saving it at the database. Due to charset issues (already fixed), I saved a row like this, when trying to store letters with accents:
19 ? ? ? 2015-03-01
and now I want to delete this row.
Yeah, doing this:
DELETE FROM table WHERE id=19;
works nice. My didatic question is: why, if I try something like this:
DELETE FROM table WHERE name='?';
it returns 0 rows affected, like if it can't see ? as a valid character?

Try doing
SELECT id, HEX(name), HEX(email), HEX(address), borningDate FROM table
This will tell you what's actually in the database. It probably isn't actually ASCII question marks. The question marks are probably substitution characters applied when MySQL tries to convert the column's character set to the connection's character set.
To manage this more specifically, do SHOW CREATE TABLE table and look for the character set being used for the text columns. This probably shows up at the end of the table definition as something like DEFAULT CHARSET utf8 or some such thing. But it might be specified in the column definition.
Once you know the character set, issue the command SET NAMES charset, for example, SET NAMES utf8. Then reissue your commands and see if you get better results than the ? substitution character. That assumes, of course, that the client program you are using can handle the character set mentioned.

Related

Losing data on converting MySQL latin1_swedish_ci to utf8_unicode_ci

When I try to convert data from latin1_swedish_ci to utf8_unicode_ci I loose data ! The TEXT column is cut at the first special character.
For example:
Becomes:
Yet I tried many ways to convert my column and all solutions end up deleting data at the first special character!
I tried by phpMyAdmin or with this SQL request:
UPDATE `page` SET page_text = CONVERT(cast(CONVERT(page_text USING latin1) AS BINARY) USING utf8);
I also tried the php script :
https://github.com/nicjansma/mysql-convert-latin1-to-utf8/blob/master/mysql-convert-latin1-to-utf8.php
With all the time the same result, data are lost at first special character!
What should I do?
UPDATE
I could change the data to utf8 with
ALTER TABLE page CONVERT TO CHARACTER SET utf8mb4;
or
ALTER TABLE page CONVERT TO CHARACTER SET utf8;
without loosing data but it does not display properly special characters.
Using the php function utf8_encode($myvar); does display correctly special characters.
To convert a table, use
ALTER TABLE ... CONVERT TO ...
Or, to change individually columns, use
ALTER TABLE ... MODIFY COLUMN ...
Instead, you seem to have done something different. For further analysis, please provide SELECT col, HEX(col) ... before and after the conversion, plus the conversion used.
See "truncated" in this . The proper fix is found here, but depends on what you see from the HEX.

Implement search in codeigniter

Everyone know how to implement search in page with codeiginter.
One issue is that my keyword, for example "Șaweqă" has diacritisc. How to search into database table with this keyword.
Simple $this->db->like($keyword); return database error.
My database character encoding is set to utf8-general-ci and the table where i do search has rows with utf8_romanian_ci encoding.
The output of SHOW CREATE TABLE should look one of these ways:
the_column ... CHARACTER SET utf8, -- or utf8mb4
or, if not specified on the_column, then
CREATE TABLE ... (...) DEFAULT CHARACTER SET utf8; -- or utf8mb4
To check that Șaweqă was stored (via INSERT, etc) correctly as utf8, do something like SELECT HEX(col).... You should get C898 61 77 65 71 C483 for Șaweqă. (I added spaces to distinguish the characters.)
To check that it is fetched (via SELECT, etc) correctly, look at the output.
Try allowing special character in the CI config file :
$config['permitted_uri_chars'] = 'a-z 0-9~%.:#_\-ä';
And if you are using GET method, try to urldecode() the keyword :
$keyword = urldecode($keyword);

MySQL invalid String error for specific Special Characters but not for others?

I have a simple mySQL database with a single table and row. The row is set up like so:
utf8mb4_general_ci - varchar(255)
When I manually insert the following characters, it works fine:
☎ , इ ई , ∑
But when I attempt to insert characters like these:
𝓝 or 😫
It doesn't work and gives me an Invalid String error. I found information on the first character here.
It states that the character belongs to the following block:
MATHEMATICAL_ALPHANUMERIC_SYMBOLS
How can I detect characters like these to prevent them from being submitted by users. Is there a simple regex I can use?
𝓝 and 😫 are in the range of utf8mb4. So, you need to establish the connection from your client to the mysqld server as being utf8mb4. Since you have not said what your client is, I can't give you a specific answer. http://mysql.rjweb.org/doc.php/charcoll#python and the following sections may cover your client. Be sure to use utf8mb4 instead of utf8, but leave UTF-8 alone.

mysql - funny square characters added to the value when inserting it into table

I have a php script that inserts values into mySQL table
INSERT INTO stories (title) VALUES('$_REQUEST[title]);
I checked the values of my request variables before going into the table and it's fine.
But when I add title=john to the table for example,
I get something like this:
title = "[][][][]john"
and when I extract the value, it's a newline then john.
I have my columns set to utf-8, I tried swedish character set as well.
Note: I don't get this error when inserting values from the phpMyAdmin commandline
You need {} around any array notation when used inside "".
$q="INSERT INTO stories(title) VALUES('{$_REQUEST['title']}')";
BTW, it would be better, when checking your $_REQUEST vars to store the sanitized versions in new variables, and to be sure to escape them with real_escape_string()
SET NAMES <encoding> query must be executed every time you connect to your database.
very simple rule.
where <encoding> is your HTML page encoding in mysql dialect (utf8 for the utf-8)
You need to check the character set of the database, the server, and the client.
Note that it's not a swedish character set, it's a swedish collation.

Where did I go wrong with this unicode field in MySQL?

I have a table with a field which contains strings in my MySQL database.
The MySQL version is 5.0.51a. The default character set for the table is 'utf8'.
Many of the strings have unicode characters such as \xae and \u21222 (registered symbol and trademark symbol respectively).
For example, suppose I have a row with a field this value:
"Bing® Blang™ Blaow"
The default character set of my mysql command line client is "latin1".
If I issue a SELECT statement in the mysql client program from the command line without specifying a character set, the output of the title shows up like so:
"Bing® Blang Blaow"
The (R) symbol is correct but the (TM) symbol is missing. If I cut and paste this string from the console into TextMate, the (TM) symbol appears, but is half-way behind the g in the word "Blang".
I am assuming that the half-way-behind-the-g thing is a just a display error in TextMate (though if anyone can provide further detail that'd be great, but that's not really the important part).
The main thing I am inferring from the its-there-after-you-cut-and-paste behavior is that the data is in the database but there's something wrong with some sort of character set setting somewhere.
If I override the default encoding of the mysql client on the command line like so:
mysql --default-character-set=utf8
Then do the same select, the string comes out as:
"Bing® Blang™ Blaow"
which is to say that both the (R) and (TM) symbols appear and are in the right place but both are preceded by the unicode character \xae which is an A with a circumflex on top.
(Incidentally this is also how the data is displayed when I pull it out using python and display it on a web page, which is what my real problem is).
Anyway, what is going on here? Everything we have done recently has used UTF8 everywhere possible, but it's possible that some of these rows were inserted prior to that change which means they would've been using the latin1 default... however neither encoding seems to produce the right result?
If the rows were inserted when the default encoding on the table was latin1 before it was switched to utf8, then the encoding was switched (via alter table..) then would the encoding have actually been updated? Should one of the encodings work now? Will unicode ever stop kicking my ass?
There are quite a number of issues here:
About the characters
You indicate that the text has characters U+AE and U+2122 (® and ™ respectively). However, the results imply that the text has U+99 as the character after "Blang": When you set MySQL to output UTF8, then you see this "™" -- which is the UTF8 sequence for U+99 displayed on a terminal that is interpreting this byte stream as Windows-1252.
U+99 probably isn't what you wanted: In Unicode, that is an extended control character with no graphic representation. It just so happens that in Windows-1252, that 0x99 is the encoding of the trademark symbol (U+2122).
(Please note that both MySQL and most web browsers have a common, "broken" behavior of using Windows-1252 when you choose Latin1. Sigh.)
What's probably wrong
Your terminal isn't operating in the right character set. It is clearly operating in Windows-1252.
Programs should be connecting to the database in UTF-8. You can do that in the command line, as you've found, or by executing the statement SET NAMES utf8_general_ci; in your database handle before doing anything else. Some other database APIs may have other ways of doing this, but there is no generic way for all SQL engines. SET NAMES ... is specific to MySQL, but sets all the required character set variables (there are three!) at once.
The process that is inserting data into the database is taking user input and not correctly converting it from Windows-1252 into UTF-8 before inserting. This is how you got a U+99 into your database. Since I don't know how you are getting that data, I'm not sure what to fix, but here are several possibilities:
If the data comes from a web page form, be sure the page with the form is served in UTF-8, is properly marked as such (via the MIME Type, and the <meta> tag.) Be sure also, that the <form> tag is not specifying a different character set.
When converting the data, be sure that you use iconv or similar libraries to convert from the input character set to UTF-8. Even if you think the input is Latin1, do not try to do this by hand (for example, by zero expanding every byte to 16-bits then claiming this is UTF-16 - that won't work for Windwos-1252!). Make absolutely certain that you know the character set of the source data. In particular, be sure to know if it is Latin1 or Windows-1252.
Instead of converting the user input, you could connect to the database in character set of the user input, and then just insert the raw byte data you get from the user. However, you must be sure to only do insertions this way: reading back data from the data with the user's character set in effect will lose information if other rows have data that can't be represented in that character set. It is possible to set up a MySQL connection so that you issue statements in one character set and read results back in another... But it isn't for the faint of heart, and future programmers will likely go nuts trying to understand why the code does this.
If, when you pull the data out with Python and display it in a web page, you see the string "™", then that is indication that your are pulling the data out of the database correctly as UTF-8, but then putting it into a web page that is not correctly identified as UTF-8. Probably it is just defaulting to Latin1, which as noted above will really be Windows-1252.
Nonetheless, even if you fix the display, note that the data base has bad data in it, since U+99 isn't really the trademark symbol in a UTF-8 column. You'll need to clean up your data, by reading all the data, and replacing any characters in the range of U+80 through U+9F with what they were likely to have been, assuming the data was really Windows-1252. If you're not certain what character set the data was in originally -- then this data is, alas, just junk.
About changing character sets of tables
Converting the character set and collation of the table after inserting data will convert the columns, but, of course, any data already inserted will have already lost whatever characters the original character set couldn't represent.
Be careful to note the difference between ALTER TABLE foo CONVERT TO CHARACTER SET ... and ALTER TABLE foo CHARACTER SET ... The later only changes the default character set for the table, and will not change any columns, even if they were set to the default at creation. (MySQL only uses the defaults at column creation time, it doesn't remember that a given column is "defaulted" not does it keep it in sync with the table's default.)
I think it has to do with the settings of the mysql connection in your Python code.
try setting conn.character_set_name or something like that, depends on the mysql connection lib you are using.
in case of MySQLdb it should be smthng like this:
def character_set_name(*args, **kwargs): return 'utf-8'
conn.character_set_name = new.instancemethod(character_set_name, conn, conn.__class__)
Could it be that some of the columns have an explicitly different character set than the table default?
something like this...?
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci