Mysql common ascii character collation - mysql

I am getting an error like this:
COLLATION 'latin1_swedish_ci' is not valid for CHARACTER SET 'utf8'
Whenever I try to run a particular query. The problem in my case is, that I need this query to be able to run - without modification - against two separate databases, which have a different character collation (one is latin1, the other is utf8).
Since the strings I am trying to match are guaranteed to be basic letters (a-z), I was wondering if there was any way to force the comparison to work irrespective of the specific encoding?
I mean, a A is an A no matter how it is encoded - is there some way to tell mysql to compare the content of the string as letters rather than as whatever binary thing it does internally? I don't even understand why it can't auto-convert collations, since it is quite capable of doing it when explicitly told to.

Related

run mysql without collation (utf-8 only)

I run a sqlite3 database with utf8-strings from many languages. For various reasons I want to move to mysql, but I constantly run into trouble because of the mysql-collation feature.
One problem is that I am not even able to reliably know what is in my database. (For example I get "?" for non-latin characters and "�" for latin-based characters like öé, etc. - but I have absolutely no idea whether the problem lies in the import from sqlite3 to mysql or in reading from the mysql-database.)
Is there a way to get rid of this "feature" and let mysql do what I tell it without trying to be smart? I use UTF-8 everywhere and I never need any mangling of strings: Input is always UTF-8 and output should be always UTF-8. Also I really would like to know what really is stored in the database - i.e. without a collation-feature corrupting the data during readout.
You could use the MySQL VARBINARY column type, which stores a sequence of arbitrary bytes without interpreting them in any particular charset (or maybe VARCHAR BINARY, which is subtly different).
MySQL uses latin1_swedish_ci unless you specify something different explicitly. That's the opposite of smart. You have to be smart and change that default. This can be done with e.g. the --character-set-server and --collation-server command line options. See Specifying Character Sets and Collations for other means and further options.

Accent insensitive search query in MySQL

Is there any way to make search query accent insensitive?
the column's and table's collation are utf8_polish_ci and I don't want to change them.
example word : toruń
select * from pages where title like '%torun%'
It doesn't find "toruń". How can I do that?
You can change the collation at runtime in the sql query,
...where title like '%torun%' collate utf8_unicode_ci
but beware that changing the collation on the fly at runtime forgoes the possibility of mysql using an index, so performance on large tables may be terrible.
Or, you can copy the column to another column, such as searchable_title, but change the collation on it. It's actually common to do this type of stuff, where you copy data but have it in some slightly different form that's optimized for some specific workload/purpose. You can use triggers as a nice way to keep the duplicated columns in sync. This method has the potential to perform well, if indexed.
Note - Make sure that your db really has those characters and not html entities.
Also, the character set of your connection matters. The above assumes it's set to utf8, for example, via set names like set names utf8
If not, you need an introducer for the literal value
...where title like _utf8'%torun%' collate utf8_unicode_ci
and of course, the value in the single quotes must actually be utf8 encoded, even if the rest of the sql query isn't.
This wont work in extreme circumstances, but try to change the column collation to UFT8 utf8_unicode_ci. Then accented characters will be equal to their non-accented counterparts.
You could try SOUNDEX:
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
This compares two string by how they sound. But this obviously delivers many more results.

mySQL Character Sets

I noticed today that our database uses character set "utf8 -- UTF-8 Unicode" and collation "utf8_general_ci" but most of the tables and columns inside are using CHARSET=latin1. Will I run into any problems with this?
The reason I ask is because we have been running into a lot of problems syncing data between two database.
For an overview of MySQL character sets, read for example http://mysqldump.azundris.com/archives/60-Handling-character-sets.html
The server, a schema/database and a table have no character sets, they have just defaults that are inherited downwards (server to schema to table). Columns that are of a CHAR, VARCHAR or any TEXT type have character sets, and do so on a per column basis. If no specific character set is defined for them, they inherit from the table.
Inheritance for all these objects happens at object creation time.
The other thing that has a character set is the connection. Since the connection is the collection of things the server knows about the client, the character set of the connection should be set to whatever character set you are using in your client.
MySQL will then correctly convert between the character set of a column and the character set of a connection. Usually there are no problems with that.
The most common problem PEOPLE have with it is lying to the server, that is, setting the character set of a connection to something different from what the client is actually sending or using. This can be done at runtime by sending the command SET NAMES ... as the first thing at connection setup, and it is very important that you specify the correct thing here.
If you do, and for example send latin1 data into a connection that has been SET NAMES latin1, storing data into a latin1 column will not convert data, whereas storing data into a utf8 column will convert your latin1 umlauts (ö = F6) into utf8 umlauts (ö = C3 B6) on disk. Reading will transparently convert back, if the connection is properly set up.
In your setup, if your connection is SET NAMES utf8 and you are sending data to a latin1 column, only data that can be represented in latin1 can be stored. There will be data truncation, and a data truncation warning if you for example try to store japanese hiragana in such a latin1 column.
My experience with messign up MySQL charset was not 100% functional sorting of strings. You would be better with having everything in UTF-8 to be on the safe side.
I think it depends on what you actually store in that columns. If you store UTF-8 multi-byte characters in a column with latin-1 charset you might run into the sorting troubles. But as longs as there are only EN/US characters you should be ok.
You will run into problems if there's a possibility of storing "international" text -- that is, non-latin characters.
If I understand what you 're posting correctly, this means that the default for new tables in your database is UTF-8, but your existing tables use latin-1. That could be a problem. Depends on your data, as mentioned above.

Unicode Comparing in PHP/MySQL

The name Accîdent seems to be different than AccÎdent when I do a database query to update the column. Yet Accîdent and AccÎdent point to the same place...
In MySQL Accîdent = Accîdent when inserted.
Also, AccÎdent = AccÃŽdent.
Do you know why this is?
By default, MySQL assumes the client uses the latin1 character set. If you're using UTF-8 in your PHP scripts, then this assumption is false. You need to specify to MySQL that you're using UTF-8 by issuing this SQL statement just after the database connection is opened:
SET NAMES utf8
Then the data inserted by the following SQL statements will use the correct character set. This means that you need to re-insert your data or follow the MySQL conversion procedure (see the last paragraphs).
It is recommended that your tables are configured to store data in UTF-8, too, to avoid unnecessary read/write character set conversions. That's not required, though.
More information is available in the MySQL documentation. Specifically, Connection Character Sets and Collations.
First, you seem to be storing UTF-8 data in a table of different encoding. MySQL will try and cope, but the side effect is as you see - data in the database will look "weird". When creating a table, you need to specify the character encoding - preferably UTF-8. For existing tables, you'll need to convert the data.
Second, the tables have a "collation" beside encoding. Encoding determines how the characters map to bytes, collation determines sorting and comparison. There are language-specific collations, but utf8_general_ci should be the one you're looking for (ci stands for "case insensitive") - then your two string would match.

MySQL update error when special characters are used

I was wondering if anyone had come across this one before. I have a customer who uses special characters in their product description field. Updating to a MySQL database works fine if we use their HTML equivalents but it fails if the character itself is used (copied from either character map or Word I would assume).
Has anyone seen this behaviour before? The character in question in this case is ø - and we can't seem to do a replace on it (in ASP at least) as the character comes though to the SQL string as a "?".
Any suggestions much appreciated - thanks!
This suggests a mismatched character set between your database (connection) and actual data.
Most likely, you're using ISO-8859-1 on your site, but MySQL thinks it should be getting UTF-8.
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html describes what to check and how to change it. The simplest way is probably to run the query "SET NAMES latin1" when connecting to the database (assuming that's the character set you need).
Being a fan of Unicode, I'd suggest switching over to UTF-8 entirely, but I realize that this is not always a feasible option.
Edit: #markokocic: Collation only dictates the sorting order. Although this should of course match your character set, it does not affect the range of characters that can be stored in a field.
Have you tried to set collation for the table to utf-8 or something non latin1/ascii.