Should each individual column have its charset defined? - mysql

Simple question: when creating database tables I already specify the collation for the entire table, however, individual columns also have a collation field. Should this be left blank or filled with the same collation as the table (for non-numeric types that is)?
Could anyone shed some light on what this field is for exactly? My suspicion is that it is used for when the column's collation differs from the table's collation, however, I would like a confirmation.

Leave it blank and it will use the table's collation. (See here in the manual.)
As for what all this is for, see the manual section Character Set Support and, in particular, Specifying Character Sets and Collations.

Related

Query on collating sequence in database

As per this link, I would like to understand the meaning of
Case-sensitivity in string comparison depends on the collating sequence used
As per the additional information retrieved, With this information from msdn, I could not understand the meaning of collation with these below statement:
A collation specifies the bit patterns that represent each character in a data set. Collations also determine the rules that sort and compare data. SQL Server supports storing objects that have different collations in a single database. For non-Unicode columns, the collation setting specifies the code page for the data and which characters can be represented. Data that is moved between non-Unicode columns must be converted from the source code page to the destination code page.
So, Can you help me understand the meaning/significance of collating sequence in database with an example?
Note: I am currently part of database intro course.
While creating database you might feel the need to store data in different language and different languages will have different number of characters with different sort order, so you might need some way to sort them accordingly, at that point we use collation. Collation controls the way string values are sorted. In TSQL you can define it using collate clause as explained here http://msdn.microsoft.com/en-us/library/ms184391.aspx
You can go through the documentation if you want to find out what are different configurations that are supported. If you don't define any collation while creating database it will pick the default collation from current sql server instance. You can also apply collation at database table or at a column and you can use collation while selecting data it will apply to the sort order.
Here is a related question that will help you understand further What does 'COLLATE SQL_Latin1_General_CP1_CI_AS' do?

SQL UNIQUE key not treating international characters as distinct

I have just exported a MySQL database (in LATIN1) and converted to UTF-8 in the process, and imported on to a newer system.
It seemed to go OK, but I did hit a few instances where a UNIQUE key threw an error because two entries which differed only in an international character, e.g.
"åle" was not considered unique from "ale"
I did not find anything in the documentation on UNIQUE keys that mentioned character sets or encodings at all.
How can I configure the database to ensure that it considers these letters unique?
This depends on the "COLLATION" setting for the column in question. You can see the current collation with "SHOW FULL COLUMNS IN yourtablename".
For example, "utf8_general_ci" considers "ale", "åle" and "ALE" the same. Depending on your use case, something like "utf8_swedish_ci" or "utf8_bin" might be more appropriate. Note that changing collation will also change what ".. where yourcolumn=value" matches, and the ordering of "...order by yourcolumn".
You can change the collation with "ALTER TABLE" (for a single column), or database-wide. More information in the manual: http://dev.mysql.com/doc/refman/5.5/en/globalization.html

Accent insensitive search query in MySQL

Is there any way to make search query accent insensitive?
the column's and table's collation are utf8_polish_ci and I don't want to change them.
example word : toruń
select * from pages where title like '%torun%'
It doesn't find "toruń". How can I do that?
You can change the collation at runtime in the sql query,
...where title like '%torun%' collate utf8_unicode_ci
but beware that changing the collation on the fly at runtime forgoes the possibility of mysql using an index, so performance on large tables may be terrible.
Or, you can copy the column to another column, such as searchable_title, but change the collation on it. It's actually common to do this type of stuff, where you copy data but have it in some slightly different form that's optimized for some specific workload/purpose. You can use triggers as a nice way to keep the duplicated columns in sync. This method has the potential to perform well, if indexed.
Note - Make sure that your db really has those characters and not html entities.
Also, the character set of your connection matters. The above assumes it's set to utf8, for example, via set names like set names utf8
If not, you need an introducer for the literal value
...where title like _utf8'%torun%' collate utf8_unicode_ci
and of course, the value in the single quotes must actually be utf8 encoded, even if the rest of the sql query isn't.
This wont work in extreme circumstances, but try to change the column collation to UFT8 utf8_unicode_ci. Then accented characters will be equal to their non-accented counterparts.
You could try SOUNDEX:
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
This compares two string by how they sound. But this obviously delivers many more results.

UTF8 string comparisons in MySQL

We have issues with utf8-string comparisons in MySQL 5, regarding case and accents :
from what I gathered, what MySQL implements collations by considering that "groups of characters should be considered equal".
For example, in the utf8_unicode_ci collation, all the letters "EÉÈÊeéèê" are in the same box (together with other variants of "e").
So if you have a table containing ["video", "vidéo", "vidÉo", "vidÊo", "vidêo", "vidÈo", "vidèo", "vidEo"] (in a varchar column declared with ut8_general_ci collation) :
when asking MySQL to sort the rows according to this column, the sorting is random (MySQL does not enforce a sorting rule between "é" and "É" for example),
when asking MySQL to add a Unique Key on this column, it raises an error because it considers all the values are equal.
What setting can we fiddle with to fix these two points ?
PS : on a related note, I do not see any case sensitive collation for the utf8 charset. Did I miss something ?
[edit] I think my initial question still holds some interest, and I will leave it as is (and maybe one day get a positive answer).
It turned out, however, that our problems with string comparisons regarding accents was not linked to the collation of our text columns. It was linked to a configuration problem with the character_set_client parameter when talking with MySQL - which defaulted to latin1.
Here is the article that explained it all to us, and allowed us to fix the problem :
Getting out of MySQL character set hell
It is lengthy, but trust me, you need this length to explain both the problem and the fix.
Use collation that considers these characters to be distinct. Maybe utf8_bin (it's case sensitive, since it does binary comparison of characters)
http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html

SSIS - Data from MySql to MsSql some characters are?

I just transferred some data from MySql to MsSql (2K5) in a text field, some of my characters, such as apostrophes, are now ? (question mark) to me this indicates some sort of collation or character set error, right?
To be honest, I don't know which one should I be using
The MySql db currect charset is utf8_general_ci and in ms sql is SQL_Latin1_General_CP1_CI_AS .
I have tried changing the charset of the mysql table to latin1_swedish_ci, however this doesnt help
Thanks for the input
Have you tried changing the target (SQL Server) column data type to NVARCHAR?
The utf8_general_ci collation on the MySQL column indicates a Unicode data type. If the source is Unicode, so should be the target - for the easiest transition.
Collations themselves play a minor role here. They just affect comparison and sorting.
You might also need to check the SSIS type of the columns in your dataflow. Remember, the data type and character set is set at the connection manager on the source (and that may involve a conversion from the original native character set). Also, any operations like derived columns or conversions will have a character set which can be altered and will persist down that column's lineage in the data flow. At the end when it gets to the destination, there could be additional character set coercion/conversion.