As per this link, I would like to understand the meaning of
Case-sensitivity in string comparison depends on the collating sequence used
As per the additional information retrieved, With this information from msdn, I could not understand the meaning of collation with these below statement:
A collation specifies the bit patterns that represent each character in a data set. Collations also determine the rules that sort and compare data. SQL Server supports storing objects that have different collations in a single database. For non-Unicode columns, the collation setting specifies the code page for the data and which characters can be represented. Data that is moved between non-Unicode columns must be converted from the source code page to the destination code page.
So, Can you help me understand the meaning/significance of collating sequence in database with an example?
Note: I am currently part of database intro course.
While creating database you might feel the need to store data in different language and different languages will have different number of characters with different sort order, so you might need some way to sort them accordingly, at that point we use collation. Collation controls the way string values are sorted. In TSQL you can define it using collate clause as explained here http://msdn.microsoft.com/en-us/library/ms184391.aspx
You can go through the documentation if you want to find out what are different configurations that are supported. If you don't define any collation while creating database it will pick the default collation from current sql server instance. You can also apply collation at database table or at a column and you can use collation while selecting data it will apply to the sort order.
Here is a related question that will help you understand further What does 'COLLATE SQL_Latin1_General_CP1_CI_AS' do?
Related
I am getting an error like this:
COLLATION 'latin1_swedish_ci' is not valid for CHARACTER SET 'utf8'
Whenever I try to run a particular query. The problem in my case is, that I need this query to be able to run - without modification - against two separate databases, which have a different character collation (one is latin1, the other is utf8).
Since the strings I am trying to match are guaranteed to be basic letters (a-z), I was wondering if there was any way to force the comparison to work irrespective of the specific encoding?
I mean, a A is an A no matter how it is encoded - is there some way to tell mysql to compare the content of the string as letters rather than as whatever binary thing it does internally? I don't even understand why it can't auto-convert collations, since it is quite capable of doing it when explicitly told to.
I have just exported a MySQL database (in LATIN1) and converted to UTF-8 in the process, and imported on to a newer system.
It seemed to go OK, but I did hit a few instances where a UNIQUE key threw an error because two entries which differed only in an international character, e.g.
"åle" was not considered unique from "ale"
I did not find anything in the documentation on UNIQUE keys that mentioned character sets or encodings at all.
How can I configure the database to ensure that it considers these letters unique?
This depends on the "COLLATION" setting for the column in question. You can see the current collation with "SHOW FULL COLUMNS IN yourtablename".
For example, "utf8_general_ci" considers "ale", "åle" and "ALE" the same. Depending on your use case, something like "utf8_swedish_ci" or "utf8_bin" might be more appropriate. Note that changing collation will also change what ".. where yourcolumn=value" matches, and the ordering of "...order by yourcolumn".
You can change the collation with "ALTER TABLE" (for a single column), or database-wide. More information in the manual: http://dev.mysql.com/doc/refman/5.5/en/globalization.html
I am running a production application with MySQL database server. I forget to set column's collation from latin to utf8_unicode, which results in strange data when saving to the column with multi-language data.
My question is, what will happen with my existing data if I change my collation to utf8_unicode now? Will it destroy or corrupt the existing data or will the data remain, but the new data will be saved as utf8 as it should?
I will change with phpMyAdmin web client.
The article http://mysqldump.azundris.com/archives/60-Handling-character-sets.html discusses this at length and also shows what will happen.
Please note that you are mixing up a CHARACTER SET (actually an encoding) with a COLLATION.
A character set defines the physical representation of a string in bytes on disk. You can make this visible, using the HEX() function, for example SELECT HEX(str) FROM t WHERE id = 1 to see how MySQL stores the bytes of your string. What MySQL delivers to you may be different, depending on the character set of your connection, defined with SET NAMES .....
A collation is a sort order. It is dependent on the character set. For example, your data may be in the latin1 character set, but it may be ordered according to either of the two german sort orders latin1_german1_ci or latin1_german2_ci. Depending on your choice, Umlauts such as ö will either sort as oe or as o.
When you are changing a character set, the data in your table needs to be rewritten. MySQL will read all data and all indexes in the table, make a hidden copy of the table which temporarily takes up disk space, then moves the old table into a hidden location, moves the hidden table into place and then drops the old data, freeing up disk space. For some time inbetween, you will need two times the storage for that.
When you are changing a collation, the sort order of the data changes but not the data itself. If the column you are changing is not part of an index, nothing needs to be done besides rewriting the frm file, and sufficiently recent versions of MySQL should not do more.
When you are changing a collation of a column that is part of an index, the index needs to be rewritten, as an index is a sorted excerpt of a table. This will again trigger the ALTER TABLE table copy logic outlined above.
MySQL tries to preserve data doing this: As long as the data you have can be represented in the target character set, the conversion will not be lossy. Warnings will be printed if there is data truncation going on, and data which cannot be represented in the target character set will be replaced by ?
Running a quick test in MySQL 5.1 with a VARCHAR column set to latin1_bin I inserted some non-latin chars
INSERT INTO Test VALUES ('英國華僑');
I select them and get rubbish (as expected).
SELECT text from Test;
gives
text
????
I then changed the collation of the column to utf8_unicode and re-ran the SELECT and it shows the same result
text
????
This is what I would expect - It will keep the data and the data will remain rubbish, because when the data was inserted the column lost the extra character information and just inserted a ? for each non-latin character and there is no way for the ???? to again become 英國華僑.
Your data will stay in place but it won't be fixed.
Valid data will be properly converted:
When you change a data type using
CHANGE or MODIFY, MySQL tries to
convert existing column values to the
new type as well as possible. Warning:
This conversion may result in
alteration of data.
http://dev.mysql.com/doc/refman/5.5/en/alter-table.html
... and more specifically:
To convert a binary or nonbinary
string column to use a particular
character set, use ALTER TABLE. For
successful conversion to occur, one of
the following conditions must
apply:[...] If the column has a
nonbinary data type (CHAR, VARCHAR,
TEXT), its contents should be encoded
in the column character set, not some
other character set. If the contents
are encoded in a different character
set, you can convert the column to use
a binary data type first, and then to
a nonbinary column with the desired
character set.
http://dev.mysql.com/doc/refman/5.1/en/charset-conversion.html
So your problem is invalid data, e.g., data encoded in a different character set. I've tried the tip suggested by the documentation and it basically ruined my data, but the reason is that my data was already lost: running SELECT column, HEX(column) FROM table showed that multibyte chars had been inserted as 0x3F (i.e., the ? symbol in Latin1). My MySQL stack had been smart enough to detect that input data was not Latin1 and convert it into something "compatible". And once data is gone, you can't get it back.
To sum up:
Use HEX() to find out if you still have your data.
Make your tests in a copy of your table.
My question is, what will happen with my existing data if I change my
collation to utf8_unicode now?
Answer: If you change to utf8_unicode_ci, nonthing will happen to your existing data (which is already corrupt and remain corrupt till you modify it).
Will it destroy or corrupt the existing data or will the data remain,
but the new data will be saved as utf8 as it should?
Answer: After you change to utf8_unicode_ci, existing data will not be destroyed. It will remain the same like before (something like ????). However, if you insert new data containing Unicode characters, it will be stored correctly.
I will change with phpMyAdmin web client.
Answer: Sure, you can change collation with phpMyAdmin by going to Operations > Table options
CAUTION! Some problems are solved via
ALTER TABLE ... CONVERT TO ...
Some are solved via a 2-step process
ALTER TABLE ... MODIFY ... VARBINARY...
ALTER TABLE ... MODIFY ... VARCHAR...
If you do the wrong one, you will have a worse mess!
Do SELECT HEX(col), col ... to see what you really have.
Study this to see what case you have: Trouble with utf8 characters; what I see is not what I stored
Perform the correct fix, based on these cases: http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
The name Accîdent seems to be different than AccÎdent when I do a database query to update the column. Yet Accîdent and AccÎdent point to the same place...
In MySQL Accîdent = Accîdent when inserted.
Also, AccÎdent = AccÃŽdent.
Do you know why this is?
By default, MySQL assumes the client uses the latin1 character set. If you're using UTF-8 in your PHP scripts, then this assumption is false. You need to specify to MySQL that you're using UTF-8 by issuing this SQL statement just after the database connection is opened:
SET NAMES utf8
Then the data inserted by the following SQL statements will use the correct character set. This means that you need to re-insert your data or follow the MySQL conversion procedure (see the last paragraphs).
It is recommended that your tables are configured to store data in UTF-8, too, to avoid unnecessary read/write character set conversions. That's not required, though.
More information is available in the MySQL documentation. Specifically, Connection Character Sets and Collations.
First, you seem to be storing UTF-8 data in a table of different encoding. MySQL will try and cope, but the side effect is as you see - data in the database will look "weird". When creating a table, you need to specify the character encoding - preferably UTF-8. For existing tables, you'll need to convert the data.
Second, the tables have a "collation" beside encoding. Encoding determines how the characters map to bytes, collation determines sorting and comparison. There are language-specific collations, but utf8_general_ci should be the one you're looking for (ci stands for "case insensitive") - then your two string would match.
I just transferred some data from MySql to MsSql (2K5) in a text field, some of my characters, such as apostrophes, are now ? (question mark) to me this indicates some sort of collation or character set error, right?
To be honest, I don't know which one should I be using
The MySql db currect charset is utf8_general_ci and in ms sql is SQL_Latin1_General_CP1_CI_AS .
I have tried changing the charset of the mysql table to latin1_swedish_ci, however this doesnt help
Thanks for the input
Have you tried changing the target (SQL Server) column data type to NVARCHAR?
The utf8_general_ci collation on the MySQL column indicates a Unicode data type. If the source is Unicode, so should be the target - for the easiest transition.
Collations themselves play a minor role here. They just affect comparison and sorting.
You might also need to check the SSIS type of the columns in your dataflow. Remember, the data type and character set is set at the connection manager on the source (and that may involve a conversion from the original native character set). Also, any operations like derived columns or conversions will have a character set which can be altered and will persist down that column's lineage in the data flow. At the end when it gets to the destination, there could be additional character set coercion/conversion.