MySQL string matching using dashes - mysql

I am currently migrating data from MySQL 5.6.41 on Windows, to MySQL 8.0.21 on Windows. Overall, a pretty smooth migration, with a couple of very frustrating hiccups. There's one table that looks like this:
CREATE TABLE `domains` (
`intDomainID` int(11) NOT NULL AUTO_INCREMENT,
`txtDomain` varchar(100) NOT NULL,
`dtDateTime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`blnTor` int(1) NOT NULL DEFAULT '0',
`txtTLD` varchar(100) NOT NULL,
PRIMARY KEY (`intDomainID`),
UNIQUE KEY `txtDomain` (`txtDomain`)
ENGINE=InnoDB AUTO_INCREMENT=10127897 DEFAULT CHARSET=utf8mb4;
The CREATE SCHEMA is complete, and was created by Workbench's "Copy to Clipboard" --> "Create Schema" function.
When I used the built in Workbench export/import, the import always failed with "Duplicate value in txtDomain" (paraphrasing here) error, which is weird because the original table has a UNIQUE KEY constraint on that field, so there cannot be duplicates, and I confirmed, the values it was finding as duplicates were NOT duplicates in the original database.
I then dumped the table using SELECT ... INTO OUTFILE, moved the file over to the new server, and did a LOAD DATE INFILE. This also failed with the same "Duplicate value in txtDomain" error.
I then removed the UNIQUE KEY constraint, and redid the LOAD DATE INFILE. This worked, the data is there. However, I cannot add back the UNIQUE KEY constraint due to "duplicates". I investigated and found this:
Query result on MySQL 5.6.41:
Query result on MySQL 8.0.21:
Now, what is going on? The table definition, the database, table and field charset/collations are identical. I need that UNIQUE KEY constraint back...
Why is http://d­­eepdot35wv­­m­eyd5.onion:80 == http://d­­ee-p---dot35w-v­­m­eyd5.onion:80 ??
In case it helps, my export command:
SELECT * INTO OUTFILE 'S:\\temp\\domains.txt'
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
FROM domains;
My import command:
LOAD DATA INFILE 'E:\\DB Backup\\ServerMigration\\domains.txt'
INTO TABLE domains
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\r\n';
COLLATIONS:
Collations Old Server: utf8_general_ci [I don't remember touching this value]
New Server: utf8mb4_0900_ai_ci [I didn't touch this value]
DB old/new are the same: utf8mb4_0900_ai_ci
Table old/new are the same: utf8mb4_0900_ai_ci
This is how the raw TXT file looks like on the file-system:
Note, if I paste in one of the URLs from the screenshot into here, it magically turns into the "correct" value, without the dashes:
i.e.: http://deepdot35w­­v­­­m­­­eyd5.onion:80
Note2: Using Notepad++, if I convert a regular "dash" to HEX I get "2D". However, if I convert one from the URLs that's causing trouble, I get HEX "C2AD". So it seems that I'm dealing with a weird unicode character and not a dash?
Note3: If anyone wants a small sample file, here it is:
https://www.dropbox.com/s/1ssbl95t2jgn2xy/domains_small.zip

The character in question is U+00AD "SOFT HYPHEN" - a non-printable character that is used to signal a possible hyphenation point inside a word.
It seems that the COLLATION used handles these characters differently on the new setup (MySQL 8.0 with default collation settings) than it did on the old setup (MySQL 5.7 with default collation settings):
These nonprintable characters are now ignored in a comparison.
You can test the difference with this simple fiddle. The comparison is "0" in 5.6, but "1" in MySQL 8.0 -> https://dbfiddle.uk/?rdbms=mysql_5.7&fiddle=a9bd0bf7de815dc14a886c5069bd1a0f
Note that the SQL fiddle also uses a default collation configuration when it's not specified explicitly.
You might fix that by setting a binary UTF-8 collation for the txtDomain column, which is actually what you want for technical strings anyway:
CREATE TABLE `domains` (
`intDomainID` int(11) NOT NULL AUTO_INCREMENT,
`txtDomain` varchar(100) NOT NULL
CHARACTER SET utf8mb4
COLLATE utf8mb4_binary_ci,
`dtDateTime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`blnTor` int(1) NOT NULL DEFAULT '0',
`txtTLD` varchar(100) NOT NULL,
PRIMARY KEY (`intDomainID`),
UNIQUE KEY `txtDomain` (`txtDomain`)
) ENGINE=InnoDB AUTO_INCREMENT=10127897 DEFAULT CHARSET=utf8mb4;
UPDATE: As it turns out, the COLLATION must have been different between the old (5.6) and new (8.0) setup, as utf8mb4_0900_ai_ci was introduced with MySQL 8.0. The old collation must have been utf8mb4_general_ci, which when applied shows the desired behaviour in MySQL 8.0, too.
But still, you should use binary collation for technical strings like URLs anyways.

Related

Correct mysql syntax to change a column with collation

This line of code was generated automatically by phpmyadmin (perhaps an old version running on my company's website, but I don't want to deal with that right now):
ALTER TABLE `lc_error_logs` CHANGE `OneDetailedMessage` `OneDetailedMessage` VARCHAR(5000) CHARSET=latin1 COLLATE latin1_swedish_ci NULL DEFAULT NULL;
When attempt to execute that SQL, get error message via phpmyadmin:
Query error:
#1064 - You have an error in your SQL syntax; check the manual that corresponds to your MYSQL server version for the right syntax to user near '=latin1 COLLATE latin1_swedish_ci NULL DEFAULT NULL' at line 1.
I don't see what is wrong with the query, and would like to know what is wrong.
Notes:
All I did was change the VARCHAR length.
Works fine if I delete the column, then recreate it with these same settings. Is just something wrong with that ALTER TABLE syntax.
I can change INT columns in the same table (e.g. INT to SMALLINT), without any problems. Something about the COLLATE? But that is the collate we use throughout this DB.
mysql version: "5.5.47-0ubuntu0.14.04.1"
It doesn't matter what values I use for VARCHAR length. It doesn't matter what I attempt to change (e.g. changing column name has the same error.) And I've had this problem before on other tables. I've never been able to successfully alter a VARCHAR in this DB, if I specify collation. (Even though this is the default collation for this DB, generated automatically when I do not specify collation.)
ALTER TABLE `lc_error_logs` CHANGE `OneDetailedMessage` `OneDetailedMessage` VARCHAR(5000) CHARACTER SET latin1 COLLATE latin1_swedish_ci NULL DEFAULT NULL;
try CHARACTER SET instead of CHARSET=
The problem is with the CHARSET/COLLATE part. If I remove that, the query works (uses the DB's defaults for charset and collation):
ALTER TABLE `lc_error_logs` CHANGE `OneDetailedMessage` `OneDetailedMessage` VARCHAR(5000) NULL DEFAULT NULL;
Or the equivalent, slightly simpler:
ALTER TABLE `lc_error_logs` MODIFY `OneDetailedMessage` VARCHAR(5000) NULL DEFAULT NULL;
NOTE:
I still don't know what the correct syntax is, or whether it is an issue with how the DB is set up. So if someone can supply an answer that works even with specification of collation, I will accept that answer. Otherwise, I will accept this answer, and move on.

MySQL: column size limit

I'm currently working on a Windows OS and I have installed MySQL community server 5.6.30 and everything is fine. I have a script that initializes the DB and again, everything works fine.
Now I'm trying to run this script on a Linux environment -- same MySQL version -- and I get the following error:
ERROR 1074 (42000) at line 3: Column length too big for column
'txt' (max = 21845); use BLOB or TEXT instead
Script -
DROP TABLE IF EXISTS text;
CREATE TABLE `texts` (
`id` BINARY(16) NOT NULL DEFAULT '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0',
`txt` VARCHAR(50000) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=INNODB DEFAULT CHARSET=utf8;
Obviously there's some MySQL server configuration on my Windows OS that I need to replicate on Linux; can anyone share an ideas?
Update 1
on AWS's RDS it also works and im pretty sure its just a service on top of linux so obviously its just a config issue.
does any body knows how to reach varchar 50k with UTF8 ?. i dont want to use TEXT or MEDIUMTEXT or any else , just plain old varchar(size)
Update 2
i appreciate the different solutions that were suggested but im not looking for a new solution im only looking for an answer as do why varchar(50k) works under windows and under linux it doesnt.
Btw , im using charcter set UTF8 and collation utf8_general_ci .
Answer
to answer my own question , it was an issue with the SQL_MODE it was set to
STRICT_TRANS_TABLES and should have been removed.
According to the documentation:
Although InnoDB supports row sizes larger than 65,535 bytes
internally, MySQL itself imposes a row-size limit of 65,535 for the
combined size of all columns:
mysql> CREATE TABLE t (a VARCHAR(8000), b VARCHAR(10000),
-> c VARCHAR(10000), d VARCHAR(10000), e VARCHAR(10000),
-> f VARCHAR(10000), g VARCHAR(10000)) ENGINE=InnoDB;
ERROR 1118 (42000): Row size too large. The maximum row size for the
used table type, not counting BLOBs, is 65535. You have to change some
columns to TEXT or BLOBs
(Unfortunately, this example does not provide the character set so we don't really know how large the columns are.)
The utf8 encoding uses 1, 2, or 3 bytes per character. So, the maximum number of characters that can safely fit in a page of 65,535 bytes (the MySQL maximum) is 21,845 characters (21,845*3 = 65,535).
Despite the versions being similar, it would appear the Windows is being conservative in its space allocation and guaranteeing that you can store any characters in the field. Linux seems to have a more laissez-faire attitude. You can store some strings with over 21,845 characters, depending on the characters.
I have no idea why this difference would exist in the same version. Both methods are "right" in some sense. There are simple enough work-arounds:
Use TEXT.
Switch to a collation that has shorter characters (which is presumably what you want to store).
Reduce the size of the field.
please simply use TEXT to declare txt column
DROP TABLE IF EXISTS text;
CREATE TABLE `texts` (
`id` BINARY(16) NOT NULL DEFAULT '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0',
`txt` TEXT DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=INNODB DEFAULT CHARSET=utf8;
utf8 needs up to 3 bytes per character. utf8mb4: 4; latin1: 1; ascii: 1; etc. VARCHAR(N) is implemented as a 1- or 2-byte length in front of the bytes for the text. That is allowed to hold N characters (not bytes). So, if you say you want utf8, then 3*N must be less than 65535, the max value for a 2-byte length.
Be glad you are not running in some old version, where VARCHAR had a limit of 255.
If your txt does not need characters other than ascii or English, then use CHARACTER SET latin1.
In InnoDB, when there are 'long' fields (big varchars, texts, blobs, etc), some or all of the column is stored in a separate block(s). There is a limit of about 8000 bytes for what is stored together in the record.
If you really need 50K of utf8, then MEDIUMTEXT is what you need. It uses a 3-byte length and can hold up to 16M bytes (5M characters, possibly more, since utf8 is a variable length encoding).
Most applications can (should?) use either ascii (1 byte per character) or utf8mb4 (1-4 bytes per character). The latter allows for all languages, including Emoji and the 4-byte Chinese characters that utf8 cannot handle.
As for why Windows and Linux work differently here, I don't know. Are you using the same version? Suggest you file a bug report with http://bugs.mysql.com . (And provide a link to it from this Question.)
If you absolutely must use varchar - which is a bad solution to this problem! - then here's something you can try:
CREATE TABLE `texts` (
`id` BINARY(16) NOT NULL DEFAULT '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0',
`txt` VARCHAR(20000) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=INNODB DEFAULT CHARSET=utf8;
CREATE TABLE `texts2` (
`id` BINARY(16) NOT NULL DEFAULT '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0',
`txt` VARCHAR(20000) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=INNODB DEFAULT CHARSET=utf8;
CREATE TABLE `texts3` (
`id` BINARY(16) NOT NULL DEFAULT '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0',
`txt` VARCHAR(10000) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=INNODB DEFAULT CHARSET=utf8;
There's 50000 characters. Now your client application will have to manage breaking up the text into the separate chunks, and creating the records in each table. Likewise reading the text back in will require you to do 3 select statements, but you will then have 50000 characters.
It's just not at all recommended to do this with any database implementation.
I've worked in a few environments where large text was stored in columns in the database, and it always wound up causing more problems than it solved.
These should really be spooled to files on disk, and a reference to the full path to the file stored in the database.
Then run some indexing engine over this corpus of documents.
you will get greater scalability from this, and easier management.
Just to add for more clarity. If you are using a solution that definitely requires a long VarChar. Like in my case when trying to configure WatchDog.NET to use mysql database for a .NET web api log.
You can sign into mysql database as root user and then run:
SET GLOBAL sql_mode = ""

BLOB not storing utf-8 (Chinese) in MySQL

I have a table like this
CREATE TABLE account_data (
id BIGINT NOT NULL,
data BLOB NOT NULL,
PRIMARY KEY (account_id)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8
However MySQL is not storing it as Chinese, but some garbage values
like «å«¢ å æææ æ­¶ç·è
I checked everywhere and it says CHARSET must be utf8, which is my case
The MySql version I am using is 5.6.14.
I Tried
ALTER TABLE account_data MODIFY data BLOB CHARACTER SET utf8 COLLATE utf8_unicode_ci;
but for some reason, MySQL is giving syntax error for BLOB.
If I do
insert into account_data (id
,data)
VALUES
(5952638508182497,
"123456偟 滭滹漇 嶕憱撏 齞齝囃 熤熡"
);
and check TEXT in Viewer in MySQL workbench, I can see 123456 but for Chinese I am seeing garbage.
Thanks
You can try to set char set of DB and also column.
ALTER DATABASE <database_name> CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE <table_name> MODIFY <column_name> VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Hope this can help you!
For Chinese, you want CHARACTER SET utf8mb4, and optionally some COLLATION starting with utf8mb4_, the default is probably fine.
BLOB should work as you described it, but probably the insertion connection and the reading connection were configured differently. You mentioned Workbench; what charset is set up in it? Was that used for both reading and writing?
The sample gibberish you provided (which does not map correctly into any Chinese) looks like Mojibake, which is usually caused by having latin1 established for the connection. (This was the old default.)
You mentioned 5.6.14; what that used for both inserting and selecting?

Emoji is not stored properly in MySQL 5.6 with collation utf8mb4

I am trying to store emoji to the database in my server. I am using AWS EC2 instance as server, my server details are listed below:
OS: ubuntu0.14.04.1
MySQL version: 5.6.19-0ubuntu0.14.04.1 - (Ubuntu)
Database client version: libmysql - mysqlnd 5.0.11-dev - 20120503
I created a database test and table emoji in the server with following SQL:
CREATE DATABASE IF NOT EXISTS `test` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
USE `test`;
CREATE TABLE IF NOT EXISTS `emoji` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`text` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 AUTO_INCREMENT=1;
When I tried to execute the following insert, a warning appears and data are not stored properly:
INSERT INTO `test`.`emoji` (`id` , `text`) VALUES (NULL , '👆 👇 👈 👉');
Inserted row id: 3
Warning: #1366 Incorrect string value: '\xF0\x9F\x91\x86 \xF0...' for column 'text' at row 1
The value stored in the text column is: ???? ???? ???? ????
The same scenario work for my local database and the values are stored properly. Almost all configurations are similar in my local except the OS (Windows).
I was able to recreate your issue using SqlWorkbench.
You're client most likely has established a connection to the db whose character set does not match the character set of the table:
run this statement before you run the insert statement to align the character set and collation of the connection:
SET NAMES utf8mb4 COLLATE utf8mb4_general_ci
Hope this helps, character sets can be tricky.
Trying to save emoji's in my existing database table using the following stack Node-Js 12.13.x , Mysql 5.6.
Way around:
Either follow this solution
Or change the column data type to BLOB i.e
ALTER TABLE table_name CHANGE column column BLOB NULL
Hope this trick will work for you!
Migration from MSSQL to MySQL using workbench always getting problem.
Workbench already sets utf8mb4, and still getting error.
Then i follow #Haisum Usman's suggestion:
Set column as Blob on migration sql generated.
Migrate data
Change column to LONGTEXT!
Lots of time invested to get this working.

MYSQL: Can't Select values although I know they are there

I'm dealing with this problem in my MYSQL database for several hours now. I work with OS X 10.8.4 and use the tool Sequel Pro to work with my database. The table I have troubles with looks like this:
CREATE TABLE `Descriptions` (
`id` int(11) unsigned zerofill NOT NULL AUTO_INCREMENT,
`company` varchar(200) DEFAULT NULL,
`overview` mediumtext,
`trade` mediumtext,
PRIMARY KEY (`id`))
ENGINE=InnoDB AUTO_INCREMENT=1703911 DEFAULT CHARSET=utf8;
I imported a csv file like this:
LOAD DATA LOCAL INFILE 'users/marc/desktop/descriptions kopie.txt'
INTO TABLE descriptions
FIELDS TERMINATED BY ';'
LINES TERMINATED BY '\n'
(#dummy, company, overview, trade)
When I look at the data in my table now, everything looks the way I expect (SELECT * Syntax). But I can't work with the data. When I try to select the company 'SISTERS', from which I know that it exists, it gives me no results. Also the fields "overview" and "trade" are not NULL when there's no data, it is just an empty string. The other tables in the database works just fine with the imported data. Somehow MySQL just doesn't see the values as something to work with, it doesn't bothers to read them.
What I tried so far:
- I used text wrangler to convert the csv to txt (utf-8) and loadet it into the database, did not work
- I changed the fields into BLOB and back to varchar/mediumtext to force mysql to do something with the data, did not work
- I tried to use the Sequel Pro Import function, did not change anything
- I tried to make a new table and copy the old one into it, did not change anything
- I tried to force mysql to change the values by using the concat syntax (just adding random variables which I could delete later again)
Could it have something to do with the collation settings? Could it has something to do with my regional settings (Switzerland) on my OS X) Any other ideas? I would appreciate any help very much.
Kind Regards,
Marc
I could solve the problems. When I opened the csv in Text Wrangler and let the invisible characters show, it was full of red reversed question marks. Those sneaky bastards, they messed up everything. I don't now what they are, but they were the problem. I removed them with the "Zap Gremlins..." option.