I am reading data from a CSV text file using Coldfusion and inserting it into a table. The database is UTF8, and the table is UTf8.
This string •Detroit Diesel series-60 engine keeps getting stored in the Description field as
•Detroit Diesel series-60 engine. (This is what I get from the database, not displayed in the browser.)
I can manually insert the string into a new record from the command line, and the characters are correctly preserved. UTF8 must support the bullet character. What can I be doing wrong?
Datasource connection string:
this.datasources["blabla"] = {
class: 'org.gjt.mm.mysql.Driver'
, connectionString: 'jdbc:mysql://localhost:3306/blabla?useUnicode=true&characterEncoding=UTF-8&jdbcCompliantTruncation=true&allowMultiQueries=false&useLegacyDatetimeCode=true'
, username: 'nottellingyou'
, password: "encrypted:zzzzzzz"
};
CREATE TABLE output, minus several columns
CREATE TABLE `autos` (
`VIN` varchar(30) NOT NULL,
`Description` text,
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8
In addition, I've run
ALTER TABLE blabla.autos
MODIFY description TEXT CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Full code of import file here: https://gist.github.com/mborn319/c40573d6a58f88ec6bf373efbbf92f29
CSV file here. See line 7: http://pastebin.com/fM7fFtXD
In my CFML script, I tried dumping the data per suggestion from #Leigh and #Rick James. I then saw that the characters are garbled BEFORE insertion into Mysql. Based on this, I realized I needed to specify the charset when reading the file.
<cffile
action="read"
file="#settings.csvfile#"
variable="autodata"
charset="utf-8">
Result: •Detroit Diesel series-60 engine. This can now insert correctly into the database.
Related
Server version: 10.8.3-MariaDB
Server charset: UTF-8 Unicode (utf8mb4)
InnoDB
I'm getting an error trying to import into a blank db (db is already created, just trying to import now):
ERROR 1366 (22007) at line 19669: Incorrect string value: '\x92t' for column glen_wazzup.nuke_bbsearch_wordlist.word_text at row 1
The SQL:
CREATE TABLE `nuke_bbsearch_wordlist` (
`word_text` varchar(50) binary NOT NULL default '',
`word_id` mediumint(8) unsigned NOT NULL auto_increment,
`word_common` tinyint(1) unsigned NOT NULL default '0',
PRIMARY KEY (`word_text`),
KEY `word_id` (`word_id`)
) ENGINE=InnoDB AUTO_INCREMENT=18719 ;
Line 19669 (error line):
INSERT INTO `nuke_bbsearch_wordlist` VALUES (0x6469646e9274, 6895, 0);
From my readings it has something to do with character encoding, and the character is an apostrophe and the wires are getting crossed somewhere. I've read you can use an ALTER statement, but this is a raw sql import file that isn't able to be imported yet, so I'm not sure how (or exactly "what") to change in the file so that it'll import?
didn’t -- Note that the apostrophe is not the ascii char, but hex 92 if encoded in latin1 (and several character sets) or E28099 if encoded in utf8 or utf8mb4.
On the other hand, you have stated "Server charset: UTF-8 Unicode (utf8mb4)", but x92 is not valid in UTF-8
You are trying to import? How? From what? From mysqldump? From a CSV file? You have an INSERT statement; does that come from a dump?
In any case, it would probably be correct to state that the file is in "Character set latin1".
The collation is not important.
The solution may be as easy as converting your import source file from ISO-8859-1 to UTF-8 encoding.
To do the conversion on Linux, you can run recode l1..u8 <filename >filename.out (if installed) or iconv -f ISO-8859-1 -t UTF-8 -o filename.out filename. And then import filename.out to MySQL.
However, the source encoding may be different from ISO-8859-1 (e.g. it may be ISO-8859-2), so you may want to try multiple source encodings, and check which output file looks right (e.g. by looking at non-ASCII characters in filename.out).
I am trying to set up a database to store string data that is in multiple languages and includes Chinese letters among many others.
Steps I have taken so far:
I have created a schema which uses utf8mb4 character set and utf8mb4_unicode_ci collation.
I have created a table which includes CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; at the end of the CREATE statement.
I am attempting to LOAD DATA INFILE from a CSV file with CHARACTER SET utf8mb4 specified in the LOAD statement.
However, I am receiving an error Error Code: 1366. Incorrect string value: '\xCE\x09DIS' for column 'company_name' at row 43630.
Did it successfully parse 43629 rows? Then croak on that row? It may actually be garbage in the file.
Do you know what that company name should be? What does the rest of the line say?
Do you have another example? Remove that one line and run the LOAD again.
CE can be interpreted by any 1-byte charset, but not necessarily in a meaningful way.
09 is the "tab" character in virtually all charsets; is it reasonable to have a tab in a company name??
I am using mysqldump to back up a table. The schema is as follows:
CREATE TABLE `student` (
`ID` bigint(20) unsigned DEFAULT NULL,
`DATA` varbinary(64) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I can use the following command to backup my data in the table.
mysqldump -uroot -p123456 tdb > dump.sql.
Now I want to write my own code using the MySQL c interface to generate the file similar to dump.sql.
So I just
read the data, and store it int char* p(using function mysql_fetch_row);
write data into file using fprintf(f,"%s",p);
However, when I check the table fields written into the file, I find that the file generated by mysqldump and by my own program are different.For example,
one data field in the file generated by mysqldump
'[[ \\^X\í^G\ÑX` C;·Qù^Dô7<8a>¼!{<96>aÓ¹<8c> HÀaHr^Q^^½n÷^Kþ<98>IZ<9f>3þ'
one data field in the file generated by my program
[[ \^Xí^GÑX` C;·Qù^Dô7<8a>¼!{<96>aÓ¹<8c> HÀaHr^Q^^½n÷^Kþ<98>IZ<9f>3þ
So, My question is: Why is writting data using sprintf(f,"%s",xx) for backup not correct? Is it enough to just add ' ' in the front and end of the string? If so, what if the data of that field happen to have ' in it?
Also, I wonder what it means to write some unprintable characters into a text file.
Also, I read stackoverflow.com/questions/16559086 and tried --hex-blob option. Is it OK if I transform every byte of the binary data into hex form and then write simple text strings into the dump.sql.
Then, instead of getting
'[[ \\^X\í^G\ÑX` C;·Qù^Dô7<8a>¼!{<96>aÓ¹<8c> HÀaHr^Q^^½n÷^Kþ<98>IZ<9f>3þ'
I got something like
0x5B5B095C18ED07D1586009433BB751F95E44F4378ABC217B9661D3B98C0948C0614872111EBD6EF70BFE98495A9F33FE
All the characters are printable now!
However, If I choose this method, I wonder if I can meet problems when I use other encoding schemes other than latin1.
Also, the above words are all my own ideas, I also wonder I there are other ways to back up data using the C interface.
Thank you for your help!
latin1, utf8, etc are CHARACTER SETs. They apply to TEXT and VARCHAR columns, not BLOB and VARBINARY columns.
Using --hex-blob is a good idea.
If you have "unprintable characters" in TEXT or CHAR, then either you have been trying to put a BLOB into such -- naughty -- or the print mechanism does is not set for the appropriate charset.
I have twenty pipe-delimited text files that I would like to convert into a MySQL database. The manual that came with the data say
Owing to the difficulty of displaying data for characters outside of
standard Latin Character Sets, all data is displayed using Unicode
(UCS-2) character encoding. All CSV files are structured using
commercial standards with the preferred format being pipe delimiter
(“|”) and carriage return + line feed (CRLF) as row terminators.
I am using MySQL Workbench 6.2.5 on Win 8.1, but the manual provides example SQL Server scripts to create the twenty tables. Here's one.
/****** Object: Table [dbo].[tbl_Company_Profile_Stocks] Script Date:
12/12/2007 08:42:05 ******/
CREATE TABLE [dbo].[tbl_Company_Profile_Stocks](
[BoardID] [int] NULL,
[BoardName] [nvarchar](255) NULL,
[ClientCompanyID] [int] NULL,
[Ticker] [nvarchar](255) NULL,
[ISIN] [nvarchar](255) NULL,
[OrgVisible] [nvarchar](255) NULL
)
Which I adjust as follows for MySQL.
/****** Object: Table dbo.tbl_Company_Profile_Stocks Script Date:
12/12/2007 08:42:05 ******/
CREATE TABLE dbo.tbl_Company_Profile_Stocks
(
BoardID int NULL,
BoardName varchar(255) NULL,
ClientCompanyID int NULL,
Ticker varchar(255) NULL,
ISIN varchar(255) NULL,
OrgVisible varchar(255) NULL
);
Because the manual says that the flat files are UCS-2, I set the dbo schema to UCS-2 default collation when I create it. This works fine AFAIK. It is the LOAD INFILE that fails. Because the data are pipe-delimited with CRLF line endings I try the following.
LOAD DATA LOCAL INFILE 'C:/Users/Richard/Dropbox/Research/BoardEx_data/unzipped/Company_Profile_Stocks20100416.csv'
INTO TABLE dbo.tbl_company_profile_stocks
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES;
But in this case now rows are imported and the message is 0 row(s) affected Records: 0 Deleted: 0 Skipped: 0 Warnings: 0. So I try \n line endings instead. This imports something, but my integer values become zeros and the text becomes very widely spaced. The message is 14121 row(s) affected, 64 warning(s): 1366 Incorrect integer value: <snip> Records: 14121 Deleted: 0 Skipped: 0 Warnings: 28257.
If I open the flat text file in Sublime Text 3, the Encoding Helper package suggests that the file has UTF-16 LE with BOM encoding. If I repeat the above with UTF-16 default collation when I create the dbo schema, then my results are the same.
How can I fix this? Encoding drives me crazy!
Probably the main problem is that the LOAD DATA needs this clause (see reference):
CHARACTER SET ucs2
In case that does not suffice, ...
Can you get a hex dump of a little of the csv file? I want to make sure it is really ucs2. (ucs2 is very rare. Usually text is transferred in utf8.) If it looks readable when you paste text into this forum, then it is probably utf8 instead.
There is no "dbo" ("database owner"), only database, in MySQL.
Please provide SHOW CREATE TABLE tbl_Company_Profile_Stocks
(just a recommendation) Don't prefix table names with "tbl_"; it does more to clutter than to clarify.
Provide a PRIMARY KEY for the table.
#Rick James had the correct answer (i.e., set the encoding for LOAD DATA with the CHARACTER SET option). But in my case this didn't work because MySQL doesn't support UCS-2.
Note
It is not possible to load data files that use the ucs2 character set.
Here are a few approaches that work here. In the end I went this SQLite rather than MySQL, but the last solution should work with MySQL, or any other DB that accepts flat files.
SQLiteStudio
SQLiteStudio was the easiest solution in this case. I prefer command line solutions, but the SQLiteStudio GUI accepts UCS-2 encoding and any delimiter. This keeps the data in UCS-2.
Convert to ASCII in Windows command line
The easiest conversion to ASCII is in the Windows command line with TYPE.
for %%f in (*.csv) do (
echo %%~nf
type "%%~nf.csv" > "%%~nf.txt"
)
This may cause problems with special characters. In my case it left in single and double quotes that caused some problems with the SQLite import. This is the crudest approach.
Convert to ASCII in Python
import codecs
import glob
import os
for fileOld in glob.glob('*.csv'):
print 'Reading: %s' % fileOld
fileNew = os.path.join('converted', fileOld)
with codecs.open(fileOld, 'r', encoding='utf-16le') as old, codecs.open(fileNew, 'w', encoding='ascii', errors='ignore') as new:
print 'Writing: %s' % fileNew
for line in old:
new.write(line.replace("\'", '').replace('"', ''))
This is the most extensible approach and would allow you more precisely control which data you convert or retain.
When I run this at the MySQL command line, it works fine:
INSERT INTO MYTABLE VALUES(NULL,101942,'2015-05-08','sähkötupakalle');
The 'ä' and the 'ö' end up in the MySQL varchar column just fine.
However, when I put the same data in a file, and use
LOAD DATA LOCAL INFILE
then the 'ä' and the 'ö' get mangled, and I end up with data in the MySQL varchar column that looks like this:
sähkötupakalle
Any ideas for how I can get these characters to load correctly using "LOAD DATA LOCAL INFILE" ?? FYI, my table has CHARSET=utf8.
Apparently the file you are loading is correctly encoded with utf8? But you did not include the CHARACTER SET utf8 clause?
Symptom of "Mojibake":
When SELECTing the text, each non-english character is replaced by 2-3 characters that you could call jibberish or garbage.
How you got in the mess:
The client's bytes to be INSERTed into the table were encoded as utf8 (good), and
The charset for the connection was latin1 (eg, via SET NAMES latin1), and
The table column was declared CHARACTER SET latin1
How to fix the text and the table:
Do the 2-step ALTER:
ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;
where the lengths are big enough and the other "..." have whatever else (NOT NULL, etc) was already on the column.
That converts the column definition while leaving the bits alone.
How to fix the code (in general):
Change the client's declaration of charset to utf8 - via SET NAMES utf8 or equivalent.