I'm unable to load special character row in AWS Redshift.
Getting an error: String contains invalid or unsupported UTF8 codepoints. Bad UTF8 hex sequence: c8 4d (error 4)
The string causing the problem is: Crème (the è).
For a temporary fix, I am using:
copy dev.table (a,
b,
c,
d) from 's3://test-bucket/redshift_data_source/test_data.csv'
CREDENTIALS 'aws_access_key_id=xxxxxxxxxx;aws_secret_access_key=xxxxxxxxxxxx' CSV delimiter ',' IGNOREHEADER 1 COMPUPDATE OFF acceptinvchars;
acceptinvchars is accepting them as varchar but putting ? on those characters. How to read them as is?
The best solution seems to be to convert your source data to UTF-8. It is currently saved using some other encoding.
I have a csv file and need to load it to Greenplum DB.
My code looks like this:
CREATE TABLE usr_wrk.CAR(
brand varchar(255),
model varchar(255),
ID INTEGER
);
COPY usr_wrk.CAR FROM '...Car.csv' DELIMITER ',' CSV HEADER
But I get this error:
[22025] ERROR: invalid Unicode escape: Unicode escapes must be full-length: \uXXXX or \UXXXXXXXX.
Rows of a csv file looks, for example, like:
Jaguar,XJ,1
Or
Citroen,C4,91
I replaced all non-latin words, there are no NULL or empty values, but it still appears. Does anybody have thoughts on this?
P.S.
I don't have admin rights and can make/drop and rule tables only in this schema.
You might try one of the following:
copy usr_wrk.car from .../Car.csv DELIMITER ',' ESCAPE as 'OFF' NULL as '' CSV HEADER;
OR
copy usr_wrk.car from .../Car.csv DELIMITER ',' ESCAPE as '\' NULL as '' CSV HEADER;
Default escape is a double quote for CSV format. Turning it off or setting it to the default TEXT format escape (a backslash) may get you around this. You could also remove the CSV header from the file and declare it as TEXT file with a comma delimiter to avoid having to specify the ESCAPE character.
Are you sure there are no special characters around the car names? Thinking specifically of umlauts or grave accents that would make the data multibyte and trigger that error.
You might try doing: head Car.csv | oc -c | more and see if any multibyte characters show up in your file (this assumes you are on a Linux system).
If it is possible for you to do, you might try using the GPLOAD utility to load the file. You can specify the ENCODING of the data file as 'LATIN1' which may get you past the UTF error you are hitting.
Hope this helps.
Jim
I have a PHP script that extracts attachments (Unicode text csv files) from gmail and uploads them to a mysql database. All of that works fine. But once in the database I cannot run the simplest of queries against the data.
If I first bring the file into Excel then export as a CSV file then all works fine, I can query and get the expected results.
I have done enough reading to understand (I think) that the issue is somehow related to the fact that Unicode text is either UTF8 or UTF16, but when I convert the table to either of those, the data comes in fine but I still cannot run a successful query.
Update:
I have an individual named White in the lastrep column of the data. The only way I can pull the associated records is by using wild cards between characters, as in:
SELECT * FROM `dailyactual` WHERE `lastrep` like "%W%h%i%t%e%"
Any help would be appreciated.
Jim
Use the UTF8MB4 collation. Instructions https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-upgrading.html
In utf8 or utf8mb4 character set, 'White' would be 'White' (hex 57 68 69 74 65). In utf16, there would be (effectively) a zero byte between each character; hex: 0057 0068 0069 0074 0065.
Can you get a hex dump of part of the file?
If you can specify the output of excel, do so. Else specify the input for mysql to be utf16 or whatever the encoding says. Since there are many ways of bringing a csv file into mysql, I can't be more specific.
I have a source file which contains Chinese characters. After loading that file into a table in Postgres DB, all the characters are garbled and I'm not able to see the Chinese characters. The encoding on Postgres DB is UTF-8. I'm using the psql utility on my local mac osx to check the output. The source file was generated from mySql db using mysqldump and contains only insert statements.
INSERT INTO "trg_tbl" ("col1", "col2", "col3", "col4", "col5", "col6", "col7", "col7",
"col8", "col9", "col10", "col11", "col12", "col13", "col14",
"col15", "col16", "col17", "col18", "col19", "col20", "col21",
"col22", "col23", "col24", "col25", "col26", "col27", "col28",
"col29", "col30", "col31", "col32", "col33")
VALUES ( 1, 1, '与é<U+009D>žç½‘_首页&频é<U+0081>“页顶部广告ä½<U+008D>(946×90)',
'通æ <U+008F>广告(Leaderboard Banner)',
0,3,'',946,90,'','','','',0,'f',0,'',NULL,NULL,NULL,NULL,NULL,
'2011-08-19 07:29:56',0,0,0,'',NULL,0,NULL,'CPM',NULL,NULL,0);
What can I do to resolve this issue?
The text was mangled before producing that SQL statement. You probably wanted the text to start with 与 instead of the "Mojibake" version: 与. I suggest you fix the dump either to produce utf8 characters or hex. Then the load may work, or there may be more places to specify utf8, such as SET NAMES or the equivalent.
Also, for Chinese, CHARACTER SET utf8mb4 is preferred in MySQL.
é<U+009D>ž is so mangled I don't want to figure out the second character.
I have an MYSQL Database in utf-8 format, but the Characters inside the Database are ISO-8859-1 (ISO-8859-1 Strings are stored in utf-8). I've tried with recode, but it only converted e.g. ü to ü). Does anybody out there has an solution??
If you tried to store ISO-8859-1 characters in the a database which is set to UTF-8 you just managed to corrupt your "special characters" -- as MySQL would retrieve the bytes from the database and try to assemble them as UTF-8 rather than ISO-8859-1. The only way to read the data correctly is to use a script which does something like:
ResultSet rs = ...
byte[] b = rs.getBytes( COLUMN_NAME );
String s = new String( b, "ISO-8859-1" );
This would ensure you get the bytes (which came from a ISO-8859-1 string from what you said) and then you can assemble them back to ISO-8859-1 string.
The other problem as well -- what do you use to "view" the strings in the database -- is it not the case that your console doesn't have the right charset to display those characters rather than the characters being stored wrongly?
NOTE: Updated the above after the last comment
I just went through this. The biggest part of my solution was exporting the database to .csv and Find / Replace the characters in question. The character at issue may look like a space, but copy it directly from the cell as your Find parameter.
Once this is done - and missing this is what took me all morning:
Save the file as CSV ( MS-DOS )
Excellent post on the issue
Source of MS-DOS idea