Unicode Error while loading data from csv to Greenplum

Unicode Error while loading data from csv to Greenplum - csv

I have a csv file and need to load it to Greenplum DB.
My code looks like this:
CREATE TABLE usr_wrk.CAR(
brand varchar(255),
model varchar(255),
ID INTEGER
);
COPY usr_wrk.CAR FROM '...Car.csv' DELIMITER ',' CSV HEADER
But I get this error:
[22025] ERROR: invalid Unicode escape: Unicode escapes must be full-length: \uXXXX or \UXXXXXXXX.
Rows of a csv file looks, for example, like:
Jaguar,XJ,1
Or
Citroen,C4,91
I replaced all non-latin words, there are no NULL or empty values, but it still appears. Does anybody have thoughts on this?
P.S.
I don't have admin rights and can make/drop and rule tables only in this schema.

You might try one of the following:
copy usr_wrk.car from .../Car.csv DELIMITER ',' ESCAPE as 'OFF' NULL as '' CSV HEADER;
OR
copy usr_wrk.car from .../Car.csv DELIMITER ',' ESCAPE as '\' NULL as '' CSV HEADER;
Default escape is a double quote for CSV format. Turning it off or setting it to the default TEXT format escape (a backslash) may get you around this. You could also remove the CSV header from the file and declare it as TEXT file with a comma delimiter to avoid having to specify the ESCAPE character.
Are you sure there are no special characters around the car names? Thinking specifically of umlauts or grave accents that would make the data multibyte and trigger that error.
You might try doing: head Car.csv | oc -c | more and see if any multibyte characters show up in your file (this assumes you are on a Linux system).
If it is possible for you to do, you might try using the GPLOAD utility to load the file. You can specify the ENCODING of the data file as 'LATIN1' which may get you past the UTF error you are hitting.
Hope this helps.
Jim

Related

How to load HTML character codes data correctly into My Sql database?

I receive a data file in ETL from the client and we load the data into Mysql database using Load Data file functionality and use CHARACTER SET
as utf8.
LOAD DATA LOCAL INFILE '${filePath}'
INTO TABLE test_staging
CHARACTER SET 'utf8'
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'
(${testcolumns}) SET
first_name = #first_name;
Data from client
1|"test"|"name"|2
2|"asdf"|asdf&test|2
3|fun|value|2
When I load the above data into the database and it is inserting directly as strings instead of converting to html characters
Database Data
id first_name last_name
1 "test" "name"
2 "asdf" asdf&test
3 fun value
I tried changing the CHARACTER SET value from utf8 to latin1 but the result is same.
I also tried replacing the special characters while loading the data into database but the issue is, I receive all types of html characters data in the file. I cannot keep on adding the replace function for all of them.
LOAD DATA LOCAL INFILE '${filePath}'
INTO TABLE test_staging
CHARACTER SET 'utf8'
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'
(${testcolumns}) SET
first_name = REPLACE(REPLACE(REPLACE(first_name,'&apos;','\''),'"','"'),'&','&');
Is there any character set which converts the html data and loads correctly?
Expected Database Data
id first_name last_name
1 "test" "name"
2 "asdf" asdf&test
3 fun value
Any help is appreciated... Thanks

The problem you are facing is not about character set. It happens because the software that your client use intentionally converts HTML special characters to their codes.
It is probably possible to convert them back using MySQL though I couldn't find a quick solution, but as you are handling this data with ETL the better option seems to be to use the external tool before you insert the data into the database. One of these for example:
cat input-with-specialchars.html | recode html..ascii
xmlstarlet unesc
perl -MHTML::Entities -pe 'decode_entities($_);'
etc.
or something else depending on what tools you have available in your system or which ones you can afford to install.

Redshift loading CSV with commas in a text field

I've been trying to load a csv file with the following row in it:
91451960_NE,-1,171717198,50075943,"MARTIN LUTHER KING, JR WAY",1,NE
Note the comma in the name. I've tried all permutations of REMOVEQUOTES, DELIMITER ',', etc... and none of them work.
I have other rows with quotes in the middle of the name, so the ESCAPE option has to be there as well.
According to other posts,
DELIMITER ',' ESCAPE REMOVEQUOTES IGNOREHEADER 1;
should work but does not. Redshift gives a "Delimiter not found" error.
Is the ESCAPE causing issues and do I have to escape the comma?

I have tried loading your data using CSV as data format parameter and this worked for me. Please keep in mind that CSV cannot be used with FIXEDWIDTH, REMOVEQUOTES, or ESCAPE.
create TEMP table awscptest (a varchar(40),b int,c bigint,d bigint,e varchar(40),f int,g varchar(10));
copy awscptest from 's3://sds-dev-db-replica/test.txt'
iam_role 'arn:aws:iam::<accounID>:<IAM_role>'
delimiter as ',' EMPTYASNULL CSV NULL AS '\0';
References: http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html
http://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-run-copy.html
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#load-from-csv

This is a commonly recurring question. If you are actually using the CSV format for you files (not just some ad hoc text file that uses commas) then you need to enclose the field in double quotes. If you have commas and quotes then you need to enclose the field in double quotes and escape the double quotes in the field data.
There is a definition for the CSV files format - rfc 4180. All text characters can be represented correctly in CSV if you follow the format.
https://www.ietf.org/rfc/rfc4180.txt
Use the CSV option to the Redshift COPY command, not just TEXT with a Delimiter of ','. Redshift will also follow the official file format if you tell it that the files is CSV

In this case, you have comma (,) in name field. Clean the data by removing that comma before loading to redshift.
df = (df.withColumn('name', F.regexp_replace(F.col('name'), ',', ' ')))
Store the new dataframe in s3 and then use the below copy command to load to redshift
COPY 'table_name'
FROM 's3 path'
IAM_ROLE 'iam role'
DELIMITER ','
ESCAPE
IGNOREHEADER 1
MAXERROR AS 5
COMPUPDATE FALSE
ACCEPTINVCHARS
ACCEPTANYDATE
FILLRECORD
EMPTYASNULL
BLANKSASNULL
NULL AS 'null';
END;

Redshift - Delimited value missing end quote

Im trying to load a CSV file to redshift.
Delimiter '|'
1'st column of CSV:
1 |Bhuvi|"This is ok"|xyz#domain.com
I used this command to load.
copy tbl from 's3://datawarehouse/source.csv'
iam_role 'arn:aws:iam:::role/xxx'cas-pulse-redshift'
delimiter '|'
removequotes
ACCEPTINVCHARS ;
ERROR:
raw_field_value | This is ok" |xyz#domain.com
err_code | 1214
err_reason | Delimited value missing end quote
then I tried this too.
copy tbl from 's3://datawarehouse/source.csv'
iam_role 'arn:aws:iam:::role/xxx'
CSV QUOTE '\"'
DELIMITER '|'
ACCEPTINVCHARS ;

Disclaimer - Even though this post does not answer the question asked here, I am posting this analysis in case it helps some one.
The error "Delimited value missing end quote" can be reported in cases where a quoted text column is missing the end quote, or if the text column value has a new line in the value itself. In my case, there was a newline in the text column value.
As per RFC 4180 the specification of CSV says,
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes.
For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
So a valid CSV can have multi-line rows, and the correct way to import it in Redshift is to specify the CSV format option. This also assumes that all columns having the quote character in the value will have the quote character escaped by another preceding quote character. This is also as per the CSV RFC specification.
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote.
For example:
"aaa","b""bb","ccc"
If the file that we are trying to import is not a valid CSV, and is just named as a .CSV file as the case may just be, then we have the following options.
Try copying the file without specifying the CSV option, and fine tuning the delimiter and escape and quoting behaviour with the corresponding copy options.
If a set of options is not able to consistently copy data, then pre-process the file to make it consistent.
In general, it helps to make the behaviour deterministic if we try to export and import data in formats that are consistent.

Load data from text file to DB

Data:
1|\N|"First\Line"
2|\N|"Second\Line"
3|100|\N
\N represents NULL in MYSQL & MariaDB.
I'm trying to load above data using LOAD DATA LOCAL INFILE method into a table named ID_OPR.
Table structure:
CREATE TABLE ID_OPR (
idnt decimal(4),
age decimal(3),
comment varchar(100)
);
My code looks like below:
LOAD DATA LOCAL INFILE <DATA FILE LOCATION> INTO TABLE <TABLE_NAME> FIELDS TERMINATED BY '|' ESCAPED BY '' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\n';
Problem with this code is it aborts with error Incorrect decimal value: '\\N' For column <Column name>.
Question:
How to load this data with NULL values in second decimal column and also without loosing \(Backslash) from third string column?
I'm trying this is MariaDB which is similar to Mysql in most case.
Update:
The error i have mentioned appears like a warning and the data is actually getting loaded into table. But the catch here is with the text data.
For example: Incase of the third record above it is being loaded as \N itself into string column. But i want it to be NULL.
Is there any way to make the software to recognize this null value? Something like decode in oracle?

You can't have it both ways - either \ is an escape character or it is not. From MySQL docs:
If the FIELDS ESCAPED BY character is empty, no characters are escaped and NULL is output as NULL, not \N. It is probably not a good idea to specify an empty escape character, particularly if field values in your data contain any of the characters in the list just given.
So, I'd suggest a consistently formatted input file, however that was generated:
use \\ if you want to keep the backslash in the strings
make \ an escape character in your load command
OR
make strings always, not optionally, enclosed in quotes
leave escape character empty, as is
use NULL for nulls, not \N
BTW, this also explains the warnings you were experiencing loading \N in your decimal field.

Deal with nulls with blanks. that should fix it.
1||"First\Line"
2||"Second\Line"
3|100|
Thats how nulls are handled on CSVs and TSVs. And don't expect decimal datatype to go null as it stays 0, use int or bigint instead if needed. You should forget about "ESCAPED BY"; as long as string data is enclosed by "" that deals with the escaping problem.

we need three text file & 1 batch file for Load Data:
Suppose your file location 'D:\loaddata'
Your text file 'D:\loaddata\abc.txt'
1. D:\loaddata\abc.bad -- empty
2. D:\loaddata\abc.log -- empty
3. D:\loaddata\abc.ctl
a. Write Code Below for no separator
OPTIONS ( SKIP=1, DIRECT=TRUE, ERRORS=10000000, ROWS=5000000)
load data
infile 'D:\loaddata\abc.txt'
TRUNCATE
into table Your_table
(
a_column POSITION (1:7) char,
b_column POSITION (8:10) char,
c_column POSITION (11:12) char,
d_column POSITION (13:13) char,
f_column POSITION (14:20) char
)
b. Write Code Below for coma separator
OPTIONS ( SKIP=1, DIRECT=TRUE, ERRORS=10000000, ROWS=5000000)
load data
infile 'D:\loaddata\abc.txt'
TRUNCATE
into table Your_table
FIELDS TERMINATED BY ","
TRAILING NULLCOLS
(a_column,
b_column,
c_column,
d_column,
e_column,
f_column
)
4.D:\loaddata\abc.bat "Write Code Below"
sqlldr db_user/db_passward#your_tns control=D:\loaddata\abc.ctl log=D:\loaddata\abc.log
After double click "D:\loaddata\abc.bat" file you data will be load desire oracle table. if anything wrong check you "D:\loaddata\abc.bad" and "D:\loaddata\abc.log" file

Insert UTF-8 txt file into MySql results in ErrorCode 1366

I am trying to insert a delimited txt file into MySQL but it seems there is something wrong with the encoding. I get Error coe 1366:Incorrect string value in in MYSQL when I try and insert. When I open the txt file it looks like this on the line that caused the error.
Any idea how can I insert this data?

You need to escape special characters. Something like \' for quote and so on

Somehow hex 9C got into the file. In latin1, that is œ.
Plan A:
Edit the file to remove it. The text around there is all in uppercase English, so I suspect it was not supposed to be œ.
Plan B:
Use the charset option to set the incoming stream to latin1. Then deal with œ later (or not).

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Unicode Error while loading data from csv to Greenplum - csv

Related

How to load HTML character codes data correctly into My Sql database?

Redshift loading CSV with commas in a text field

Redshift - Delimited value missing end quote

Load data from text file to DB

Insert UTF-8 txt file into MySql results in ErrorCode 1366

Categories

Resources