Redshift loading CSV with commas in a text field - csv

I've been trying to load a csv file with the following row in it:
91451960_NE,-1,171717198,50075943,"MARTIN LUTHER KING, JR WAY",1,NE
Note the comma in the name. I've tried all permutations of REMOVEQUOTES, DELIMITER ',', etc... and none of them work.
I have other rows with quotes in the middle of the name, so the ESCAPE option has to be there as well.
According to other posts,
DELIMITER ',' ESCAPE REMOVEQUOTES IGNOREHEADER 1;
should work but does not. Redshift gives a "Delimiter not found" error.
Is the ESCAPE causing issues and do I have to escape the comma?

I have tried loading your data using CSV as data format parameter and this worked for me. Please keep in mind that CSV cannot be used with FIXEDWIDTH, REMOVEQUOTES, or ESCAPE.
create TEMP table awscptest (a varchar(40),b int,c bigint,d bigint,e varchar(40),f int,g varchar(10));
copy awscptest from 's3://sds-dev-db-replica/test.txt'
iam_role 'arn:aws:iam::<accounID>:<IAM_role>'
delimiter as ',' EMPTYASNULL CSV NULL AS '\0';
References: http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html
http://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-run-copy.html
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#load-from-csv

This is a commonly recurring question. If you are actually using the CSV format for you files (not just some ad hoc text file that uses commas) then you need to enclose the field in double quotes. If you have commas and quotes then you need to enclose the field in double quotes and escape the double quotes in the field data.
There is a definition for the CSV files format - rfc 4180. All text characters can be represented correctly in CSV if you follow the format.
https://www.ietf.org/rfc/rfc4180.txt
Use the CSV option to the Redshift COPY command, not just TEXT with a Delimiter of ','. Redshift will also follow the official file format if you tell it that the files is CSV

In this case, you have comma (,) in name field. Clean the data by removing that comma before loading to redshift.
df = (df.withColumn('name', F.regexp_replace(F.col('name'), ',', ' ')))
Store the new dataframe in s3 and then use the below copy command to load to redshift
COPY 'table_name'
FROM 's3 path'
IAM_ROLE 'iam role'
DELIMITER ','
ESCAPE
IGNOREHEADER 1
MAXERROR AS 5
COMPUPDATE FALSE
ACCEPTINVCHARS
ACCEPTANYDATE
FILLRECORD
EMPTYASNULL
BLANKSASNULL
NULL AS 'null';
END;

Related

Informix 'UNLOAD TO' generates backslashes when I export table as CSV

Whenever I try to export a table from an Informix database to a CSV file, I find that the generated file contains backslashes. This is the query I used:
UNLOAD TO 'C:/Documents and Settings/XXXX/XXXX/test.txt' DELIMITER '|'
select * from xxx
This is an example of the results I get in the CSV file
A|B|C|D|E|F\
This\
Is\
SOME\
TEXT\
|
A2|B2|C3|D4|E5|F6
If anyone knows how to resolve this, I would really appreciate it.
This is because are newline characters in the values of this column.
You can remove the newlines with replace function.
First you must enable newline in quoted strings running this stored procedure
EXECUTE PROCEDURE IFX_ALLOW_NEWLINE('T');
Then you can use replace to remove (or change to another character) newlines from the column that has newlines (in this example is column3)
UNLOAD TO 'C:/Documents and Settings/XXXX/XXXX/test.txt' DELIMITER '|'
SELECT column1, column2, replace(column3, "
", "")
FROM xxx
Note that in the call to function replace are only a newline between the two firsts quotes in the second parameter, the third parameter is the value for which you want to replace the newlines.

Using \COPY to load CSV with JSON fields into Postgres

I'm attempting to load TSV data from a file into a Postgres table using the \COPY command.
Here's an example data row:
2017-11-22 23:00:00 "{\"id\":123,\"class\":101,\"level\":3}"
Here's the psql command I'm using:
\COPY bogus.test_table (timestamp, sample_json) FROM '/local/file.txt' DELIMITER E'\t'
Here's the error I'm receiving:
ERROR: invalid input syntax for type json
DETAIL: Token "sample_json" is invalid.
CONTEXT: JSON data, line 1: "{"sample_json...
COPY test_table, line 1, column sample_json: ""{\"id\":123,\"class\":101,\"level\":3}""
I verified the JSON is in the correct JSON format and read a couple similar questions, but I'm still not sure what's going on here. An explanation would be awesome
To load your data file as it is:
\COPY bogus.test_table (timestamp, sample_json) FROM '/local/file.txt' CSV DELIMITER E'\t' QUOTE '"' ESCAPE '\'
Your json is quoted. It shouldn't have surrounding " characters, and the " characters surrounding the field names shouldn't be escaped.
It should look like this:
2017-11-22 23:00:00 {"id":123,"class":101,"level":3}
The answer of Aeblisto almost did the trick for my crazy JSON fields, but needed to modify an only small bit - THE QUOTE with backslash - here it is in full form:
COPY "your_schema_name.yor_table_name" (your, column_names, here)
FROM STDIN
WITH CSV DELIMITER E'\t' QUOTE E'\b' ESCAPE '\';
--here rows data
\.

Unicode Error while loading data from csv to Greenplum

I have a csv file and need to load it to Greenplum DB.
My code looks like this:
CREATE TABLE usr_wrk.CAR(
brand varchar(255),
model varchar(255),
ID INTEGER
);
COPY usr_wrk.CAR FROM '...Car.csv' DELIMITER ',' CSV HEADER
But I get this error:
[22025] ERROR: invalid Unicode escape: Unicode escapes must be full-length: \uXXXX or \UXXXXXXXX.
Rows of a csv file looks, for example, like:
Jaguar,XJ,1
Or
Citroen,C4,91
I replaced all non-latin words, there are no NULL or empty values, but it still appears. Does anybody have thoughts on this?
P.S.
I don't have admin rights and can make/drop and rule tables only in this schema.
You might try one of the following:
copy usr_wrk.car from .../Car.csv DELIMITER ',' ESCAPE as 'OFF' NULL as '' CSV HEADER;
OR
copy usr_wrk.car from .../Car.csv DELIMITER ',' ESCAPE as '\' NULL as '' CSV HEADER;
Default escape is a double quote for CSV format. Turning it off or setting it to the default TEXT format escape (a backslash) may get you around this. You could also remove the CSV header from the file and declare it as TEXT file with a comma delimiter to avoid having to specify the ESCAPE character.
Are you sure there are no special characters around the car names? Thinking specifically of umlauts or grave accents that would make the data multibyte and trigger that error.
You might try doing: head Car.csv | oc -c | more and see if any multibyte characters show up in your file (this assumes you are on a Linux system).
If it is possible for you to do, you might try using the GPLOAD utility to load the file. You can specify the ENCODING of the data file as 'LATIN1' which may get you past the UTF error you are hitting.
Hope this helps.
Jim

Redshift - Delimited value missing end quote

Im trying to load a CSV file to redshift.
Delimiter '|'
1'st column of CSV:
1 |Bhuvi|"This is ok"|xyz#domain.com
I used this command to load.
copy tbl from 's3://datawarehouse/source.csv'
iam_role 'arn:aws:iam:::role/xxx'cas-pulse-redshift'
delimiter '|'
removequotes
ACCEPTINVCHARS ;
ERROR:
raw_field_value | This is ok" |xyz#domain.com
err_code | 1214
err_reason | Delimited value missing end quote
then I tried this too.
copy tbl from 's3://datawarehouse/source.csv'
iam_role 'arn:aws:iam:::role/xxx'
CSV QUOTE '\"'
DELIMITER '|'
ACCEPTINVCHARS ;
Disclaimer - Even though this post does not answer the question asked here, I am posting this analysis in case it helps some one.
The error "Delimited value missing end quote" can be reported in cases where a quoted text column is missing the end quote, or if the text column value has a new line in the value itself. In my case, there was a newline in the text column value.
As per RFC 4180 the specification of CSV says,
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes.
For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
So a valid CSV can have multi-line rows, and the correct way to import it in Redshift is to specify the CSV format option. This also assumes that all columns having the quote character in the value will have the quote character escaped by another preceding quote character. This is also as per the CSV RFC specification.
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote.
For example:
"aaa","b""bb","ccc"
If the file that we are trying to import is not a valid CSV, and is just named as a .CSV file as the case may just be, then we have the following options.
Try copying the file without specifying the CSV option, and fine tuning the delimiter and escape and quoting behaviour with the corresponding copy options.
If a set of options is not able to consistently copy data, then pre-process the file to make it consistent.
In general, it helps to make the behaviour deterministic if we try to export and import data in formats that are consistent.

Export table enclosing values with quotes to local csv in hive

I am trying to export a table to a local csv file in hive.
INSERT OVERWRITE LOCAL DIRECTORY '/home/sofia/temp.csv'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
select * from mytable;
The problem is that some of the values contain the newline "\n" character and the resulting file becomes really messy.
Is there any way of enclosing the values in quotes when exporting in Hive, so that the csv file can contain special characters (and especially the newline)?
One possible solution could be to use Hive CSV SerDe (Serializer/Deserializer). It provides a way to specify custom delimiters, quote, and escape characters.
Limitation:
It does not handle embedded newlines
Availability:
The CSV Serde is available in Hive 0.14 and greater.
Background:
The CSV SerDe is based from https://github.com/ogrodnek/csv-serde, and was added to the Hive distribution in HIVE-7777.
Usage:
This SerDe works for most CSV data, but does not handle embedded newlines. To use the SerDe, specify the fully qualified class name org.apache.hadoop.hive.serde2.OpenCSVSerde.
original documentation is available at https://github.com/ogrodnek/csv-serde.
CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
Default separator, quote, and escape characters if unspecified
DEFAULT_ESCAPE_CHARACTER \
DEFAULT_QUOTE_CHARACTER "
DEFAULT_SEPARATOR ,
Reference: Hive csv-serde