Redshift - Delimited value missing end quote - csv

Im trying to load a CSV file to redshift.
Delimiter '|'
1'st column of CSV:
1 |Bhuvi|"This is ok"|xyz#domain.com
I used this command to load.
copy tbl from 's3://datawarehouse/source.csv'
iam_role 'arn:aws:iam:::role/xxx'cas-pulse-redshift'
delimiter '|'
removequotes
ACCEPTINVCHARS ;
ERROR:
raw_field_value | This is ok" |xyz#domain.com
err_code | 1214
err_reason | Delimited value missing end quote
then I tried this too.
copy tbl from 's3://datawarehouse/source.csv'
iam_role 'arn:aws:iam:::role/xxx'
CSV QUOTE '\"'
DELIMITER '|'
ACCEPTINVCHARS ;

Disclaimer - Even though this post does not answer the question asked here, I am posting this analysis in case it helps some one.
The error "Delimited value missing end quote" can be reported in cases where a quoted text column is missing the end quote, or if the text column value has a new line in the value itself. In my case, there was a newline in the text column value.
As per RFC 4180 the specification of CSV says,
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes.
For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
So a valid CSV can have multi-line rows, and the correct way to import it in Redshift is to specify the CSV format option. This also assumes that all columns having the quote character in the value will have the quote character escaped by another preceding quote character. This is also as per the CSV RFC specification.
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote.
For example:
"aaa","b""bb","ccc"
If the file that we are trying to import is not a valid CSV, and is just named as a .CSV file as the case may just be, then we have the following options.
Try copying the file without specifying the CSV option, and fine tuning the delimiter and escape and quoting behaviour with the corresponding copy options.
If a set of options is not able to consistently copy data, then pre-process the file to make it consistent.
In general, it helps to make the behaviour deterministic if we try to export and import data in formats that are consistent.

Related

Import RFC4180 files (CSV spec) into snowflake? (Unable to create file format that matches CSV RFC spec)

Summary:
Original question from a year ago: How to escape double quotes within a data when it is already enclosed by double quotes
I have the same need as the original poster: I have a CSV file that matches the CSV RFC spec (my data has double quotes that are properly qualified, my data has commas in it, and my data also has line feeds in it. Excel is able to read it just fine because the file matches the spec and excel properly reads the spec).
Unfortunately I can't figure out how to import files that match the CSV RFC 4180 spec into snowflake. Any ideas?
Details:
We've been creating CSV files that match the RFC 4180 spec for years in order to maximize compatibility across applications and OSes.
Here is a sample of what my data looks like:
KEY,NAME,DESCRIPTION
1,AFRICA,This is a simple description
2,NORTH AMERICA,"This description has a comma, so I have to wrap the whole field in double quotes"
3,ASIA,"This description has ""double quotes"" in it, so I have to qualify the double quotes and wrap the field in double quotes"
4,EUROPE,"This field has a carriage
return so it is wrapped in double quotes"
5,MIDDLE EAST,Simple descriptoin with single ' quote
When opening this file in Excel, Excel properly reads the rows/columns (because excel follows the RFC spec):
In order to import this file into Snowflake, I first try to create a file format and I set the following:
Name
Value
Column Separator
Comma
Row Separator
New Line
Header lines to skip
1
Field optionally enclosed by
Double Quote
Escape Character
"
Escape Unenclosed Field
None
But when go to save the file format, I get this error:
Unable to create file format "CSV_SPEC".
SQL compilation error: value ["] for parameter 'FIELD_OPTIONALLY_ENCLOSED_BY' conflict with parameter 'ESCAPE'
It would appear that I'm missing something? I would think that I must be getting the snowflake configuration wrong. (
While writing up this question and testing all the scenarios I could think of, I found a file format that seems to work:
Name
Value
Column Separator
Comma
Row Separator
New Line
Header lines to skip
1
Field optionally enclosed by
Double Quote
Escape Character
None
Escape Unenclosed Field
None
Same information, but for those that prefer screenshots:
Same information again, but in SQL form:
ALTER FILE FORMAT "DB_NAME"."SCHEMA_NAME"."CSV_SPEC3" SET COMPRESSION = 'NONE' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = '\042' TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = 'NONE' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');
I don't know why this works, but it does, so, there you.

Unable to load csv file into Snowflake

Iam getting the below error when I try to load CSV From my system to Snowflake table:
Unable to copy files into table.
Numeric value '"4' is not recognized File '#EMPP/ui1591621834308/snow.csv', line 2, character 25 Row 1, column "EMPP"["SALARY":5] If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
You appear to be loading your CSV with the file format option of FIELD_OPTIONALLY_ENCLOSED_BY='"' specified.
This option will allow reading any fields properly quoted with the " character, and even support such fields carrying the delimiter character as well as the " character if properly escaped. Some examples that could be considered valid:
CSV FORM | ACTUAL DATA
------------------------
abc | abc
"abc" | abc
"a,bc" | a,bc
"a,""bc""" | a,"bc"
In particular, notice that the final example follows the specified rule:
When a field contains this character, escape it using the same character. For example, if the value is the double quote character and a field contains the string A "B" C, escape the double quotes as follows:
A ""B"" C
If your CSV file carries quote marks within the data but is not necessarily quoting the fields (and delimiters and newlines do not appear within data fields), you can remove the FIELD_OPTIONALLY_ENCLOSED_BY option from your file format definition and just read the file at the delimited (,) fields.
If your CSV does use quoting, ensure that whatever is producing the CSV files is using a valid CSV format writer and not simple string munging, and recreate it with the quotes properly escaped. If the above data example is to be considered valid in quoted form, it must instead appear within the file as "4" or 4.
The error message is saying that you have a value in your file that contains a "4 which is being added into a table that has a number field for that value. Since that isn't a number, it fails. This appears to be happening in your very first row of your file, so you could open it up and take a look at the value. If its just one record, you can add the ON_ERROR = 'CONTINUE' to your command, so that it skips it and moves on.

Redshift loading CSV with commas in a text field

I've been trying to load a csv file with the following row in it:
91451960_NE,-1,171717198,50075943,"MARTIN LUTHER KING, JR WAY",1,NE
Note the comma in the name. I've tried all permutations of REMOVEQUOTES, DELIMITER ',', etc... and none of them work.
I have other rows with quotes in the middle of the name, so the ESCAPE option has to be there as well.
According to other posts,
DELIMITER ',' ESCAPE REMOVEQUOTES IGNOREHEADER 1;
should work but does not. Redshift gives a "Delimiter not found" error.
Is the ESCAPE causing issues and do I have to escape the comma?
I have tried loading your data using CSV as data format parameter and this worked for me. Please keep in mind that CSV cannot be used with FIXEDWIDTH, REMOVEQUOTES, or ESCAPE.
create TEMP table awscptest (a varchar(40),b int,c bigint,d bigint,e varchar(40),f int,g varchar(10));
copy awscptest from 's3://sds-dev-db-replica/test.txt'
iam_role 'arn:aws:iam::<accounID>:<IAM_role>'
delimiter as ',' EMPTYASNULL CSV NULL AS '\0';
References: http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html
http://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-run-copy.html
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#load-from-csv
This is a commonly recurring question. If you are actually using the CSV format for you files (not just some ad hoc text file that uses commas) then you need to enclose the field in double quotes. If you have commas and quotes then you need to enclose the field in double quotes and escape the double quotes in the field data.
There is a definition for the CSV files format - rfc 4180. All text characters can be represented correctly in CSV if you follow the format.
https://www.ietf.org/rfc/rfc4180.txt
Use the CSV option to the Redshift COPY command, not just TEXT with a Delimiter of ','. Redshift will also follow the official file format if you tell it that the files is CSV
In this case, you have comma (,) in name field. Clean the data by removing that comma before loading to redshift.
df = (df.withColumn('name', F.regexp_replace(F.col('name'), ',', ' ')))
Store the new dataframe in s3 and then use the below copy command to load to redshift
COPY 'table_name'
FROM 's3 path'
IAM_ROLE 'iam role'
DELIMITER ','
ESCAPE
IGNOREHEADER 1
MAXERROR AS 5
COMPUPDATE FALSE
ACCEPTINVCHARS
ACCEPTANYDATE
FILLRECORD
EMPTYASNULL
BLANKSASNULL
NULL AS 'null';
END;

MySQL importing CSV file with phpmyadmin without cell quotes

I have a huge CSV file with the data like this:
ID~Name~Location~Price~Rating
02~Foxtrot~Scotland~~9
08~Alpha~Iceland~9.90~4
32~ForestLane~Germany~14.35~
The issue is that when importing using PHPMyAdmin, it asks for Columns enclosed with: and Columns escaped with:. The trouble is, that this CSV doesn't have quotes for the cells.
If I leave this blank, it gives the error: Invalid parameter for CSV import: Columns escaped with
Is there a way to import without having quotes on the CSV?
I can reproduce this behavior. I'll bring it up on the phpMyAdmin development discussion list, but in the meantime, you can can work around it by using some nonsense character for "Columns escaped with" and leaving "Columns enclosed with" blank. Make sure your data doesn't contain, say a " or £ and use that for "Columns escaped with". For instance, I have a data set where I know £ doesn't exist, so I can use that for the "Columns escaped with" character -- if you don't have any escaped characters, you can enter any character there.
I'll update if I can provide any more useful information, but certainly that workaround should allow you to import your data.

Using Excel to create a CSV file with special characters and then Importing it into a db using SSIS

Take this XLS file
I then save this XLS file as CSV and then open it up with a text editor. This is what I see:
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,ABC,"AB""C","D,E",F,03,"3,2"
I see that the double quote character in column C was stored as AB""C, the column value was enclosed with quotations and the double quote character in the data was replaced with 2 double quote characters to indicate that the quote is occurring within the data and not terminating the column value. I also see that the value for column G, 3,2, is enclosed in quotes so that it is clear that the comma occurs within the data rather than indicating a new column. So far, so good.
I am a little surprised that all of the column values are not enclosed by quotes but even this seems reasonable OK when I assume that EXCEL only specifies column delimieters when special characters like a commad or a dbl quote character exists in the data.
Now I try to use SQL Server to import the csv file. Note that I specify a double quote character as the Text Qualifier character.
And a command char as the Column delimiter character. However, note that SSIS imports column 3 incorrectly,eg, not translating the two consecutive double quote characters as a single occurence of a double quote character.
What do I have to do to get Excel and SSIS to get along?
Generally people avoid the issue by using column delimiter chactacters that are LESS LIKELY to occur in the data but this is not a real solution.
I find that if I modify the file from this
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,ABC,"AB""C","D,E",F,03,"3,2"
...to this:
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,ABC,"AB"C","D,E",F,03,"3,2"
i.e, removing the two consecutive quotes in column C's value, that the data is loaded properly, however, this is a little confusing to me. First of all, how does SSIS determine that the double quote between the B and the C is not terminating that column value? Is it because the following characters are not a comma column delimiter or a row delimiter (CRLF)? And why does Excel export it this way?
According to Wikipedia, here are a couple of traits of a CSV file:
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
However, it looks like SSIS doesn't like it that way when importing. What can be done to get Excel to create a CSV file that could contain ANY special characters used as column delimiters, text delimiters or row delimiters in the data? There's no reason that it can't work using the approach specified in Wikipedia,. which is what I thought the old MS DTS packages used to do...
Update:
If I use Notepad change the input file to
Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8
"1","ABC","AB""C","D,E","F","03","3,2","AB""C"
Excel reads it just fine
but SSIS returns
The preview sample contains embedded text qualifiers ("). The flat file parser does not support embedding text qualifiers in data. Parsing columns that contain data with text qualifiers will fail at run time.
Conclusion:
Just like the error message says in your update...
The flat file parser does not support embedding text qualifiers in data. Parsing columns that contain data with text qualifiers will fail at run time.
Confirmed bug in Microsoft Connect. I encourage everyone reading this to click on this aforementioned link and place your vote to have them fix this stinker. This is in the top 10 of the most egregious bugs I have encountered.
Do you need to use a comma delimiter.
I used a pipe delimiter with no Text qualifier and it worked fine. Here is my output form the text file.
1|ABC|AB"C|D,E|F|03|3,2
You have 3 options in my opinion.
Read the data into a stage table.
Run any update queries you need on the columns
Now select your data from the stage table and output it to a flat file.
OR
Use pipes are you delimiters.
OR
Do all of this in a C# application and build it in code.
You could send the row to a script in SSIS and parse and build the file you want there as well.
Using text qualifiers and "character" delimited fields is problematic for sure.
Have Fun!