Invalid quote formatting for CSV with double quotes (Redshift loading) - csv

I have a CSV containing the following:
id,homie_id,user_id,some_data,some_datetime,list_stuff,confirmed_at,report_id
1,57,1,,,"{\"assets\":[]}","2014-12-26 16:50:32",18
2,59,1,,,"{\"assets\":[]}","2014-12-26 16:50:46",18
When I run the COPY command, I get an error "Invalid quote formatting for CSV"
Why is that? It has the backslash before the quote, so it should be acceptable. I see Redshift says to use "" instead (https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-data-format-parameters) but is there a way to tell it to accept \" since that is a valid way to escape quotes with CSVs?
I've already tried Googling and don't see a reason this wouldn't work.

There isn't really a way to force Redshift to use backslash as the escape character.
You have to convert your input data to the format that Redshift can handle. One way is just to replace all backslahes with double quotes. For example,
"{\"assets\":[]}" turns into "{""assets"":[]}" which is then parsable by Redshift and in the end the actual data should look like {"assets":[]} for that field.
From the docs:
The default quote character is a double quotation mark ( " ). When the
quote character is used within a field, escape the character with an
additional quote character. For example, if the quote character is a
double quotation mark, to insert the string A "quoted" word the input
file should include the string "A ""quoted"" word"

Related

Removing double quotes from CSV but leaving qualifiers (using Notepad++)

I have a comma delimited text file where one of the columns (appropriately) has text encased with double quotes. There are also many instances of double quotes within the content of this particular column. I've used the following to remove many of the double quotes, replacing them with single quotes (excluding any double quotes next to a comma).
(?<!^)(?<![,])"(?![,])(?!$)
How do I isolate/replace the double quote after [fine,] without removing the "good" double quotes?
column1,"he's doing 'fine," says Tom, but nothing specific. Blah, blah, blah", column3
Here is another example of "good" double quotes that I don't want to remove (where the first two columns are blank/empty)
,,"This is text I need",
Assuming that double quotes only occur in one column then I suggest a two-step approach. First change all double quotes in the file to single quotes, using a simple replace all. Next change the first and last single quotes back to double quotes. This can be done in one regex, replace (^[^\r\n']*)'(.*)'(^[^\r\n']*)$ with \1"\2"\3.
If single quotes occur in other columns and see should not be altered then a three-step approach can be used. Choose a character that does not occur anywhere in the text. Change all double quotes to that character, I will use ! as an example. As above, change the first and last ! to double quotes. This can be done in one regex, replace (^[^\r\n']*)!(.*)!(^[^\r\n']*)$ with \1"\2"\3. Finally change all the ! to single quotes. If you cannot find an unused character then you can use a longer string that is not in the file instead, perhaps something like _<<abc>>_ instead of the !.
Struggled with this a bit, but based on your question, there might be a possible solution. If you only have one column which has unescaped quotes or commas, you might be able to count the commas in front of that column and the commas after that column then strip all the quotes and commas between them. If you have multiple columns with unescaped characters, this might be harder.
Not familiar with Notepad++, but reading other answers I assume there is a way to use regex. If so, you can use this one:
(?<!^|",)"(?!,"|$)

Data is getting wrapped in the double quotes for Hive's CHAR data types if there is trailing space in it [duplicate]

I am using the following code to export my data frame to csv:
data.write.format('com.databricks.spark.csv').options(delimiter="\t", codec="org.apache.hadoop.io.compress.GzipCodec").save('s3a://myBucket/myPath')
Note that I use delimiter="\t", as I don't want to add additional quotation marks around each field. However, when I checked the output csv file, there are still some fields which are enclosed by quotation marks. e.g.
abcdABCDAAbbcd ....
1234_3456ABCD ...
"-12345678AbCd" ...
It seems that the quotation mark appears when the leading character of a field is "-". Why is this happening and is there a way to avoid this? Thanks!
You don't use all the options provided by the CSV writer. It has quoteMode parameter which takes one of the four values (descriptions from the org.apache.commons.csv documentation:
ALL - quotes all fields
MINIMAL (default) - quotes fields which contain special characters such as a delimiter, quotes character or any of the characters in line separator
NON_NUMERIC - quotes all non-numeric fields
NONE - never quotes fields
If want to avoid quoting the last options looks a good choice, doesn't it?

How to use UPDATE in MySQL with string containing escape characters

please look here:
UPDATE cars_tbl
SET description = '{\rtf1'
WHERE (ID=1)
Description field is "blob", where my RTF document is to be stored.
When I check updated data I always find
{
tf1
\r simply disapears. I tried to find solution on the web, but no success. My rtf files are corrupted on many places, because the escape characters used in the string are substituted. How to suppress this substitution and update field with string as is?
Thanx for advice
Lyborko
Backslash is an escape character, so to keep it you need a double backslash:
UPDATE cars_tbl
SET description = '{\\rtf1'
WHERE (ID=1)
As an aside \r is a carriage return.. and it hasn't disappeared in your data; it is responsible for tf1 appearing on the line below the {.
You can achieve this with a more generic approach
use of QUOTE() in mysql
MySQL QUOTE() produces a string which is a properly escaped data value in an SQL statement, out of an user supplied string as argument.
The function achieve this by enclosing the string with single quotes, and by preceding each single quote, backslash, ASCII NUL and control-Z with a backslash.
example
UPDATE cars_tbl
SET description = QUOTE('{\rtf1')
WHERE (ID=1)
UPDATE
to escape your RTF you can also just use REPLACE this way all your \ will become \\
Example
UPDATE cars_tbl
SET description = REPLACE('{\rtf1', '\', '\\')
WHERE (ID=1)

explain these lines of mysql string Literals

I Have selected these lines from Mysql official site dev.mysql.com.
I am unable to understand what these lines means.
There are several ways to include quote characters within a string:
A “'” inside a string quoted with “'” may be written as “''”.
A “"” inside a string quoted with “"” may be written as “""”.
I did not understand how this sql.
mysql> SELECT 'hel''lo';
Outout: hel'lo
Please Help
You have a string inside single quotes, then it finds another quote, escaped by yet another code. So, it will translate into
'(start of string)hel'(escaping the next quote)'(the escaped quote)lo'(ending the string)
And thus outputting:
hel'lo
It's simple. If you need to put a quote within a string literal delimited by those quotes, you can't use just a standalone quote character (like 'O'Brien') since there's no easy way to tell which of the second or third quote is the closing quote.
So they introduce a rule. If the SQL interpreter is within a quoted string and it finds another quote, it uses these rules:
if the quote is immediately followed by another quote, assume the user wants one quote within the literal.
otherwise it's the closing quote for the literal.
So, for example, consider:
select * from people where surname = 'O'Brien' order by id
Now you and I can tell which of those quotes actually terminates the string literal because we understand how names work. The computer does not take that for granted, instead requiring:
select * from people where surname = 'O''Brien' order by id
and turning the '' inside the literal into a single '.

Space Before Quote in CSV Field

From the CSV spec (RFC 4180), Spaces are considered part of a field and should not be ignored. Obviously if the field contains double quotes it should retain the spaces inside the quotes.
My question is, what about spaces outside of the double quotes? The only way I can see this happening is if the tool that generated the CSV didn't do it properly.
Example: one, "two" ,three
Should the space before and after "two" be included?
That cell is invalid - to properly code that row it should be:
one," ""two"" ",three
Double quotes must also be escaped (as double-double quote) since they are used as the escape sequence. If you don't want to preserve the quotes around two, technically there are two things invalid about the row - (1) the spaces before and after the quotes and (2) the fact that there are quotes around the cell but nothing to be escaped. CSV demands that there can only be quotes around the cell if there are commas or quotes inside the content of the cell.
If I were in your case, I would err on the side of leniency.
I dealt with this using BULK INSERT and BCP format files, which is tricky to account for the quote and comma delimiting. In the event that there could be variation, say with a , " delimiter We used the lowest common delimiter, so the comma in your example, then stripped out what wasn't needed like all the double quotes.
But it could also be that your source data was only comma delimited and this was the actual contents of that field. Either way, I would toss out the quotes when loading the field, in whatever method was appropriate.