Prevent LOAD DATA INFILE from escaping double double quotes - mysql

I have csv data like the following:
"E12 98003";1085894;"HELLA";"8GS007949261";"";1
"5 3/4"";652493;"HELLA";"9HD140976001";"";1
Some fields are included in double quotes. The problem is that
as you may see in the second line the data in the first column contains a double quotation mark at the end as part of the data.
I tried something along the lines of:
LOAD DATA INFILE file.csv
INTO TABLE mytable
FIELDS TERMINATED BY ';' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
but it will use the quotation mark in the data to escape the field enclosing quotation mark. I also tried ESCAPED BY '' and ESCAPED BY '\\' with no success.
Is there a way to stop the LOAD DATA INFILE command from escaping the double double quotation marks?
Or should I parse the csv and put double quotation marks when there is only one?
I am parsing the files anyway using powershell to change the encoding to utf8. Is there some way to fix this quickly there? My powershell code:
function Convert-FileToUTF8 {
param([string]$infile,
[string]$outfile,
[System.Int32]$encodingCode)
$encoding = [System.Text.Encoding]::GetEncoding($encodingCode)
$text = [System.IO.File]::ReadAllText($infile, $encoding)
[System.IO.File]::WriteAllText($outfile, $text)
}
Ok, I did it using a .NET regular expression to fix the csv. It is costly, but not too much.
I wrote
$text = [regex]::Replace($text, "(?m)(?<!^)(?<!\;)""(?!\;)(?!\r?$)", '""');
just before the last line in the function and it seems to work ok. Since I am a novice in regular expressions this could probably be improved.

The main problem is that the input data constitutes invalid CSV syntax, as stated in RFC-4180, paragraph 7:
If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.
But in your PowerShell script you could try to fix this issue with an extra line, using the replace method on $text, once you got it's value:
$text = $text.Replace('"";', '""";')
This should be enough, as the loader will deal well with unescaped double quotes if they appear elsewhere in the data, as stated on mysql.com (my highlight):
If the field begins with the ENCLOSED BY character, instances of that character are recognized as terminating a field value only if followed by the field or line TERMINATED BY sequence.
Of course, if the badly formatted CSV has data that contains ";, then you still have a problem. But it is very hard to determine whether such an occurrence terminates the data or should be seen as part of the data, even for humans :-)
Another thing to pay attention to as found on mysql.com:
If the input values are not necessarily enclosed within quotation marks, use OPTIONALLY before the ENCLOSED BY keywords.

In addition: importing CSV files in MySQL having the values enclosed in quotes works fine when using the ENCLOSED BY option.. UNLESS the enclosed field is the last field in a row, AND you used Excel to create the CSV file. Excel omits the field separator after the last field in a row. MySQL doesn't mind... unless the last field is enclosed in quotes. Then the import terminates at that line.
Examples:
This works fine: ...;value2;value3 (no trailing separator)
This also works fine ...;"value 2";value3 (value enclosed in quotes)
This also works fine ...;value 2;"value3"; (last field value enclosed in quotes and trailing separator)
But this breaks the import: ...;value2;"value 3" (last field value enclosed in quotes and no trailing separator)
Took me some time to figure this out; hope sharing this saves somebody else that time.

Related

Error loading MySQL/MariaDB Inline Data With Unescaped Double Quotes In Fields

I am having essentially the same problem as described here but the issue was left unresolved in that question.
I am trying to import a series of data files totaling about 100 million records into a MariaDB database. I've run into issues with some lines in the import file that look like:
"GAYATRI INC DBA "WHIPIN"","1950","S I","","AUSTIN","TX","78704","5124425337","B","93"
which I was trying to load with a statement like:
LOAD DATA INFILE 'testline.txt'
INTO TABLE data
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
(#name,#housenum,#street,#aptnum,#city,#state,#zip,#phone,#business,#year)
SET name=#name, housenum=#housenum, street=#street, aptnum=#aptnum, city=#city, state=#state, zip=#zip, phone=#phone, business=#business, year=#year;
but am receiving errors because the first field contains unescaped double quotes in the text of the field. That seems to be OK in and of itself as the database seems smart enough to handle that in most situations. However, because the field ends with a double quote in the text plus a double quote to close the field it assumes the first double quote is escaping the second double quote following RFC4180 and thus is not terminating the field even though the next character is a comma.
The source files can't be created any differently as they are exports from old software which I do not control. Obviously searching through 100 million records and changing entries like this by hand is not feasible. I'm unsure of whether any fields might contain commas though it's probably safe to assume they do in this quantity of records so programmatically forcing fields to break at commas is probably out too.
Any ideas on how to get them to import correctly?

MySQL Load data infile -- double quotes in a double quoted value as "a "double" quoted value"

I have a csv file with millions of rows. Here is the command I am using to load data
load data local infile 'myfile' into table test.mytable
fields terminated by ',' optionally enclosed by '"'
lines terminated by '\n' ignore 1 lines
This caters almost everything except some of the lines where there are double quotes inside a double quoted string. as in
"first column",second column,"third column has "double quotes" inside", fourth column
It truncates the third column and give me warning as this row does not contain data for all columns.
Appreciate your help
The CSV is broken. There is no way MySQL or any program can import it. The double quotes needed to be escaped if inside a column.
You might fix the CSV with a script. If the quotes doesn't have a comma in front or behind it, it's probably part of the text and should be escaped.
The following regular expression will do a negative lookbehind and lookahead to find quotes that don't have a quote right in front or behind it.
/(?<!^)(?<!,)(\s*)"(\s*)(?!,)(?!$)/
See it on regex101
On the command like you can run
perl -pe 's/(?<!,)(?<!^)(\s*)"(\s*)(?!,)(?!$)/\1\\"\2/g' data.csv > data-fixed.csv
Note that this method isn't fool proof. If there is a double quote that does have a comma behind it but is part of the text, there is little you can do to fix the CSV. In that case, the script simply has no way of knowing if it's a column delimiter or not.
Try this:
mysqlimport --fields-optionally-enclosed-by='"' --fields-terminated-by=, --lines-terminated-by="\r\n" --user=YOUR_USERNAME --password YOUR_DATABASE YOUR_TABLE.csv

formatting MySQL output to valid CSV or XLSX

I have a query whose output I format and dump onto a CSV file.
This is the code I'm using,
(query.....)
INTO OUTFILE
"/tmp/dump.csv"
FIELDS TERMINATED BY
','
ENCLOSED BY
'"'
LINES TERMINATED BY
'\n'
;
However when I open the CSV in Google Sheets or Excel, the columns are broken up into hundreds of smaller ones.
When I open the CSV in a plain text editor, I see that the column values itself contain quotes (single and double), commas, line-breaks.
Only the double-quotes are escaped.
Even though the double-quotes are escaped, they are omitted when interpreted by Google Sheets and Excel.
I tried manually editing the CSV entries; escaping the commas and such. But no luck. The commas still break the columns. However, in a couple of instances they didn't break the column. I am not able to figure why though.
So my question is how do I correctly format the output to accommodate for these characters and dump it onto a CSV or even an XLXS ( in case a CSV is not capable for situations like these )?
For context, I'm operating in a WordPress environment. If there is a solution in PHP, that can work too.
EDIT ::
Here is a sample line from the CSV,
"1369","Blaze Pannier Mounts for KTM Duke 200 & 390","HTA.04.740.80200/B","<strong>Product Description</strong><span data-sheets-value=\"[null,2,"SW Motech brings you the Blaze Pannier Brackets for the Duke 200 & 390. "]\" data-sheets-userformat=\"[null,null,15293,[null,0],11]\">SW Motech brings you the Blaze Pannier Brackets for the Duke 200 & 390.</span>"," <strong>What's in the box? </strong><span data-sheets-value=\"[null,2,"2 Quick Lock SupportsnMounting materialnMounting Instructions"]\" data-sheets-userformat=\"[null,null,15293,[null,0],null,[null,[[null,2,0,null,null,[null,2,13421772]],[null,0,0,3],[null,1,0,null,1]]],[null,[[null,2,0,null,null,[null,2,13421772]],[null,0,0,3],[null,1,0,null,1]]],[null,[[null,2,0,null,null,[null,2,13421772]],[null,0,0,3],[null,1,0,null,1]]],[null,[[null,2,0,null,null,[null,2,13421772]],[null,0,0,3],[null,1,0,null,1]]],null,0,1,0,null,[null,2,0],"calibri,arial,sans,sans-serif",11]\">2 Quick Lock SupportsMounting materialMounting Instructions</span> ","Installation Instructions"
From RFC 4180
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
Any double quotes inside fields enclosed with double quotes need to be escaped with another double quote. So given abc,ab"c," the expected formatting would be abc,"ab""c","""".

error with MySQL load data infile field with double quotes

I have .csv file data like this:
"UPRR 38 PAN AM "M"","1"
and I loaded data into table using below command which is having two columns (a and b).
LOAD DATA LOCAL INFILE 'E:\monthly_data.csv'
INTO TABLE test_data_table
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n';
But when I select table, it's giving unexpected results which is shown below.
a contains:
UPRR 38 PAN AM "M","1
... and b is NULL.
Thanks
You can replace all the instances of "Double quote double quote" in your file
either A. open the files and find replace them
or B. make a script to open the files and replace the extra quote that is messing it up
You have this:
ENCLOSED BY '"'
Thus " is not a regular character any more. It's a special character that has a special meaning: it highlights the start and end of a column value. If you want to type a " that does not behave that way you need to escape it. The RFC 4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files document explains how to do that:
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote
a;b
"UPRR 38 PAN AM ""M""";1
As they say, garbage in, garbage out ;-)

MySQL fields terminated by tab

I am trying to upload a tab delimitted file with MySQL. I want a query something likes this: LOAD DATA LOCAL INFILE 'file' INTO TABLE tbl FIELDS TERMINATED BY 'TAB' Is there something I can subsitute for TAB to make this work?
have you tried '\t' the escape sequence + "T" is considered tab... haven't tried, but might be what you need
Just tried to find the answer to this question myself to save re-saving my file with commas separating instead of tabs...
From an old MySQL reference manual, a long way down the page, you can find that TAB is the default separater for files loaded using LOAD DATA on MySQL.
See: http://dev.mysql.com/doc/refman/4.1/en/load-data.html
I just loaded a CSV file in this way into MySQL5.1.
BW
fields terminated by '\t'
Try this one
Note :
Field and Line Handling
For both the LOAD DATA and SELECT ... INTO OUTFILE statements, the syntax of the FIELDS and LINES clauses is the same. Both clauses are optional, but FIELDS must precede LINES if both are specified.
If you specify a FIELDS clause, each of its subclauses (TERMINATED BY, [OPTIONALLY] ENCLOSED BY, and ESCAPED BY) is also optional, except that you must specify at least one of them. Arguments to these clauses are permitted to contain only ASCII characters.
If you specify no FIELDS or LINES clause, the defaults are the same as if you had written this:
FIELDS TERMINATED BY '\t' ENCLOSED BY '' ESCAPED BY '\\'
LINES TERMINATED BY '\n' STARTING BY ''
Backslash is the MySQL escape character within strings in SQL statements. Thus, to specify a literal backslash, you must specify two backslashes for the value to be interpreted as a single backslash. The escape sequences '\t' and '\n' specify tab and newline characters, respectively.
In other words, the defaults cause LOAD DATA to act as follows when reading input:
Look for line boundaries at newlines.
Do not skip any line prefix.
Break lines into fields at tabs.
Do not expect fields to be enclosed within any quoting characters.
Interpret characters preceded by the escape character \ as escape sequences. For example, \t, \n, and \ signify tab, newline, and backslash, respectively. See the discussion of FIELDS ESCAPED BY later for the full list of escape sequences.
Conversely, the defaults cause SELECT ... INTO OUTFILE to act as follows when writing output:
Write tabs between fields.
Do not enclose fields within any quoting characters.
Use \ to escape instances of tab, newline, or \ that occur within field values.
Write newlines at the ends of lines.
see: https://dev.mysql.com/doc/refman/8.0/en/load-data.html
for more details.