I downloaded a tab-delimited file from a well-known source and now want to upload it into a MySQL table. I am doing this using load data local infile.
This data file, which has over 10 million records, also has the misfortune of many backslashes.
$ grep '\\' tabd_file.txt | wc -l
223212
These backslashes aren't a problem, except when they come at the end of fields. MySQL interprets backslashes as an escape character, and when it comes at the end of the field, it messes up the next field, or possibly the next row.
In spite of these backslashes, I only received 6 warnings from MySQL when loading it into a table. In each of these warnings, a row doesn't have the proper number of columns precisely because the backslash concatenated two adjacent fields in the same row.
My question is, how to deal with these backslashes? Should I specify load data local infile [...] escaped by '' to remove any special meaning from them? Or would this have unintended consequences? I can't think of a single important use of an escape sequence in this data file. The actual tabs that terminate fields are "physical tabs", not "\t" sequences.
Or, is removing the escape character from my load command bad practice? Should I just replace every instance of '\' in the file with '\\'?
Thanks for any advice :-)
If you don't need the escaping, then definitely use ESCAPED BY ''.
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
"If the FIELDS ESCAPED BY character is empty, escape-sequence interpretation does not occur. "
Related
I am having essentially the same problem as described here but the issue was left unresolved in that question.
I am trying to import a series of data files totaling about 100 million records into a MariaDB database. I've run into issues with some lines in the import file that look like:
"GAYATRI INC DBA "WHIPIN"","1950","S I","","AUSTIN","TX","78704","5124425337","B","93"
which I was trying to load with a statement like:
LOAD DATA INFILE 'testline.txt'
INTO TABLE data
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
(#name,#housenum,#street,#aptnum,#city,#state,#zip,#phone,#business,#year)
SET name=#name, housenum=#housenum, street=#street, aptnum=#aptnum, city=#city, state=#state, zip=#zip, phone=#phone, business=#business, year=#year;
but am receiving errors because the first field contains unescaped double quotes in the text of the field. That seems to be OK in and of itself as the database seems smart enough to handle that in most situations. However, because the field ends with a double quote in the text plus a double quote to close the field it assumes the first double quote is escaping the second double quote following RFC4180 and thus is not terminating the field even though the next character is a comma.
The source files can't be created any differently as they are exports from old software which I do not control. Obviously searching through 100 million records and changing entries like this by hand is not feasible. I'm unsure of whether any fields might contain commas though it's probably safe to assume they do in this quantity of records so programmatically forcing fields to break at commas is probably out too.
Any ideas on how to get them to import correctly?
I've spent a fair amount of time googling this but I can't seem to point myself in the right direction of exactly what I'm looking for. The problem with my .csv file is that, while the line terminator is ',,,,', some lines do not include this, so when I import the file it's fine until it gets to one of these, but then it treats it as one record that's about twice as long as the amount of columns a standard record should have, and then it's thrown off from that point forward. What I need to do is skip the records (data between ',,,,' terminations) that have more than the correct number of columns (15). I realize this will essentially skip 2 records each time this happens, but that's fine for the purpose of what I'm doing with a pretty large dataset.
I've come across the IGNORE keyword, but that doesn't seem to apply. What I'm looking for is something like: for each record during import, skip record if record.columns.count > 15. Here is my import statement, thanks for any help provided.
LOAD DATA LOCAL INFILE "/Users/foo/Desktop/csvData.csv"
INTO TABLE csvData
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY ',,,,';
If you just want to skip the malformed records, a simple awk command to filter only the good records is:
awk -F, '{ if (NF == 15) print; }' csvData.csv > csvData_fixed.csv
Then LOAD DATA from the fixed file.
If you want to get fancier, you could write a script using awk (or Python or whatever you like) to rewrite the malformed records in the proper format.
Re your comment: The awk command reads your original file and outputs only each line that has exactly 15 fields, where fields are separated by commas.
Apparently your input data has no lines that have exactly 15 fields, even though you described it that way.
Another thought: it's a little bit weird to use the line terminator of ',,,,' in your original LOAD DATA command. Normally the line terminator is '\n' which is a newline character. So when you redefine the line terminator as ',,,,' it means MySQL will keep reading text until it finds ',,,,' even if that ends up reading dozens of fields over multiple lines of text. Perhaps you could set your line terminator to ',,,,\n'.
I have a CSV file that I need to format (i.e., turn into) a SQL file for ingestion into MySQL. I am looking for a way to add the text delimiters (single quote) to the text, but not to the numbers, booleans, etc. I am finding it difficult because some of the text that I need to enclose in single quotes have commas themselves, making it difficult to key in to the commas for search and replace. Here is an example line I am working with:
1239,1998-08-26,'Severe Storm(s)','Texas,Val Verde,"DEL RIO, PARKS",'No',25,"412,007.74"
This is FEMA data file, with 131246 lines, I got off of data.gov that I am trying to get into a MySQL database. As you can see, I need to insert a single quote after Texas and before Val Verde, so I tried:
s/,/','/3
But that only replaced the first occurrence of the comma on the first three lines of the file. Once I get past that, I will need to find a way to deal with "DEL RIO, PARKS", as that has a comma that I do not want to place a single quote around.
So, is there a "nice" way to manipulate this data to get it from plain CSV to a proper SQL format?
Thanks
CSV files are notoriously dicey to parse. Different programs export CSV in different ways, possibly including strangeness like embedding new lines within a quoted field or different ways of representing quotes within a quoted field. You're better off using a tool specifically suited to parsing CSV -- perl, python, ruby and java all have CSV parsing libraries, or there are command line programs such as csvtool or ffe.
If you use a scripting language's CSV library, you may also be able to leverage the language's SQL import as well. That's overkill for a one-off, but if you're importing a lot of data this way, or if you're transforming data, it may be worthwhile.
I think that I would also want to do some troubleshooting to find out why the CSV import into MYSql failed.
I would take an approach like this:
:%s/,\("[^"]*"\|[^,"]*\)/,'\1'/g
:%s/^\("[^"]*"\|[^,"]*\)/'\1'/g
In words, look for a double quoted set of characters or , \|, a non-double quoted set of characters beginning with a comma and replace the set of characters in a single quotation.
Next, for the first column in a row, look for a double quoted set of characters or , \|, a non-double quoted set of characters beginning with a comma and replace the set of characters in a single quotation.
Try the csv plugin. It allows to convert the data into other formats. The help includes an example, how to convert the data for importing it into a database
Just to bring this to a close, I ended up using #Eric Andres idea, which was the MySQL load data option:
LOAD DATA LOCAL INFILE '/path/to/file.csv'
INTO TABLE MYTABLE FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n';
The initial .csv file still took a little massaging, but not as much as I were to do it by hand.
When I commented that the LOAD DATA had truncated my file, I was incorrect. I was treating the file as a typical .sql file and assumed the "ID" column I had added would auto-increment. This turned out to not be the case. I had to create a quick script that prepended an ID to the front of each line. After that, the LOAD DATA command worked for all lines in my file. In other words, all data has to be in place within the file to load before the load, or the load will not work.
Thanks again to all who replied, and #Eric Andres for his idea, which I ultimately used.
I'm trying to import a large csv file wiht 27797 rows into MySQL. Here is my code:
load data local infile 'foo.csv' into table bar fields terminated by ',' enclosed by '"' lines terminated by '\n' ignore 1 lines;
It works fine. However, some rows of this file containing backslashes (\), for example:
"40395383771234304","40393156566585344","84996340","","","2011-02-23 12:59:44 +0000","引力波宇宙广播系统零号控制站","#woiu 太好了"
"40395151830421504","40392270645563392","23063222","","","2011-02-23 12:58:49 +0000","引力波宇宙广播系统零号控制站","#wx0 确切地讲安全电压是\""不高于36V\""而不是\""36V\"", 呵呵. 话说要如何才能测它的电压呢?"
"40391869477158912","40390512645124096","23063222","","","2011-02-23 12:45:46 +0000","引力波宇宙广播系统零号控制站","#wx0 这是别人的测量结果, 我没验证过. 不过麻麻的感觉的确是存在的, 而且用适配器充电时麻感比用电脑的前置USB接口充电高"
"15637769883","15637418359","35192559","","","2010-06-07 15:44:15 +0000","强互作用力宇宙探测器","#Hc95 那就不是DOS程序啦,只是个命令行程序,就像Android里的adb.exe。$ adb push d:\hc95.tar.gz /tmp/ $ adb pull /system/hc95/eyes d:\re\"
After importing, lines with backslashes will be broken.
How could I fix it? Should I use sed or awk to substitute all \ with \ (within 27797 rows...)? Or this can be fixed by just modifying the SQL query?
This is abit more of a discussion than a direct answer. Do you need the double quotes in the middle of the values in the final data (in the DB)? The fact that you have a large amount of data to munge doesn't present any problems at all.
The "" thing is what Oracle does for quotes inside strings. I think whatever built that file attempted to escape the quote sequence. This is the string manual for MySQL. Either of these is valid::
select "hel""lo", "\"hello";
I would tend to do the editing separately to the import, so it easier/faster to see if things worked. If your text file is less than 10MB, it shouldn't take more than a minute to update it via sed.
sed -e 's/\\//' foo.csv
From your comments, you can set the escape char to be something other than '\'.
ESCAPED BY 'char'
This means the loader should verbatim add the values. If it gets too complicated, if you base64() the data before you insert it, this will stop any tools from breaking the UTf8 sequences.
What I did in a similar situation was to create a java string first in a test application. Then compile the test class and fix any errors that I found.
For example:
`String me= "LOAD DATA LOCAL INFILE 'X:/access.log/' REPLACE INTO TABLE `logrecords"+"`\n"+
"FIELDS TERMINATED BY \'|\'\n"+
"ENCLOSED BY \'\"\'\n"+
"ESCAPED BY \'\\\\\'\n"+
"LINES TERMINATED BY \'\\r\\n\'(\n"+
"`startDate` ,\n"+
"`IP` ,\n"+
"`request` ,\n"+
"`threshold` ,\n"+
"`useragent`\n"+
")";
System.out.println("" +me);
enter code here
We have a large tab-delimited text file (approximately 120,000 records, 50MB) that we're trying to shove into MySQL using mysqlimport. Some fields are enclosed in double-quotes, some not. We're using the fields-optionally-enclosed-by='\"' switch, but the problem is some of the field values themselves contain double-quotes (indicating inches) so the delimited field value might be something "ABCDEF19"". Make sense?
We have no control over the source of the file, so we can't change the formatting there. I tried removing the fields-optionally-enclosed-by switch, but then the double-quotes that surround the values are imported.
he records with quotes in the values are getting seriously messed up. Is there a way we can tell mysqlimport that some fields are optionally enclosed by quotes, but may still contain quotes? We've thought maybe a global search and replace to escape the double-quotes in field values? Or any other suggestions?
If your data is including quotes inside of the body of the field quote without delimiting that somehow, you have a problem. You can't guarantee that mysqlimport will do this properly.
Massage the data first before trying to insert it in this way.
Luckily, it is tab-delimited, so you can run a regex to replace the quotes with a delimited version and then tell mysqlimport the delimiter.
You could import it with the quotes (fields-optionally-enclosed-by switch removed) and then run a check where if the value has double quotes at the beginning and end (assuming none of the values have inches at the beginning) then truncate by 1 character from the beginning and end to remove the extra quotes you got from importing.
EDIT: after reading kekoav's response I have to agree that if you are able to manipulate the file before importing that would be a much wiser option, but if you are forced to remove quotes afterwards, you could use something like this:
UPDATE table
SET column =
IF(
STRCMP(LEFT(table.column,1),'"'),
MID(table.column,2,(LENGTH(table.column)-2)),
table.column
)
for every 'column' in 'table'