MySQL Load Data InFile; skip rows IF - mysql

I've spent a fair amount of time googling this but I can't seem to point myself in the right direction of exactly what I'm looking for. The problem with my .csv file is that, while the line terminator is ',,,,', some lines do not include this, so when I import the file it's fine until it gets to one of these, but then it treats it as one record that's about twice as long as the amount of columns a standard record should have, and then it's thrown off from that point forward. What I need to do is skip the records (data between ',,,,' terminations) that have more than the correct number of columns (15). I realize this will essentially skip 2 records each time this happens, but that's fine for the purpose of what I'm doing with a pretty large dataset.
I've come across the IGNORE keyword, but that doesn't seem to apply. What I'm looking for is something like: for each record during import, skip record if record.columns.count > 15. Here is my import statement, thanks for any help provided.
LOAD DATA LOCAL INFILE "/Users/foo/Desktop/csvData.csv"
INTO TABLE csvData
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY ',,,,';

If you just want to skip the malformed records, a simple awk command to filter only the good records is:
awk -F, '{ if (NF == 15) print; }' csvData.csv > csvData_fixed.csv
Then LOAD DATA from the fixed file.
If you want to get fancier, you could write a script using awk (or Python or whatever you like) to rewrite the malformed records in the proper format.
Re your comment: The awk command reads your original file and outputs only each line that has exactly 15 fields, where fields are separated by commas.
Apparently your input data has no lines that have exactly 15 fields, even though you described it that way.
Another thought: it's a little bit weird to use the line terminator of ',,,,' in your original LOAD DATA command. Normally the line terminator is '\n' which is a newline character. So when you redefine the line terminator as ',,,,' it means MySQL will keep reading text until it finds ',,,,' even if that ends up reading dozens of fields over multiple lines of text. Perhaps you could set your line terminator to ',,,,\n'.

Related

Error loading MySQL/MariaDB Inline Data With Unescaped Double Quotes In Fields

I am having essentially the same problem as described here but the issue was left unresolved in that question.
I am trying to import a series of data files totaling about 100 million records into a MariaDB database. I've run into issues with some lines in the import file that look like:
"GAYATRI INC DBA "WHIPIN"","1950","S I","","AUSTIN","TX","78704","5124425337","B","93"
which I was trying to load with a statement like:
LOAD DATA INFILE 'testline.txt'
INTO TABLE data
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
(#name,#housenum,#street,#aptnum,#city,#state,#zip,#phone,#business,#year)
SET name=#name, housenum=#housenum, street=#street, aptnum=#aptnum, city=#city, state=#state, zip=#zip, phone=#phone, business=#business, year=#year;
but am receiving errors because the first field contains unescaped double quotes in the text of the field. That seems to be OK in and of itself as the database seems smart enough to handle that in most situations. However, because the field ends with a double quote in the text plus a double quote to close the field it assumes the first double quote is escaping the second double quote following RFC4180 and thus is not terminating the field even though the next character is a comma.
The source files can't be created any differently as they are exports from old software which I do not control. Obviously searching through 100 million records and changing entries like this by hand is not feasible. I'm unsure of whether any fields might contain commas though it's probably safe to assume they do in this quantity of records so programmatically forcing fields to break at commas is probably out too.
Any ideas on how to get them to import correctly?

How to remove line break from LOAD DATA LOCAL INFILE query?

I am using this query to import data from a txt file into my table:
LOAD DATA LOCAL INFILE '~/Desktop/data.txt' INTO TABLE codes LINES TERMINATED BY '\n' (code)
This is working fine. But when I take a look in the "code"-field, every entry has a line break at its end. Is there a way to get rid of this?
Load data infile command is not really suitable for data cleansing, but you may get lucky. First of all, determine what characters exactly make up those 'line breaks'.
It is possible, that the text file uses Windows style line breaks (\r\n). In this case use lines terminated by '\r\n'. If the line breaks consist of different characters, but are consistent across all lines, then include those in the line terminated by clause.
If the line break characters are inconsistent, then you may have to create a stored procedure or use an external programming language to cleanse your data.

Cut when csv file has commas in cell linux

I have IMDB data in csv format. Here is a snapshot.
[root#jamatney IMDB]# head IMDBMovie.txt
id,name,year,rank
0,#28 (2002),2002,
1,#7 Train: An Immigrant Journey, The (2000),2000,
2,$ (1971),1971,6.4000000000000004
3,$1000 Reward (1913),1913,
4,$1000 Reward (1915),1915,
5,$1000 Reward (1923),1923,
6,$1,000,000 Duck (1971),1971,5
7,$1,000,000 Reward, The (1920),1920,
8,$10,000 Under a Pillow (1921),1921,
I'd like to import this data into a MySQL database. However there are commas present in the name cells. This prevents me from loading the data into the database correctly, as my loading query is,
mysql> LOAD DATA LOCAL INFILE 'IMDB/IMDBMovie.txt' INTO TABLE Movie FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n' IGNORE 1 LINES;
I've thought about using some combination of rev and cut to isolate the offending column, then find/replace the commas, but can't seem to get it to work. Was wondering if this is the right approach, or if there's a better way.
It looks like the first field and last two fields are unambiguous, so all you have to do is write a script to pull those out, and surround what remains in quotes. My bash-fu isn't quite good enough to get it done with rev and cut, but I was able to write a Python script to get it done. You can add an OPTIONALLY ENCLOSED BY clause to your LOAD DATA query.
f = open("IMDBMovie.txt")
print(next(f)) # header
for line in f:
fields = line.strip().split(",")
# Get unambiguous fields.
id = fields.pop(0)
rank = fields.pop(-1)
year = fields.pop(-1)
# Surround name with quotes.
name = '"{}"'.format(",".join(fields))
print("{},{},{},{}".format(id, name, year, rank))
On your test data, the output was
id,name,year,rank
0,"#28 (2002)",2002,
1,"#7 Train: An Immigrant Journey, The (2000)",2000,
2,"$ (1971)",1971,6.4000000000000004
3,"$1000 Reward (1913)",1913,
4,"$1000 Reward (1915)",1915,
5,"$1000 Reward (1923)",1923,
6,"$1,000,000 Duck (1971)",1971,5
7,"$1,000,000 Reward, The (1920)",1920,
8,"$10,000 Under a Pillow (1921)",1921,
This is too long for a comment.
Good luck. Your input file is in a lousy format. It is not really CSV. Here are two options:
(1) Open the file in Excel (or your favorite spreadsheet) and save it out with tab delimiters instead. Keep your fingers crossed that none of the fields have a tab. Or use another delimiter such as a pipe character.
(2) Load each row into a table with only one column, a big character string column. Then, parse the rows into their constituent fields (substring_index() can be very useful).

Mysql error 1261 (doesn't contain data for all columns) on last row with no empty values

I'm doing a lad data infile in MySQL through MySQL Workbench. I'm pretty new to SQL in general, so this may be a simple fix, but I can't get it to work. It is throwing a a 1261 Error (doesn't contain data for all columns) on the last row, but the last row (like the rest of the CSV) doesn't have any blank or null values.
I've looked around for help and read the manual, but everything I've seen has been about dealing with null values.
I exported the CSV from Excel, to the extent that maters.
The code I'm using to import is (I've changed the field, file, and table names to be more generic):
load data infile '/temp/filename.csv'
into table table1
fields terminated by ","
lines terminated by '\r'
ignore 1 lines
(Col1,Col2,Col3,Col4,Col5,col6,col7,Col8,Col9);
The first two columns are varchar and char, respectively with the remaining columns all formatted as double.
Here's the last few lines of the csv file:
364,6001.009JR,43.96,0,0,0,0,0,0
364,6001.900FM,0,0,0,0,0,0,0
364,6001.900JR,0,0,0,0,0,0,0
The only thing I can think of is that I'm supposed to have some signal after the last line to indicate that the file is finished, but I haven't found anything to indicate what that would be.
Any help would be appreciated
When I've had similar errors, it's because there were unexpected newlines inside my data (a newline in one row would look like two too-short rows, upon import).

Loading data into MySQL: How to deal with backslashes?

I downloaded a tab-delimited file from a well-known source and now want to upload it into a MySQL table. I am doing this using load data local infile.
This data file, which has over 10 million records, also has the misfortune of many backslashes.
$ grep '\\' tabd_file.txt | wc -l
223212
These backslashes aren't a problem, except when they come at the end of fields. MySQL interprets backslashes as an escape character, and when it comes at the end of the field, it messes up the next field, or possibly the next row.
In spite of these backslashes, I only received 6 warnings from MySQL when loading it into a table. In each of these warnings, a row doesn't have the proper number of columns precisely because the backslash concatenated two adjacent fields in the same row.
My question is, how to deal with these backslashes? Should I specify load data local infile [...] escaped by '' to remove any special meaning from them? Or would this have unintended consequences? I can't think of a single important use of an escape sequence in this data file. The actual tabs that terminate fields are "physical tabs", not "\t" sequences.
Or, is removing the escape character from my load command bad practice? Should I just replace every instance of '\' in the file with '\\'?
Thanks for any advice :-)
If you don't need the escaping, then definitely use ESCAPED BY ''.
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
"If the FIELDS ESCAPED BY character is empty, escape-sequence interpretation does not occur. "