Cut when csv file has commas in cell linux - mysql

I have IMDB data in csv format. Here is a snapshot.
[root#jamatney IMDB]# head IMDBMovie.txt
id,name,year,rank
0,#28 (2002),2002,
1,#7 Train: An Immigrant Journey, The (2000),2000,
2,$ (1971),1971,6.4000000000000004
3,$1000 Reward (1913),1913,
4,$1000 Reward (1915),1915,
5,$1000 Reward (1923),1923,
6,$1,000,000 Duck (1971),1971,5
7,$1,000,000 Reward, The (1920),1920,
8,$10,000 Under a Pillow (1921),1921,
I'd like to import this data into a MySQL database. However there are commas present in the name cells. This prevents me from loading the data into the database correctly, as my loading query is,
mysql> LOAD DATA LOCAL INFILE 'IMDB/IMDBMovie.txt' INTO TABLE Movie FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n' IGNORE 1 LINES;
I've thought about using some combination of rev and cut to isolate the offending column, then find/replace the commas, but can't seem to get it to work. Was wondering if this is the right approach, or if there's a better way.

It looks like the first field and last two fields are unambiguous, so all you have to do is write a script to pull those out, and surround what remains in quotes. My bash-fu isn't quite good enough to get it done with rev and cut, but I was able to write a Python script to get it done. You can add an OPTIONALLY ENCLOSED BY clause to your LOAD DATA query.
f = open("IMDBMovie.txt")
print(next(f)) # header
for line in f:
fields = line.strip().split(",")
# Get unambiguous fields.
id = fields.pop(0)
rank = fields.pop(-1)
year = fields.pop(-1)
# Surround name with quotes.
name = '"{}"'.format(",".join(fields))
print("{},{},{},{}".format(id, name, year, rank))
On your test data, the output was
id,name,year,rank
0,"#28 (2002)",2002,
1,"#7 Train: An Immigrant Journey, The (2000)",2000,
2,"$ (1971)",1971,6.4000000000000004
3,"$1000 Reward (1913)",1913,
4,"$1000 Reward (1915)",1915,
5,"$1000 Reward (1923)",1923,
6,"$1,000,000 Duck (1971)",1971,5
7,"$1,000,000 Reward, The (1920)",1920,
8,"$10,000 Under a Pillow (1921)",1921,

This is too long for a comment.
Good luck. Your input file is in a lousy format. It is not really CSV. Here are two options:
(1) Open the file in Excel (or your favorite spreadsheet) and save it out with tab delimiters instead. Keep your fingers crossed that none of the fields have a tab. Or use another delimiter such as a pipe character.
(2) Load each row into a table with only one column, a big character string column. Then, parse the rows into their constituent fields (substring_index() can be very useful).

Related

Error loading MySQL/MariaDB Inline Data With Unescaped Double Quotes In Fields

I am having essentially the same problem as described here but the issue was left unresolved in that question.
I am trying to import a series of data files totaling about 100 million records into a MariaDB database. I've run into issues with some lines in the import file that look like:
"GAYATRI INC DBA "WHIPIN"","1950","S I","","AUSTIN","TX","78704","5124425337","B","93"
which I was trying to load with a statement like:
LOAD DATA INFILE 'testline.txt'
INTO TABLE data
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
(#name,#housenum,#street,#aptnum,#city,#state,#zip,#phone,#business,#year)
SET name=#name, housenum=#housenum, street=#street, aptnum=#aptnum, city=#city, state=#state, zip=#zip, phone=#phone, business=#business, year=#year;
but am receiving errors because the first field contains unescaped double quotes in the text of the field. That seems to be OK in and of itself as the database seems smart enough to handle that in most situations. However, because the field ends with a double quote in the text plus a double quote to close the field it assumes the first double quote is escaping the second double quote following RFC4180 and thus is not terminating the field even though the next character is a comma.
The source files can't be created any differently as they are exports from old software which I do not control. Obviously searching through 100 million records and changing entries like this by hand is not feasible. I'm unsure of whether any fields might contain commas though it's probably safe to assume they do in this quantity of records so programmatically forcing fields to break at commas is probably out too.
Any ideas on how to get them to import correctly?

Importing a series of .CSV files that contain one field while adding additional 'known' data in other fields

I've got a process that creates a csv file that contains ONE set of values that I need to import into a field in a MySQL database table. This process creates a specific file name that identifies the values of the other fields in that table. For instance, the file name T001U020C075.csv would be broken down as follows:
T001 = Test 001
U020 = User 020
C075 = Channel 075
The file contains a single row of data separated by commas for all of the test results for that user on a specific channel and it might look something like:
12.555, 15.275, 18.333, 25.000 ... (there are hundreds, maybe thousands, of results per user, per channel).
What I'm looking to do is to import directly from the CSV file adding the field information from the file name so that it looks something like:
insert into results (test_no, user_id, channel_id, result) values (1, 20, 75, 12.555)
I've tried to use "Bulk Insert" but that seems to want to import all of the fields where each ROW is a record. Sure, I could go into each file and convert the row to a column and add the data from the file name into the columns preceding the results but that would be a very time consuming task as there are hundreds of files that have been created and need to be imported.
I've found several "import CSV" solutions but they all assume all of the data is in the file. Obviously, it's not...
The process that generated these files is unable to be modified (yes, I asked). Even if it could be modified, it would only provide the proper format going forward and what is needed is analysis of the historical data. And, the new format would take significantly more space.
I'm limited to using either MATLAB or MySQL Workbench to import the data.
Any help is appreciated.
Bob
A possible SQL approach to getting the data loaded into the table would be to run a statement like this:
LOAD DATA LOCAL INFILE '/dir/T001U020C075.csv'
INTO TABLE results
FIELDS TERMINATED BY '|'
LINES TERMINATED BY ','
( result )
SET test_no = '001'
, user_id = '020'
, channel_id = '075'
;
We need the comma to be the line separator. We can specify some character that we are guaranteed not to tppear to be the field separator. So we get LOAD DATA to see a single "field" on each "line".
(If there isn't trailing comma at the end of the file, after the last value, we need to test to make sure we are getting the last value (the last "line" as we're telling LOAD DATA to look at the file.)
We could use user-defined variables in place of the literals, but that leaves the part about parsing the filename. That's really ugly in SQL, but it could be done, assuming a consistent filename format...
-- parse filename components into user-defined variables
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(f.n,'T',-1),'U',1) AS t
, SUBSTRING_INDEX(SUBSTRING_INDEX(f.n,'U',-1),'C',1) AS u
, SUBSTRING_INDEX(f.n,'C',-1) AS c
, f.n AS n
FROM ( SELECT SUBSTRING_INDEX(SUBSTRING_INDEX( i.filename ,'/',-1),'.csv',1) AS n
FROM ( SELECT '/tmp/T001U020C075.csv' AS filename ) i
) f
INTO #ls_u
, #ls_t
, #ls_c
, #ls_n
;
while we're testing, we probably want to see the result of the parsing.
-- for debugging/testing
SELECT #ls_t
, #ls_u
, #ls_c
, #ls_n
;
And then the part about running of the actual LOAD DATA statement. We've got to specify the filename again. We need to make sure we're using the same filename ...
LOAD DATA LOCAL INFILE '/tmp/T001U020C075.csv'
INTO TABLE results
FIELDS TERMINATED BY '|'
LINES TERMINATED BY ','
( result )
SET test_no = #ls_t
, user_id = #ls_u
, channel_id = #ls_c
;
(The client will need read permission the .csv file)
Unfortunately, we can't wrap this in a procedure because running LOAD DATA
statement is not allowed from a stored program.
Some would correctly point out that as a workaround, we could compile/build a user-defined function (UDF) to execute an external program, and a procedure could call that. Personally, I wouldn't do it. But it is an alternative we should mention, given the constraints.

MySQL Load Data InFile; skip rows IF

I've spent a fair amount of time googling this but I can't seem to point myself in the right direction of exactly what I'm looking for. The problem with my .csv file is that, while the line terminator is ',,,,', some lines do not include this, so when I import the file it's fine until it gets to one of these, but then it treats it as one record that's about twice as long as the amount of columns a standard record should have, and then it's thrown off from that point forward. What I need to do is skip the records (data between ',,,,' terminations) that have more than the correct number of columns (15). I realize this will essentially skip 2 records each time this happens, but that's fine for the purpose of what I'm doing with a pretty large dataset.
I've come across the IGNORE keyword, but that doesn't seem to apply. What I'm looking for is something like: for each record during import, skip record if record.columns.count > 15. Here is my import statement, thanks for any help provided.
LOAD DATA LOCAL INFILE "/Users/foo/Desktop/csvData.csv"
INTO TABLE csvData
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY ',,,,';
If you just want to skip the malformed records, a simple awk command to filter only the good records is:
awk -F, '{ if (NF == 15) print; }' csvData.csv > csvData_fixed.csv
Then LOAD DATA from the fixed file.
If you want to get fancier, you could write a script using awk (or Python or whatever you like) to rewrite the malformed records in the proper format.
Re your comment: The awk command reads your original file and outputs only each line that has exactly 15 fields, where fields are separated by commas.
Apparently your input data has no lines that have exactly 15 fields, even though you described it that way.
Another thought: it's a little bit weird to use the line terminator of ',,,,' in your original LOAD DATA command. Normally the line terminator is '\n' which is a newline character. So when you redefine the line terminator as ',,,,' it means MySQL will keep reading text until it finds ',,,,' even if that ends up reading dozens of fields over multiple lines of text. Perhaps you could set your line terminator to ',,,,\n'.

How can I quickly reformat a CSV file into SQL format in Vim?

I have a CSV file that I need to format (i.e., turn into) a SQL file for ingestion into MySQL. I am looking for a way to add the text delimiters (single quote) to the text, but not to the numbers, booleans, etc. I am finding it difficult because some of the text that I need to enclose in single quotes have commas themselves, making it difficult to key in to the commas for search and replace. Here is an example line I am working with:
1239,1998-08-26,'Severe Storm(s)','Texas,Val Verde,"DEL RIO, PARKS",'No',25,"412,007.74"
This is FEMA data file, with 131246 lines, I got off of data.gov that I am trying to get into a MySQL database. As you can see, I need to insert a single quote after Texas and before Val Verde, so I tried:
s/,/','/3
But that only replaced the first occurrence of the comma on the first three lines of the file. Once I get past that, I will need to find a way to deal with "DEL RIO, PARKS", as that has a comma that I do not want to place a single quote around.
So, is there a "nice" way to manipulate this data to get it from plain CSV to a proper SQL format?
Thanks
CSV files are notoriously dicey to parse. Different programs export CSV in different ways, possibly including strangeness like embedding new lines within a quoted field or different ways of representing quotes within a quoted field. You're better off using a tool specifically suited to parsing CSV -- perl, python, ruby and java all have CSV parsing libraries, or there are command line programs such as csvtool or ffe.
If you use a scripting language's CSV library, you may also be able to leverage the language's SQL import as well. That's overkill for a one-off, but if you're importing a lot of data this way, or if you're transforming data, it may be worthwhile.
I think that I would also want to do some troubleshooting to find out why the CSV import into MYSql failed.
I would take an approach like this:
:%s/,\("[^"]*"\|[^,"]*\)/,'\1'/g
:%s/^\("[^"]*"\|[^,"]*\)/'\1'/g
In words, look for a double quoted set of characters or , \|, a non-double quoted set of characters beginning with a comma and replace the set of characters in a single quotation.
Next, for the first column in a row, look for a double quoted set of characters or , \|, a non-double quoted set of characters beginning with a comma and replace the set of characters in a single quotation.
Try the csv plugin. It allows to convert the data into other formats. The help includes an example, how to convert the data for importing it into a database
Just to bring this to a close, I ended up using #Eric Andres idea, which was the MySQL load data option:
LOAD DATA LOCAL INFILE '/path/to/file.csv'
INTO TABLE MYTABLE FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n';
The initial .csv file still took a little massaging, but not as much as I were to do it by hand.
When I commented that the LOAD DATA had truncated my file, I was incorrect. I was treating the file as a typical .sql file and assumed the "ID" column I had added would auto-increment. This turned out to not be the case. I had to create a quick script that prepended an ID to the front of each line. After that, the LOAD DATA command worked for all lines in my file. In other words, all data has to be in place within the file to load before the load, or the load will not work.
Thanks again to all who replied, and #Eric Andres for his idea, which I ultimately used.

MySql load data infile STR_TO_DATE returning blank?

i'm importing 1m+ records into my table from a csv file.
Works great using the load data local infile method.
However, the dates are all different formats.
A quick google lead me to this function:
STR_TO_DATE
However, when I implement that, I get nothing, an empty insert. here's my SQ cut down to include one date (I've 4 with the same issue) and generic column names:
load data local infile 'myfile.csv' into table `mytable`
fields terminated by '\t'
lines terminated by '\n'
IGNORE 1 LINES
( `column name 1`
, `my second column`
, #temp_date
, `final column`)
SET `Get Date` = STR_TO_DATE(#temp_date, '%c/%e/%Y')
If I do:
SET `Get Date` = #temp_date
The date from the csv is captured in the the format it was in the file.
However when I try the first method, my table column is empty. I've changed the column type to varchar (255) from timestamp to captre whatever is going in, but ultimatly, I want to capture y-m-d H:i:s (Not sure if STR_TO_DATE can do that?)
I'm also unsure as to why I need the # symbol.. google failed me there.
So, my questions are:
Why do I need the # symbol to use this function?
Should the data format ('%c/%e/%Y') be the format of the inputted data or my desired output?
Can I capture time in this way too?
sorry for the large post!
Back to Google for now...
Why do I need the # symbol to use this function?
The # symbol means that you are using a variable, so the read string isnt put right away into the table but into a memory pice that lets you operate with it before inserting it. More info in http://dev.mysql.com/doc/refman/5.0/en/user-variables.html
Should the data format ('%c/%e/%Y') be the format of the inputted data or my desired output?
Its the format of the inputted data, more info in http://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_str-to-date
Can I capture time in this way too?
You should be able to as long as you chose the correct format, something like
STR_TO_DATE(#temp_date,'%c/%e/%Y %h:%i:%s');
I had this problem. What solved it for me was making sure I accounted for whitespace that weren't delimiters in my load file. So if ',' is the delimiter:
..., 4/29/2012, ...
might be interpreted as " 4/29/2012"
So should be
...,4/29/2012,...