Posted this to Reddit yesterday, but no love. I'm on Centos, writing bash scripts and parsing data to import into mysql.
I'm having to convert a story archive that stored the main part of the stories in a plain text file, and need to be able to import these multiple-lined text files into a column in my database. I know I can use mysqlimport, and I have the file designated as a pipe delimited - BUT because the text file I'm importing has carriage returns/line breaks in them, it's importing each paragraph as its own row. So a 9 paragraph text file will import as 9 rows when I use mysqlimport.
Is there a way to do this?
I know the ideal text file for importing (with pipe delimiters) would be like (without the blank line between):
this is my record|12345
another record|24353
have another bagel, why don't you?|43253
However, my file is actually closer to this:
This is the first line of my first paragraph. And now I'm going to do some more line wrapping and stuff.
This is a second line from the same text file that should be treated as a single record along with the first line in a single "blob" or text field. |12345
This is the last stumbling block to recover from a bad piece of software someone dropped in my lap, and I hope this can be done. I have 14,000 of these text files (each in this format), so doing them by hand is kind of out of the question.
Encode / transmit new line as '\n' and same way tab as '\t'. And this is the best practice when you are storing any url or raw text into your database. This will also help you to avoid the sql injection and solve your current problem too...
Please let me know if this helps. Thanks.
I do not know about performance when you converting the lines to sql statements. I think it can be useful:
Input
This is the first line of my first paragraph. And now I'm going to do some more line wrapping and stuff.
This is a second line from the same text file that should be treated as a single record along with the first line in a single "blob" or text field. |12345
I am hoping I understood the question correct.
Everything without a pipe is part of the first field.
And the line with a pipe is for field 1 and 2.
Like this one |12346
Script
my_insert="INSERT INTO my_table
(field1, field2)
VALUES
('"
firstline=0
while read -r line; do
if [[ -z "${line}" ]]; then
printf "\n"
continue;
fi
if [[ "${firstline}" -eq 0 ]]; then
printf "%s" "${my_insert}"
firstline=1
fi
line_no_pipe=${line%|*}
if [[ "${line}" = "${line_no_pipe}" ]]; then
printf "%s\n" "${line}"
else
printf "%s',%s);\n" "${line_no_pipe}" "${line##*|}"
firstline=0
fi
done < input
Output
INSERT INTO my_table
(field1, field2)
VALUES
('This is the first line of my first paragraph. And now I'm going to do some more line wrapping and stuff.
This is a second line from the same text file that should be treated as a single record along with the first line in a single "blob" or text field. ',12345);
INSERT INTO my_table
(field1, field2)
VALUES
('I am hoping I understood the question correct.
Everything without a pipe is part of the first field.
And the line with a pipe is for field 1 and 2.
Like this one ',12346);
Related
I've spent a fair amount of time googling this but I can't seem to point myself in the right direction of exactly what I'm looking for. The problem with my .csv file is that, while the line terminator is ',,,,', some lines do not include this, so when I import the file it's fine until it gets to one of these, but then it treats it as one record that's about twice as long as the amount of columns a standard record should have, and then it's thrown off from that point forward. What I need to do is skip the records (data between ',,,,' terminations) that have more than the correct number of columns (15). I realize this will essentially skip 2 records each time this happens, but that's fine for the purpose of what I'm doing with a pretty large dataset.
I've come across the IGNORE keyword, but that doesn't seem to apply. What I'm looking for is something like: for each record during import, skip record if record.columns.count > 15. Here is my import statement, thanks for any help provided.
LOAD DATA LOCAL INFILE "/Users/foo/Desktop/csvData.csv"
INTO TABLE csvData
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY ',,,,';
If you just want to skip the malformed records, a simple awk command to filter only the good records is:
awk -F, '{ if (NF == 15) print; }' csvData.csv > csvData_fixed.csv
Then LOAD DATA from the fixed file.
If you want to get fancier, you could write a script using awk (or Python or whatever you like) to rewrite the malformed records in the proper format.
Re your comment: The awk command reads your original file and outputs only each line that has exactly 15 fields, where fields are separated by commas.
Apparently your input data has no lines that have exactly 15 fields, even though you described it that way.
Another thought: it's a little bit weird to use the line terminator of ',,,,' in your original LOAD DATA command. Normally the line terminator is '\n' which is a newline character. So when you redefine the line terminator as ',,,,' it means MySQL will keep reading text until it finds ',,,,' even if that ends up reading dozens of fields over multiple lines of text. Perhaps you could set your line terminator to ',,,,\n'.
I have a txt file which contains quoted, comma deliminated text, and i am trying to figure out what type of newline is being used.
The reason is because i am trying to import it into mysql server, using local infile, and obviously i need to tell it the correct LINES TERMINATED BY
When i use either \n or \r\n it imports exactly half, of the records only, skipping a line each time.
But when i use \r it imports exactly double, giving me the exact number of rows as all values null, as there are records.
When i open the file in notepad, there is no space in between lines, however, if i open it in a browser, there is a blank line in between each line, as though there is a paragraph there somewhere. Like wise if i choose "open with > Excel" it does not put into columns, and has a blank line between each. The only way to open properly in excel is to use "get external data > From text" and choose comma deliminator.
I provide a couple of lines below exactly by just copying and pasting, and obviously it would be great if someone could let me know the correct settings to use for importing. But i it would be even more great, if there was a way for me to quickly know what type of newline any particular file is using (there is also a blank line at the very end of the file as per the other rows).
"Item No.","Description","Description 2","Customers Price","Home stock","Brand Name","Expected date for delivery","Item Group No.","Item Group Name","Item Product Link (Web)","Item Picture Link (Web)","EAN/UPC","Weight","UNSPSC code","Product type code","Warranty"
"/5PL0006","Drum Unit","DK-23","127.00","32","Kyocera","04/11/2013","800002","Drums","http://uk.product.com/product.aspx?id=%2f5PL0006","http://s.product.eu/products/2_PICTURE-TAKEN.JPG","5711045360824","0.30","44103109","","3M"
"/DK24","DK-24 Drum Unit FS-3750","","119.00","8","Dell","08/11/2013","800002","Drums","http://uk.product.com/product.aspx?id=%2fDK24","http://s.product.eu/products/2_PICTURE-TAKEN.JPG","5711045360718","0.20","44103109","","3M"
Ok so what i wish to do is to get specific lines from TEXT without loading all the data from TEXT into memory.
So lets say i have 100k lines of text in TEXT and i wish to get lines 9000-9100 from there.
I can do it with files but is it possible with mysql as well?
Or is it better to use file for this?
To my knowledge it is not possible to address values in a column with type TEXT by line. (You can do it by BYTE with SUBSTRING)
So do it by file or use another structure to save it in the database... e.g. save each line in a separate row with line number .. than you can easily query your wanted lines
What I'm trying to do is export a view/table from Sybase ASE 12.0 into a CSV file, but I'm having a lot of difficulty in it.
We want to import it into IDEA or MS-Access. The way that these programs operate is with the text-field encapsulation character and a field separator character, along with new lines being the record separator (without being able to modify this).
Well, using bcp to export it is ultimately fruitless with its built in options. It doesn't allow you to define a text field encapsulation character (as far as I can tell). So we tried to create another view that reads from the other view/table that concatenates the fields that have new lines in them (text fields), however, you may not do that without losing some of the precision because it forces the field into a varchar of 8000 characters/bytes, of which our max field used is 16000 (so there's definitely some truncation).
So, we decided to create columns in the new view that had the text field delimiters. However, that put our column count for the view at 320 -- 70 more than the 250 column limit in ASE 12.0.
bcp can only work on existing tables and views, so what can we do to export this data? We're pretty much open to anything.
If its only the new line char that is causing problems can you not just do a replace
create new view as
select field1, field2, replace(text_field_with_char, 'new line char,' ' ')
from old_view
You may have to consider exporting as 2 files, importing into your target as 2 tables and then combining them again in the target. If both files have a primary key this is simple.
That sounds like bcp's right, but process the output via awk or perl.
But are those things you have and know? That might be a little overhead for you.
If you're on Windows you can get Active Perl free and it could be quick.
something like:
perl -F, -lane 'print "\"$F[0]\",$F[1],\"$F[2]\",$F[3]\n" ;' bcp-output-file
how's that? $F is an array of fields. The text ones you encircle with \"
You can use BCP format files for this.
bcp .... -f XXXX.fmt
BCP can also produce this format files interactively if you don't state
any of -c -n -f flags. Then you can save the format file and experiment with it, editing it and runnign BCP.
To safe time while exporting and debugging, use -F -L flags like "-F 1 -L 10" -- this gets only first 10 lines.
I have an input file I want to load into a MySQL database, but spread throughout the file are comment lines, which start with !. For example,
!dataset_value_type = count
The other lines that I want to read into the table are normal, without the leading !.
What is the import command to ignore lines that start with !? I just see commands to ONLY take lines that start with something (LINES STARTING BY)
Ouch! I think you will need to pre-process your data file. Something like:
perl -pi.bak -e 's/^!.*$//;' data-file.dat