Mysql dump character escaping and CSV read - csv

I am trying to dump out the contents of my mysql query into a csv and read it using some java based open source csv reader. Here are the problems that I face with that,
My data set is having around 50 fields. The data set contains few fields with text having line breaks. Hence to prevent breaking my CSV reader, I gave Fields optionally enclosed by "\"" so that line breaks will be wrapped inside double quotes. In this case, for other fields even if there are no line breaks, it wraps them inside double quotes.
Looks like by default the escape character while doing mysql dump is \ ( backslash) This causes line breaks to appear with \ at the end which breaks the csv parser. To remove this \ at the end, if I give Fields escaped by '' ( empty string), it causes my double quotes in the text not to be escaped, still breaking the csv read.
It would be great if I can skip the line break escaping, but still retain escaping double quotes to cause csv reader not to break.
Any suggestions what can I follow here?
Thanks,
Sriram

Try dumping your data into CSV using uniVocity-parsers. You can then read the result using the same library:
Try this for dumping the data out:
ResultSet resultSet = executeYourQuery();
// To dump the data of our ResultSet, we configure the output format:
CsvWriterSettings writerSettings = new CsvWriterSettings();
writerSettings.getFormat().setLineSeparator("\n");
writerSettings.setHeaderWritingEnabled(true); // if you want want the column names to be printed out.
// Then create a routines object:
CsvRoutines routines = new CsvRoutines(writerSettings);
// The write() method takes care of everything. Both resultSet and output are closed by the routine.
routines.write(resultSet, new File("/path/to/your.csv"), "UTF-8");
And this to read your file:
// creates a CSV parser
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.getFormat().setLineSeparator("\n");
parserSettings.setHeaderExtractionEnabled(true); //extract headers from file
CsvParser parser = new CsvParser(parserSettings);
// call beginParsing to read records one by one, iterator-style. Note that there are many ways to read your file, check the documentation.
parser.beginParsing(new File("/path/to/your.csv"), "UTF-8);
String[] row;
while ((row = parser.parseNext()) != null) {
System.out.println(Arrays.toString(row));
}
Hope this helps.
Disclaimer: I'm the author of this library, it's open source and free (Apache V2.0 license)

Related

Import RFC4180 files (CSV spec) into snowflake? (Unable to create file format that matches CSV RFC spec)

Summary:
Original question from a year ago: How to escape double quotes within a data when it is already enclosed by double quotes
I have the same need as the original poster: I have a CSV file that matches the CSV RFC spec (my data has double quotes that are properly qualified, my data has commas in it, and my data also has line feeds in it. Excel is able to read it just fine because the file matches the spec and excel properly reads the spec).
Unfortunately I can't figure out how to import files that match the CSV RFC 4180 spec into snowflake. Any ideas?
Details:
We've been creating CSV files that match the RFC 4180 spec for years in order to maximize compatibility across applications and OSes.
Here is a sample of what my data looks like:
KEY,NAME,DESCRIPTION
1,AFRICA,This is a simple description
2,NORTH AMERICA,"This description has a comma, so I have to wrap the whole field in double quotes"
3,ASIA,"This description has ""double quotes"" in it, so I have to qualify the double quotes and wrap the field in double quotes"
4,EUROPE,"This field has a carriage
return so it is wrapped in double quotes"
5,MIDDLE EAST,Simple descriptoin with single ' quote
When opening this file in Excel, Excel properly reads the rows/columns (because excel follows the RFC spec):
In order to import this file into Snowflake, I first try to create a file format and I set the following:
Name
Value
Column Separator
Comma
Row Separator
New Line
Header lines to skip
1
Field optionally enclosed by
Double Quote
Escape Character
"
Escape Unenclosed Field
None
But when go to save the file format, I get this error:
Unable to create file format "CSV_SPEC".
SQL compilation error: value ["] for parameter 'FIELD_OPTIONALLY_ENCLOSED_BY' conflict with parameter 'ESCAPE'
It would appear that I'm missing something? I would think that I must be getting the snowflake configuration wrong. (
While writing up this question and testing all the scenarios I could think of, I found a file format that seems to work:
Name
Value
Column Separator
Comma
Row Separator
New Line
Header lines to skip
1
Field optionally enclosed by
Double Quote
Escape Character
None
Escape Unenclosed Field
None
Same information, but for those that prefer screenshots:
Same information again, but in SQL form:
ALTER FILE FORMAT "DB_NAME"."SCHEMA_NAME"."CSV_SPEC3" SET COMPRESSION = 'NONE' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = '\042' TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = 'NONE' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');
I don't know why this works, but it does, so, there you.

NBSP creeping inside mySQL data [duplicate]

I have a spreadsheet which really has only one complicated table. I basically convert the spreadsheet to a cvs and use a groovy script to generate the INSERT scripts.
However, I cannot do this with a table that has 28 fields with data within some of the fields on the spreadsheet that make importing into the CVS even more complicated. So the fields in the new CVS are not differentiated properly or my script has not accounted for it.
Does anyone have any suggestions on a better approach to do this? Thanks.
Have a look at LOAD DATA INFILE statement. It will help you to import data from the CSV file into table.
This is a recurrent question on stackoverflow. Here is an updated answer.
There are actually several ways to import an excel file in to a MySQL database with varying degrees of complexity and success.
Excel2MySQL or Navicat utilities. Full disclosure, I am the author of Excel2MySQL. These 2 utilities aren't free, but they are the easiest option and have the fewest limitations. They also include additional features to help with importing Excel data into MySQL. For example, Excel2MySQL automatically creates your table and automatically optimizes field data types like dates, times, floats, etc. If your in a hurry or can't get the other options to work with your data then these utilities may suit your needs.
LOAD DATA INFILE: This popular option is perhaps the most technical and requires some understanding of MySQL command execution. You must manually create your table before loading and use appropriately sized VARCHAR field types. Therefore, your field data types are not optimized. LOAD DATA INFILE has trouble importing large files that exceed 'max_allowed_packet' size. Special attention is required to avoid problems importing special characters and foreign unicode characters. Here is a recent example I used to import a csv file named test.csv.
phpMyAdmin: Select your database first, then select the Import tab. phpMyAdmin will automatically create your table and size your VARCHAR fields, but it won't optimize the field types. phpMyAdmin has trouble importing large files that exceed 'max_allowed_packet' size.
MySQL for Excel: This is a free Excel Add-in from Oracle. This option is a bit tedious because it uses a wizard and the import is slow and buggy with large files, but this may be a good option for small files with VARCHAR data. Fields are not optimized.
For comma-separated values (CSV) files, the results view panel in Workbench has an "Import records from external file" option that imports CSV data directly into the result set. Execute that and click "Apply" to commit the changes.
For Excel files, consider using the official MySQL for Excel plugin.
A while back I answered a very similar question on the EE site, and offered the following block of Perl, as a quick and dirty example of how you could directly load an Excel sheet into MySQL. Bypassing the need to export / import via CSV and so hopefully preserving more of those special characters, and eliminating the need to worry about escaping the content.
#!/usr/bin/perl -w
# Purpose: Insert each Worksheet, in an Excel Workbook, into an existing MySQL DB, of the same name as the Excel(.xls).
# The worksheet names are mapped to the table names, and the column names to column names.
# Assumes each sheet is named and that the first ROW on each sheet contains the column(field) names.
#
use strict;
use Spreadsheet::ParseExcel;
use DBI;
use Tie::IxHash;
die "You must provide a filename to $0 to be parsed as an Excel file" unless #ARGV;
my $sDbName = $ARGV[0];
$sDbName =~ s/\.xls//i;
my $oExcel = new Spreadsheet::ParseExcel;
my $oBook = $oExcel->Parse($ARGV[0]);
my $dbh = DBI->connect("DBI:mysql:database=$sDbName;host=192.168.123.123","root", "xxxxxx", {'RaiseError' => 1,AutoCommit => 1});
my ($sTableName, %hNewDoc, $sFieldName, $iR, $iC, $oWkS, $oWkC, $sSql);
print "FILE: ", $oBook->{File} , "\n";
print "DB: $sDbName\n";
print "Collection Count: ", $oBook->{SheetCount} , "\n";
for(my $iSheet=0; $iSheet < $oBook->{SheetCount} ; $iSheet++)
{
$oWkS = $oBook->{Worksheet}[$iSheet];
$sTableName = $oWkS->{Name};
print "Table(WorkSheet name):", $sTableName, "\n";
for(my $iR = $oWkS->{MinRow} ; defined $oWkS->{MaxRow} && $iR <= $oWkS->{MaxRow} ; $iR++)
{
tie ( %hNewDoc, "Tie::IxHash");
for(my $iC = $oWkS->{MinCol} ; defined $oWkS->{MaxCol} && $iC <= $oWkS->{MaxCol} ; $iC++)
{
$sFieldName = $oWkS->{Cells}[$oWkS->{MinRow}][$iC]->Value;
$sFieldName =~ s/[^A-Z0-9]//gi; #Strip non alpha-numerics from the Column name
$oWkC = $oWkS->{Cells}[$iR][$iC];
$hNewDoc{$sFieldName} = $dbh->quote($oWkC->Value) if($oWkC && $sFieldName);
}
if ($iR == $oWkS->{MinRow}){
#eval { $dbh->do("DROP TABLE $sTableName") };
$sSql = "CREATE TABLE IF NOT EXISTS $sTableName (".(join " VARCHAR(512), ", keys (%hNewDoc))." VARCHAR(255))";
#print "$sSql \n\n";
$dbh->do("$sSql");
} else {
$sSql = "INSERT INTO $sTableName (".(join ", ",keys (%hNewDoc)).") VALUES (".(join ", ",values (%hNewDoc)).")\n";
#print "$sSql \n\n";
eval { $dbh->do("$sSql") };
}
}
print "Rows inserted(Rows):", ($oWkS->{MaxRow} - $oWkS->{MinRow}), "\n";
}
# Disconnect from the database.
$dbh->disconnect();
Note:
Change the connection ($oConn) string to suit, and if needed add a
user-id and password to the arguments.
If you need XLSX support a quick switch to Spreadsheet::XLSX is all
that's needed. Alternatively it only takes a few lines of code, to
detect the filetype and call the appropriate library.
The above is a simple hack, assumes everything in a cell is a string
/ scalar, if preserving type is important, a little function with a
few regexp can be used in conjunction with a few if statements to
ensure numbers / dates remain in the applicable format when written
to the DB
The above code is dependent on a number of CPAN modules, that you can install, assuming outbound ftp access is permitted, via a:
cpan YAML Data::Dumper Spreadsheet::ParseExcel Tie::IxHash Encode Scalar::Util File::Basename DBD::mysql
Should return something along the following lines (tis rather slow, due to the auto commit):
# ./Excel2mysql.pl test.xls
FILE: test.xls
DB: test
Collection Count: 1
Table(WorkSheet name):Sheet1
Rows inserted(Rows):9892

How can I quickly reformat a CSV file into SQL format in Vim?

I have a CSV file that I need to format (i.e., turn into) a SQL file for ingestion into MySQL. I am looking for a way to add the text delimiters (single quote) to the text, but not to the numbers, booleans, etc. I am finding it difficult because some of the text that I need to enclose in single quotes have commas themselves, making it difficult to key in to the commas for search and replace. Here is an example line I am working with:
1239,1998-08-26,'Severe Storm(s)','Texas,Val Verde,"DEL RIO, PARKS",'No',25,"412,007.74"
This is FEMA data file, with 131246 lines, I got off of data.gov that I am trying to get into a MySQL database. As you can see, I need to insert a single quote after Texas and before Val Verde, so I tried:
s/,/','/3
But that only replaced the first occurrence of the comma on the first three lines of the file. Once I get past that, I will need to find a way to deal with "DEL RIO, PARKS", as that has a comma that I do not want to place a single quote around.
So, is there a "nice" way to manipulate this data to get it from plain CSV to a proper SQL format?
Thanks
CSV files are notoriously dicey to parse. Different programs export CSV in different ways, possibly including strangeness like embedding new lines within a quoted field or different ways of representing quotes within a quoted field. You're better off using a tool specifically suited to parsing CSV -- perl, python, ruby and java all have CSV parsing libraries, or there are command line programs such as csvtool or ffe.
If you use a scripting language's CSV library, you may also be able to leverage the language's SQL import as well. That's overkill for a one-off, but if you're importing a lot of data this way, or if you're transforming data, it may be worthwhile.
I think that I would also want to do some troubleshooting to find out why the CSV import into MYSql failed.
I would take an approach like this:
:%s/,\("[^"]*"\|[^,"]*\)/,'\1'/g
:%s/^\("[^"]*"\|[^,"]*\)/'\1'/g
In words, look for a double quoted set of characters or , \|, a non-double quoted set of characters beginning with a comma and replace the set of characters in a single quotation.
Next, for the first column in a row, look for a double quoted set of characters or , \|, a non-double quoted set of characters beginning with a comma and replace the set of characters in a single quotation.
Try the csv plugin. It allows to convert the data into other formats. The help includes an example, how to convert the data for importing it into a database
Just to bring this to a close, I ended up using #Eric Andres idea, which was the MySQL load data option:
LOAD DATA LOCAL INFILE '/path/to/file.csv'
INTO TABLE MYTABLE FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n';
The initial .csv file still took a little massaging, but not as much as I were to do it by hand.
When I commented that the LOAD DATA had truncated my file, I was incorrect. I was treating the file as a typical .sql file and assumed the "ID" column I had added would auto-increment. This turned out to not be the case. I had to create a quick script that prepended an ID to the front of each line. After that, the LOAD DATA command worked for all lines in my file. In other words, all data has to be in place within the file to load before the load, or the load will not work.
Thanks again to all who replied, and #Eric Andres for his idea, which I ultimately used.

Using Excel to create a CSV file with special characters and then Importing it into a db using SSIS

Take this XLS file
I then save this XLS file as CSV and then open it up with a text editor. This is what I see:
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,ABC,"AB""C","D,E",F,03,"3,2"
I see that the double quote character in column C was stored as AB""C, the column value was enclosed with quotations and the double quote character in the data was replaced with 2 double quote characters to indicate that the quote is occurring within the data and not terminating the column value. I also see that the value for column G, 3,2, is enclosed in quotes so that it is clear that the comma occurs within the data rather than indicating a new column. So far, so good.
I am a little surprised that all of the column values are not enclosed by quotes but even this seems reasonable OK when I assume that EXCEL only specifies column delimieters when special characters like a commad or a dbl quote character exists in the data.
Now I try to use SQL Server to import the csv file. Note that I specify a double quote character as the Text Qualifier character.
And a command char as the Column delimiter character. However, note that SSIS imports column 3 incorrectly,eg, not translating the two consecutive double quote characters as a single occurence of a double quote character.
What do I have to do to get Excel and SSIS to get along?
Generally people avoid the issue by using column delimiter chactacters that are LESS LIKELY to occur in the data but this is not a real solution.
I find that if I modify the file from this
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,ABC,"AB""C","D,E",F,03,"3,2"
...to this:
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,ABC,"AB"C","D,E",F,03,"3,2"
i.e, removing the two consecutive quotes in column C's value, that the data is loaded properly, however, this is a little confusing to me. First of all, how does SSIS determine that the double quote between the B and the C is not terminating that column value? Is it because the following characters are not a comma column delimiter or a row delimiter (CRLF)? And why does Excel export it this way?
According to Wikipedia, here are a couple of traits of a CSV file:
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
However, it looks like SSIS doesn't like it that way when importing. What can be done to get Excel to create a CSV file that could contain ANY special characters used as column delimiters, text delimiters or row delimiters in the data? There's no reason that it can't work using the approach specified in Wikipedia,. which is what I thought the old MS DTS packages used to do...
Update:
If I use Notepad change the input file to
Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8
"1","ABC","AB""C","D,E","F","03","3,2","AB""C"
Excel reads it just fine
but SSIS returns
The preview sample contains embedded text qualifiers ("). The flat file parser does not support embedding text qualifiers in data. Parsing columns that contain data with text qualifiers will fail at run time.
Conclusion:
Just like the error message says in your update...
The flat file parser does not support embedding text qualifiers in data. Parsing columns that contain data with text qualifiers will fail at run time.
Confirmed bug in Microsoft Connect. I encourage everyone reading this to click on this aforementioned link and place your vote to have them fix this stinker. This is in the top 10 of the most egregious bugs I have encountered.
Do you need to use a comma delimiter.
I used a pipe delimiter with no Text qualifier and it worked fine. Here is my output form the text file.
1|ABC|AB"C|D,E|F|03|3,2
You have 3 options in my opinion.
Read the data into a stage table.
Run any update queries you need on the columns
Now select your data from the stage table and output it to a flat file.
OR
Use pipes are you delimiters.
OR
Do all of this in a C# application and build it in code.
You could send the row to a script in SSIS and parse and build the file you want there as well.
Using text qualifiers and "character" delimited fields is problematic for sure.
Have Fun!

prevent CRLF in CSV export data

I have an export functionality that reads data from DB (entire records) and writes them in a .txt file, one record on a row each field being separated by ';'. the problem i am facing is that some fields contain CRLFs in it and when i write them to the file it goes to the next line thus destroying the structure of the file.
The only solution is to replace the CRLFs with a custom value, and at import replace back with CRLF. but i don't like this solution because these files are huge and the replace operation decreases performance....
Do you have any other ideas?
thank you!
Yes, use a CSV generator that quotes string values. For example, Python's csv module.
For example (ripped and modified from the csv docs):
import csv
def write(filename):
spamWriter = csv.writer(open(filename, 'w'), quoting=csv.QUOTE_ALL)
spamWriter.writerow(['Spam'] * 5 + ['Baked Beans'])
spamWriter.writerow(['Spam', 'Lovely Spam', 'Wonderful Spam\nbar'])
def read(filename):
reader = csv.reader(open(filename, "rb"))
for row in reader:
print row
write('eggs.csv')
read('eggs.csv')
Outputs:
['Spam', 'Spam', 'Spam', 'Spam', 'Spam', 'Baked Beans']
['Spam', 'Lovely Spam', 'Wonderful Spam\r\nbar']
If you have control over how the file is exported and imported, then you might want to consider using XML .. also you can use double quotes i believe to indicate literals like "," in the values.