Postgres import file that has columns separated by new lines - mysql

I have a large text file that has one column per row and I want to import this data file into Postgres.
I have a working MySQL script.
LOAD DATA LOCAL
INFILE '/Users/Farmor/data.sql'
INTO TABLE tablename
COLUMNS TERMINATED BY '\n';
How can I translate this into Postgres? I've tried amongst other this command.
COPY tablename
FROM '/Users/Farmor/data.sql'
WITH DELIMITER '\n'
However it complains:
ERROR: COPY delimiter must be a single one-byte character

The immediate error is because \n is just a two char string, \ and n.
You want:
COPY tablename
FROM '/Users/Farmor/data.sql'
WITH DELIMITER E'\n'
The E'' syntax is a PostgreSQL extension.
It still won't work, though, because PostgreSQL's COPY can't understand files with newline column delimiters. I've never even seen that format.
You'll need to load it using another tool and transform the CSV. Use an office suite, the csv module for Python, Text::CSV for Perl, or whatever. Then feed the cleaned up CSV into PostgreSQL.

While postgresql will not recognize \n as a field delimiter, the original question asked how to import a row as a single column and this can be accomplished in postgresql by defining a delimiter not found in the data string. For example:
COPY tablename
FROM '/Users/Farmor/data.sql'
WITH DELIMITER '~';
If no ~ is found in the row, postgresql will treat the entire row as one column.

Your delimiter is two characters so it's a valid error message.
I believe the simplest approach would be to modify the file you're importing from and actually change the delimiters to something other than \n but that might not be an option in your situation.
This question addresses the same issue:
ERROR: COPY delimiter must be a single one-byte character

Related

Redshift - Delimited value missing end quote

Im trying to load a CSV file to redshift.
Delimiter '|'
1'st column of CSV:
1 |Bhuvi|"This is ok"|xyz#domain.com
I used this command to load.
copy tbl from 's3://datawarehouse/source.csv'
iam_role 'arn:aws:iam:::role/xxx'cas-pulse-redshift'
delimiter '|'
removequotes
ACCEPTINVCHARS ;
ERROR:
raw_field_value | This is ok" |xyz#domain.com
err_code | 1214
err_reason | Delimited value missing end quote
then I tried this too.
copy tbl from 's3://datawarehouse/source.csv'
iam_role 'arn:aws:iam:::role/xxx'
CSV QUOTE '\"'
DELIMITER '|'
ACCEPTINVCHARS ;
Disclaimer - Even though this post does not answer the question asked here, I am posting this analysis in case it helps some one.
The error "Delimited value missing end quote" can be reported in cases where a quoted text column is missing the end quote, or if the text column value has a new line in the value itself. In my case, there was a newline in the text column value.
As per RFC 4180 the specification of CSV says,
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes.
For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
So a valid CSV can have multi-line rows, and the correct way to import it in Redshift is to specify the CSV format option. This also assumes that all columns having the quote character in the value will have the quote character escaped by another preceding quote character. This is also as per the CSV RFC specification.
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote.
For example:
"aaa","b""bb","ccc"
If the file that we are trying to import is not a valid CSV, and is just named as a .CSV file as the case may just be, then we have the following options.
Try copying the file without specifying the CSV option, and fine tuning the delimiter and escape and quoting behaviour with the corresponding copy options.
If a set of options is not able to consistently copy data, then pre-process the file to make it consistent.
In general, it helps to make the behaviour deterministic if we try to export and import data in formats that are consistent.

How to bulk load into cassandra other than copy method.?

AM using the copy method for cpying the .csv file into the cassandra tables..
But am getting records mismatch error..
Record 41(Line 41) has mismatched number of records (85 instead of 82)
This is happening for all the .csv files & all the .csv files are system generated..
Any work around for this error..?
Based on your error message, it sounds like the copy command is working for you, until record 41. What are you using as a delimiter? The default delimiter for the COPY command is a comma, and I'll bet that your data has some additional commas in it on line 41.
A few options:
Edit your data and remove the extra commas.
Alter your .csv file to encapsulate the values of all of your fields in double-quotes, as COPY's default QUOTE value is ". This will allow you to leave the in-text commas.
Alter your .csv file to delimit with pipes | instead of a comma, and set the COPY command's DELIMITER option to |.
Try using either the Cassandra bulk loader or json2sstable utility to import your data. I've never used them, but I would bet you'll have similar problems if you have commas in your data set.

How can I quickly reformat a CSV file into SQL format in Vim?

I have a CSV file that I need to format (i.e., turn into) a SQL file for ingestion into MySQL. I am looking for a way to add the text delimiters (single quote) to the text, but not to the numbers, booleans, etc. I am finding it difficult because some of the text that I need to enclose in single quotes have commas themselves, making it difficult to key in to the commas for search and replace. Here is an example line I am working with:
1239,1998-08-26,'Severe Storm(s)','Texas,Val Verde,"DEL RIO, PARKS",'No',25,"412,007.74"
This is FEMA data file, with 131246 lines, I got off of data.gov that I am trying to get into a MySQL database. As you can see, I need to insert a single quote after Texas and before Val Verde, so I tried:
s/,/','/3
But that only replaced the first occurrence of the comma on the first three lines of the file. Once I get past that, I will need to find a way to deal with "DEL RIO, PARKS", as that has a comma that I do not want to place a single quote around.
So, is there a "nice" way to manipulate this data to get it from plain CSV to a proper SQL format?
Thanks
CSV files are notoriously dicey to parse. Different programs export CSV in different ways, possibly including strangeness like embedding new lines within a quoted field or different ways of representing quotes within a quoted field. You're better off using a tool specifically suited to parsing CSV -- perl, python, ruby and java all have CSV parsing libraries, or there are command line programs such as csvtool or ffe.
If you use a scripting language's CSV library, you may also be able to leverage the language's SQL import as well. That's overkill for a one-off, but if you're importing a lot of data this way, or if you're transforming data, it may be worthwhile.
I think that I would also want to do some troubleshooting to find out why the CSV import into MYSql failed.
I would take an approach like this:
:%s/,\("[^"]*"\|[^,"]*\)/,'\1'/g
:%s/^\("[^"]*"\|[^,"]*\)/'\1'/g
In words, look for a double quoted set of characters or , \|, a non-double quoted set of characters beginning with a comma and replace the set of characters in a single quotation.
Next, for the first column in a row, look for a double quoted set of characters or , \|, a non-double quoted set of characters beginning with a comma and replace the set of characters in a single quotation.
Try the csv plugin. It allows to convert the data into other formats. The help includes an example, how to convert the data for importing it into a database
Just to bring this to a close, I ended up using #Eric Andres idea, which was the MySQL load data option:
LOAD DATA LOCAL INFILE '/path/to/file.csv'
INTO TABLE MYTABLE FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n';
The initial .csv file still took a little massaging, but not as much as I were to do it by hand.
When I commented that the LOAD DATA had truncated my file, I was incorrect. I was treating the file as a typical .sql file and assumed the "ID" column I had added would auto-increment. This turned out to not be the case. I had to create a quick script that prepended an ID to the front of each line. After that, the LOAD DATA command worked for all lines in my file. In other words, all data has to be in place within the file to load before the load, or the load will not work.
Thanks again to all who replied, and #Eric Andres for his idea, which I ultimately used.

Using Excel to create a CSV file with special characters and then Importing it into a db using SSIS

Take this XLS file
I then save this XLS file as CSV and then open it up with a text editor. This is what I see:
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,ABC,"AB""C","D,E",F,03,"3,2"
I see that the double quote character in column C was stored as AB""C, the column value was enclosed with quotations and the double quote character in the data was replaced with 2 double quote characters to indicate that the quote is occurring within the data and not terminating the column value. I also see that the value for column G, 3,2, is enclosed in quotes so that it is clear that the comma occurs within the data rather than indicating a new column. So far, so good.
I am a little surprised that all of the column values are not enclosed by quotes but even this seems reasonable OK when I assume that EXCEL only specifies column delimieters when special characters like a commad or a dbl quote character exists in the data.
Now I try to use SQL Server to import the csv file. Note that I specify a double quote character as the Text Qualifier character.
And a command char as the Column delimiter character. However, note that SSIS imports column 3 incorrectly,eg, not translating the two consecutive double quote characters as a single occurence of a double quote character.
What do I have to do to get Excel and SSIS to get along?
Generally people avoid the issue by using column delimiter chactacters that are LESS LIKELY to occur in the data but this is not a real solution.
I find that if I modify the file from this
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,ABC,"AB""C","D,E",F,03,"3,2"
...to this:
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,ABC,"AB"C","D,E",F,03,"3,2"
i.e, removing the two consecutive quotes in column C's value, that the data is loaded properly, however, this is a little confusing to me. First of all, how does SSIS determine that the double quote between the B and the C is not terminating that column value? Is it because the following characters are not a comma column delimiter or a row delimiter (CRLF)? And why does Excel export it this way?
According to Wikipedia, here are a couple of traits of a CSV file:
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
However, it looks like SSIS doesn't like it that way when importing. What can be done to get Excel to create a CSV file that could contain ANY special characters used as column delimiters, text delimiters or row delimiters in the data? There's no reason that it can't work using the approach specified in Wikipedia,. which is what I thought the old MS DTS packages used to do...
Update:
If I use Notepad change the input file to
Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8
"1","ABC","AB""C","D,E","F","03","3,2","AB""C"
Excel reads it just fine
but SSIS returns
The preview sample contains embedded text qualifiers ("). The flat file parser does not support embedding text qualifiers in data. Parsing columns that contain data with text qualifiers will fail at run time.
Conclusion:
Just like the error message says in your update...
The flat file parser does not support embedding text qualifiers in data. Parsing columns that contain data with text qualifiers will fail at run time.
Confirmed bug in Microsoft Connect. I encourage everyone reading this to click on this aforementioned link and place your vote to have them fix this stinker. This is in the top 10 of the most egregious bugs I have encountered.
Do you need to use a comma delimiter.
I used a pipe delimiter with no Text qualifier and it worked fine. Here is my output form the text file.
1|ABC|AB"C|D,E|F|03|3,2
You have 3 options in my opinion.
Read the data into a stage table.
Run any update queries you need on the columns
Now select your data from the stage table and output it to a flat file.
OR
Use pipes are you delimiters.
OR
Do all of this in a C# application and build it in code.
You could send the row to a script in SSIS and parse and build the file you want there as well.
Using text qualifiers and "character" delimited fields is problematic for sure.
Have Fun!

Change the column data delimiter on mysqldump output

I'm looking to change to formatting of the output produced by the mysqldump command in the following way:
(data_val1,data_val2,data_val3,...)
to
(data_val1|data_val2|data_val3|...)
The change here being a different delimiter. This would then allow me to (in python) parse the data lines using a line.split("|") command and end up with the values correctly split (as opposed to doing line.split(",") and have values that contain commas be split into multiple values).
I've tried using the --fields-terminated-by flag, but this requires the --tab flag to be used as well. I don't want use the --tab flag as it splits the dump into several files. Does anyone know how to alter the delimiter that mysqldump uses?
This is not a good idea. Instead of using string.split() in Python, use the csv module to properly parse CSV data, which may be enclosed in quotes and may have internal , which aren't delimiters.
import csv
MySQL dump files are intended to be used as input back into MySQL. If you really want pipe-delimited output, use the SELECT INTO OUTFILE syntax instead with the FIELDS TERMINATED BY '|' option.