Meaning of Empty Line in CSV File - csv

At first this seemed obvious, but now I'm not so sure.
If a CSV file has the following line:
a,
I would interpret that as two fields with the values "a" and "". But then looking at an empty line, I could just as easily argue that it signifies one field with the value "".
I accept that an empty line at the end of the file should be interpreted as the end of the file (no field). But does anyone have any information on what an empty line within the file should mean?

Looking at how Excel handles empty lines when reading CSV files, I can see that Excel does not ignore them.
Unfortunately, there is no way to tell if the empty line was treated as an empty field or no fields at all because Excel always has the same number of columns.
I saw some proprietary uses of the CSV format where there was an option to how blank lines should be treated. In the end, this is the approach I took. My CSV reader class has four options for how to deal with empty lines:
Ignore and skip over them
Treat them as a row with zero fields
Treat them as a row with one empty field
Treat them as the end of the input file
If anyone's interested, I will be posting the new source code to replace the existing article at Reading and Writing CSV Files in C#.

Be aware that an empty line might be part of a multiline quoted field:
1,2,"this
is
field number
3",4,5
is valid CSV.
In most CSV files I've seen, the number of fields is constant per row (although that doesn't have to be so), so unless a CSV file only has one column, I would expect empty lines (outside of quoted fields) to be a mistake.
I just checked: Python's CSV parser ignores empty lines. I guess that's reasonable.

To the best of my understanding and experience it stands for missing record and should be ignored. Don't treat it as EOF.

TLDR; After thinking about RFC, I would interpret empty line as a record with one empty value.
In RFC (https://datatracker.ietf.org/doc/html/rfc4180) there is a grammar. The grammar contains, among other things, this:
file = [header CRLF] record *(CRLF record) [CRLF]
...
record = field *(COMMA field)
...
field = (escaped / non-escaped)
non-escaped = *TEXTDATA
Strictly speaking, the grammar does not define the semantics, but anyway, I would interpret it so that a record has at least one field, possibly with empty value.
If I would write a grammar, where there could be a record without fields at all, I would write something different, maybe:
record = *fields CRLF
fields = field *(COMMA field)

Related

How can I quickly reformat a CSV file into SQL format in Vim?

I have a CSV file that I need to format (i.e., turn into) a SQL file for ingestion into MySQL. I am looking for a way to add the text delimiters (single quote) to the text, but not to the numbers, booleans, etc. I am finding it difficult because some of the text that I need to enclose in single quotes have commas themselves, making it difficult to key in to the commas for search and replace. Here is an example line I am working with:
1239,1998-08-26,'Severe Storm(s)','Texas,Val Verde,"DEL RIO, PARKS",'No',25,"412,007.74"
This is FEMA data file, with 131246 lines, I got off of data.gov that I am trying to get into a MySQL database. As you can see, I need to insert a single quote after Texas and before Val Verde, so I tried:
s/,/','/3
But that only replaced the first occurrence of the comma on the first three lines of the file. Once I get past that, I will need to find a way to deal with "DEL RIO, PARKS", as that has a comma that I do not want to place a single quote around.
So, is there a "nice" way to manipulate this data to get it from plain CSV to a proper SQL format?
Thanks
CSV files are notoriously dicey to parse. Different programs export CSV in different ways, possibly including strangeness like embedding new lines within a quoted field or different ways of representing quotes within a quoted field. You're better off using a tool specifically suited to parsing CSV -- perl, python, ruby and java all have CSV parsing libraries, or there are command line programs such as csvtool or ffe.
If you use a scripting language's CSV library, you may also be able to leverage the language's SQL import as well. That's overkill for a one-off, but if you're importing a lot of data this way, or if you're transforming data, it may be worthwhile.
I think that I would also want to do some troubleshooting to find out why the CSV import into MYSql failed.
I would take an approach like this:
:%s/,\("[^"]*"\|[^,"]*\)/,'\1'/g
:%s/^\("[^"]*"\|[^,"]*\)/'\1'/g
In words, look for a double quoted set of characters or , \|, a non-double quoted set of characters beginning with a comma and replace the set of characters in a single quotation.
Next, for the first column in a row, look for a double quoted set of characters or , \|, a non-double quoted set of characters beginning with a comma and replace the set of characters in a single quotation.
Try the csv plugin. It allows to convert the data into other formats. The help includes an example, how to convert the data for importing it into a database
Just to bring this to a close, I ended up using #Eric Andres idea, which was the MySQL load data option:
LOAD DATA LOCAL INFILE '/path/to/file.csv'
INTO TABLE MYTABLE FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n';
The initial .csv file still took a little massaging, but not as much as I were to do it by hand.
When I commented that the LOAD DATA had truncated my file, I was incorrect. I was treating the file as a typical .sql file and assumed the "ID" column I had added would auto-increment. This turned out to not be the case. I had to create a quick script that prepended an ID to the front of each line. After that, the LOAD DATA command worked for all lines in my file. In other words, all data has to be in place within the file to load before the load, or the load will not work.
Thanks again to all who replied, and #Eric Andres for his idea, which I ultimately used.

Lose data in random fields when importing from file into table using phpmyadmin

I have an access DB. I exported tables to xlsx. Then I saved as .ods using openOffice
because I found out that phpmyadmin-mysql no longer supports excel files. I have my mySQL database formated exactly as it should to accept the data. I import and everything seems fine except one little detail.
In some fields, the value is NULL instead of the value it should have according to the .ods file. Some rows show the same value for that field correctly, some show NULL.
Also, the "faulty" rows have some fields that show the value 0 for fields that where empty in the imported file (instead of NULL). Default value for those fields in mySQL is NULL. Each row has many fields like that and all of the same data type (tinyint). Some appear correctly NULL and some have the value 0....
I can't see a pattern on all these.
Any help is appreciated.
Check to see that imported strings have ("") quotes and NULL do not and that all are separated appropriately, usually a "," comma with the record/row delimited by ";" semicolon. Best way to check what the MySQL is looking for is to export some existing data to the same format and check it against what you are trying to import. One little missed quote and the deal is off. Be consistent in the use of either double " quotes or single ' quotes. also the ` character is not used as I think. If you are "squishing" your data through an application that applies "smart quotes" like MS word does or "Open Office??' this too can cause issues. Add the word NULL either inside or without quotes in your csv import where values appropriate.

Using Excel to create a CSV file with special characters and then Importing it into a db using SSIS

Take this XLS file
I then save this XLS file as CSV and then open it up with a text editor. This is what I see:
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,ABC,"AB""C","D,E",F,03,"3,2"
I see that the double quote character in column C was stored as AB""C, the column value was enclosed with quotations and the double quote character in the data was replaced with 2 double quote characters to indicate that the quote is occurring within the data and not terminating the column value. I also see that the value for column G, 3,2, is enclosed in quotes so that it is clear that the comma occurs within the data rather than indicating a new column. So far, so good.
I am a little surprised that all of the column values are not enclosed by quotes but even this seems reasonable OK when I assume that EXCEL only specifies column delimieters when special characters like a commad or a dbl quote character exists in the data.
Now I try to use SQL Server to import the csv file. Note that I specify a double quote character as the Text Qualifier character.
And a command char as the Column delimiter character. However, note that SSIS imports column 3 incorrectly,eg, not translating the two consecutive double quote characters as a single occurence of a double quote character.
What do I have to do to get Excel and SSIS to get along?
Generally people avoid the issue by using column delimiter chactacters that are LESS LIKELY to occur in the data but this is not a real solution.
I find that if I modify the file from this
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,ABC,"AB""C","D,E",F,03,"3,2"
...to this:
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,ABC,"AB"C","D,E",F,03,"3,2"
i.e, removing the two consecutive quotes in column C's value, that the data is loaded properly, however, this is a little confusing to me. First of all, how does SSIS determine that the double quote between the B and the C is not terminating that column value? Is it because the following characters are not a comma column delimiter or a row delimiter (CRLF)? And why does Excel export it this way?
According to Wikipedia, here are a couple of traits of a CSV file:
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
However, it looks like SSIS doesn't like it that way when importing. What can be done to get Excel to create a CSV file that could contain ANY special characters used as column delimiters, text delimiters or row delimiters in the data? There's no reason that it can't work using the approach specified in Wikipedia,. which is what I thought the old MS DTS packages used to do...
Update:
If I use Notepad change the input file to
Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8
"1","ABC","AB""C","D,E","F","03","3,2","AB""C"
Excel reads it just fine
but SSIS returns
The preview sample contains embedded text qualifiers ("). The flat file parser does not support embedding text qualifiers in data. Parsing columns that contain data with text qualifiers will fail at run time.
Conclusion:
Just like the error message says in your update...
The flat file parser does not support embedding text qualifiers in data. Parsing columns that contain data with text qualifiers will fail at run time.
Confirmed bug in Microsoft Connect. I encourage everyone reading this to click on this aforementioned link and place your vote to have them fix this stinker. This is in the top 10 of the most egregious bugs I have encountered.
Do you need to use a comma delimiter.
I used a pipe delimiter with no Text qualifier and it worked fine. Here is my output form the text file.
1|ABC|AB"C|D,E|F|03|3,2
You have 3 options in my opinion.
Read the data into a stage table.
Run any update queries you need on the columns
Now select your data from the stage table and output it to a flat file.
OR
Use pipes are you delimiters.
OR
Do all of this in a C# application and build it in code.
You could send the row to a script in SSIS and parse and build the file you want there as well.
Using text qualifiers and "character" delimited fields is problematic for sure.
Have Fun!

How to deal with a string with comma in it from a csv, when we have to read the data by using loadrunner?

When I used Loadrunner, it can read data from a csv file. As we know , csv file is separated by a comma.
The question is, if the parameter in csv has comma itself, the string will be separated to several segments. That is not I want to get.
How can we get the original data with comma in it?
When data has a comma, use an escape character to store the data in the parameter.
For example, if the name is 'Smith, John', it can be stored as Smith\, John in the Loadrunner data file.
When you save a file in Excel that has commas in the actual cell data, the whole cell will be inside two " characters. Also it seems that cells with a space in them are inside " chars.
Example
ColA,ColB,"ColC with a , inside",ColD,ColE
More info on CSV file format: http://www.parse-o-matic.com/parse/pskb/CSV-File-Format.htm
The answer to the question is that perhaps the easiest way to do deal with , separators is to change the separator to a ; character. This is also a valid separator in CSV files.
Then the example would be:
ColA;ColB;"ColC with a , inside";ColD;ColE
Maybe the right way is to use C functions to read data from the file (for example fopen/fread)? When you have read it you be able to use "strchr" to find first quotes char and second quotes char. All in that interval would be a value, and it doesn't matter if comma is inside.
For the documentation about fopen, fread,strchr, you could refer to the HP or C function references.
Hope this will help you.
Assuming you are reading from a data file for the parameters, just use a custom seperator. Comma is the default, but you can define it to be whatever you want. Whenever I have a comma in the variable data I tend to use a pipe symbol, '|' as a way to distinguish the columns of data in the data file.
Examine your parameter dialog carefully and you will see where to make the change.

handling .CSV when one of it's value may include a comma in the string

I have a .csv that I need to convert to a coldfusion query. I have used the cflib.org CSVtoQuery method which works fine... but...
If there is a 'cell' in the csv that includes a comma in the string, such as a list, the query row for that record gets messed up as it sees the comma in the string as a new value.
I have no control over how the data is going in, so I can't have it written or passed inside quotes or the like.
Does anyone know if there is a way to process a .csv (convert to a query or other workable struct) that may have commas in the values?
No, there isn't. Whoever is making the CSV is not making it properly. No CSV parser can tell the difference between commas that separate and commas that don't if there is no way to tell the difference.
Whoever is making the file should choose a different delimiter.