I am using OpenCSV to read CSV files. Looking over the docs, I don't see guidelines on how to handle malformed data.
I have a CSV File. Comes with all the expected features: each field is separated by a comma, and each field is surrounded by quotes in case one of the values may contain a comma. However, every line (except the headers) is missing a leading quote. Here is an example
"Header 1","Header2"
value1","value2"
value1","value2"
The CSV parser ended up skipping every other line due to the way the quotes were lined up, which obviously causes problems.
I would consider this to be an error, because the first column is missing quotation marks since I know what the data should look like, but as far as the CSV spec is considered, this may be considered valid? If so, I suppose I would have to build extra checks myself to make sure that I am not missing any lines, despite it containing valid CSV data.
According of the rfc for CSV files:
While there are various specifications and implementations for the
CSV format, there is no formal
specification in existence, which allows for a wide variety of
interpretations of CSV files.
So simply put, malformed? No. Informal? No. Even this article (Linked in the RFC) mentions that lines can be mixmatched with quotes and no quotes.
For the data you show:
"Header 1","Header2"
value1","value2"
value1","value2"
we could argue the data is not malformed if the fields would be considered as being not quoted and the fields never contain a separator and there are no multiline fields, which would give the values:
"Header 1" "Header2"
value1" "value2"
value1" "value2"
Of course it's obvious this data was meant to have quoted fields. In that case the data is certainly malformed, and could be parsed differently with different parsers (maybe even as multiline fields).
Valid options would be:
value1,value2 // no quotes at all
"value1","value2" // all quoted
value1,"value2,more data" // only quoted when there is a separator inside
I've come across some pattern values for the type="tel" such as \d{3}[\-]\d{3}[\-]\d{4} however I need a pattern that matches the proper pattern and two common non-proper formats:
(123) 456-7890 (Proper)
123-456-7890
1234567890
The following seems to work for all three formats:
<input pattern="?(\d{3}?)? \d{3}?-\d{4}" type="tel" />
Is this valid or are there strings that aren't valid that this would still pass? Is this optimized well or is there a faster way to run the regular expression? Bonus: what can be done to properly support international telephone number formats?
At w3schools I had it passing for example (1234567890 and jhj, which is not what you’re after.
However, a solution can be examined here.
It’s looking at all your example strings at the same time because the global flag is set. Flag settings are to the right of the pattern.
On the LHS there’s a regex debugger which shows how it worked it all out.
The explanation on the RHS explains what all the symbols are all about.
The pattern should match only the formats specified.
pattern=”\(\d{3}\)\s\d{3}-\d{4}|\d{3}-\d{3}-\d{4}|\d{10}”
It looks very complicated, but on inspection is your
proper pattern | 1st non-proper format | 2nd non-proper format
(123) 456-7890 | 123-456-7890 | 1234567890
”\( \d{3} \) \s \d{3} - \d{4} | \d{3} - \d{3} - \d{4} | \d{10}”
I am writing a CSV parser and I want it to comply with this standards. It states:
Each record is located on a separate line, delimited by a line break (CRLF)
How should I handle rows ending with only CR of LF character? Should I treat them as literals and pass to field, interpret as a row end. Or maybe dub the file malformed?
I guess, that most flexible solution would be to accept either type of line end, but I am trying to figure out what standards say.
What do you think about it?
You should certainly not treat them as malformed, because there can be different line endings on Linux, Windows and Mac for example.
It's better to support them all.
Also, fields can have newlines in them as well, if they are properly quoted. So you'll need to check for that too.
For example:
123,"test on 2
lines",456
is a valid csv row.
I used Unit Separator (US/0x1f) in database. When I export to XML 1.0 file, it is not accepted and leave the attribute with empty value.
I have data in database like this:
"option1=10;option2=20;option3=aaa[US]bbb[US]ccc;"
I'm assuming to export to XML 1.0 file like this:
<elementname, attr1="option1=10;option2=20;option3=aaa[US]bbb[US]ccc;"/>
However, the [US] is not accepted by XML 1.0. Any suggestions?
I can replace '\37' (oct 37, hex 1f) with something like "XXX", "$", "(0x1f)"... before writing to XML;
I can replace it when importing from XML and write to database. However, if I replace it with "& # x 1 F ;", which is the HTML Entity for Unit separator, I end up with "& a m p ; # x 1 F ;", which is definitely not what I wanted.
If I manually modify the XML file to "& # x 1 F ;", I can not use MSXML to load it, giving error "Invalid Unicode Character".
Any suggestions?
Thank you
Summary:
Let's make an analogy: Let's think about how the compiler works, there are two phases: "Pre-compile" and "Compile".
For XML File Generation, it acts like the "Compile" phase. E.g. convert "<" to "& l t ;"
However, the Unit Separator is not supported by XML 1.0, so the "Compile" phase will not convert it to HTML Entity "& # x 1 F ;"
So we have to seek solution in the "Pre-Compile" phase, which is our own application's responsibility.
When writing:
Option1: <unit>aaa</unit><unit>bbb</unit>
Option2: simply use "_x241F_" to replace "\37" in the string if "_x241F_" is not conflicting with any existing token in the string.
When reading:
According to Option1: Load the elements, catenate to a single string with "\37" as separator.
According to Option2: simply use "\37" to replace "_x241F_".
I've also found out that MSXML (even the highest version MSXML6.dll) will not load XML 1.1 .
So if we are unfortunately using MSXML, we have to write our own "Pre-Compile" code to handle the Unicode characters before feeding the "Compile" phase.
Note: I borrowed the idea of "_ x 2 4 1 F _" from here.
Thanks for everyone's help
There is no HTML entity for U+001F UNIT SEPARATOR. Besides, HTML entities would be irrelevant when dealing with generic XML.
The character references would be and , in HTML and in XML, but the character is not allowed in HTML or in XML. For XML 1.0, which this seems to be about, please refer to section 2.2 Characters, where the normative definition is the following production (the associated comment is misleading, and comments are non-normative):
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]
The conclusions to be drawn depend on the meaning and purpose of UNIT SEPARATOR in the text. It has no generally defined meaning; it is up to applications to assign a meaning to it and process it accordingly.
Usually UNIT SEPARATOR is used to separate units of some kind, so the natural approach would be to process the incoming data so that instead of such separators, the data, when converted to XML format, has units denoted by markup. So for data like aaa[US]bbb[US]ccc where [US] is UNIT SEPARATOR, you would generate something like <unit>aaa</unit><unit>bbb</unit><unit>ccc</unit>.
This website
http://www.fileformat.info/info/unicode/char/1f/index.htm
suggests one of the following:
HTML Entity (decimal)
HTML Entity (hex)
I have a program that outputs a table, and I was wondering if there are any advantages/disadvantages between the csv and tsv formats.
TSV is a very efficient for Javascript/Perl/Python to process, without losing
any typing information, and also easy for humans to read.
The format has been supported in 4store since its public release, and
it's reasonably widely used.
The way I look at it is: CSV is for loading into spreadsheets, TSV is
for processing by bespoke software.
You can see here the technical specification of each here.
The choice depends on the application. In a nutshell, if your fields don't contain commas, use CSV; otherwise TSV is the way to go.
TL;DR
In both formats, the problem arises when the delimiter can appear within the fields, so it is necessary to indicate that the delimiter is not working as a field separator but as a value within the field, which can be somewhat painful.
For example, using CSV: Kalman, Rudolf, von Neumann, John, Gabor, Dennis
Some basic approaches are:
Delete all the delimiters that appear within the field.
E.g. Kalman Rudolf, von Neumann John, Gabor Dennis
Escape the character (usually pre-appending a backslash \).
E.g. Kalman\, Rudolf, von Neumann\, John, Gabor\, Dennis
Enclose each field with other character (usually double quotes ").
E.g. "Kalman, Rudolf", "von Neumann, John", "Gabor, Dennis"
CSV
The fields are separated by a comma ,.
For example:
Name,Score,Country
Peter,156,GB
Piero,89,IT
Pedro,31415,ES
Advantages:
It is more generic and useful when sharing with non-technical people,
as most of software packages can read it without playing with the
settings.
Disadvantages:
Escaping the comma within the fields can be frustrating because not
everybody follows the standards.
All the extra escaping characters and quotes add weight to the final file size.
TSV
The fields are separated by a tabulation <TAB> or \t
For example:
Name<TAB>Score<TAB>Country
Peter<TAB>156<TAB>GB
Piero<TAB>89<TAB>IT
Pedro<TAB>31415<TAB>ES
Advantages:
It is not necessary to escape the delimiter as it is not usual to have the tab-character within a field. Otherwise, it should be removed.
Disadvantages:
It is less widespread.
TSV-utils makes an interesting comparison, copied here after. In a nutshell, use TSV.
Comparing TSV and CSV formats
The differences between TSV and CSV formats can be confusing. The obvious distinction is the default field delimiter: TSV uses TAB, CSV uses comma. Both use newline as the record delimiter.
By itself, using different field delimiters is not especially significant. Far more important is the approach to delimiters occurring in the data. CSV uses an escape syntax to represent comma and newlines in the data. TSV takes a different approach, disallowing TABs and newlines in the data.
The escape syntax enables CSV to fully represent common written text. This is a good fit for human edited documents, notably spreadsheets. This generality has a cost: reading it requires programs to parse the escape syntax. While not overly difficult, it is still easy to do incorrectly, especially when writing one-off programs. It is good practice is to use a CSV parser when processing CSV files. Traditional Unix tools like cut, sort, awk, and diff do not process CSV escapes, alternate tools are needed.
By contrast, parsing TSV data is simple. Records can be read using the typical readline routines found in most programming languages. The fields in each record can be found using split routines. Unix utilities can be called by providing the correct field delimiter, e.g. awk -F "\t", sort -t $'\t'. No special parser is needed. This is much more reliable. It is also faster, no CPU time is used parsing the escape syntax.
The speed advantages are especially pronounced for record oriented operations. Record counts (wc -l), deduplication (uniq, tsv-uniq), file splitting (head, tail, split), shuffling (GNU shuf, tsv-sample), etc. TSV is faster because record boundaries can be found using highly optimized newline search routines (e.g. memchr). Identifying CSV record boundaries requires fully parsing each record.
These characteristics makes TSV format well suited for the large tabular data sets common in data mining and machine learning environments. These data sets rarely need TAB and newline characters in the fields.
The most common CSV escape format uses quotes to delimit fields containing delimiters. Quotes must also be escaped, this is done by using a pair of quotes to represent a single quote. Consider the data in this table:
Field-1
Field-2
Field-3
abc
hello, world!
def
ghi
Say "hello, world!"
jkl
In Field-2, the first value contains a comma, the second value contain both quotes and a comma. Here is the CSV representation, using escapes to represent commas and quotes in the data.
Field-1,Field-2,Field-3
abc,"hello, world!",def
ghi,"Say ""hello, world!""",jkl
In the above example, only fields with delimiters are quoted. It is also common to quote all fields whether or not they contain delimiters. The following CSV file is equivalent:
"Field-1","Field-2","Field-3"
"abc","hello, world!","def"
"ghi","Say ""hello, world!""","jkl"
Here's the same data in TSV. It is much simpler as no escapes are involved:
Field-1 Field-2 Field-3
abc hello, world! def
ghi Say "hello, world!" jkl
The similarity between TSV and CSV can lead to confusion about which tools are appropriate. Furthering this confusion, it is somewhat common to have data files using comma as the field delimiter, but without comma, quote, or newlines in the data. No CSV escapes are needed in these files, with the implication that traditional Unix tools like cut and awk can be used to process these files. Such files are sometimes referred to as "simple CSV". They are equivalent to TSV files with comma as a field delimiter. Traditional Unix tools and tsv-utils tools can process these files correctly by specifying the field delimiter. However, "simple csv" is a very ad hoc and ill defined notion. A simple precaution when working with these files is to run a CSV-to-TSV converter like csv2tsv prior to other processing steps.
Note that many CSV-to-TSV conversion tools don't actually remove the CSV escapes. Instead, many tools replace comma with TAB as the record delimiter, but still use CSV escapes to represent TAB, newline, and quote characters in the data. Such data cannot be reliably processed by Unix tools like sort, awk, and cut. The csv2tsv tool in tsv-utils avoids escapes by replacing TAB and newline with a space (customizable). This works well in the vast majority of data mining scenarios.
To see what a specific CSV-to-TSV conversion tool does, convert CSV data containing quotes, commas, TABs, newlines, and double-quoted fields. For example:
$ echo $'Line,Field1,Field2\n1,"Comma: |,|","Quote: |""|"\n"2","TAB: |\t|","Newline: |\n|"' | <csv-to-tsv-converter>
Approaches that generate CSV escapes will enclose a number of the output fields in double quotes.
References:
Wikipedia: Tab-separated values - Useful description of TSV format.
IANA TSV specification - Formal definition of the tab-separated-values mime type.
Wikipedia: Comma-separated-values - Describes CSV and related formats.
RFC 4180 - IETF CSV format description, the closest thing to an actual standard for CSV.
brendano/tsvutils: The philosophy of tsvutils - Brendan O'Connor's discussion of the rationale for using TSV format in his open source toolkit.
So You Want To Write Your Own CSV code? - Thomas Burette's humorous, and accurate, blog post describing the troubles with ad-hoc CSV parsing. Of course, you could use TSV and avoid these problems!
You can use any delimiter you want, but tabs and commas are supported by many applications, including Excel, MySQL, PostgreSQL. Commas are common in text fields, so if you escape them, more of them need to be escaped. If you don't escape them and your fields might contain commas, then you can't confidently run "sort -k2,4" on your file. You might need to escape some characters in fields anyway (null bytes, newlines, etc.). For these reasons and more, my preference is to use TSVs, and escape tabs, null bytes, and newlines within fields. Additionally, it is usually easier to work with TSVs. Just split each line by the tab delimiter. With CSVs there are quoted fields, possibly fields with newlines, etc. I only use CSVs when I'm forced to.
I think that generally csv, are supported more often than the tsv format.