Introduction
We have a pretty standard way of importing .txt and .csv into our data warehouse using SSIS.
Our txt/csvs are produced with speech marks as text qualifiers. So a typical file may look like the below:
"0001","025",1,"01/01/19","28/12/18",4,"ST","SMITH,JOHN","15/01/19"
"0002","807",1,"01/01/19","29/12/18",3,"ST","JONES,JOY","06/02/19"
"0003","160",1,"01/01/19","29/12/18",3,"ST","LEWIS,HANNAH","18/01/19"
We have set all our SSIS packages to strip out the speech marks by setting Text Qualifier = "
Problem
However, as some of our data entry is done manually, speech marks are sometimes used - particularly in free text fields such as NAME where people have nicknames/alias. This causes errors in our SSIS loading.
An example of a problematic row would be:
"0004","645",1,"01/01/19","29/12/18",3,"ST","MOORE,STANLEY "STAN"","12/04/19"
My question
Is there a way to somehow strip out these problematic speech marks? i.e. the speech marks surrounding "STAN", so that column would be treated as MOORE, STANLEY STAN.
If there was a way within SSIS to do this, great. If not, we are open to other ideas outside of SSIS.
Solution needs to be scalable as we have hundreds of SSIS packages where this problem can occur.
I have a few suggestions:
I know Excel has a setting that says something like "Treat Consecutive Delimiters as one."
Change your delimiter to something else, like a pipe (the thing above the backslash, not sure what it is called elsewhere, looks like a vertical line). You can distinguish delimiters from quote marks that are meant to be included in the resulting value because any string delimiter either immediately precedes or immediately follows a comma. A quote character anywhere else is not a delimiter.
If you do not need to pass the data through any T-SQL you might want to replace non-delimiter quotes with single quotes or, depending on the final output, maybe the html entity (") instead.
Hope this helps,
Joey
I try to import some large csv dataset into neo4j using the neo4j-import tool. Quotation is not used anywhere, and therefore i get errors when parsing using --quote " --quote ' --quote ยด and alike. even choosing very rare unicode chars doesnt help with this multi-gig csv because it also contains arabic letters, math symbols and everything you can imagine.
So: Is there a way to disable the quotation checking completely?
Perhaps it would be useful to have the import tool able to accept character configuration values specifying ASCII codes. If so then you could specify --quote \0 and no character would match. That would also be useful for specifying other special characters in general I'd guess.
You need to make sure the CSV file uses quotation marks, since they allow the tool to reliably determine when strings end.
Any string in your data file might contain the delimiter character (a comma, by default). Even if there were a way to turn off quotation checking, the tool would treat every delimiter character as the end of a field. Therefore, any string field that happened to contain the delimiter character would be terminated prematurely, causing errors.
How do I read an RFC4180-standard CSV file into SPSS? Specifically, how to handle string values that have embedded double quotes which are (properly) escaped with a second double quote?
Here's one instance of a record with a problematic value:
2985909844,,3,3,3,3,3,3,1,2,2,"I recall an ad for ""RackSpace"", but I don't recall if this was here or in another page.",200,1,1,1,0,1,0,Often
The SPSS syntax I used is as follows:
GET DATA
/TYPE=TXT
/FILE="/Users/pieter/Work/Stackoverflow/2013_StackOverflowRecoded.csv"
/IMPORTCASE=ALL
/ARRANGEMENT=DELIMITED
/DELCASE=LINE
/FIRSTCASE=2
/DELIMITERS=","
/QUALIFIER='"'
/VARIABLES= ... list of column names...
The import succeeds, but gets off track and throws warnings after encountering such values.
I'm afraid this is a bug in SPSS and therefore not possible to solve.
You might want to ask the IBM Support team about this issue and post their answer here, if you find it helpful.
One Workaround would be to change the escaped double quotes in your *.csv file(s) to some other quote type. This should be only little work if you use an advanced text editor such as notepad++ or the "sed" command line tool on UNIX like operation systems.
Trying an example in the current version of Statistics (22) doubled identifiers are handled correctly, however, if you generate the syntax with the Text Wizard, the fields are too short in the generated syntax, so you would need to increase the widths.
I have a program that outputs a table, and I was wondering if there are any advantages/disadvantages between the csv and tsv formats.
TSV is a very efficient for Javascript/Perl/Python to process, without losing
any typing information, and also easy for humans to read.
The format has been supported in 4store since its public release, and
it's reasonably widely used.
The way I look at it is: CSV is for loading into spreadsheets, TSV is
for processing by bespoke software.
You can see here the technical specification of each here.
The choice depends on the application. In a nutshell, if your fields don't contain commas, use CSV; otherwise TSV is the way to go.
TL;DR
In both formats, the problem arises when the delimiter can appear within the fields, so it is necessary to indicate that the delimiter is not working as a field separator but as a value within the field, which can be somewhat painful.
For example, using CSV: Kalman, Rudolf, von Neumann, John, Gabor, Dennis
Some basic approaches are:
Delete all the delimiters that appear within the field.
E.g. Kalman Rudolf, von Neumann John, Gabor Dennis
Escape the character (usually pre-appending a backslash \).
E.g. Kalman\, Rudolf, von Neumann\, John, Gabor\, Dennis
Enclose each field with other character (usually double quotes ").
E.g. "Kalman, Rudolf", "von Neumann, John", "Gabor, Dennis"
CSV
The fields are separated by a comma ,.
For example:
Name,Score,Country
Peter,156,GB
Piero,89,IT
Pedro,31415,ES
Advantages:
It is more generic and useful when sharing with non-technical people,
as most of software packages can read it without playing with the
settings.
Disadvantages:
Escaping the comma within the fields can be frustrating because not
everybody follows the standards.
All the extra escaping characters and quotes add weight to the final file size.
TSV
The fields are separated by a tabulation <TAB> or \t
For example:
Name<TAB>Score<TAB>Country
Peter<TAB>156<TAB>GB
Piero<TAB>89<TAB>IT
Pedro<TAB>31415<TAB>ES
Advantages:
It is not necessary to escape the delimiter as it is not usual to have the tab-character within a field. Otherwise, it should be removed.
Disadvantages:
It is less widespread.
TSV-utils makes an interesting comparison, copied here after. In a nutshell, use TSV.
Comparing TSV and CSV formats
The differences between TSV and CSV formats can be confusing. The obvious distinction is the default field delimiter: TSV uses TAB, CSV uses comma. Both use newline as the record delimiter.
By itself, using different field delimiters is not especially significant. Far more important is the approach to delimiters occurring in the data. CSV uses an escape syntax to represent comma and newlines in the data. TSV takes a different approach, disallowing TABs and newlines in the data.
The escape syntax enables CSV to fully represent common written text. This is a good fit for human edited documents, notably spreadsheets. This generality has a cost: reading it requires programs to parse the escape syntax. While not overly difficult, it is still easy to do incorrectly, especially when writing one-off programs. It is good practice is to use a CSV parser when processing CSV files. Traditional Unix tools like cut, sort, awk, and diff do not process CSV escapes, alternate tools are needed.
By contrast, parsing TSV data is simple. Records can be read using the typical readline routines found in most programming languages. The fields in each record can be found using split routines. Unix utilities can be called by providing the correct field delimiter, e.g. awk -F "\t", sort -t $'\t'. No special parser is needed. This is much more reliable. It is also faster, no CPU time is used parsing the escape syntax.
The speed advantages are especially pronounced for record oriented operations. Record counts (wc -l), deduplication (uniq, tsv-uniq), file splitting (head, tail, split), shuffling (GNU shuf, tsv-sample), etc. TSV is faster because record boundaries can be found using highly optimized newline search routines (e.g. memchr). Identifying CSV record boundaries requires fully parsing each record.
These characteristics makes TSV format well suited for the large tabular data sets common in data mining and machine learning environments. These data sets rarely need TAB and newline characters in the fields.
The most common CSV escape format uses quotes to delimit fields containing delimiters. Quotes must also be escaped, this is done by using a pair of quotes to represent a single quote. Consider the data in this table:
Field-1
Field-2
Field-3
abc
hello, world!
def
ghi
Say "hello, world!"
jkl
In Field-2, the first value contains a comma, the second value contain both quotes and a comma. Here is the CSV representation, using escapes to represent commas and quotes in the data.
Field-1,Field-2,Field-3
abc,"hello, world!",def
ghi,"Say ""hello, world!""",jkl
In the above example, only fields with delimiters are quoted. It is also common to quote all fields whether or not they contain delimiters. The following CSV file is equivalent:
"Field-1","Field-2","Field-3"
"abc","hello, world!","def"
"ghi","Say ""hello, world!""","jkl"
Here's the same data in TSV. It is much simpler as no escapes are involved:
Field-1 Field-2 Field-3
abc hello, world! def
ghi Say "hello, world!" jkl
The similarity between TSV and CSV can lead to confusion about which tools are appropriate. Furthering this confusion, it is somewhat common to have data files using comma as the field delimiter, but without comma, quote, or newlines in the data. No CSV escapes are needed in these files, with the implication that traditional Unix tools like cut and awk can be used to process these files. Such files are sometimes referred to as "simple CSV". They are equivalent to TSV files with comma as a field delimiter. Traditional Unix tools and tsv-utils tools can process these files correctly by specifying the field delimiter. However, "simple csv" is a very ad hoc and ill defined notion. A simple precaution when working with these files is to run a CSV-to-TSV converter like csv2tsv prior to other processing steps.
Note that many CSV-to-TSV conversion tools don't actually remove the CSV escapes. Instead, many tools replace comma with TAB as the record delimiter, but still use CSV escapes to represent TAB, newline, and quote characters in the data. Such data cannot be reliably processed by Unix tools like sort, awk, and cut. The csv2tsv tool in tsv-utils avoids escapes by replacing TAB and newline with a space (customizable). This works well in the vast majority of data mining scenarios.
To see what a specific CSV-to-TSV conversion tool does, convert CSV data containing quotes, commas, TABs, newlines, and double-quoted fields. For example:
$ echo $'Line,Field1,Field2\n1,"Comma: |,|","Quote: |""|"\n"2","TAB: |\t|","Newline: |\n|"' | <csv-to-tsv-converter>
Approaches that generate CSV escapes will enclose a number of the output fields in double quotes.
References:
Wikipedia: Tab-separated values - Useful description of TSV format.
IANA TSV specification - Formal definition of the tab-separated-values mime type.
Wikipedia: Comma-separated-values - Describes CSV and related formats.
RFC 4180 - IETF CSV format description, the closest thing to an actual standard for CSV.
brendano/tsvutils: The philosophy of tsvutils - Brendan O'Connor's discussion of the rationale for using TSV format in his open source toolkit.
So You Want To Write Your Own CSV code? - Thomas Burette's humorous, and accurate, blog post describing the troubles with ad-hoc CSV parsing. Of course, you could use TSV and avoid these problems!
You can use any delimiter you want, but tabs and commas are supported by many applications, including Excel, MySQL, PostgreSQL. Commas are common in text fields, so if you escape them, more of them need to be escaped. If you don't escape them and your fields might contain commas, then you can't confidently run "sort -k2,4" on your file. You might need to escape some characters in fields anyway (null bytes, newlines, etc.). For these reasons and more, my preference is to use TSVs, and escape tabs, null bytes, and newlines within fields. Additionally, it is usually easier to work with TSVs. Just split each line by the tab delimiter. With CSVs there are quoted fields, possibly fields with newlines, etc. I only use CSVs when I'm forced to.
I think that generally csv, are supported more often than the tsv format.
I wonder if there is any way to generate culture neutral CSV file or at least specify data format of certian columns present in file.
For example I generated CSV file that contains numbers with decimal separator (.), and after
pass it to the client which is in the country where decimal separator is (,), client opens it with Excel and sees all values changed.
Is there any way to resolve this isure, or just in this case do not use CSV file ?
Thank you in advance.
What you want is a "quoted CSV file".
That is as well as separating your values with commas you also enclose them in (usually) double quotes.
Like so:-
"first","second","3,00","Some other text, etc."
This format is quite common and supported by EXCEL.
Two ways I came up with to avoid the decimal separator altogether:
1) Use scientific notation, so 1.25 would be: 123E-2
2) Make it a formula, so 1.25 would be: =125/100
Both pretty crappy, depending on your target audience, but at least Excel sees them as numbers and can calculate with them.
A CSV file will be separated by commas (the 'C' in CSV) but you can output a text with any delimiter and qualifier and you'll be able to open it in Excel - you specify them in the step 2 of the import text wizard.
A common choice for situations like this is to use tabs (TSV).
You can use Tab Separated Values, which does not vary between cultures and are supported by Microsoft Excel. Common file extensions are .tsv and .tab.
http://en.wikipedia.org/wiki/Tab-separated_values