Load a CSV onto Apache Beam where there is a comma in some of the fields - csv

I am loading a CSV into Apache Beam, but the CSV I am loading has commas in the fields. It looks like this:
ID, Name
1, Barack Obama
2, Barry, Bonds
How can I go about fixing this issue?

This is not specific to Beam, but a general problem with CSV. It's unclear if the second line should have ID="2, Barry" Name="Bonds" or the other way around.
If you can use some context (e.g. ID is always an integer, only one field could possibly contain commas) you could solve this by reading it as a text file line by line and parsing it into separate fields with a custom DoFn (assuming rows also contain newlines).
Generally, non-separating commas should be inside quotes in well-formed CSV, which makes this much more tractable (e.g. it would just work with the Beam Dataframes read_csv.)

Related

concern while importing/linking csv to access database

I have a csv file with delimiter as , (comma) and few of the data column of same file has comma in it .
Hence while linking / importing the file, data is getting jumbled in next column.
I have tried all possible means like skip column etc , but not getting any fruitful results.
Please let me know if this can be handled through VBA function in ms-access.
If the CSV file contains text fields that contain commas and are not surrounded by a text qualifier (usually ") then the file is malformed and cannot be parsed in a bulletproof way. That is,
1,Hello world!,1.414
2,"Goodbye, cruel world!",3.142
can be reliably parsed, but
1,Hello world!,1.414
2,Goodbye, cruel world!,3.142
cannot. However, if you have additional information about the file, e.g., that it should contain three columns
a Long Integer column,
a Short Text column, and
a Double column
then your VBA code could read the file line-by-line and split the string on commas into an array. The first array element would be the Long Integer, the last array element would be the Double value, and the remaining "columns" in between could be concatenated together to reconstruct the string.
As you can imagine, that approach could easily be confounded (e.g., if there was more than one text field that might contain commas). Therefore it is not particularly appealing.
(Also worth noting is that the CSV parser in Access has never been able to properly handle text fields that contain line breaks, but at least we can import those CSV files into Excel and then import into Access from the Excel file.)
TL;DR - If the CSV file contains unqualified text containing commas then the system that produced it is broken and should be fixed.

Is there any technical difference between CSV, a TSV or a TXT file?

I use these files constantly in my application, but aren't CSV, TSV or TXT files all flat files?
The content is:
"sample","sample"
They are all text files, following the same "guidelines". The difference between the files are - as long as the creator followed some "rules", that:
A csv file will have comma separated values and a tsv file will have tab seperated values.
For .txt files, there is no formatting specified.
.csv stands for comma separated values, .tsv stands for tab separated values.
As the names suggest, different elements in the file are separated by ',' and '\t' respectively.
The type is chosen depending on the data. If we have say numbers larger than 3 digits, we might need commas as part of the content ans it would be better to use a csv in that case.
Both are types of text files and are increasingly used for classification and data mining purposes.
They do not have any other technical distinguishing factor.
A text file (which might have a txt file extension) will have lines separated by a platform specific line separator (CRLF on Windows, LF on Linux, and so on), and it will tend to contain characters human readable as text in some encoding. Apart from that human readability expectation this allows pretty much any file content on some platforms, so this is more of a content classification than a specific file format.
The other two formats are usually considered special cases of a text file intended to allow easy automated processing; tsv, a "tab separated values" file is simpler than csv, a "comma separated values" file.
csv will have commas as field separators, and it may use quoting and escaping especially to handle commas and quotes occurring in those fields. It may also include a header line as the first line in the file. The last line in the file may or may not end with a line separator.
(Details.)
tsv simply disallows tabs in the values, the header line is mandatory, the final line separator is mandatory.
(Details.)
A "flat file", in connection with databases, is a text file as opposed to a machine optimized storage method (such as a fixed size record file or a compressed backup file or a file using more elaborate markup language supporting data validation); a flat file tends to be csv or tsv or similar.
This answer benefited from a comment by Alex Shpilkin.

Java.io.IOException: wrong number of values (WEKA CSV to ARFF)

Currently working on a Data Mining project using my own dataset I had found using Weka. The only issue is that taking my file from csv format and converting it into arff format is causing issues.
java.io.IOException: wrong number of values. Read 2, expected 5, Read Token[EOL], line 3
This is the error I am getting. I have browsed around online looking for similar issues and have tried removing all quotes and special characters that throw this exception. Every place I looked told me to remove special characters and I believe there are none left. The link to my dataset is here : https://docs.google.com/spreadsheets/d/1xqEe7MZE9SdKB_yvFSgWeSVYuDrq0b31Eu5oECNbGH0/edit#gid=1736568367&vpid=A1
This is the first three lines of my file where the first is the attribute names, file is separated by commas in note
Inequality Adjusted HPI Rank,Sub Region,Inequality Adjusted Life Expectancy,Inquality Adjusted Well being,Footprint
,Inequality adjusted HPI
1,1,73.1,6.9,2.5,48.2
2,6,65.17333333,5.487667631,1.390974448,45.97489063
If you open your file with a text editor, you will see that Footprint has quotes around it. Delete the quotes and you are good to go!
Weka is normally not that good in reading CSV files that include special characters, and ARFF files are normally easier to use. Therefore, in such cases, the easiest way is to convert your CSV file to an ARFF file using R ("RWeka" and "foreign" libraries can handle this conversion).
There is also another possibility. I was creating my CSV file and the header had a different number of elements compared to the rest of the data. So, check the header as well...!

How do I deal with commas/tabs that are part of the data in CSV/TSV in MarkLogic

I am trying to load a CSV file that have commas as part of the data into MarkLogic using RecordLoader. The data loads but MarkLogic takes commas that are part of the data as delimiters. I tried to escape commas by using backslashes but didn't work and the data remains dirty with the backslashes. I thought about replacing the data commas with other symbols so that I can change them back to commas after I load but I don't know if there is a way to modify the data after I load and I would have to reposition the XML tags line by line.
How can I load a CSV/TSV file and keep the commas/tabs that are part of the data as part of the data and not as delimiters?
Thanks in advance.
RecordLoader's DelimitedDataLoader doesn't support any escaping today. If you want to add it as a patch, https://github.com/marklogic/recordloader/blob/master/src/java/com/marklogic/recordloader/xcc/DelimitedDataLoader.java#L102 is the place to start looking at the code.
Although you asked about RecordLoader, you could also use the MarkLogic Content Pump. See Creating Documents from Delimited Text Files.

Culture independent CSV

I wonder if there is any way to generate culture neutral CSV file or at least specify data format of certian columns present in file.
For example I generated CSV file that contains numbers with decimal separator (.), and after
pass it to the client which is in the country where decimal separator is (,), client opens it with Excel and sees all values changed.
Is there any way to resolve this isure, or just in this case do not use CSV file ?
Thank you in advance.
What you want is a "quoted CSV file".
That is as well as separating your values with commas you also enclose them in (usually) double quotes.
Like so:-
"first","second","3,00","Some other text, etc."
This format is quite common and supported by EXCEL.
Two ways I came up with to avoid the decimal separator altogether:
1) Use scientific notation, so 1.25 would be: 123E-2
2) Make it a formula, so 1.25 would be: =125/100
Both pretty crappy, depending on your target audience, but at least Excel sees them as numbers and can calculate with them.
A CSV file will be separated by commas (the 'C' in CSV) but you can output a text with any delimiter and qualifier and you'll be able to open it in Excel - you specify them in the step 2 of the import text wizard.
A common choice for situations like this is to use tabs (TSV).
You can use Tab Separated Values, which does not vary between cultures and are supported by Microsoft Excel. Common file extensions are .tsv and .tab.
http://en.wikipedia.org/wiki/Tab-separated_values