Golang CSV read : extraneous " in field error - csv

I am using a simple program to read CSV file, somehow I noticed when I created a CSV using EXCEL or windows based computer go library fails to read it. even when I use cat command it only shows me last line on the terminal. It always results in this error extraneous " in field.
I researched somewhat than I found it is somewhat related to carriage return differences between OS.
But I really want to ask how to make a generic csv reader. I tried reading the same csv using pandas and it was reading successfully. But i am not been able to achieve this using my Go code.
Also screen shot of correct csv Is here

Your file clearly shows that you've got an extra quote at the end of the content. While programs like pandas may be fine with that, I assume it's not valid csv so go does return an error.
Quick example of what's wrong with your data: https://play.golang.org/p/KBikSc1nzD
Update: After your update and a little bit of searching, I have to apoligize, the carriage return does matter and seems to be tha main culprit here, Go seems to be ok handling the \r\n windows variant but not the \r one. In that case what you can do is wrap the bytes.Reader into a custom reader that replaces the \r byte with the \n byte.
Here's an example: https://play.golang.org/p/vNjzwAHmtg
Please note, that the example is just that, an example, it's not handling all the possible cases where \r might be a legit byte.

Related

Octave - dlmread and csvread convert the first value to zero

When I try to read a csv file in Octave I realize that the very first value from it is converted to zero. I tried both csvread and dlmread and I'm receiving no errors. I am able to open the file in a plain text editor and I can see the correct value there. From what I can tell, there are no funny hidden characters, spacings, or similar in the csv file. Files also contain only numbers. The only thing that I feel might be important is that I have five columns/groups that each have different number of values in them.
I went through the commands' documentation on Octave Forge and I do not know what may be causing this. Does anyone have an idea what I can troubleshoot?
To try to illustrate the issue, if I try to load a file with the contents:
1.1,2.1,3.1,4.1,5.1
,2.2,3.2,4.2,5.2
,2.3,3.3,4.3,
,,3.4,4.4
,,3.5,
Command window will return:
0.0,2.1,3.1,4.1,5.1
,2.2,3.2,4.2,5.2
,2.3,3.3,4.3,
,,3.4,4.4
,,3.5,
( with additional trailing zeros after the decimal point).
Command syntaxes I'm using are:
dt = csvread("FileName.csv")
and
dt = dlmread("FileName.csv",",")
and they both return the same.
Your csv file contains a Byte Order Mark right before the first number. You can confirm this if you open the file in a hex editor, you will see the sequence EF BB BF before the numbers start.
This causes the first entry to be interpreted as a 'string', and since strings are parsed based on whether there are numbers in 'front' of the string sequence, this is parsed as the number zero. (see also this answer for more details on how csv entries are parsed).
In my text editor, if I start at the top left of the file, and press the right arrow key once, you can tell that the cursor hasn't moved (meaning I've just gone over the invisible byte order mark, which takes no visible space). Pressing backspace at this point to delete the byte order mark allows the csv to be read properly. Alternatively, you may have to fix your file in a hex editor, or find some other way to convert it to a proper Ascii file (or UTF without the byte order mark).
Also, it may be worth checking how this file was produced; if you have any control in that process, perhaps you can find why this mark was placed in the first place and prevent it. E.g., if this was exported from Excel, you can choose plain 'csv' format instead of 'utf-8 csv'.
UPDATE
In fact, this issue seems to have already been submitted as a bug and fixed in the development branch of octave. See #58813 :)

Java.io.IOException: wrong number of values (WEKA CSV to ARFF)

Currently working on a Data Mining project using my own dataset I had found using Weka. The only issue is that taking my file from csv format and converting it into arff format is causing issues.
java.io.IOException: wrong number of values. Read 2, expected 5, Read Token[EOL], line 3
This is the error I am getting. I have browsed around online looking for similar issues and have tried removing all quotes and special characters that throw this exception. Every place I looked told me to remove special characters and I believe there are none left. The link to my dataset is here : https://docs.google.com/spreadsheets/d/1xqEe7MZE9SdKB_yvFSgWeSVYuDrq0b31Eu5oECNbGH0/edit#gid=1736568367&vpid=A1
This is the first three lines of my file where the first is the attribute names, file is separated by commas in note
Inequality Adjusted HPI Rank,Sub Region,Inequality Adjusted Life Expectancy,Inquality Adjusted Well being,Footprint
,Inequality adjusted HPI
1,1,73.1,6.9,2.5,48.2
2,6,65.17333333,5.487667631,1.390974448,45.97489063
If you open your file with a text editor, you will see that Footprint has quotes around it. Delete the quotes and you are good to go!
Weka is normally not that good in reading CSV files that include special characters, and ARFF files are normally easier to use. Therefore, in such cases, the easiest way is to convert your CSV file to an ARFF file using R ("RWeka" and "foreign" libraries can handle this conversion).
There is also another possibility. I was creating my CSV file and the header had a different number of elements compared to the rest of the data. So, check the header as well...!

When COPYing to Redshift, how to deal with special characters in a CSV?

I am using a COPY with ACCEPTINVCHARS to load a CSV into Amazon Redshift.
Unfortunately I get errors like
Missing newline: Unexpected character 0x69 found at location 129
However, if I try to use the ESCAPE option as well, I get the exception
CSV is not compatible with ESCAPE
What am I supposed to do in order to COPY this into Redshift? I'm fine if the chars get replaced with ? or whatever.
Ignore the header as the headers might not be of the same datatype as your fields.
Use IGNOREHEADER AS
Refer to the forum for more details,
https://forums.aws.amazon.com/thread.jspa?messageID=557452
For future generations, "CSV is not compatible with ESCAPE" is probably right but you don't actually need the CSV keyword to load CSV, so it's worth trying to remove the CSV keyword from your copy command.

how to use ascii character for quote in COPY in cqlsh

I am uploading data from a a big .csv file into Cassandra using copy in cqlsh.
I am using cassandra 1.2 and CQL 3.0.
However since " is part of my data I have to use some other character for uploading my data, I need to use any extended ASCII characters. I tried various approaches but fails.
The following works, but need to use an extended ascii characters for my purpose..
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '"';
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '~';
When I give quote='ß', I get the error below:
:"quotechar" must be an 1-character string
Pls advice on how I can use an extended ASCII character for quote parameter..
Thanks in advance
A note on the COPY documentation page suggests that for bulk loading (like in your case), the json2sstable utility should be used. You can then load the sstables to your cluster using sstableloader. So I suggest that you write a script/program to convert your CSV to JSON and use these tools for your big CSV. JSON will not have any problem handling all characters from ASCII table.
I had a similar problem, and inspected the source code of cqlsh (it's a python script). In my case, I was generating the csv with python, so it was a matter of finding the right python csv parameters.
Here's the key information from cqlsh:
csv_dialect_defaults = dict(delimiter=',', doublequote=False,
escapechar='\\', quotechar='"')
So if you are lucky enough to generate your .csv file from python, it's just a matter of using the csv module with:
writer = csv.writer(open("output.csv", 'w'), **csv_dialect_defaults)
Hope this helps, even if you are not using python.

Changing The Delimiter to CTRL+A in Python CSV Module

I'm trying to write a csv file with the delimiter ctrl+a. I'm going to have to eventually write the file to hadoop and I'm unable to use a standard delimiter.
Currently I'm trying this:
writer = csv.writer(f, delimiter = "\u0001")
for item in aList:
writer.writerow(item)
f.close()
However, the outputted excel file doesn't appear to be written correctly...
Some rows are condensed into one block, while others will have one field in the first and then the rest condensed into the second block, etc.
Is the error where I'm setting up the writer object, or am I just not familiar with separating files this way?
You can try using the nonprinting "group separator" character, which can be represented in python code as '\035'
see http://www.asciitable.com/index/asciifull.gif for some other nonprinting characters if you need more.
It may be helpful to include more context about why you want to use nonstandard delimiter. And whether Excel parsing of the file is necessary, or just a quick check to see if the file might be parsed properly by the target system, Hadoop.