Octave - dlmread and csvread convert the first value to zero - csv

When I try to read a csv file in Octave I realize that the very first value from it is converted to zero. I tried both csvread and dlmread and I'm receiving no errors. I am able to open the file in a plain text editor and I can see the correct value there. From what I can tell, there are no funny hidden characters, spacings, or similar in the csv file. Files also contain only numbers. The only thing that I feel might be important is that I have five columns/groups that each have different number of values in them.
I went through the commands' documentation on Octave Forge and I do not know what may be causing this. Does anyone have an idea what I can troubleshoot?
To try to illustrate the issue, if I try to load a file with the contents:
1.1,2.1,3.1,4.1,5.1
,2.2,3.2,4.2,5.2
,2.3,3.3,4.3,
,,3.4,4.4
,,3.5,
Command window will return:
0.0,2.1,3.1,4.1,5.1
,2.2,3.2,4.2,5.2
,2.3,3.3,4.3,
,,3.4,4.4
,,3.5,
( with additional trailing zeros after the decimal point).
Command syntaxes I'm using are:
dt = csvread("FileName.csv")
and
dt = dlmread("FileName.csv",",")
and they both return the same.

Your csv file contains a Byte Order Mark right before the first number. You can confirm this if you open the file in a hex editor, you will see the sequence EF BB BF before the numbers start.
This causes the first entry to be interpreted as a 'string', and since strings are parsed based on whether there are numbers in 'front' of the string sequence, this is parsed as the number zero. (see also this answer for more details on how csv entries are parsed).
In my text editor, if I start at the top left of the file, and press the right arrow key once, you can tell that the cursor hasn't moved (meaning I've just gone over the invisible byte order mark, which takes no visible space). Pressing backspace at this point to delete the byte order mark allows the csv to be read properly. Alternatively, you may have to fix your file in a hex editor, or find some other way to convert it to a proper Ascii file (or UTF without the byte order mark).
Also, it may be worth checking how this file was produced; if you have any control in that process, perhaps you can find why this mark was placed in the first place and prevent it. E.g., if this was exported from Excel, you can choose plain 'csv' format instead of 'utf-8 csv'.
UPDATE
In fact, this issue seems to have already been submitted as a bug and fixed in the development branch of octave. See #58813 :)

Related

How do I preserve the leading 0 of a number using Unoconv when converting from a .csv file to a .xls file?

I have a 3 column csv file. The 2nd column contains numbers with a leading zero. For example:
044934343
I need to convert a .csv file into a .xls and to do that I'm using the command line tool called 'unoconv'.
It's converting as expected, however when I load up the .xls in Excel instead of showing '04493434', the cell shows '4493434' (the leading 0 has been removed).
I have tried surrounding the number in the .csv file with a single quote and a double quote however the leading 0 is still removed after conversion.
Is there a way to tell unoconv that a particular column should be of a TEXT type? I've tried to read the man page of unocov however the options are little confusing.
Any help would be greatly appreciated.
Perhaps I came too late at the scene, but just in case someone is looking for an answer for a similar question this is how to do:
unoconv -i FilterOptions=44,34,76,1,1/1/2/2/3/1 --format xls <csvFileName>
The key here is "1/1/2/2/3/1" part, which tells unoconv that the second column's type should be "TEXT", leaving the first and third as "Standard".
You can find more info here: https://wiki.openoffice.org/wiki/Documentation/DevGuide/Spreadsheets/Filter_Options#Token_7.2C_csv_import
BTW this is my first post here...

Golang CSV read : extraneous " in field error

I am using a simple program to read CSV file, somehow I noticed when I created a CSV using EXCEL or windows based computer go library fails to read it. even when I use cat command it only shows me last line on the terminal. It always results in this error extraneous " in field.
I researched somewhat than I found it is somewhat related to carriage return differences between OS.
But I really want to ask how to make a generic csv reader. I tried reading the same csv using pandas and it was reading successfully. But i am not been able to achieve this using my Go code.
Also screen shot of correct csv Is here
Your file clearly shows that you've got an extra quote at the end of the content. While programs like pandas may be fine with that, I assume it's not valid csv so go does return an error.
Quick example of what's wrong with your data: https://play.golang.org/p/KBikSc1nzD
Update: After your update and a little bit of searching, I have to apoligize, the carriage return does matter and seems to be tha main culprit here, Go seems to be ok handling the \r\n windows variant but not the \r one. In that case what you can do is wrap the bytes.Reader into a custom reader that replaces the \r byte with the \n byte.
Here's an example: https://play.golang.org/p/vNjzwAHmtg
Please note, that the example is just that, an example, it's not handling all the possible cases where \r might be a legit byte.

Java.io.IOException: wrong number of values (WEKA CSV to ARFF)

Currently working on a Data Mining project using my own dataset I had found using Weka. The only issue is that taking my file from csv format and converting it into arff format is causing issues.
java.io.IOException: wrong number of values. Read 2, expected 5, Read Token[EOL], line 3
This is the error I am getting. I have browsed around online looking for similar issues and have tried removing all quotes and special characters that throw this exception. Every place I looked told me to remove special characters and I believe there are none left. The link to my dataset is here : https://docs.google.com/spreadsheets/d/1xqEe7MZE9SdKB_yvFSgWeSVYuDrq0b31Eu5oECNbGH0/edit#gid=1736568367&vpid=A1
This is the first three lines of my file where the first is the attribute names, file is separated by commas in note
Inequality Adjusted HPI Rank,Sub Region,Inequality Adjusted Life Expectancy,Inquality Adjusted Well being,Footprint
,Inequality adjusted HPI
1,1,73.1,6.9,2.5,48.2
2,6,65.17333333,5.487667631,1.390974448,45.97489063
If you open your file with a text editor, you will see that Footprint has quotes around it. Delete the quotes and you are good to go!
Weka is normally not that good in reading CSV files that include special characters, and ARFF files are normally easier to use. Therefore, in such cases, the easiest way is to convert your CSV file to an ARFF file using R ("RWeka" and "foreign" libraries can handle this conversion).
There is also another possibility. I was creating my CSV file and the header had a different number of elements compared to the rest of the data. So, check the header as well...!

Why are uri chars (or at least spaces) being dropped on an html file upload?

I have a file upload form and would like to use the filename on the server, however I notice that when I upload it the spaces are dropped. On the client/browser I can do something like this in an event called after the input type='file' element has changed:
function process_svg (e) {
var files = e.target.files || e.originalEvent.dataTransfer.files;
console.log(files[0].filename);
And if I upload a file with the name 'some file - type.ext' 'some file - type.ext' will be printed in the console. On the server (running bottle) however if I run:
#route('/some_route')
def some_route():
print(request.files['form_name_attr'].filename)
I get 'somefile-type.ext.' I am guessing this has to do with uri escaping (or lack there of), but since you cannot change a file preupload how do you get around this and preserve it? Strangely I cannot find mention of this on google, in part I have had trouble thinking of appropriate search terms, but I'm also aware that this may not actually be native behaviour, but a bug elsewhere in my code.
I do not think that is the case as I've issued these console.log and print statements at the end (right before the upload) and beginning (right when the server starts processing the request) and do not believe I really have any code to touch it in between, however if that is the case please let me know as I could be looking in the wrong direction.
You want raw_filename, not filename.
(Note that it may contain unsafe characters.)
#route('/some_route', method='POST')
def some_route():
print(request.files['form_name_attr'].filename) # "cleaned" file name
print(request.files['form_name_attr'].raw_filename) # unmodified file name
Found this in the source code for FileUpload.filename:
Only ASCII letters, digits, dashes, underscores and dots are allowed
in the final filename. Accents are removed, if possible. Whitespace is
replaced by a single dash. Leading or tailing dots or dashes are
removed. The filename is limited to 255 characters.

Implementing run-length encoding

I've written a program to perform run length encoding.
In typical scenario if the text is
AAAAAABBCDEEEEGGHJ
run length encoding will make it
A6B2C1D1E4G2H1J1
but it was adding extra 1 for each non repeating character. Since i'm compressing BMP files with it, i went with an idea of placing a marker "$" to signify the occurance of a repeating character, (assuming that image files have huge amount of repeating text).
So it'd look like
$A6$B2CD$E4$G2HJ
For the current example it's length is the same, but there's a noticable difference for BMP files. Now my problem is in decoding. It so happens some BMP Files have the pattern $<char><num> i.e. $I9 in the original file, so in the compressed file also i'd contain the same text. $I9, however upon decoding it'd treat it as a repeating I which repeats 9 times! So it produces wrong output. What i want to know is which symbol can i use to mark the start of a repeating character (run) so that it doesn't conflict with the original source.
Why don't you encode each $ in the original file as $$ in the compressed file?
And/or use some other character instead of $ - one that is not used much in bmp files.
Also note that the BMP format has RLE compression 'built-in' - look here, near the bottom of the page - under "Image Data and Compression".
I don't know what you're using your program for, or if it's just for learning, but if you used the "official" bmp method, your compressed images wouldn't need decompression before viewing.
AAAAAABBCDEEEEGGHJ$IIIIIIIII ==> $A6$B2CD$E4$G2HJ$$I9
If the repeat character occurs in the data, try inserting an extra repeat character in the encoded data. Then if the decoder sees a double repeat character it can insert the actual repeat character
$A6$B2CD$E4$G2HJ$$I9 ==> AAAAAABBCDEEEEGGHJ$IIIIIIIII
What most programs do to signify that some character needs to be treated literally is that they have a defined escape sequence.
For example, in regular expressions, the following are specially defined characters that usually have a meaning:
^[].*+{}()$
Yes, your fun dollar sign character is in there, and it usually means end of line.
So what a programmer using regular expressions has to do to have these characters interpreted literally is that they need to express those characters as an escape sequence. For example, to interpret $ as $, and not end of line, the programmer uses \$, which is the escape sequence.(1)
In your case, you can store literal dollar signs into your compressed file as \$.(2)
NB: grep inverts this logic.
The above solutions to store $ as $$ becomes confusing when you have runs of $ in the BMP file.
If you have the luxury of being able to scan the entire input before starting to compress it, you could choose the least frequent value in the input as your escape value.
For example, given this input:
AAAABBCCCCDDEEEEEEEFFG
You could choose "G" as your escape value (or even "H" if it's part of your symbol set) and adopt a convention whereby the first character of the encoded stream is the escape value. So the string above might encode to:
GGA4BBGC4DDGE7FFGG
or even better:
HHA4BBHC4DDHE7FFG
Please note that there's no point in encoding a "run" of two identical values because the "compressed" version (e.g. HD2) is longer than the uncompressed version (DD).
Hope that helps!
If I understand correctly, the problem is that $ is both a symbol for marking a repeat, and also can be a 'BMP' value as well?
If so, what you could do is to mark a double $ ('$$') character to denote that the '$' character should be treated not as a repeat, but as a single '$'. This would of course mean that the '$' is expensive to encode (takes two symbols instead of 1), but would solve your problem.
If you wanted to have a run of the '$' character, you would need to encode it as:
$$$5 - meaning '$' run of '$$'=$, '5' - 5 times.
I'm honestly not sure what would possessed someone to use a text-based RLE if they want to compress binary data with it. A BMP is not text.
Right now, since only a single byte is read after the $, and it is interpreted as ascii number from 0 to 9, this process has a run length range of 0 to 9, meaning you can only compress values up to 9 repetitions before a new run-length flag needs to be written. After all, you can't make the difference between $I34 for a run-length of 34, and $I3 + 4 for a literal 4 behind the repeat of 3.
If this same byte is instead interpreted as binary value, it can contain values from 0 to 255, giving a massive difference in efficiency.
As for the escaping of $ signs themselves, I'd advice either always treating it as repeat of at least 1 ($$1), or, better yet, encoding the entire thing differently, with the order of the run length values and the data swapped, so a code becomes $<length><data>; then you can use $0 as special symbol to mean 'just $'. When decompressing and encountering the 0 after a $, simply don't read on for a third byte. A run length of 0 should never appear in the compressed data anyway, so it can be given a special meaning, but this is useless if the data byte is put first, since then it'd still be the same length as a normal repeat.