liblinear's train.exe: "Wrong input format at line 1" - regression

I'm trying to run liblinear's train.exe on Windows:
>train ex1_train.txt
Wrong input format at line 1
Here's the beginning of the file. What's wrong?
17.592 1:6.1101
9.1302 1:5.5277
13.662 1:8.5186
11.854 1:7.0032
6.8233 1:5.8598
11.886 1:8.3829
4.3483 1:7.4764
12 1:8.5781
6.5987 1:6.4862
3.8166 1:5.0546
3.2522 1:5.7107
15.505 1:14.164
3.1551 1:5.734
7.2258 1:8.4084
0.71618 1:5.6407
3.5129 1:5.3794
5.3048 1:6.3654
0.56077 1:5.1301
3.6518 1:6.4296
5.3893 1:7.0708

Liblinear requires the same input format as LibSVM. And, from their README file,
The format of training and testing data file is:
<label> <index1>:<value1> <index2>:<value2> ...
Each line contains an instance and is ended by a '\n' character. For
classification, <label> is an integer indicating the class label
(multi-class is supported). For regression, <label> is the target
value which can be any real number. For one-class SVM, it's not used
so can be any number. The pair <index>:<value> gives a feature
(attribute) value: <index> is an integer starting from 1 and <value>
is a real number. The only exception is the precomputed kernel, where
<index> starts from 0; see the section of precomputed kernels. Indices
must be in ASCENDING order.
Since we don't have the entire file, the best answer we can provide is that make sure all these instructions are followed. E.g., there is no TAB instead of space, there is no '\r\n' instead of '\n', etc. A good way to debug would be to take a few lines and keep adding until you get the error.
head -10 <yourfile> > tmp10
head -20 <yourfile> > tmp20
etc. And see where the error pops up.

My problems were that: you can't use zero as a feature id, and your features need to be sorted.

Related

Octave - dlmread and csvread convert the first value to zero

When I try to read a csv file in Octave I realize that the very first value from it is converted to zero. I tried both csvread and dlmread and I'm receiving no errors. I am able to open the file in a plain text editor and I can see the correct value there. From what I can tell, there are no funny hidden characters, spacings, or similar in the csv file. Files also contain only numbers. The only thing that I feel might be important is that I have five columns/groups that each have different number of values in them.
I went through the commands' documentation on Octave Forge and I do not know what may be causing this. Does anyone have an idea what I can troubleshoot?
To try to illustrate the issue, if I try to load a file with the contents:
1.1,2.1,3.1,4.1,5.1
,2.2,3.2,4.2,5.2
,2.3,3.3,4.3,
,,3.4,4.4
,,3.5,
Command window will return:
0.0,2.1,3.1,4.1,5.1
,2.2,3.2,4.2,5.2
,2.3,3.3,4.3,
,,3.4,4.4
,,3.5,
( with additional trailing zeros after the decimal point).
Command syntaxes I'm using are:
dt = csvread("FileName.csv")
and
dt = dlmread("FileName.csv",",")
and they both return the same.
Your csv file contains a Byte Order Mark right before the first number. You can confirm this if you open the file in a hex editor, you will see the sequence EF BB BF before the numbers start.
This causes the first entry to be interpreted as a 'string', and since strings are parsed based on whether there are numbers in 'front' of the string sequence, this is parsed as the number zero. (see also this answer for more details on how csv entries are parsed).
In my text editor, if I start at the top left of the file, and press the right arrow key once, you can tell that the cursor hasn't moved (meaning I've just gone over the invisible byte order mark, which takes no visible space). Pressing backspace at this point to delete the byte order mark allows the csv to be read properly. Alternatively, you may have to fix your file in a hex editor, or find some other way to convert it to a proper Ascii file (or UTF without the byte order mark).
Also, it may be worth checking how this file was produced; if you have any control in that process, perhaps you can find why this mark was placed in the first place and prevent it. E.g., if this was exported from Excel, you can choose plain 'csv' format instead of 'utf-8 csv'.
UPDATE
In fact, this issue seems to have already been submitted as a bug and fixed in the development branch of octave. See #58813 :)

What does 'multiline strings are different' meant by from RIDE (Robot Framework) output?

i am trying to compare two csv file data and followed below process in RIDE -
${csvA} = Get File ${filePathA}
${csvB} = Get File ${filePathB}
Should Be Equal As Strings ${csvA} ${csvB}
Here are my two csv contents -
csvA data
Harshil,45,8.03,DMJ
Divy,55,8,VVN
Parth,1,9,vvn
kjhjmb,44,0.5,bugg
csvB data
Harshil,45,8.03,DMJ
Divy,55,78,VVN
Parth,1,9,vvnbcb
acc,5,6,afafa
As few of the data is not in match, when i Run the code in RIDE, the result is FAIL. But in the log below data is shown -
**
Multiline strings are different:
--- first
+++ second
## -1,4 +1,4 ##
Harshil,45,8.03,DMJ
-Divy,55,8,VVN
-Parth,1,9,vvn
-kjhjmb,44,0.5,bugg
+Divy,55,78,VVN
+Parth,1,9,vvnbcb
+acc,5,6,afafa**
I would like to know the meaning of ---first +++second ##-1,4+1,4## content.
Thanks in advance!
When robot compares multiline strings (data that has newlines in it), it uses the standard unix tool diff to show the differences. Those characters are all part of what's called a unified diff. Even though you pass in raw data, it's treating the data as two files and showing the differences between the two in a format familiar to most programmers.
Here are two references to read more about the format:
What does "## -1 +1 ##" mean in Git's diff output?. (stackoverflow)
the diff man page (gnu.org)
In short, the ## gives you a reference for which line numbers are different, and the + and - show you which lines are different.
In your specific example it's telling you that three lines were different between the two strings: the line beginning with Divy, the line beginning with Parth, and the line beginning with acc. Since the line beginning with Harshil does not show a + or -, that means it was identical between the two strings.

Entry delimiter of JSON files for Hive table

We are collecting JSON data (public social media posts in particular) via REST API invocations, which we plan to dump into HDFS, then abstract a Hive table on top it using SerDe. I wonder though what would be the appropriate delimiter per JSON entry in a file? Is it new line ("\n")? So it would look like this:
{ id: entry1 ... post: }
{ id: entry2 ... post: }
...
{ id: entryn ... post: }
How about if we encounter a new line character within the JSON data itself, for example in post?
The best way would be one record per line, separated by "\n" exactly as you guessed.
This also means that you should be careful to escape "\n" that may be inside the JSON elements.
Indented JSON won't work well with hadoop/hive, since to distribute processing, hadoop must be able to tell when a records ends, so it can split processing of a file with N bytes with W workers in W chunks of size roughly N/W.
The splitting is done by the particular InputFormat that's been used, in case of text, TextInputFormat.
TextInputFormat will basically split the file at the first instance of "\n" found after byte i*N/W (for i from 1 to W-1).
For this reason, having other "\n" around would confuse Hadoop and it will give you incomplete records.
As an alternative, I wouldn't recommend it, but if you really wanted you could use a character other than "\n" by configuring the property "textinputformat.record.delimiter" when reading the file through hadoop/hive, using a character that won't be in JSON (for instance, \001 or CTRL-A is commonly used by Hive as a field delimiter) but that can be tricky since it has to also be supported by the SerDe.
Also, if you change the record delimiter, anybody who copies/uses the file on HDFS must be aware of the delimiter, or they won't be able to parse it correctly, and will need special code to do it, while keeping "\n" as a delimiter, the files will still be normal text files and can be used by other tools.
As for the SerDe, I'd recommend this one, with the disclaimer that I wrote it :)
https://github.com/rcongiu/Hive-JSON-Serde

In Stata, how do I add variable labels from a separate csv file?

I have a set of csv files that are very simple to load into Stata using the -insheet- command. But they have very uninformative variable names. For each of these files, I also have a file of metadata consisting of two columns: the original (uninformative) variable names, and a description of what the variables actually mean. I'd like to use these metadata files to create variable labels, preferably without going through and typing up all the separate label commands or turning the metadata file into a dictionary for each file. It seems like there must be a quick way of loading the metadata file into Stata and looping through it to generate the label commands, but I don't know what it is. Any thoughts?
Ideally each line of the metadata is something like
varname1 "more interesting description"
in which case you can prefix each line with
label var
and then run the file as if it were a do-file using do. See the help for label. That is easy in a decent text editor, as for example searching for the start of each line and replacing it with label var (note the need for the space).
What could bite here includes:
You don't have double quotes " " as delimiters, in which case you need to insert them.
The extra information does not qualify as a variable label because it is more than 80 characters long. See help limits.
There are other ways to do this with Stata. You could write a program to read in the metadata and write out a do-file using file, but if this were my problem I would reach first for my text editor. (Most experienced Stata programmers use something else as well as doedit.)

Implementing run-length encoding

I've written a program to perform run length encoding.
In typical scenario if the text is
AAAAAABBCDEEEEGGHJ
run length encoding will make it
A6B2C1D1E4G2H1J1
but it was adding extra 1 for each non repeating character. Since i'm compressing BMP files with it, i went with an idea of placing a marker "$" to signify the occurance of a repeating character, (assuming that image files have huge amount of repeating text).
So it'd look like
$A6$B2CD$E4$G2HJ
For the current example it's length is the same, but there's a noticable difference for BMP files. Now my problem is in decoding. It so happens some BMP Files have the pattern $<char><num> i.e. $I9 in the original file, so in the compressed file also i'd contain the same text. $I9, however upon decoding it'd treat it as a repeating I which repeats 9 times! So it produces wrong output. What i want to know is which symbol can i use to mark the start of a repeating character (run) so that it doesn't conflict with the original source.
Why don't you encode each $ in the original file as $$ in the compressed file?
And/or use some other character instead of $ - one that is not used much in bmp files.
Also note that the BMP format has RLE compression 'built-in' - look here, near the bottom of the page - under "Image Data and Compression".
I don't know what you're using your program for, or if it's just for learning, but if you used the "official" bmp method, your compressed images wouldn't need decompression before viewing.
AAAAAABBCDEEEEGGHJ$IIIIIIIII ==> $A6$B2CD$E4$G2HJ$$I9
If the repeat character occurs in the data, try inserting an extra repeat character in the encoded data. Then if the decoder sees a double repeat character it can insert the actual repeat character
$A6$B2CD$E4$G2HJ$$I9 ==> AAAAAABBCDEEEEGGHJ$IIIIIIIII
What most programs do to signify that some character needs to be treated literally is that they have a defined escape sequence.
For example, in regular expressions, the following are specially defined characters that usually have a meaning:
^[].*+{}()$
Yes, your fun dollar sign character is in there, and it usually means end of line.
So what a programmer using regular expressions has to do to have these characters interpreted literally is that they need to express those characters as an escape sequence. For example, to interpret $ as $, and not end of line, the programmer uses \$, which is the escape sequence.(1)
In your case, you can store literal dollar signs into your compressed file as \$.(2)
NB: grep inverts this logic.
The above solutions to store $ as $$ becomes confusing when you have runs of $ in the BMP file.
If you have the luxury of being able to scan the entire input before starting to compress it, you could choose the least frequent value in the input as your escape value.
For example, given this input:
AAAABBCCCCDDEEEEEEEFFG
You could choose "G" as your escape value (or even "H" if it's part of your symbol set) and adopt a convention whereby the first character of the encoded stream is the escape value. So the string above might encode to:
GGA4BBGC4DDGE7FFGG
or even better:
HHA4BBHC4DDHE7FFG
Please note that there's no point in encoding a "run" of two identical values because the "compressed" version (e.g. HD2) is longer than the uncompressed version (DD).
Hope that helps!
If I understand correctly, the problem is that $ is both a symbol for marking a repeat, and also can be a 'BMP' value as well?
If so, what you could do is to mark a double $ ('$$') character to denote that the '$' character should be treated not as a repeat, but as a single '$'. This would of course mean that the '$' is expensive to encode (takes two symbols instead of 1), but would solve your problem.
If you wanted to have a run of the '$' character, you would need to encode it as:
$$$5 - meaning '$' run of '$$'=$, '5' - 5 times.
I'm honestly not sure what would possessed someone to use a text-based RLE if they want to compress binary data with it. A BMP is not text.
Right now, since only a single byte is read after the $, and it is interpreted as ascii number from 0 to 9, this process has a run length range of 0 to 9, meaning you can only compress values up to 9 repetitions before a new run-length flag needs to be written. After all, you can't make the difference between $I34 for a run-length of 34, and $I3 + 4 for a literal 4 behind the repeat of 3.
If this same byte is instead interpreted as binary value, it can contain values from 0 to 255, giving a massive difference in efficiency.
As for the escaping of $ signs themselves, I'd advice either always treating it as repeat of at least 1 ($$1), or, better yet, encoding the entire thing differently, with the order of the run length values and the data swapped, so a code becomes $<length><data>; then you can use $0 as special symbol to mean 'just $'. When decompressing and encountering the 0 after a $, simply don't read on for a third byte. A run length of 0 should never appear in the compressed data anyway, so it can be given a special meaning, but this is useless if the data byte is put first, since then it'd still be the same length as a normal repeat.