Reading csv without specifying enclosure characters in Weka - csv

I have a dataset that I want to open in Weka, so I converted it as csv file. (The file contains some text including commas/apostrophes/quotation marks, while its seperator is pipeline character.)
When I try to read this csv file, in options window, I specify pipeline (|) as my fieldSeperator, leave enclosureCharacters empty, and don't touch the rest of the options. This can be seen in the screenshot:
Then I get this error:
File not recognised as an 'CSV data files' file. Reason: Enclosures
can only be single characters.
Seems like Weka's csv loader does not accept enclosureCharacters field empty? What can I write into this field? I think my file does not have enclosures for its text data.

Related

Invalid literal because symbol appears when reading a csv file

When I am using replit I can remove the little symbol that appears when I drag and drop in a csv file so my main.py can read it, otherwise I get invalid literal base 10 issue. I am trying to run this on local machine with sublime text and getting same error now as it is reading the file from the directory, so I assume it is adding this symbol in before reading.... I can click on the csv file in replit and edit, but cannot do this in sublime.
Can someone explain what this is for? HOw can I get it to read the basic comma delimited numbers in the file (It is a game tile map).
with open(f'level{level}_data.csv', newline= '') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
Saved it is comma delimited csv instead of UTF-8 comma delimited csv. It then imports without the 'question mark in a diamon' symbol. I understand this is an unrecognised special character, but I have nothing apart from integers in my table. Maybe someone could clarify that?...

convert CSV to JSON using Python

I need to convert a CSV file to JSON file using Python. I used this,
variable = csv.DictReader(file.csv)
It throws this ERROR
csv.Error: line contains NULL byte
I checked the CSV file in Excel, it shows no NULL chars, but when I printed the data in CSV file using Python. There are some data like SOHNULNULHG (here last 2 letters, HG is the data displaying in the Excel). I need to remove these ASCII chars in the CSV file, while converting to JSON. (i.e. I need only HG from the above string)
I just ran into the same issue. I converted my csv file to csv UTF-8 and ran it again without any errors. That seemed to fix the ASCII char issue. Hope that helps.
To convert the csv type, I just opened my file up in Excel, did save as, then selected CSV UTF-8(Comma delimited)(*.csv) in the Save as type.
Hope that helps.

xlsread in octave return zero values

I am trying to read a csv file in octave. The file contains a table with both numeric and text data. It also contains information of date and hour. In addition, the first line is in a different format then the rest of the lines since it contains titles.
The csvread can only read numeric data (according to Octave help), so I tried using xlsread as follows:
[NUMARR, TXTARR, RAWARR, LIMITS] = xlsread ('Line.csv')
I get only a matrix of NUMARR with numeric values. However, all other returned variables are empty- their dimension is 0x0.
How do I get all the text and all other information?
TX!
To solve this issue, open your CSV file in Windows notepad and save it as ANSI format instead of UNICODE.

Is there any technical difference between CSV, a TSV or a TXT file?

I use these files constantly in my application, but aren't CSV, TSV or TXT files all flat files?
The content is:
"sample","sample"
They are all text files, following the same "guidelines". The difference between the files are - as long as the creator followed some "rules", that:
A csv file will have comma separated values and a tsv file will have tab seperated values.
For .txt files, there is no formatting specified.
.csv stands for comma separated values, .tsv stands for tab separated values.
As the names suggest, different elements in the file are separated by ',' and '\t' respectively.
The type is chosen depending on the data. If we have say numbers larger than 3 digits, we might need commas as part of the content ans it would be better to use a csv in that case.
Both are types of text files and are increasingly used for classification and data mining purposes.
They do not have any other technical distinguishing factor.
A text file (which might have a txt file extension) will have lines separated by a platform specific line separator (CRLF on Windows, LF on Linux, and so on), and it will tend to contain characters human readable as text in some encoding. Apart from that human readability expectation this allows pretty much any file content on some platforms, so this is more of a content classification than a specific file format.
The other two formats are usually considered special cases of a text file intended to allow easy automated processing; tsv, a "tab separated values" file is simpler than csv, a "comma separated values" file.
csv will have commas as field separators, and it may use quoting and escaping especially to handle commas and quotes occurring in those fields. It may also include a header line as the first line in the file. The last line in the file may or may not end with a line separator.
(Details.)
tsv simply disallows tabs in the values, the header line is mandatory, the final line separator is mandatory.
(Details.)
A "flat file", in connection with databases, is a text file as opposed to a machine optimized storage method (such as a fixed size record file or a compressed backup file or a file using more elaborate markup language supporting data validation); a flat file tends to be csv or tsv or similar.
This answer benefited from a comment by Alex Shpilkin.

Remove all binary characters from a file

Occasionally, I have a hard time manipulating data in a CSV file because of the following error.
Binary file (standard input) matches
I researched several articles online but cannot seem to find one that helps me remove all of the binary characters or elements from a CSV file.
Unfortunately, I do not know where to start with this.
If I run the 'file' command on the file, I get the following output:
Little-endian UTF-16 Unicode text, with very long lines, with CRLF, CR line terminators
The second from last line in the file prints as:
"???? ?????, ???? ???",????,"?????, ????",???,,,,,,,,,,,,,,,,,,,,,,,,* Home,email#address.com,,
The second line in the file prints as:
,,,,,,,,,,,,,,,,,,,,,,,,,,,* ,email#address.com,,
This file contains too many lines to open in Excel or a GUI, "Save as..." and remove the binary elements that way.
Please help me. Thank you!