Perl CSV reading characters that aren't there - csv

I'm reading in a file using Perl's Text::CSV_XS package which I am reading on Ubuntu:
open my $fh, '<:encoding(utf8)', 'file.csv' or die "Can't read csv: $!"; # error shows on this line
while (my $row = $list->getline ($fh)) {
....
}
and this reads just fine until one line gives an error:
UTF-8 "\xE9" does not map to Unicode at 0.xlsx_to_json.pl line 198, <$_[...]> line 14019.
looking online, this suggests that this is a ê character or something similar, which is strange because I don't see any such characters on line 14109, that line looks just like any other line.
I tried changing the open line to
open my $fh, '<', 'file.csv'
but that gives the same error.
I tried opening the CSV and saving as CSV with a different delimiter, but I can't do that in Excel 2016 anymore, the option to change the delimiter simply doesn't appear
I tried opening in LibreOffice to save as a CSV, but an update removed the ability to change the delimiter
How can I read this CSV file without this strange error?

Your file is not a valid UTF-8 file. Byte E9 appears where it's not expected.
Followed by two continuation bytes = ok
$ perl -M5.010 -MEncode=decode -e'
decode("UTF-8", "\xE9\xBF\xBF", Encode::FB_WARN | Encode::LEAVE_SRC);
say "done";
'
done
Not followed by two continuation bytes = bad
$ perl -M5.010 -MEncode=decode -e'
decode("UTF-8", "\xE9\x41", Encode::FB_WARN | Encode::LEAVE_SRC);
say "done";
'
UTF-8 "\xE9" does not map to Unicode at -e line 2.
done
Fix your bad data.

Related

neo4j throws error for "\" character

I am exporting csv file and need to read line one by one.
One of the line in csv file contains the string "C:\Program Files\". Because of this line it throws the below error.
At D:\workdir\Neo4j_Database\Database1\import\Data.csv:22798 - there's
a field starting with a quote and whereas it ends that quote there
seems to be characters in that field after that ending quote. That
isn't supported. This is what I read: 'CMM 10.0.1 Silent Installation
will install SW always in "C:\Program Files"",V10.0,
,,,,,,,,105111,AVASAAIS AG,E,,"G,"'
If I remove the last \ of the line then it does not throw this error.
I am not sure how to resolve this without modifying the csv file.
Note: CSV loader used LOAD CSV.

Bash: Base64 encode 1 column in a very large .csv and output to new file

I've tried using the code below but the csv file has over 80 million lines (roughly 25gb) and some of the special characters seem to break the echo command. The csv has 2 columns separated by a comma.
ex:
blah, blah2
data1,data2
line3,fd$$#$%T%^Y%&$$B
somedata,%^&%^&%^&^
The goal is to take that second column and base64 is to get ready to import into a sql db. I'm doing a base64 encode on the second column so there's unicode support etc and no character will corrupt the db.
I'm looking for a more efficient way of doing this that won't break on special chars etc.
awk -F "," '
{
"echo "$2" | base64" | getline x
print $1, x
}
' OFS=',' input.csv > base64.csv
Error:
sh: 1: Syntax error: word unexpected (expecting ")") :
not foundrf :
not found201054 :
not foundth :
not foundz09
| base64' (Too many open files)ut.csv FNR=1078) fatal: cannot open pipe `echo q1w2e3r4
The problem is that you're not quoting the argument to echo in the the awk script.
But there's no need to use awk for this, bash can parse the file directly.
IFS=, while read -r col1 col2
do
base64=$(base64 <<<"$col2")
echo "$col1, $base64"
done < input.csv > base64.csv
Try something like this in your MySQL command-line client:
LOAD DATA LOCAL '/tmp/filename.txt' INTO TABLE tbl FIELDS TERMINATED BY ','
You can reorder fields if needed and apply special expressions if you need to remove special characters, concatenate strings, convert date format, etc. If you still really need base64 conversion, MySQL versions 5.6 and later have a native function for that (TO_BASE64()), while there is a UDF for the older ones. See base64 encode in MySQL
However, as long as your columns do not have commas, LOAD DATA INFILE will be able to handle it, and you can save some disk space by avoiding the conversion.
For details on how to use LOAD DATA INFILE, see MySQL manual: https://dev.mysql.com/doc/refman/5.7/en/load-data.html
You will need to authenticate to MySQL as a user with the LOAD privilege, and have local-infile option enabled (e.g. by passing --local-infile=1 on the command line.
The goal is to take that second column and base64
With awk getline function:
awk -F',[[:space:]]*' '{ cmd="echo \042"$2"\042 | base64"; cmd | getline v;
close(cmd); print $1","v }' input.csv > base64.csv
The base64.csv contents (for your current input):
blah,YmxhaDIK
data1,ZGF0YTIK
line3,ZmQyNzMwOCMkJVQlXlklJjI3MzA4Qgo=
somedata,JV4mJV4mJV4mXgo=

Updating files using AWK: Why do I get weird newline character after each replacement?

I have a .csv containing a few columns. One of those columns needs to be updated to the same number in ~1000 files. I'm trying to use AWK to edit each file, but I'm not getting the intended result.
What the original .csv looks like
heading_1,heading_2,heading_3,heading_4
a,b,c,1
d,e,f,1
g,h,i,1
j,k,m,1
I'm trying to update column 4 from 1 to 15.
awk '$4="15"' FS=, OFS=, file > update.csv
When I run this on a .csv generated in excel, the result is a newline ^M character after the first line (which it updates to 15) and then it terminates and does not update any of the other columns.
It repeats the same mistake on each file when running through all files in a directory.
for file in *.csv; do awk '$4="15"' FS=, OFS=, $file > $file"_updated>csv"; done
Alternatively, if someone has a better way to do this task, I'm open to suggestions.
Excel is generating the control-Ms, not awk. Run dos2unix or similar on your file before running awk on it.
Well, I couldn't reproduce your problem in my linux as writing 15 to last column will overwrite the \r (the ^M is actually 0x0D or \r) before the newline \n, but you could always remove the \r first:
$ awk 'sub(/\r/,""); ...' file
I have had some issues with non-ASCII characters processed in a file in a differing locale, for example having a file with ISO-8859-1 encoding processed with Gnu awk in UTF8 shell.

weka - csv file upload produces null error

Hej,
no matter what I try, I keep getting the error: file not recognised as 'CSV data files' file, reason: null, while loading a cvs file into Weka explorer. Any suggestions what could be wrong?
I have been trying "correct" this type of errors Wrong number of values, Read 1, expected 2 Token[EOL], line 17 and after it stops giving those, the null one appears.
The file in question: file link
Thank you in advance!
I've preprocessed the file with these shell commands.
# optional:
# The file uses "\r" characters (sometimes displayed as ^M) characters
# as line separator. Character "\n" is better.
# make it a unix-compliant csv file
# original file is saved into ~/Downloads/rezultati.csv.bak
perl -pi.bak -E "s/\r/\n/g" ~/Downloads/rezultati.csv
# end optional
# take first 240 lines, except the defective last line .
# I don't know what's wrong with it. maybe it's "No newline at end of file"
# I'll just omit that single line starting with ID 243.
head -240 ~/Downloads/rezultati.csv > ~/Downloads/rezultati-240.csv
resultati-240.csv can be loaded into weka.

Data Type of Module Output

I have a script that I run on various texts to convert XHTML (e.g., ü) to ASCII. For Example, my script is written in the following manner:
open (INPUT, '+<file') || die "File doesn't exist! $!";
open (OUTPUT, '>file') || die "Can't find file! $!";
while (<INPUT>) {
s/&uuml/ü/g;
}
print OUTPUT $_;
This works as expected and substitutes the XHTML with the ASCII equivalent. However, since this is often run, I've attempted to convert it into a module. But, Perl doesn't return "ü" it returns the decomposition. How can I get Perl to return the data back with the ASCII equivalent (as run and printed in my regular .pl file)?
There is no ASCII. Not in practice anyway, and certainly not outside the US. I suggest you specify an encoding that will have all characters you might encounter (ASCII does not contain ü, it is only a 7-bit encoding!). Latin-1 is possible, but still suboptimal, so you should use Unicode, preferably UTF-8.
If you don't want to output in Unicode, at least your Perl script should be encoded with UTF-8. To signal this to the perl interpreter, use utf8 at the top of your script.
Then open the input file with an encoding layer like this:
open my $fh, "<:encoding(UTF-8)", $filename
The same goes for the output file. Just make sure to specify an an encoding when you want to use one.
You can change the encoding of a file with binmode, just see the documentation.
You can also use the Encode module to translate a byte string to unicode and vice versa. See this excellent question for further information about using Unicode with Perl.
If you want to, you can use the existing HTML::Entities module to handle the entity decoding and just focus in the I/O.