why are chinese characters becoming ascii in vi? - csv

My CSV file's chinese characters turned into ascii characters when i ran an awk code on it. I know this because when i Vi my CSV file in vim:
i see this:
words,country,percent_sum,week
å<88><86>æ<9c><9f>,China,16.5,11/22/15
å<8f><91>è´§,China,31.36,11/22/15
The Chinese words are turned into ascii characters. the only thing i did was:
cat myfile.csv|awk -F, '{if(NF==4 && $4 != "12/13/15-12/19/15" ) print }' > tmp
which is weird because i didn't overwrite my CSV file and wrote to a tmp file instead.
However, when i cat the CSV file in the terminal it looks fine.
Is this a vim setting that i need to change?
i already have in my vimrc this setting:
set encoding=utf-8
set fileencoding=utf-8

There is nothing wrong with vim if the Chinese characters display normally in it.
cat uses terminal locale settings as Alastair suggested, please check your locale and pay attention to LANG & LC_ALL. Also you can try type Chinese in your terminal to see if it works correctly, then your cat or awk will work as expected.

Related

decode html entities to chinese characters

need help
My curl output shows chinese characters as (on Linux terminal)
기타/부재시 집 앞에 놓고가셔&#46104
I need the output in chinese characters like (기타/부재시 집 앞에 놓고가셔되)
OR-OR-oR
how to convert these html to entities to chinese characters on terminal
Please note I do not have php installed on my machine. so I can not use html_entity_decode or other php decode methods
I have perl and python installed on my machine.
Just pipe the output through this simple Perl substitution:
perl -CO -pe 's/&#(\d+);/chr $1/ge'
-p reads the input line by line and prints each after processing
-CO turns on UTF-8 encoding of the output
/e evaluates the replacement part of the s/// substitution as code
chr just returns the character of the given number in the character set.

How to convert Combining Diacritical Marks to single grapheme

I have a PDF document with text I want to copy and paste into an HTML document.
The problem is that all accented characters are actually made of combining diacritical marks instead of single Unicode code points.
So for instance, é which would be represented by Unicode point é is here encoded as two seperate chars like é (e & ́).
This isn't very easy to deal with, especially since some browsers (Firefox) display a whitespace after the accented letter whereas some others (Chrome) do not.
Hence, is there a way to automatically convert those pesky characters into friendly single Unicode code point characters?
You want to normalize the string to one of the composed form, NFC or NFKC (The difference is that NFKC removes some formatting distinctions like ligatures).
I'm not sure what you mean exactly by 'automatically' but I believe it is possible with most languages, for example :
Python : unicodedata.normalize():
import unicodedata
original_string = '\u0065\u0301'
normalized_string = unicodedata.normalize('NFC', original_string)
Java / Android : Normalizer.normalize()
String normalizedText = Normalizer.normalize(originalString, Normalizer.Form.NFC)
.Net : String.Normalize()
string normalizedText = originalString.Normalize(NormalizationForm.FormC);
Oracle SQL COMPOSE()
SELECT COMPOSE(UNISTR('\0065\0301')) FROM DUAL;
Online test : Dencode (NFD and NFKD are called Decoded NFC, Decoded NFKC )
If you are on Linux, you can use uconv to convert from one Unicode form to the other.
To convert a text file from UTF-8 NFD to UTF-8 NFC :
uconv -f utf8 -t utf8 -x NFC $in_file -o $out_file
Since you mentioned copy/pasting, you could also do it directly with an alias. Assuming your Linux is using UTF-8 by default :
alias to_nfc='xclip -o -selection clipboard | uconv -f - -x NFC | xclip -selection clipboard'
Then you only have to type to_nfc in a terminal to have your clipboard converted.
On Debian-based systems (Ubuntu, etc.), uconv is in package icu-devtools :
sudo apt install icu-devtools
On Centos: yum install icu
On Mac (or Windows if you have Perl installed) you could try
perl -C -MUnicode::Normalize -pe '$_=NFC($_)' < $in_file > $out_file

Updating files using AWK: Why do I get weird newline character after each replacement?

I have a .csv containing a few columns. One of those columns needs to be updated to the same number in ~1000 files. I'm trying to use AWK to edit each file, but I'm not getting the intended result.
What the original .csv looks like
heading_1,heading_2,heading_3,heading_4
a,b,c,1
d,e,f,1
g,h,i,1
j,k,m,1
I'm trying to update column 4 from 1 to 15.
awk '$4="15"' FS=, OFS=, file > update.csv
When I run this on a .csv generated in excel, the result is a newline ^M character after the first line (which it updates to 15) and then it terminates and does not update any of the other columns.
It repeats the same mistake on each file when running through all files in a directory.
for file in *.csv; do awk '$4="15"' FS=, OFS=, $file > $file"_updated>csv"; done
Alternatively, if someone has a better way to do this task, I'm open to suggestions.
Excel is generating the control-Ms, not awk. Run dos2unix or similar on your file before running awk on it.
Well, I couldn't reproduce your problem in my linux as writing 15 to last column will overwrite the \r (the ^M is actually 0x0D or \r) before the newline \n, but you could always remove the \r first:
$ awk 'sub(/\r/,""); ...' file
I have had some issues with non-ASCII characters processed in a file in a differing locale, for example having a file with ISO-8859-1 encoding processed with Gnu awk in UTF8 shell.

MATLAB: Read HTML-Codes (within XML)

I'm trying to read the following XML-file of a Polish treebank using MATLAB: http://zil.ipipan.waw.pl/Sk%C5%82adnica?action=AttachFile&do=view&target=Sk%C5%82adnica-frazowa-0.5-TigerXML.xml.gz
Polish letters seem to be encoded as HTML-codes: http://webdesign.about.com/od/localization/l/blhtmlcodes-pl.htm
For instance, ł stands for 'ł'. If I open the treebank using 'UTF-8', I get words like kłaniał, which should actually be displayed as 'kłaniał'
Now, I see 2 options to read the treebank correctly:
Directly read the XML-file such that HTML-codes are transformed into the corresponding characters.
First save the words in non-decoded format (e.g. as kłaniał) and then transform the characters afterwards.
Is it possible to do one of the 2 options (or both) in MATLAB?
A non-MATLAB solution is to preprocess the file through some external utility. For instance, with Ruby installed, one could use the HTMLentities gem to unescape all the special characters.
sudo gem install htmlentities
Let file.xml be the filename which should consist of ascii-only chars. The Ruby code to convert the file could be like this:
#!/usr/bin/env ruby
require 'htmlentities'
xml = File.open("file.xml").read
converted_xml = HTMLEntities.new.decode xml
IO.write "decoded_file.xml", xml
(To run the file, don't forget to chmod +x it to make it executable).
Or more compactly, as a one-liner
ruby -e "require 'htmlentities';IO.write(\"decoded_file.xml\",HTMLEntities.new.decode(File.open(\"file.xml\").read))"
You could then postprocess the xml however you wish.

what type of encoding is this "Non-ISO extended-ASCII English text, with very long lines, with CRLF line terminators"?

I am trying to export the out put of mssql query which must use utf (utf-16 I suppose) encoding according to description I am using -W -u
functions with sqlcmd.
Ä is converted to character z (with 2 dots or something like ^ inverted) by default, and lists it as ansi character set.
when I try to use notepad++ to convert this utf file to utf-8 shows me with some strange highlighted characters (x8E) for Ä, and some other for other characters like x86 and x94 does not matter what ever encoding I use as default in Nottepad++.
When I transfered the file to a Ubuntu 12.04 machine and using file command says that its
user#user:~/Desktop/enc
oding/checkencoding$ file
convertit4.csv convertit4.csv: Non-ISO extended-ASCII English
text, with very long lines, with CRLF line terminators
user#user:~/Desktop/encoding/checkencoding$ chardet
convertit4.csv convertit4.csv: ISO-8859-2 (confidence: 0.77)
I am confused what kind of encoding it uses.
the purpose is to convert it to utf-8 encoding without any errors to upload it to magmi importer.
note: I am using this command to remove the underline after the headers type c:\outfiles\convertit1.temp | findstr /r /v "^\-[;\-]*$" > c:\outfiles\convertit4.csv hope this line of codes is not the problem.
I hope the information are complete to solve this issue, If any more information needed, please let me know,
Regards.
Try the -f option as in
http://www.yaldex.com/sql_server_tutorial_3/ch06lev1sec1.html