How to convert Combining Diacritical Marks to single grapheme - html

I have a PDF document with text I want to copy and paste into an HTML document.
The problem is that all accented characters are actually made of combining diacritical marks instead of single Unicode code points.
So for instance, é which would be represented by Unicode point é is here encoded as two seperate chars like é (e & ́).
This isn't very easy to deal with, especially since some browsers (Firefox) display a whitespace after the accented letter whereas some others (Chrome) do not.
Hence, is there a way to automatically convert those pesky characters into friendly single Unicode code point characters?

You want to normalize the string to one of the composed form, NFC or NFKC (The difference is that NFKC removes some formatting distinctions like ligatures).
I'm not sure what you mean exactly by 'automatically' but I believe it is possible with most languages, for example :
Python : unicodedata.normalize():
import unicodedata
original_string = '\u0065\u0301'
normalized_string = unicodedata.normalize('NFC', original_string)
Java / Android : Normalizer.normalize()
String normalizedText = Normalizer.normalize(originalString, Normalizer.Form.NFC)
.Net : String.Normalize()
string normalizedText = originalString.Normalize(NormalizationForm.FormC);
Oracle SQL COMPOSE()
SELECT COMPOSE(UNISTR('\0065\0301')) FROM DUAL;
Online test : Dencode (NFD and NFKD are called Decoded NFC, Decoded NFKC )

If you are on Linux, you can use uconv to convert from one Unicode form to the other.
To convert a text file from UTF-8 NFD to UTF-8 NFC :
uconv -f utf8 -t utf8 -x NFC $in_file -o $out_file
Since you mentioned copy/pasting, you could also do it directly with an alias. Assuming your Linux is using UTF-8 by default :
alias to_nfc='xclip -o -selection clipboard | uconv -f - -x NFC | xclip -selection clipboard'
Then you only have to type to_nfc in a terminal to have your clipboard converted.
On Debian-based systems (Ubuntu, etc.), uconv is in package icu-devtools :
sudo apt install icu-devtools
On Centos: yum install icu
On Mac (or Windows if you have Perl installed) you could try
perl -C -MUnicode::Normalize -pe '$_=NFC($_)' < $in_file > $out_file

Related

decode html entities to chinese characters

need help
My curl output shows chinese characters as (on Linux terminal)
기타/부재시 집 앞에 놓고가셔&#46104
I need the output in chinese characters like (기타/부재시 집 앞에 놓고가셔되)
OR-OR-oR
how to convert these html to entities to chinese characters on terminal
Please note I do not have php installed on my machine. so I can not use html_entity_decode or other php decode methods
I have perl and python installed on my machine.
Just pipe the output through this simple Perl substitution:
perl -CO -pe 's/&#(\d+);/chr $1/ge'
-p reads the input line by line and prints each after processing
-CO turns on UTF-8 encoding of the output
/e evaluates the replacement part of the s/// substitution as code
chr just returns the character of the given number in the character set.

How to port app from Borland Pascal to FreePascal and Unicode terminal

I am trying to port my first app I ever wrote from old Borland Pascal to FreePascal and run it in Linux unicode shell.
Unfortunately, the app uses CRT unit and writes non-standard ASCII graphical characters. So I tried to rewrite statements like these:
gotoxy(2,3); write(#204);
writeln('3. Intro');
to these:
gotoxy(2,3); write('╠');
write('3. Intro', #10);
Two notes:
I use unicode characters directly in code because I did not find out how to write unicode characters via their code.
I used write procedure instead of writeln to make sure that unix line endings will be produced.
But after replacing all non-standard ASCII characters and getting rid of all writeln statements, it became even worse.
Before changes:
After changes:
Why it ends up like this? What I can do better?
After some time here is an update what I found out.
1) I cannot port it
As user #dmsc rightly pointed out, CRT does not support UTF-8. His suggested hack that did not work for me.
2) When you can't port it, emulate environment.
The graphical characters I needed were part of CP-437. There is a program called luit that is made for converting application output from the locale's encoding into UTF-8. Unfortunately this does not work for me. It simple erased the characters:
# Via iconv, everything is OK:
$ printf "top right corner in CP437: \xbf \n" | iconv -f CP437 -t UTF-8
top right corner in CP437: ┐
# But not via luit, that simply omit the character:
$ luit -gr g2 -g2 'CP 437' printf "top right corner in CP437: \xbf \n"
top right corner in CP437:
So my solution is to run gnome-terminal, add and set Hebrew (IBM862) encoding (tutorial here) and enjoy your app!
The CRT unit does not currently works with UTF-8, as it assumes that each character on the screen is exactly one byte, see http://www.freepascal.org/docs-html-3.0.0/rtl/crt/index.html
But, simple applications can be made to work by "tricking" GotoXY to always do a full cursor positioning, by doing:
GotoXY(1,1);
GotoXY(x, y);
To replace all the strings in your source file, you can use recode, in a terminal type:
recode cp437..u8 < original.pas > fixed.pas
Then, you need to replace all the numeric characters (like your #204 example) with the equivalent UTF-8, you can use:
echo -e '\xCC' | recode cp437/..u8
The 'CC' is hexadecimal for 204, and as a result the character '╠' will be printed.

MATLAB: Read HTML-Codes (within XML)

I'm trying to read the following XML-file of a Polish treebank using MATLAB: http://zil.ipipan.waw.pl/Sk%C5%82adnica?action=AttachFile&do=view&target=Sk%C5%82adnica-frazowa-0.5-TigerXML.xml.gz
Polish letters seem to be encoded as HTML-codes: http://webdesign.about.com/od/localization/l/blhtmlcodes-pl.htm
For instance, ł stands for 'ł'. If I open the treebank using 'UTF-8', I get words like kłaniał, which should actually be displayed as 'kłaniał'
Now, I see 2 options to read the treebank correctly:
Directly read the XML-file such that HTML-codes are transformed into the corresponding characters.
First save the words in non-decoded format (e.g. as kłaniał) and then transform the characters afterwards.
Is it possible to do one of the 2 options (or both) in MATLAB?
A non-MATLAB solution is to preprocess the file through some external utility. For instance, with Ruby installed, one could use the HTMLentities gem to unescape all the special characters.
sudo gem install htmlentities
Let file.xml be the filename which should consist of ascii-only chars. The Ruby code to convert the file could be like this:
#!/usr/bin/env ruby
require 'htmlentities'
xml = File.open("file.xml").read
converted_xml = HTMLEntities.new.decode xml
IO.write "decoded_file.xml", xml
(To run the file, don't forget to chmod +x it to make it executable).
Or more compactly, as a one-liner
ruby -e "require 'htmlentities';IO.write(\"decoded_file.xml\",HTMLEntities.new.decode(File.open(\"file.xml\").read))"
You could then postprocess the xml however you wish.

what type of encoding is this "Non-ISO extended-ASCII English text, with very long lines, with CRLF line terminators"?

I am trying to export the out put of mssql query which must use utf (utf-16 I suppose) encoding according to description I am using -W -u
functions with sqlcmd.
Ä is converted to character z (with 2 dots or something like ^ inverted) by default, and lists it as ansi character set.
when I try to use notepad++ to convert this utf file to utf-8 shows me with some strange highlighted characters (x8E) for Ä, and some other for other characters like x86 and x94 does not matter what ever encoding I use as default in Nottepad++.
When I transfered the file to a Ubuntu 12.04 machine and using file command says that its
user#user:~/Desktop/enc
oding/checkencoding$ file
convertit4.csv convertit4.csv: Non-ISO extended-ASCII English
text, with very long lines, with CRLF line terminators
user#user:~/Desktop/encoding/checkencoding$ chardet
convertit4.csv convertit4.csv: ISO-8859-2 (confidence: 0.77)
I am confused what kind of encoding it uses.
the purpose is to convert it to utf-8 encoding without any errors to upload it to magmi importer.
note: I am using this command to remove the underline after the headers type c:\outfiles\convertit1.temp | findstr /r /v "^\-[;\-]*$" > c:\outfiles\convertit4.csv hope this line of codes is not the problem.
I hope the information are complete to solve this issue, If any more information needed, please let me know,
Regards.
Try the -f option as in
http://www.yaldex.com/sql_server_tutorial_3/ch06lev1sec1.html

CL-JSON encodes Unicode chars by outputting their Unicode escape string in ASCII format. How can I override this?

I am using CL-JSON to encode an object. It spits the encoded string in ASCII format and the non-ASCII chars are written out as a sequence of ASCII chars in "\uxxxx" form. The result is that even if I open the output file stream with external format :utf-8, the file contains only ASCII chars. When I try to view it with for example notepad++ I cannot convert it to Unicode because now all the data is just ASCII (even the "\uXXXX" sequences). I would like either to know if there is an editor that will automatically convert the file to Unicode and recognize those escape sequences, or if there is a way to tell CL-JSON to keep the output characters in Unicode. Any ideas?
EDIT: here is some more info:
CL-USER>(with-open-file (out "dump.json"
:direction :output
:if-does-not-exist :create
:if-exists :overwrite
:external-format :utf-8)
(json:encode-json '("abcd" "αβγδ") out)
(format out "~%"))
CL-USER>(quit)
bash$ file dump.json
dump.json: ASCII text
bash$ cat dump.json
["abcd","\u03B1\u03B2\u03B3\u03B4"]
bash$ uname -a
Linux suse-server 3.0.38-0.5-default #1 SMP Fri Aug 3 09:02:17 UTC 2012 (358029e) x86_64 x86_64 x86_64 GNU/Linux
bash$ sbcl --version
SBCL 1.0.50
bash$
EDIT2:
YASON does what I need, outputting chars without escaping them in \uXXXX format, but unfortunately it lacks features that I need, so it is not an option.
I know this is a temporary solution but I changed the CL-JSON source by redefining the appropriate function not to unicode-escape ranges outside ASCII. The function is named write-json-chars and it resides in file encoder.lisp in the sources.