CL-JSON encodes Unicode chars by outputting their Unicode escape string in ASCII format. How can I override this? - json

I am using CL-JSON to encode an object. It spits the encoded string in ASCII format and the non-ASCII chars are written out as a sequence of ASCII chars in "\uxxxx" form. The result is that even if I open the output file stream with external format :utf-8, the file contains only ASCII chars. When I try to view it with for example notepad++ I cannot convert it to Unicode because now all the data is just ASCII (even the "\uXXXX" sequences). I would like either to know if there is an editor that will automatically convert the file to Unicode and recognize those escape sequences, or if there is a way to tell CL-JSON to keep the output characters in Unicode. Any ideas?
EDIT: here is some more info:
CL-USER>(with-open-file (out "dump.json"
:direction :output
:if-does-not-exist :create
:if-exists :overwrite
:external-format :utf-8)
(json:encode-json '("abcd" "αβγδ") out)
(format out "~%"))
CL-USER>(quit)
bash$ file dump.json
dump.json: ASCII text
bash$ cat dump.json
["abcd","\u03B1\u03B2\u03B3\u03B4"]
bash$ uname -a
Linux suse-server 3.0.38-0.5-default #1 SMP Fri Aug 3 09:02:17 UTC 2012 (358029e) x86_64 x86_64 x86_64 GNU/Linux
bash$ sbcl --version
SBCL 1.0.50
bash$
EDIT2:
YASON does what I need, outputting chars without escaping them in \uXXXX format, but unfortunately it lacks features that I need, so it is not an option.

I know this is a temporary solution but I changed the CL-JSON source by redefining the appropriate function not to unicode-escape ranges outside ASCII. The function is named write-json-chars and it resides in file encoder.lisp in the sources.

Related

Apache NiFi - All the spanish characters (ñ, á, í, ó, ú) in CSV changed to question mark (?) in JSON

I've fetched the CSV file using GetFile processor where CSV have spanish characters (ñ, á, í, ó, ú and more) within the English Words.
When I try to use ConvertRecord processor with controller service of JSONRecordSetWriter, it displays the JSON output having question mark instead of special characters.
What is the correct way to convert CSV records into JSON format with proper encoding?
Any response/feedback will be much appreciated.
Note: CSV File is UTF-8 encoded and fetched and read properly in NiFi.
If you have verified that the input is UTF-8, try this:
Open $NIFI/conf/bootstrap.conf
Add an argument very early in the list of arguments for -Dfile.encoding=UTF-9 to force the JVM to not use the OS's settings. This has mainly been a problem in the past with the JVM on Windows.

decode html entities to chinese characters

need help
My curl output shows chinese characters as (on Linux terminal)
기타/부재시 집 앞에 놓고가셔&#46104
I need the output in chinese characters like (기타/부재시 집 앞에 놓고가셔되)
OR-OR-oR
how to convert these html to entities to chinese characters on terminal
Please note I do not have php installed on my machine. so I can not use html_entity_decode or other php decode methods
I have perl and python installed on my machine.
Just pipe the output through this simple Perl substitution:
perl -CO -pe 's/&#(\d+);/chr $1/ge'
-p reads the input line by line and prints each after processing
-CO turns on UTF-8 encoding of the output
/e evaluates the replacement part of the s/// substitution as code
chr just returns the character of the given number in the character set.

How to convert Combining Diacritical Marks to single grapheme

I have a PDF document with text I want to copy and paste into an HTML document.
The problem is that all accented characters are actually made of combining diacritical marks instead of single Unicode code points.
So for instance, é which would be represented by Unicode point é is here encoded as two seperate chars like é (e & ́).
This isn't very easy to deal with, especially since some browsers (Firefox) display a whitespace after the accented letter whereas some others (Chrome) do not.
Hence, is there a way to automatically convert those pesky characters into friendly single Unicode code point characters?
You want to normalize the string to one of the composed form, NFC or NFKC (The difference is that NFKC removes some formatting distinctions like ligatures).
I'm not sure what you mean exactly by 'automatically' but I believe it is possible with most languages, for example :
Python : unicodedata.normalize():
import unicodedata
original_string = '\u0065\u0301'
normalized_string = unicodedata.normalize('NFC', original_string)
Java / Android : Normalizer.normalize()
String normalizedText = Normalizer.normalize(originalString, Normalizer.Form.NFC)
.Net : String.Normalize()
string normalizedText = originalString.Normalize(NormalizationForm.FormC);
Oracle SQL COMPOSE()
SELECT COMPOSE(UNISTR('\0065\0301')) FROM DUAL;
Online test : Dencode (NFD and NFKD are called Decoded NFC, Decoded NFKC )
If you are on Linux, you can use uconv to convert from one Unicode form to the other.
To convert a text file from UTF-8 NFD to UTF-8 NFC :
uconv -f utf8 -t utf8 -x NFC $in_file -o $out_file
Since you mentioned copy/pasting, you could also do it directly with an alias. Assuming your Linux is using UTF-8 by default :
alias to_nfc='xclip -o -selection clipboard | uconv -f - -x NFC | xclip -selection clipboard'
Then you only have to type to_nfc in a terminal to have your clipboard converted.
On Debian-based systems (Ubuntu, etc.), uconv is in package icu-devtools :
sudo apt install icu-devtools
On Centos: yum install icu
On Mac (or Windows if you have Perl installed) you could try
perl -C -MUnicode::Normalize -pe '$_=NFC($_)' < $in_file > $out_file

MATLAB: Read HTML-Codes (within XML)

I'm trying to read the following XML-file of a Polish treebank using MATLAB: http://zil.ipipan.waw.pl/Sk%C5%82adnica?action=AttachFile&do=view&target=Sk%C5%82adnica-frazowa-0.5-TigerXML.xml.gz
Polish letters seem to be encoded as HTML-codes: http://webdesign.about.com/od/localization/l/blhtmlcodes-pl.htm
For instance, ł stands for 'ł'. If I open the treebank using 'UTF-8', I get words like kłaniał, which should actually be displayed as 'kłaniał'
Now, I see 2 options to read the treebank correctly:
Directly read the XML-file such that HTML-codes are transformed into the corresponding characters.
First save the words in non-decoded format (e.g. as kłaniał) and then transform the characters afterwards.
Is it possible to do one of the 2 options (or both) in MATLAB?
A non-MATLAB solution is to preprocess the file through some external utility. For instance, with Ruby installed, one could use the HTMLentities gem to unescape all the special characters.
sudo gem install htmlentities
Let file.xml be the filename which should consist of ascii-only chars. The Ruby code to convert the file could be like this:
#!/usr/bin/env ruby
require 'htmlentities'
xml = File.open("file.xml").read
converted_xml = HTMLEntities.new.decode xml
IO.write "decoded_file.xml", xml
(To run the file, don't forget to chmod +x it to make it executable).
Or more compactly, as a one-liner
ruby -e "require 'htmlentities';IO.write(\"decoded_file.xml\",HTMLEntities.new.decode(File.open(\"file.xml\").read))"
You could then postprocess the xml however you wish.

what type of encoding is this "Non-ISO extended-ASCII English text, with very long lines, with CRLF line terminators"?

I am trying to export the out put of mssql query which must use utf (utf-16 I suppose) encoding according to description I am using -W -u
functions with sqlcmd.
Ä is converted to character z (with 2 dots or something like ^ inverted) by default, and lists it as ansi character set.
when I try to use notepad++ to convert this utf file to utf-8 shows me with some strange highlighted characters (x8E) for Ä, and some other for other characters like x86 and x94 does not matter what ever encoding I use as default in Nottepad++.
When I transfered the file to a Ubuntu 12.04 machine and using file command says that its
user#user:~/Desktop/enc
oding/checkencoding$ file
convertit4.csv convertit4.csv: Non-ISO extended-ASCII English
text, with very long lines, with CRLF line terminators
user#user:~/Desktop/encoding/checkencoding$ chardet
convertit4.csv convertit4.csv: ISO-8859-2 (confidence: 0.77)
I am confused what kind of encoding it uses.
the purpose is to convert it to utf-8 encoding without any errors to upload it to magmi importer.
note: I am using this command to remove the underline after the headers type c:\outfiles\convertit1.temp | findstr /r /v "^\-[;\-]*$" > c:\outfiles\convertit4.csv hope this line of codes is not the problem.
I hope the information are complete to solve this issue, If any more information needed, please let me know,
Regards.
Try the -f option as in
http://www.yaldex.com/sql_server_tutorial_3/ch06lev1sec1.html