JSON encoding in Perl output - json

Context:
I have to migrate a Perl script, into Python. The problem resides in that the configuration files that this Perl script uses, is actually valid Perl code. My Python version of it, uses .yaml files as config.
Therefore, I basically had to write a converter between Perl and yaml. Given that, from what I found, Perl does not play well with Yaml, but there are libs that allow dumping Perl hashes into JSON, and that Python works with JSON -almost- natively, I used this format as an intermediate: Perl -> JSON -> Yaml. The first conversion is done in Perl code, and the second one, in Python code (which also does some mangling on the data).
Using the library mentioned by #simbabque, I can output YAML natively, which afterwards I must modify and play with. As I know next to nothing of Perl, I prefer to do so in Python.
Problem:
The source config files look something like this:
$sites = {
"0100101001" => {
mail => 1,
from => 'mail#mail.com',
to => 'mail#mail.com',
subject => 'á é í ó ú',
msg => 'á é í ó ú',
ftp => 0,
sftp => 0,
},
"22222222" => {
[...]
And many more of those.
My "parsing" code is the following:
use strict;
use warnings;
# use JSON;
use YAML;
use utf8;
use Encode;
use Getopt::Long;
my $conf;
GetOptions('conf=s' => \$conf) or die;
our (
$sites
);
do $conf;
# my $json = encode_json($sites);
my $yaml = Dump($sites);
binmode(STDOUT, ':encoding(utf8)');
# print($json);
print($yaml);
Nothing out of the ordinary. I simply need the JSON YAML version of the Perl data. In fact, it mostly works. My problem is with the encoding.
The output of the above code is this:
[...snip...]
mail: 1
msg: á é í ó ú
sftp: 0
subject: á é í ó ú
[...snip...]
The encoding goes to hell and back. As far as I read, UTF-8 is the default, and just in case, I force it with binmode, but to no avail.
What am I missing here? Any workaround?
Note: I thought I may have been my shell, but locale outputs this:
❯ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
Which seems ok.
Note 2: I know next to nothing of Perl, and is not my intent to be an expert on it, so any enhancements/tips are greatly appreciated too.
Note 3: I read this answer, and my code is loosely based on it. The main difference is that I'm not sure how to encode a file, instead of a simple string.

The sites config file is UTF-8 encoded. Here are three workarounds:
Put use utf8 pragma inside the site configuration file. The use utf8 pragma in the main script is not sufficient to treat files included with do/require as UTF-8 encoded.
If that is not feasible, decode the input before you pass it to the JSON encoder. Something like
open CFG, "<:encoding(utf-8)", $conf;
do { local $/; eval <CFG> };
close CFG;
instead of
do $conf
Use JSON::to_json instead of JSON::encode_json. encode_json expects decoded input (Unicode code points) and the output is UTF-8 encoded. The output of to_json is not encoded, or rather, it will have the same encoding as the input, which is what you want.
There is no need to encode the final output as UTF-8. Using any of the three workarounds will already produce UTF-8 encoded output.

Related

Dump Chinese data into a json file

I am falling on a problem, while dumping a chinese data (non-latin language data) into a json file.
I am trying to store list into a json file with the following code;
with open("file_name.json","w",encoding="utf8") as file:
json.dump(edits,file)
It will dumped without any errors.
When i am viewing a file, it will look like this,
[{sentence: \u5979\u7d30\u5c0f\u8072\u5c0d\u6211\u8aaa\uff1a\u300c\u6211\u501f\u4f60\u4e00\u679d\u925b\u7b46\u3002\u300d}...]
And I also tried out, without encoding option.
with open("file_name.json","w") as file:
json.dump(edits,file)
My question is, why my json file look like this, and how to dump my json file with having chinese string instead of unicode string.
Any helps would be appreciated. Thanks : )
Check out the docs for json.dump.
Specifically, it has a switch ensure_ascii that if set to False should make the function not escape the characters.
If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.

Perl: Why do i need to set the latin1 flag explicitly since JSON 2.xx?

Since JSON 2.xx i need to set the latin1 flag in order to get umlauts safe to the html document:
my $obj_with_umlauts = {
title => 'geändert',
}
my $json = JSON->new()->latin1(1)->encode($obj_with_umlauts);
This was not necessary using JSON 1.xx :
my $json = JSON->new()->objToJson($obj_with_umlauts);
The html document is in iso-8559-1 (meta-tag).
Can anybody explain to me why?
This is such a huge can of worms that you're opening here.
I suspect that the answer is something along the lines of "a bug was fixed in the character handling of JSON.pm". But it's hard to know what is going on without a lot more information about your situation.
How is $string_with_umlauts being set? How are you encoding the data that you write to the HTML document?
Do you want to handle utf8 data correctly (you really should) or are you happy assuming that you live in a Latin1 world?
It's important to realise that if you completely ignore Unicode considerations then it can often seem that your programs are working correctly as errors often cancel each other out. When you start to address Unicode issues, it can seem that your programs are getting worse until you address all of the issues.
The Perl Unicode Tutorial is a good place to start learning about these things.
P.S. It's "Perl", not "PERL".
What are you talking about?
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->objToJson(["\xE4"]);
say sprintf "%v02X", $json;
'
1.15
5B.22.E4.22.5D # Unicode code points for ["ä"]
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->encode(["\xE4"]);
say sprintf "%v02X", $json;
'
2.59
5B.22.E4.22.5D # Unicode code points for ["ä"]
Those two strings are identical! In fact, adding ->latin1() doesn't change anything because the iso-8859-1 encoding of Unicode code point U+00E4 is E4.
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->latin1()->encode(["\xE4"]);
say sprintf "%v02X", $json;
'
2.59
5B.22.E4.22.5D # iso-8859-1 encoding of ["ä"]
There is one difference between the last two: it's stored differently in the scalar. That should make absolutely no difference. If code treats them differently, then that code is incorrectly reading the data in the scalar, and that code is buggy.
$string_with_umlauts definetly is a string in winLatin
Well, that's error number one.
JSON expects strings of decoded text (strings of Unicode code points), not encoded text.
That said, there happens to be no difference between a string encoded using iso-8859-1 and a string of Unicode code points. For example, when encoded using iso-8859-1, "ä" is byte E4, and it's Unicode code point U+00E4, two different notation for the same number.
If the string is encoded using cp1252, though, you'll have problems with characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ (the characters in cp1252 but not in iso-8859-1). For example, when encoded using cp1252, "€" is byte 80, but it's Unicode code point U+20AC. 0x80 != 0x20AC.
The html document is in iso-8559-1 (meta-tag).
Then at some point, you'll have to encode the output into iso-8859-1. You can do it using an :encoding layer, or using Encode's encode or using JSON's ->latin1 directive. The advantage of using this final option is that it will cause JSON to escape any character outside of the iso-8859-1 character set before attempting to encode it.
Can anybody explain to me why?
You have a code (an XS module) that reads the underlying string buffer of the scalar and incorrectly treats that as the content of the string. There is a bug is in that module.

Scrambled umlauts since upgrade from JSON1 to JSON2 in Perl

I wondered why some german umlauts were scrambled on our page.
Then i found out that the recent version of JSON (i use 2.07) does convert strings in an other manner than JSON 1.5.
Problem here is that i have a hash with strings like
use Data::Dumper;
my $test = {
'fields' => 'überrascht'
};
print Dumper(to_json($test)); gives me
$VAR1 = "{ \"fields\" : \"\x{fc}berrascht\" } ";
Using the old module using
$json = JSON->new();
print Dumper ($json->to_json($test));
gives me (the correct result)
$VAR1 = '{"fields":[{"title":"überrascht"}]}';
So umlauts are scrammbled using the new JSON 2 module.
What do i need to get them correct?
Update: It might be bad to use Data::Dumper to show output, because Dumper uses its own encoding. Well, a difference in the result from Dumper shows that anything is treated differently here. It might be better to describe the backend as Brad mentioned:
The json string gets printed using Template-Toolkit and then gets assigned to a javascript variable for further use. The correct javascript shows something like this
{
"title" : "Geändert",
},
using the new module i get
{
"title" : "Geändert",
},
The target page is in 8859-1 (latin1).
Any suggestions?
\x{fc} is ü, at least in Latin-1, Latin-9 etc. Also, ü is codepoint U+00FC in Unicode. However, we want UTF-8 (I suppose). The easiest solution to get UTF-8 string literals is to save your Perl source code with this encoding, and put a use utf8; at the top of your script.
Then, encoding the string as JSON yields correct output:
use strict; use warnings; use utf8;
use Data::Dumper; use JSON;
print Dumper encode_json {fields => "nicht überrascht"};
The encode_json assumes UTF-8. Read the documentation for more info.
Output:
$VAR1 = '{"fields":"nicht überrascht"}';
(JSON module version: 2.53)
my $json_text = to_json($data);
is short for
my $json_text = JSON->new->encode($data);
This returns a string of Unicode Code Points. U+00FC is indeed the correct Unicode code point for "ü", so the output is correct. (As proof, the HTML source for that is actually "ü".)
It's hard to tell what your original output actually contained (since you showed non-ASCII characters), so it's hard to determine what your problem is actually.
But one thing you must do before outputing the string is to convert it from a string of code points into bytes, say, by using Encode's encode or encode_utf8.
my $json_cp1252 = encode('cp1252', to_json($data));
my $json_utf8 = encode_utf8(to_json($data));
If the appropriate encoding is UTF-8, you can also use any of the following:
my $json_utf8 = to_json($data, { utf8 => 1 });
my $json_utf8 = encode_json($data);
my $json_utf8 = JSON->new->utf8->encode($data);
Use encode_json instead. According to the manual it converts the given Perl data structure to a UTF-8 encoded, binary string.
Regarding your update: If you actually want to produce JSON in Latin1 (ISO-8859-1), you can try:
to_json($test, { latin1 => 1 })
Or
JSON->new->latin1->encode($test)
Note that if you dump the result, getting \x{fc} for ü is correct in this case. I guess that the root of your problem is that you receive text in Perl's UTF-8 format from somewhere. In this case, the latin1 option of the JSON module is needed.
You can also try to use ascii instead of latin1 as the safest option.
Another solution might be to specify an output encoding for Template-Toolkit. I don't know if that's possible. Or, you could encode your result as Latin1 in the final step before sending it to the client.
Strictly-speaking, Latin-1-encoded JSON is not valid JSON. The JSON spec allows UTF-8, UTF-16 or UTF-32 encodings.
If you want to be standards-compliant or you want to ensure your JSON will be compatible with both your current pages and future UTF-8-based pages, you need to use JSON->new->utf8->encode($str). Being strict about generated valid JSON could save you lots of headaches in the future.
You can translate UTF-8 JSON to Latin-1 using client-side Javascript if you need to, using this trick.
The ascii option also produces valid JSON, by escaping any non-ASCII characters using valid JSON unicode escapes. But the latin1 option does not, and therefore should be avoided IMHO. The utf8(0) option should be avoided too unless you specify an encoding when writing the data out to clients: utf8(0) is subtly different from the utf8 option in that it generates Perl character strings instead of byte strings. If you do any I/O using character strings without specifying an encoding, Perl will translate it on-the-fly back to Latin-1. The utf8 option generates raw UTF-8 bytes, which are perfect for doing raw I/O.

Data Type of Module Output

I have a script that I run on various texts to convert XHTML (e.g., ü) to ASCII. For Example, my script is written in the following manner:
open (INPUT, '+<file') || die "File doesn't exist! $!";
open (OUTPUT, '>file') || die "Can't find file! $!";
while (<INPUT>) {
s/&uuml/ü/g;
}
print OUTPUT $_;
This works as expected and substitutes the XHTML with the ASCII equivalent. However, since this is often run, I've attempted to convert it into a module. But, Perl doesn't return "ü" it returns the decomposition. How can I get Perl to return the data back with the ASCII equivalent (as run and printed in my regular .pl file)?
There is no ASCII. Not in practice anyway, and certainly not outside the US. I suggest you specify an encoding that will have all characters you might encounter (ASCII does not contain ü, it is only a 7-bit encoding!). Latin-1 is possible, but still suboptimal, so you should use Unicode, preferably UTF-8.
If you don't want to output in Unicode, at least your Perl script should be encoded with UTF-8. To signal this to the perl interpreter, use utf8 at the top of your script.
Then open the input file with an encoding layer like this:
open my $fh, "<:encoding(UTF-8)", $filename
The same goes for the output file. Just make sure to specify an an encoding when you want to use one.
You can change the encoding of a file with binmode, just see the documentation.
You can also use the Encode module to translate a byte string to unicode and vice versa. See this excellent question for further information about using Unicode with Perl.
If you want to, you can use the existing HTML::Entities module to handle the entity decoding and just focus in the I/O.

How can I decode HTML entities?

Here's a quick Perl question:
How can I convert HTML special characters like ü or ' to normal ASCII text?
I started with something like this:
s/\&#(\d+);/chr($1)/eg;
and could write it for all HTML characters, but some function like this probably already exists?
Note that I don't need a full HTML->Text converter. I already parse the HTML with the HTML::Parser. I just need to convert the text with the special chars I'm getting.
Take a look at HTML::Entities:
use HTML::Entities;
my $html = "Snoopy & Charlie Brown";
print decode_entities($html), "\n";
You can guess the output.
The above answers tell you how to decode the entities into Perl strings, but you also asked how to change those into ASCII.
Assuming that this is really what you want and you don't want all the unicode characters you can look at the Text::Unidecode module from CPAN to Zap all those odd characters back into a roughly similar collection of ASCII characters:
use Text::Unidecode qw(unidecode);
use HTML::Entities qw(decode_entities);
my $source = '北亰';
print unidecode(decode_entities($source));
# That prints: Bei Jing
Note that there are hex-specified characters too. They look like this: é (é).
Use HTML::Entities' decode_entities to translate the entities into actual characters. To convert that to ASCII requires more work. I've used iconv (perl interface: Text::Iconv)
with the transliterate option on with some success in the past. But if you are dealing
with a limited set of entities, or you don't actually need it reduced to ASCII equivalents,
you may be better off limiting what decode_entities produces or providing it with custom
conversion maps. See the HTML::Entities doc.
There are a handful of predefined HTML entities - & " > and so on - that you could hard code.
However, the larger case of numberic entities - { - is going to be much harder, as those values are Unicode, and conversion to ASCII is going to range from difficult to impossible.
I use this script. Save it as html2utf.py, and use it ala echo $some_html | html2utf.py.
#!/usr/bin/env python3
"""
An alternative for `perl -Mopen=locale -MHTML::Entities -pe '$_ = decode_entities($_)'` (which you can use by `cpanm HTML::Entities`) and `recode html..`.
"""
import fileinput
import html
for line in fileinput.input():
print(html.unescape(line.rstrip('\n')))
I have created a one-liner for bash, using Perl to decode the HTML entities that are passed to perl. My solution is a blend of this answer (see above) and something I found on commandlinefu.com last week.
Most of us who code in Bash aren't in the habit of using echo -n to strip out the \n newline character since it doesn't usually affect Bash text parsing. With Perl——and with this particular method——it's important to use echo -n or else perl will interpret the 'newline' \n character as a literal part of the response, adding an unwanted %0A to your results.
Here's my bash-perl one-liner hybrid:
encodedURL="$(echo -n "$entityURL" | perl -MHTML::Entities -MURI::Escape -ne 'print uri_escape(decode_entities($_))')"
Example:
Input: Seals \& Croft - Summer Breeze
Output: Seals%20%26%20Croft%20-%20Summer%20Breeze