Perl JSON encode in UTF-8 strange behaviour - json

Based on Perl JSON 2.90 documentation, to encode JSON object in UTF-8 all you need to do is:
$json_text = JSON->new->utf8->encode($perl_scalar)
That is obvious and this what I did. After a while, I got an issue report on GitHub from one of users, which made me really surprised, as it shouldn't be happening!
I was beating for hours to figure out what was happening but the solution happened to be very weird and wrong from my point of view.
What eventually worked for me is this:
$json_text = JSON->new->latin1->encode($perl_scalar)
After that, I tested this code with all different characters, including Russian and Chinese - it just worked?
Can anyone please explain, why encoding is working correctly with latin1 and not with utf8, when it's actually has to be visa versa?

Two possible bugs could result in the described outcome.
You were passing strings already encoded using UTF-8 to encode.
If $string contains installé and sprintf '%vX', $string returns 69.6E.73.74.61.6C.6C.C3 A9, are suffering from this bug.
If you are suffering from the this bug, properly decode all inputs to your program, and continue using JSON->new->utf8->encode (aka encode_json).
You were encoding the output of the JSON command using UTF-8 a second time, possibly via a :utf8 or :encoding layer on a file handle.
If $string contains installé and sprintf '%vX', $string returns 69.6E.73.74.61.6C.6C.E9, are suffering from this bug.
If you are suffering from the this bug, either use JSON->new->encode (aka to_json) and keep the second layer of encoding, or use JSON->new->utf8->encode (aka encode_json) and remove the second layer of encoding.
In neither case is the solution to use JSON->new->latin1->encode.

What are you doing to output $json_text? What kind of binmode do you use on that handle? The screenshot looks like it's double-encoded, which suggests the handle has :utf8 or :encoding enabled (which is incorrect for writing encoded data to). As unintuitively as it may seem, ->latin1 giving a correct result matches that hypothesis (PerlIO assumes any binary string is encoded as latin-1).

Related

Convert UTF-8 to ANSI in tcl

proc pub:write { nick host handle channel arg } {
set fid [open /var/www/test.txt w]
puts $fid "█████████████████████████████████████████████████████████████████"
puts $fid "██"
close $fid
}
when i open i Webbrowser its Result so :
█████████████████████████████████████████████████████████████████
but it should :
█████████████████████████████████████████████████████████████████
Welcome to the yawning pit of complexity that is string encodings. You've got to get two things right to make what you're trying to do work. READ EVERYTHING BELOW BEFORE MAKING CHANGES as it all interacts horribly.
The character needs to be written to the file using the right encoding. This is done by configuring the encoding on the channel, which defaults to a system-specific value that is usually but not always right.
I'm taking a very wild guess that an encoding like “cp437 DOSLatinUS” is the right one.
fconfigure $fid -encoding cp437
However, Tcl's usually pretty good at picking the right thing to do by default.
Also, there's a huge number of different encodings. Some are very similar to each other and picking which one to use is a bit of a black art. The usual best bet is to stick with utf8 when possible, and otherwise to use the correct encoding (defined by protocol or by the system) and take a vast amount of care. This is really complicated!
You've also got to get the character into Tcl correctly in the first place. This means that the character has to be encoded in the source file, and Tcl has to read that file with the right encoding. Since the file is being written by another program (your editor usually) there's all sorts of potential for trouble. If you can discover what encoding is being used there (usually a matter of complete guesswork) then you can use the -encoding option to tclsh or source to allow Tcl to figure out what is going on.
Alternatively, stick with the ASCII subset in your source as that's pretty reliably handled the same whatever encoding is in use. You do this by converting each █ to the Tcl escape sequence \u2588. At least like that, you can be sure that you're only hunting down problems with the output encoding.
When debugging this thing, only change one thing at a time before retesting as there's a lot of bits that can go wrong and poison what is going on in ways that produce weird results downstream. I advise trying the escape sequence first as that at least means that you know that the input data is correct; once you know that you're not pushing garbage in, you can try hunting down whether you're actually getting problems with getting garbage out and what to do about it.
Finally, be aware that mixing in networking in this makes the problems about ten times harder…

Perl: Why do i need to set the latin1 flag explicitly since JSON 2.xx?

Since JSON 2.xx i need to set the latin1 flag in order to get umlauts safe to the html document:
my $obj_with_umlauts = {
title => 'geändert',
}
my $json = JSON->new()->latin1(1)->encode($obj_with_umlauts);
This was not necessary using JSON 1.xx :
my $json = JSON->new()->objToJson($obj_with_umlauts);
The html document is in iso-8559-1 (meta-tag).
Can anybody explain to me why?
This is such a huge can of worms that you're opening here.
I suspect that the answer is something along the lines of "a bug was fixed in the character handling of JSON.pm". But it's hard to know what is going on without a lot more information about your situation.
How is $string_with_umlauts being set? How are you encoding the data that you write to the HTML document?
Do you want to handle utf8 data correctly (you really should) or are you happy assuming that you live in a Latin1 world?
It's important to realise that if you completely ignore Unicode considerations then it can often seem that your programs are working correctly as errors often cancel each other out. When you start to address Unicode issues, it can seem that your programs are getting worse until you address all of the issues.
The Perl Unicode Tutorial is a good place to start learning about these things.
P.S. It's "Perl", not "PERL".
What are you talking about?
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->objToJson(["\xE4"]);
say sprintf "%v02X", $json;
'
1.15
5B.22.E4.22.5D # Unicode code points for ["ä"]
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->encode(["\xE4"]);
say sprintf "%v02X", $json;
'
2.59
5B.22.E4.22.5D # Unicode code points for ["ä"]
Those two strings are identical! In fact, adding ->latin1() doesn't change anything because the iso-8859-1 encoding of Unicode code point U+00E4 is E4.
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->latin1()->encode(["\xE4"]);
say sprintf "%v02X", $json;
'
2.59
5B.22.E4.22.5D # iso-8859-1 encoding of ["ä"]
There is one difference between the last two: it's stored differently in the scalar. That should make absolutely no difference. If code treats them differently, then that code is incorrectly reading the data in the scalar, and that code is buggy.
$string_with_umlauts definetly is a string in winLatin
Well, that's error number one.
JSON expects strings of decoded text (strings of Unicode code points), not encoded text.
That said, there happens to be no difference between a string encoded using iso-8859-1 and a string of Unicode code points. For example, when encoded using iso-8859-1, "ä" is byte E4, and it's Unicode code point U+00E4, two different notation for the same number.
If the string is encoded using cp1252, though, you'll have problems with characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ (the characters in cp1252 but not in iso-8859-1). For example, when encoded using cp1252, "€" is byte 80, but it's Unicode code point U+20AC. 0x80 != 0x20AC.
The html document is in iso-8559-1 (meta-tag).
Then at some point, you'll have to encode the output into iso-8859-1. You can do it using an :encoding layer, or using Encode's encode or using JSON's ->latin1 directive. The advantage of using this final option is that it will cause JSON to escape any character outside of the iso-8859-1 character set before attempting to encode it.
Can anybody explain to me why?
You have a code (an XS module) that reads the underlying string buffer of the scalar and incorrectly treats that as the content of the string. There is a bug is in that module.

Extended ASCII characters show up as junk in MySQL db is inserted through perl

I have a MySQL 'articles' table and I am trying to make the following insert using SQLyog.
insert into articles (id,title) values (2356606,'Jérôme_Lejeune');
This works fine and the data shows fine when I do a select query.
The problem is that when I do the same insert query through my perl script, the name shows up with some junk characters in place of é and ô in the database. I need to know how to properly store the name through my script. The part of code that does the insert is like this.
$sql_insert = "insert into articles (id,title) values (?,?)";
$sth_insert = $dbh->prepare($sql_insert);
$sth_insert->execute($id,$title);
$id and $title have the correct required data which I have checked by print before I am inserting them. Please assist.
You have opened up the character encoding can of worms, and you have a lot to learn before you will solve this problem and have it stay solved.
You are probably already used to thinking of how a character of text can be encoded as a string of bits. Under the ASCII encoding, for example, the 8-bit string 01000001 (65) is used to indicate the A character. When you start to think about how many different languages there are and how many different kinds of characters there are, you quickly realize that an 8-bit encoding is not going to get you very far. So a number of other character encodings have proliferated. Some of the most popular are latin1 (ISO-8859-1) and UTF-8. Both of these encodings can render the é and ô characters, but they use quite different bit strings to represent them. As you write to a file (or to the terminal) or add a row to a database, Perl and MySQL have a notion of what the character encoding of the output stream is. An encoding is also used when you read data. If you don't know what this encoding is, then it doesn't make any sense to say that the data looks good/looks bad when you store it and retrieve it.
Perl and MySQL can, with the right settings, handle both of these encodings and several more. Which encoding you choose to use is not as important as making sure that all the pieces of your application are using the same encoding. But you should choose an encoding that
can encode all of the characters you will need (for this problem, you mention é and ô, but will there be others? what about in the future?)
is supported by all the pieces of your application (front-end, database, back-end)
Here's some suggested reading to get you headed in the right direction:
The Encode module for Perl
character sets in MySQL
(others should feel free to recommend additional links)
I can't speak to MySQL so much, but character encoding support in Perl is rapidly evolving (which isn't to say that it ain't damn good). The latest versions of Perl will have the best support (for the most obscure character sets) and the best features (for example, regular expressions and character classes) for characters beyond ASCII.
There are few things to follow.
First you have to make sure, that Perl understands that data which is moving between your program and DB is encoded as UTF-8 (i expect your databases and tables are set properly). For this you need to say it loud out on connecting to database, like this:
my($dbh) = DBI->connect(
'dbi:mysql:test',
'user',
'password',
{
mysql_enable_utf8 => 1,
}
);
Next, you need send data to output and you must set it to decaode data as UTF-8. For this i like pretty good module:
use utf8::all;
But this module is not in core, so you may want to set it with binmode yourself too:
binmode STDIN, ":utf8";
binmode STDOUT, ":utf8";
And if you deal with webpages, you have to make sure, that browser understoods that you are sending your data encoded as UTF-8. For that you should make sure your HTTP-headers include encoding:
Content-Type: text/html; charset=utf-8;
and set it with HTML META-tag too:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Now you should get your road covered.

Cannot properly decode html entities in perl

I am having an issue which I am unable to solve after spending the last 10 hours searching around the internet for an answer.
I have some data in this format
??E??0??<?20120529184453+0200?20120529184453+0200???G0E?5?=20111213T103134000-136.225.6.103-30365316-1448169323, ver: 12??W??tP?2??
??|?????
??:o?????tP???B#?????B#??????)0????
49471010550??? ???tP???3??<????????????????
I have a PHP code, not written by me, which is just running html_entity_decode on that and it returns the correct results.
When I try running Perl's decode_entities I get a completely different result. After some debugging it seems to me that PHP is "properly" replacing what seems to be invalid entities, such as,  or  into their ascii counterparts, namely NULL and backspace for the 2 cases mentioned.
Perl on the other hand does not seem to decode those "invalid" entities and leaves them alone which later one screws up the result (Which goes through unpack or, in phph's case, bin2hex, which fails because rather than unpacking null to 00 it will unpack each individual character of ).
I have tried everything I can think of include running the following substitution in perl after running decode_entities
$var =~ s/&#(\d+);/chr($1)/g
however that does not work at all.
This is driving me mad and I would like to have this done in perl rather than phpI really hope I don't have to write 1000 pattern matching lines in perl to cover all possible entities and numbers.
Anybody that has an idea how to go about this problem without resorting to having to parse PHPs entire html_entity_decode function into perl or writing endless lines of pattern matching?
You're almost there. Instead of
$var =~ s/&#(\d+);/chr($1)/g
say
$var =~ s/&#(\d+);/chr($1)/ge
The /e modifier instructs Perl to 'e'valuate the replacement pattern.

Migrating MS Access data to MySQL: character encoding issues

We have an MS Access .mdb file produced, I think, by an Access 2000 database. I am trying to export a table to SQL with mdbtools, using this command:
mdb-export -S -X \\ -I orig.mdb Reviewer > Reviewer.sql
That produces the file I expect, except one thing: Some of the characters are represented as question marks. This: "He wasn't ready" shows up like this: "He wasn?t ready", only in some cases (primarily single/double curly quotes), where maybe the content was pasted into the DB from MS Word. Otherwise, the data look great.
I have tried various values for "export MDB_ICONV=". I've tried using iconv on the resulting file, with ISO-8859-1 in the from/to, with UTF-8 in the from/to, with WINDOWS-1250 and WINDOWS-1252 and WINDOWS-1256 in the from, in various combinations. But I haven't succeeded in getting those curly quotes back.
Frankly, based on the way the resulting file looks, I suspect the issue is either in the original .mdb file, or in mdbtools. The malformed characters are all single question marks, but it is clear that they are not malformed versions of the same thing; so (my gut says) there's not enough data in the resulting file; so (my gut says) the issue can't be fixed in the resulting file.
Has anyone run into this one before? Any tips for moving forward? FWIW, I don't have and never have had MS Access -- the file is coming from a 3rd party -- so this could be as simple as changing something on the database, and I would be very glad to hear that.
Thanks.
Looks like "smart quotes" have claimed yet another victim.
MS word takes plain ascii quotes and translates them to the double-byte left-quote and right-quote characters and translates a single quote into the double byte apostrophe character. The double byte characters in question blelong to to an MS code page which is roughly compatable with unicode-16 except for the silly quote characters.
There is a perl script called 'demoroniser.pl' which undoes all this malarky and converts the quotes back to plain ASCII.
It's most likely due to the fact that the data in the Access file is UTF, and MDB Tools is trying to convert it to ascii/latin/is0-8859-1 or some other encoding. Since these encodings don't map all the UTF characters properly, you end up with question marks. The information here may help you fix your encoding issues by getting MDB Tools to use the correct encoding.