Why is JSON::XS Not Generating Valid UTF-8? - json

I'm getting some corrupted JSON and I've reduced it down to this test case.
use utf8;
use 5.18.0;
use Test::More;
use Test::utf8;
use JSON::XS;
BEGIN {
# damn it
my $builder = Test::Builder->new;
foreach (qw/output failure_output todo_output/) {
binmode $builder->$_, ':encoding(UTF-8)';
}
}
foreach my $string ( 'Deliver «French Bread»', '日本国' ) {
my $hashref = { value => $string };
is_sane_utf8 $string, "String: $string";
my $json = encode_json($hashref);
is_sane_utf8 $json, "JSON: $json";
say STDERR $json;
}
diag ord('»');
done_testing;
And this is the output:
utf8.t ..
ok 1 - String: Deliver «French Bread»
not ok 2 - JSON: {"value":"Deliver «French Bread»"}
# Failed test 'JSON: {"value":"Deliver «French Bread»"}'
# at utf8.t line 17.
# Found dodgy chars "<c2><ab>" at char 18
# String not flagged as utf8...was it meant to be?
# Probably originally a LEFT-POINTING DOUBLE ANGLE QUOTATION MARK char - codepoint 171 (dec), ab (hex)
{"value":"Deliver «French Bread»"}
ok 3 - String: 日本国
ok 4 - JSON: {"value":"æ¥æ¬å½"}
1..4
{"value":"日本国"}
# 187
So the string containing guillemets («») is valid UTF-8, but the resulting JSON is not. What am I missing? The utf8 pragma is correctly marking my source. Further, that trailing 187 is from the diag. That's less than 255, so it almost looks like a variant of the old Unicode bug in Perl. (And the test output still looks like crap. Never could quite get that right with Test::Builder).
Switching to JSON::PP produces the same output.
This is Perl 5.18.1 running on OS X Yosemite.

is_sane_utf8 doesn't do what you think it does. You're suppose to pass strings you've decoded to it. I'm not sure what's the point of it, but it's not the right tool. If you want to check if a string is valid UTF-8, you could use
ok(eval { decode_utf8($string, Encode::FB_CROAK | Encode::LEAVE_SRC); 1 },
'$string is valid UTF-8');
To show that JSON::XS is correct, let's look at the sequence is_sane_utf8 flagged.
+--------------------- Start of two byte sequence
| +---------------- Not zero (good)
| | +---------- Continuation byte indicator (good)
| | |
v v v
C2 AB = [110]00010 [10]101011
00010 101011 = 000 1010 1011 = U+00AB = «
The following shows that JSON::XS produces the same output as Encode.pm:
use utf8;
use 5.18.0;
use JSON::XS;
use Encode;
foreach my $string ('Deliver «French Bread»', '日本国') {
my $hashref = { value => $string };
say(sprintf("Input: U+%v04X", $string));
say(sprintf("UTF-8 of input: %v02X", encode_utf8($string)));
my $json = encode_json($hashref);
say(sprintf("JSON: %v02X", $json));
say("");
}
Output (with some spaces added):
Input: U+0044.0065.006C.0069.0076.0065.0072.0020.00AB.0046.0072.0065.006E.0063.0068.0020.0042.0072.0065.0061.0064.00BB
UTF-8 of input: 44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB
JSON: 7B.22.76.61.6C.75.65.22.3A.22.44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB.22.7D
Input: U+65E5.672C.56FD
UTF-8 of input: E6.97.A5.E6.9C.AC.E5.9B.BD
JSON: 7B.22.76.61.6C.75.65.22.3A.22.E6.97.A5.E6.9C.AC.E5.9B.BD.22.7D

JSON::XS is generating valid UTF-8, but you're using the resulting UTF-8 encoded byte strings in two different contexts that expect character strings.
Issue 1: Test::utf8
Here are the two main situations when is_sane_utf8 will fail:
You have a miscoded character string that had been decoded from a UTF-8 byte string as if it were Latin-1 or from double encoded UTF-8, or the character string is perfectly fine and looks like a potentially "dodgy" miscoding (using the terminology from its docs).
You have a valid UTF-8 byte string containing the encoded code points U+0080 through U+00FF, for example «French Bread».
The is_sane_utf8 test is intended only for character strings and has the documented potential for false negatives.
Issue 2: Output Encoding
All of your non-JSON strings are character strings while your JSON strings are UTF-8 encoded byte strings, as returned from the JSON encoder. Since you're using the :encoding(UTF-8) PerlIO layer for TAP output, the character strings are being implicitly encoded to UTF-8 with good results, while the byte strings containing JSON are being double encoded. STDERR however does not have an :encoding PerlIO layer set, so the encoded JSON byte strings look good in your warnings since they're already encoded and being passed straight out.
Only use the :encoding(UTF-8) PerlIO layer for IO with character strings, as opposed to the UTF-8 encoded byte strings returned by default from the JSON encoder.

Related

Convert emoji Unicode byte sequences to Unicode characters with jq

I'm filtering Facebook Messenger JSON dumps with jq. The source JSON contains emojis as Unicode sequences. How can I output these back as emojis?
echo '{"content":"\u00f0\u009f\u00a4\u00b7\u00f0\u009f\u008f\u00bf\u00e2\u0080\u008d\u00e2\u0099\u0082\u00ef\u00b8\u008f"}' | jq -c '.'
Actual result:
{"content":"ð¤·ð¿ââï¸"}
Desired result:
{"content":"🤷🏿‍♂️"}
#chepner's use of Latin1 in Python finally shook free in my head how to do with jq almost directly. You'll need to pipe through iconv:
$ echo '{"content":"\u00f0\u..."}' | jq -c . | iconv -t latin1
{"content":"🤷🏿‍♂️"}
In JSON, the string \u00f0 does not mean "the byte 0xF0, as part of a UTF-8 encoded sequence." It means "Unicode code point 0x00F0." That's ð, and jq is displaying it correctly as the UTF-8 encoding 0xc3 0xb0.
The iconv call reinterprets the UTF-8 string for ð (0xc3 0xb0) back into Latin1 as 0xf0 (Latin1 exactly matches the first 255 Unicode code points). Your UTF-8 capable terminal then interprets that as the first byte of a UTF-8 sequence.
The problem is that the response contains the UTF-8 encoding of the Unicode code points, not the code points themselves. jq cannot decode this itself. You could use another language; for example, in Python
>>> x = json.load(open("response.json"))['content']
>>> x
'ð\x9f¤·ð\x9f\x8f¿â\x80\x8dâ\x99\x82ï¸\x8f'
>>> x.encode('latin1').decode()
'🤷🏿\u200d♂️'
It's not exact, but I'm not sure the encoding is unambiguous. For example,
>>> x.encode('latin1')
b'\xf0\x9f\xa4\xb7\xf0\x9f\x8f\xbf\xe2\x80\x8d\xe2\x99\x82\xef\xb8\x8f'
>>> '🤷🏿‍♂️'.encode()
b'\xf0\x9f\xa4\xb7\xf0\x9f\x8f\xbf\xe2\x80\x8d\xe2\x99\x82\xef\xb8\x8f'
>>> '🤷🏿‍♂️'.encode().decode()
'🤷🏿\u200d♂️'
The result of re-encoding the response using Latin-1 is identical to encoding the desired emoji as UTF-8, but decoding doesn't not give back precisely the same emoji (or at least, Python isn't rendering it identically.)
Here's a jq-only solution. It works with both the C and Go implementations of jq.
# input: a decimal integer
# output: the corresponding binary array, most significant bit first
def binary_digits:
if . == 0 then 0
else [recurse( if . == 0 then empty else ./2 | floor end ) % 2]
| reverse
| .[1:] # remove the leading 0
end ;
def binary_to_decimal:
reduce reverse[] as $b ({power:1, result:0};
.result += .power * $b
| .power *= 2)
| .result;
# input: an array of decimal integers representing the utf-8 bytes of a Unicode codepoint.
# output: the corresponding decimal number of that codepoint.
def utf8_decode:
# Magic numbers:
# x80: 128, # 10000000
# xe0: 224, # 11100000
# xf0: 240 # 11110000
(-6) as $mb # non-first bytes start 10 and carry 6 bits of data
# first byte of a 2-byte encoding starts 110 and carries 5 bits of data
# first byte of a 3-byte encoding starts 1110 and carries 4 bits of data
# first byte of a 4-byte encoding starts 11110 and carries 3 bits of data
| map(binary_digits) as $d
| .[0]
| if . < 128 then $d[0]
elif . < 224 then [$d[0][-5:][], $d[1][$mb:][]]
elif . < 240 then [$d[0][-4:][], $d[1][$mb:][], $d[2][$mb:][]]
else [$d[0][-3:][], $d[1][$mb:][], $d[2][$mb:][], $d[3][$mb:][]]
end
| binary_to_decimal ;
{"content":"\u00f0\u009f\u00a4\u00b7\u00f0\u009f\u008f\u00bf\u00e2\u0080\u008d\u00e2\u0099\u0082\u00ef\u00b8\u008f"}
| .content|= (explode| [utf8_decode] | implode)
Transcript:
$ jq -nM -f program.jq
{
"content": "🤷"
}
First of all, you need a font which supports this.
You are confusing Unicode composed chars with UTF-8 encoding. It has to be either:
$ echo '{"content":"\u1F937\u200D\u2642"}' | jq -c '.'
or
$ echo '{"content":"\u1F937\u200D\u2642\uFE0F"}' | jq -c '.'

Why does reading JSON from database with Perl warn about "wide characters"?

I'm reading data from a database using Perl (I'm new to Perl), one of the columns is a JSON array. The problem I'm having is that when I try to read the data in the JSON I get an error "Wide character in subroutine entry".
Table:
id | name | date | data
Sample data.
{ "duration":"24", "name":"My Test","visible":"1" }
use JSON qw(decode_json);
my $current_connection = $DATABASE_CONNECTION->prepare( "SELECT * FROM data WHERE syt = 'area1' " );
$current_connection->execute();
while( my $db_data = $current_connection->fetchrow_hashref() )
{
my $name = $db_data->{name};
my $object = decode_json($db_data->{data});
foreach my $key (sort keys %{$object}) {
my $result;
$pages .= "<p> $result->{$key}->{name} </p>";
}
}
That error means a character greater than 255 was passed to a sub expecting a string of bytes.
When stored in the database, the string is encoded using some character encoding, possibly UTF-8. You appear to have decoded the text (e.g. by using mysql_enable_utf8mb4 => 1), producing a string of Unicode Code Points. However, decode_json expects UTF-8.
The solution is to use from_json or JSON->new->decode instead of decode_json; these expect decoded text (a string of UCP).
You can verify that this is the issue using sprintf '%vX', $json.
For example,
If you get E9 and 2660 for "é" and "♠",
That's their UCP. Use from_json or JSON->new->decode.
If you get C3.A9 and E2.99.A0 for "é" and "♠",
That's their UTF-8. Use decode_json or JSON->new->utf8->decode.

Json dumps returns "\u0001" for "\1". I need to print the exact characters "\1" after passing to json dumps

Here is my code:
import json
a = "\1"
print json.dumps(a)
It returns "\u0001", instead of desired "\1".
Is there any way to get the exact character after passing with json dumps.
In Python, the string literal "\1" represents just one character of which the character code is 1. The backslash functions here as an escape to provide the character code as an octal value.
So either escape the backslash like this:
a = "\\1"
Or use the raw string literal notation with the r prefix:
a = r"\1"
Both will assign exactly the same string: print a produces:
\1
The output of json.dumps(a) will be:
"\\1"
... as also in JSON format, a literal backslash (reverse solidus) needs to be escaped by another backslash. But it truly represents \1.
The following prints True:
print a == json.loads(json.dumps(a))

getting length of utf8mb4 string with perl from MySql

i wrote a small perl function that takes a string and checks its length without the spaces. the basic code looks as follows:
sub foo
{
use utf8;
my #wordsArray = split(/ /, $_[0]));
my $result = length(join('', #wordsArray));
return $result;
}
When i provide this function with a string containing special characters (such as hebrew letters), it seems to work great.
the problem starts when i use a value coming from MySql column, with a character set of utf8mb4: in such case, the value being calculated is higher than the value in the previous example.
I can guess why such behavior occurs: the special characters are written in a 4 byte manner in the table, and thus each letter calculates as two characters in the utf8 encoding.
Does anyone know how the above can be resolved, so that i will get the right number of characters from a string coming from DB table defined as utf8mb4?
EDIT:
Some more information regarding the above code:
The DB column used as an argument for the function is of type VARCHAR(1000), with collation of utf8mb4_unicode_ci.
I am fetching the row via a MySql connection configured as follows:
$mySql = DBI->connect(
"DBI:mysql:$db_info{'database'}:$db_info{'hostname'};mysql_multi_statements=1;",
"$db_info{'user'}",
"$db_info{'password'}",
{'RaiseError' => 1,'AutoCommit' => 0});
...
$mySql->do("set names utf8mb4");
an example data value would be "שלום עולם" (which in hebrew means "Hello World").
1) When calling foo($request->{VALUE}); (where VALUE is the column data from DB), the result is 16 (where each hebrew character is counted as two characters, and one space between them is disregarded). Dumper in this case is:
$VAR1 = "\327\251\327\234\327\225\327\235 \327\242\327\225\327\234\327\235";
2) When calling foo("שלום עולם");:
when declaring use utf8;, the result is 8 (as there are 8 visible characters in this string). Dumper (Useqq=1) in this case is:
$VAR1 = "\x{5e9}\x{5dc}\x{5d5}\x{5dd} \x{5e2}\x{5d5}\x{5dc}\x{5dd}";
when not declaring `use utf8;', the result is 16, and is similar to the case of sending the value from DB:
$VAR1 = "\327\251\327\234\327\225\327\235 \327\242\327\225\327\234\327\235";
Looks like i need to find a way of converting the received value to UTF8 before starting to work with it.
What MySQL calls utf8 is a limited subset of UTF-8 which allows only three bytes per character and covers code points up to 0xFFFF. Even utf8mb4 doesn't cover the full UTF-8 range, which supports encoded characters up to 6 bytes long
The consequence is that any data from either a utf8 or a utf8mb4 column is simply a UTF-8 string in Perl and there should be no difference between the two database encodings
It is my guess that you haven't enabled UTF-8 for your DBI handle, so everything is being treated as just a sequence of bytes. You should enable the mysql_enable_utf8 when you make the connect call, which should then look something like
my $dbh = DBI->connect($dsn, $user, $password, { mysql_enable_utf8 => 1 });
With the additional data, I can see that the string you are retrieving from the database is indeed שלום עולם UTF-8-encoded
However, if I decode it, then first of all I get a non-space character count of 8 from both your foo subroutine and my own, not 9; and also you should be getting characters back from the database, not bytes
I suspect that you may have written an encoded string to the database in the first place. Here's a short program that creates a MySQL table, writes two records to it (one character string and one encoded string) and retrieves what it has written. You will see that The only thing that makes a difference is the setting of mysql_enable_utf8. The behaviour is the same whether or not the original string is encoded, and with or without SET NAMES utf8mb4
Further experimentation showed that either mysql_enable_utf8 or SET NAMES utf8mb4 will get DBI to write the data correctly, but the latter has no effect on reading
I suggest that your solution should be to use ONLY mysql_enable_utf8 when either reading or writing
You should also use utf8 only at the top of all your programs. Missing this out means you can't use any non-ASCII characters in your code
use utf8;
use strict;
use warnings;
use DBI;
use open qw/ :std :encoding(utf-8) /;
STDOUT->autoflush;
my $VAR1 = "\327\251\327\234\327\225\327\235 \327\242\327\225\327\234\327\235";
my $dbh = DBI->connect(
qw/ DBI:mysql:database=temp admin admin /, {
RaiseError => 1,
PrintError => 0,
mysql_enable_utf8 => 1,
}
) or die DBI::errstr;
$dbh->do('SET NAMES utf8mb4');
$dbh->do('DROP TABLE IF EXISTS temp');
$dbh->do('CREATE TABLE temp (value VARCHAR(64) CHARACTER SET utf8mb4)');
my $insert = $dbh->prepare('INSERT INTO temp (value) VALUES (?)');
$insert->execute('שלום עולם');
$insert->execute($VAR1);
my $values = $dbh->selectcol_arrayref('SELECT value FROM temp');
printf "string: %s foo: %d\n", $_, foo($_) for #$values;
sub foo2 {
$_[0] =~ tr/ //c;
}
sub foo {
length join '', split / /, $_[0];
}
output with mysql_enable_utf8 => 1
string: שלום עולם foo: 8
string: שלום עולם foo: 8
output with mysql_enable_utf8 => 0
string: ש××× ×¢××× foo: 16
string: ש××× ×¢××× foo: 16

Why does the JSON module quote some numbers but not others?

We recently switched to the new JSON2 perl module.
I thought all and everything gets returned quoted now.
But i encountered some cases in which a number (250) got returned as unquoted number in the json string created by perl.
Out of curiosity:
Does anyone know why such cases exist and how the json module decides if to quote a value?
It will be unquoted if it's a number. Without getting too deeply into Perl internals, something is a number if it's a literal number or the result of an arithmetic operation, and it hasn't been stringified since its numeric value was produced.
use JSON::XS;
my $json = JSON::XS->new->allow_nonref;
say $json->encode(42); # 42
say $json->encode("42"); # "42"
my $x = 4;
say $json->encode($x); # 4
my $y = "There are $x lights!";
say $json->encode($x); # "4"
$x++; # modifies the numeric value of $x
say $json->encode($x); # 5
Note that printing a number isn't "stringifying it" even though it produces a string representation of the number to output; print $x doesn't cause a number to be a string, but print "$x" does.
Anyway, all of this is a bit weird, but if you want a value to be reliably unquoted in JSON then put 0 + $value into your structure immediately before encoding it, and if you want it to be reliably quoted then use "" . $value or "$value".
You can force it into a string by doing something like this:
$number_str = '' . $number;
For example:
perl -MJSON -le 'print encode_json({foo=>123, bar=>"".123})'
{"bar":"123","foo":123}
It looks like older versions of JSON has autoconvert functionality that can be set. Did you not have $JSON::AUTOCONVERT set to a true value?