Why does reading JSON from database with Perl warn about "wide characters"? - json

I'm reading data from a database using Perl (I'm new to Perl), one of the columns is a JSON array. The problem I'm having is that when I try to read the data in the JSON I get an error "Wide character in subroutine entry".
Table:
id | name | date | data
Sample data.
{ "duration":"24", "name":"My Test","visible":"1" }
use JSON qw(decode_json);
my $current_connection = $DATABASE_CONNECTION->prepare( "SELECT * FROM data WHERE syt = 'area1' " );
$current_connection->execute();
while( my $db_data = $current_connection->fetchrow_hashref() )
{
my $name = $db_data->{name};
my $object = decode_json($db_data->{data});
foreach my $key (sort keys %{$object}) {
my $result;
$pages .= "<p> $result->{$key}->{name} </p>";
}
}

That error means a character greater than 255 was passed to a sub expecting a string of bytes.
When stored in the database, the string is encoded using some character encoding, possibly UTF-8. You appear to have decoded the text (e.g. by using mysql_enable_utf8mb4 => 1), producing a string of Unicode Code Points. However, decode_json expects UTF-8.
The solution is to use from_json or JSON->new->decode instead of decode_json; these expect decoded text (a string of UCP).
You can verify that this is the issue using sprintf '%vX', $json.
For example,
If you get E9 and 2660 for "é" and "♠",
That's their UCP. Use from_json or JSON->new->decode.
If you get C3.A9 and E2.99.A0 for "é" and "♠",
That's their UTF-8. Use decode_json or JSON->new->utf8->decode.

Related

Using Text::CSV on a String Containing Quotes

I have pored over this site (and others) trying to glean the answer for this but have been unsuccessful.
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1, auto_diag => 1 } );
$line = q(data="a=1,b=2",c=3);
my $csvParse = $csv->parse($line);
my #fields = $csv->fields();
for my $field (#fields) {
print "FIELD ==> $field\n";
}
Here's the output:
# CSV_XS ERROR: 2034 - EIF - Loose unescaped quote # rec 0 pos 6 field 1
FIELD ==>
I am expecting 2 array elements:
data="a=1,b=2"
c=3
What am I missing?
You may get away with using Text::ParseWords. Since you are not using real csv, it may be fine. Example:
use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;
my $line = q(data="a=1,b=2",c=3);
my #fields = quotewords(',', 1, $line);
print Dumper \#fields;
This will print
$VAR1 = [
'data="a=1,b=2"',
'c=3'
];
As you requested. You may want to test further on your data.
Your input data isn't "standard" CSV, at least not the kind that Text::CSV expects and not the kind that things like Excel produce. An entire field has to be quoted or not at all. The "standard" encoding of that would be "data=""a=1,b=2""",c=3 (which you can see by asking Text::CSV to print your expected data using say).
If you pass the allow_loose_quotes option to the Text::CSV constructor, it won't error on your input, but it won't consider the quotes to be "protecting" the comma, so you will get three fields, namely data="a=1, b=2" and c=3.

getting length of utf8mb4 string with perl from MySql

i wrote a small perl function that takes a string and checks its length without the spaces. the basic code looks as follows:
sub foo
{
use utf8;
my #wordsArray = split(/ /, $_[0]));
my $result = length(join('', #wordsArray));
return $result;
}
When i provide this function with a string containing special characters (such as hebrew letters), it seems to work great.
the problem starts when i use a value coming from MySql column, with a character set of utf8mb4: in such case, the value being calculated is higher than the value in the previous example.
I can guess why such behavior occurs: the special characters are written in a 4 byte manner in the table, and thus each letter calculates as two characters in the utf8 encoding.
Does anyone know how the above can be resolved, so that i will get the right number of characters from a string coming from DB table defined as utf8mb4?
EDIT:
Some more information regarding the above code:
The DB column used as an argument for the function is of type VARCHAR(1000), with collation of utf8mb4_unicode_ci.
I am fetching the row via a MySql connection configured as follows:
$mySql = DBI->connect(
"DBI:mysql:$db_info{'database'}:$db_info{'hostname'};mysql_multi_statements=1;",
"$db_info{'user'}",
"$db_info{'password'}",
{'RaiseError' => 1,'AutoCommit' => 0});
...
$mySql->do("set names utf8mb4");
an example data value would be "שלום עולם" (which in hebrew means "Hello World").
1) When calling foo($request->{VALUE}); (where VALUE is the column data from DB), the result is 16 (where each hebrew character is counted as two characters, and one space between them is disregarded). Dumper in this case is:
$VAR1 = "\327\251\327\234\327\225\327\235 \327\242\327\225\327\234\327\235";
2) When calling foo("שלום עולם");:
when declaring use utf8;, the result is 8 (as there are 8 visible characters in this string). Dumper (Useqq=1) in this case is:
$VAR1 = "\x{5e9}\x{5dc}\x{5d5}\x{5dd} \x{5e2}\x{5d5}\x{5dc}\x{5dd}";
when not declaring `use utf8;', the result is 16, and is similar to the case of sending the value from DB:
$VAR1 = "\327\251\327\234\327\225\327\235 \327\242\327\225\327\234\327\235";
Looks like i need to find a way of converting the received value to UTF8 before starting to work with it.
What MySQL calls utf8 is a limited subset of UTF-8 which allows only three bytes per character and covers code points up to 0xFFFF. Even utf8mb4 doesn't cover the full UTF-8 range, which supports encoded characters up to 6 bytes long
The consequence is that any data from either a utf8 or a utf8mb4 column is simply a UTF-8 string in Perl and there should be no difference between the two database encodings
It is my guess that you haven't enabled UTF-8 for your DBI handle, so everything is being treated as just a sequence of bytes. You should enable the mysql_enable_utf8 when you make the connect call, which should then look something like
my $dbh = DBI->connect($dsn, $user, $password, { mysql_enable_utf8 => 1 });
With the additional data, I can see that the string you are retrieving from the database is indeed שלום עולם UTF-8-encoded
However, if I decode it, then first of all I get a non-space character count of 8 from both your foo subroutine and my own, not 9; and also you should be getting characters back from the database, not bytes
I suspect that you may have written an encoded string to the database in the first place. Here's a short program that creates a MySQL table, writes two records to it (one character string and one encoded string) and retrieves what it has written. You will see that The only thing that makes a difference is the setting of mysql_enable_utf8. The behaviour is the same whether or not the original string is encoded, and with or without SET NAMES utf8mb4
Further experimentation showed that either mysql_enable_utf8 or SET NAMES utf8mb4 will get DBI to write the data correctly, but the latter has no effect on reading
I suggest that your solution should be to use ONLY mysql_enable_utf8 when either reading or writing
You should also use utf8 only at the top of all your programs. Missing this out means you can't use any non-ASCII characters in your code
use utf8;
use strict;
use warnings;
use DBI;
use open qw/ :std :encoding(utf-8) /;
STDOUT->autoflush;
my $VAR1 = "\327\251\327\234\327\225\327\235 \327\242\327\225\327\234\327\235";
my $dbh = DBI->connect(
qw/ DBI:mysql:database=temp admin admin /, {
RaiseError => 1,
PrintError => 0,
mysql_enable_utf8 => 1,
}
) or die DBI::errstr;
$dbh->do('SET NAMES utf8mb4');
$dbh->do('DROP TABLE IF EXISTS temp');
$dbh->do('CREATE TABLE temp (value VARCHAR(64) CHARACTER SET utf8mb4)');
my $insert = $dbh->prepare('INSERT INTO temp (value) VALUES (?)');
$insert->execute('שלום עולם');
$insert->execute($VAR1);
my $values = $dbh->selectcol_arrayref('SELECT value FROM temp');
printf "string: %s foo: %d\n", $_, foo($_) for #$values;
sub foo2 {
$_[0] =~ tr/ //c;
}
sub foo {
length join '', split / /, $_[0];
}
output with mysql_enable_utf8 => 1
string: שלום עולם foo: 8
string: שלום עולם foo: 8
output with mysql_enable_utf8 => 0
string: ש××× ×¢××× foo: 16
string: ש××× ×¢××× foo: 16

Why is JSON::XS Not Generating Valid UTF-8?

I'm getting some corrupted JSON and I've reduced it down to this test case.
use utf8;
use 5.18.0;
use Test::More;
use Test::utf8;
use JSON::XS;
BEGIN {
# damn it
my $builder = Test::Builder->new;
foreach (qw/output failure_output todo_output/) {
binmode $builder->$_, ':encoding(UTF-8)';
}
}
foreach my $string ( 'Deliver «French Bread»', '日本国' ) {
my $hashref = { value => $string };
is_sane_utf8 $string, "String: $string";
my $json = encode_json($hashref);
is_sane_utf8 $json, "JSON: $json";
say STDERR $json;
}
diag ord('»');
done_testing;
And this is the output:
utf8.t ..
ok 1 - String: Deliver «French Bread»
not ok 2 - JSON: {"value":"Deliver «French Bread»"}
# Failed test 'JSON: {"value":"Deliver «French Bread»"}'
# at utf8.t line 17.
# Found dodgy chars "<c2><ab>" at char 18
# String not flagged as utf8...was it meant to be?
# Probably originally a LEFT-POINTING DOUBLE ANGLE QUOTATION MARK char - codepoint 171 (dec), ab (hex)
{"value":"Deliver «French Bread»"}
ok 3 - String: 日本国
ok 4 - JSON: {"value":"æ¥æ¬å½"}
1..4
{"value":"日本国"}
# 187
So the string containing guillemets («») is valid UTF-8, but the resulting JSON is not. What am I missing? The utf8 pragma is correctly marking my source. Further, that trailing 187 is from the diag. That's less than 255, so it almost looks like a variant of the old Unicode bug in Perl. (And the test output still looks like crap. Never could quite get that right with Test::Builder).
Switching to JSON::PP produces the same output.
This is Perl 5.18.1 running on OS X Yosemite.
is_sane_utf8 doesn't do what you think it does. You're suppose to pass strings you've decoded to it. I'm not sure what's the point of it, but it's not the right tool. If you want to check if a string is valid UTF-8, you could use
ok(eval { decode_utf8($string, Encode::FB_CROAK | Encode::LEAVE_SRC); 1 },
'$string is valid UTF-8');
To show that JSON::XS is correct, let's look at the sequence is_sane_utf8 flagged.
+--------------------- Start of two byte sequence
| +---------------- Not zero (good)
| | +---------- Continuation byte indicator (good)
| | |
v v v
C2 AB = [110]00010 [10]101011
00010 101011 = 000 1010 1011 = U+00AB = «
The following shows that JSON::XS produces the same output as Encode.pm:
use utf8;
use 5.18.0;
use JSON::XS;
use Encode;
foreach my $string ('Deliver «French Bread»', '日本国') {
my $hashref = { value => $string };
say(sprintf("Input: U+%v04X", $string));
say(sprintf("UTF-8 of input: %v02X", encode_utf8($string)));
my $json = encode_json($hashref);
say(sprintf("JSON: %v02X", $json));
say("");
}
Output (with some spaces added):
Input: U+0044.0065.006C.0069.0076.0065.0072.0020.00AB.0046.0072.0065.006E.0063.0068.0020.0042.0072.0065.0061.0064.00BB
UTF-8 of input: 44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB
JSON: 7B.22.76.61.6C.75.65.22.3A.22.44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB.22.7D
Input: U+65E5.672C.56FD
UTF-8 of input: E6.97.A5.E6.9C.AC.E5.9B.BD
JSON: 7B.22.76.61.6C.75.65.22.3A.22.E6.97.A5.E6.9C.AC.E5.9B.BD.22.7D
JSON::XS is generating valid UTF-8, but you're using the resulting UTF-8 encoded byte strings in two different contexts that expect character strings.
Issue 1: Test::utf8
Here are the two main situations when is_sane_utf8 will fail:
You have a miscoded character string that had been decoded from a UTF-8 byte string as if it were Latin-1 or from double encoded UTF-8, or the character string is perfectly fine and looks like a potentially "dodgy" miscoding (using the terminology from its docs).
You have a valid UTF-8 byte string containing the encoded code points U+0080 through U+00FF, for example «French Bread».
The is_sane_utf8 test is intended only for character strings and has the documented potential for false negatives.
Issue 2: Output Encoding
All of your non-JSON strings are character strings while your JSON strings are UTF-8 encoded byte strings, as returned from the JSON encoder. Since you're using the :encoding(UTF-8) PerlIO layer for TAP output, the character strings are being implicitly encoded to UTF-8 with good results, while the byte strings containing JSON are being double encoded. STDERR however does not have an :encoding PerlIO layer set, so the encoded JSON byte strings look good in your warnings since they're already encoded and being passed straight out.
Only use the :encoding(UTF-8) PerlIO layer for IO with character strings, as opposed to the UTF-8 encoded byte strings returned by default from the JSON encoder.

How to force a variable to be treated as a String in perl?

I have a JSON object which has a key value pair and the value of a one such pair is 0E10.
The problem is that this value should be a string but this is being treated as a float because of the presence of letter E after a number, hence it is showing 0 whenever I print this value (0*e+10).
Can somebody please help me solve this problem?
I am using perl to pass the JSON and reading it through Javascript. (Solution in any language would be acceptable)
This is what I get when I print the JSON.
KEY1 : 0E10
KEY2 : "XYZ"
You can clearly see that, if the value is string it puts under quotes (") but for 0E10 it is not using the quotes (").
The problem in my case is that I am reading the JSON from an API whose control is beyond my reach. I have a backend service which is written in perl which passes the JSON returned by the API. So whenever I hit a URL, the backend service written in perl is called. This service gets the JSON from the API and return back the JSON to the service which is hitting the URL.
See the difference:
Option A
use strict;
use warnings;
use JSON;
my $value = 12345;
my $hr = { KEY1=> $value, KEY2=> "XYZ" };
my $json = encode_json $hr;
print $json, "\n";
#<-- prints: {"KEY2":"XYZ","KEY1":12345}
Option B: double quote the $value assign to KEY1
use strict;
use warnings;
use JSON;
my $value = 12345;
my $hr = { KEY1=> "$value", KEY2=> "XYZ" };
my $json = encode_json $hr;
print $json, "\n";
#<-- prints: {"KEY2":"XYZ","KEY1":"12345"}
If you want to generate key: 0E10 (as opposed to key: 0 and key: '0E10'), then you'll have to generate your own JSON. Perl doesn't have a way of storing 0E10 differently than 0E9. (Neither do JavaScript, Java, C, C++, ...)
If you're willing to accept any exponent, you'll probably still have to generate your own JSON. Perl doesn't have a type system, and JSON encoders tend to use integer notation for integers (in the mathematical sense).
I specifically tested JSON::XS and JSON::PP will use 0 for a zero internally stored as a floating point number.
$ perl -MJSON::XS -MDevel::Peek -E'($_=1.1)-=$_; Dump $_; say encode_json([$_]);'
SV = PVNV(0x8002b7d8) at 0x800720f0
REFCNT = 1
FLAGS = (NOK,pNOK)
IV = 1
NV = 0
PV = 0
[0]
$ perl -MJSON::PP -MDevel::Peek -E'($_=1.1)-=$_; Dump $_; say encode_json([$_]);'
SV = PVNV(0x801602b0) at 0x8008e520
REFCNT = 1
FLAGS = (NOK,pNOK)
IV = 1
NV = 0
PV = 0
[0]
(NOK indicates the scalar contains a value stored as a floating point number.)

I am converting a json file returned from the server into perl data structures

I am able to convert a hard coded json string into perl hashes however if i want to convert a complete json file into perl data structures which can be parsed later in any manner, I am getting the folloring error.
malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "(end of string)") at json_vellai.pl line 9
use JSON::PP;
$json= JSON::PP->new()
$json = $json->allow_singlequote([$enable]);
open (FH, "jsonsample.doc") or die "could not open the file\n";
#$fileContents = do { local $/;<FH>};
#fileContents = <FH>;
#print #fileContents;
$str = $json->allow_barekey->decode(#filecontents);
foreach $t (keys %$str)
{
print "\n $t -- $str->{$t}";
}
This is how my code looks .. plz help me out
It looks to me like decode doesn't want a list, it wants a scalar string.
You could slurp the file:
undef $/;
$fileContents = <FH>;