getting length of utf8mb4 string with perl from MySql - mysql

i wrote a small perl function that takes a string and checks its length without the spaces. the basic code looks as follows:
sub foo
{
use utf8;
my #wordsArray = split(/ /, $_[0]));
my $result = length(join('', #wordsArray));
return $result;
}
When i provide this function with a string containing special characters (such as hebrew letters), it seems to work great.
the problem starts when i use a value coming from MySql column, with a character set of utf8mb4: in such case, the value being calculated is higher than the value in the previous example.
I can guess why such behavior occurs: the special characters are written in a 4 byte manner in the table, and thus each letter calculates as two characters in the utf8 encoding.
Does anyone know how the above can be resolved, so that i will get the right number of characters from a string coming from DB table defined as utf8mb4?
EDIT:
Some more information regarding the above code:
The DB column used as an argument for the function is of type VARCHAR(1000), with collation of utf8mb4_unicode_ci.
I am fetching the row via a MySql connection configured as follows:
$mySql = DBI->connect(
"DBI:mysql:$db_info{'database'}:$db_info{'hostname'};mysql_multi_statements=1;",
"$db_info{'user'}",
"$db_info{'password'}",
{'RaiseError' => 1,'AutoCommit' => 0});
...
$mySql->do("set names utf8mb4");
an example data value would be "שלום עולם" (which in hebrew means "Hello World").
1) When calling foo($request->{VALUE}); (where VALUE is the column data from DB), the result is 16 (where each hebrew character is counted as two characters, and one space between them is disregarded). Dumper in this case is:
$VAR1 = "\327\251\327\234\327\225\327\235 \327\242\327\225\327\234\327\235";
2) When calling foo("שלום עולם");:
when declaring use utf8;, the result is 8 (as there are 8 visible characters in this string). Dumper (Useqq=1) in this case is:
$VAR1 = "\x{5e9}\x{5dc}\x{5d5}\x{5dd} \x{5e2}\x{5d5}\x{5dc}\x{5dd}";
when not declaring `use utf8;', the result is 16, and is similar to the case of sending the value from DB:
$VAR1 = "\327\251\327\234\327\225\327\235 \327\242\327\225\327\234\327\235";
Looks like i need to find a way of converting the received value to UTF8 before starting to work with it.

What MySQL calls utf8 is a limited subset of UTF-8 which allows only three bytes per character and covers code points up to 0xFFFF. Even utf8mb4 doesn't cover the full UTF-8 range, which supports encoded characters up to 6 bytes long
The consequence is that any data from either a utf8 or a utf8mb4 column is simply a UTF-8 string in Perl and there should be no difference between the two database encodings
It is my guess that you haven't enabled UTF-8 for your DBI handle, so everything is being treated as just a sequence of bytes. You should enable the mysql_enable_utf8 when you make the connect call, which should then look something like
my $dbh = DBI->connect($dsn, $user, $password, { mysql_enable_utf8 => 1 });
With the additional data, I can see that the string you are retrieving from the database is indeed שלום עולם UTF-8-encoded
However, if I decode it, then first of all I get a non-space character count of 8 from both your foo subroutine and my own, not 9; and also you should be getting characters back from the database, not bytes
I suspect that you may have written an encoded string to the database in the first place. Here's a short program that creates a MySQL table, writes two records to it (one character string and one encoded string) and retrieves what it has written. You will see that The only thing that makes a difference is the setting of mysql_enable_utf8. The behaviour is the same whether or not the original string is encoded, and with or without SET NAMES utf8mb4
Further experimentation showed that either mysql_enable_utf8 or SET NAMES utf8mb4 will get DBI to write the data correctly, but the latter has no effect on reading
I suggest that your solution should be to use ONLY mysql_enable_utf8 when either reading or writing
You should also use utf8 only at the top of all your programs. Missing this out means you can't use any non-ASCII characters in your code
use utf8;
use strict;
use warnings;
use DBI;
use open qw/ :std :encoding(utf-8) /;
STDOUT->autoflush;
my $VAR1 = "\327\251\327\234\327\225\327\235 \327\242\327\225\327\234\327\235";
my $dbh = DBI->connect(
qw/ DBI:mysql:database=temp admin admin /, {
RaiseError => 1,
PrintError => 0,
mysql_enable_utf8 => 1,
}
) or die DBI::errstr;
$dbh->do('SET NAMES utf8mb4');
$dbh->do('DROP TABLE IF EXISTS temp');
$dbh->do('CREATE TABLE temp (value VARCHAR(64) CHARACTER SET utf8mb4)');
my $insert = $dbh->prepare('INSERT INTO temp (value) VALUES (?)');
$insert->execute('שלום עולם');
$insert->execute($VAR1);
my $values = $dbh->selectcol_arrayref('SELECT value FROM temp');
printf "string: %s foo: %d\n", $_, foo($_) for #$values;
sub foo2 {
$_[0] =~ tr/ //c;
}
sub foo {
length join '', split / /, $_[0];
}
output with mysql_enable_utf8 => 1
string: שלום עולם foo: 8
string: שלום עולם foo: 8
output with mysql_enable_utf8 => 0
string: ש××× ×¢××× foo: 16
string: ש××× ×¢××× foo: 16

Related

Why does reading JSON from database with Perl warn about "wide characters"?

I'm reading data from a database using Perl (I'm new to Perl), one of the columns is a JSON array. The problem I'm having is that when I try to read the data in the JSON I get an error "Wide character in subroutine entry".
Table:
id | name | date | data
Sample data.
{ "duration":"24", "name":"My Test","visible":"1" }
use JSON qw(decode_json);
my $current_connection = $DATABASE_CONNECTION->prepare( "SELECT * FROM data WHERE syt = 'area1' " );
$current_connection->execute();
while( my $db_data = $current_connection->fetchrow_hashref() )
{
my $name = $db_data->{name};
my $object = decode_json($db_data->{data});
foreach my $key (sort keys %{$object}) {
my $result;
$pages .= "<p> $result->{$key}->{name} </p>";
}
}
That error means a character greater than 255 was passed to a sub expecting a string of bytes.
When stored in the database, the string is encoded using some character encoding, possibly UTF-8. You appear to have decoded the text (e.g. by using mysql_enable_utf8mb4 => 1), producing a string of Unicode Code Points. However, decode_json expects UTF-8.
The solution is to use from_json or JSON->new->decode instead of decode_json; these expect decoded text (a string of UCP).
You can verify that this is the issue using sprintf '%vX', $json.
For example,
If you get E9 and 2660 for "é" and "♠",
That's their UCP. Use from_json or JSON->new->decode.
If you get C3.A9 and E2.99.A0 for "é" and "♠",
That's their UTF-8. Use decode_json or JSON->new->utf8->decode.

I need to extract Data from a single line of json-data which is inbetween two variables (Powershell)

I need to extract Data from a single line of json-data which is inbetween two variables (Powershell)
my Variables:
in front of Data:
DeviceAddresses":[{"Id":
after Data:
,"
I tried this, but there needs to be some error because of all the special characters I'm using:
$devicepattern = {DeviceAddresses":[{"Id":{.*?},"}
#$deviceid = [regex]::match($changeduserdata, $devicepattern).Groups[1].Value
#$deviceid
As you've found, some character literals can't be used as-is in a regex pattern because they carry special meaning - we call these meta-characters.
In order to match the corresponding character literal in an input string, we need to escape it with \ -
to match a literal (, we use the escape sequence \(,
for a literal }, we use \}, and so on...
Fortunately, you don't need to know or remember which ones are meta-characters or escapable sequences - we can use Regex.Escape() to escape all the special character literals in a given pattern string:
$prefix = [regex]::Escape('DeviceAddresses":[{"Id":')
$capture = '(.*?)'
$suffix = [regex]::Escape(',"')
$devicePattern = "${prefix}${capture}${suffix}"
You also don't need to call [regex]::Match directly, PowerShell will populate the automatic $Matches variable with match groups whenever a scalar -match succeeds:
if($changeduserdata -match $devicePattern){
$deviceid = $Matches[1]
} else {
Write-Error 'DeviceID not found'
}
For reference, the following ASCII literals needs to be escaped in .NET's regex grammar:
$ ( ) * + . ? [ \ ^ { |
Additionally, # and (regular space character) needs to be escaped and a number of other whitespace characters have to be translated to their respective escape sequences to make patterns safe for use with the IgnorePatternWhitespace option (this is not applicable to your current scenario):
\u0009 => '\t' # Tab
\u000A => '\n' # Line Feed
\u000C => '\f' # Form Feed
\u000D => '\r' # Carriage Return
... all of which Regex.Escape() takes into account for you :)
To complement Mathias R. Jessen's helpful answer:
Generally, note that JSON data is much easier to work with - and processed more robustly - if you parse it into objects whose properties you can access - see the bottom section.
As for your regex attempt:
Note: The following also applies to all PowerShell-native regex features, such as the -match, -replace, and -split operators, the switch statement, and the Select-String cmdlet.
Mathias' answer uses [regex]::Escape() to escape the parts of the regex pattern to be used verbatim by the regex engine.
This is unequivocally the best approach if those verbatim parts aren't known in advance - e.g., when provided via a variable or expression, or passed as an argument.
However, in a regex pattern that is specified as a string literal it is often easier to individually \-escape the regex metacharacters, i.e. those characters that would otherwise have special meaning to the regex engine.
The list of characters that need escaping is (it can be inferred from the .NET Regular-Expression Quick Reference):
\ ( ) | . * + ? ^ $ [ {
If you enable the IgnorePatternWhiteSpace option (which you can do inline with
(?x), at the start of a pattern), you'll additionally have to \-escape:
#
significant whitespace characters (those you actually want matched) specified verbatim (e.g., ' ', or via string interpolation,"`t"); this does not apply to those specified via escape sequences (e.g., \t or \n).
Therefore, the solution could be simplified to:
# Sample JSON
$changeduserdata = '{"DeviceAddresses":[{"Id": 42,"More": "stuff"}]}'
# Note how [ and { are \-escaped
$deviceId = if ($changeduserdata -match 'DeviceAddresses":\[\{"Id":(.*?),"') {
$Matches[1]
}
Using ConvertFrom-Json to properly parse JSON into objects is both more robust and more convenient, as it allows property access (dot notation) to extract the value of interest:
# Sample JSON
$changeduserdata = '{"DeviceAddresses":[{"Id": 42,"More": "stuff"}]}'
# Convert to an object ([pscustomobject]) and drill down to the property
# of interest; note that the value of .DeviceAddresses is an *array* ([...]).
$deviceId = (ConvertFrom-Json $changeduserdata).DeviceAddresses[0].Id # -> 42

Perl fetchrow_hashref results are different integer vs. string values

I really need your help for understanding with the following perl example code:
#!/usr/bin/perl
# Hashtest
use strict;
use DBI;
use DBIx::Log4perl;
use Data::Dumper;
use utf8;
if (my $dbh = DBIx::Log4perl->connect("DBI:mysql:myDB","myUser","myPassword",{
RaiseError => 1,
PrintError => 1,
AutoCommit => 0,
mysql_enable_utf8 => 1
}))
{
my $data = undef;
my $sql_query = <<EndOfSQL;
SELECT 1
EndOfSQL
my $out = $dbh->prepare($sql_query);
$out->execute() or exit(0);
my $row = $out->fetchrow_hashref();
$out->finish();
# Debugging
print Dumper($row);
$dbh->disconnect;
exit(0);
}
1;
If i run this code on two machines i get different results.
Result on machine 1: (Result i needed with integer value)
arties#p51s:~$ perl hashTest.pl
Log4perl: Seems like no initialization happened. Forgot to call init()?
$VAR1 = {
'1' => 1
};
Resulst on machine 2: (Result that makes trouble because of string value)
arties#core3:~$ perl hashTest.pl
Log4perl: Seems like no initialization happened. Forgot to call init()?
$VAR1 = {
'1' => '1'
};
As you can see on machine 1 the value from MySQL will be interpreted as integer value and on machine 2 as string value.
I need on both machines the integer value. And it is not possible to modify the hash later, because the original code has too much values, that must be changed...
Both machines uses DBI 1.642 and DBIx::Log4perl 0.26
The only difference is the perl version machine 1 (v5.26.1) vs. machine 2 (v5.14.2)
So the big question is, how can I make sure I always get the integer in the hash as the result?
Update 10.10.2019:
To show perhaps better the problem, i improve the above example:
...
use Data::Dumper;
use JSON; # <-- Inserted
use utf8;
...
...
print Dumper($row);
# JSON Output
print JSON::to_json($row)."\n"; # <-- Inserted
$dbh->disconnect;
...
Now the output on machine 1 with last line the JSON Output:
arties#p51s:~$ perl hashTest.pl
Log4perl: Seems like no initialization happened. Forgot to call init()?
$VAR1 = {
'1' => 1
};
{"1":1}
Now the output on machine 2 with last line the JSON Output:
arties#core3:~$ perl hashTest.pl
$VAR1 = {
'1' => '1'
};
{"1":"1"}
You see, that both Data::Dumper AND JSON has the same behavor. And as i wrote bevor, +0 is not an option because the original hash is much more complex.
Both machines use JSON 4.02
#Nick P : That's the solution you linked Why does DBI implicitly change integers to strings? , the DBD::mysql was different on both systems! So i upgraded on machine 2 from Version 4.020 to Version 4.050 and now both systems has the same result! And Integers are Integers ;-)
So the result on both machines is now:
$VAR1 = {
'1' => 1
};
{"1":1}
Thank you!

How to convert \u0421 into letter "C"?

I made a post query to server and got json. It contains wrong symbol: instead "Correct" I got "\u0421orrect". How can I encode this text?
A parse_json function performs it like "РЎorrect";
I found out that
$a = "\x{0421}orrect";
$a= encode("utf-8", $a);
returns "РЎorrect", and
$a = "\x{0421}orrect";
$a= encode("cp1251", $a);
returns "Correct"
So I've decided to change \u to \x and then to use cp1251.
\u to \x
I wrote:
Encode::Escape::enmode 'unicode-escape', 'perl';
Encode::Escape::demode 'unicode-escape', 'python';
$content= encode 'unicode-escape', decode 'unicode-escape', $content;
and got \x{0421}orrect.
And then I tried:
$content = encode( 'cp1251', $content );
And... nothing changed! I still have \x{0421}orrect...
I notice something interesting:
$a = "\x{0421}orrect";
$a= encode("cp1251", $a);
returns "Correct"
BUT
$a = '\x{0421}orrect';
$a= encode("cp1251", $a);
still returns "\x{0421}orrect".
Maybe this is a key, but I don't know what I can do with this.
I've already tried to encode and decode, Encode:: from_to,JSON::XS and utf8.
You mention escaping multiple times, but you want to do the opposite (unescape).
decode_json/from_json will correctly return "Сorrect" (Where the "C" is CYRILLIC CAPITAL LETTER ES).
use JSON::XS qw( decode_json );
my $json_utf8 = '{"value":"\u0421orrect"}';
my $data = decode_json($json_utf8);
You do need to encode your outputs, though. For example, if you have Cyrillic-based Windows system, and you wanted to create a native file, you could use
open(my $fh, '>:encoding(cp1251)', $qfn)
or die("Can't create \"$qfn\": $!\n");
say $fh $data->{value};
If you want to hardcode the encoding, or if you're interested in the encoding output to STDOUT and STDERR as well, check out this.
Apologies if you realise this already - I just think it's worth pointing out so we're all on the same page.
Character number \x{0421} has the description "CYRILLIC CAPITAL LETTER ES" and looks like this: С
Character number \x{0043} has the description "LATIN CAPITAL LETTER C" and looks like this: C
So depending on the font you're using, it's entirely likely that the two characters appear identical.
You asked "How can I encode this text?" but you didn't explain what you mean by that or why you want to "encode" it. There is no encoding that will convert 'С' (\x{0421}) into 'C' (\x{0043}) - they are two different characters from two different alphabets.
So the question is, what are you trying to achieve? Are you trying to check if the string returned from the server matched "Correct"? If so, that simply won't work, because the server is returning the string "Сorrect". They might look the same, but they are two different strings.
It's possible that whole situation is an error in the server code and it should be returning "Correct". If that is the case and you can't rely on the server reliably returning the "Correct", one workaround would be to use a character replacement, to "normalise" the string before you inspect its contents. For example:
use JSON::XS qw( decode_json );
my $response = <<EOF;
{
"status": "\u0421orrect"
}
EOF
my $data = decode_json($response);
my $status = $data->{status};
$status =~ tr/\x{0421}/C/;
if($status eq "Correct") {
say "The status is correct";
}
else {
say "The status is not correct";
}
This code will work now, and in the future if the server code is fixed to return "Correct".

Why is JSON::XS Not Generating Valid UTF-8?

I'm getting some corrupted JSON and I've reduced it down to this test case.
use utf8;
use 5.18.0;
use Test::More;
use Test::utf8;
use JSON::XS;
BEGIN {
# damn it
my $builder = Test::Builder->new;
foreach (qw/output failure_output todo_output/) {
binmode $builder->$_, ':encoding(UTF-8)';
}
}
foreach my $string ( 'Deliver «French Bread»', '日本国' ) {
my $hashref = { value => $string };
is_sane_utf8 $string, "String: $string";
my $json = encode_json($hashref);
is_sane_utf8 $json, "JSON: $json";
say STDERR $json;
}
diag ord('»');
done_testing;
And this is the output:
utf8.t ..
ok 1 - String: Deliver «French Bread»
not ok 2 - JSON: {"value":"Deliver «French Bread»"}
# Failed test 'JSON: {"value":"Deliver «French Bread»"}'
# at utf8.t line 17.
# Found dodgy chars "<c2><ab>" at char 18
# String not flagged as utf8...was it meant to be?
# Probably originally a LEFT-POINTING DOUBLE ANGLE QUOTATION MARK char - codepoint 171 (dec), ab (hex)
{"value":"Deliver «French Bread»"}
ok 3 - String: 日本国
ok 4 - JSON: {"value":"æ¥æ¬å½"}
1..4
{"value":"日本国"}
# 187
So the string containing guillemets («») is valid UTF-8, but the resulting JSON is not. What am I missing? The utf8 pragma is correctly marking my source. Further, that trailing 187 is from the diag. That's less than 255, so it almost looks like a variant of the old Unicode bug in Perl. (And the test output still looks like crap. Never could quite get that right with Test::Builder).
Switching to JSON::PP produces the same output.
This is Perl 5.18.1 running on OS X Yosemite.
is_sane_utf8 doesn't do what you think it does. You're suppose to pass strings you've decoded to it. I'm not sure what's the point of it, but it's not the right tool. If you want to check if a string is valid UTF-8, you could use
ok(eval { decode_utf8($string, Encode::FB_CROAK | Encode::LEAVE_SRC); 1 },
'$string is valid UTF-8');
To show that JSON::XS is correct, let's look at the sequence is_sane_utf8 flagged.
+--------------------- Start of two byte sequence
| +---------------- Not zero (good)
| | +---------- Continuation byte indicator (good)
| | |
v v v
C2 AB = [110]00010 [10]101011
00010 101011 = 000 1010 1011 = U+00AB = «
The following shows that JSON::XS produces the same output as Encode.pm:
use utf8;
use 5.18.0;
use JSON::XS;
use Encode;
foreach my $string ('Deliver «French Bread»', '日本国') {
my $hashref = { value => $string };
say(sprintf("Input: U+%v04X", $string));
say(sprintf("UTF-8 of input: %v02X", encode_utf8($string)));
my $json = encode_json($hashref);
say(sprintf("JSON: %v02X", $json));
say("");
}
Output (with some spaces added):
Input: U+0044.0065.006C.0069.0076.0065.0072.0020.00AB.0046.0072.0065.006E.0063.0068.0020.0042.0072.0065.0061.0064.00BB
UTF-8 of input: 44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB
JSON: 7B.22.76.61.6C.75.65.22.3A.22.44.65.6C.69.76.65.72.20.C2.AB.46.72.65.6E.63.68.20.42.72.65.61.64.C2.BB.22.7D
Input: U+65E5.672C.56FD
UTF-8 of input: E6.97.A5.E6.9C.AC.E5.9B.BD
JSON: 7B.22.76.61.6C.75.65.22.3A.22.E6.97.A5.E6.9C.AC.E5.9B.BD.22.7D
JSON::XS is generating valid UTF-8, but you're using the resulting UTF-8 encoded byte strings in two different contexts that expect character strings.
Issue 1: Test::utf8
Here are the two main situations when is_sane_utf8 will fail:
You have a miscoded character string that had been decoded from a UTF-8 byte string as if it were Latin-1 or from double encoded UTF-8, or the character string is perfectly fine and looks like a potentially "dodgy" miscoding (using the terminology from its docs).
You have a valid UTF-8 byte string containing the encoded code points U+0080 through U+00FF, for example «French Bread».
The is_sane_utf8 test is intended only for character strings and has the documented potential for false negatives.
Issue 2: Output Encoding
All of your non-JSON strings are character strings while your JSON strings are UTF-8 encoded byte strings, as returned from the JSON encoder. Since you're using the :encoding(UTF-8) PerlIO layer for TAP output, the character strings are being implicitly encoded to UTF-8 with good results, while the byte strings containing JSON are being double encoded. STDERR however does not have an :encoding PerlIO layer set, so the encoded JSON byte strings look good in your warnings since they're already encoded and being passed straight out.
Only use the :encoding(UTF-8) PerlIO layer for IO with character strings, as opposed to the UTF-8 encoded byte strings returned by default from the JSON encoder.