Normalization on utf8 filenames stored in JSON with perl - json

I have two Json files which come from different OSes.
Both files are encoded in UTF-8 and contain UTF-8 encoded filenames.
One file comes from OS X and the filename is in NFD form: (od -bc)
0000160 166 145 164 154 141 314 201 057 110 157 165 163 145 040 155 145
v e t l a ́ ** / H o u s e m e
the second contains the same filename but in NFC form:
000760 166 145 164 154 303 241 057 110 157 165 163 145 040 155 145 163
v e t l á ** / H o u s e m e s
As I have learned, this is called 'different normalization', and there is an CPAN module Unicode::Normalize for handling it.
I'm reading both files with the next:
my $json1 = decode_json read_file($file1, {binmode => ':raw'}) or die "..." ;
my $json2 = decode_json read_file($file2, {binmode => ':raw'}) or die "..." ;
The read_file is from File::Slurp and decode_json from the JSON::XS.
Reading the JSON into perl structure, from one json file the filename comes into key position and from the second file comes into the values. I need to search when the hash key from the 1st hash is equvalent to a value from the second hash, so need ensure than they are "binary" identical.
Tried the next:
grep 'House' file1.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc
and
grep 'House' file2.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc
produces for me the same output.
Now the questions:
How to simply read both json files to get the same normalization into the both $hashrefs?
or need after the decode_json run someting like on both hashes?
while(my($k,$v) = each(%$json1)) {
$copy->{ NFD($k) } = NFD($v);
}
In short:
How to read different JSON files to get the same normalization 'inside' the perl $href? It is possible to achieve somewhat nicer as explicitly doing NFD on each key value and creating another NFD normalized (big) copy of the hashes?
Some hints, suggestions - pleae...
Because my english is very bad, here is a simulation of the problem
use 5.014;
use warnings;
use utf8;
use feature qw(unicode_strings);
use charnames qw(:full);
use open qw(:std :utf8);
use Encode qw(encode decode);
use Unicode::Normalize qw(NFD NFC);
use File::Slurp;
use Data::Dumper;
use JSON::XS;
#Creating two files what contains different "normalizations"
my($nfc, $nfd);;
$nfc->{ NFC('key') } = NFC('vál');
$nfd->{ NFD('vál') } = 'something';
#save as NFC - this comes from "FreeBSD"
my $jnfc = JSON::XS->new->encode($nfc);
open my $fd, ">:utf8", "nfc.json" or die("nfc");
print $fd $jnfc;
close $fd;
#save as NFD - this comes from "OS X"
my $jnfd = JSON::XS->new->encode($nfd);
open $fd, ">:utf8", "nfd.json" or die("nfd");
print $fd $jnfd;
close $fd;
#now read them
my $jc = decode_json read_file( "nfc.json", { binmode => ':raw' } ) or die "No file" ;
my $jd = decode_json read_file( "nfd.json", { binmode => ':raw' } ) or die "No file" ;
say $jd->{ $jc->{key} } // "NO FOUND"; #wanted to print "something"
my $jc2;
#is here a better way to DO THIS?
while(my($k,$v) = each(%$jc)) {
$jc2->{ NFD($k) } = NFD($v);
}
say $jd->{ $jc2->{key} } // "NO FOUND"; #OK

While searching the right solution for your question i discovered: the software is c*rp :) See: https://stackoverflow.com/a/17448888/632407 .
Anyway, found the solution for your particular question - how to read json with filenames regardless of normalization:
instead of your:
#now read them
my $jc = decode_json read_file( "nfc.json", { binmode => ':raw' } ) or die "No file" ;
my $jd = decode_json read_file( "nfd.json", { binmode => ':raw' } ) or die "No file" ;
use the next:
#now read them
my $jc = get_json_from_utf8_file('nfc.json') ;
my $jd = get_json_from_utf8_file('nfd.json') ;
...
sub get_json_from_utf8_file {
my $file = shift;
return
decode_json #let parse the json to perl
encode 'utf8', #the decode_json want utf8 encoded binary string, encode it
NFC #conv. to precomposed normalization - regardless of the source
read_file #your file contains utf8 encoded text, so read it correctly
$file, { binmode => ':utf8' } ;
}
This should (at least i hope) ensure than regardles what decomposition uses the JSON content, the NFC will convert it to precomposed version and the JSON:XS will read parse it correctly to the same internal perl structure.
So your example prints:
something
without traversing the $json
The idea comes from Joseph Myers and Nemo ;)
Maybe some more skilled programmers will give more hints.

Even though it may be important right now only to convert a few file names to the same normalization for comparison, other unexpected problems could arise from almost anywhere if JSON data has a different normalization.
So my suggestion is to normalize the entire input from both sources as your first step before doing any parsing (i.e., at the same time you read the file and before decode_json). This should not corrupt any of your JSON structures since those are delimited using ASCII characters. Then your existing perl code should be able to blindly assume all UTF8 characters have the same normalization.
$rawdata1 = read_file($file1, {binmode => ':raw'}) or die "...";
$rawdata2 = read_file($file2, {binmode => ':raw'}) or die "...";
my $json1 = decode_json NFD($rawdata1);
my $json2 = decode_json NFD($rawdata2);
To make this process slightly faster (it should be plenty fast already, since the module uses fast XS procedures), you can find out whether one of the two data files is already in a certain normalization form, and then leave that file unchanged, and convert the other file into that form.
For example:
$rawdata1 = read_file($file1, {binmode => ':raw'}) or die "...";
$rawdata2 = read_file($file2, {binmode => ':raw'}) or die "...";
if (checkNFD($rawdata1)) {
# then you know $file1 is already in Normalization Form D
# (i.e., it was formed by canonical decomposition).
# so you only need to convert $file2 into NFD
$rawdata2 = NFD($rawdata2);
}
my $json1 = decode_json $rawdata1;
my $json2 = decode_json $rawdata2;
Of course, you would naturally have to experiment now in the development time to see if one or other of the input files is already in a normalized form, and then in your final version of the code, you would no longer need a conditional statement, but simply convert the other input file into the same normalized form.
Also note that it is suggested to produce output in NFC form (if your program produces any output that would be stored and used later). See here, for example: http://www.perl.com/pub/2012/05/perlunicookbook-unicode-normalization.html

Hm. I can't advice you some better "programming" solution. But why simply doesn't run
perl -CSDA -MUnicode::Normalize -0777 -nle 'print NFD($_)' < freebsd.json >bsdok.json
perl -CSDA -MUnicode::Normalize -0777 -nle 'print NFD($_)' < osx.json >osxok.json
and now your script can read and use both because they are both in the same normalisation? So instead searching for som programming solution inside of your script, solve the problem before entering to the script. (The second command is unnecessary - simple convert on the file level. Sure is more easy as traversing data structures...

Instead of traversing the data structure manually, let a module handle this for you.
Data::Visitor
Data::Rmap
Data::Dmap

Related

Why does reading JSON from database with Perl warn about "wide characters"?

I'm reading data from a database using Perl (I'm new to Perl), one of the columns is a JSON array. The problem I'm having is that when I try to read the data in the JSON I get an error "Wide character in subroutine entry".
Table:
id | name | date | data
Sample data.
{ "duration":"24", "name":"My Test","visible":"1" }
use JSON qw(decode_json);
my $current_connection = $DATABASE_CONNECTION->prepare( "SELECT * FROM data WHERE syt = 'area1' " );
$current_connection->execute();
while( my $db_data = $current_connection->fetchrow_hashref() )
{
my $name = $db_data->{name};
my $object = decode_json($db_data->{data});
foreach my $key (sort keys %{$object}) {
my $result;
$pages .= "<p> $result->{$key}->{name} </p>";
}
}
That error means a character greater than 255 was passed to a sub expecting a string of bytes.
When stored in the database, the string is encoded using some character encoding, possibly UTF-8. You appear to have decoded the text (e.g. by using mysql_enable_utf8mb4 => 1), producing a string of Unicode Code Points. However, decode_json expects UTF-8.
The solution is to use from_json or JSON->new->decode instead of decode_json; these expect decoded text (a string of UCP).
You can verify that this is the issue using sprintf '%vX', $json.
For example,
If you get E9 and 2660 for "é" and "♠",
That's their UCP. Use from_json or JSON->new->decode.
If you get C3.A9 and E2.99.A0 for "é" and "♠",
That's their UTF-8. Use decode_json or JSON->new->utf8->decode.

Using Text::CSV on a String Containing Quotes

I have pored over this site (and others) trying to glean the answer for this but have been unsuccessful.
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1, auto_diag => 1 } );
$line = q(data="a=1,b=2",c=3);
my $csvParse = $csv->parse($line);
my #fields = $csv->fields();
for my $field (#fields) {
print "FIELD ==> $field\n";
}
Here's the output:
# CSV_XS ERROR: 2034 - EIF - Loose unescaped quote # rec 0 pos 6 field 1
FIELD ==>
I am expecting 2 array elements:
data="a=1,b=2"
c=3
What am I missing?
You may get away with using Text::ParseWords. Since you are not using real csv, it may be fine. Example:
use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;
my $line = q(data="a=1,b=2",c=3);
my #fields = quotewords(',', 1, $line);
print Dumper \#fields;
This will print
$VAR1 = [
'data="a=1,b=2"',
'c=3'
];
As you requested. You may want to test further on your data.
Your input data isn't "standard" CSV, at least not the kind that Text::CSV expects and not the kind that things like Excel produce. An entire field has to be quoted or not at all. The "standard" encoding of that would be "data=""a=1,b=2""",c=3 (which you can see by asking Text::CSV to print your expected data using say).
If you pass the allow_loose_quotes option to the Text::CSV constructor, it won't error on your input, but it won't consider the quotes to be "protecting" the comma, so you will get three fields, namely data="a=1, b=2" and c=3.

putting commas separated values into a new file line by line

From command line, we are passing multiple values separated by commas such as sydney,delhi,NY,Russia as an option. These values are getting stored under $runTest in the perl script. Now I want to create a new file under the script with contents of $runTest but as line by line. For example:
INPUT (passed values from command line):
sydney,delhi,NY,Russia
OUTPUT (under new file: myfile):
sydney
delhi
NY
Russia
In this simple example, it is better to use split on a delimiter than tr in such case. A few minor points: use snake_case for names instead of CamelCase, and use autodie to make open, close, etc, fatal, without the need to clutter the code with or die "...":
use autodie;
my $run_test = 'sydney,delhi,NY,Russia';
open my $out, '>', 'myFile';
print {$out} map { "$_\n" } split /,/, $run_test;
close $out;
For more robust parsing in general, beyond this simple example, prefer specialized modules, such as Text::CSV or Text::CSV_XS for csv parsing. Compared to the overly simplistic split, Text::CSV_XS enables correct input/output of quoted fields, fields containing the delimiter (comma), binary characters, provides error messages and more. Example:
use Text::CSV_XS;
use autodie;
open my $out, q{>}, q{myFile};
# All of these input strings are parsed correctly, unlike when using "split":
# my $run_test = q{sydney,delhi,NY,Russia};
# my $run_test = q{sydney,delhi,NY,Russia,"field,with,commas"};
my $run_test = q{sydney,delhi,NY,Russia,"field,with,commas","field,with,missing,quote};
# binary => 1 : enable parsing binary characters in quoted fields.
# auto_diag => 1 : print the internal error code and the associated error message to STDERR.
my $csv = Text::CSV_XS->new( { binary => 1, auto_diag => 1 } );
if ( $csv->parse( $run_test ) ) {
print {$out} map { "$_\n" } $csv->fields;
}
else {
print STDERR q{parse() failed on: }, $csv->error_input, qq{\n};
}

Retrieving original JSON code from a decoded array node in Perl

I'm working on a script which receives JSON code for an array of objects similar to this:
{
"array":[
{ "id": 1, "text": "Some text" },
{ "id": 2, "text": "Some text" }
]
}
I decode it using JSON::XS and then filter out some of the results. After this, I need to store the JSON code for each individual node into a queue for later processing. The format this queue requires is JSON too, so the code I'd need to insert for each node would be something like this:
{ "id": 1, "text": "Some text" }
However, after decode_json has decoded a node, all that's left are hash references for each node:
print $json->{'array'}[0]; # Would print something like HASH(0x7ffa80c83270)
I know I could get something similar to the original JSON code using encode_json on the hash reference, but the resulting code is different from the original code, UTF-8 characters get all weird, and it seems like a lot of extra processing, specially considering the amount of data this script has to deal with.
Is there a way to retrieve the original JSON code from a decoded array node? Does JSON::XS keep the original chunks somewhere after they have been decoded?
EDIT
About the weird UTF-8 characters, they just look weird on the screen:
#!/usr/bin/perl
use utf8;
use JSON::XS;
binmode STDOUT, ":utf8";
$old_json = '{ "text": "Drag\u00f3n" }';
$json = decode_json($old_json);
print $json->{'text'}; # Dragón
$new_json = encode_json($json);
print $new_json; # {"text":"Dragón"}
$json = decode_json($new_json);
print $json->{'text'}; # Dragón
encode_json will produce equivalent JSON to what you originally had before you decoded it with decode_json. Characters encoded using UTF-8 do not get all weird.
$ cat a.pl
use Encode qw( encode_utf8 );
use JSON::XS qw( decode_json encode_json );
my $json = encode_utf8(qq!{"name":"\x{C9}ric" }!);
print($json, "\n");
print(encode_json(decode_json($json)), "\n");
$ perl a.pl | od -c
0000000 { " n a m e " : " 303 211 r i c "
0000020 } \n { " n a m e " : " 303 211 r i c
0000040 " } \n
0000043
If you want a parser that preserves the original JSON, you'll surely have to write your own; the existing ones don't do that.
No, it doesn't exist anywhere. The "original JSON" isn't stored per-element; it's decoded in a single pass.
No, this is not possible. Every JSON object can have multiple, but equivalent representations:
{ "key": "abc" }
and
{
"key" : "abc"
}
are pretty much the same.
So just use the re-encoded JSON your module gives you.
Even if JSON::XS caches the chunks, extracting them would be a breach of encapsulation, therefore having no guarantee of working if the module is upgraded. And it is bad design.
Don't care about performance. The XS modules have exceptional performance, as they are coded in C. And if you were paranoid about performance, you wouldn't use JSON but some binary format. And you wouldn't be using Perl, but Fortran ;-)
You should treat equivalent data as equivalent data. Even if the presentation is different.
If the unicode chars look weird, but process fine, there is no problem. If they don't get processed correctly, you might have to specify an exact encoding.

I am converting a json file returned from the server into perl data structures

I am able to convert a hard coded json string into perl hashes however if i want to convert a complete json file into perl data structures which can be parsed later in any manner, I am getting the folloring error.
malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "(end of string)") at json_vellai.pl line 9
use JSON::PP;
$json= JSON::PP->new()
$json = $json->allow_singlequote([$enable]);
open (FH, "jsonsample.doc") or die "could not open the file\n";
#$fileContents = do { local $/;<FH>};
#fileContents = <FH>;
#print #fileContents;
$str = $json->allow_barekey->decode(#filecontents);
foreach $t (keys %$str)
{
print "\n $t -- $str->{$t}";
}
This is how my code looks .. plz help me out
It looks to me like decode doesn't want a list, it wants a scalar string.
You could slurp the file:
undef $/;
$fileContents = <FH>;