perl & python writing out non-ASCII characters into JSON differently - json

I have hash keys that look like this:
1я310яHOM_REF_truth:HOM_ALT_test:discordant_hom_ref_to_hom_altяAяC
this is a string that is joined by the Cyrillic letter я which I chose as a delimiter because it will never appear in this files.
I write this to a JSON file in Perl 5.30.2 thus:
use JSON 'encode_json';
sub hash_to_json_file {
my $hash = shift;
my $filename = shift;
my $json = encode_json $hash;
open my $out, '>', $filename;
say $out $json
}
and in python 3.8:
use json
def hash_to_json_file(hashtable,filename):
json1=json.dumps(hashtable)
f = open(filename,"w+")
print(json1,file=f)
f.close()
when I try to load a JSON written by Python back into a Perl script, I see a cryptic error that I don't know how to solve:
Wide character in say at read_json.pl line 27.
Reading https://perldoc.perl.org/perlunifaq.html I've tried adding use utf8 to my script, but it doesn't work.
I've also tried '>:encoding(UTF-8)' within my subroutine, but the same error results.
Upon inspection of the JSON files, I see keys like "1Ñ180ÑHET_ALT_truth:HET_REF_test:discordant_het_alt_to_het_refÑAÑC,G" where ÑAÑ substitutes я. In the JSON written by python, I see \u044f I think that this is the wide character, but I don't know how to change it back.
I've also tried changing my subroutine:
use Encode 'decode';
sub json_file_to_hash {
my $file = shift;
open my $in, '<:encoding(UTF-8)', $file;
my $json = <$in>;
my $ref = decode_json $json;
$ref = decode('UTF-8', $json);
return %{ $ref }
}
but this gives another error:
Wide character in hash dereference at read_json.pl line 17, <$_[...]> line 1
How can I get python JSON read into Perl correctly?

use utf8; # Source is encoded using UTF-8
use open ':std', ':encoding(UTF-8)'; # For say to STDOUT. Also default for open()
use JSON qw( decode_json encode_json );
sub hash_to_json_file {
my $qfn = shift;
my $ref = shift;
my $json = encode_json($ref); # Produces UTF-8
open(my $fh, '>:raw', $qfn) # Write it unmangled
or die("Can't create \"$qfn\": $!\n");
say $fh $json;
}
sub json_file_to_hash {
my $qfn = shift;
open(my $fh, '<:raw', $qfn) # Read it unmangled
or die("Can't create \"$qfn\": $!\n");
local $/; # Read whole file
my $json = <$fh>; # This is UTF-8
my $ref = decode_json($json); # This produces decoded text
return $ref; # Return the ref rather than the keys and values.
}
my $src = { key => "1я310яHOM_REF_truth:HOM_ALT_test:discordant_hom_ref_to_hom_altяAяC" };
hash_to_json("a.json", $src);
my $dst = hash_to_json("a.json");
say $dst->{key};
You could also avoid using :raw by using from_json and to_json.
use utf8; # Source is encoded using UTF-8
use open ':std', ':encoding(UTF-8)'; # For say to STDOUT. Also default for open()
use JSON qw( from_json to_json );
sub hash_to_json_file {
my $qfn = shift;
my $hash = shift;
my $json = to_json($hash); # Produces decoded text.
open(my $fh, '>', $qfn) # "use open" will add :encoding(UTF-8)
or die("Can't create \"$qfn\": $!\n");
say $fh $json; # Encoded by :encoding(UTF-8)
}
sub json_file_to_hash {
my $qfn = shift;
open(my $fh, '<', $qfn) # "use open" will add :encoding(UTF-8)
or die("Can't create \"$qfn\": $!\n");
local $/; # Read whole file
my $json = <$fh>; # Decoded text thanks to "use open".
my $ref = from_json($json); # $ref contains decoded text.
return $ref; # Return the ref rather than the keys and values.
}
my $src = { key => "1я310яHOM_REF_truth:HOM_ALT_test:discordant_hom_ref_to_hom_altяAяC" };
hash_to_json("a.json", $src);
my $dst = hash_to_json("a.json");
say $dst->{key};

I like the ascii option so that the JSON output is all 7-bit ASCII
my $json = JSON->new->ascii->encode($hash);
Both the Perl and Python JSON modules will be able to read it.

Related

Escape special characters in JSON string

I have Perl script which contains variable $env->{'arguments'}, this variable should contain a JSON object and I want to pass that JSON object as argument to my other external script and run it using backticks.
Value of $env->{'arguments'} before escaping:
$VAR1 = '{"text":"This is from module and backslash \\ should work too"}';
Value of $env->{'arguments'} after escaping:
$VAR1 = '"{\\"text\\":\\"This is from module and backslash \\ should work too\\"}"';
Code:
print Dumper($env->{'arguments'});
escapeCharacters(\$env->{'arguments'});
print Dumper($env->{'arguments'});
my $command = './script.pl '.$env->{'arguments'}.'';
my $output = `$command`;
Escape characters function:
sub escapeCharacters
{
#$env->{'arguments'} =~ s/\\/\\\\"/g;
$env->{'arguments'} =~ s/"/\\"/g;
$env->{'arguments'} = '"'.$env->{'arguments'}.'"';
}
I would like to ask you what is correct way and how to parse that JSON string into valid JSON string which I can use as argument for my script.
You're reinventing a wheel.
use String::ShellQuote qw( shell_quote );
my $cmd = shell_quote('./script.pl', $env->{arguments});
my $output = `$cmd`;
Alternatively, there's a number of IPC:: modules you could use instead of qx. For example,
use IPC::System::Simple qw( capturex );
my $output = capturex('./script.pl', $env->{arguments});
Because you have at least one argument, you could also use the following:
my $output = '';
open(my $pipe, '-|', './script.pl', $env->{arguments});
while (<$pipe>) {
$output .= $_;
}
close($pipe);
Note that current directory isn't necessarily the directory that contains the script that executing. If you want to executing script.pl that's in the same directory as the currently executing script, you want the following changes:
Add
use FindBin qw( $RealBin );
and replace
'./script.pl'
with
"$RealBin/script.pl"
Piping it to your second program rather than passing it as an argument seems like it would make more sense (and be a lot safer).
test1.pl
#!/usr/bin/perl
use strict;
use JSON;
use Data::Dumper;
undef $/;
my $data = decode_json(<>);
print Dumper($data);
test2.pl
#!/usr/bin/perl
use strict;
use IPC::Open2;
use JSON;
my %data = ('text' => "this has a \\backslash", 'nums' => [0,1,2]);
my $json = JSON->new->encode(\%data);
my ($chld_out, $chld_in);
print("Executing script\n");
my $pid = open2($chld_out, $chld_in, "./test1.pl");
print $chld_in "$json\n";
close($chld_in);
my $out = do {local $/; <$chld_out>};
waitpid $pid, 0;
print(qq~test1.pl output =($out)~);

Perl encoding from JSON issue

apologies if this is a really stupid question or already asked elsewhere. I'm reading in some JSON and using decode_json on it, then extracting text from it and outputting that to a file.
My problem is that Unicode characters are encoded as eg \u2019 in the JSON, decode_json appears to convert this to \x{2019}. When I grab this text and output to a UTF8-encoded file, it appears as garbage.
Sample code:
use warnings;
use strict;
use JSON qw( decode_json );
use Data::Dumper;
open IN, $file or die;
binmode IN, ":utf8";
my $data = <IN>;
my $json = decode_json( $data );
open OUT, ">$outfile" or die;
binmode OUT, ":utf8";
binmode STDOUT, ":utf8";
foreach my $textdat (#{ $json->{'results'} }) {
print STDOUT Dumper($textdat);
my $text = $textdat->{'text'};
print OUT "$text\n";
}
The Dumper output shows that the \u encoding has been converted to \x encoding. What am I doing wrong?
decode_json needs UTF-8 encoded input, so use from_json instead that accepts unicode:
my $json = from_json($data);
Another option would be to encode the data yourself:
use Encode;
my $encoded_data = encode('UTF-8', $data);
...
my $json = decode_json($data);
But it makes little sense to encode data just to decode it.
decode_json expects UTF-8, but you're passing decoded text (Unicode Code Points) instead.
So, you could remove the existing character decoding.
use feature qw( say );
use open 'std', ':encoding(UTF-8)';
use JSON qw( decode_json );
my $json_utf8 = do {
open(my $fh, '<:raw', $in_qfn)
or die("Can't open \"$in_qfn\": $!\n");
local $/;
<$fh>;
};
my $data = decode_json($json_utf8);
{
open(my $fh, '>', $out_qfn)
or die("Can't create \"$out_qfn\": $!\n");
for my $result (#{ $data->{results} }) {
say $fh $result->{text};
}
}
Or, you could use from_json (or JSON->new->decode) instead of decode_json.
use feature qw( say );
use open 'std', ':encoding(UTF-8)';
use JSON qw( from_json ); # <---
my $json_ucp = do {
open(my $fh, '<', $in_qfn) # <---
or die("Can't open \"$in_qfn\": $!\n");
local $/;
<$fh>;
};
my $data = from_json($json_ucp); # <---
{
open(my $fh, '>', $out_qfn)
or die("Can't create \"$out_qfn\": $!\n");
for my $result (#{ $data->{results} }) {
say $fh $result->{text};
}
}
The arrows point to the three minor differences between the two snippets.
I made a number of cleanups.
Missing local $/; in case there are line breaks in the JSON.
Don't use 2-arg open.
Don't needlessly use global variables.
Use better names for variables. $data and $json were notably reversed, and $file didn't contain a file.
Limit the scope of your variables, especially if they use up system resources (e.g. file handles).
Use :encoding(UTF-8) (the standard encoding) instead of :encoding(utf8) (an encoding only used by Perl). :utf8 is even worse as it uses the internal encoding rather than the standard one, and it can lead to corrupt scalars if provided bad input.
Get rid of the noisy quotes around identifiers used as hash keys.

Corrupted JSON encoding in Perl (missign comma)

My custom code (on Perl) give next wrong JSON, missing comma between blocks:
{
"data": [{
"{#LOGFILEPATH}": "/tmp/QRZ2007.tcserverlogs",
"{#LOGFILE}": "QRZ2007"
} **missing comma** {
"{#LOGFILE}": "ARZ2007",
"{#LOGFILEPATH}": "/tmp/ARZ2007.tcserverlogs"
}]
}
My terrible code:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
use utf8;
use JSON;
binmode STDOUT, ":utf8";
my $dir = $ARGV[0];
my $json = JSON->new->utf8->space_after;
opendir(DIR, $dir) or die $!;
print '{"data": [';
while (my $file = readdir(DIR)) {
next unless (-f "$dir/$file");
next unless ($file =~ m/\.tcserverlogs$/);
my $fullPath = "$dir/$file";
my $filenameshort = basename($file, ".tcserverlogs");
my $data_to_json = {"{#LOGFILEPATH}"=>$fullPath,"{#LOGFILE}"=>$filenameshort};
my $data_to_json = {"{#LOGFILEPATH}"=>$fullPath,"{#LOGFILE}"=>$filenameshort};
print $json->encode($data_to_json);
}
print ']}'."\n";
closedir(DIR);
exit 0;
Dear Team i am not a programmer, please any idea how fix it, thank you!
If you do not print a comma, you will not get a comma.
You are trying to build your own JSON string from pre-encoded pieces of smaller data structures. That will not work unless you tell Perl when to put commas. You could do that, but it's easier to just collect all the data into a Perl data structure that is equivalent to the JSON string you want to produce, and encode the whole thing in one go when you're done.
my $dir = $ARGV[0];
my $json = JSON->new->utf8->space_after;
my #data;
opendir( DIR, $dir ) or die $!;
while ( my $file = readdir(DIR) ) {
next unless ( -f "$dir/$file" );
next unless ( $file =~ m/\.tcserverlogs$/ );
my $fullPath = "$dir/$file";
my $filenameshort = basename( $file, ".tcserverlogs" );
my $data_to_json = { "{#LOGFILEPATH}" => $fullPath, "{#LOGFILE}" => $filenameshort };
push #data, $data_to_json;
}
closedir(DIR);
print $json->encode( { data => \#data } );

How do I use encode_json with string in Perl?

Here is my code that I try to open the file to get data and change it to UTF-8, then read each line and store it in variable my $abstract_text and send it back in JSON structure.
my $fh;
if (!open($fh, '<:encoding(UTF-8)',$path))
{
returnApplicationError("Cannot read abstract file: $path ($!)\nERRORCODE|111|\n");
}
printJsonHeader;
my #lines = <$fh>;
my $abstract_text = '';
foreach my $line (#lines)
{
$abstract_text .= $line;
}
my $json = encode_json($abstract_text);
close $fh;
print $json;
By using that code, I get this error;
hash- or arrayref expected (not a simple scalar, use allow_nonref to allow this)
error message also point out that the problem is in this line;
my $json = encode_json($abstract_text);
I want to send the data back as a string (which is in UTF-8). Please help.
I assume you're using either JSON or JSON::XS.
Both allow for non-reference data, but not via the procedural encode_json routine.
You'll need to use the object-oriented approach:
use strict; # obligatory
use warnings; # obligatory
use JSON::XS;
my $encoder = JSON::XS->new();
$encoder->allow_nonref();
print $encoder->encode('Hello, world.');
# => "Hello, world."

JSON: dies on decoding when created file with pretty print

Why do I get this error, when I use the pretty print version?
'"' expected, at character offset 2 (before "(end of string)") at ./perl.pl line 29.
#!/usr/bin/env perl
use warnings;
use 5.014;
use utf8;
binmode STDOUT, ':encoding(utf-8)';
use Data::Dumper;
use JSON;
my $json = JSON->new->utf8;
my $hashref = {
'muster, hanß' => {
'hello' => {
year => 2000,
color => 'green'
}
}
};
my $utf8_encoded_json_text = $json->pretty->encode( $hashref ); # leads to a die
#my $utf8_encoded_json_text = $json->encode( $hashref ); # works
open my $fh, '>', 'testfile.json' or die $!;
print $fh $utf8_encoded_json_text;
close $fh;
open $fh, '<', 'testfile.json' or die $!;
$utf8_encoded_json_text = readline $fh;
close $fh;
$hashref = decode_json( $utf8_encoded_json_text );
say Dumper $hashref;
Because when you read the file back in, you're using readline, and only reading the first line of the file. When pretty is off, the entire output is on one line. When pretty is on, the JSON is spread out over multiple lines, so you're passing invalid truncated JSON to decode_json.
Read the entire content by using local $/ = undef; or slurp or whatever else you want.