Perl encoding from JSON issue - json

apologies if this is a really stupid question or already asked elsewhere. I'm reading in some JSON and using decode_json on it, then extracting text from it and outputting that to a file.
My problem is that Unicode characters are encoded as eg \u2019 in the JSON, decode_json appears to convert this to \x{2019}. When I grab this text and output to a UTF8-encoded file, it appears as garbage.
Sample code:
use warnings;
use strict;
use JSON qw( decode_json );
use Data::Dumper;
open IN, $file or die;
binmode IN, ":utf8";
my $data = <IN>;
my $json = decode_json( $data );
open OUT, ">$outfile" or die;
binmode OUT, ":utf8";
binmode STDOUT, ":utf8";
foreach my $textdat (#{ $json->{'results'} }) {
print STDOUT Dumper($textdat);
my $text = $textdat->{'text'};
print OUT "$text\n";
}
The Dumper output shows that the \u encoding has been converted to \x encoding. What am I doing wrong?

decode_json needs UTF-8 encoded input, so use from_json instead that accepts unicode:
my $json = from_json($data);
Another option would be to encode the data yourself:
use Encode;
my $encoded_data = encode('UTF-8', $data);
...
my $json = decode_json($data);
But it makes little sense to encode data just to decode it.

decode_json expects UTF-8, but you're passing decoded text (Unicode Code Points) instead.
So, you could remove the existing character decoding.
use feature qw( say );
use open 'std', ':encoding(UTF-8)';
use JSON qw( decode_json );
my $json_utf8 = do {
open(my $fh, '<:raw', $in_qfn)
or die("Can't open \"$in_qfn\": $!\n");
local $/;
<$fh>;
};
my $data = decode_json($json_utf8);
{
open(my $fh, '>', $out_qfn)
or die("Can't create \"$out_qfn\": $!\n");
for my $result (#{ $data->{results} }) {
say $fh $result->{text};
}
}
Or, you could use from_json (or JSON->new->decode) instead of decode_json.
use feature qw( say );
use open 'std', ':encoding(UTF-8)';
use JSON qw( from_json ); # <---
my $json_ucp = do {
open(my $fh, '<', $in_qfn) # <---
or die("Can't open \"$in_qfn\": $!\n");
local $/;
<$fh>;
};
my $data = from_json($json_ucp); # <---
{
open(my $fh, '>', $out_qfn)
or die("Can't create \"$out_qfn\": $!\n");
for my $result (#{ $data->{results} }) {
say $fh $result->{text};
}
}
The arrows point to the three minor differences between the two snippets.
I made a number of cleanups.
Missing local $/; in case there are line breaks in the JSON.
Don't use 2-arg open.
Don't needlessly use global variables.
Use better names for variables. $data and $json were notably reversed, and $file didn't contain a file.
Limit the scope of your variables, especially if they use up system resources (e.g. file handles).
Use :encoding(UTF-8) (the standard encoding) instead of :encoding(utf8) (an encoding only used by Perl). :utf8 is even worse as it uses the internal encoding rather than the standard one, and it can lead to corrupt scalars if provided bad input.
Get rid of the noisy quotes around identifiers used as hash keys.

Related

perl & python writing out non-ASCII characters into JSON differently

I have hash keys that look like this:
1я310яHOM_REF_truth:HOM_ALT_test:discordant_hom_ref_to_hom_altяAяC
this is a string that is joined by the Cyrillic letter я which I chose as a delimiter because it will never appear in this files.
I write this to a JSON file in Perl 5.30.2 thus:
use JSON 'encode_json';
sub hash_to_json_file {
my $hash = shift;
my $filename = shift;
my $json = encode_json $hash;
open my $out, '>', $filename;
say $out $json
}
and in python 3.8:
use json
def hash_to_json_file(hashtable,filename):
json1=json.dumps(hashtable)
f = open(filename,"w+")
print(json1,file=f)
f.close()
when I try to load a JSON written by Python back into a Perl script, I see a cryptic error that I don't know how to solve:
Wide character in say at read_json.pl line 27.
Reading https://perldoc.perl.org/perlunifaq.html I've tried adding use utf8 to my script, but it doesn't work.
I've also tried '>:encoding(UTF-8)' within my subroutine, but the same error results.
Upon inspection of the JSON files, I see keys like "1Ñ180ÑHET_ALT_truth:HET_REF_test:discordant_het_alt_to_het_refÑAÑC,G" where ÑAÑ substitutes я. In the JSON written by python, I see \u044f I think that this is the wide character, but I don't know how to change it back.
I've also tried changing my subroutine:
use Encode 'decode';
sub json_file_to_hash {
my $file = shift;
open my $in, '<:encoding(UTF-8)', $file;
my $json = <$in>;
my $ref = decode_json $json;
$ref = decode('UTF-8', $json);
return %{ $ref }
}
but this gives another error:
Wide character in hash dereference at read_json.pl line 17, <$_[...]> line 1
How can I get python JSON read into Perl correctly?
use utf8; # Source is encoded using UTF-8
use open ':std', ':encoding(UTF-8)'; # For say to STDOUT. Also default for open()
use JSON qw( decode_json encode_json );
sub hash_to_json_file {
my $qfn = shift;
my $ref = shift;
my $json = encode_json($ref); # Produces UTF-8
open(my $fh, '>:raw', $qfn) # Write it unmangled
or die("Can't create \"$qfn\": $!\n");
say $fh $json;
}
sub json_file_to_hash {
my $qfn = shift;
open(my $fh, '<:raw', $qfn) # Read it unmangled
or die("Can't create \"$qfn\": $!\n");
local $/; # Read whole file
my $json = <$fh>; # This is UTF-8
my $ref = decode_json($json); # This produces decoded text
return $ref; # Return the ref rather than the keys and values.
}
my $src = { key => "1я310яHOM_REF_truth:HOM_ALT_test:discordant_hom_ref_to_hom_altяAяC" };
hash_to_json("a.json", $src);
my $dst = hash_to_json("a.json");
say $dst->{key};
You could also avoid using :raw by using from_json and to_json.
use utf8; # Source is encoded using UTF-8
use open ':std', ':encoding(UTF-8)'; # For say to STDOUT. Also default for open()
use JSON qw( from_json to_json );
sub hash_to_json_file {
my $qfn = shift;
my $hash = shift;
my $json = to_json($hash); # Produces decoded text.
open(my $fh, '>', $qfn) # "use open" will add :encoding(UTF-8)
or die("Can't create \"$qfn\": $!\n");
say $fh $json; # Encoded by :encoding(UTF-8)
}
sub json_file_to_hash {
my $qfn = shift;
open(my $fh, '<', $qfn) # "use open" will add :encoding(UTF-8)
or die("Can't create \"$qfn\": $!\n");
local $/; # Read whole file
my $json = <$fh>; # Decoded text thanks to "use open".
my $ref = from_json($json); # $ref contains decoded text.
return $ref; # Return the ref rather than the keys and values.
}
my $src = { key => "1я310яHOM_REF_truth:HOM_ALT_test:discordant_hom_ref_to_hom_altяAяC" };
hash_to_json("a.json", $src);
my $dst = hash_to_json("a.json");
say $dst->{key};
I like the ascii option so that the JSON output is all 7-bit ASCII
my $json = JSON->new->ascii->encode($hash);
Both the Perl and Python JSON modules will be able to read it.

Can I use Text::CSV_XS to parse a csv-format string without writing it to disk?

I am getting a "csv file" from a vendor (using their API), but what they do is just spew the whole thing into their response. It wouldn't be a significant problem except that, of course, some of those pesky humans entered the data and put in "features" like line breaks. What I am doing now is creating a file for the raw data and then reopening it to read the data:
open RAW, ">", "$rawfile" or die "ERROR: Could not open $rawfile for write: $! \n";
print RAW $response->content;
close RAW;
my $csv = Text::CSV_XS->new({ binary=>1,always_quote=>1,eol=>$/ });
open my $fh, "<", "$rawfile" or die "ERROR: Could not open $rawfile for read: $! \n";
while ( $line = $csv->getline ($fh) ) { ...
Somehow this seems ... inelegant. It seems that I ought to be able to just read the data from the $response->content (multiline string) as if it were a file. But I'm drawing a total blank on how do this.
A pointer would be greatly appreciated.
Thanks,
Paul
You could use a string filehandle:
my $data = $response->content;
open my $fh, "<", \$data or croak "unable to open string filehandle : $!";
my $csv = Text::CSV_XS->new({ binary=>1,always_quote=>1,eol=>$/ });
while ( $line = $csv->getline ($fh) ) { ... }
Yes, you can use Text::CSV_XS on a string, via its functional interface
use warnings;
use strict;
use feature 'say';
use Text::CSV_XS qw(csv); # must use _XS version
my $csv = qq(a,line\nand,another);
my $aoa = csv(in => \$csv)
or die Text::CSV->error_diag;
say "#$_" for #aoa;
Note that this indeed needs Text::CSV_XS (normally Text::CSV works but not with this).
I don't know why this isn't available in the OO interface (or perhaps is but is not documented).
While the above parses the string directly as asked, one can also lessen the "inelegant" aspect in your example by writing content directly to a file as it's acquired, what most libraries support like with :content_file option in LWP::UserAgent::get method.
Let me also note that most of the time you want the library to decode content, so for LWP::UA to use decoded_content (see HTTP::Response).
I cooked up this example with Mojo::UserAgent. For the CSV input I used various data sets from the NYC Open Data. This is also going to appear in the next update for Mojo Web Clients.
I build the request without making the request right away, and that gives me the transaction object, $tx. I can then replace the read event so I can immediately send the lines into Text::CSV_XS:
#!perl
use v5.10;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $url = ...;
my $tx = $ua->build_tx( GET => $url );
$tx->res->content->unsubscribe('read')->on(read => sub {
state $csv = do {
require Text::CSV_XS;
Text::CSV_XS->new;
};
state $buffer;
state $reader = do {
open my $r, '<:encoding(UTF-8)', \$buffer;
$r;
};
my ($content, $bytes) = #_;
$buffer .= $bytes;
while (my $row = $csv->getline($reader) ) {
say join ':', $row->#[2,4];
}
});
$tx = $ua->start($tx);
That's not as nice as I'd like it to be because all the data still show up in the buffer. This is slightly more appealing, but it's fragile in the ways I note in the comments. I'm too lazy at the moment to make it any better because that gets hairy very quickly as you figure out when you have enough data to process a record. My particular code isn't as important as the idea that you can do whatever you like as the transactor reads data and passes it into the content handler:
use v5.10;
use strict;
use warnings;
use feature qw(signatures);
no warnings qw(experimental::signatures);
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $url = ...;
my $tx = $ua->build_tx( GET => $url );
$tx->res->content
->unsubscribe('read')
->on( read => process_bytes_factory() );
$tx = $ua->start($tx);
sub process_bytes_factory {
return sub ( $content, $bytes ) {
state $csv = do {
require Text::CSV_XS;
Text::CSV_XS->new( { decode_utf8 => 1 } );
};
state $buffer = '';
state $line_no = 0;
$buffer .= $bytes;
# fragile if the entire content does not end in a
# newline (or whatever the line ending is)
my $last_line_incomplete = $buffer !~ /\n\z/;
# will not work if the format allows embedded newlines
my #lines = split /\n/, $buffer;
$buffer = pop #lines if $last_line_incomplete;
foreach my $line ( #lines ) {
my $status = $csv->parse($line);
my #row = $csv->fields;
say join ':', $line_no++, #row[2,4];
}
};
}

Escape special characters in JSON string

I have Perl script which contains variable $env->{'arguments'}, this variable should contain a JSON object and I want to pass that JSON object as argument to my other external script and run it using backticks.
Value of $env->{'arguments'} before escaping:
$VAR1 = '{"text":"This is from module and backslash \\ should work too"}';
Value of $env->{'arguments'} after escaping:
$VAR1 = '"{\\"text\\":\\"This is from module and backslash \\ should work too\\"}"';
Code:
print Dumper($env->{'arguments'});
escapeCharacters(\$env->{'arguments'});
print Dumper($env->{'arguments'});
my $command = './script.pl '.$env->{'arguments'}.'';
my $output = `$command`;
Escape characters function:
sub escapeCharacters
{
#$env->{'arguments'} =~ s/\\/\\\\"/g;
$env->{'arguments'} =~ s/"/\\"/g;
$env->{'arguments'} = '"'.$env->{'arguments'}.'"';
}
I would like to ask you what is correct way and how to parse that JSON string into valid JSON string which I can use as argument for my script.
You're reinventing a wheel.
use String::ShellQuote qw( shell_quote );
my $cmd = shell_quote('./script.pl', $env->{arguments});
my $output = `$cmd`;
Alternatively, there's a number of IPC:: modules you could use instead of qx. For example,
use IPC::System::Simple qw( capturex );
my $output = capturex('./script.pl', $env->{arguments});
Because you have at least one argument, you could also use the following:
my $output = '';
open(my $pipe, '-|', './script.pl', $env->{arguments});
while (<$pipe>) {
$output .= $_;
}
close($pipe);
Note that current directory isn't necessarily the directory that contains the script that executing. If you want to executing script.pl that's in the same directory as the currently executing script, you want the following changes:
Add
use FindBin qw( $RealBin );
and replace
'./script.pl'
with
"$RealBin/script.pl"
Piping it to your second program rather than passing it as an argument seems like it would make more sense (and be a lot safer).
test1.pl
#!/usr/bin/perl
use strict;
use JSON;
use Data::Dumper;
undef $/;
my $data = decode_json(<>);
print Dumper($data);
test2.pl
#!/usr/bin/perl
use strict;
use IPC::Open2;
use JSON;
my %data = ('text' => "this has a \\backslash", 'nums' => [0,1,2]);
my $json = JSON->new->encode(\%data);
my ($chld_out, $chld_in);
print("Executing script\n");
my $pid = open2($chld_out, $chld_in, "./test1.pl");
print $chld_in "$json\n";
close($chld_in);
my $out = do {local $/; <$chld_out>};
waitpid $pid, 0;
print(qq~test1.pl output =($out)~);

How do I use encode_json with string in Perl?

Here is my code that I try to open the file to get data and change it to UTF-8, then read each line and store it in variable my $abstract_text and send it back in JSON structure.
my $fh;
if (!open($fh, '<:encoding(UTF-8)',$path))
{
returnApplicationError("Cannot read abstract file: $path ($!)\nERRORCODE|111|\n");
}
printJsonHeader;
my #lines = <$fh>;
my $abstract_text = '';
foreach my $line (#lines)
{
$abstract_text .= $line;
}
my $json = encode_json($abstract_text);
close $fh;
print $json;
By using that code, I get this error;
hash- or arrayref expected (not a simple scalar, use allow_nonref to allow this)
error message also point out that the problem is in this line;
my $json = encode_json($abstract_text);
I want to send the data back as a string (which is in UTF-8). Please help.
I assume you're using either JSON or JSON::XS.
Both allow for non-reference data, but not via the procedural encode_json routine.
You'll need to use the object-oriented approach:
use strict; # obligatory
use warnings; # obligatory
use JSON::XS;
my $encoder = JSON::XS->new();
$encoder->allow_nonref();
print $encoder->encode('Hello, world.');
# => "Hello, world."

JSON: dies on decoding when created file with pretty print

Why do I get this error, when I use the pretty print version?
'"' expected, at character offset 2 (before "(end of string)") at ./perl.pl line 29.
#!/usr/bin/env perl
use warnings;
use 5.014;
use utf8;
binmode STDOUT, ':encoding(utf-8)';
use Data::Dumper;
use JSON;
my $json = JSON->new->utf8;
my $hashref = {
'muster, hanß' => {
'hello' => {
year => 2000,
color => 'green'
}
}
};
my $utf8_encoded_json_text = $json->pretty->encode( $hashref ); # leads to a die
#my $utf8_encoded_json_text = $json->encode( $hashref ); # works
open my $fh, '>', 'testfile.json' or die $!;
print $fh $utf8_encoded_json_text;
close $fh;
open $fh, '<', 'testfile.json' or die $!;
$utf8_encoded_json_text = readline $fh;
close $fh;
$hashref = decode_json( $utf8_encoded_json_text );
say Dumper $hashref;
Because when you read the file back in, you're using readline, and only reading the first line of the file. When pretty is off, the entire output is on one line. When pretty is on, the JSON is spread out over multiple lines, so you're passing invalid truncated JSON to decode_json.
Read the entire content by using local $/ = undef; or slurp or whatever else you want.