Retrieving original JSON code from a decoded array node in Perl - json

I'm working on a script which receives JSON code for an array of objects similar to this:
{
"array":[
{ "id": 1, "text": "Some text" },
{ "id": 2, "text": "Some text" }
]
}
I decode it using JSON::XS and then filter out some of the results. After this, I need to store the JSON code for each individual node into a queue for later processing. The format this queue requires is JSON too, so the code I'd need to insert for each node would be something like this:
{ "id": 1, "text": "Some text" }
However, after decode_json has decoded a node, all that's left are hash references for each node:
print $json->{'array'}[0]; # Would print something like HASH(0x7ffa80c83270)
I know I could get something similar to the original JSON code using encode_json on the hash reference, but the resulting code is different from the original code, UTF-8 characters get all weird, and it seems like a lot of extra processing, specially considering the amount of data this script has to deal with.
Is there a way to retrieve the original JSON code from a decoded array node? Does JSON::XS keep the original chunks somewhere after they have been decoded?
EDIT
About the weird UTF-8 characters, they just look weird on the screen:
#!/usr/bin/perl
use utf8;
use JSON::XS;
binmode STDOUT, ":utf8";
$old_json = '{ "text": "Drag\u00f3n" }';
$json = decode_json($old_json);
print $json->{'text'}; # Dragón
$new_json = encode_json($json);
print $new_json; # {"text":"Dragón"}
$json = decode_json($new_json);
print $json->{'text'}; # Dragón

encode_json will produce equivalent JSON to what you originally had before you decoded it with decode_json. Characters encoded using UTF-8 do not get all weird.
$ cat a.pl
use Encode qw( encode_utf8 );
use JSON::XS qw( decode_json encode_json );
my $json = encode_utf8(qq!{"name":"\x{C9}ric" }!);
print($json, "\n");
print(encode_json(decode_json($json)), "\n");
$ perl a.pl | od -c
0000000 { " n a m e " : " 303 211 r i c "
0000020 } \n { " n a m e " : " 303 211 r i c
0000040 " } \n
0000043
If you want a parser that preserves the original JSON, you'll surely have to write your own; the existing ones don't do that.

No, it doesn't exist anywhere. The "original JSON" isn't stored per-element; it's decoded in a single pass.

No, this is not possible. Every JSON object can have multiple, but equivalent representations:
{ "key": "abc" }
and
{
"key" : "abc"
}
are pretty much the same.
So just use the re-encoded JSON your module gives you.
Even if JSON::XS caches the chunks, extracting them would be a breach of encapsulation, therefore having no guarantee of working if the module is upgraded. And it is bad design.
Don't care about performance. The XS modules have exceptional performance, as they are coded in C. And if you were paranoid about performance, you wouldn't use JSON but some binary format. And you wouldn't be using Perl, but Fortran ;-)
You should treat equivalent data as equivalent data. Even if the presentation is different.
If the unicode chars look weird, but process fine, there is no problem. If they don't get processed correctly, you might have to specify an exact encoding.

Related

Using Text::CSV on a String Containing Quotes

I have pored over this site (and others) trying to glean the answer for this but have been unsuccessful.
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1, auto_diag => 1 } );
$line = q(data="a=1,b=2",c=3);
my $csvParse = $csv->parse($line);
my #fields = $csv->fields();
for my $field (#fields) {
print "FIELD ==> $field\n";
}
Here's the output:
# CSV_XS ERROR: 2034 - EIF - Loose unescaped quote # rec 0 pos 6 field 1
FIELD ==>
I am expecting 2 array elements:
data="a=1,b=2"
c=3
What am I missing?
You may get away with using Text::ParseWords. Since you are not using real csv, it may be fine. Example:
use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;
my $line = q(data="a=1,b=2",c=3);
my #fields = quotewords(',', 1, $line);
print Dumper \#fields;
This will print
$VAR1 = [
'data="a=1,b=2"',
'c=3'
];
As you requested. You may want to test further on your data.
Your input data isn't "standard" CSV, at least not the kind that Text::CSV expects and not the kind that things like Excel produce. An entire field has to be quoted or not at all. The "standard" encoding of that would be "data=""a=1,b=2""",c=3 (which you can see by asking Text::CSV to print your expected data using say).
If you pass the allow_loose_quotes option to the Text::CSV constructor, it won't error on your input, but it won't consider the quotes to be "protecting" the comma, so you will get three fields, namely data="a=1, b=2" and c=3.

putting commas separated values into a new file line by line

From command line, we are passing multiple values separated by commas such as sydney,delhi,NY,Russia as an option. These values are getting stored under $runTest in the perl script. Now I want to create a new file under the script with contents of $runTest but as line by line. For example:
INPUT (passed values from command line):
sydney,delhi,NY,Russia
OUTPUT (under new file: myfile):
sydney
delhi
NY
Russia
In this simple example, it is better to use split on a delimiter than tr in such case. A few minor points: use snake_case for names instead of CamelCase, and use autodie to make open, close, etc, fatal, without the need to clutter the code with or die "...":
use autodie;
my $run_test = 'sydney,delhi,NY,Russia';
open my $out, '>', 'myFile';
print {$out} map { "$_\n" } split /,/, $run_test;
close $out;
For more robust parsing in general, beyond this simple example, prefer specialized modules, such as Text::CSV or Text::CSV_XS for csv parsing. Compared to the overly simplistic split, Text::CSV_XS enables correct input/output of quoted fields, fields containing the delimiter (comma), binary characters, provides error messages and more. Example:
use Text::CSV_XS;
use autodie;
open my $out, q{>}, q{myFile};
# All of these input strings are parsed correctly, unlike when using "split":
# my $run_test = q{sydney,delhi,NY,Russia};
# my $run_test = q{sydney,delhi,NY,Russia,"field,with,commas"};
my $run_test = q{sydney,delhi,NY,Russia,"field,with,commas","field,with,missing,quote};
# binary => 1 : enable parsing binary characters in quoted fields.
# auto_diag => 1 : print the internal error code and the associated error message to STDERR.
my $csv = Text::CSV_XS->new( { binary => 1, auto_diag => 1 } );
if ( $csv->parse( $run_test ) ) {
print {$out} map { "$_\n" } $csv->fields;
}
else {
print STDERR q{parse() failed on: }, $csv->error_input, qq{\n};
}

Parsing JSON file in powershell with specific characters

I am trying to get the values in powershell within specific characters. Basically I have a json with thousands of objects like this
"Name": "AllZones_APOPreface_GeographyMatch_FromBRE_ToSTR",
"Sequence": 0,
"Condition": "this.TripOriginLocationCode==\"BRE\"&&this.TripDestinationLocationCode==\"STR\"",
"Action": "this.FeesRate=0.19m;this.ZoneCode=\"Zone1\";halt",
"ElseAction": ""
I want everything within \" \"
IE here I would see that BRE and STR is Zone1
All I need is those 3 things outputted.
I have been searching how to do it with ConvertFrom-Json but no success, maybe I just havent found a good article on this.
Thanks
Start by representing your JSON as a string:
$myjson = #'
{
"Name": "AllZones_APOPreface_GeographyMatch_FromBRE_ToSTR",
"Sequence": 0,
"Condition": "this.TripOriginLocationCode==\"BRE\"&&this.TripDestinationLocationCode==\"STR\"",
"Action": "this.FeesRate=0.19m;this.ZoneCode=\"Zone1\";halt",
"ElseAction": ""
}
'#
Next, create a regular expression that matches everything in between \" and \", that's under 10 characters long (else it'll match unwanted results).
$regex = [regex]::new('\\"(?<content>.{1,10})\\"')
Next, perform the regular expression comparison, by calling the Matches() method on the regular expression. Pass your JSON string into the method parameters, as the text that you want to perform the comparison against.
$matchlist = $regex.Matches($myjson)
Finally, grab the content match group that was defined in the regular expression, and extract the values from it.
$matchlist.Groups.Where({ $PSItem.Name -eq 'content' }).Value
Result
BRE
STR
Zone1
Approach #2: Use Regex Look-behinds for more accurate matching
Here's a more specific regular expression that uses look-behinds to validate each field appropriately. Then we assign each match to a developer-friendly variable name.
$regex = [regex]::new('(?<=TripOriginLocationCode==\\")(?<OriginCode>\w+)|(?<=TripDestinationLocationCode==\\")(?<DestinationCode>\w+)|(?<=ZoneCode=\\")(?<ZoneCode>\w+)')
$matchlist = $regex.Matches($myjson)
### Assign each component to its own friendly variable name
$OriginCode, $DestinationCode, $ZoneCode = $matchlist[0].Value, $matchlist[1].Value, $matchlist[2].Value
### Construct a string from the individual components
'Your origin code is {0}, your destination code is {1}, and your zone code is {2}' -f $OriginCode, $DestinationCode, $ZoneCode
Result
Your origin code is BRE, your destination code is STR, and your zone code is Zone1

Perl - interpolate JSON string values

I have a json file like this:
"tool_name": {
"command": "$ENV{TOOL_BIN_DIR}/some_file_name",
"args": "some args"
}
I am using use:JSON from Perl 5.14. and using decode_json function to read the file and get data into perl hash.
But when I refer to this read data from code like this:
my $cmd = "$data->{tool_name}->{command}";
print $cmd;
I get
$ENV{TOOL_BIN_DIR}/some_file_name
How can I make perl resolve the value of this variable?
This example uses enviornment variable but in general if I want to use variables from JSON - how can I do that?
Using eval opens you to up to malicious or accidental damage: the string you are executing could contain any Perl code that may do anything at all to your system
It is preferable to use interpolate from the String::Interpolate module, which uses Perl's own interpolation engine that expands ordinary double-quoted strings at run time
This program sets up a value for the environment variable TOOL_BIN_DIR and expands all the values in the tool_name hash that contain a dollar $ or at # sigil
I've used Data::Dump to display the contents of the data after the interpolation
You may want to write a recursive subroutine that will process the values of all nested hashes and arrays if you don't know which values are likely to contain a value that needs to be expanded
use strict;
use warnings 'all';
use JSON 'decode_json';
use String::Interpolate 'interpolate';
use Data::Dump 'dd';
my $data = decode_json <<'__END_JSON__';
{
"tool_name": {
"command": "$ENV{TOOL_BIN_DIR}/some_file_name",
"args": "some args"
}
}
__END_JSON__
$ENV{TOOL_BIN_DIR} = 'tool_dir_test';
for ( values %{ $data->{tool_name} } ) {
$_ = interpolate($_) if /[\$\#]/;
}
dd $data;
output
{
tool_name => { args => "some args", command => "tool_dir_test/some_file_name" },
}
You can use eval like in:
my $cmd = '$ENV{HOME}/toto';
print eval('"' . $cmd . '"'), "\n";
But remember, this is very unsafe from a security point of view. You should probably avoid having to do this.

Normalization on utf8 filenames stored in JSON with perl

I have two Json files which come from different OSes.
Both files are encoded in UTF-8 and contain UTF-8 encoded filenames.
One file comes from OS X and the filename is in NFD form: (od -bc)
0000160 166 145 164 154 141 314 201 057 110 157 165 163 145 040 155 145
v e t l a ́ ** / H o u s e m e
the second contains the same filename but in NFC form:
000760 166 145 164 154 303 241 057 110 157 165 163 145 040 155 145 163
v e t l á ** / H o u s e m e s
As I have learned, this is called 'different normalization', and there is an CPAN module Unicode::Normalize for handling it.
I'm reading both files with the next:
my $json1 = decode_json read_file($file1, {binmode => ':raw'}) or die "..." ;
my $json2 = decode_json read_file($file2, {binmode => ':raw'}) or die "..." ;
The read_file is from File::Slurp and decode_json from the JSON::XS.
Reading the JSON into perl structure, from one json file the filename comes into key position and from the second file comes into the values. I need to search when the hash key from the 1st hash is equvalent to a value from the second hash, so need ensure than they are "binary" identical.
Tried the next:
grep 'House' file1.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc
and
grep 'House' file2.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc
produces for me the same output.
Now the questions:
How to simply read both json files to get the same normalization into the both $hashrefs?
or need after the decode_json run someting like on both hashes?
while(my($k,$v) = each(%$json1)) {
$copy->{ NFD($k) } = NFD($v);
}
In short:
How to read different JSON files to get the same normalization 'inside' the perl $href? It is possible to achieve somewhat nicer as explicitly doing NFD on each key value and creating another NFD normalized (big) copy of the hashes?
Some hints, suggestions - pleae...
Because my english is very bad, here is a simulation of the problem
use 5.014;
use warnings;
use utf8;
use feature qw(unicode_strings);
use charnames qw(:full);
use open qw(:std :utf8);
use Encode qw(encode decode);
use Unicode::Normalize qw(NFD NFC);
use File::Slurp;
use Data::Dumper;
use JSON::XS;
#Creating two files what contains different "normalizations"
my($nfc, $nfd);;
$nfc->{ NFC('key') } = NFC('vál');
$nfd->{ NFD('vál') } = 'something';
#save as NFC - this comes from "FreeBSD"
my $jnfc = JSON::XS->new->encode($nfc);
open my $fd, ">:utf8", "nfc.json" or die("nfc");
print $fd $jnfc;
close $fd;
#save as NFD - this comes from "OS X"
my $jnfd = JSON::XS->new->encode($nfd);
open $fd, ">:utf8", "nfd.json" or die("nfd");
print $fd $jnfd;
close $fd;
#now read them
my $jc = decode_json read_file( "nfc.json", { binmode => ':raw' } ) or die "No file" ;
my $jd = decode_json read_file( "nfd.json", { binmode => ':raw' } ) or die "No file" ;
say $jd->{ $jc->{key} } // "NO FOUND"; #wanted to print "something"
my $jc2;
#is here a better way to DO THIS?
while(my($k,$v) = each(%$jc)) {
$jc2->{ NFD($k) } = NFD($v);
}
say $jd->{ $jc2->{key} } // "NO FOUND"; #OK
While searching the right solution for your question i discovered: the software is c*rp :) See: https://stackoverflow.com/a/17448888/632407 .
Anyway, found the solution for your particular question - how to read json with filenames regardless of normalization:
instead of your:
#now read them
my $jc = decode_json read_file( "nfc.json", { binmode => ':raw' } ) or die "No file" ;
my $jd = decode_json read_file( "nfd.json", { binmode => ':raw' } ) or die "No file" ;
use the next:
#now read them
my $jc = get_json_from_utf8_file('nfc.json') ;
my $jd = get_json_from_utf8_file('nfd.json') ;
...
sub get_json_from_utf8_file {
my $file = shift;
return
decode_json #let parse the json to perl
encode 'utf8', #the decode_json want utf8 encoded binary string, encode it
NFC #conv. to precomposed normalization - regardless of the source
read_file #your file contains utf8 encoded text, so read it correctly
$file, { binmode => ':utf8' } ;
}
This should (at least i hope) ensure than regardles what decomposition uses the JSON content, the NFC will convert it to precomposed version and the JSON:XS will read parse it correctly to the same internal perl structure.
So your example prints:
something
without traversing the $json
The idea comes from Joseph Myers and Nemo ;)
Maybe some more skilled programmers will give more hints.
Even though it may be important right now only to convert a few file names to the same normalization for comparison, other unexpected problems could arise from almost anywhere if JSON data has a different normalization.
So my suggestion is to normalize the entire input from both sources as your first step before doing any parsing (i.e., at the same time you read the file and before decode_json). This should not corrupt any of your JSON structures since those are delimited using ASCII characters. Then your existing perl code should be able to blindly assume all UTF8 characters have the same normalization.
$rawdata1 = read_file($file1, {binmode => ':raw'}) or die "...";
$rawdata2 = read_file($file2, {binmode => ':raw'}) or die "...";
my $json1 = decode_json NFD($rawdata1);
my $json2 = decode_json NFD($rawdata2);
To make this process slightly faster (it should be plenty fast already, since the module uses fast XS procedures), you can find out whether one of the two data files is already in a certain normalization form, and then leave that file unchanged, and convert the other file into that form.
For example:
$rawdata1 = read_file($file1, {binmode => ':raw'}) or die "...";
$rawdata2 = read_file($file2, {binmode => ':raw'}) or die "...";
if (checkNFD($rawdata1)) {
# then you know $file1 is already in Normalization Form D
# (i.e., it was formed by canonical decomposition).
# so you only need to convert $file2 into NFD
$rawdata2 = NFD($rawdata2);
}
my $json1 = decode_json $rawdata1;
my $json2 = decode_json $rawdata2;
Of course, you would naturally have to experiment now in the development time to see if one or other of the input files is already in a normalized form, and then in your final version of the code, you would no longer need a conditional statement, but simply convert the other input file into the same normalized form.
Also note that it is suggested to produce output in NFC form (if your program produces any output that would be stored and used later). See here, for example: http://www.perl.com/pub/2012/05/perlunicookbook-unicode-normalization.html
Hm. I can't advice you some better "programming" solution. But why simply doesn't run
perl -CSDA -MUnicode::Normalize -0777 -nle 'print NFD($_)' < freebsd.json >bsdok.json
perl -CSDA -MUnicode::Normalize -0777 -nle 'print NFD($_)' < osx.json >osxok.json
and now your script can read and use both because they are both in the same normalisation? So instead searching for som programming solution inside of your script, solve the problem before entering to the script. (The second command is unnecessary - simple convert on the file level. Sure is more easy as traversing data structures...
Instead of traversing the data structure manually, let a module handle this for you.
Data::Visitor
Data::Rmap
Data::Dmap