Perl YAML to JSON - json

What I am trying to do should be VERY straightforward and simple.
use JSON;
use YAML;
use Data::Dumper;
my $yaml_hash = YAML::LoadFile("data_file.yaml");
print ref($yaml_hash) # prints HASH as expected
print Dumper($yaml_hash) # correctly prints the hash
my $json_text = encode_json($yaml_hash);
The encode_json errors out saying:
cannot encode reference to scalar 'SCALAR(0x100ab630)' unless the scalar is 0 or 1
I am not able to understand why encode_json thinks that $yaml_hash is a reference to a scalar when in fact it is a reference to a HASH
What am I doing wrong?

It is not $yaml_hash that it is complaining about, it is some reference in one of the hash values (or deeper). Scalar references can be represented in YAML but not in JSON.

YAML enables you to load objects and scalar references. JSON does not by default
I suspect that your data file most likely contains an inside-out object, and JSON doesn't know how to work with the scalar reference.
The following demonstrates loading a YAML hash containing a scalar reference in one of the values and then failing to encode it using JSON:
use strict;
use warnings;
use YAML;
use JSON;
# Load a YAML hash containing a scalar ref as a value.
my ($hashref) = Load(<<'END_YAML');
---
bar: !!perl/ref
=: 17
foo: 1
END_YAML
use Data::Dump;
dd $hashref;
my $json_text = encode_json($hashref);
Output:
{ bar => \17, foo => 1 }
cannot encode reference to scalar at script.pl line 18.

Here are one liners that can be used to pipe YAML in and produce JSON on STDOUT
perl -0777 -MYAML -MJSON -e 'print(JSON->new()->utf8()->pretty()->encode(Load(<STDIN>)))'
or even shorter if you don't care for formatting
perl -0777 -MYAML -MJSON -e 'print encode_json(Load(<STDIN>))'
For large volumes and faster parsing I'd also recommend using YAML::XS and JSON::XS counterparts

Related

When creating a variable from command output, Bash removes a backslash from the JSON. How do I make it keep both backslashes to maintain valid JSON?

I'm doing the following to capture some ADO JSON data:
iteration="$(az boards iteration team list --team Test --project Test --timeframe current)"
Normally, the output of that command contains a JSON key/value pair like the following:
"path": "Test\\Sprint1"
But after capturing the STDOUT into that iteration variable, if I do
echo "$iteration"
That key/value pair becomes
"path": "Test\Sprint1"
And if I attempt to use jq on that output, it breaks because it's not recognized as valid JSON any longer. I'm very unfamiliar with Bash. How can I get that JSON to remain valid all the way through?
As already commented by markp-fuso:
It looks like your echo command is interpreting the backslashes. You can confirm this by running echo 'a\\b' and looking at the output.
The portable way to deal with such problems is to use printf instead of echo:
printf %s\\n "$iteration"

Unescaping data in mason/Perl and creating a Json out of it

string s = "%7BparentAsin%3Aasin_1%2C+businessType%3A+%22AHS%22%2CrenderType%3ARenderAll%2Cconstraints%3A%5B%7Btype%3A+Delete%2CmutuallyInclusive%3Afalse%7D%5D%7D"
I want this to be converted into a JSON in Mason Language. (Mason is very similar to perl).
I am doing this and it is working partly:
URI::Escape::uri_unescape($ItemAssociationGroupData)
This is returning:
{parentAsin:asin_1,+businessType:+"AHS",renderType:RenderAll,constraints:[{type:+Delete,mutuallyInclusive:false}]}
Here I dont want the "+" signs and the final output should be a Json and not a String. Like this can be done online on this tool, but I want to do same in code.
https://www.url-encode-decode.com/
I have tried: JSON::XS::to_json && HTML::Entities.. n all but they are not working and returning undef values.
Any help here is appreciated
Just replace the + with spaces.
uri_unescape( $ItemAssociationGroupData =~ s/\+/ /rg )
That produces
{parentAsin:asin_1, businessType: "AHS",renderType:RenderAll,constraints:[{type: Delete,mutuallyInclusive:false}]}
But that string isn't JSON. The keys of objects must be string literals in JSON, and string literals must be quoted.
Cpanel::JSON::XS's allow_barekey option will make it accept unquoted keys, but no JSON parser is going to accept the other unquoted string literals (asin_1, RenderAll, Delete). Not even JavaScript would accept that.
I don't know where you're getting that string from, but it's not really very close to JSON.
!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use JSON;
use URI::Escape;
use Data::Dumper;
my $str = '%7BparentAsin%3Aasin_1%2C+businessType%3A+%22AHS%22%2CrenderType%3ARenderAll%2Cconstraints%3A%5B%7Btype%3A+Delete%2CmutuallyInclusive%3Afalse%7D%5D%7D';
my $json = uri_unescape($str);
say $json;
say Dumper decode_json($json);
We get this output:
{parentAsin:asin_1,+businessType:+"AHS",renderType:RenderAll,constraints:[{type:+Delete,mutuallyInclusive:false}]}
And then this error:
'"' expected, at character offset 1 (before "parentAsin:asin_1,+b...") at json_decode line 21.
That's caused by the keys in your objects not being in quoted strings. Ok, we can fix that. We'll also replace the '+' signs with spaces.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use JSON;
use URI::Escape;
use Data::Dumper;
my $str = '%7BparentAsin%3Aasin_1%2C+businessType%3A+%22AHS%22%2CrenderType%3ARenderAll%2Cconstraints%3A%5B%7Btype%3A+Delete%2CmutuallyInclusive%3Afalse%7D%5D%7D';
# ADDED THIS LINE
$str =~ s/\+/ /g;
my $json = uri_unescape($str);
# ADDED THIS LINE
$json =~ s/(\w+?):/"$1":/g;
say $json;
say Dumper decode_json($json);
Now we get better output:
{"parentAsin":asin_1, "businessType": "AHS","renderType":RenderAll,"constraints":[{"type": Delete,"mutuallyInclusive":false}]}
But we still get an error:
malformed JSON string, neither tag, array, object, number, string or atom, at character offset 14 (before "asin_1,+"businessTyp...") at json_decode line 21.
This is because your values also need to be quoted strings. But fixing this is harder because some of your values are already quoted (e.g. "AHS") and some values don't need to be quoted (e.g. false).
So it's hard to know the best approach to take from here. My first instinct would be to go back to whatever is generating that original string and see if you can get the bugs fixed so you get a proper JSON string.

Perl: threads vs JSON

Below listed program fails with the following error:
JSON text must be an object or array (but found number, string, true, false or null, use allow_nonref to allow this) at json_test.pl line 10.
Works fine when I comment out thread startup/join, or when JSON is parsed before thread is run.
Message seems to be coming from JSON library, so I suppose something is wrong with it.
Any ideas what's going on and how to fix it?
# json_test.pl
use strict;
use warnings;
use threads;
use JSON;
use Data::Dumper;
my $t = threads->new(\&DoSomething);
my $str = '{"category":"dummy"}';
my $json = JSON->new();
my $data = $json->decode($str);
print Dumper($data);
$t->join();
sub DoSomething
{
sleep 10;
return 1;
}
JSON uses JSON::XS if installed which is not compatible with Perl threads (please don't take the author's words at face value - threads are discouraged and difficult to use effectively, but not deprecated and there are no plans to remove them). The community-preferred fork Cpanel::JSON::XS is thread safe and will be used by JSON::MaybeXS by default, which is a mostly drop-in replacement for JSON.

Mask certain file paths in binary files

I have a binary file containing some file paths. If the path starts with a certain string, the rest of the file path [\x20-\x7f]+ should be masked, leaving the general structure and size of the file intact!
So with a list of paths to search for is this:
/usr/local/bin/
/home/joe/
Then an occurrence like this in the binary data:
^#^#^#^#/home/joe/documents/hello.docx^#^#^#^#
Should be changed to this:
^#^#^#^#/home/joe/********************^#^#^#^#
What is the best way to do this? Do sed, perl or awk have a way? Or do I have to write a C or PHP program where I find the string and write strlen() number of mask characters in its place?
perl is a good choice for working on binary data. For sed and awk, only the GNU implementations can generally cope with binary data, the other ones would choke on the NUL byte or on long sequences between two newline characters, or on non-terminated lines.
perl -pi.back -e 's{(/usr/local/bin|/home/joe)/\K[\x20-\x7f]+}{
$& =~ s/./*/rg}ge' binary-file
You'd need not too old a version of perl for the /r flag (returns the result of the substitution instead of applying it on the variable) and \K (reset the start of the matched string).
By default, perl -p works on one line at a time, since the newline character is not part of [\x20-\x7f], that's fine.
Here is some perl code that works, though I'm sure it can be optimised. It is a filter, so it reads all of stdin into $data, then for each string in the array #dirs it does a substitute for the pattern. The replacement however is not a fixed string but a function call replace($dir,$1) which is evaluated because of the e modifier to the substitute command.
#!/usr/bin/perl
use strict;
sub replace{
my ($dir,$rest) = #_;
$rest =~ s/./*/g;
return $dir.$rest;
}
my #dirs = ('/usr/local/bin/','/home/joe/');
my $data = join("",<STDIN>);
foreach my $dir (#dirs){
$data =~ s|$dir([\x20-\x7f]+)|replace($dir,$1)|ge;
}
print $data;
The function is given 2 arguments, the directory and the captured part of the pattern. It returns these concatenated after replacing each character in the captured string.

parse domains from html page using perl

i have an html page that contain urls like :
<h3><a href="http://site.com/path/index.php" h="blablabla">
<h3><a href="https://www.site.org/index.php?option=com_content" h="vlavlavla">
i want to extract :
site.com/path
www.site.org
between <h3><a href=" & /index.php .
i've tried this code :
#!/usr/local/bin/perl
use strict;
use warnings;
open (MYFILE, 'MyFileName.txt');
while (<MYFILE>)
{
my $values1 = split('http://', $_); #VALUE WILL BE: www.site.org/path/index2.php
my #values2 = split('index.php', $values1); #VALUE WILL BE: www.site.org/path/ ?option=com_content
print $values2[0]; # here it must print www.site.org/path/ but it don't
print "\n";
}
close (MYFILE);
but this give an output :
2
1
2
2
1
1
and it don't parse https websites.
hope you've understand , regards.
The main thing wrong with your code is that when you call split in scalar context as in your line:
my $values1 = split('http://', $_);
It returns the size of the list created by the split. See split.
But I don't think split is appropriate for this task anyway. If you know that the value you are looking for will always lie between 'http[s]://' and '/index.php' you just need a regex substitution in your loop (you should also be more careful opening your file...):
open(my $myfile_fh, '<', 'MyFileName.txt') or die "Couldn't open $!";
while(<$myfile_fh>) {
s{.*http[s]?://(.*)/index\.php.*}{$1} && print;
}
close($myfile_fh);
It's likely you will need a more general regex than that, but I think this would work based on your description of the problem.
This feels to me like a job for modules
HTML::LinkExtor
URI
Generally using regexps to parse HTML is risky.
dms explained in his answer why using split isn't the best solution here:
It returns the number of items in scalar context
A normal regex is better suited for this task.
However, I do not think that line-based processing of the input is valid for HTML, or that using a substitution makes sense (it does not, especially when the pattern looks like .*Pattern.*).
Given an URL, we can extract the required information like
if ($url =~ m{^https?://(.+?)/index\.php}s) { # domain+path now in $1
say $1;
}
But how do we extract the URLs? I'd recommend the wonderful Mojolicious suite.
use strict; use warnings;
use feature 'say';
use File::Slurp 'slurp'; # makes it easy to read files.
use Mojo;
my $html_file = shift #ARGV; # take file name from command line
my $dom = Mojo::DOM->new(scalar slurp $html_file);
for my $link ($dom->find('a[href]')->each) {
say $1 if $link->attr('href') =~ m{^https?://(.+?)/index\.php}s;
}
The find method can take CSS selectors (here: all a elements that have an href attribute). The each flattens the result set into a list which we can loop over.
As I print to STDOUT, we can use shell redirection to put the output into a wanted file, e.g.
$ perl the-script.pl html-with-links.html >only-links.txt
The whole script as a one-liner:
$ perl -Mojo -E'$_->attr("href") =~ m{^https?://(.+?)/index\.php}s and say $1 for x(b("test.html")->slurp)->find("a[href]")->each'