Unescaping data in mason/Perl and creating a Json out of it - json

string s = "%7BparentAsin%3Aasin_1%2C+businessType%3A+%22AHS%22%2CrenderType%3ARenderAll%2Cconstraints%3A%5B%7Btype%3A+Delete%2CmutuallyInclusive%3Afalse%7D%5D%7D"
I want this to be converted into a JSON in Mason Language. (Mason is very similar to perl).
I am doing this and it is working partly:
URI::Escape::uri_unescape($ItemAssociationGroupData)
This is returning:
{parentAsin:asin_1,+businessType:+"AHS",renderType:RenderAll,constraints:[{type:+Delete,mutuallyInclusive:false}]}
Here I dont want the "+" signs and the final output should be a Json and not a String. Like this can be done online on this tool, but I want to do same in code.
https://www.url-encode-decode.com/
I have tried: JSON::XS::to_json && HTML::Entities.. n all but they are not working and returning undef values.
Any help here is appreciated

Just replace the + with spaces.
uri_unescape( $ItemAssociationGroupData =~ s/\+/ /rg )
That produces
{parentAsin:asin_1, businessType: "AHS",renderType:RenderAll,constraints:[{type: Delete,mutuallyInclusive:false}]}
But that string isn't JSON. The keys of objects must be string literals in JSON, and string literals must be quoted.
Cpanel::JSON::XS's allow_barekey option will make it accept unquoted keys, but no JSON parser is going to accept the other unquoted string literals (asin_1, RenderAll, Delete). Not even JavaScript would accept that.

I don't know where you're getting that string from, but it's not really very close to JSON.
!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use JSON;
use URI::Escape;
use Data::Dumper;
my $str = '%7BparentAsin%3Aasin_1%2C+businessType%3A+%22AHS%22%2CrenderType%3ARenderAll%2Cconstraints%3A%5B%7Btype%3A+Delete%2CmutuallyInclusive%3Afalse%7D%5D%7D';
my $json = uri_unescape($str);
say $json;
say Dumper decode_json($json);
We get this output:
{parentAsin:asin_1,+businessType:+"AHS",renderType:RenderAll,constraints:[{type:+Delete,mutuallyInclusive:false}]}
And then this error:
'"' expected, at character offset 1 (before "parentAsin:asin_1,+b...") at json_decode line 21.
That's caused by the keys in your objects not being in quoted strings. Ok, we can fix that. We'll also replace the '+' signs with spaces.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use JSON;
use URI::Escape;
use Data::Dumper;
my $str = '%7BparentAsin%3Aasin_1%2C+businessType%3A+%22AHS%22%2CrenderType%3ARenderAll%2Cconstraints%3A%5B%7Btype%3A+Delete%2CmutuallyInclusive%3Afalse%7D%5D%7D';
# ADDED THIS LINE
$str =~ s/\+/ /g;
my $json = uri_unescape($str);
# ADDED THIS LINE
$json =~ s/(\w+?):/"$1":/g;
say $json;
say Dumper decode_json($json);
Now we get better output:
{"parentAsin":asin_1, "businessType": "AHS","renderType":RenderAll,"constraints":[{"type": Delete,"mutuallyInclusive":false}]}
But we still get an error:
malformed JSON string, neither tag, array, object, number, string or atom, at character offset 14 (before "asin_1,+"businessTyp...") at json_decode line 21.
This is because your values also need to be quoted strings. But fixing this is harder because some of your values are already quoted (e.g. "AHS") and some values don't need to be quoted (e.g. false).
So it's hard to know the best approach to take from here. My first instinct would be to go back to whatever is generating that original string and see if you can get the bugs fixed so you get a proper JSON string.

Related

How can I convert HTML entities?

I am reading in a large string in Perl from a webpage using WWW::Mechanzie. Am not writing it into a file, just going through it. However apostrophes are coming out as &#27. Is there a way to automatically convert the entire string so that I get ' instead of its character code?
To decode strings with HTML entities you can use the decode() method in HTML::Entities. For example:
use feature qw(say);
use strict;
use warnings;
use HTML::Entities;
my $str = "An &#39example&#39";
say decode_entities($str);
Output:
An 'example'

Perl: threads vs JSON

Below listed program fails with the following error:
JSON text must be an object or array (but found number, string, true, false or null, use allow_nonref to allow this) at json_test.pl line 10.
Works fine when I comment out thread startup/join, or when JSON is parsed before thread is run.
Message seems to be coming from JSON library, so I suppose something is wrong with it.
Any ideas what's going on and how to fix it?
# json_test.pl
use strict;
use warnings;
use threads;
use JSON;
use Data::Dumper;
my $t = threads->new(\&DoSomething);
my $str = '{"category":"dummy"}';
my $json = JSON->new();
my $data = $json->decode($str);
print Dumper($data);
$t->join();
sub DoSomething
{
sleep 10;
return 1;
}
JSON uses JSON::XS if installed which is not compatible with Perl threads (please don't take the author's words at face value - threads are discouraged and difficult to use effectively, but not deprecated and there are no plans to remove them). The community-preferred fork Cpanel::JSON::XS is thread safe and will be used by JSON::MaybeXS by default, which is a mostly drop-in replacement for JSON.

Perl YAML to JSON

What I am trying to do should be VERY straightforward and simple.
use JSON;
use YAML;
use Data::Dumper;
my $yaml_hash = YAML::LoadFile("data_file.yaml");
print ref($yaml_hash) # prints HASH as expected
print Dumper($yaml_hash) # correctly prints the hash
my $json_text = encode_json($yaml_hash);
The encode_json errors out saying:
cannot encode reference to scalar 'SCALAR(0x100ab630)' unless the scalar is 0 or 1
I am not able to understand why encode_json thinks that $yaml_hash is a reference to a scalar when in fact it is a reference to a HASH
What am I doing wrong?
It is not $yaml_hash that it is complaining about, it is some reference in one of the hash values (or deeper). Scalar references can be represented in YAML but not in JSON.
YAML enables you to load objects and scalar references. JSON does not by default
I suspect that your data file most likely contains an inside-out object, and JSON doesn't know how to work with the scalar reference.
The following demonstrates loading a YAML hash containing a scalar reference in one of the values and then failing to encode it using JSON:
use strict;
use warnings;
use YAML;
use JSON;
# Load a YAML hash containing a scalar ref as a value.
my ($hashref) = Load(<<'END_YAML');
---
bar: !!perl/ref
=: 17
foo: 1
END_YAML
use Data::Dump;
dd $hashref;
my $json_text = encode_json($hashref);
Output:
{ bar => \17, foo => 1 }
cannot encode reference to scalar at script.pl line 18.
Here are one liners that can be used to pipe YAML in and produce JSON on STDOUT
perl -0777 -MYAML -MJSON -e 'print(JSON->new()->utf8()->pretty()->encode(Load(<STDIN>)))'
or even shorter if you don't care for formatting
perl -0777 -MYAML -MJSON -e 'print encode_json(Load(<STDIN>))'
For large volumes and faster parsing I'd also recommend using YAML::XS and JSON::XS counterparts

parse domains from html page using perl

i have an html page that contain urls like :
<h3><a href="http://site.com/path/index.php" h="blablabla">
<h3><a href="https://www.site.org/index.php?option=com_content" h="vlavlavla">
i want to extract :
site.com/path
www.site.org
between <h3><a href=" & /index.php .
i've tried this code :
#!/usr/local/bin/perl
use strict;
use warnings;
open (MYFILE, 'MyFileName.txt');
while (<MYFILE>)
{
my $values1 = split('http://', $_); #VALUE WILL BE: www.site.org/path/index2.php
my #values2 = split('index.php', $values1); #VALUE WILL BE: www.site.org/path/ ?option=com_content
print $values2[0]; # here it must print www.site.org/path/ but it don't
print "\n";
}
close (MYFILE);
but this give an output :
2
1
2
2
1
1
and it don't parse https websites.
hope you've understand , regards.
The main thing wrong with your code is that when you call split in scalar context as in your line:
my $values1 = split('http://', $_);
It returns the size of the list created by the split. See split.
But I don't think split is appropriate for this task anyway. If you know that the value you are looking for will always lie between 'http[s]://' and '/index.php' you just need a regex substitution in your loop (you should also be more careful opening your file...):
open(my $myfile_fh, '<', 'MyFileName.txt') or die "Couldn't open $!";
while(<$myfile_fh>) {
s{.*http[s]?://(.*)/index\.php.*}{$1} && print;
}
close($myfile_fh);
It's likely you will need a more general regex than that, but I think this would work based on your description of the problem.
This feels to me like a job for modules
HTML::LinkExtor
URI
Generally using regexps to parse HTML is risky.
dms explained in his answer why using split isn't the best solution here:
It returns the number of items in scalar context
A normal regex is better suited for this task.
However, I do not think that line-based processing of the input is valid for HTML, or that using a substitution makes sense (it does not, especially when the pattern looks like .*Pattern.*).
Given an URL, we can extract the required information like
if ($url =~ m{^https?://(.+?)/index\.php}s) { # domain+path now in $1
say $1;
}
But how do we extract the URLs? I'd recommend the wonderful Mojolicious suite.
use strict; use warnings;
use feature 'say';
use File::Slurp 'slurp'; # makes it easy to read files.
use Mojo;
my $html_file = shift #ARGV; # take file name from command line
my $dom = Mojo::DOM->new(scalar slurp $html_file);
for my $link ($dom->find('a[href]')->each) {
say $1 if $link->attr('href') =~ m{^https?://(.+?)/index\.php}s;
}
The find method can take CSS selectors (here: all a elements that have an href attribute). The each flattens the result set into a list which we can loop over.
As I print to STDOUT, we can use shell redirection to put the output into a wanted file, e.g.
$ perl the-script.pl html-with-links.html >only-links.txt
The whole script as a one-liner:
$ perl -Mojo -E'$_->attr("href") =~ m{^https?://(.+?)/index\.php}s and say $1 for x(b("test.html")->slurp)->find("a[href]")->each'

Scrambled umlauts since upgrade from JSON1 to JSON2 in Perl

I wondered why some german umlauts were scrambled on our page.
Then i found out that the recent version of JSON (i use 2.07) does convert strings in an other manner than JSON 1.5.
Problem here is that i have a hash with strings like
use Data::Dumper;
my $test = {
'fields' => 'überrascht'
};
print Dumper(to_json($test)); gives me
$VAR1 = "{ \"fields\" : \"\x{fc}berrascht\" } ";
Using the old module using
$json = JSON->new();
print Dumper ($json->to_json($test));
gives me (the correct result)
$VAR1 = '{"fields":[{"title":"überrascht"}]}';
So umlauts are scrammbled using the new JSON 2 module.
What do i need to get them correct?
Update: It might be bad to use Data::Dumper to show output, because Dumper uses its own encoding. Well, a difference in the result from Dumper shows that anything is treated differently here. It might be better to describe the backend as Brad mentioned:
The json string gets printed using Template-Toolkit and then gets assigned to a javascript variable for further use. The correct javascript shows something like this
{
"title" : "Geändert",
},
using the new module i get
{
"title" : "Geändert",
},
The target page is in 8859-1 (latin1).
Any suggestions?
\x{fc} is ü, at least in Latin-1, Latin-9 etc. Also, ü is codepoint U+00FC in Unicode. However, we want UTF-8 (I suppose). The easiest solution to get UTF-8 string literals is to save your Perl source code with this encoding, and put a use utf8; at the top of your script.
Then, encoding the string as JSON yields correct output:
use strict; use warnings; use utf8;
use Data::Dumper; use JSON;
print Dumper encode_json {fields => "nicht überrascht"};
The encode_json assumes UTF-8. Read the documentation for more info.
Output:
$VAR1 = '{"fields":"nicht überrascht"}';
(JSON module version: 2.53)
my $json_text = to_json($data);
is short for
my $json_text = JSON->new->encode($data);
This returns a string of Unicode Code Points. U+00FC is indeed the correct Unicode code point for "ü", so the output is correct. (As proof, the HTML source for that is actually "ü".)
It's hard to tell what your original output actually contained (since you showed non-ASCII characters), so it's hard to determine what your problem is actually.
But one thing you must do before outputing the string is to convert it from a string of code points into bytes, say, by using Encode's encode or encode_utf8.
my $json_cp1252 = encode('cp1252', to_json($data));
my $json_utf8 = encode_utf8(to_json($data));
If the appropriate encoding is UTF-8, you can also use any of the following:
my $json_utf8 = to_json($data, { utf8 => 1 });
my $json_utf8 = encode_json($data);
my $json_utf8 = JSON->new->utf8->encode($data);
Use encode_json instead. According to the manual it converts the given Perl data structure to a UTF-8 encoded, binary string.
Regarding your update: If you actually want to produce JSON in Latin1 (ISO-8859-1), you can try:
to_json($test, { latin1 => 1 })
Or
JSON->new->latin1->encode($test)
Note that if you dump the result, getting \x{fc} for ü is correct in this case. I guess that the root of your problem is that you receive text in Perl's UTF-8 format from somewhere. In this case, the latin1 option of the JSON module is needed.
You can also try to use ascii instead of latin1 as the safest option.
Another solution might be to specify an output encoding for Template-Toolkit. I don't know if that's possible. Or, you could encode your result as Latin1 in the final step before sending it to the client.
Strictly-speaking, Latin-1-encoded JSON is not valid JSON. The JSON spec allows UTF-8, UTF-16 or UTF-32 encodings.
If you want to be standards-compliant or you want to ensure your JSON will be compatible with both your current pages and future UTF-8-based pages, you need to use JSON->new->utf8->encode($str). Being strict about generated valid JSON could save you lots of headaches in the future.
You can translate UTF-8 JSON to Latin-1 using client-side Javascript if you need to, using this trick.
The ascii option also produces valid JSON, by escaping any non-ASCII characters using valid JSON unicode escapes. But the latin1 option does not, and therefore should be avoided IMHO. The utf8(0) option should be avoided too unless you specify an encoding when writing the data out to clients: utf8(0) is subtly different from the utf8 option in that it generates Perl character strings instead of byte strings. If you do any I/O using character strings without specifying an encoding, Perl will translate it on-the-fly back to Latin-1. The utf8 option generates raw UTF-8 bytes, which are perfect for doing raw I/O.