How can I convert HTML entities? - html

I am reading in a large string in Perl from a webpage using WWW::Mechanzie. Am not writing it into a file, just going through it. However apostrophes are coming out as &#27. Is there a way to automatically convert the entire string so that I get ' instead of its character code?

To decode strings with HTML entities you can use the decode() method in HTML::Entities. For example:
use feature qw(say);
use strict;
use warnings;
use HTML::Entities;
my $str = "An &#39example&#39";
say decode_entities($str);
Output:
An 'example'

Related

Unescaping data in mason/Perl and creating a Json out of it

string s = "%7BparentAsin%3Aasin_1%2C+businessType%3A+%22AHS%22%2CrenderType%3ARenderAll%2Cconstraints%3A%5B%7Btype%3A+Delete%2CmutuallyInclusive%3Afalse%7D%5D%7D"
I want this to be converted into a JSON in Mason Language. (Mason is very similar to perl).
I am doing this and it is working partly:
URI::Escape::uri_unescape($ItemAssociationGroupData)
This is returning:
{parentAsin:asin_1,+businessType:+"AHS",renderType:RenderAll,constraints:[{type:+Delete,mutuallyInclusive:false}]}
Here I dont want the "+" signs and the final output should be a Json and not a String. Like this can be done online on this tool, but I want to do same in code.
https://www.url-encode-decode.com/
I have tried: JSON::XS::to_json && HTML::Entities.. n all but they are not working and returning undef values.
Any help here is appreciated
Just replace the + with spaces.
uri_unescape( $ItemAssociationGroupData =~ s/\+/ /rg )
That produces
{parentAsin:asin_1, businessType: "AHS",renderType:RenderAll,constraints:[{type: Delete,mutuallyInclusive:false}]}
But that string isn't JSON. The keys of objects must be string literals in JSON, and string literals must be quoted.
Cpanel::JSON::XS's allow_barekey option will make it accept unquoted keys, but no JSON parser is going to accept the other unquoted string literals (asin_1, RenderAll, Delete). Not even JavaScript would accept that.
I don't know where you're getting that string from, but it's not really very close to JSON.
!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use JSON;
use URI::Escape;
use Data::Dumper;
my $str = '%7BparentAsin%3Aasin_1%2C+businessType%3A+%22AHS%22%2CrenderType%3ARenderAll%2Cconstraints%3A%5B%7Btype%3A+Delete%2CmutuallyInclusive%3Afalse%7D%5D%7D';
my $json = uri_unescape($str);
say $json;
say Dumper decode_json($json);
We get this output:
{parentAsin:asin_1,+businessType:+"AHS",renderType:RenderAll,constraints:[{type:+Delete,mutuallyInclusive:false}]}
And then this error:
'"' expected, at character offset 1 (before "parentAsin:asin_1,+b...") at json_decode line 21.
That's caused by the keys in your objects not being in quoted strings. Ok, we can fix that. We'll also replace the '+' signs with spaces.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use JSON;
use URI::Escape;
use Data::Dumper;
my $str = '%7BparentAsin%3Aasin_1%2C+businessType%3A+%22AHS%22%2CrenderType%3ARenderAll%2Cconstraints%3A%5B%7Btype%3A+Delete%2CmutuallyInclusive%3Afalse%7D%5D%7D';
# ADDED THIS LINE
$str =~ s/\+/ /g;
my $json = uri_unescape($str);
# ADDED THIS LINE
$json =~ s/(\w+?):/"$1":/g;
say $json;
say Dumper decode_json($json);
Now we get better output:
{"parentAsin":asin_1, "businessType": "AHS","renderType":RenderAll,"constraints":[{"type": Delete,"mutuallyInclusive":false}]}
But we still get an error:
malformed JSON string, neither tag, array, object, number, string or atom, at character offset 14 (before "asin_1,+"businessTyp...") at json_decode line 21.
This is because your values also need to be quoted strings. But fixing this is harder because some of your values are already quoted (e.g. "AHS") and some values don't need to be quoted (e.g. false).
So it's hard to know the best approach to take from here. My first instinct would be to go back to whatever is generating that original string and see if you can get the bugs fixed so you get a proper JSON string.

Inserting two arrays into columns of a html table perl

I have two arrays that have related data. I need to insert them into a html table. I am accessing these arrays from a different program by using modules which I found out by searching the forum.
package My::Module;
use strict;
use warnings;
use File::Slurp;
use Data::Dumper;
use Exporter;
our #ISA = 'Exporter';
our #EXPORT = qw(\#owners \#values);
our(#owners, #values);
$Data::Dumper::Indent = 1;
my #fileDatas = read_file("/x/home/venganesan/output.txt");
This is under a folder My and is named Module.pm. parts of the other file which will have the table are
use strict;
use warnings;
use CGI;
use My::Module;
my $q = new CGI;
print $q->header;
print $q->start_html(-title=>"Table testing", -style =>{'src'=> '/x/home/venganesan/style.css'});
print $q->h1("Modified WOWO diff");
print $q->table( {-border=>1, cellpadding=>3},
$q->Tr($q->th(['WOWODiff', 'Owner', 'Signoff'])),
foreach $own(#owners){
$q->Tr(
$q->td([$own,'Two', 'Three'])},
$q->td(['four', 'Five', 'Six']),
),
I am just trying to print one array to see how it works and then include the other. The output I am getting is both the arrays on command line without the html when I use Module.pm. If i remove it, I get html code. I am learning perl and new modules on the fly. I am open to criticism and better ways to implement the code.
It's 2013. No-one should be generating HTML using CGI.pm these days. By all means, use CGI.pm for generating headers and parsing CGI requests, but please consider using something like the Template Toolkit for your HTML.
I'm not clear what your question is. Are you saying that you get errors if you use My::Module (that's a terrible name for it, by the way)? In that case you should see what gets written to the web server's error log and address the problems given there.

Scrambled umlauts since upgrade from JSON1 to JSON2 in Perl

I wondered why some german umlauts were scrambled on our page.
Then i found out that the recent version of JSON (i use 2.07) does convert strings in an other manner than JSON 1.5.
Problem here is that i have a hash with strings like
use Data::Dumper;
my $test = {
'fields' => 'überrascht'
};
print Dumper(to_json($test)); gives me
$VAR1 = "{ \"fields\" : \"\x{fc}berrascht\" } ";
Using the old module using
$json = JSON->new();
print Dumper ($json->to_json($test));
gives me (the correct result)
$VAR1 = '{"fields":[{"title":"überrascht"}]}';
So umlauts are scrammbled using the new JSON 2 module.
What do i need to get them correct?
Update: It might be bad to use Data::Dumper to show output, because Dumper uses its own encoding. Well, a difference in the result from Dumper shows that anything is treated differently here. It might be better to describe the backend as Brad mentioned:
The json string gets printed using Template-Toolkit and then gets assigned to a javascript variable for further use. The correct javascript shows something like this
{
"title" : "Geändert",
},
using the new module i get
{
"title" : "Geändert",
},
The target page is in 8859-1 (latin1).
Any suggestions?
\x{fc} is ü, at least in Latin-1, Latin-9 etc. Also, ü is codepoint U+00FC in Unicode. However, we want UTF-8 (I suppose). The easiest solution to get UTF-8 string literals is to save your Perl source code with this encoding, and put a use utf8; at the top of your script.
Then, encoding the string as JSON yields correct output:
use strict; use warnings; use utf8;
use Data::Dumper; use JSON;
print Dumper encode_json {fields => "nicht überrascht"};
The encode_json assumes UTF-8. Read the documentation for more info.
Output:
$VAR1 = '{"fields":"nicht überrascht"}';
(JSON module version: 2.53)
my $json_text = to_json($data);
is short for
my $json_text = JSON->new->encode($data);
This returns a string of Unicode Code Points. U+00FC is indeed the correct Unicode code point for "ü", so the output is correct. (As proof, the HTML source for that is actually "ü".)
It's hard to tell what your original output actually contained (since you showed non-ASCII characters), so it's hard to determine what your problem is actually.
But one thing you must do before outputing the string is to convert it from a string of code points into bytes, say, by using Encode's encode or encode_utf8.
my $json_cp1252 = encode('cp1252', to_json($data));
my $json_utf8 = encode_utf8(to_json($data));
If the appropriate encoding is UTF-8, you can also use any of the following:
my $json_utf8 = to_json($data, { utf8 => 1 });
my $json_utf8 = encode_json($data);
my $json_utf8 = JSON->new->utf8->encode($data);
Use encode_json instead. According to the manual it converts the given Perl data structure to a UTF-8 encoded, binary string.
Regarding your update: If you actually want to produce JSON in Latin1 (ISO-8859-1), you can try:
to_json($test, { latin1 => 1 })
Or
JSON->new->latin1->encode($test)
Note that if you dump the result, getting \x{fc} for ü is correct in this case. I guess that the root of your problem is that you receive text in Perl's UTF-8 format from somewhere. In this case, the latin1 option of the JSON module is needed.
You can also try to use ascii instead of latin1 as the safest option.
Another solution might be to specify an output encoding for Template-Toolkit. I don't know if that's possible. Or, you could encode your result as Latin1 in the final step before sending it to the client.
Strictly-speaking, Latin-1-encoded JSON is not valid JSON. The JSON spec allows UTF-8, UTF-16 or UTF-32 encodings.
If you want to be standards-compliant or you want to ensure your JSON will be compatible with both your current pages and future UTF-8-based pages, you need to use JSON->new->utf8->encode($str). Being strict about generated valid JSON could save you lots of headaches in the future.
You can translate UTF-8 JSON to Latin-1 using client-side Javascript if you need to, using this trick.
The ascii option also produces valid JSON, by escaping any non-ASCII characters using valid JSON unicode escapes. But the latin1 option does not, and therefore should be avoided IMHO. The utf8(0) option should be avoided too unless you specify an encoding when writing the data out to clients: utf8(0) is subtly different from the utf8 option in that it generates Perl character strings instead of byte strings. If you do any I/O using character strings without specifying an encoding, Perl will translate it on-the-fly back to Latin-1. The utf8 option generates raw UTF-8 bytes, which are perfect for doing raw I/O.

Perl WWW::Mechanize from string variable

In Perl, using module WWW::Mechanize (required, not other module), is it possible to "parse" document from string variable, instead of url?
I mean instead of
$mech->get($url);
to do something like
$html = '<html...';
$mech->???($html);
Possible?
You could write the data to disk and then get() it in the usual manner. Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use File::Temp;
use URI::File;
use WWW::Mechanize;
my $data = '<html><body>foo</body></html>';
# write the data to disk
my $fh = File::Temp->new;
print $fh $data;
$fh->close;
my $mech = WWW::Mechanize->new;
$mech->get( URI::file->new( $fh->filename ) );
print $mech->content;
prints: <html><body>foo</body></html>
Got it:
$mech->get(0);
$mech->update_html('<html>...</html>');
It works!
Not really. You could try getting the HTTP::Response object using $mech->response and then using that object's content method to replace the content with your own string. But you would have to adjust all the message headers as well and it would get quite messy.
What is it that you want to do? The methods like forms and images that WWW::Mechanize provides are based on other modules and are fairly simple to code.

How to Allow CGI in HTML to accept Whitespace as Parameter

Currently I have a link like follows:
<a href=/myweb/cgi-bin/my.cgi?name=B. anthracis>B. anthracis</a>
But instead of taking B. anthracis as input parameter, it takes B. instead.
How can I modify the above HTML or CGI script to allow that?
And currently my CGI script looks like this:
use CGI;
my $cgi = CGI->new();
my $param = $cgi->param('name');
print "$param\n";
You should URL encode the query string:
B. anthracis
And including the quotes on your attributes is strongly recommended.
You can use encodeURIComponent in JavaScript or uri_escape in Perl to encode each parameter name and value before building the query string.