Corrupted JSON encoding in Perl (missign comma) - json

My custom code (on Perl) give next wrong JSON, missing comma between blocks:
{
"data": [{
"{#LOGFILEPATH}": "/tmp/QRZ2007.tcserverlogs",
"{#LOGFILE}": "QRZ2007"
} **missing comma** {
"{#LOGFILE}": "ARZ2007",
"{#LOGFILEPATH}": "/tmp/ARZ2007.tcserverlogs"
}]
}
My terrible code:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
use utf8;
use JSON;
binmode STDOUT, ":utf8";
my $dir = $ARGV[0];
my $json = JSON->new->utf8->space_after;
opendir(DIR, $dir) or die $!;
print '{"data": [';
while (my $file = readdir(DIR)) {
next unless (-f "$dir/$file");
next unless ($file =~ m/\.tcserverlogs$/);
my $fullPath = "$dir/$file";
my $filenameshort = basename($file, ".tcserverlogs");
my $data_to_json = {"{#LOGFILEPATH}"=>$fullPath,"{#LOGFILE}"=>$filenameshort};
my $data_to_json = {"{#LOGFILEPATH}"=>$fullPath,"{#LOGFILE}"=>$filenameshort};
print $json->encode($data_to_json);
}
print ']}'."\n";
closedir(DIR);
exit 0;
Dear Team i am not a programmer, please any idea how fix it, thank you!

If you do not print a comma, you will not get a comma.
You are trying to build your own JSON string from pre-encoded pieces of smaller data structures. That will not work unless you tell Perl when to put commas. You could do that, but it's easier to just collect all the data into a Perl data structure that is equivalent to the JSON string you want to produce, and encode the whole thing in one go when you're done.
my $dir = $ARGV[0];
my $json = JSON->new->utf8->space_after;
my #data;
opendir( DIR, $dir ) or die $!;
while ( my $file = readdir(DIR) ) {
next unless ( -f "$dir/$file" );
next unless ( $file =~ m/\.tcserverlogs$/ );
my $fullPath = "$dir/$file";
my $filenameshort = basename( $file, ".tcserverlogs" );
my $data_to_json = { "{#LOGFILEPATH}" => $fullPath, "{#LOGFILE}" => $filenameshort };
push #data, $data_to_json;
}
closedir(DIR);
print $json->encode( { data => \#data } );

Related

perl & python writing out non-ASCII characters into JSON differently

I have hash keys that look like this:
1я310яHOM_REF_truth:HOM_ALT_test:discordant_hom_ref_to_hom_altяAяC
this is a string that is joined by the Cyrillic letter я which I chose as a delimiter because it will never appear in this files.
I write this to a JSON file in Perl 5.30.2 thus:
use JSON 'encode_json';
sub hash_to_json_file {
my $hash = shift;
my $filename = shift;
my $json = encode_json $hash;
open my $out, '>', $filename;
say $out $json
}
and in python 3.8:
use json
def hash_to_json_file(hashtable,filename):
json1=json.dumps(hashtable)
f = open(filename,"w+")
print(json1,file=f)
f.close()
when I try to load a JSON written by Python back into a Perl script, I see a cryptic error that I don't know how to solve:
Wide character in say at read_json.pl line 27.
Reading https://perldoc.perl.org/perlunifaq.html I've tried adding use utf8 to my script, but it doesn't work.
I've also tried '>:encoding(UTF-8)' within my subroutine, but the same error results.
Upon inspection of the JSON files, I see keys like "1Ñ180ÑHET_ALT_truth:HET_REF_test:discordant_het_alt_to_het_refÑAÑC,G" where ÑAÑ substitutes я. In the JSON written by python, I see \u044f I think that this is the wide character, but I don't know how to change it back.
I've also tried changing my subroutine:
use Encode 'decode';
sub json_file_to_hash {
my $file = shift;
open my $in, '<:encoding(UTF-8)', $file;
my $json = <$in>;
my $ref = decode_json $json;
$ref = decode('UTF-8', $json);
return %{ $ref }
}
but this gives another error:
Wide character in hash dereference at read_json.pl line 17, <$_[...]> line 1
How can I get python JSON read into Perl correctly?
use utf8; # Source is encoded using UTF-8
use open ':std', ':encoding(UTF-8)'; # For say to STDOUT. Also default for open()
use JSON qw( decode_json encode_json );
sub hash_to_json_file {
my $qfn = shift;
my $ref = shift;
my $json = encode_json($ref); # Produces UTF-8
open(my $fh, '>:raw', $qfn) # Write it unmangled
or die("Can't create \"$qfn\": $!\n");
say $fh $json;
}
sub json_file_to_hash {
my $qfn = shift;
open(my $fh, '<:raw', $qfn) # Read it unmangled
or die("Can't create \"$qfn\": $!\n");
local $/; # Read whole file
my $json = <$fh>; # This is UTF-8
my $ref = decode_json($json); # This produces decoded text
return $ref; # Return the ref rather than the keys and values.
}
my $src = { key => "1я310яHOM_REF_truth:HOM_ALT_test:discordant_hom_ref_to_hom_altяAяC" };
hash_to_json("a.json", $src);
my $dst = hash_to_json("a.json");
say $dst->{key};
You could also avoid using :raw by using from_json and to_json.
use utf8; # Source is encoded using UTF-8
use open ':std', ':encoding(UTF-8)'; # For say to STDOUT. Also default for open()
use JSON qw( from_json to_json );
sub hash_to_json_file {
my $qfn = shift;
my $hash = shift;
my $json = to_json($hash); # Produces decoded text.
open(my $fh, '>', $qfn) # "use open" will add :encoding(UTF-8)
or die("Can't create \"$qfn\": $!\n");
say $fh $json; # Encoded by :encoding(UTF-8)
}
sub json_file_to_hash {
my $qfn = shift;
open(my $fh, '<', $qfn) # "use open" will add :encoding(UTF-8)
or die("Can't create \"$qfn\": $!\n");
local $/; # Read whole file
my $json = <$fh>; # Decoded text thanks to "use open".
my $ref = from_json($json); # $ref contains decoded text.
return $ref; # Return the ref rather than the keys and values.
}
my $src = { key => "1я310яHOM_REF_truth:HOM_ALT_test:discordant_hom_ref_to_hom_altяAяC" };
hash_to_json("a.json", $src);
my $dst = hash_to_json("a.json");
say $dst->{key};
I like the ascii option so that the JSON output is all 7-bit ASCII
my $json = JSON->new->ascii->encode($hash);
Both the Perl and Python JSON modules will be able to read it.

Can I use Text::CSV_XS to parse a csv-format string without writing it to disk?

I am getting a "csv file" from a vendor (using their API), but what they do is just spew the whole thing into their response. It wouldn't be a significant problem except that, of course, some of those pesky humans entered the data and put in "features" like line breaks. What I am doing now is creating a file for the raw data and then reopening it to read the data:
open RAW, ">", "$rawfile" or die "ERROR: Could not open $rawfile for write: $! \n";
print RAW $response->content;
close RAW;
my $csv = Text::CSV_XS->new({ binary=>1,always_quote=>1,eol=>$/ });
open my $fh, "<", "$rawfile" or die "ERROR: Could not open $rawfile for read: $! \n";
while ( $line = $csv->getline ($fh) ) { ...
Somehow this seems ... inelegant. It seems that I ought to be able to just read the data from the $response->content (multiline string) as if it were a file. But I'm drawing a total blank on how do this.
A pointer would be greatly appreciated.
Thanks,
Paul
You could use a string filehandle:
my $data = $response->content;
open my $fh, "<", \$data or croak "unable to open string filehandle : $!";
my $csv = Text::CSV_XS->new({ binary=>1,always_quote=>1,eol=>$/ });
while ( $line = $csv->getline ($fh) ) { ... }
Yes, you can use Text::CSV_XS on a string, via its functional interface
use warnings;
use strict;
use feature 'say';
use Text::CSV_XS qw(csv); # must use _XS version
my $csv = qq(a,line\nand,another);
my $aoa = csv(in => \$csv)
or die Text::CSV->error_diag;
say "#$_" for #aoa;
Note that this indeed needs Text::CSV_XS (normally Text::CSV works but not with this).
I don't know why this isn't available in the OO interface (or perhaps is but is not documented).
While the above parses the string directly as asked, one can also lessen the "inelegant" aspect in your example by writing content directly to a file as it's acquired, what most libraries support like with :content_file option in LWP::UserAgent::get method.
Let me also note that most of the time you want the library to decode content, so for LWP::UA to use decoded_content (see HTTP::Response).
I cooked up this example with Mojo::UserAgent. For the CSV input I used various data sets from the NYC Open Data. This is also going to appear in the next update for Mojo Web Clients.
I build the request without making the request right away, and that gives me the transaction object, $tx. I can then replace the read event so I can immediately send the lines into Text::CSV_XS:
#!perl
use v5.10;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $url = ...;
my $tx = $ua->build_tx( GET => $url );
$tx->res->content->unsubscribe('read')->on(read => sub {
state $csv = do {
require Text::CSV_XS;
Text::CSV_XS->new;
};
state $buffer;
state $reader = do {
open my $r, '<:encoding(UTF-8)', \$buffer;
$r;
};
my ($content, $bytes) = #_;
$buffer .= $bytes;
while (my $row = $csv->getline($reader) ) {
say join ':', $row->#[2,4];
}
});
$tx = $ua->start($tx);
That's not as nice as I'd like it to be because all the data still show up in the buffer. This is slightly more appealing, but it's fragile in the ways I note in the comments. I'm too lazy at the moment to make it any better because that gets hairy very quickly as you figure out when you have enough data to process a record. My particular code isn't as important as the idea that you can do whatever you like as the transactor reads data and passes it into the content handler:
use v5.10;
use strict;
use warnings;
use feature qw(signatures);
no warnings qw(experimental::signatures);
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $url = ...;
my $tx = $ua->build_tx( GET => $url );
$tx->res->content
->unsubscribe('read')
->on( read => process_bytes_factory() );
$tx = $ua->start($tx);
sub process_bytes_factory {
return sub ( $content, $bytes ) {
state $csv = do {
require Text::CSV_XS;
Text::CSV_XS->new( { decode_utf8 => 1 } );
};
state $buffer = '';
state $line_no = 0;
$buffer .= $bytes;
# fragile if the entire content does not end in a
# newline (or whatever the line ending is)
my $last_line_incomplete = $buffer !~ /\n\z/;
# will not work if the format allows embedded newlines
my #lines = split /\n/, $buffer;
$buffer = pop #lines if $last_line_incomplete;
foreach my $line ( #lines ) {
my $status = $csv->parse($line);
my #row = $csv->fields;
say join ':', $line_no++, #row[2,4];
}
};
}

Getting links from an HTML table using HTML::TableExtract and HTML::Extor in Perl

My goal is to extract the links from the tables titled "Agonists," "Antagonists," and "Allosteric Regulators" in the following site:
http://www.iuphar-db.org/DATABASE/ObjectDisplayForward?objectId=1&familyId=1
I've been using HTML::TableExtract to extract the tables but have been unable to get HTML::LinkExtor to retrieve the links in question. Here is the code I have so far:
use warnings;
use strict;
use HTML::TableExtract;
use HTML::LinkExtor;
my #names = `ls /home/wallakin/LINDA/ligands/iuphar/data/html2/`;
foreach (#names)
{
chomp ($_);
my $te = HTML::TableExtract->new( headers => [ "Ligand",
"Sp.",
"Action",
"Affinity",
"Units",
"Reference" ] );
my $le = HTML::LinkExtor->new();
$te->parse_file("/home/wallakin/LINDA/ligands/iuphar/data/html2/$_");
my $output = $_;
$output =~ s/\.html/\.txt/g;
open (RESET, ">/home/wallakin/LINDA/ligands/iuphar/data/links/$output") or die "Can't reset";
close RESET;
#open (DATA, ">>/home/wallakin/LINDA/ligands/iuphar/data/links/$output") or die "Can't append to file";
foreach my $ts ($te->tables)
{
foreach my $row ($ts->rows)
{
$le->parse($row->[0]);
for my $link_tag ( $le->links )
{
my %links = #$link_tag;
print #$link_tag, "\n";
}
}
}
#print "Links extracted from $_\n";
}
I've tried using some sample code from another thread on this site (Perl parse links from HTML Table) to no avail. I'm not sure whether it's a problem of parsing or table recognition. Any help provided would be greatly appreciated. Thanks!
Try this as a base script (you only need to adapt it to fetch links) :
use warnings; use strict;
use HTML::TableExtract;
use HTML::LinkExtor;
use WWW::Mechanize;
use utf8;
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");
my $m = WWW::Mechanize->new( autocheck => 1, quiet => 0 );
$m->agent_alias("Linux Mozilla");
$m->cookie_jar({});
my $te = HTML::TableExtract->new(
headers => [
"Ligand",
"Sp.",
"Action",
"Affinity",
"Units",
"Reference"
]
);
$te->parse(
$m->get("http://tinyurl.com/jvwov9m")->content
);
foreach my $ts ($te->tables) {
print "Table (", join(',', $ts->coords), "):\n";
foreach my $row ($ts->rows) {
print join(',', #$row), "\n";
}
}
You don't describe what the problem is...what exactly doesn't work? What does $row->[0] contain? But part of the problem might be that TableExtract returns just the 'visible' text, not the raw html, by default. You probably want to use the keep_html option in HTML::TableExtract.

How can I extract data from HTML tables in Perl?

I'm trying to use regular expressions in Perl to parse a table with the following structure. The first line is as follows:
<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>
Here I wish to take out "Time Played", "Artist", "Title", and "Label", and print them to an output file.
I've tried many regular expressions such as:
$lines =~ / (<td>) /
OR
$lines =~ / <td>(.*)< /
OR
$lines =~ / >(.*)< /
My current program looks like so:
#!perl -w
open INPUT_FILE, "<", "FIRST_LINE_OF_OUTPUT.txt" or die $!;
open OUTPUT_FILE, ">>", "PLAYLIST_TABLE.txt" or die $!;
my $lines = join '', <INPUT_FILE>;
print "Hello 2\n";
if ($lines =~ / (\S.*\S) /) {
print "this is 1: \n";
print $1;
if ($lines =~ / <td>(.*)< / ) {
print "this is the 2nd 1: \n";
print $1;
print "the word was: $1.\n";
$Time = $1;
print $Time;
print OUTPUT_FILE $Time;
} else {
print "2ND IF FAILED\n";
}
} else {
print "THIS FAILED\n";
}
close(INPUT_FILE);
close(OUTPUT_FILE);
Do NOT use regexps to parse HTML. There are a very large number of CPAN modules which do this for you much more effectively.
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Can you provide an example of parsing HTML with your favorite parser?
HTML::Parser
HTML::TreeBuilder
HTML::TableExtract
Use HTML::TableExtract. Really.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
my $file = 'Table3.htm';
unless ( -e $file ) {
my $rc = getstore(
'http://www.ntsb.gov/aviation/Table3.htm',
$file);
die "Failed to download document\n" unless $rc == 200;
}
my #headers = qw( Year Fatalities );
my $te = HTML::TableExtract->new(
headers => \#headers,
attribs => { id => 'myTable' },
);
$te->parse_file($file);
my ($table) = $te->tables;
print join("\t", #headers), "\n";
for my $row ($te->rows ) {
print join("\t", #$row), "\n";
}
This is what I meant in another post by "task-specific" HTML parsers.
You could have saved a lot of time by directing your energy to reading some documentation rather than throwing regexes at the wall and seeing if any stuck.
That's an easy one:
my $html = '<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>';
my #stuff = $html =~ />([^<]+)</g;
print join (", ", #stuff), "\n";
See http://codepad.org/qz9d5Bro if you want to try running it.

Can I use perltidy's HTML formatter in my automated Perl build?

I'm using Module::Build to perform build, test, testpod, html, & install actions on my Perl module that I'm developing. The HTML files that are generated are okay, but I'd be much happier if I could somehow configure Module::Build to use the perltidy -html formatting utility instead of its own HTML formatter.
Anyone know of a way I can replace the HTML formatter that comes with Module::Build with the prettier perltidy HTML formatter?
Addendum: When I said "replace" above, that was probably misleading. I don't really want to write code to replace the html formatter that comes with Module::Build. I really want to know if Module::Build has any other HTML formatter options. The HTML it generates is so plain and generic looking. It's so boring. I like perltidy's output a lot.
Here is how I got it working right now in a build script that I wrote, but it's totally a hack ... falling out to the command line perltidy script:
use strict;
use warnings;
# get list of files in directory
my $libLocation = "lib/EDF";
opendir( DIR, $libLocation );
my #filenameArray = readdir(DIR);
# iterate over all files to find *.pm set
for my $file (#filenameArray) {
if ( $file =~ m/ # matching regex
\. # literal period character
pm # the pm file extenstion
/x # end of regex
)
{
my $return = `perl D:/Perl/site/bin/perltidy -q --indent-columns=4 --maximum-line-length=80 -html -opath blib/libhtml2 -toc $libLocation/$file`;
if ($return eq "") {
print "HTMLized " . $file . "\n";
}
else {
print "Error: " . $return . "\n";
}
}
}
But I was really hoping there was a way to use Module::Build and just tell it with a flag or an argument or whatever to tell it to use a different HTML formatter. I guess that's a pipe dream, though:
use strict;
use warnings;
use Module::Build;
my $build = Module::Build->resume (
properties => {
config_dir => '_build',
},
);
$build->dispatch('build');
$build->dispatch('html', engine => 'perltidy');
or maybe:
$build->dispatch('htmltidy');
Well, the action is implemented in
htmlify_pods
in Module::Build::Base.
It should be possible to override that method.
Much Later ...
Here is my attempt (tested only once):
package My::Builder;
use strict;
use warnings;
use base 'Module::Build';
sub htmlify_pods {
my $self = shift;
my $type = shift;
my $htmldir = shift || File::Spec->catdir($self->blib, "${type}html");
require Module::Build::Base;
require Module::Build::PodParser;
require Perl::Tidy;
$self->add_to_cleanup('pod2htm*');
my $pods = $self->_find_pods(
$self->{properties}{"${type}doc_dirs"},
exclude => [ Module::Build::Base::file_qr('\.(?:bat|com|html)$') ] );
return unless %$pods; # nothing to do
unless ( -d $htmldir ) {
File::Path::mkpath($htmldir, 0, oct(755))
or die "Couldn't mkdir $htmldir: $!";
}
my #rootdirs = ($type eq 'bin') ? qw(bin) :
$self->installdirs eq 'core' ? qw(lib) : qw(site lib);
my $podpath = join ':',
map $_->[1],
grep -e $_->[0],
map [File::Spec->catdir($self->blib, $_), $_],
qw( script lib );
foreach my $pod ( keys %$pods ) {
my ($name, $path) = File::Basename::fileparse($pods->{$pod},
Module::Build::Base::file_qr('\.(?:pm|plx?|pod)$'));
my #dirs = File::Spec->splitdir( File::Spec->canonpath( $path ) );
pop( #dirs ) if $dirs[-1] eq File::Spec->curdir;
my $fulldir = File::Spec->catfile($htmldir, #rootdirs, #dirs);
my $outfile = File::Spec->catfile($fulldir, "${name}.html");
my $infile = File::Spec->abs2rel($pod);
next if $self->up_to_date($infile, $outfile);
unless ( -d $fulldir ){
File::Path::mkpath($fulldir, 0, oct(755))
or die "Couldn't mkdir $fulldir: $!";
}
my $path2root = join( '/', ('..') x (#rootdirs+#dirs) );
my $htmlroot = join( '/',
($path2root,
$self->installdirs eq 'core' ? () : qw(site) ) );
my $fh = IO::File->new($infile) or die "Can't read $infile: $!";
my $abstract = Module::Build::PodParser->new(fh => $fh)->get_abstract();
my $title = join( '::', (#dirs, $name) );
$title .= " - $abstract" if $abstract;
my %opts = (
argv => join(" ",
qw( -html --podflush ),
"--title=$title",
'--podroot='.$self->blib,
"--htmlroot=$htmlroot",
"--podpath=$podpath",
),
source => $infile,
destination => $outfile,
);
if ( eval{Pod::Html->VERSION(1.03)} ) {
$opts{argv} .= ' --podheader';
$opts{argv} .= ' --backlink=Back to Top';
if ( $self->html_css ) {
$opts{argv} .= " --css=$path2root/" . $self->html_css;
}
}
$self->log_info("HTMLifying $infile -> $outfile\n");
$self->log_verbose("perltidy %opts\n");
Perl::Tidy::perltidy(%opts); # or warn "pod2html #opts failed: $!";
}
}
1;
** To use it .. **
#!/usr/bin/perl
use strict;
use warnings;
use My::Builder;
my $builder = My::Builder->new(
module_name => 'My::Test',
license => 'perl',
);
$builder->create_build_script;
It's very easy to define new Module::Build actions that you can call with dispatch, and there are plenty of examples in the Module::Build documentation. Define an action to handle your new step:
sub ACTION_htmltidy
{
my( $self ) = #_;
$self->depends_on( ...other targets... );
require Perl::Tidy;
...do your damage...
}
If you want another action to use yours, you can extend it so you can make the dependency:
sub ACTION_install
{
my( $self ) = #_;
$self->depends_on( 'htmltidy' );
$self->SUPER::install;
}