Perl XML2JSON : How to preserve XML element order? - json

I have a configuration file which is in XML format. I need to parse the XML and convert to JSON. I'm able to convert it with XML2JSON module of perl. But the problem is, it is not maintaining the order of XML elements. I strictly need the elements in order otherwise I cannot configure
My XML file is something like this. I have to configure an IP address and set that IP as a gateway to certain route.
<Config>
<ip>
<address>1.1.1.1</address>
<netmask>255.255.255.0</netmask>
</ip>
<route>
<network>20.20.20.0</network>
<netmask>55.255.255.0</netmask>
<gateway>1.1.1.1</gateway>
</route>
</Config>
This is my perl code to convert to JSON
my $file = 'config.xml';
use Data::Dumper;
open my $fh, '<',$file or die;
$/ = undef;
my $data = <$fh>;
my $XML = $data;
my $XML2JSON = XML::XML2JSON->new();
my $Obj = $XML2JSON->xml2obj($XML);
print Dumper($Obj);
The output I'm getting is,
$VAR1 = {'Config' => {'route' => {'netmask' => {'$t' => '55.255.255.0'},'gateway' => {'$t' => '1.1.1.1'},'network' => {'$t' => '20.20.20.0'}},'ip' => {'netmask' => {'$t' => '255.255.255.0'},'address' => {'$t' => '1.1.1.1'}}},'#encoding' => 'UTF-8','#version' => '1.0'};
I have a script which reads the json object and configure..
But it fails as it first tries to set gateway ip address to a route where the ip address is not yet configured and add then add ip address.
I strictly want key ip to come first and then route for proper configuration without error. Like this I have many dependencies where order of keys is a must.
Is there any way I can tackle this problem? I tried almost all modules of XML parsing like XML::Simple,Twig::XML,XML::Parser. But nothing helped..

Here's a program that I hacked together that uses XML::Parser to parse some XML data and generate the equivalent JSON in the same order. It ignores any attributes, processing instructions etc. and requires that every XML element must contain either a list of child elements or a text node. Mixing text and elements won't work, and this isn't checked except that the program will die trying to dereference a string
It's intended to be a framework for you to enhance as you require, but works fine as it stands with the XML data you show in your question
use strict;
use warnings 'all';
use XML::Parser;
my $parser = XML::Parser->new(Handlers => {
Start => \&start_tag,
End => \&end_tag,
Char => \&text,
});
my $struct;
my #stack;
$parser->parsefile('config.xml');
print_json($struct->[1]);
sub start_tag {
my $expat = shift;
my ($tag, %attr) = #_;
my $elem = [ $tag => [] ];
if ( $struct ) {
my $content = $stack[-1][1];
push #{ $content }, $elem;
}
else {
$struct = $elem;
}
push #stack, $elem;
}
sub end_tag {
my $expat = shift;
my ($elem) = #_;
die "$elem <=> $stack[-1][0]" unless $stack[-1][0] eq $elem;
for my $content ( $stack[-1][1] ) {
$content = "#$content" unless grep ref, #$content;
}
pop #stack;
}
sub text {
my $expat = shift;
my ($string) = #_;
return unless $string =~ /\S/;
$string =~ s/\A\s+//;
$string =~ s/\s+\z//;
push #{ $stack[-1][1] }, $string;
}
sub print_json {
my ($data, $indent, $comma) = (#_, 0, '');
print "{\n";
for my $i ( 0 .. $#$data ) {
# Note that $data, $indent and $comma are overridden here
# to reflect the inner context
#
my $elem = $data->[$i];
my $comma = $i < $#$data ? ',' : '';
my ($tag, $data) = #$elem;
my $indent = $indent + 1;
printf qq{%s"%s" : }, ' ' x $indent, $tag;
if ( ref $data ) {
print_json($data, $indent, $comma);
}
else {
printf qq{"%s"%s\n}, $data, $comma;
}
}
# $indent and $comma (and $data) are restored here
#
printf "%s}%s\n", ' ' x $indent, $comma;
}
output
{
"ip" : {
"address" : "1.1.1.1",
"netmask" : "255.255.255.0"
},
"route" : {
"network" : "20.20.20.0",
"netmask" : "55.255.255.0",
"gateway" : "1.1.1.1"
}
}

The problem isn't so much to do with XML parsing, but because perl hashes are not ordered. So when you 'write' some JSON... it can be any order.
The way to avoid this is to apply a sort function to your JSON.
You can do this by using sort_by to explicitly sort:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
use JSON::PP;
use Data::Dumper;
sub order_nodes {
my %rank_of = ( ip => 0, route => 1, address => 2, network => 3, netmask => 4, gateway => 5 );
print "$JSON::PP::a <=> $JSON::PP::b\n";
return $rank_of{$JSON::PP::a} <=> $rank_of{$JSON::PP::b};
}
my $twig = XML::Twig -> parse (\*DATA);
my $json = JSON::PP -> new;
$json ->sort_by ( \&order_nodes );
print $json -> encode( $twig -> simplify );
__DATA__
<Config>
<ip>
<address>1.1.1.1</address>
<netmask>255.255.255.0</netmask>
</ip>
<route>
<network>20.20.20.0</network>
<netmask>55.255.255.0</netmask>
<gateway>1.1.1.1</gateway>
</route>
</Config>
In some scenarios, setting canonical can help, as that sets ordering to lexical order. (And means your JSON output would be consistently ordered). This doesn't apply to your case.
You could build the node ordering via XML::Twig, either by an xpath expression, or by using twig_handlers. I gave it a quick go, but got slightly unstuck in figuring out how you'd 'tell' how to figure out ordering based on getting address/netmask and then network/netmask/gateway.
As a simple example you could:
my $count = 0;
foreach my $node ( $twig -> get_xpath ( './*' ) ) {
$rank_of{$node->tag} = $count++ unless $rank_of{$node->tag};
}
print Dumper \%rank_of;
This will ensure ip and route are always the right way around. However it doesn't order the subkeys.
That actually gets a bit more complicated, as you'd need to recurse... and then decide how to handle 'collisions' (like netmask - address comes before, but how does it sort compared to network).
Or alternatively:
my $count = 0;
foreach my $node ( $twig->get_xpath('.//*') ) {
$rank_of{ $node->tag } = $count++ unless $rank_of{ $node->tag };
}
This walks all the nodes, and puts them in order. It doesn't quite work, because netmask appears in both stanzas though.
You get:
{"ip":{"address":"1.1.1.1","netmask":"255.255.255.0"},"route":{"netmask":"55.255.255.0","network":"20.20.20.0","gateway":"1.1.1.1"}}
I couldn't figure out a neat way of collapsing both lists.

Related

Can I use Text::CSV_XS to parse a csv-format string without writing it to disk?

I am getting a "csv file" from a vendor (using their API), but what they do is just spew the whole thing into their response. It wouldn't be a significant problem except that, of course, some of those pesky humans entered the data and put in "features" like line breaks. What I am doing now is creating a file for the raw data and then reopening it to read the data:
open RAW, ">", "$rawfile" or die "ERROR: Could not open $rawfile for write: $! \n";
print RAW $response->content;
close RAW;
my $csv = Text::CSV_XS->new({ binary=>1,always_quote=>1,eol=>$/ });
open my $fh, "<", "$rawfile" or die "ERROR: Could not open $rawfile for read: $! \n";
while ( $line = $csv->getline ($fh) ) { ...
Somehow this seems ... inelegant. It seems that I ought to be able to just read the data from the $response->content (multiline string) as if it were a file. But I'm drawing a total blank on how do this.
A pointer would be greatly appreciated.
Thanks,
Paul
You could use a string filehandle:
my $data = $response->content;
open my $fh, "<", \$data or croak "unable to open string filehandle : $!";
my $csv = Text::CSV_XS->new({ binary=>1,always_quote=>1,eol=>$/ });
while ( $line = $csv->getline ($fh) ) { ... }
Yes, you can use Text::CSV_XS on a string, via its functional interface
use warnings;
use strict;
use feature 'say';
use Text::CSV_XS qw(csv); # must use _XS version
my $csv = qq(a,line\nand,another);
my $aoa = csv(in => \$csv)
or die Text::CSV->error_diag;
say "#$_" for #aoa;
Note that this indeed needs Text::CSV_XS (normally Text::CSV works but not with this).
I don't know why this isn't available in the OO interface (or perhaps is but is not documented).
While the above parses the string directly as asked, one can also lessen the "inelegant" aspect in your example by writing content directly to a file as it's acquired, what most libraries support like with :content_file option in LWP::UserAgent::get method.
Let me also note that most of the time you want the library to decode content, so for LWP::UA to use decoded_content (see HTTP::Response).
I cooked up this example with Mojo::UserAgent. For the CSV input I used various data sets from the NYC Open Data. This is also going to appear in the next update for Mojo Web Clients.
I build the request without making the request right away, and that gives me the transaction object, $tx. I can then replace the read event so I can immediately send the lines into Text::CSV_XS:
#!perl
use v5.10;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $url = ...;
my $tx = $ua->build_tx( GET => $url );
$tx->res->content->unsubscribe('read')->on(read => sub {
state $csv = do {
require Text::CSV_XS;
Text::CSV_XS->new;
};
state $buffer;
state $reader = do {
open my $r, '<:encoding(UTF-8)', \$buffer;
$r;
};
my ($content, $bytes) = #_;
$buffer .= $bytes;
while (my $row = $csv->getline($reader) ) {
say join ':', $row->#[2,4];
}
});
$tx = $ua->start($tx);
That's not as nice as I'd like it to be because all the data still show up in the buffer. This is slightly more appealing, but it's fragile in the ways I note in the comments. I'm too lazy at the moment to make it any better because that gets hairy very quickly as you figure out when you have enough data to process a record. My particular code isn't as important as the idea that you can do whatever you like as the transactor reads data and passes it into the content handler:
use v5.10;
use strict;
use warnings;
use feature qw(signatures);
no warnings qw(experimental::signatures);
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $url = ...;
my $tx = $ua->build_tx( GET => $url );
$tx->res->content
->unsubscribe('read')
->on( read => process_bytes_factory() );
$tx = $ua->start($tx);
sub process_bytes_factory {
return sub ( $content, $bytes ) {
state $csv = do {
require Text::CSV_XS;
Text::CSV_XS->new( { decode_utf8 => 1 } );
};
state $buffer = '';
state $line_no = 0;
$buffer .= $bytes;
# fragile if the entire content does not end in a
# newline (or whatever the line ending is)
my $last_line_incomplete = $buffer !~ /\n\z/;
# will not work if the format allows embedded newlines
my #lines = split /\n/, $buffer;
$buffer = pop #lines if $last_line_incomplete;
foreach my $line ( #lines ) {
my $status = $csv->parse($line);
my #row = $csv->fields;
say join ':', $line_no++, #row[2,4];
}
};
}

DBM::Deep is failing to import hashref having 'true' or 'false' values

I have the JSON text as given below :
test.json
{
"a" : false
}
I want to create the DBM::Deep hash for above JSON. My code is looks like as given below :
dbm.pl
use strict;
use warnings;
use DBM::Deep;
use JSON;
use Data::Dumper;
# create the dbm::deep object
my $db = DBM::Deep->new(
file => 'test.db',
type => DBM::Deep->TYPE_HASH
);
my $json_text = do {
open( my $json_fh, $path )
or die("Can't open \$path\": $!\n");
local $/;
<$json_fh>;
};
my $json = JSON->new;
my $data = $json->decode($json_text);
print Dumper($data);
# create dbm::deep hash
eval { $db->{$path} = $data; };
if ($#) {
print "error : $#\n";
}
I am getting below output/error on execution of above code:
Error
$VAR1 = {
'a' => bless( do{(my $o = 0)}, 'JSON::XS::Boolean' )
};
error : DBM::Deep: Storage of references of type 'SCALAR' is not supported. at dbm.pl line 26
It seems like, JSON internally uses JSON::XS which convert the 'true' value in JSON::XS::Boolean object and DBM::Deep is not able to handle this, while it can handle the null value.
While the above code is working fine for below inputs:
{
"a" : 'true' # if true is in quotes
}
or
{
"a" : null
}
I tried many thing, but nothing worked. Does anyone has any workaround?
The JSON parser you are using, among others, returns an object that works as a boolean when it encounters true or false in the JSON. This allows the data to be re-encoded into JSON without change, but it can cause this kind of issue.
null doesn't have this problem because Perl has a native value (undef) that can be used to represent it unambiguously.
The following convert these objects into simple values.
sub convert_json_bools {
local *_convert_json_bools = sub {
my $ref_type = ref($_[0])
or return;
if ($ref_type eq 'HASH') {
_convert_json_bools($_) for values(%{ $_[0] });
}
elsif ($ref_type eq 'ARRAY') {
_convert_json_bools($_) for #{ $_[0] };
}
elsif ($ref_type =~ /::Boolean\z/) {
$_[0] = $_[0] ? 1 : 0;
}
else {
warn("Unsupported type $ref_type\n");
}
};
&_convert_json_bools;
}
convert_json_bools($data);
Your code works fine for me, with the only change being to set
my $path = 'test.json';
You should check your module version numbers. These are the ones that I have
print $DBM::Deep::VERSION, "\n"; # 2.0013
print $JSON::VERSION, "\n"; # 2.90
print $JSON::XS::VERSION, "\n"; # 3.02
and I am running Perl v5.24.0
The dumped output is as follows
Newly-created DBM::Deep database
$VAR1 = bless( {}, 'DBM::Deep::Hash' );
output of $json->decode
$VAR1 = {
'a' => undef
};
Populated DBM::Deep database after the eval
$VAR1 = bless( {
'test.json' => bless( {
'a' => undef
}, 'DBM::Deep::Hash' )
}, 'DBM::Deep::Hash' );
All of that looks to be as it should

perl Mojo and JSON for simultaneous requests

I'm usually no Perl coder. However I've got to complete this task.
The following code works for me:
#!/usr/bin/perl
use LWP::UserAgent;
use JSON;
use strict;
my $md5 = $ARGV[0];
$md5 =~ s/[^A-Fa-f0-9 ]*//g;
die "invalid MD5" unless ( length($md5) == 32 );
my $ua = LWP::UserAgent->new(ssl_opts => { verify_hostname => 1 }, timeout => 10);
my $key="12345...7890";
my $url='https://www.virustotal.com/vtapi/v2/file/report';
my $response = $ua->post( $url, ['apikey' => $key, 'resource' => $md5] );
die "$url error: ", $response->status_line unless $response->is_success;
my $results=$response->content;
my $json = JSON->new->allow_nonref;
my $decjson = $json->decode( $results);
print "md5: ",$md5,"\n";
print "positives: ", $decjson->{"positives"}, "\n";
print "total: ", $decjson->{"total"}, "\n";
print "date: ", $decjson->{"scan_date"}, "\n";
Now I would like to recode the above for using asynchronous http using Mojo. I'm trying this:
#!/usr/bin/perl
use warnings;
use strict;
use Mojo;
use Mojo::UserAgent;
my $md5 = $ARGV[0];
$md5 =~ s/[^A-Fa-f0-9 ]*//g;
die "invalid MD5" unless ( length($md5) == 32 );
my ($vt_positives, $vt_scandate, $response_vt);
my $url='https://www.virustotal.com/vtapi/v2/file/report';
my $key="12345...7890";
my $ua = Mojo::UserAgent->new;
my $delay = Mojo::IOLoop->delay;
$ua->max_redirects(0)->connect_timeout(3)->request_timeout(6);
$ua->max_redirects(5);
$delay->begin;
$response_vt = $ua->post( $url => ['apikey' => $key, 'resource' => $md5] => sub {
my ($ua, $tx) = #_;
$vt_positives=$tx->res->json->{"positives"};
print "Got response: $vt_positives\n";
});
Mojo::IOLoop->start unless Mojo::IOLoop->is_running;
The first code is OK, the second isn't working. I must be doing something wrong when sending the request since I seem to get a 403 response (incorrect API usage). I also tried -> json calls but it didn't work out.
And even if I had done the request correctly, I'm not sure if I'm correctly decoding the json results with Mojo.
Help will be appreciated!
EDIT
It seems that we missed the real question, how to post forms. Oops sorry about that.
Posting forms depends on which version of Mojolicious you are using. Until recently (v3.85 -- 2013-02-13) there was a post_form method. On reflection however, it was decided there should either be *_form methods for every request type, or we should do something smarter, and thus the form generator was born.
$response_vt = $ua->post(
$url,
form => {'apikey' => $key, 'resource' => $md5},
sub { ... }
);
It can be added to any request method, making it much more consistent than the old form. Also note that it should be a hashref, not an arrayref as LWP allows. BTW there is also a json generator that works like this too, or you can even add your own!
I'm leaving my original answer, showing non-blocking usage, which you may now amend given the above.
ORIGINAL
Building off the logic from creaktive, this is how I would start. The major difference is that there isn't a monitor watching to be sure that there are works going, rather when one finishes it checks to be sure that there are no idlers.
I have also made some changes in the parsing logic, but nothing major.
#!/usr/bin/env perl
use Mojo::Base -strict;
use utf8::all;
use Mojo::URL;
use Mojo::UserAgent;
# FIFO queue
my #urls = qw(
http://sysd.org/page/1/
http://sysd.org/page/2/
http://sysd.org/page/3/
);
# User agent following up to 5 redirects
my $ua = Mojo::UserAgent
->new(max_redirects => 5)
->detect_proxy;
start_urls($ua, \#urls, \&get_callback);
sub start_urls {
my ($ua, $queue, $cb) = #_;
# Limit parallel connections to 4
state $idle = 4;
state $delay = Mojo::IOLoop->delay(sub{say #$queue ? "Loop ended before queue depleated" : "Finished"});
while ( $idle and my $url = shift #$queue ) {
$idle--;
print "Starting $url, $idle idle\n\n";
$delay->begin;
$ua->get($url => sub{
$idle++;
print "Got $url, $idle idle\n\n";
$cb->(#_, $queue);
# refresh worker pool
start_urls($ua, $queue, $cb);
$delay->end;
});
}
# Start event loop if necessary
$delay->wait unless $delay->ioloop->is_running;
}
sub get_callback {
my ($ua, $tx, $queue) = #_;
# Parse only OK HTML responses
return unless
$tx->res->is_status_class(200)
and $tx->res->headers->content_type =~ m{^text/html\b}ix;
# Request URL
my $url = $tx->req->url;
say "Processing $url";
parse_html($url, $tx, $queue);
}
sub parse_html {
my ($url, $tx, $queue) = #_;
state %visited;
my $dom = $tx->res->dom;
say $dom->at('html title')->text;
# Extract and enqueue URLs
$dom->find('a[href]')->each(sub{
# Validate href attribute
my $link = Mojo::URL->new($_->{href});
return unless eval { $link->isa('Mojo::URL') };
# "normalize" link
$link = $link->to_abs($url)->fragment(undef);
return unless grep { $link->protocol eq $_ } qw(http https);
# Don't go deeper than /a/b/c
return if #{$link->path->parts} > 3;
# Access every link only once
return if $visited{$link->to_string}++;
# Don't visit other hosts
return if $link->host ne $url->host;
push #$queue, $link;
say " -> $link";
});
say '';
return;
}
Take a look at this concurrent-requesting Mojolicious-based web crawler I wrote to illustrate my article Web Scraping with Modern Perl:
#!/usr/bin/env perl
use 5.010;
use open qw(:locale);
use strict;
use utf8;
use warnings qw(all);
use Mojo::UserAgent;
# FIFO queue
my #urls = map { Mojo::URL->new($_) } qw(
http://sysd.org/page/1/
http://sysd.org/page/2/
http://sysd.org/page/3/
);
# Limit parallel connections to 4
my $max_conn = 4;
# User agent following up to 5 redirects
my $ua = Mojo::UserAgent
->new(max_redirects => 5)
->detect_proxy;
# Keep track of active connections
my $active = 0;
Mojo::IOLoop->recurring(
0 => sub {
for ($active + 1 .. $max_conn) {
# Dequeue or halt if there are no active crawlers anymore
return ($active or Mojo::IOLoop->stop)
unless my $url = shift #urls;
# Fetch non-blocking just by adding
# a callback and marking as active
++$active;
$ua->get($url => \&get_callback);
}
}
);
# Start event loop if necessary
Mojo::IOLoop->start unless Mojo::IOLoop->is_running;
sub get_callback {
my (undef, $tx) = #_;
# Deactivate
--$active;
# Parse only OK HTML responses
return
if not $tx->res->is_status_class(200)
or $tx->res->headers->content_type !~ m{^text/html\b}ix;
# Request URL
my $url = $tx->req->url;
say $url;
parse_html($url, $tx);
return;
}
sub parse_html {
my ($url, $tx) = #_;
say $tx->res->dom->at('html title')->text;
# Extract and enqueue URLs
for my $e ($tx->res->dom('a[href]')->each) {
# Validate href attribute
my $link = Mojo::URL->new($e->{href});
next if 'Mojo::URL' ne ref $link;
# "normalize" link
$link = $link->to_abs($tx->req->url)->fragment(undef);
next unless grep { $link->protocol eq $_ } qw(http https);
# Don't go deeper than /a/b/c
next if #{$link->path->parts} > 3;
# Access every link only once
state $uniq = {};
++$uniq->{$url->to_string};
next if ++$uniq->{$link->to_string} > 1;
# Don't visit other hosts
next if $link->host ne $url->host;
push #urls, $link;
say " -> $link";
}
say '';
return;
}
LWP::UserAgent takes arguments to post as either ref to array or ref to hash format.
http://search.cpan.org/~gaas/libwww-perl-6.04/lib/LWP/UserAgent.pm#REQUEST_METHODS
$ua->post( $url, \%form )
$ua->post( $url, \#form )
which you provide in the first script in the ref to array format "\#form"
my $response = $ua->post( $url, ['apikey' => $key, 'resource' => $md5] );
being as it is a hash this is probably better written in the hash format "\%form"
my $response = $ua->post( $url, {'apikey' => $key, 'resource' => $md5} );
In Mojo::UserAgent the arguments to post are a little more complex, but essentially appear to be a "string" of hash refs to hash keys, with which I am unfamiliar. However you may find using the hash ref format provides the expected arguments correctly.
http://search.cpan.org/~sri/Mojolicious-3.87/lib/Mojo/UserAgent.pm#post
POST
my $tx = $ua->post('kraih.com');
my $tx = $ua->post('http://kraih.com' => {DNT => 1} => 'Hi!');
my $tx = $ua->post('http://kraih.com' => {DNT => 1} => form => {a => 'b'});
my $tx = $ua->post('http://kraih.com' => {DNT => 1} => json => {a => 'b'});
try this ?:
$response_vt = $ua->post( $url => form => {'apikey' => $key, 'resource' => $md5} => sub {... });

Converting MySQL TEXT field with breaklines to XML by Perl script returns a malformed notation

I have a table in MySQL that has one field defined as TEXT. The information is fed to the database by a webform using a textarea.
I'm using the following script to generate an XML with the information of the table:
#!/usr/bin/perl
use strict;
use DBI;
use XML::Generator::DBI;
use XML::Handler::YAWriter;
my $dbh = DBI->connect ("DBI:access info goes here",
{ RaiseError => 1, PrintError => 0});
my $out = XML::Handler::YAWriter->new (AsFile => "-", Encoding=>"ISO-8859-1");
my $gen = XML::Generator::DBI->new (
Handler => $out,
dbh => $dbh
);
$gen->execute ("SELECT text FROM table");
$dbh->disconnect ();
The problem is that when the text entered has breaklines it generates a malformed XML:
<text {http://axkit.org/NS/xml-generator-dbi}encoding="HASH(0x9c43ba0)">PHA+YWlqZHNvaWFqZG9pYXNqZG9pYXNqb2RpanNhaW9kanNhb2lkYXNvaWo8L3A+DQo8cD5zPC9w
Pg0KPHA+ZDwvcD4NCjxwPmFzPC9wPg0KPHA+ZHNhPC9wPg0KPHA+ZDwvcD4NCjxwPnNhZHNhZHNh
ZHM8L3A+DQo8cD4mbmJzcDs8L3A+DQo8cD5hc2Rhc2Rzc2FkZHNkc2FzZHNhPC9wPg0KPHA+Jm5i
c3A7PC9wPg0KPHA+YXNkZHNhZHNhYXNkc2Rhc2RhYXNkPC9wPg==
</text>
For example if the text entered is:
<p>One</p>
<p>Two</p>
It outputs the malformed XML, but when the text is:
<p>One</p> <p>Two</p>
It prints out the XML correctly.
Is there any way to 'strip' the breakline from the textarea or ignore it in the creation of the XML?
Thanks.
It might work to enforce well-formed-ness:
$text = s|(?i)(<br)>|$1 />|gm;
Which will turn any bare linebreak tag into an empty tag compliant with XML well-formed-ness.
With my cursory look at the classes you're using, it looks like if you can step into the handler chain, and handle, say characters, you might be able to do something likes this before the call to XML::Generator::DBI->execute.
$gen->set_content_handler(
SAXHandlerWrapper->new(
characters => sub {
s|(?i)(<br)>|$1 />|gm;
return $out->characters( $_ )
}
)
);
Where the following behavior defines SAXHandlerWrapper:
package SAXHandlerWrapper;
use 5.010;
use strict;
use warnings;
use Carp qw<croak>;
use Params::Util qw<_CODE _HASH _IDENTIFIER _INSTANCE>;
use Scalar::Util qw<blessed>;
sub _make_handler {
my $name = shift || $_;
return if __PACKAGE__->can( $name );
no strict;
*$name = sub {
my $action = shift->{ $name };
local $_ = $_[0];
return &$action;
}
}
sub new {
my $self = bless {}, shift;
my $current_name;
#_ = %{ shift() } if &_HASH( $_[0] );
while ( local $_ = shift #_ ) {
given ( $_ ) {
when ( !_IDENTIFIER( $_ )) {
croak( "Invalid parameter name: $_!" );
}
when ( 'event' ) {
croak( "Invalid event name: $_!" )
unless $current_name = _IDENTIFIER( shift )
;
_make_handler( $current_name );
}
when ( 'action' ) {
croak( 'Action not code reference!' )
unless my $action = _CODE( shift )
;
croak( 'No active handler name!' ) unless $current_name;
$self->{ $current_name } = $action;
}
default {
croak( "Invalid event: $_!" )
unless $self->{ $_ } = _CODE( shift )
;
_make_handler( $_ );
}
}
}
Carp::croak( 'Nothing handled!' ) unless %$self;
foreach ( grep { !_CODE( $self->{$_} ) } keys %$self ) {
Carp::croak( "Handler for $_ is not complete!" );
}
return $self;
}

Can I use perltidy's HTML formatter in my automated Perl build?

I'm using Module::Build to perform build, test, testpod, html, & install actions on my Perl module that I'm developing. The HTML files that are generated are okay, but I'd be much happier if I could somehow configure Module::Build to use the perltidy -html formatting utility instead of its own HTML formatter.
Anyone know of a way I can replace the HTML formatter that comes with Module::Build with the prettier perltidy HTML formatter?
Addendum: When I said "replace" above, that was probably misleading. I don't really want to write code to replace the html formatter that comes with Module::Build. I really want to know if Module::Build has any other HTML formatter options. The HTML it generates is so plain and generic looking. It's so boring. I like perltidy's output a lot.
Here is how I got it working right now in a build script that I wrote, but it's totally a hack ... falling out to the command line perltidy script:
use strict;
use warnings;
# get list of files in directory
my $libLocation = "lib/EDF";
opendir( DIR, $libLocation );
my #filenameArray = readdir(DIR);
# iterate over all files to find *.pm set
for my $file (#filenameArray) {
if ( $file =~ m/ # matching regex
\. # literal period character
pm # the pm file extenstion
/x # end of regex
)
{
my $return = `perl D:/Perl/site/bin/perltidy -q --indent-columns=4 --maximum-line-length=80 -html -opath blib/libhtml2 -toc $libLocation/$file`;
if ($return eq "") {
print "HTMLized " . $file . "\n";
}
else {
print "Error: " . $return . "\n";
}
}
}
But I was really hoping there was a way to use Module::Build and just tell it with a flag or an argument or whatever to tell it to use a different HTML formatter. I guess that's a pipe dream, though:
use strict;
use warnings;
use Module::Build;
my $build = Module::Build->resume (
properties => {
config_dir => '_build',
},
);
$build->dispatch('build');
$build->dispatch('html', engine => 'perltidy');
or maybe:
$build->dispatch('htmltidy');
Well, the action is implemented in
htmlify_pods
in Module::Build::Base.
It should be possible to override that method.
Much Later ...
Here is my attempt (tested only once):
package My::Builder;
use strict;
use warnings;
use base 'Module::Build';
sub htmlify_pods {
my $self = shift;
my $type = shift;
my $htmldir = shift || File::Spec->catdir($self->blib, "${type}html");
require Module::Build::Base;
require Module::Build::PodParser;
require Perl::Tidy;
$self->add_to_cleanup('pod2htm*');
my $pods = $self->_find_pods(
$self->{properties}{"${type}doc_dirs"},
exclude => [ Module::Build::Base::file_qr('\.(?:bat|com|html)$') ] );
return unless %$pods; # nothing to do
unless ( -d $htmldir ) {
File::Path::mkpath($htmldir, 0, oct(755))
or die "Couldn't mkdir $htmldir: $!";
}
my #rootdirs = ($type eq 'bin') ? qw(bin) :
$self->installdirs eq 'core' ? qw(lib) : qw(site lib);
my $podpath = join ':',
map $_->[1],
grep -e $_->[0],
map [File::Spec->catdir($self->blib, $_), $_],
qw( script lib );
foreach my $pod ( keys %$pods ) {
my ($name, $path) = File::Basename::fileparse($pods->{$pod},
Module::Build::Base::file_qr('\.(?:pm|plx?|pod)$'));
my #dirs = File::Spec->splitdir( File::Spec->canonpath( $path ) );
pop( #dirs ) if $dirs[-1] eq File::Spec->curdir;
my $fulldir = File::Spec->catfile($htmldir, #rootdirs, #dirs);
my $outfile = File::Spec->catfile($fulldir, "${name}.html");
my $infile = File::Spec->abs2rel($pod);
next if $self->up_to_date($infile, $outfile);
unless ( -d $fulldir ){
File::Path::mkpath($fulldir, 0, oct(755))
or die "Couldn't mkdir $fulldir: $!";
}
my $path2root = join( '/', ('..') x (#rootdirs+#dirs) );
my $htmlroot = join( '/',
($path2root,
$self->installdirs eq 'core' ? () : qw(site) ) );
my $fh = IO::File->new($infile) or die "Can't read $infile: $!";
my $abstract = Module::Build::PodParser->new(fh => $fh)->get_abstract();
my $title = join( '::', (#dirs, $name) );
$title .= " - $abstract" if $abstract;
my %opts = (
argv => join(" ",
qw( -html --podflush ),
"--title=$title",
'--podroot='.$self->blib,
"--htmlroot=$htmlroot",
"--podpath=$podpath",
),
source => $infile,
destination => $outfile,
);
if ( eval{Pod::Html->VERSION(1.03)} ) {
$opts{argv} .= ' --podheader';
$opts{argv} .= ' --backlink=Back to Top';
if ( $self->html_css ) {
$opts{argv} .= " --css=$path2root/" . $self->html_css;
}
}
$self->log_info("HTMLifying $infile -> $outfile\n");
$self->log_verbose("perltidy %opts\n");
Perl::Tidy::perltidy(%opts); # or warn "pod2html #opts failed: $!";
}
}
1;
** To use it .. **
#!/usr/bin/perl
use strict;
use warnings;
use My::Builder;
my $builder = My::Builder->new(
module_name => 'My::Test',
license => 'perl',
);
$builder->create_build_script;
It's very easy to define new Module::Build actions that you can call with dispatch, and there are plenty of examples in the Module::Build documentation. Define an action to handle your new step:
sub ACTION_htmltidy
{
my( $self ) = #_;
$self->depends_on( ...other targets... );
require Perl::Tidy;
...do your damage...
}
If you want another action to use yours, you can extend it so you can make the dependency:
sub ACTION_install
{
my( $self ) = #_;
$self->depends_on( 'htmltidy' );
$self->SUPER::install;
}