I am looking to substitute anything that is not an HTML tag from an HTML document. So, basically trying to get rid of all the text within the document.
I have the below regex to remove all HTML from a string, but need help with the opposite scenario.
$string =~ s/<[^>]+>//g;
Thanks.
If this is regex s///ubstitution to remove all html from document
$string =~ s/<[^>]+>//g;
Then you can use the same regex in a m//atch operator to keep all html from document
$string = join '', $string =~ m/<[^>]+>/g;
If the above regex satisfies your requirements, then you're done :) But maybe you want to consider this ol' regex pattern, slightly longer :D http://perlmonks.org/?node_id=161281
Mind the caveats like Ethan Browne mentions :)
Ethan Brown namechecks HTML::DOM as if it were the only CPAN solution.
HTML::Parser is more ubiquitous, but it's not hard to Google for more.
http://metacpan.org/pod/HTML::Parser
A solution using HTML::Parser is (tested once):
use HTML::Parser ();
my $p = HTML::Parser->new(api_version => 3);
$p->handler( text => sub { }, "");
$p->handler( default => sub { print shift }, "text");
$p->parse_file('content.html') || die $!;
Are you looking for this?
$string =~ s/>[^<]*</></mg;
Or this?
$string =~ s/(?<=>)[^<]*(?=<)//mg;
LibXML makes it easy to select stuff that isn't tags/comments/processing-instruction and remove it
#!/usr/bin/perl --
use strict;
use warnings;
use XML::LibXML 1.70; ## for load_html/load_xml/location
use XML::LibXML::PrettyPrint;
Main( #ARGV );
exit( 0 );
sub Main {
binmode STDOUT;
my $loc = shift or die "
Usage:
$0 ko00010.html
$0 http://example.com/ko00010.html\n\n";
my $dom = XML::LibXML->new(
qw/
recover 2
no_blanks 1
/
)->load_html( location => $loc, );
## http://www.w3.org/TR/xpath/#node-tests
## http://www.w3.org/TR/xpath/#NT-NodeType
## http://www.w3.org/TR/xpath/#section-Text-Nodes
for my $text ( $dom->findnodes(q{ //text() }) ){
node_detach( $text );
}
local $XML::LibXML::skipXMLDeclaration = 1; ## <?xml ?>
local $XML::LibXML::setTagCompression = 0; ## <p />
#~ print "$dom";
my $pp = XML::LibXML::PrettyPrint->new_for_html;
$pp->{indent_string}=' ';
print $pp->pretty_print( $dom );
}
sub node_detach {
my( $self ) = #_;
$self->parentNode->removeChild( $self );
}
Related
To extract links from HTML pages use the code:
#!/usr/bin/perl
use warnings;
use strict;
use XML::LibXML;
open( SITE, "< index.html" );
my $html = <SITE>; # load the HTML file
my $content = 'XML::LibXML'->load_html(string => "index.html", recover => 1);
my #del = qw( Contact Tables );
my $condition = join ' or ', map "text()='$_'", #del;
for my $anch ($content->findnodes("//a[$condition]/..")) {
$anch->parentNode->removeChild($anch);
}
open (NOTEPAD, ">> index.html");
print NOTEPAD "$content";
close(NOTEPAD);
My problem is use the $file variable to read the contents of HTML page, it not work.
If I use the HTML tags inside the $file variable it works. But not is solution.
You can use WWW::Mechanize; to accomplish your task.
my $mech = WWW::Mechanize->new();
$mech->get( $url ); #url to extract links
my #links = $mech->links();
foreach my $link (#links) {
my $curr_url = $link->url_abs;
}
For complete documentation of this module refer WWW::Mechaniize
You can use Mojo::DOM (part of Mojolicious) which allows you to use CSS selectors, which I feel is a much better approach;
use Mojo::DOM;
use Mojo::File qw( path );
my $dom = Mojo::DOM->new( path('index.html')->slurp );
foreach ( $dom->find('a')->each ) {
# Do something with $_
}
my $html = $dom->to_string;
Your program opens file index.html, reads one line of it into $html, and then tries to parse the string index.html as if it were HTML data
Forget about opening and reading the file; you can get XML::LibXML to do all of that for you, with
my $content = XML::LibXML->load_html( location => 'index.html', recover => 1 );
I have a Perl CGI script. I'm trying to generate hyperlinks on an HTML page using all the elements in an array. I'm using functional CGI style programming. Here's a minimal representation of my code:
#!/usr/bin/perl
use strict; use warnings;
use CGI qw( :standard);
print header;
print start_html(
-title => 'Get LINK!'
);
my %HoA = (
'foo' => [ '12', '23', '593' ],
'bam' => [ '232', '65' ],
);
my #array = ("foo", "bam");
foreach my $i (#array){
foreach my $j (#{$HoA{$i}}){
my $link = get_link($i);
print "$i"."\t"; # this doesn't work!!
}
}
#-----------------------------------
# this subroutine works!
sub get_link{
my $id = $_[0];
my $link = 'http://www.example.com/'.$id;
return $link;
}
#------------------------------------
Desired Output
foo foo foo bam bam
Any help or suggestions are appreciated.
print "$g"."\t"; # this doesn't work!!
that's because your quotes in your quote end your quote. You need to escape them:
print "$g"."\t"; # this doesn't work!!
single quotes work on the inside, too.
As per the comment, qq-style quoting is also available to you, so you can use double quotes and variable interpolation at the same time without the escape characters.
There's a few things that don't make much sense.
The code below doesn't work because $link is a ref. Also, you're not using $j anywhere,
which should give you a sign that there's a bit of a "design" issue.
foreach my $i (#array){
foreach my $j (#{$HoA{$i}}){
my $link = get_link($i);
print "$g"."\t"; # this doesn't work!!
}
}
Why can't this be rewritten like so:
for my $array_ref (keys %HoA) {
for my $item ( #{ $HoA{$array_ref} } ){
my $link = get_link($item);
# print ....
}
}
How does this sub "work"? $id is an array reference here. Do you just mean to shift onto $id?
#-----------------------------------
# this subroutine works!
sub get_link{
#my $id = $HoA{$_[0]};
# with the refactored double for loop, or the way you had it if you fix the
# `ref` issue, a `shift` should do the trick.
my $id = shift;
my $link = 'http://www.google.com/'.$id;
return $link;
}
#------------------------------------
You also need to escape the quotes in print statement:
print "$g"."\t";
Here is another way you can do this ...
use strict;
use warnings;
use CGI qw(:standard);
print header,
start_html(-title => 'Get LINK!');
my %HoA = (
foo => [ qw(12 23 593) ],
bam => [ qw(232 65) ],
);
foreach my $i ( reverse sort keys %HoA ) {
foreach ( #{$HoA{$i}} ) {
print a({-href => 'http:/www.google.com/'.$_,
-target => '_blank'}, $i) . "\n";
}
}
Output:
foo foo foo bam bam
I have an html page that has particular text that I want to parse into a databse using a Perl Script.
I want to be able to strip off all the stuff I don't want, an exmple of the html is-
<div class="postbody">
<h3><a href "foo">Re: John Smith <span class="posthilit">England</span></a></h3>
<div class="content">Is C# better than Visula Basic?</div>
</div>
Therefore I would want to import into the database
Name: John Smith.
Lives in: England.
Commented: Is C# better than Visula Basic?
I have started to create a Perl script but it needs to be changed to work for what I want;
use DBI;
open (FILE, "list") || die "couldn't open the file!";
open (F1, ">list.csv") || die "couldn't open the file!";
print F1 "Name\|Lives In\|Commented\n";
while ($line=<FILE>)
{
chop($line);
$text = "";
$add = 0;
open (DATA, $line) || die "couldn't open the data!";
while ($data=<DATA>)
{
if ($data =~ /ds\-div/)
{
$data =~ s/\,//g;
$data =~ s/\"//g;
$data =~ s/\'//g;
$text = $text . $data;
}
}
#p = split(/\\/, $line);
print F1 $p[2];
print F1 ",";
print F1 $p[1];
print F1 ",";
print F1 $p[1];
print F1 ",";
print F1 "\n";
$a = $a + 1;
Any input would be greatly appreciated.
Please do not use regular expressions to parse HTML as HTML is not a regular language. Regular expressions describe regular languages.
It is easy to parse HTML with HTML::TreeBuilder (and its family of modules):
#!/usr/bin/env perl
use warnings;
use strict;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new_from_content(
do { local $/; <DATA> }
);
for ( $tree->look_down( 'class' => 'postbody' ) ) {
my $location = $_->look_down( 'class' => 'posthilit' )->as_trimmed_text;
my $comment = $_->look_down( 'class' => 'content' )->as_trimmed_text;
my $name = $_->look_down( '_tag' => 'h3' )->as_trimmed_text;
$name =~ s/^Re:\s*//;
$name =~ s/\s*$location\s*$//;
print "Name: $name\nLives in: $location\nCommented: $comment\n";
}
__DATA__
<div class="postbody">
<h3>Re: John Smith <span class="posthilit">England</span></h3>
<div class="content">Is C# better than Visual Basic?</div>
</div>
Output
Name: John Smith
Lives in: England
Commented: Is C# better than Visual Basic?
However, if you require much more control, have a look at HTML::Parser as has already been answered by ADW.
Use an HTML parser, like HTML::TreeBuilder to parse the HTML--don't do it yourself.
Also, don't use two-arg open with global handles, don't use chop--use chomp (read the perldoc to understand why). Find yourself a newer tutorial. You are using a ton of OLD OLD OLD Perl. And damnit, USE STRICT and USE WARNINGS. I know you've been told to do this. Do it. Leaving it out will do nothing but buy you pain.
Go. Read. Modern Perl. It is free.
my $page = HTML::TreeBuilder->new_from_file( $file_name );
$page->elementify;
my #posts;
for my $post ( $page->look_down( class => 'postbody' ) ) {
my %post = (
name => get_name($post),
loc => get_loc($post),
comment => get_comment($post),
);
push #posts, \%post;
}
# Persist #posts however you want to.
sub get_name {
my $post = shift;
my $name = $post->look_down( _tag => 'h3' );
return unless defined $name;
$name->look_down->(_tag=>'a');
return unless defined $name;
$name = ($name->content_list)[0];
return unless defined $name;
$name =~ s/^Re:\s*//;
$name =~ /\s*$//;
return $name;
}
sub get_loc {
my $post = shift;
my $loc = $post->look_down( _tag => 'span', class => 'posthilit' );
return unless defined $loc;
return $loc->as_text;
}
sub get_comment {
my $post = shift;
my $com = $post->look_down( _tag => 'div', class => 'content' );
return unless defined $com;
return $com->as_text;
}
Now you have a nice data structure with all your post data. You can write it to CSV or a database or whatever it is you really want to do. You seem to be trying to do both.
You'd be much better using the HTML::Parser module from the CPAN.
Many email clients don't like linked CSS stylesheets, or even the embedded <style> tag, but rather want the CSS to appear inline as style attributes on all your markup.
BAD: <link rel=stylesheet type="text/css" href="/style.css">
BAD: <style type="text/css">...</style>
WORKS: <h1 style="margin: 0">...</h1>
However this inline style attribute approach is a right pain to manage.
I've found tools for Ruby and PHP that will take a CSS file and some separate markup as input and return you the merged result - a single file of markup with all the CSS converted to style attributes.
I'm looking for a Perl solution to this problem, but I've not found one on CPAN or by searching Google. Any pointers? Alternatively, are there CPAN modules one could combine to achieve the same result?
Ruby http://premailer.dialect.ca/
PHP http://www.pelagodesign.com/sidecar/emogrifier/
Perl ?
I do not know of a complete, pre-packaged solution.
CSS::DOM's compute_style is subject to pretty much the same caveats as emogrifier above. That module, in conjunction with HTML::TokeParser ought to be usable to cook up something.
Update: Here is a buggy mish-mash of things:
#!/usr/bin/perl
use strict;
use warnings;
use CSS::DOM;
use File::Slurp;
use HTML::DOM;
use HTML::TokeParser;
die "convert html_file css_file" unless #ARGV == 2;
my ($html_file, $css_file) = #ARGV;
my $html_parser = HTML::TokeParser->new($html_file)
or die "Cannot open '$html_file': $!";
my $sheet = CSS::DOM::parse( scalar read_file $css_file );
while ( my $token = $html_parser->get_token ) {
my $type = $token->[0];
my $text = $type eq 'T' ? $token->[1] : $token->[-1];
if ( $type eq 'S' ) {
unless ( skip( $token->[1] ) ) {
$text = insert_computed_style($sheet, $token);
}
}
print $text;
}
sub insert_computed_style {
my ($sheet, $token) = #_;
my ($tag, $attr, $attrseq) = #$token[1 .. 3];
my $doc = HTML::DOM->new;
my $element = $doc->createElement($tag);
for my $attr_name ( #$attrseq ) {
$element->setAttribute($attr_name, $attr->{$attr_name});
}
my $style = CSS::DOM::compute_style(
element => $element, user_sheet => $sheet
);
my #attrseq = (style => grep { lc $_ ne 'style' } #$attrseq );
$attr->{style} = $style->cssText;
my $text .= join(" ",
"<$tag",
map{ qq/$_='$attr->{$_}'/ } #attrseq );
$text .= '>';
return $text;
}
sub skip {
my ($tag) = #_;
$tag = lc $tag;
return 1 if $tag =~ /^(?:h(?:ead|tml)|link|meta|script|title)$/;
}
You can use CPAN Perl module CSS::Inliner https://metacpan.org/release/CSS-Inliner
I'm using Module::Build to perform build, test, testpod, html, & install actions on my Perl module that I'm developing. The HTML files that are generated are okay, but I'd be much happier if I could somehow configure Module::Build to use the perltidy -html formatting utility instead of its own HTML formatter.
Anyone know of a way I can replace the HTML formatter that comes with Module::Build with the prettier perltidy HTML formatter?
Addendum: When I said "replace" above, that was probably misleading. I don't really want to write code to replace the html formatter that comes with Module::Build. I really want to know if Module::Build has any other HTML formatter options. The HTML it generates is so plain and generic looking. It's so boring. I like perltidy's output a lot.
Here is how I got it working right now in a build script that I wrote, but it's totally a hack ... falling out to the command line perltidy script:
use strict;
use warnings;
# get list of files in directory
my $libLocation = "lib/EDF";
opendir( DIR, $libLocation );
my #filenameArray = readdir(DIR);
# iterate over all files to find *.pm set
for my $file (#filenameArray) {
if ( $file =~ m/ # matching regex
\. # literal period character
pm # the pm file extenstion
/x # end of regex
)
{
my $return = `perl D:/Perl/site/bin/perltidy -q --indent-columns=4 --maximum-line-length=80 -html -opath blib/libhtml2 -toc $libLocation/$file`;
if ($return eq "") {
print "HTMLized " . $file . "\n";
}
else {
print "Error: " . $return . "\n";
}
}
}
But I was really hoping there was a way to use Module::Build and just tell it with a flag or an argument or whatever to tell it to use a different HTML formatter. I guess that's a pipe dream, though:
use strict;
use warnings;
use Module::Build;
my $build = Module::Build->resume (
properties => {
config_dir => '_build',
},
);
$build->dispatch('build');
$build->dispatch('html', engine => 'perltidy');
or maybe:
$build->dispatch('htmltidy');
Well, the action is implemented in
htmlify_pods
in Module::Build::Base.
It should be possible to override that method.
Much Later ...
Here is my attempt (tested only once):
package My::Builder;
use strict;
use warnings;
use base 'Module::Build';
sub htmlify_pods {
my $self = shift;
my $type = shift;
my $htmldir = shift || File::Spec->catdir($self->blib, "${type}html");
require Module::Build::Base;
require Module::Build::PodParser;
require Perl::Tidy;
$self->add_to_cleanup('pod2htm*');
my $pods = $self->_find_pods(
$self->{properties}{"${type}doc_dirs"},
exclude => [ Module::Build::Base::file_qr('\.(?:bat|com|html)$') ] );
return unless %$pods; # nothing to do
unless ( -d $htmldir ) {
File::Path::mkpath($htmldir, 0, oct(755))
or die "Couldn't mkdir $htmldir: $!";
}
my #rootdirs = ($type eq 'bin') ? qw(bin) :
$self->installdirs eq 'core' ? qw(lib) : qw(site lib);
my $podpath = join ':',
map $_->[1],
grep -e $_->[0],
map [File::Spec->catdir($self->blib, $_), $_],
qw( script lib );
foreach my $pod ( keys %$pods ) {
my ($name, $path) = File::Basename::fileparse($pods->{$pod},
Module::Build::Base::file_qr('\.(?:pm|plx?|pod)$'));
my #dirs = File::Spec->splitdir( File::Spec->canonpath( $path ) );
pop( #dirs ) if $dirs[-1] eq File::Spec->curdir;
my $fulldir = File::Spec->catfile($htmldir, #rootdirs, #dirs);
my $outfile = File::Spec->catfile($fulldir, "${name}.html");
my $infile = File::Spec->abs2rel($pod);
next if $self->up_to_date($infile, $outfile);
unless ( -d $fulldir ){
File::Path::mkpath($fulldir, 0, oct(755))
or die "Couldn't mkdir $fulldir: $!";
}
my $path2root = join( '/', ('..') x (#rootdirs+#dirs) );
my $htmlroot = join( '/',
($path2root,
$self->installdirs eq 'core' ? () : qw(site) ) );
my $fh = IO::File->new($infile) or die "Can't read $infile: $!";
my $abstract = Module::Build::PodParser->new(fh => $fh)->get_abstract();
my $title = join( '::', (#dirs, $name) );
$title .= " - $abstract" if $abstract;
my %opts = (
argv => join(" ",
qw( -html --podflush ),
"--title=$title",
'--podroot='.$self->blib,
"--htmlroot=$htmlroot",
"--podpath=$podpath",
),
source => $infile,
destination => $outfile,
);
if ( eval{Pod::Html->VERSION(1.03)} ) {
$opts{argv} .= ' --podheader';
$opts{argv} .= ' --backlink=Back to Top';
if ( $self->html_css ) {
$opts{argv} .= " --css=$path2root/" . $self->html_css;
}
}
$self->log_info("HTMLifying $infile -> $outfile\n");
$self->log_verbose("perltidy %opts\n");
Perl::Tidy::perltidy(%opts); # or warn "pod2html #opts failed: $!";
}
}
1;
** To use it .. **
#!/usr/bin/perl
use strict;
use warnings;
use My::Builder;
my $builder = My::Builder->new(
module_name => 'My::Test',
license => 'perl',
);
$builder->create_build_script;
It's very easy to define new Module::Build actions that you can call with dispatch, and there are plenty of examples in the Module::Build documentation. Define an action to handle your new step:
sub ACTION_htmltidy
{
my( $self ) = #_;
$self->depends_on( ...other targets... );
require Perl::Tidy;
...do your damage...
}
If you want another action to use yours, you can extend it so you can make the dependency:
sub ACTION_install
{
my( $self ) = #_;
$self->depends_on( 'htmltidy' );
$self->SUPER::install;
}