Extract link from page HTML with Perl - html

To extract links from HTML pages use the code:
#!/usr/bin/perl
use warnings;
use strict;
use XML::LibXML;
open( SITE, "< index.html" );
my $html = <SITE>; # load the HTML file
my $content = 'XML::LibXML'->load_html(string => "index.html", recover => 1);
my #del = qw( Contact Tables );
my $condition = join ' or ', map "text()='$_'", #del;
for my $anch ($content->findnodes("//a[$condition]/..")) {
$anch->parentNode->removeChild($anch);
}
open (NOTEPAD, ">> index.html");
print NOTEPAD "$content";
close(NOTEPAD);
My problem is use the $file variable to read the contents of HTML page, it not work.
If I use the HTML tags inside the $file variable it works. But not is solution.

You can use WWW::Mechanize; to accomplish your task.
my $mech = WWW::Mechanize->new();
$mech->get( $url ); #url to extract links
my #links = $mech->links();
foreach my $link (#links) {
my $curr_url = $link->url_abs;
}
For complete documentation of this module refer WWW::Mechaniize

You can use Mojo::DOM (part of Mojolicious) which allows you to use CSS selectors, which I feel is a much better approach;
use Mojo::DOM;
use Mojo::File qw( path );
my $dom = Mojo::DOM->new( path('index.html')->slurp );
foreach ( $dom->find('a')->each ) {
# Do something with $_
}
my $html = $dom->to_string;

Your program opens file index.html, reads one line of it into $html, and then tries to parse the string index.html as if it were HTML data
Forget about opening and reading the file; you can get XML::LibXML to do all of that for you, with
my $content = XML::LibXML->load_html( location => 'index.html', recover => 1 );

Related

Can I use Text::CSV_XS to parse a csv-format string without writing it to disk?

I am getting a "csv file" from a vendor (using their API), but what they do is just spew the whole thing into their response. It wouldn't be a significant problem except that, of course, some of those pesky humans entered the data and put in "features" like line breaks. What I am doing now is creating a file for the raw data and then reopening it to read the data:
open RAW, ">", "$rawfile" or die "ERROR: Could not open $rawfile for write: $! \n";
print RAW $response->content;
close RAW;
my $csv = Text::CSV_XS->new({ binary=>1,always_quote=>1,eol=>$/ });
open my $fh, "<", "$rawfile" or die "ERROR: Could not open $rawfile for read: $! \n";
while ( $line = $csv->getline ($fh) ) { ...
Somehow this seems ... inelegant. It seems that I ought to be able to just read the data from the $response->content (multiline string) as if it were a file. But I'm drawing a total blank on how do this.
A pointer would be greatly appreciated.
Thanks,
Paul
You could use a string filehandle:
my $data = $response->content;
open my $fh, "<", \$data or croak "unable to open string filehandle : $!";
my $csv = Text::CSV_XS->new({ binary=>1,always_quote=>1,eol=>$/ });
while ( $line = $csv->getline ($fh) ) { ... }
Yes, you can use Text::CSV_XS on a string, via its functional interface
use warnings;
use strict;
use feature 'say';
use Text::CSV_XS qw(csv); # must use _XS version
my $csv = qq(a,line\nand,another);
my $aoa = csv(in => \$csv)
or die Text::CSV->error_diag;
say "#$_" for #aoa;
Note that this indeed needs Text::CSV_XS (normally Text::CSV works but not with this).
I don't know why this isn't available in the OO interface (or perhaps is but is not documented).
While the above parses the string directly as asked, one can also lessen the "inelegant" aspect in your example by writing content directly to a file as it's acquired, what most libraries support like with :content_file option in LWP::UserAgent::get method.
Let me also note that most of the time you want the library to decode content, so for LWP::UA to use decoded_content (see HTTP::Response).
I cooked up this example with Mojo::UserAgent. For the CSV input I used various data sets from the NYC Open Data. This is also going to appear in the next update for Mojo Web Clients.
I build the request without making the request right away, and that gives me the transaction object, $tx. I can then replace the read event so I can immediately send the lines into Text::CSV_XS:
#!perl
use v5.10;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $url = ...;
my $tx = $ua->build_tx( GET => $url );
$tx->res->content->unsubscribe('read')->on(read => sub {
state $csv = do {
require Text::CSV_XS;
Text::CSV_XS->new;
};
state $buffer;
state $reader = do {
open my $r, '<:encoding(UTF-8)', \$buffer;
$r;
};
my ($content, $bytes) = #_;
$buffer .= $bytes;
while (my $row = $csv->getline($reader) ) {
say join ':', $row->#[2,4];
}
});
$tx = $ua->start($tx);
That's not as nice as I'd like it to be because all the data still show up in the buffer. This is slightly more appealing, but it's fragile in the ways I note in the comments. I'm too lazy at the moment to make it any better because that gets hairy very quickly as you figure out when you have enough data to process a record. My particular code isn't as important as the idea that you can do whatever you like as the transactor reads data and passes it into the content handler:
use v5.10;
use strict;
use warnings;
use feature qw(signatures);
no warnings qw(experimental::signatures);
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $url = ...;
my $tx = $ua->build_tx( GET => $url );
$tx->res->content
->unsubscribe('read')
->on( read => process_bytes_factory() );
$tx = $ua->start($tx);
sub process_bytes_factory {
return sub ( $content, $bytes ) {
state $csv = do {
require Text::CSV_XS;
Text::CSV_XS->new( { decode_utf8 => 1 } );
};
state $buffer = '';
state $line_no = 0;
$buffer .= $bytes;
# fragile if the entire content does not end in a
# newline (or whatever the line ending is)
my $last_line_incomplete = $buffer !~ /\n\z/;
# will not work if the format allows embedded newlines
my #lines = split /\n/, $buffer;
$buffer = pop #lines if $last_line_incomplete;
foreach my $line ( #lines ) {
my $status = $csv->parse($line);
my #row = $csv->fields;
say join ':', $line_no++, #row[2,4];
}
};
}

How do I add link to cells using HTML::TagTree?

I am using HTML::TagTree as it seems to create the html file for table easily. I want to add html links to the text within some cells.
From the documentation provided here, I am not able to get a clear answer about how to add a new tag for the text inside a cell. Here's my code.
Main line to focus on: $new_row->td($1,'style=text-align:center','a:href="second_page.html"')
I don't think I clearly understand how to add more tags and attributes. Can someone please help?
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TagTree;
my $filename = 'list.txt';
my $html = HTML::TagTree->new('html'); # Define the top of the tree of objects.
my $head = $html->head(); # Put a 'head' branch on the tree.
my $body = $html->body(); # Put a 'body' branch on the tree
$head->title("Report");
$head->meta('', 'name=author CONTENT="xxx"');
$body->div->h1('Main page name'); # Example of method chaining to create
# a long branch.
my $table = $body->table('', 'width=100% border=1');
my $row1 = $table->tr();
$row1->td('Feature Code','style=background-color:khaki;text-align:center');
$row1->td('Feature Name','style=background-color:khaki;text-align:center');
$row1->td('% completed','style=background-color:khaki;text-align:center');
open(my $fh, '<', $filename)
or die "Could not open file '$filename' $!";
while (my $row = <$fh>){
if($row =~ m/([.\d]+): (.+)/){
my $new_row = $table->tr();
$new_row->td($1,'style=text-align:center','a:href="page_for_each_item.html"');
$new_row->td($2);
}
}
# Print to STDOUT the actual HTML representation of the tree
$html->print_html();
After some tries I think I found one way of doing it:
my $new_row = $table->tr();
my $text = $html->object();
$text->a($original_text,"href=second_page.html");
$new_row->td($text,'style=text-align:center');

Perl Regex for Not HTML

I am looking to substitute anything that is not an HTML tag from an HTML document. So, basically trying to get rid of all the text within the document.
I have the below regex to remove all HTML from a string, but need help with the opposite scenario.
$string =~ s/<[^>]+>//g;
Thanks.
If this is regex s///ubstitution to remove all html from document
$string =~ s/<[^>]+>//g;
Then you can use the same regex in a m//atch operator to keep all html from document
$string = join '', $string =~ m/<[^>]+>/g;
If the above regex satisfies your requirements, then you're done :) But maybe you want to consider this ol' regex pattern, slightly longer :D http://perlmonks.org/?node_id=161281
Mind the caveats like Ethan Browne mentions :)
Ethan Brown namechecks HTML::DOM as if it were the only CPAN solution.
HTML::Parser is more ubiquitous, but it's not hard to Google for more.
http://metacpan.org/pod/HTML::Parser
A solution using HTML::Parser is (tested once):
use HTML::Parser ();
my $p = HTML::Parser->new(api_version => 3);
$p->handler( text => sub { }, "");
$p->handler( default => sub { print shift }, "text");
$p->parse_file('content.html') || die $!;
Are you looking for this?
$string =~ s/>[^<]*</></mg;
Or this?
$string =~ s/(?<=>)[^<]*(?=<)//mg;
LibXML makes it easy to select stuff that isn't tags/comments/processing-instruction and remove it
#!/usr/bin/perl --
use strict;
use warnings;
use XML::LibXML 1.70; ## for load_html/load_xml/location
use XML::LibXML::PrettyPrint;
Main( #ARGV );
exit( 0 );
sub Main {
binmode STDOUT;
my $loc = shift or die "
Usage:
$0 ko00010.html
$0 http://example.com/ko00010.html\n\n";
my $dom = XML::LibXML->new(
qw/
recover 2
no_blanks 1
/
)->load_html( location => $loc, );
## http://www.w3.org/TR/xpath/#node-tests
## http://www.w3.org/TR/xpath/#NT-NodeType
## http://www.w3.org/TR/xpath/#section-Text-Nodes
for my $text ( $dom->findnodes(q{ //text() }) ){
node_detach( $text );
}
local $XML::LibXML::skipXMLDeclaration = 1; ## <?xml ?>
local $XML::LibXML::setTagCompression = 0; ## <p />
#~ print "$dom";
my $pp = XML::LibXML::PrettyPrint->new_for_html;
$pp->{indent_string}=' ';
print $pp->pretty_print( $dom );
}
sub node_detach {
my( $self ) = #_;
$self->parentNode->removeChild( $self );
}

Getting links from an HTML table using HTML::TableExtract and HTML::Extor in Perl

My goal is to extract the links from the tables titled "Agonists," "Antagonists," and "Allosteric Regulators" in the following site:
http://www.iuphar-db.org/DATABASE/ObjectDisplayForward?objectId=1&familyId=1
I've been using HTML::TableExtract to extract the tables but have been unable to get HTML::LinkExtor to retrieve the links in question. Here is the code I have so far:
use warnings;
use strict;
use HTML::TableExtract;
use HTML::LinkExtor;
my #names = `ls /home/wallakin/LINDA/ligands/iuphar/data/html2/`;
foreach (#names)
{
chomp ($_);
my $te = HTML::TableExtract->new( headers => [ "Ligand",
"Sp.",
"Action",
"Affinity",
"Units",
"Reference" ] );
my $le = HTML::LinkExtor->new();
$te->parse_file("/home/wallakin/LINDA/ligands/iuphar/data/html2/$_");
my $output = $_;
$output =~ s/\.html/\.txt/g;
open (RESET, ">/home/wallakin/LINDA/ligands/iuphar/data/links/$output") or die "Can't reset";
close RESET;
#open (DATA, ">>/home/wallakin/LINDA/ligands/iuphar/data/links/$output") or die "Can't append to file";
foreach my $ts ($te->tables)
{
foreach my $row ($ts->rows)
{
$le->parse($row->[0]);
for my $link_tag ( $le->links )
{
my %links = #$link_tag;
print #$link_tag, "\n";
}
}
}
#print "Links extracted from $_\n";
}
I've tried using some sample code from another thread on this site (Perl parse links from HTML Table) to no avail. I'm not sure whether it's a problem of parsing or table recognition. Any help provided would be greatly appreciated. Thanks!
Try this as a base script (you only need to adapt it to fetch links) :
use warnings; use strict;
use HTML::TableExtract;
use HTML::LinkExtor;
use WWW::Mechanize;
use utf8;
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");
my $m = WWW::Mechanize->new( autocheck => 1, quiet => 0 );
$m->agent_alias("Linux Mozilla");
$m->cookie_jar({});
my $te = HTML::TableExtract->new(
headers => [
"Ligand",
"Sp.",
"Action",
"Affinity",
"Units",
"Reference"
]
);
$te->parse(
$m->get("http://tinyurl.com/jvwov9m")->content
);
foreach my $ts ($te->tables) {
print "Table (", join(',', $ts->coords), "):\n";
foreach my $row ($ts->rows) {
print join(',', #$row), "\n";
}
}
You don't describe what the problem is...what exactly doesn't work? What does $row->[0] contain? But part of the problem might be that TableExtract returns just the 'visible' text, not the raw html, by default. You probably want to use the keep_html option in HTML::TableExtract.

Extract text from HTML - Perl using HTML::TreeBuilder

I'm trying to access the .html files and extract the text in <p> tags. Logically, my code below should work. By using the HTML::TreeBuilder. I parse the html then extract text in <p> using find_by_attribute("p"). But my script came out with empty directories. Did i leave out anything?
#!/usr/bin/perl
use strict;
use HTML::TreeBuilder 3;
use FileHandle;
my #task = ('ar','cn','en','id','vn');
foreach my $lang (#task) {
mkdir "./extract_$lang", 0777 unless -d "./extract_$lang";
opendir (my $dir, "./$lang/") or die "$!";
my #files = grep (/\.html/,readdir ($dir));
closedir ($dir);
foreach my $file (#files) {
open (my $fh, '<', "./$lang/$file") or die "$!";
my $root = HTML::TreeBuilder->new;
$root->parse_file("./$lang/$file");
my #all_p = $root->find_by_attribute("p");
foreach my $p (#all_p) {
my $ptag = HTML::TreeBuilder->new_from_content ($p->as_HTML);
my $filewrite = substr($file, 0, -5);
open (my $outwrite, '>>', "extract_$lang/$filewrite.txt") or die $!;
print $outwrite $ptag->as_text . "\n";
my $pcontents = $ptag->as_text;
print $pcontents . "\n";
close (outwrite);
}
close (FH);
}
}
My .html files are the plain text htmls from .asp websites e.g. http://www.singaporemedicine.com/vn/hcp/med_evac_mtas.asp
My .html files are saved in:
./ar/*
./cn/*
./en/*
./id/*
./vn/*
You are confusing element with attribute. The program can be written much more concisely:
#!/usr/bin/env perl
use strictures;
use File::Glob qw(bsd_glob);
use Path::Class qw(file);
use URI::file qw();
use Web::Query qw(wq);
use autodie qw(:all);
foreach my $lang (qw(ar cn en id vn)) {
mkdir "./extract_$lang", 0777 unless -d "./extract_$lang";
foreach my $file (bsd_glob "./$lang/*.html") {
my $basename = file($file)->basename;
$basename =~ s/[.]html$/.txt/;
open my $out, '>>:encoding(UTF-8)', "./extract_$lang/$basename";
$out->say($_) for wq(URI::file->new_abs($file))->find('p')->text;
close $out;
}
}
Use find_by_tag_name to search for tag names, not find_by_attribute.
You want find_by_tag_name, not find_by_attribute:
my #all_p = $root->find_by_tag_name("p");
From the docs:
$h->find_by_tag_name('tag', ...)
In list context, returns a list of elements at or under $h that have
any of the specified tag names. In scalar context, returns the first
(in pre-order traversal of the tree) such element found, or undef if
none.
You might want to take a look at Mojo::DOM which lets you use CSS selectors.