Extract text from HTML - Perl using HTML::TreeBuilder - html

I'm trying to access the .html files and extract the text in <p> tags. Logically, my code below should work. By using the HTML::TreeBuilder. I parse the html then extract text in <p> using find_by_attribute("p"). But my script came out with empty directories. Did i leave out anything?
#!/usr/bin/perl
use strict;
use HTML::TreeBuilder 3;
use FileHandle;
my #task = ('ar','cn','en','id','vn');
foreach my $lang (#task) {
mkdir "./extract_$lang", 0777 unless -d "./extract_$lang";
opendir (my $dir, "./$lang/") or die "$!";
my #files = grep (/\.html/,readdir ($dir));
closedir ($dir);
foreach my $file (#files) {
open (my $fh, '<', "./$lang/$file") or die "$!";
my $root = HTML::TreeBuilder->new;
$root->parse_file("./$lang/$file");
my #all_p = $root->find_by_attribute("p");
foreach my $p (#all_p) {
my $ptag = HTML::TreeBuilder->new_from_content ($p->as_HTML);
my $filewrite = substr($file, 0, -5);
open (my $outwrite, '>>', "extract_$lang/$filewrite.txt") or die $!;
print $outwrite $ptag->as_text . "\n";
my $pcontents = $ptag->as_text;
print $pcontents . "\n";
close (outwrite);
}
close (FH);
}
}
My .html files are the plain text htmls from .asp websites e.g. http://www.singaporemedicine.com/vn/hcp/med_evac_mtas.asp
My .html files are saved in:
./ar/*
./cn/*
./en/*
./id/*
./vn/*

You are confusing element with attribute. The program can be written much more concisely:
#!/usr/bin/env perl
use strictures;
use File::Glob qw(bsd_glob);
use Path::Class qw(file);
use URI::file qw();
use Web::Query qw(wq);
use autodie qw(:all);
foreach my $lang (qw(ar cn en id vn)) {
mkdir "./extract_$lang", 0777 unless -d "./extract_$lang";
foreach my $file (bsd_glob "./$lang/*.html") {
my $basename = file($file)->basename;
$basename =~ s/[.]html$/.txt/;
open my $out, '>>:encoding(UTF-8)', "./extract_$lang/$basename";
$out->say($_) for wq(URI::file->new_abs($file))->find('p')->text;
close $out;
}
}

Use find_by_tag_name to search for tag names, not find_by_attribute.

You want find_by_tag_name, not find_by_attribute:
my #all_p = $root->find_by_tag_name("p");
From the docs:
$h->find_by_tag_name('tag', ...)
In list context, returns a list of elements at or under $h that have
any of the specified tag names. In scalar context, returns the first
(in pre-order traversal of the tree) such element found, or undef if
none.

You might want to take a look at Mojo::DOM which lets you use CSS selectors.

Related

How do I add link to cells using HTML::TagTree?

I am using HTML::TagTree as it seems to create the html file for table easily. I want to add html links to the text within some cells.
From the documentation provided here, I am not able to get a clear answer about how to add a new tag for the text inside a cell. Here's my code.
Main line to focus on: $new_row->td($1,'style=text-align:center','a:href="second_page.html"')
I don't think I clearly understand how to add more tags and attributes. Can someone please help?
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TagTree;
my $filename = 'list.txt';
my $html = HTML::TagTree->new('html'); # Define the top of the tree of objects.
my $head = $html->head(); # Put a 'head' branch on the tree.
my $body = $html->body(); # Put a 'body' branch on the tree
$head->title("Report");
$head->meta('', 'name=author CONTENT="xxx"');
$body->div->h1('Main page name'); # Example of method chaining to create
# a long branch.
my $table = $body->table('', 'width=100% border=1');
my $row1 = $table->tr();
$row1->td('Feature Code','style=background-color:khaki;text-align:center');
$row1->td('Feature Name','style=background-color:khaki;text-align:center');
$row1->td('% completed','style=background-color:khaki;text-align:center');
open(my $fh, '<', $filename)
or die "Could not open file '$filename' $!";
while (my $row = <$fh>){
if($row =~ m/([.\d]+): (.+)/){
my $new_row = $table->tr();
$new_row->td($1,'style=text-align:center','a:href="page_for_each_item.html"');
$new_row->td($2);
}
}
# Print to STDOUT the actual HTML representation of the tree
$html->print_html();
After some tries I think I found one way of doing it:
my $new_row = $table->tr();
my $text = $html->object();
$text->a($original_text,"href=second_page.html");
$new_row->td($text,'style=text-align:center');

Extract link from page HTML with Perl

To extract links from HTML pages use the code:
#!/usr/bin/perl
use warnings;
use strict;
use XML::LibXML;
open( SITE, "< index.html" );
my $html = <SITE>; # load the HTML file
my $content = 'XML::LibXML'->load_html(string => "index.html", recover => 1);
my #del = qw( Contact Tables );
my $condition = join ' or ', map "text()='$_'", #del;
for my $anch ($content->findnodes("//a[$condition]/..")) {
$anch->parentNode->removeChild($anch);
}
open (NOTEPAD, ">> index.html");
print NOTEPAD "$content";
close(NOTEPAD);
My problem is use the $file variable to read the contents of HTML page, it not work.
If I use the HTML tags inside the $file variable it works. But not is solution.
You can use WWW::Mechanize; to accomplish your task.
my $mech = WWW::Mechanize->new();
$mech->get( $url ); #url to extract links
my #links = $mech->links();
foreach my $link (#links) {
my $curr_url = $link->url_abs;
}
For complete documentation of this module refer WWW::Mechaniize
You can use Mojo::DOM (part of Mojolicious) which allows you to use CSS selectors, which I feel is a much better approach;
use Mojo::DOM;
use Mojo::File qw( path );
my $dom = Mojo::DOM->new( path('index.html')->slurp );
foreach ( $dom->find('a')->each ) {
# Do something with $_
}
my $html = $dom->to_string;
Your program opens file index.html, reads one line of it into $html, and then tries to parse the string index.html as if it were HTML data
Forget about opening and reading the file; you can get XML::LibXML to do all of that for you, with
my $content = XML::LibXML->load_html( location => 'index.html', recover => 1 );

Perl accessing and printing data from a html folder in a parent folder

so I have a Folder called DATA, and it includes the following: part1.html, part2.html, part3.html, HTML.htm, plain.html, and jojo.jsp.
Now i use the following commmand to open the DATA folder and extract the files containing .htm
opendir(DIR,'DATA');
my(#dir) = grep /\.htm/, readdir (DIR);
closedir(DIR);
It successfully prints out the name of the files containing the extensions .html . Now i wish to use the html file that are filtered and print the data out into the cygwin terminal. I tried to use the files and stored it to a variable, and use a foreach loop to open the first html file using Filehandler and printing out the data init. The loop will repeat itself and do the same for all the other html files. But i seemed to run into the error! Please help!
my $value = join(#dir);
print "$value\n";
foreach(#dir){
my $movies = my $value;
open (FHD, $movies) || die " could not open $movies\n";
my #movies = <FHD>;
my $value2 = join(', ', #movies);
print "$value2\n";
What's with this line?
my $movies = my $value;
You're making this a lot harder than it needs to be.
Just use glob to read the directory as that will automatically include the path information on your found files.
use strict;
use warnings;
use autodie;
for my $html (glob('DATA/*.htm*')) {
print "File: $html\n";
open my $fh, '<', $html;
print <$fh>;
}

how to pass a variable from an HTML form to a perl cgi script?

I would like to use an HTML form to pass a variable to Perl CGI script so that I can process that variable, and then print it out on another HTML page.
Here is my HTML code: http://jsfiddle.net/wTVQ5/.
Here is my Perl CGI script to links the HTML. Here is the way I would like to do it (since it uses less lines and probably more efficient).
#!/usr/bin/perl
use warnings; use strict;
use CGI qw( :standard);
my $query = CGI->new;
# Process an HTTP request
my $user = $query->param('first_name');
# process $user... for example:
my $foo = "Foo";
my $str = $user . $foo;
print "Content-type:text/html\r\n\r\n";
print "<html>";
print "<head>";
print "<title>Hello - Second CGI Program</title>";
print "</head>";
print "<body>";
print "<h2>Hello $str - Second CGI Program</h2>";
print "</body>";
print "</html>";
1;
Here's a way I read from a tutorial and makes more sense to me:
#!/usr/bin/perl
use warnings; use strict;
my ($buffer, #pairs, $pair, $name, $value, %FORM);
# Read in text
$ENV{'REQUEST_METHOD'} =~ tr/a-z/A-Z/;
if ($ENV{'REQUEST_METHOD'} eq "POST") {
read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
}else {
$buffer = $ENV{'QUERY_STRING'};
}
# Split information into name/value pairs
#pairs = split(/&/, $buffer);
foreach $pair (#pairs) {
($name, $value) = split(/=/, $pair);
$value =~ tr/+/ /;
$value =~ s/%(..)/pack("C", hex($1))/eg;
$FORM{$name} = $value;
}
my $user = $FORM{first_name};
# process $user... for example:
my $foo = "Foo";
my $str = $user . $foo;
print "Content-type:text/html\r\n\r\n";
print "<html>";
print "<head>";
print "<title>Hello - Second CGI Program</title>";
print "</head>";
print "<body>";
print "<h2>Hello $str - Second CGI Program</h2>";
print "</body>";
print "</html>";
1;
Both of these don't work properly BTW. When I click on the submit button on the HTML page, it just links me to the script instead of passing the variable, processing it, and printing out the HTML page.
this line:
print "Content-type:text/html\r\n\r\n";
should be:
print "Content-type:text/html\n\n";
or better:
print $query->header;
Also, ensure your web server was well configurated for CGI. And, if you have enough time, use a modern web application approach, there are many frameworks that may be better than CGI (Dancer, Mojolicious, OX, ...)
I see your using CGI 'standard', no need really to initiate a CGI->new unless you just wanted to, also you said less lines, you could just do something like this.
use strict;
use warnings;
use CGI qw( :standard );
my $user = param('first_name') || q/foo/;
print header,
start_html(-title => 'Hello'), h1('Hello ' . $user), end_html;
You need to edit your httpd.conf with something like this.
AddHandler cgi-script cgi pl
<Directory /path/to/cgi/files>
Options +ExecCGI
</Directory>
If you are running this locally you could create a folder named public_html in your home directory structure and set this to run your scripts, you would just have to configure that also mapping it to that location.

Perl regular expression for html

I need to extract the IMDB id(example:for the movie 300 it is tt0416449) for a movie specified by the variable URL. I have looked at the page source for this page and come up with the following regex
use LWP::Simple;
$url = "http://www.imdb.com/search/title?title=$FORM{'title'}";
if (is_success( $content = LWP::Simple::get($url) ) ) {
print "$url is alive!\n";
} else {
print "No movies found";
}
$code = "";
if ($content=~/<td class="number">1\.</td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s) {
$code = $1;
}
I am getting an internal server error at this line
$content=~/<td class="number">1\.</td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s
I am very new to perl, and would be grateful if anyone could point out my mistake(s).
Use an HTML parser. Regular expressions cannot parse HTML.
Anyway, the reason for the error is probably that you forgot to escape a forward slash in your regex. It should look like this:
/<td class="number">1\.<\/td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s
A very nice interface for this type of work is provided by some tools of the Mojolicious distribution.
Long version
The combination of its UserAgent, DOM and URL classes can work in a very robust way:
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
use Mojo::UserAgent;
use Mojo::URL;
# preparations
my $ua = Mojo::UserAgent->new;
my $url = "http://www.imdb.com/search/title?title=Casino%20Royale";
# try to load the page
my $tx = $ua->get($url);
# error handling
die join ', ' => $tx->error unless $tx->success;
# extract the url
my $movie_link = $tx->res->dom('a[href^=/title]')->first;
my $movie_url = Mojo::URL->new($movie_link->attrs('href'));
say $movie_url->path->parts->[-1];
Output:
tt0381061
Short version
The funny one liner helper module ojo helps to build a very short version:
$ perl -Mojo -E 'say g("imdb.com/search/title?title=Casino%20Royale")->dom("a[href^=/title]")->first->attrs("href") =~ m|([^/]+)/?$|'
Output:
tt0381061
I agree XML is anti-line-editing thus anti-unix but, there is AWK.
If awk can do, perl can surely do. I can produce a list:
curl -s 'http://www.imdb.com/find?q=300&s=all' | awk -vRS='<a|</a>' -vFS='>|"' -vID=$1 '
$NF ~ ID && /title/ { printf "%s\t", $NF; match($2, "/tt[0-9]+/"); print substr($2, RSTART+1, RLENGTH-2)}
' | uniq
Pass search string to "ID".
Basically it's all about how you choose your tokenizer in awk, I use the <a> tag. Should be easier in perl.