Parsing HTML link from around 600 URLs using Perl - html

I have the list of around 600 drugs as a input and I have written a perl script to get the list of 600 URLs for all these drugs, grabs the URL content. Also, there is a link inside each URL termed Shared/Embed Graph that can be clicked to view the HTML source code. However, I need to make the script such that it clicks all these links inside all the 600 URLs and prints the 600 HTML source codes to STDOUT in perl. Right now my script is:
<c>
#!/usr/bin/perl
use strict;
use warnings;
#use LWP::Simple qw(get);
#use HTML::Parse;
use YAML;
use WWW::Mechanize;
use Data::Dumper;
use Template::Extract;
my $infile = $ARGV[0];
my $outfile = $ARGV[1];
open (IFILE, "<", $infile) || die "Could not open $infile\n";
open (OFILE, ">", $outfile) || die "Could not open $outfile\n";
my #arrayofdrugterms;
while (<IFILE>) {
chomp;
push (#arrayofdrugterms, $_);
}
#print "#arrayofdrugterms\n";
my $url;
foreach my $arrayofdrugterms( #arrayofdrugterms) {
$url = "http://www.drugcite.com/?q=$arrayofdrugterms\n";
print OFILE "$url\n";
}
close OFILE;
#open outfile for reading
open (OFILE, "<", $outfile) || die "Cannot open $outfile\n";
my #arrayofurls;
my $mech;
my $ext;
my #result;
my #link;
my $template;
my $content;
while (<OFILE>) {
chomp;
#arrayofurls = split ( ' ', $_);
#print "#arrayofurls\n";
foreach my $arrayofurls ( #arrayofurls) {
#print "$arrayofurls\n";
$mech = WWW::Mechanize->new(autocheck => 0);
$mech->get( "$arrayofurls" );
#print $mech->get( "$arrayofurls" ). "\n";
$ext = Template::Extract->new;
#print "$ext\n";
</c>
<b>
$template = "<div id="[% DrugCite %]" style="[% padding:10px %]">
<img src="[% http://www.drugcite.com/img/? item=vg&q=$arrayofdrugterms&meddra=pt style=border;0px; alt=Top 10 $arrayofdrugterms ADVERSE EVENTS - DrugCite.com %]">
<br />
<a href="[% http://www.drugcite.com/?q=$arrayofdrugterms style=font-size:7pt;margin-left:20px;color:#c0c0c0;text-decoration:none %]">"[% Source DrugCite %]"
</a>
</div>";
</b>
<c>
#result = $ext->extract($template, $mech->content);
print "#result\n";
#print Dumper "\#result" . "\n";
foreach ($mech->links) {
if( $_->[0] =~ /^Share\/Embed Graph$/) {
$mech->get($_->[0]);
}
=cut
#else {
#print "Not found the required link\n";
#}
#else {
#push (#link, $_->[0]) . "\n";
#}
=cut
}#end foreach
#print STDOUT "$mech->content\n";
#print Dumper \#link . "\n";
foreach (#result) {
#print YAML::Dump $_;
}
}
}
</c>
Any help is appreciated.Thanks

Related

How to read a csv using Perl?

I want to read a csv using perl excluding the first row. Further, col 2 and col3 variables need to be stored in another file and the row read must be deleted.
Edit : Following code has worked. I just want the deletion part.
use strict;
use warnings;
my ($field1, $field2, $field3, $line);
my $file = 'D:\Patching_test\ptch_file.csv';
open( my $data, '<', $file ) or die;
while ( $line = <$data> ) {
next if $. == 1;
( $field1, $field2, $field3 ) = split ',', $line;
print "$field1 : $field2 : $field3 ";
my $filename = 'D:\DB_Patch.properties';
unlink $filename;
open( my $sh, '>', $filename )
or die "Could not open file '$filename' $!";
print $sh "Patch_id=$field2\n";
print $sh "Patch_Name=$field3";
close($sh);
close($data);
exit 0;
}
OPs problem poorly presented for processing
no input data sample provided
no desired output data presented
no modified input file after processing presented
Based on problem description following code provided
use strict;
use warnings;
use feature 'say';
my $input = 'D:\Patching_test\ptch_file.csv';
my $output = 'D:\DB_Patch.properties';
my $temp = 'D:\script_temp.dat';
open my $in, '<', $input
or die "Couldn't open $input";
open my $out, '>', $output
or die "Couldn't open $output";
open my $tmp, '>', $temp
or die "Couldn't open $temp";
while ( <$in> ) {
if( $. == 1 ) {
say $tmp $_;
} else {
my($patch_id, $patch_name) = (split ',')[1,2];
say $out "Patch_id=$patch_id";
say $out "Patch_Name=$patch_name";
}
}
close $in;
close $out;
close $tmp;
rename $temp,$input;
exit 0;

HTML tag parsing script

I've written an HTML tag parsing script that I think should work but I'm getting a file not found error. Maybe I'm having a senior moment but I'm stuck. I have all of the *.html files that I want to parse in a directory called Test and I am executing the perl script from a folder called temp that has the directory Test in it. The exact error is: Error opening Test/1.html: No such file or directory.
Here's the code:
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
use HTTP::Headers;
use HTML::HeadParser;
use Text::CSV;
my $csv1 = Text::CSV->new ( { binary => 1 } ) or die Text::CSV->error_diag();
$csv1->eol ("\n");
my $dfile = 'all_tags.csv';
open my $fh1, ">:encoding(utf8)", "$dfile" or die "Error opening $dfile: $!";
my $dir = 'Test';
find (\&HTML_Files, $dir);
print "directory is";
print $dir;
close $fh1 or die "Error closing $dfile: $!";
exit;
sub HTML_Files {
Parse_HTML_Header($File::Find::name) if /\.html?$/;
}
sub Parse_HTML_Header {
my $ifile = shift;
open(my $fh0, '<', $ifile) or die "Error opening $ifile: $!\n";
my $text = '';
{
$/ = undef;
$text = <$fh0>;
}
close $fh0;
my $h = HTTP::Headers->new;
my $p = HTML::HeadParser->new($h);
$p->parse($text);
for ($h->header_field_names) {
my #values = split ',', $h->header($_);
if (/keywords/i) {
$csv1->print ($fh1, \#values);
} elsif (/description/i) {
$csv1->print ($fh1, \#values);
} elsif (/title/i) {
$csv1->print ($fh1, \#values);
}
}
}
It's because File::Find is doing a chdir as it runs. You should pass $_ instead of $File::Find::name. Or set no_chdir:
no_chdir
Does not chdir() to each directory as it recurses. The wanted() function will need to be aware of this, of course. In this case, $_ will be the same as $File::Find::name .
Because you are specifying a relative path, $File::Find::name is also a relative path. You can avoid this by specifying a full path to find as well. (e.g. /full/path/to/dir)

HTML::TagFilter returning HTML::Element HASH object

New to Perl and I am digging what I can do as well as the support and documentation available for all these great libraries; however, I am having an issue with a script I am working on. Prior to implementing HTML::TagFilter, I was using line 63 (print FH $tree->as_HTML) to print to file the html content I was looking for. I looked specifically for everything in the body tag. Now I'd like to only print out the p tags, h tags, and img tags without any attributes. When I run my code, the files are created in the proper directory but in each file a hash object is printed (HTML::Element=HASH(0x3a104c8)).
use open qw(:locale);
use strict;
use warnings qw(all);
use HTML::TreeBuilder 5 -weak; # Ensure weak references in use
use URI::Split qw/ uri_split uri_join /;
use HTML::TagFilter;
my #links;
open(FH, "<", "index/site-index.txt")
or die "Failed to open file: $!\n";
while(<FH>) {
chomp;
push #links, $_;
}
close FH;
my $dir = "";
while($dir eq ""){
print "What is the name of the site we are working on? ";
$dir = <STDIN>;
chomp $dir;
}
#make directory to store files
mkdir($dir);
my $entities = "";
my $indent_char = "\t";
my $filter = HTML::TagFilter->new(
allow=>{ p => { none => [] }, h1 => { none => [] }, h2 => { none => [] }, h3 => { none => [] }, h4 => { none => [] }, h5 => { none => [] }, h6 => { none => [] }, img => { none => [] }, },
log_rejects => 1,
strip_comments => 1
);
foreach my $url (#links){
#print $url;
my ($filename) = $url =~ m#([^/]+)$#;
#print $filename;
$filename =~ tr/=/_/;
$filename =~ tr/?/_/;
#print "\n";
my $currentfile = $dir . '/' . $filename . '.html';
print "Preparing " . $currentfile . "\n" . "\n";
open (FH, '>', $currentfile)
or die "Failed to open file: $!\n";
my $tree = HTML::TreeBuilder->new_from_url($url);
$tree->parse($url);
$tree = $tree->look_down('_tag', 'body');
if($tree){
$tree->dump; # a method we inherit from HTML::Element
print FH $filter->filter($tree);
#print FH $tree->as_HTML($entities, $indent_char), "\n";
} else{
warn "No body tag found";
}
print "File " . $currentfile . " completed.\n" . "\n";
close FH;
}
Why is this happening and how can I print the actual content I am looking for?
Thank you.
$filter->filter() expects HTML, HTML::TreeBuilder is not HTML, but a subclass of HTML::Element. look_down() returns a HTML::Element. That is what you see from your print, because when you treat this reference as a string, you will get the string representation of the object. HTML::Element=HASH(0x7f81509ab6d8), which means that the object HTML::Element, which is solved by a HASH structure and the memory address of this object.
You can fix it all by calling filter with the HTML from the look_down:
print FH $filter->filter($tree->as_HTML);

HTML post Perl script to text file

I have a Perl script (I didn't write it) that takes a POST from an html page and it displays a certain section of a txt file to a webpage. The problem is, now I need it to also make a text file of that section to a text file on our Unix server. Any help? Code below.
#!/usr/bin/perl
#
print "Content-type: text/html\n\n";
print '<pre>';
read(STDIN, $buf, $ENV{'CONTENT_LENGTH'});
#print "$buf\n";
#print "$REMOTE_USER = $REMOTE_USER\n";
#pairs = split(/&/, $buf);
#print "$pairs\n";
($txt_HRcode, $lc_HRcode) = split(/=/,$pairs[0]);
#print "$txt_HRcode\n";
#$HRcode = " HRcode: E2PSYAA0";
$HRcode = " HRcode: F8".uc($lc_HRcode)."0";
#print "$HRcode\n";
open(LINEFIND, "grep -n \"$HRcode\" /release/ucpmr/ucpmr.txt |") or die print "Can't Open File" ;
$line_num = <LINEFIND>;
#print "$line_num\n";
#if($line_num !~ m/$HRcode/) {print "SEQUENCE CODE NOT FOUND"; die()};
($sline, $hrd, $lin_text) = split(/:/, $line_num);
$beg_line = ($sline - 2);
$end_line = ($beg_line + 10000);
#print "$beg_line\n";
#print "$end_line\n";
close(LINEFIND);
open(DISP, "/release/ucpmr/ucpmr.txt") or die print "File is no longer in History. Press Back to Return";
for($incr=1; $incr <= $end_line; $incr +=1)
{$line = <DISP>;
if($incr > $beg_line) {
if($incr >$sline){
if($line =~ m/HRcode: F8/){
if($line !~ m/$HRcode/) {$quit_line = $incr-3 ; last;
close(DISP);}}}}}
open(PRINTFIND, "/release/ucpmr/ucpmr.txt") or die print "File is no longer in History. Press Back to Return";
for($incr=1; $incr <= $quit_line; $incr +=1)
{$line = <PRINTFIND>;
#$line =~ s/\d\d\d-\d\d-/XXX-XX-/;
if($incr > $beg_line) {print"$line";}}
#print "quit line is : $quit_line\n";
print "</pre>";
Change the end part, starting from open(PRINTFIND, ... like this:
open(PRINTFIND, "/release/ucpmr/ucpmr.txt") or die print "File is no longer in History. Press Back to Return";
open(my $fh, '>/release/ucpmr/TEXT_FILE_NAME.txt');
for($incr=1; $incr <= $quit_line; $incr +=1)
{$line = <PRINTFIND>;
#$line =~ s/\d\d\d-\d\d-/XXX-XX-/;
if($incr > $beg_line) {print"$line"; print $fh $line; }}
#print "quit line is : $quit_line\n";
print "</pre>";
... but dude, if you're a .NET guy, do yourself a favor and rewrite this mess in .NET, seriously...
open(my $fh, '>', '/release/ucpmr/TEXT_FILE_NAME.txt');
print $fh FILE_CONTENT;
close $fh;
like this?

How can I extract data from HTML tables in Perl?

I'm trying to use regular expressions in Perl to parse a table with the following structure. The first line is as follows:
<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>
Here I wish to take out "Time Played", "Artist", "Title", and "Label", and print them to an output file.
I've tried many regular expressions such as:
$lines =~ / (<td>) /
OR
$lines =~ / <td>(.*)< /
OR
$lines =~ / >(.*)< /
My current program looks like so:
#!perl -w
open INPUT_FILE, "<", "FIRST_LINE_OF_OUTPUT.txt" or die $!;
open OUTPUT_FILE, ">>", "PLAYLIST_TABLE.txt" or die $!;
my $lines = join '', <INPUT_FILE>;
print "Hello 2\n";
if ($lines =~ / (\S.*\S) /) {
print "this is 1: \n";
print $1;
if ($lines =~ / <td>(.*)< / ) {
print "this is the 2nd 1: \n";
print $1;
print "the word was: $1.\n";
$Time = $1;
print $Time;
print OUTPUT_FILE $Time;
} else {
print "2ND IF FAILED\n";
}
} else {
print "THIS FAILED\n";
}
close(INPUT_FILE);
close(OUTPUT_FILE);
Do NOT use regexps to parse HTML. There are a very large number of CPAN modules which do this for you much more effectively.
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Can you provide an example of parsing HTML with your favorite parser?
HTML::Parser
HTML::TreeBuilder
HTML::TableExtract
Use HTML::TableExtract. Really.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
my $file = 'Table3.htm';
unless ( -e $file ) {
my $rc = getstore(
'http://www.ntsb.gov/aviation/Table3.htm',
$file);
die "Failed to download document\n" unless $rc == 200;
}
my #headers = qw( Year Fatalities );
my $te = HTML::TableExtract->new(
headers => \#headers,
attribs => { id => 'myTable' },
);
$te->parse_file($file);
my ($table) = $te->tables;
print join("\t", #headers), "\n";
for my $row ($te->rows ) {
print join("\t", #$row), "\n";
}
This is what I meant in another post by "task-specific" HTML parsers.
You could have saved a lot of time by directing your energy to reading some documentation rather than throwing regexes at the wall and seeing if any stuck.
That's an easy one:
my $html = '<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>';
my #stuff = $html =~ />([^<]+)</g;
print join (", ", #stuff), "\n";
See http://codepad.org/qz9d5Bro if you want to try running it.