How can I extract data from HTML tables in Perl?

How can I extract data from HTML tables in Perl? - html

I'm trying to use regular expressions in Perl to parse a table with the following structure. The first line is as follows:
<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>
Here I wish to take out "Time Played", "Artist", "Title", and "Label", and print them to an output file.
I've tried many regular expressions such as:
$lines =~ / (<td>) /
OR
$lines =~ / <td>(.*)< /
OR
$lines =~ / >(.*)< /
My current program looks like so:
#!perl -w
open INPUT_FILE, "<", "FIRST_LINE_OF_OUTPUT.txt" or die $!;
open OUTPUT_FILE, ">>", "PLAYLIST_TABLE.txt" or die $!;
my $lines = join '', <INPUT_FILE>;
print "Hello 2\n";
if ($lines =~ / (\S.*\S) /) {
print "this is 1: \n";
print $1;
if ($lines =~ / <td>(.*)< / ) {
print "this is the 2nd 1: \n";
print $1;
print "the word was: $1.\n";
$Time = $1;
print $Time;
print OUTPUT_FILE $Time;
} else {
print "2ND IF FAILED\n";
}
} else {
print "THIS FAILED\n";
}
close(INPUT_FILE);
close(OUTPUT_FILE);

Do NOT use regexps to parse HTML. There are a very large number of CPAN modules which do this for you much more effectively.
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Can you provide an example of parsing HTML with your favorite parser?
HTML::Parser
HTML::TreeBuilder
HTML::TableExtract

Use HTML::TableExtract. Really.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
my $file = 'Table3.htm';
unless ( -e $file ) {
my $rc = getstore(
'http://www.ntsb.gov/aviation/Table3.htm',
$file);
die "Failed to download document\n" unless $rc == 200;
}
my #headers = qw( Year Fatalities );
my $te = HTML::TableExtract->new(
headers => \#headers,
attribs => { id => 'myTable' },
);
$te->parse_file($file);
my ($table) = $te->tables;
print join("\t", #headers), "\n";
for my $row ($te->rows ) {
print join("\t", #$row), "\n";
}
This is what I meant in another post by "task-specific" HTML parsers.
You could have saved a lot of time by directing your energy to reading some documentation rather than throwing regexes at the wall and seeing if any stuck.

That's an easy one:
my $html = '<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>';
my #stuff = $html =~ />([^<]+)</g;
print join (", ", #stuff), "\n";
See http://codepad.org/qz9d5Bro if you want to try running it.

Related

Corrupted JSON encoding in Perl (missign comma)

My custom code (on Perl) give next wrong JSON, missing comma between blocks:
{
"data": [{
"{#LOGFILEPATH}": "/tmp/QRZ2007.tcserverlogs",
"{#LOGFILE}": "QRZ2007"
} **missing comma** {
"{#LOGFILE}": "ARZ2007",
"{#LOGFILEPATH}": "/tmp/ARZ2007.tcserverlogs"
}]
}
My terrible code:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
use utf8;
use JSON;
binmode STDOUT, ":utf8";
my $dir = $ARGV[0];
my $json = JSON->new->utf8->space_after;
opendir(DIR, $dir) or die $!;
print '{"data": [';
while (my $file = readdir(DIR)) {
next unless (-f "$dir/$file");
next unless ($file =~ m/\.tcserverlogs$/);
my $fullPath = "$dir/$file";
my $filenameshort = basename($file, ".tcserverlogs");
my $data_to_json = {"{#LOGFILEPATH}"=>$fullPath,"{#LOGFILE}"=>$filenameshort};
my $data_to_json = {"{#LOGFILEPATH}"=>$fullPath,"{#LOGFILE}"=>$filenameshort};
print $json->encode($data_to_json);
}
print ']}'."\n";
closedir(DIR);
exit 0;
Dear Team i am not a programmer, please any idea how fix it, thank you!

If you do not print a comma, you will not get a comma.
You are trying to build your own JSON string from pre-encoded pieces of smaller data structures. That will not work unless you tell Perl when to put commas. You could do that, but it's easier to just collect all the data into a Perl data structure that is equivalent to the JSON string you want to produce, and encode the whole thing in one go when you're done.
my $dir = $ARGV[0];
my $json = JSON->new->utf8->space_after;
my #data;
opendir( DIR, $dir ) or die $!;
while ( my $file = readdir(DIR) ) {
next unless ( -f "$dir/$file" );
next unless ( $file =~ m/\.tcserverlogs$/ );
my $fullPath = "$dir/$file";
my $filenameshort = basename( $file, ".tcserverlogs" );
my $data_to_json = { "{#LOGFILEPATH}" => $fullPath, "{#LOGFILE}" => $filenameshort };
push #data, $data_to_json;
}
closedir(DIR);
print $json->encode( { data => \#data } );

How to convert tag names and values from XML into HTML using Perl

Is there any way to convert a simple XML document into HTML using Perl that would give me a table of tag names and tag values?
The XML file output.xml is like this
<?xml version="1.0"?>
<doc>
<GI-eSTB-MIB-NPH>
<eSTBGeneralErrorCode.0>INTEGER: 0</eSTBGeneralErrorCode.0>
<eSTBGeneralConnectedState.0>INTEGER: true(1)</eSTBGeneralConnectedState.0>
<eSTBGeneralPlatformID.0>INTEGER: 2076</eSTBGeneralPlatformID.0>
<eSTBGeneralFamilyID.0>INTEGER: 25</eSTBGeneralFamilyID.0>
<eSTBGeneralModelID.0>INTEGER: 60436</eSTBGeneralModelID.0>
<eSTBMoCAMACAddress.0>STRING: 0:0:0:0:0:0</eSTBMoCAMACAddress.0>
<eSTBMoCANumberOfNodes.0>INTEGER: 0</eSTBMoCANumberOfNodes.0>
</GI-eSTB-MIB-NPH>
</doc>
I am trying to create HTML which looks like this
1. eSTBGeneralPlatformID.0 - INTEGER: 2076
2. eSTBGeneralFamilyID.0 - INTEGER: 25
3.
I was trying to use code from the web but I am really having a hard time understanding how to generate the required format for HTML tags.
What I was trying was this
#!/usr/bin/perl
use strict;
use warnings;
use XML::Parser;
use XML::LibXML;
#Add TagNumberConversion.pl here
my $parser = XML::Parser->new();
$parser->setHandlers(
Start => \&start,
End => \&end,
Char => \&char,
Proc => \&proc,
);
my $header = &getXHTMLHeader();
print $header;
$parser->parsefile( '20150630104826.xml' );
my $currentTag = "";
sub start() {
my ( $parser, $name, %attr ) = #_;
$currentTag = $name;
if ( $currentTag eq 'doc' ) {
print "<head><title>"
. "Output of snmpwalk for cpeIP4"
. "</title></head>";
print "<body><h2>" . "Output of snmpwalk for cpeIP4" . "</h2>";
print '<table summary="'
. "Output of snmpwalk for cpeIP4"
. '"><tr><th>Tag Name</th><th>Tag Value</th></tr>';
}
elsif ( $currentTag eq 'GI-eSTB-MIB-NPH' ) {
print "<tr>";
}
elsif ( $currentTag =~ /^eSTB/ ) {
print "<tr>";
}
else {
print "<td>";
}
}
sub end() {
my ( $parser, $name, %attr ) = #_;
$currentTag = $name;
if ( $currentTag eq 'doc' ) {
print "</table></body></html>";
}
elsif ( $currentTag eq 'GI-eSTB-MIB-NPH' ) {
print "</tr>";
}
elsif ( $currentTag =~ /^eSTB/ ) {
print "</tr>";
}
else {
print "</td>";
}
}
sub char() {
my ( $parser, $data ) = #_;
print $data;
}
sub proc() {
my ( $parser, $target, $data ) = #_;
if ( lc( $target ) eq 'perl' ) {
$data = eval( $data );
print $data;
}
}
sub getXHTMLHeader() {
my $header = '<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">';
return $header;
}
This is code in progress, but I realize that this will be overkill for my requirement.
So I am trying to figure out if there is any quick way to do it using Perl.
Please give me some pointers if there is indeed any quick way.

The quick and dirty way is to just use a regular expression. However it comes with the risk of missing some data and getting burned by edge cases. But since you asked for it...
#!/usr/bin/env perl
use strict;
open my $fh, 'filename.xml'
or die "unable to open filename.xml : $!";
my $count = 1;
print "<head><title>'Output of snmpwalk for cpeIP4'</title></head>\n";
print "<body><h2>'Output of snmpwalk for cpeIP4'</h2>\n";
print "<table summary='Output of snmpwalk for cpeIP4'><tr><th>Tag Name</th><th>Tag Value</th></tr>\n";
while (my $line = <$fh>) {
next unless $line =~ m|<eSTB|;
# Store into into $tag and $value
# the result of matching whitespace, followed by '<'
# followed by anything (store into $tag)
# followed by '>'
# followed by anything (store into $value)
# followed by '<'
my ($tag, $value) = $line =~ m|\s+<(.+?)>(.+?)<|;
print "<tr><td>" . $count++ . ". $tag</td><td>$value</td></tr>\n";
}
print "</table></body></html>\n";
Produces the following:
<head><title>'Output of snmpwalk for cpeIP4'</title></head>
<body><h2>'Output of snmpwalk for cpeIP4'</h2>
<table summary='Output of snmpwalk for cpeIP4'><tr><th>Tag Name</th><th>Tag Value</th></tr>
<tr><td>1. eSTBGeneralErrorCode.0</td><td>INTEGER: 0</td></tr>
<tr><td>2. eSTBGeneralConnectedState.0</td><td>INTEGER: true(1)</td></tr>
<tr><td>3. eSTBGeneralPlatformID.0</td><td>INTEGER: 2076</td></tr>
<tr><td>4. eSTBGeneralFamilyID.0</td><td>INTEGER: 25</td></tr>
<tr><td>5. eSTBGeneralModelID.0</td><td>INTEGER: 60436</td></tr>
<tr><td>6. eSTBMoCAMACAddress.0</td><td>STRING: 0:0:0:0:0:0</td></tr>
<tr><td>7. eSTBMoCANumberOfNodes.0</td><td>INTEGER: 0</td></tr>
</table></body></html>

Firstly, I think you're using the wrong tool for this. I always find XML::LibXML far easier to use than XML::Parser. You load XML::LibXML, but you never make use of it.
Secondly, I think you'll find your live is easier if you think of this as two stages - one to extract the data and one to output the new data.
Here's the first stage, which stores the data you need in an array.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use XML::LibXML;
use Data::Dumper;
my $file = shift || die "Must give XML file\n";
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($file);
my #tags;
# Find the nodes using an XPath expression
foreach ($doc->findnodes('//GI-eSTB-MIB-NPH/*')) {
push #tags, { name => $_->nodeName, content => $_->textContent };
}
# Just here to show the intermediate data structure
say Dumper \#tags;
You then need to use #tags to generate your output. For over fifteen years we've know that it's a terrible idea to include hard-coded HTML in amongst your Perl code, so I'd highly recommend looking at a templating system like the Template Toolkit.
I created a xml.tt file like this:
<html>
<head>
<title>Output of snmpwalk for cpeIP4</title>
</head>
<body><h2>Output of snmpwalk for cpeIP4</h2>
<table summary='Output of snmpwalk for cpeIP4'>
<tr>
<th>Tag Name</th><th>Tag Value</th><
/tr>
[% FOREACH tag IN tags -%]
<tr><td>[% loop.count %]. [% tag.name %]</td><td>[% tag.content %]</td></tr>
[% END -%]
</table>
</body>
</html>
And then the second half of my program looks like this:
use Template;
my $tt = Template->new;
$tt->process('xml.tt', { tags => \#tags });
I hope you agree that all looks a lot simpler than your approach.

Getting links from an HTML table using HTML::TableExtract and HTML::Extor in Perl

My goal is to extract the links from the tables titled "Agonists," "Antagonists," and "Allosteric Regulators" in the following site:
http://www.iuphar-db.org/DATABASE/ObjectDisplayForward?objectId=1&familyId=1
I've been using HTML::TableExtract to extract the tables but have been unable to get HTML::LinkExtor to retrieve the links in question. Here is the code I have so far:
use warnings;
use strict;
use HTML::TableExtract;
use HTML::LinkExtor;
my #names = `ls /home/wallakin/LINDA/ligands/iuphar/data/html2/`;
foreach (#names)
{
chomp ($_);
my $te = HTML::TableExtract->new( headers => [ "Ligand",
"Sp.",
"Action",
"Affinity",
"Units",
"Reference" ] );
my $le = HTML::LinkExtor->new();
$te->parse_file("/home/wallakin/LINDA/ligands/iuphar/data/html2/$_");
my $output = $_;
$output =~ s/\.html/\.txt/g;
open (RESET, ">/home/wallakin/LINDA/ligands/iuphar/data/links/$output") or die "Can't reset";
close RESET;
#open (DATA, ">>/home/wallakin/LINDA/ligands/iuphar/data/links/$output") or die "Can't append to file";
foreach my $ts ($te->tables)
{
foreach my $row ($ts->rows)
{
$le->parse($row->[0]);
for my $link_tag ( $le->links )
{
my %links = #$link_tag;
print #$link_tag, "\n";
}
}
}
#print "Links extracted from $_\n";
}
I've tried using some sample code from another thread on this site (Perl parse links from HTML Table) to no avail. I'm not sure whether it's a problem of parsing or table recognition. Any help provided would be greatly appreciated. Thanks!

Try this as a base script (you only need to adapt it to fetch links) :
use warnings; use strict;
use HTML::TableExtract;
use HTML::LinkExtor;
use WWW::Mechanize;
use utf8;
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");
my $m = WWW::Mechanize->new( autocheck => 1, quiet => 0 );
$m->agent_alias("Linux Mozilla");
$m->cookie_jar({});
my $te = HTML::TableExtract->new(
headers => [
"Ligand",
"Sp.",
"Action",
"Affinity",
"Units",
"Reference"
]
);
$te->parse(
$m->get("http://tinyurl.com/jvwov9m")->content
);
foreach my $ts ($te->tables) {
print "Table (", join(',', $ts->coords), "):\n";
foreach my $row ($ts->rows) {
print join(',', #$row), "\n";
}
}

You don't describe what the problem is...what exactly doesn't work? What does $row->[0] contain? But part of the problem might be that TableExtract returns just the 'visible' text, not the raw html, by default. You probably want to use the keep_html option in HTML::TableExtract.

how to pass a variable from an HTML form to a perl cgi script?

I would like to use an HTML form to pass a variable to Perl CGI script so that I can process that variable, and then print it out on another HTML page.
Here is my HTML code: http://jsfiddle.net/wTVQ5/.
Here is my Perl CGI script to links the HTML. Here is the way I would like to do it (since it uses less lines and probably more efficient).
#!/usr/bin/perl
use warnings; use strict;
use CGI qw( :standard);
my $query = CGI->new;
# Process an HTTP request
my $user = $query->param('first_name');
# process $user... for example:
my $foo = "Foo";
my $str = $user . $foo;
print "Content-type:text/html\r\n\r\n";
print "<html>";
print "<head>";
print "<title>Hello - Second CGI Program</title>";
print "</head>";
print "<body>";
print "<h2>Hello $str - Second CGI Program</h2>";
print "</body>";
print "</html>";
1;
Here's a way I read from a tutorial and makes more sense to me:
#!/usr/bin/perl
use warnings; use strict;
my ($buffer, #pairs, $pair, $name, $value, %FORM);
# Read in text
$ENV{'REQUEST_METHOD'} =~ tr/a-z/A-Z/;
if ($ENV{'REQUEST_METHOD'} eq "POST") {
read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
}else {
$buffer = $ENV{'QUERY_STRING'};
}
# Split information into name/value pairs
#pairs = split(/&/, $buffer);
foreach $pair (#pairs) {
($name, $value) = split(/=/, $pair);
$value =~ tr/+/ /;
$value =~ s/%(..)/pack("C", hex($1))/eg;
$FORM{$name} = $value;
}
my $user = $FORM{first_name};
# process $user... for example:
my $foo = "Foo";
my $str = $user . $foo;
print "Content-type:text/html\r\n\r\n";
print "<html>";
print "<head>";
print "<title>Hello - Second CGI Program</title>";
print "</head>";
print "<body>";
print "<h2>Hello $str - Second CGI Program</h2>";
print "</body>";
print "</html>";
1;
Both of these don't work properly BTW. When I click on the submit button on the HTML page, it just links me to the script instead of passing the variable, processing it, and printing out the HTML page.

this line:
print "Content-type:text/html\r\n\r\n";
should be:
print "Content-type:text/html\n\n";
or better:
print $query->header;
Also, ensure your web server was well configurated for CGI. And, if you have enough time, use a modern web application approach, there are many frameworks that may be better than CGI (Dancer, Mojolicious, OX, ...)

I see your using CGI 'standard', no need really to initiate a CGI->new unless you just wanted to, also you said less lines, you could just do something like this.
use strict;
use warnings;
use CGI qw( :standard );
my $user = param('first_name') || q/foo/;
print header,
start_html(-title => 'Hello'), h1('Hello ' . $user), end_html;

You need to edit your httpd.conf with something like this.
AddHandler cgi-script cgi pl
<Directory /path/to/cgi/files>
Options +ExecCGI
</Directory>
If you are running this locally you could create a folder named public_html in your home directory structure and set this to run your scripts, you would just have to configure that also mapping it to that location.

Remove trailing commas at the end of the string using Perl

I'm parsing a CSV file in which each line look something as below.
10998,4499,SLC27A5,Q9Y2P5,GO:0000166,GO:0032403,GO:0005524,GO:0016874,GO:0047747,GO:0004467,GO:0015245,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
There seems to be trailing commas at the end of each line.
I want to get the first term, in this case "10998" and get the number of GO terms related to it.
So my output in this case should be,
Output:
10998,7
But instead it shows 299. I realized overall there are 303 commas in each line. And I'm not able to figure out an easy way to remove trailing commas. Can anyone help me solve this issue?
Thanks!
My Code:
use strict;
use warnings;
open my $IN, '<', 'test.csv' or die "can't find file: $!";
open(CSV, ">GO_MF_counts_Genes.csv") or die "Error!! Cannot create the file: $!\n";
my #genes = ();
my $mf;
foreach my $line (<$IN>) {
chomp $line;
my #array = split(/,/, $line);
my #GO = splice(#array, 4);
my $GO = join(',', #GO);
$mf = count($GO);
print CSV "$array[0],$mf\n";
}
sub count {
my $go = shift #_;
my $count = my #go = split(/,/, $go);
return $count;
}

I'd use juanrpozo's solution for counting but if you still want to go your way, then remove the commas with regex substitution.
$line =~ s/,+$//;

I suggest this more concise way of coding your program.
Note that the line my #data = split /,/, $line discards trailing empty fields (#data has only 11 fields with your sample data) so will produce the same result whether or not trailing commas are removed beforehand.
use strict;
use warnings;
open my $in, '<', 'test.csv' or die "Cannot open file for input: $!";
open my $out, '>', 'GO_MF_counts_Genes.csv' or die "Cannot open file for output: $!";
foreach my $line (<$in>) {
chomp $line;
my #data = split /,/, $line;
printf $out "%s,%d\n", $data[0], scalar grep /^GO:/, #data;
}

You can apply grep to #array
my $mf = grep { /^GO:/ } #array;
assuming $array[0] never matches /^GO:/

For each your line:
foreach my $line (<$IN>) {
my ($first_term) = ($line =~ /(\d+),/);
my #tmp = split('GO', " $line ");
my $nr_of_GOs = #tmp - 1;
print CSV "$first_term,$nr_of_GOs\n";
}

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How can I extract data from HTML tables in Perl? - html

That's an easy one: my $html = '<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>'; my #stuff = $html =~ />([^<]+)</g; print join (", ", #stuff), "\n"; See http://codepad.org/qz9d5Bro if you want to try running it.

Related

Corrupted JSON encoding in Perl (missign comma)

How to convert tag names and values from XML into HTML using Perl

Getting links from an HTML table using HTML::TableExtract and HTML::Extor in Perl

how to pass a variable from an HTML form to a perl cgi script?

Remove trailing commas at the end of the string using Perl

Categories

Resources