HTML Parsing using HTML::Tree in Perl - html

I'm trying to scrape a webpage and get the values that are inside the html tags. The end result would be a way to separate values in a way that looks like
Club: x Location: y URL: z
Here's what I have so far
use HTML::Tree;
use LWP::Simple;
$url = "http://home.gotsoccer.com/clubs.aspx?&clubname=&clubstate=AL&clubcity=";
$content = get($url);
$tree = HTML::Tree->new();
$tree->parse($content);
#td = $tree->look_down( _tag => 'td', class => 'ClubRow');
foreach $1 (#td) {
print $1->as_text();
print "\n";
}
And what is printed is like
AYSO UnitedMadison, ALwww.aysounitednorthalabama.org
This is what the HTML looks like
<td class="ClubRow" width="80%">
<div>
AYSO United</div>
<div class="SubHeading">Madison, AL</div>
<img src="/images/icons/ArrowRightSm.png" class="LinkIcon"><font color="black">www.aysounitednorthalabama.org</font>
</td>
I need a way to either split these fields into separate variables or add some sort of deliminating character so I can do it with Regex. There isn't much documentation online so any help would be appreciative.

First, this is an abomination:
foreach $1 (#td) {
print $1->as_text();
print "\n";
}
You might think it is cute, but it is confusing to use regex capture variables such as $1 as a loop variable especially since you also say "I need a way... so I can do it with Regex." (emphasis mine)
This is the kind of nonsense that leads to unmaintainable programs which give Perl a bad name.
Always use strict and warnings and use a plain variable for your loops.
Second, you are interested in three specific elements in each td: 1) The text of a[class="ClubLink"]; 2) The text of div[class="SubHeading"]; and 3) The text of font[color="black"].
So, just extract those three bits of information instead of flattening the text inside a td:
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::Tree;
my $html = <<HTML;
<td class="ClubRow" width="80%"> <div> AYSO United</div>
<div class="SubHeading">Madison, AL</div> <a
href="http://www.aysounitednorthalabama.org" target="_blank"><img
src="/images/icons/ArrowRightSm.png" class="LinkIcon"><font
color="black">www.aysounitednorthalabama.org</font></a> </td>
HTML
my $tree = HTML::Tree->new_from_content( $html );
my #wanted = (
[class => 'ClubLink'],
[class => 'SubHeading'],
[_tag => 'font', color => 'black'],
);
my #td = $tree->look_down( _tag => 'td', class => 'ClubRow');
for my $td ( #td ) {
my ($club, $loc, $www) = map $td->look_down(#$_)->as_text, #wanted;
print join(' - ', $club, $loc, $www), "\n";
}
Output:
$ ./gg.pl
AYSO United - Madison, AL - www.aysounitednorthalabama.org
Of course, I would have probably used HTML::TreeBuilder::XPath to take advantage of XPath queries.

Here's a Mojolicious example. It's the same thing that Sinan did but with a different toolbox which has the tools to fetch and process the webpage. It looks a bit long, but that's just the comments and documentation. ;)
I like that Mojolicious is "batteries included", so once I load one of the modules, I probably have everything else I need for the task:
use v5.10;
use Mojo::UserAgent;
my $url = "http://home.gotsoccer.com/clubs.aspx?&clubname=&clubstate=AL&clubcity=";
my $ua = Mojo::UserAgent->new;
my $tx = $ua->get( $url );
# You could do some error checking here in case the fetch fails
$tx->res->dom
# there are lots of ClubRow td cells, but we want the one with
# the width attribute. Find all of those. See Mojo::DOM::CSS for
# docs on CSS selectors.
->find( 'td[class=ClubRow][width=80%]' )
# now go through each td and extract several things
->map( sub {
# these selectors represent the club location, name, and website
state $find = [ qw(
a[class=ClubLink]
div[class=SubHeading]
font[color=black]
) ];
my $chunk = $_;
# return the location, name, and link as a tuple for later
# processing
[
map { s/\t+/ /gr } # remove tabs so we can use them as a separator
map { $chunk->find( $_ )->map( 'text' )->[0] }
#$find
]
} )
# do something will all tuples. In this case, output them as tab
# separated values (which is why you removed tabs already). You
# should be able to easily import this into a spreadsheet application.
->each( sub { say join "\t", #$_ } );
The output has that annoying first line, but you can fix that up on your own:
*****Other Club*****
Alabama Soccer Association www.alsoccer.org
Alabaster Competitive SC acsc.teampages.com/
Alabaster Parks and Rec
Alex City YSL www.alexcitysoccer.com/
Auburn Thunder SC auburnthundersoccer.com/
AYSO United Madison, AL www.aysounitednorthalabama.org
Birmingham Area Adult Soccer League
Birmingham Bundesliga LLC Birmingham, AL www.birmingham7v7.com
Birmingham Premier League
Birmingham United SA Birmingham, AL, AL www.birminghamunited.com/
Blount County Youth Soccer Oneonta, AL bcysfury.com
Briarwood SC Birmingham, AL www.questrecreation.org/briarwood-soccer-club.html...
Capital City Streaks Montgomery, AL www.capitalcitystreaks.org
City of Calera Youth Soccer

Related

decode_json and return first key in hash

JSON string input: https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo
I am trying to return just the first key (current day) in the hash but have been unable to do so. My code looks like the following
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use Data::Dumper;
use JSON;
my $html = get("https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=AMD&apikey=CMDPTEHVYH7W5VSZ");
my $decoded = decode_json($html);
my ($open) = $decoded->{'Time Series (Daily)'}->[0]->{'1. open'};
I keep getting "Not an ARRAY reference" which I researched and got more confused.
I can access what I want directly with the below code but I want to access just the first result or the current day:
my ($open) = $decoded->{'Time Series (Daily)'}{'2017-12-20'}{'1. open'};
Also if I do something like this:
my ($open) = $decoded->{'Time Series (Daily)'};
print Dumper($open);
The output is as follows:
$VAR1 = {
'2017-09-07' => {
'1. open' => '12.8400',
'5. volume' => '35467788',
'2. high' => '12.9400',
'4. close' => '12.6300',
'3. low' => '12.6000'
},
'2017-11-15' => {
'3. low' => '10.7700',
'4. close' => '11.0700',
'2. high' => '11.1300',
'5. volume' => '33326871',
'1. open' => '11.0100'
},
'2017-11-30' => {
'1. open' => '10.8700',
'2. high' => '11.0300',
'5. volume' => '43101899',
'3. low' => '10.7600',
'4. close' => '10.8900'
},
Thank you in advance for any help you can provide a noob.
Problem 1: { denotes the start of a JSON object, which gets decoded into a hash. Trying to derefence an array is going to fail.
Problem 2: Like Perl hashes, JSON objects are unordered, so talking about the
"first key" makes no sense. Perhaps you want the most recent date?
use List::Util qw( maxstr );
my $time_series_daily = $decoded->{'Time Series (Daily)'};
my $latest_date = maxstr #$time_series_daily;
my $open = $time_series_daily->{$latest_date}{'1. open'};
You are picking among hashref keys, not array (sequential container) elements. Since hashes are inherently unordered you can't index into that list but need to sort keys as needed.
With the exact format you show this works
my $top = (sort { $b cmp $a } keys %{ $decoded->{'Time Series (Daily)'} } )[0];
say $decoded->{'Time Series (Daily)'}{$top}{'1. open'};
It gets the list of keys, inverse-sorts them (alphabetically), and takes the first element of that list.
If your date-time format may vary then you'll need to parse it for sorting.
If you will really ever only want the most-recent one this is inefficient since it sorts the whole list. Then use a more specific tool to extract only the "largest" element, like
use List::Util qw(reduce);
my $top = reduce { $a gt $b ? $a : $b }
keys %{ $decoded->{'Time Series (Daily)'} };
But then in your case this can be done simply by maxstr from the same List::Util module, as shown in ikegami's answer. On the other hand, if the datetime format doesn't lend itself to a direct lexicographical comparison used by strmax then the reduce allows use of custom comparisons.

perl: Creating PDF Avery lables on linux?

Which would a the best and most flexible process for creating formatted PDF's of Avery labels on a linux machine with perl?
The labels need to include images and will have formatting similar to spanned rows and columns in an html table. And example would be several rows of text on the left hand side and an image on the right hand side that spans the text.
These are my thoughts but if you have additional ideas please let me know.
perl to PDF with PDF::API2
perl to PS with ??? -> PS to PDF with ???
perl to HTML w/ CSS formatting -> HTML to PDF with wkhtmltopdf
Has anybody done this and have any pointers, examples or links that may be of assistance?
Thank You,
~Donavon
They are all viable options.
I found wkhtmltopdf to be too resource intensive and slow. If you do want to go down that route there are existing html templates already which can be found by a quick google search.
PDF::API2 performs very well, and I run it on a server system with no problems. Here's an example script I use for laying out elements in a grid format;
#!/usr/bin/env perl
use strict 'vars';
use FindBin;
use PDF::API2;
# Min usage, expects bank.pdf to exist in same location
render_grid_pdf(
labels => ['One', 'Two', 'Three', 'Four'],
cell_width => 200,
cell_height => 50,
no_of_columns => 2,
no_of_rows => 2,
);
# Advanced usage
render_grid_pdf(
labels => ['One', 'Two', 'Three', 'Four'],
cell_width => 200,
cell_height => 50,
no_of_columns => 2,
no_of_rows => 2,
font_name => "Helvetica-Bold",
font_size => 12,
template => "blank.pdf",
save_as => "my_labels.pdf",
# Manually set coordinates to start prinding
page_offset_x => 20, # Acts as a left margin
page_offset_y => 600,
);
sub render_grid_pdf {
my %args = #_;
# Print data
my $labels = $args{labels} || die "Labels required";
# Template, outfile and labels
my $template = $args{template} || "$FindBin::Bin/blank.pdf";
my $save_as = $args{save_as} || "$FindBin::Bin/out.pdf";
# Layout Properties
my $no_of_columns = $args{no_of_columns} || die "Number of columns required";
my $no_of_rows = $args{no_of_rows} || die "Number of rows required";
my $cell_width = $args{cell_width} || die "Cell width required";
my $cell_height = $args{cell_height} || die "Cell height required";
my $font_name = $args{font_name} || "Helvetica-Bold";
my $font_size = $args{font_size} || 12;
# Note: PDF::API2 uses cartesion coordinates, 0,0 being
# bottom. left. These offsets are used to set the print
# reference to top-left to make things easier to manage
my $page_offset_x = $args{page_offset_x} || 0;
my $page_offset_y = $args{page_offset_y} || $no_of_rows * $cell_height;
# Open an existing PDF file as a templata
my $pdf = PDF::API2->open("$template");
# Add a built-in font to the PDF
my $font = $pdf->corefont($font_name);
my $page = $pdf->openpage(1);
# Add some text to the page
my $text = $page->text();
$text->font($font, $font_size);
# Print out labels
my $current_label = 0;
OUTERLOOP: for (my $row = 0; $row < $no_of_columns; $row++) {
for (my $column = 0; $column < $no_of_columns; $column++) {
# Calculate label x, y positions
my $label_y = $page_offset_y - $row * $cell_height;
my $label_x = $page_offset_x + $column * $cell_width;
# Print label
$text->translate( $label_x, $label_y );
$text->text( $labels->[$current_label]);
# Increment labels index
$current_label++;
# Exit condition
if ( $current_label > scalar #{$labels}) {
last OUTERLOOP;
}
}
}
# Save the PDF
$pdf->saveas($save_as);
}
Great you have found an answer you like.
Another option, which may or may not have suited you, would be to prepare the sheet that will be printed as labels as an open/libreoffice document, with pictures, layout, non-variant text ... (and can do all your testing runs through open/libreoffice).
Then:
use OpenOffice::OODoc;
then: read you data from a database
then:
my $document = odfDocument( file => "$outputFilename",
create => "text",
template_path => $myTemplateDir );
then:
for (my $r = 0; $r < $NumOfTableRows; $r++ ) {
for (my $c = 0; $c < $NumOfTableCols; $c++) {
:
$document->cellValue($theTableName, $r, $c, $someText);
# test: was written properly ?
my $writtenTest = $document->cellValue($theTableName, $r, $c);
chomp $writtenTest;
if ($someText ne $writtenTest) {
:
}
}
}
then:
$document->save($outputFilename );
# save (convert to) a pdf
# -f format;
# -n no start new listener; use existing
# -T timeout to connect to its *OWN* listener (only);
# -e exportFilterOptions
`unoconv -f pdf -n -T 60 -e PageRange=1-2 $outputFilename `;
# I remove the open/libreoffice doc, else clutter and confusion
`rm $outputFilename `;
As a quick overview of the practical issues:
name the layout tables
place your nice, correct open/libreoffice doc in "/usr/local/lib/site_perl/myNewTemplateDir/" say. You will need this (I think that this is the default, but I pass it anyway as $myTemplateDir
gotcha: the modules routines to wait for Open/Libreoffice to start (for the unoconv converter to start) do NOT work - unoconv will still not work after they say it will. I create dummy pdfs until one actually works - exists and has a non-zero size.

Perl Mojo::DOM to find and replace html blocks

Since everyone here advised on using the Perl module Mojo::DOM for this task, I am asking how to do it with it.
I have this html code in template:
some html content here top base
<!--block:first-->
some html content here 1 top
<!--block:second-->
some html content here 2 top
<!--block:third-->
some html content here 3a
some html content here 3b
<!--endblock-->
some html content here 2 bottom
<!--endblock-->
some html content here 1 bottom
<!--endblock-->
some html content here bottom base
What I want to do (please do not suggest using Templates modules again), I want to find the inner block first:
<!--block:third-->
some html content here 3a
some html content here 3b
<!--endblock-->
then replace it with some html code, then find the second block:
<!--block:second-->
some html content here 2 top
<!--block:third-->
some html content here 3a
some html content here 3b
<!--endblock-->
some html content here 2 bottom
<!--endblock-->
then replace it with some html code, then find the third block:
<!--block:first-->
some html content here 1 top
<!--block:second-->
some html content here 2 top
<!--block:third-->
some html content here 3a
some html content here 3b
<!--endblock-->
some html content here 2 bottom
<!--endblock-->
some html content here 1 bottom
<!--endblock-->
I did not advise using Mojo::DOM for this task, as it's probably overkill, but ... you could.
The real answer is the one that I've already stated in other questions, and that is to use an already existing framework such as Template::Toolkit. It's powerful, well tested, and speedy since it allows for the caching of templates.
However, you desire to roll your own templating solution. Any such solution should include a parsing, validation, and execution phase. We're just going to be focusing on the first two steps as you've shared no real info on the last.
There is not going to be any real magic in Mojo::DOM. Its benefit and power is that it can fully and easily parse HTML, catching all of those potential edge cases. It will only be able to help with the parsing phase of templating though as it's your own rules that decide the validation. In fact, it basically just performs like a drop in replacement for split in my earlier solution I provided to you. That's why it's probably too heavy weight of a solution.
Because it's not hard to make the modifications, I've went ahead coded a full solution below. However, to make things more interesting, and to try to prove one of my greater points, it's time to share some Benchmark testing between the 3 available solutions:
Mojo::DOM for parsing as demonstrated below.
split for parsing as proposed by me in Match nested html comment blocks regex
recursive regex proposed by sln in Perl replace nested blocks regex
The below contains all three solutions:
use strict;
use warnings;
use Benchmark qw(:all);
use Mojo::DOM;
use Data::Dump qw(dump dd);
my $content = do {local $/; <DATA>};
#dd parse_using_mojo($content);
#dd parse_using_split($content);
#dd parse_using_regex($content);
timethese(100_000, {
'regex' => sub { parse_using_regex($content) },
'mojo' => sub { parse_using_mojo($content) },
'split' => sub { parse_using_split($content) },
});
sub parse_using_mojo {
my $content = shift;
my $dom = Mojo::DOM->new($content);
# Resulting Data Structure
my #data = ();
# Keep track of levels of content
# - This is a throwaway data structure to facilitate the building of nested content
my #levels = ( \#data );
for my $html ($dom->all_contents->each) {
if ($html->node eq 'comment') {
# Start of Block - Go up to new level
if ($html =~ m{^<!--\s*block:(.*)-->$}s) {
#print +(' ' x #levels) ."<$1>\n"; # For debugging
my $hash = {
block => $1,
content => [],
};
push #{$levels[-1]}, $hash;
push #levels, $hash->{content};
next;
# End of Block - Go down level
} elsif ($html =~ m{^<!--\s*endblock\s*-->$}) {
die "Error: Unmatched endblock found before " . dump($html) if #levels == 1;
pop #levels;
#print +(' ' x #levels) . "</$levels[-1][-1]{block}>\n"; # For debugging
next;
}
}
push #{$levels[-1]}, '' if !#{$levels[-1]} || ref $levels[-1][-1];
$levels[-1][-1] .= $html;
}
die "Error: Unmatched start block: $levels[-2][-1]{block}" if #levels > 1;
return \#data;
}
sub parse_using_split {
my $content = shift;
# Tokenize Content
my #tokens = split m{<!--\s*(?:block:(.*?)|(endblock))\s*-->}s, $content;
# Resulting Data Structure
my #data = (
shift #tokens, # First element of split is always HTML
);
# Keep track of levels of content
# - This is a throwaway data structure to facilitate the building of nested content
my #levels = ( \#data );
while (#tokens) {
# Tokens come in groups of 3. Two capture groups in split delimiter, followed by html.
my ($block, $endblock, $html) = splice #tokens, 0, 3;
# Start of Block - Go up to new level
if (defined $block) {
#print +(' ' x #levels) ."<$block>\n"; # For Debugging
my $hash = {
block => $block,
content => [],
};
push #{$levels[-1]}, $hash;
push #levels, $hash->{content};
# End of Block - Go down level
} elsif (defined $endblock) {
die "Error: Unmatched endblock found before " . dump($html) if #levels == 1;
pop #levels;
#print +(' ' x #levels) . "</$levels[-1][-1]{block}>\n"; # For Debugging
}
# Append HTML content
push #{$levels[-1]}, $html;
}
die "Error: Unmatched start block: $levels[-2][-1]{block}" if #levels > 1;
return \#data;
}
sub parse_using_regex {
my $content = shift;
my $href = {};
ParseCore( $href, $content );
return $href;
}
sub ParseCore
{
my ($aref, $core) = #_;
# Set the error mode on/off here ..
my $BailOnError = 1;
my $IsError = 0;
my ($k, $v);
while ( $core =~ /(?is)(?:((?&content))|(?><!--block:(.*?)-->)((?&core)|)<!--endblock-->|(<!--(?:block:.*?|endblock)-->))(?(DEFINE)(?<core>(?>(?&content)|(?><!--block:.*?-->)(?:(?&core)|)<!--endblock-->)+)(?<content>(?>(?!<!--(?:block:.*?|endblock)-->).)+))/g )
{
if (defined $1)
{
# CONTENT
$aref->{content} .= $1;
}
elsif (defined $2)
{
# CORE
$k = $2; $v = $3;
$aref->{$k} = {};
# $aref->{$k}->{content} = $v;
# $aref->{$k}->{match} = $&;
my $curraref = $aref->{$k};
my $ret = ParseCore($aref->{$k}, $v);
if ( $BailOnError && $IsError ) {
last;
}
if (defined $ret) {
$curraref->{'#next'} = $ret;
}
}
else
{
# ERRORS
print "Unbalanced '$4' at position = ", $-[0];
$IsError = 1;
# Decide to continue here ..
# If BailOnError is set, just unwind recursion.
# -------------------------------------------------
if ( $BailOnError ) {
last;
}
}
}
return $k;
}
__DATA__
some html content here top base
<!--block:first-->
<table border="1" style="color:red;">
<tr class="lines">
<td align="left" valign="<--valign-->">
<b>bold</b>mewsoft
<!--hello--> <--again--><!--world-->
some html content here 1 top
<!--block:second-->
some html content here 2 top
<!--block:third-->
some html content here 3 top
<!--block:fourth-->
some html content here 4 top
<!--block:fifth-->
some html content here 5a
some html content here 5b
<!--endblock-->
<!--endblock-->
some html content here 3a
some html content here 3b
<!--endblock-->
some html content here 2 bottom
<!--endblock-->
some html content here 1 bottom
<!--endblock-->
some html content here1-5 bottom base
some html content here 6-8 top base
<!--block:six-->
some html content here 6 top
<!--block:seven-->
some html content here 7 top
<!--block:eight-->
some html content here 8a
some html content here 8b
<!--endblock-->
some html content here 7 bottom
<!--endblock-->
some html content here 6 bottom
<!--endblock-->
some html content here 6-8 bottom base
The results for the simple template with 3 nested blocks:
Benchmark: timing 100000 iterations of mojo, regex, split...
mojo: 50 wallclock secs (50.36 usr + 0.00 sys = 50.36 CPU) # 1985.78/s (n=100000)
regex: 14 wallclock secs (13.42 usr + 0.00 sys = 13.42 CPU) # 7453.79/s (n=100000)
split: 2 wallclock secs ( 2.70 usr + 0.00 sys = 2.70 CPU) # 37050.76/s (n=100000)
Normalizing to regex at 100%, equates to mojo at 375%, and split at 20%.
And for the more complicated template included in the above code:
Benchmark: timing 100000 iterations of mojo, regex, split...
mojo: 237 wallclock secs (236.61 usr + 0.02 sys = 236.62 CPU) # 422.61/s (n=100000)
regex: 46 wallclock secs (47.25 usr + 0.00 sys = 47.25 CPU) # 2116.31/s (n=100000)
split: 7 wallclock secs ( 6.65 usr + 0.00 sys = 6.65 CPU) # 15046.64/s (n=100000)
Normalizing to regex at 100%, equates to mojo at 501%, and split at 14%. (7 times as fast)
Does speed matter?
As is demonstrated above, we can see without question that my split solution is going to be faster than any of the other solutions thus far. This should not be a surprise. It's an extremely simple tool and therefore it's fast.
In truth though, speed doesn't really matter.
Why not? Well, because whatever data structure you build from parsing and validating the template can be cached and reloaded each time you want to execute a template, until a template changes.
Final decisions
Because speed doesn't matter with caching, what you should focus on instead is how readable is the code, how fragile is it, how easily can it be extended and debugged, etc.
As much as I appreciate a well crafted regex, they tend to be fragile. Putting all of your parsing and validation logic into a single line of code is just asking for trouble.
That leaves either the split solution or mojo.
If you're caching like I described, you can actually choose either one without concern. The code I provided for each is essentially the same with slight variations, so it gets to be personal preference. Even though split is 20-35 times faster for the initial parsing matters less than if the code is more maintainable using an actual HTML Parser.
Good luck choosing your final approach. I still have my fingers crossed you'll go with TT some day, but you'll pick your own poison :)

Extracting Yahoo Financial Corporate Bond Data using PERL to mysql

I would like to extract the profile information for each of the lines listed on the following table for all the multiple pages:
http://reports.finance.yahoo.com/z1?b=1&so=a&sf=m&tc=1&stt=-&pr=0&cpl=-1&cpu=-1&yl=-1&yu=-1&ytl=-1&ytu=-1&mtl=-1&mtu=-1&rl=5&ru=-1&cll=0
Here is sample of one of the links to one of the lines that was listed on the table (which are all in the "Issue" column):
http://reports.finance.yahoo.com/z2?ce=5415446151491606016451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
I'd like to store all the information contained for each Issue for all the lines and pages in a mysql database. I assume PERL would be a good tool to use for this, but my experience with it is very limited.
I think I would need to gather all the links in the issue column for all the pages of the table (which was 2600+ pages at the time), and somehow extract the information from each of those pages from the links.
Any help would be greatly appreciated.
This will let you get started in some way and show you a general technique of doing this with regexes (which could be hard to understand if you are not very familiar with perl and regex matching).
I did it for the first page only and I did put as many comments in my code as possible to help you understand it. If you are not able to understand what this code actually does I would suggest trying to use a different tool (or maybe try a module like Web::Scraper or Mojo::DOM).
Read some perl docs if you would really like to get your job done in perl...
http://perldoc.perl.org/perlre.html
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use feature 'say';
my $start_url = 'http://reports.finance.yahoo.com/z1?b=1&cll=0&cpl=-1.000000&cpu=-1.000000&mtl=-1&mtu=-1&pr=0&rl=5&ru=-1&sf=m&so=a&stt=-&tc=1&yl=-1.000000&ytl=-1.000000&ytu=-1.000000&yu=-1.000000';
my $page_content = get($start_url);
die "Oops, something went wrong!" unless defined $page_content;
process_bond_results_page($page_content);
sub process_bond_results_page {
my $content = shift;
# iterates $content as long as /<tr class=\"yfnc_tabledata1\">(.+?)<\/tr>/g regex matches
# puts row content (content between <tr...>(...)</tr> in a special $1 variable)
while($content =~ /<tr class=\"yfnc_tabledata1\">(.+?)<\/tr>/g) {
# uncomment line below to see what $1 contains
# say $1;
# cleanup not needed HTML tags
my $tr_data = cleanup_html_tags($1);
# match content in between <td> & </td> tags and put them on #tds list
my (#tds) = $tr_data =~ /<td>(.*?)<\/td>/g;
# 2nd element of #tds list contains ISSUE NAME text
# Line below extracts link_to_issue and $issue_name and assigns them to respective variables
my ($link_to_issue, $issue_name) = $tds[1] =~ /<a[^>]*?href=\"([^\"]*?)\"[^>]*?>(.+?)<\/a>/g;
# Replace 2nd element of list that contains data like ISSUE NAME
# with just ISSUE NAME
$tds[1] = $issue_name;
# Append $link_to_issue at the end of #tds list
push(#tds,$link_to_issue);
# Print #tds array with values seaparated by TABs
say join("\t", #tds);
}
# Does it have Next link?
my ($next_link) = $content =~ /<a[^>]*?href=\"([^\"]+?)\">Next<\/a><\/b>/g;
say 'NEXT: ' . $next_link if $next_link;
return;
}
sub cleanup_html_tags {
my $html = shift;
$html =~ s/<\/?(font|div)[^>]*?>//g; # remove <font...>, <div...>, </font>, </div>
$html =~ s/<td[^>]*?>/<td>/g; # replace all <td...> with just <td>
$html =~ s/<\/?nobr>//g; # remove <nobr> and </nobr>
return $html;
}
Above will print:
Corp MERRILL LYNCH CO INC MTN BE 100.63 5.000 3-Feb-2014 -19.649 4.969 A No /z2?ce=5314754150501796218050&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp CME GROUP INC 100.84 5.750 15-Feb-2014 -8.334 5.702 AA No /z2?ce=5715449144561716016149&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp CAPITAL ONE BK MTN BE 100.80 5.125 15-Feb-2014 -8.334 5.084 A No /z2?ce=5715254147581635317455&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp HESS CORP 100.92 7.000 15-Feb-2014 -8.351 6.937 BBB No /z2?ce=5415446151491606016451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp PACCAR INC 100.90 6.875 15-Feb-2014 -8.295 6.813 A No /z2?ce=5214751144551836016451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp WACHOVIA CORP NEW 100.78 4.875 15-Feb-2014 -8.337 4.837 A No /z2?ce=4915445142581546016054&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp CATERPILLAR FINL SVCS MTNS BE 100.89 6.125 17-Feb-2014 -7.597 6.071 A No /z2?ce=5715245150561764615951&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp KRAFT FOODS INC 100.97 6.750 19-Feb-2014 -6.921 6.685 BBB No /z2?ce=5315654144531746017754&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp WESTERN UN CO 101.05 6.500 26-Feb-2014 -5.154 6.432 BBB No /z2?ce=4915145143581556015548&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp AMERICA MOVIL SAB DE CV 101.06 5.500 1-Mar-2014 -4.615 5.443 A No /z2?ce=5815451145541816015954&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp HARTFORD FINL SVCS GROUP INC 100.96 4.750 1-Mar-2014 -4.454 4.705 BBB No /z2?ce=5415548146571526017250&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp HEWLETT PACKARD CO 101.12 6.125 1-Mar-2014 -4.599 6.057 BBB No /z2?ce=5415446149551516016556&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp RYDER SYS MTN BE 101.08 5.850 1-Mar-2014 -4.495 5.788 BBB No /z2?ce=5114851146531605117352&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp HSBC FIN CORP HSBC FIN 100.72 2.000 15-Mar-2014 -3.011 1.986 A No /z2?ce=5415650149491807117451&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
Corp SYSCO CORP 101.06 4.600 15-Mar-2014 -2.772 4.552 A No /z2?ce=5014953143561486015756&q=b%3d1%26cll%3d0%26cpl%3d-1.000000%26cpu%3d-1.000000%26mtl%3d-1%26mtu%3d-1%26pr%3d0%26rl%3d5%26ru%3d-1%26sf%3dm%26so%3da%26stt%3d-%26tc%3d1%26yl%3d-1.000000%26ytl%3d-1.000000%26ytu%3d-1.000000%26yu%3d-1.000000
NEXT: z1?b=2&cll=0&cpl=-1.000000&cpu=-1.000000&mtl=-1&mtu=-1&pr=0&rl=5&ru=-1&sf=m&so=a&stt=-&tc=1&yl=-1.000000&ytl=-1.000000&ytu=-1.000000&yu=-1.000000
Since user3195726 suggested, here it is using Mojo::UserAgent and Mojo::DOM
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use Mojo::UserAgent;
my $start_url = 'http://reports.finance.yahoo.com/z1?b=1&cll=0&cpl=-1.000000&cpu=-1.000000&mtl=-1&mtu=-1&pr=0&rl=5&ru=-1&sf=m&so=a&stt=-&tc=1&yl=-1.000000&ytl=-1.000000&ytu=-1.000000&yu=-1.000000';
my $dom = Mojo::UserAgent->new->get($start_url)->res->dom;
$dom->find('tr.yfnc_tabledata1')->each(sub{
my $tds = $_->find('td');
my $anchor = $tds->[1]->at('a');
my $link = $anchor->{href};
my $name = $anchor->all_text;
$tds = $tds->all_text;
$tds->[1] = $name;
push #$tds, $link;
say $tds->join("\t");
});
say 'Next: ' . $dom->find('a')->first(sub{ $_->all_text eq 'Next'})->{href};
The finds are all using CSS3 selectors the rest is just transforms.

How to create dynamically sized HTML table?

I have a Perl CGI script. I would like to make a dynamic, appropriately-sized table based on query information from a simple HTML form: http://jsfiddle.net/wBgBZ/4/. I wanted to use HTML::Table but the server doesn't have the module installed. The administrator won't install it either. Therefore, I have to do it the old fashion way.
Here's what I have so far.
#!/usr/bin/perl
use strict; use warnings;
use CGI qw( :standard);
print header;
print start_html(
-title => 'Creating Tables'
);
# Process an HTTP request
my $query = param("names");
my #students_in_class = split(/;/, $query);
my %attributes = (
'Tommy' => 'A star baseball player who has lots of potential to play in the Major League of Baseball. ',
'Tyrone' => 'An honor roll athlete. His father is really proud of him. When he graduates, he wents to work at the National Institute for Public Health. His father wants him to become a doctor but he wants to pursue Physics.',
'Marshall' => 'A professional WWE wrestler.',
);
print table({-border=> undef},
caption('Students in the class'),
Tr({-align=>'CENTER',-valign=>'TOP'},
[
th(['Student', 'List of Attributes']),
foreach (#students_in_class){ # !!!!! problem line !!!!!!
td(['$_' , '$attributes{$}']),
}
]
)
);
Such that if the user enters the following into the search bar: Tyrone;Tommy;Marshall
the CGI should produces something similar to the following
Desired Output
http://jsfiddle.net/PrLvU/
If the user enters just Marshall;Tommy, the table should be 3x2.
It doesn't work. I need a way to dynamically add rows to the table.
This is untested, but I think this is what you are wanting. You may need to change some of the table attributes to your desired needs.
use strict;
use warnings;
use CGI qw( :standard );
print header,
start_html(-title => 'Creating Tables');
my $query = param('names');
my #headers;
my #students = split(/;/, $query);
my %attributes = (
Tommy => 'A star baseball player.',
Tyrone => 'An honor roll athlete.',
Marshall => 'A professional WWE wrestler.',
);
$headers[0] = Tr(th('Student'), th('List of Attributes'));
for my $i (#students) {
push #headers, Tr( td($i), td($attributes{$i}));
}
print table( {-border => undef}, #headers );