Get numbers, a given number of characters after a phrase, from HTML - html

Basically, I've opened an HTML file in perl, and wrote this line:
if(INFILE =~ \$txt_TeamNumber\) {
$teamNumber = \$txt_TeamNumber\
}
and I need to get the txt_TeamNumber, go 21 spaces forward, and get the next 1-5 numbers. Here is the part of the HTML file I'm trying to extract info from:
<td style="width: 25%;">Team Number:
</td>
<td style="width: 75%;">
<input name="ctl00$ContentPlaceHolder1$txt_TeamNumber" type="text" value="186" maxlength="5" readonly="readonly" id="ctl00_ContentPlaceHolder1_txt_TeamNumber" disabled="disabled" tabindex="1" class="aspNetDisabled" style="width:53px;">
</td>

This is a very good example for benefits of using ready parsers.
One of the standard modules for parsing HTML is HTML::TreeBuilder. Its effectiveness is to a good extent based on its good use of HTML::Element so always have that page ready for reference.
The question doesn't say where HTML comes from. For testing I put it in a file, wrapped with needed tags, and load it from that file. I expect it to come from internet, please change accordingly.
use warnings;
use strict;
use Path::Tiny;
use HTML::TreeBuilder;
my $file = "snippet.html";
my $html = path($file)->slurp; # or open and slurp by hand
my $tree = HTML::TreeBuilder->new_from_content($html);
my #nodes = $tree->look_down(_tag => 'input');
foreach my $node (#nodes) {
my $val = $node->look_down('name', qr/\$txt_TeamNumber/)->attr('value');
print "'value': $val\n";
}
This prints the line: 'value': 186. Note that we never have to parse anything at all.
I assume that the 'name' attribute is identified by literal $txt_TeamNumber, thus $ is escaped.
The code uses the excellent Path::Tiny to slurp the file. If there are issues with installing a module just read the file by hand into a string (if it does come from a file and not from internet).
See docs and abundant other examples for the full utility of the HTML parsing modules used above. There are of course other ways and approaches, made ready for use by yet other good modules. Please search for the right tool.
I strongly suggest to stay clear of any idea to parse HTML (or anything similar) with regex.

Watch for variable scoping. You should be able to get it with a simple regexp capture:
if(INFILE =~ /$txt_TeamNumber/) {
$teamNumber = /$txt_TeamNumber/
($value) = /$txt_TeamNumber.*?value="(.*?)"/
}

Related

PERL/CGI- gets more than text from input textarea

I'm not at all familiar with perl, but have some understanding of html. I'm currently trying to configure code from an online program that processes text inputted from the user to calculate and output a few important numbers in order to do the same for a large number of files containing text in a local directory. The problem lies in my lack of understanding for how or why the code from the site is splitting the inputted text by looking for & and =, as the inputted text never contains these characters, and neither my files. Here's some of the code from the online program:
if ($ENV{'REQUEST_METHOD'} ne "POST") {
&error('Error','Use Standard Input by METHOD=POST');
}
read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
if ($buffer eq '') {
&error('Error','Can not execute directly','Check the usage');
}
$ref = $ENV{'HTTP_REFERER'};
#pairs = split(/&/,$buffer);
foreach $pair (#pairs) {
($name,$value) = split(/=/,$pair);
if ($name eq "ATOMS") { $atoms = $value; }
It then uses these "pairs" to appropriately calculate the required numbers. The input from the user is simply a textarea named "ATOMS", and the form action is the cgi script:
<form method=POST action="/path/to/the/cgi/file.cgi">
<textarea name="ATOMS" rows=20 cols=80></textarea>
</form>
I've left out the less important details of both the html and perl codes. So far all I've been able to do is get all the content from all files in a given directory in a text format, but when I input this into the script that uses the text from textarea to calculate the values (in place of the variable $buffer), it doesn't work, which I suspect is due to the split codes, which cannot find the & and = symbols. How does the code get these symbols from the online script, and how can I implement that to use for my local files? Let me know if any additional information is needed, and thanks in advance!
The encoding scheme forms use (by default) to POST data over HTTP consists of key=value pairs (hence the =) which are separated by & characters.
The latter doesn't much matter for your form since it has only one control in it.
This is described pretty succinctly in the HTML 4 specification and in more detail in the HTML 5 specification.
If you aren't dealing with data from a form, you should remove all the form decoding code.
Not sure where you got that code from, but it's prehistoric (from the last millennium).
You should be able to replace it all with.
use CGI ':cgi';
my $atoms = param('ATOMS');

How to split up long HTML into "functions"

When writing a non-trivial static HTML page, the large-scale structure gets very hard to see, with the fine structure and the content all mixed into it. Scrolling several pages to find the closing tag that matches a given opening tag is frustrating. In general it feels messy, awkward, and hard to maintain... like writing a large program with no functions.
Of course, when I write a large program, I break it up hierarchically into smaller functions. Is there any way to do this with a large HTML file?
Basically I'm looking for a template system, where the content to be inserted into the template is just more HTML that's optionally (here's the important part) located in the same file.
That is, I want to be able to do something like what is suggested by this hypothetical syntax:
<html>
<head>{{head}}</head>
<body>
<div class="header">{{header}}</div>
<div class="navbar">{{navbar}}</div>
<div class="content">{{content}}</div>
<div class="footer">{{footer}}</div>
</body>
</html>
{{head}} =
<title>Hello World!</title>
{{styles}}
{{scripts}}
{{styles}} =
<link rel="stylesheet" type="text/css" href="style.css">
{{navbar}} =
...
...
... and so on...
Then presumably there would be a simple way to "compile" this to make a standard HTML file.
Are there any tools out there to allow writing HTML this way?
Most template engines require each include to be a separate file, which isn't useful.
UPDATE: Gnu M4 seems to do exactly the sort of thing I'm looking for, but with a few caveats:
The macro definitions have to appear before they are used, when I'd rather they be after.
M4's syntax mixes very awkwardly with HTML. Since the file is no longer HTML, it can't be easily syntax checked for errors. The M4 processor is very forgiving and flexible, making errors in M4 files hard to find sometimes - the parser won't complain, or even notice, when what you wrote means something other than what you probably meant.
There's no way to get properly indented HTML out, making the output an unreadable mess. (Since production HTML might be minified anyway, that's not a major issue, and it can always be run through a formatter if it needs to be readable.)
This will parse your template example and do what you want.
perl -E 'my $pre.=join("",<>); my ($body,%h)=split(/^\{\{(\w+)\}\}\s*=\s*$/m, $pre); while ($body =~ s/\{\{(\w+)\}\}/$h{$1}/ge) { if ($rec++>200) {die("Max recursion (200)!")}};$body =~ s/({{)-/$1/sg; $body =~ s/({{)-/$1/sg; print $body' htmlfiletoparse.html
And here's the script version.
file joshTplEngine ;)
#!/usr/bin/perl
## get/read lines from files. Multiple input files are supported
my $pre.=join("",<>);
## split files to body and variables %h = hash
## variable is [beginning of the line]{{somestring}} = [newline]
my ($body,%h)=split(/^\{\{# split on variable line and
(\w+) ## save name
\}\}
\s*=\s*$/xm, $pre);
## replace recursively all variables defined as {{somestring}}
while ($body =~ s/
\{\{
(\w+) ## name of the variable
\}\}
/ ##
$h{$1} ## all variables have been read to hash %h and $1 contens variable name from mach
/gxe) {
## check for number of recursions, limit to 200
if ($rec++>200) {
die("Max recursion (200)!")
}
}
## replace {{- to {{ and -}} to }}
$body =~ s/({{)-/$1/sg;
$body =~ s/-(}})/$1/sg;
## end, print
print $body;
Usage:
joshTplEngine.pl /some/html/file.html [/some/other/file] | tee /result/dir/out.html
I hope this little snipped of perl will enable you to your templating.

Parse HTML Page For Links With Regex Using Perl [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
How can I remove external links from HTML using Perl?
Alright, i'm working on a job for a client right now who just switched up his language choice to Perl. I'm not the best in Perl, but i've done stuff like this before with it albeit a while ago.
There are lots of links like this:
<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');" class="bnone">Death Becomes Her
(1992)</a>
I want to match the path "/en/subtitles/3586224/death-becomes-her-en" and put those into an array or list (not sure which ones better in Perl). I've been searching the perl docs, as well as looking at regex tutorials, and most if not all seemed geared towards using ~= to match stuff rather than capture matches.
Thanks,
Cody
Use a proper HTML parser to parse HTML. See this example included with HTML::Parser.
Or, consider the following simple example:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(\*DATA);
my #hrefs;
while ( my $anchor = $parser->get_tag('a') ) {
if ( my $href = $anchor->get_attr('href') ) {
push #hrefs, $href if $href =~ m!/en/subtitles/!;
}
}
print "$_\n" for #hrefs;
__DATA__
<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath
Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');"
class="bnone">Death Becomes Her
(1992)</a>
Output:
/en/subtitles/3586224/death-becomes-her-en
Don't use regexes. Use an HTML parser like HTML::TreeBuilder.
my #links;
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
$tree->elementify;
my #links = map { $_->attr('href') } $tree->look_down( _tag => 'a');
$tree = $tree->delete;
# Do stuff with links array
URLs like the one in your example can be matched with a regular expression like
($url) = /href=\"([^\"]+)\"/i
If the HTML ever uses single quotes (or no quotes) around a URL, or if there are ever quote characters in the URL, then this will not work quite right. For this reason, you will get some answers telling you not to use regular expressions to parse HTML. Heed them but carry on if you are confident that the input will be well behaved.

I'm new to Perl and have a few regex questions

I'm teaching myself Perl and I learn best by example. As such, I'm studying a simple Perl script that scrapes a specific blog and have found myself confused about a couple of the regex statements. The script looks for the following chunks of html:
<dt><a name="2004-10-25"><strong>October 25th</strong></a></dt>
<dd>
<p>
[Content]
</p>
</dd>
... and so on.
and here's the example script I'm studying:
#!/usr/bin/perl -w
use strict;
use XML::RSS;
use LWP::Simple;
use HTML::Entities;
my $rss = new XML::RSS (version => '1.0');
my $url = "http://www.linux.org.uk/~telsa/Diary/diary.html";
my $page = get($url);
$rss->channel(title => "The more accurate diary. Really.",
link => $url,
description => "Telsa's diary of life with a hacker:"
. " the current ramblings");
foreach (split ('<dt>', $page))
{
if (/<a\sname="
([^"]*) # Anchor name
">
<strong>
([^>]*) # Post title
<\/strong><\/a><\/dt>\s*<dd>
(.*) # Body of post
<\/dd>/six)
{
$rss->add_item(title => $2,
link => "$url#$1",
description => encode_entities($3));
}
}
If you have a moment to better help me understand, my questions are:
how does the following line work:
([^"]*) # Anchor name
how does the following line work:
([^>]*) # Post title
what does the "six" mean in the following line:
</dd>/six)
Thanks so much in advance for all your help! I'm also researching the answers to my own questions at the moment, but was hoping someone could give me a boost!
how does the following line work...
([^"]*) # Anchor name
zero or more things which aren't ", captured as $1, $2, or whatever, depending on the number of brackets ( in we are.
how does the following line work...
([^>]*) # Post title
zero or more things which aren't >, captured as $1, $2, or whatever.
what does the "six" mean in the
following line...
</dd>/six)
s = match as single line (this just means that "." matches everything, including \n, which it would not do otherwise)
i = match case insensitive
x = ignore whitespace in regex.
x also makes it possible to put comments into the regex itself, so the things like # Post title there are just comments.
See perldoc perlre for more / better information. The link is for Perl 5.10. If you don't have Perl 5.10 you should look at the perlre document for your version of Perl instead.
[^"]* means "any string of zero or more characters that doesn't contain a quotation mark". This is surrounded by quotes making forming a quoted string, the kind that follows <a name=
[^>]* is similar to the above, it means any string that doesn't contain >. Note here that you probably mean [^<], to match until the opening < for the next tag, not including the actual opening.
that's a collection of php specific regexp flags. I know i means case insensitive, not sure about the rest.
The code is an extended regex. It allows you to put whitespace and comments in your regexes. See perldoc perlre and perlretut. Otherwise like normal.
Same.
The characters are regex modifiers.

How can I extract URL and link text from HTML in Perl?

I previously asked how to do this in Groovy. However, now I'm rewriting my app in Perl because of all the CPAN libraries.
If the page contained these links:
Google
Apple
The output would be:
Google, http://www.google.com
Apple, http://www.apple.com
What is the best way to do this in Perl?
Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.
my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my #links = $mech->links();
for my $link ( #links ) {
printf "%s, %s\n", $link->text, $link->url;
}
Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.
Mech is basically a browser in an object.
Have a look at HTML::LinkExtractor and HTML::LinkExtor, part of the HTML::Parser package.
HTML::LinkExtractor is similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.
If you're adventurous and want to try without modules, something like this should work (adapt it to your needs):
#!/usr/bin/perl
if($#ARGV < 0) {
print "$0: Need URL argument.\n";
exit 1;
}
my #content = split(/\n/,`wget -qO- $ARGV[0]`);
my #links = grep(/<a.*href=.*>/,#content);
foreach my $c (#links){
$c =~ /<a.*href="([\s\S]+?)".*>/;
$link = $1;
$c =~ /<a.*href.*>([\s\S]+?)<\/a>/;
$title = $1;
print "$title, $link\n";
}
There's likely a few things I did wrong here, but it works in a handful of test cases I tried after writing it (it doesn't account for things like <img> tags, etc).
I like using pQuery for things like this...
use pQuery;
pQuery( 'http://www.perlbuzz.com' )->find( 'a' )->each(
sub {
say $_->innerHTML . q{, } . $_->getAttribute( 'href' );
}
);
Also checkout this previous stackoverflow.com question Emulation of lex like functionality in Perl or Python for similar answers.
Another way to do this is to use XPath to query parsed HTML. It is needed in complex cases, like extract all links in div with specific class. Use HTML::TreeBuilder::XPath for this.
my $tree=HTML::TreeBuilder::XPath->new_from_content($c);
my $nodes=$tree->findnodes(q{//map[#name='map1']/area});
while (my $node=$nodes->shift) {
my $t=$node->attr('title');
}
Sherm recommended HTML::LinkExtor, which is almost what you want. Unfortunately, it can't return the text inside the <a> tag.
Andy recommended WWW::Mechanize. That's probably the best solution.
If you find that WWW::Mechanize isn't to your liking, try HTML::TreeBuilder. It will build a DOM-like tree out of the HTML, which you can then search for the links you want and extract any nearby content you want.
Or consider enhancing HTML::LinkExtor to do what you want, and submitting the changes to the author.
Previous answers were perfectly good and I know I’m late to the party but this got bumped in the [perl] feed so…
XML::LibXML is excellent for HTML parsing and unbeatable for speed. Set recover option when parsing badly formed HTML.
use XML::LibXML;
my $doc = XML::LibXML->load_html(IO => \*DATA);
for my $anchor ( $doc->findnodes("//a[\#href]") )
{
printf "%15s -> %s\n",
$anchor->textContent,
$anchor->getAttribute("href");
}
__DATA__
<html><head><title/></head><body>
Google
Apple
</body></html>
–yields–
Google -> http://www.google.com
Apple -> http://www.apple.com
HTML::LinkExtractor is better than HTML::LinkExtor
It can give both link text and URL.
Usage:
use HTML::LinkExtractor;
my $input = q{If Apple }; #HTML string
my $LX = new HTML::LinkExtractor(undef,undef,1);
$LX->parse(\$input);
for my $Link( #{ $LX->links } ) {
if( $$Link{_TEXT}=~ m/Apple/ ) {
print "\n LinkText $$Link{_TEXT} URL $$Link{href}\n";
}
}
HTML is a structured markup language that has to be parsed to extract its meaning without errors. The module Sherm listed will parse the HTML and extract the links for you. Ad hoc regular expression-based solutions might be acceptable if you know that your inputs will always be formed the same way (don't forget attributes), but a parser is almost always the right answer for processing structured text.
We can use regular expression to extract the link with its link text. This is also the one way.
local $/ = '';
my $a = <DATA>;
while( $a =~ m/<a[^>]*?href=\"([^>]*?)\"[^>]*?>\s*([\w\W]*?)\s*<\/a>/igs )
{
print "Link:$1 \t Text: $2\n";
}
__DATA__
Google
Apple