What type of perl is this? - html

I need to debug some code that looks like the following. However, I don't know what language it's in. It appears to be a combination of perl and html. Could someone please let me know what exactly this is, so that I can do future research?
//All in the same file
<%doc>
# Something
# Something
</%doc>
<%args>
$id => undef
$debug => undef
$other => under
</%args>
<%perl>
Code that appears to be perl code
</%perl>
<!DOCTYPE html>
<html>
HTML that also appears to have code (maybe perl code) inside <%these types of %> brackets
as well as <& these types &>
</html>
Can someone please explain what exactly this is. Is it perl, or is it html? Or is it some combination of the two? And if its the later, is this how you reference the perl code from within HTML: <% foo %> <& bar &>
Sorry for the confusing question. I'd be happy to provide more details.

It appears to be a template for HTML::Mason.

Related

Get numbers, a given number of characters after a phrase, from HTML

Basically, I've opened an HTML file in perl, and wrote this line:
if(INFILE =~ \$txt_TeamNumber\) {
$teamNumber = \$txt_TeamNumber\
}
and I need to get the txt_TeamNumber, go 21 spaces forward, and get the next 1-5 numbers. Here is the part of the HTML file I'm trying to extract info from:
<td style="width: 25%;">Team Number:
</td>
<td style="width: 75%;">
<input name="ctl00$ContentPlaceHolder1$txt_TeamNumber" type="text" value="186" maxlength="5" readonly="readonly" id="ctl00_ContentPlaceHolder1_txt_TeamNumber" disabled="disabled" tabindex="1" class="aspNetDisabled" style="width:53px;">
</td>
This is a very good example for benefits of using ready parsers.
One of the standard modules for parsing HTML is HTML::TreeBuilder. Its effectiveness is to a good extent based on its good use of HTML::Element so always have that page ready for reference.
The question doesn't say where HTML comes from. For testing I put it in a file, wrapped with needed tags, and load it from that file. I expect it to come from internet, please change accordingly.
use warnings;
use strict;
use Path::Tiny;
use HTML::TreeBuilder;
my $file = "snippet.html";
my $html = path($file)->slurp; # or open and slurp by hand
my $tree = HTML::TreeBuilder->new_from_content($html);
my #nodes = $tree->look_down(_tag => 'input');
foreach my $node (#nodes) {
my $val = $node->look_down('name', qr/\$txt_TeamNumber/)->attr('value');
print "'value': $val\n";
}
This prints the line: 'value': 186. Note that we never have to parse anything at all.
I assume that the 'name' attribute is identified by literal $txt_TeamNumber, thus $ is escaped.
The code uses the excellent Path::Tiny to slurp the file. If there are issues with installing a module just read the file by hand into a string (if it does come from a file and not from internet).
See docs and abundant other examples for the full utility of the HTML parsing modules used above. There are of course other ways and approaches, made ready for use by yet other good modules. Please search for the right tool.
I strongly suggest to stay clear of any idea to parse HTML (or anything similar) with regex.
Watch for variable scoping. You should be able to get it with a simple regexp capture:
if(INFILE =~ /$txt_TeamNumber/) {
$teamNumber = /$txt_TeamNumber/
($value) = /$txt_TeamNumber.*?value="(.*?)"/
}

what is the purpose of angle bracket percent in html

I just wanted to know what <% these exactly do? %> I've used these for exporting some html tables and data to excel, but i don't really know what it's function is.
any answers are appreciated.
so like when i use these in below coding, am i actually using asp?
<body>
<%
String exportToExcel = request.getParameter("exportToExcel");
if (exportToExcel != null && exportToExcel.toString().equalsIgnoreCase("YES")) { //application/vnd.ms-excel
response.setContentType("application/vnd.ms-excel"); //application/vnd.opentextformatsofficedocument.spreadsheetml.sheet
response.setHeader("Content-Disposition", "inline; filename=" + "whatever.xls");
}
%>
i got it from http://www.quicklyjava.com/export-web-page-to-word
Answered over here: Name for Angle Bracket Percent Sign. Which then links to another answer.
In short, they are code render blocks which execute when the page is rendered. They are expressions as a part of the ASP.net framework, from what I can gather.
More information: here
EDIT: As others have commented, I found all this from a quick search.

I'm new to Perl and have a few regex questions

I'm teaching myself Perl and I learn best by example. As such, I'm studying a simple Perl script that scrapes a specific blog and have found myself confused about a couple of the regex statements. The script looks for the following chunks of html:
<dt><a name="2004-10-25"><strong>October 25th</strong></a></dt>
<dd>
<p>
[Content]
</p>
</dd>
... and so on.
and here's the example script I'm studying:
#!/usr/bin/perl -w
use strict;
use XML::RSS;
use LWP::Simple;
use HTML::Entities;
my $rss = new XML::RSS (version => '1.0');
my $url = "http://www.linux.org.uk/~telsa/Diary/diary.html";
my $page = get($url);
$rss->channel(title => "The more accurate diary. Really.",
link => $url,
description => "Telsa's diary of life with a hacker:"
. " the current ramblings");
foreach (split ('<dt>', $page))
{
if (/<a\sname="
([^"]*) # Anchor name
">
<strong>
([^>]*) # Post title
<\/strong><\/a><\/dt>\s*<dd>
(.*) # Body of post
<\/dd>/six)
{
$rss->add_item(title => $2,
link => "$url#$1",
description => encode_entities($3));
}
}
If you have a moment to better help me understand, my questions are:
how does the following line work:
([^"]*) # Anchor name
how does the following line work:
([^>]*) # Post title
what does the "six" mean in the following line:
</dd>/six)
Thanks so much in advance for all your help! I'm also researching the answers to my own questions at the moment, but was hoping someone could give me a boost!
how does the following line work...
([^"]*) # Anchor name
zero or more things which aren't ", captured as $1, $2, or whatever, depending on the number of brackets ( in we are.
how does the following line work...
([^>]*) # Post title
zero or more things which aren't >, captured as $1, $2, or whatever.
what does the "six" mean in the
following line...
</dd>/six)
s = match as single line (this just means that "." matches everything, including \n, which it would not do otherwise)
i = match case insensitive
x = ignore whitespace in regex.
x also makes it possible to put comments into the regex itself, so the things like # Post title there are just comments.
See perldoc perlre for more / better information. The link is for Perl 5.10. If you don't have Perl 5.10 you should look at the perlre document for your version of Perl instead.
[^"]* means "any string of zero or more characters that doesn't contain a quotation mark". This is surrounded by quotes making forming a quoted string, the kind that follows <a name=
[^>]* is similar to the above, it means any string that doesn't contain >. Note here that you probably mean [^<], to match until the opening < for the next tag, not including the actual opening.
that's a collection of php specific regexp flags. I know i means case insensitive, not sure about the rest.
The code is an extended regex. It allows you to put whitespace and comments in your regexes. See perldoc perlre and perlretut. Otherwise like normal.
Same.
The characters are regex modifiers.

How can I extract URL and link text from HTML in Perl?

I previously asked how to do this in Groovy. However, now I'm rewriting my app in Perl because of all the CPAN libraries.
If the page contained these links:
Google
Apple
The output would be:
Google, http://www.google.com
Apple, http://www.apple.com
What is the best way to do this in Perl?
Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.
my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my #links = $mech->links();
for my $link ( #links ) {
printf "%s, %s\n", $link->text, $link->url;
}
Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.
Mech is basically a browser in an object.
Have a look at HTML::LinkExtractor and HTML::LinkExtor, part of the HTML::Parser package.
HTML::LinkExtractor is similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.
If you're adventurous and want to try without modules, something like this should work (adapt it to your needs):
#!/usr/bin/perl
if($#ARGV < 0) {
print "$0: Need URL argument.\n";
exit 1;
}
my #content = split(/\n/,`wget -qO- $ARGV[0]`);
my #links = grep(/<a.*href=.*>/,#content);
foreach my $c (#links){
$c =~ /<a.*href="([\s\S]+?)".*>/;
$link = $1;
$c =~ /<a.*href.*>([\s\S]+?)<\/a>/;
$title = $1;
print "$title, $link\n";
}
There's likely a few things I did wrong here, but it works in a handful of test cases I tried after writing it (it doesn't account for things like <img> tags, etc).
I like using pQuery for things like this...
use pQuery;
pQuery( 'http://www.perlbuzz.com' )->find( 'a' )->each(
sub {
say $_->innerHTML . q{, } . $_->getAttribute( 'href' );
}
);
Also checkout this previous stackoverflow.com question Emulation of lex like functionality in Perl or Python for similar answers.
Another way to do this is to use XPath to query parsed HTML. It is needed in complex cases, like extract all links in div with specific class. Use HTML::TreeBuilder::XPath for this.
my $tree=HTML::TreeBuilder::XPath->new_from_content($c);
my $nodes=$tree->findnodes(q{//map[#name='map1']/area});
while (my $node=$nodes->shift) {
my $t=$node->attr('title');
}
Sherm recommended HTML::LinkExtor, which is almost what you want. Unfortunately, it can't return the text inside the <a> tag.
Andy recommended WWW::Mechanize. That's probably the best solution.
If you find that WWW::Mechanize isn't to your liking, try HTML::TreeBuilder. It will build a DOM-like tree out of the HTML, which you can then search for the links you want and extract any nearby content you want.
Or consider enhancing HTML::LinkExtor to do what you want, and submitting the changes to the author.
Previous answers were perfectly good and I know I’m late to the party but this got bumped in the [perl] feed so…
XML::LibXML is excellent for HTML parsing and unbeatable for speed. Set recover option when parsing badly formed HTML.
use XML::LibXML;
my $doc = XML::LibXML->load_html(IO => \*DATA);
for my $anchor ( $doc->findnodes("//a[\#href]") )
{
printf "%15s -> %s\n",
$anchor->textContent,
$anchor->getAttribute("href");
}
__DATA__
<html><head><title/></head><body>
Google
Apple
</body></html>
–yields–
Google -> http://www.google.com
Apple -> http://www.apple.com
HTML::LinkExtractor is better than HTML::LinkExtor
It can give both link text and URL.
Usage:
use HTML::LinkExtractor;
my $input = q{If Apple }; #HTML string
my $LX = new HTML::LinkExtractor(undef,undef,1);
$LX->parse(\$input);
for my $Link( #{ $LX->links } ) {
if( $$Link{_TEXT}=~ m/Apple/ ) {
print "\n LinkText $$Link{_TEXT} URL $$Link{href}\n";
}
}
HTML is a structured markup language that has to be parsed to extract its meaning without errors. The module Sherm listed will parse the HTML and extract the links for you. Ad hoc regular expression-based solutions might be acceptable if you know that your inputs will always be formed the same way (don't forget attributes), but a parser is almost always the right answer for processing structured text.
We can use regular expression to extract the link with its link text. This is also the one way.
local $/ = '';
my $a = <DATA>;
while( $a =~ m/<a[^>]*?href=\"([^>]*?)\"[^>]*?>\s*([\w\W]*?)\s*<\/a>/igs )
{
print "Link:$1 \t Text: $2\n";
}
__DATA__
Google
Apple

Can emacs re-indent a big blob of HTML for me?

When editing HTML in emacs, is there a way to automatically pretty-format a blob of markup, changing something like this:
<table>
<tr>
<td>blah</td></tr></table>
...into this:
<table>
<tr>
<td>
blah
</td>
</tr>
</table>
You can do sgml-pretty-print and then indent-for-tab on the same region/buffer, provided you are in html-mode or nxml-mode.
sgml-pretty-print adds new lines to proper places and indent-for-tab adds nice indentation. Together they lead to properly formatted html/xml.
By default, when you visit a .html file in Emacs (22 or 23), it will put you in html-mode. That is probably not what you want. You probably want nxml-mode, which is seriously fancy. nxml-mode seems to only come with Emacs 23, although you can download it for earlier versions of emacs from the nXML web site. There is also a Debian and Ubuntu package named nxml-mode. You can enter nxml-mode with:
M-x nxml-mode
You can view nxml mode documentation with:
C-h i g (nxml-mode) RET
All that being said, you will probably have to use something like Tidy to re-format your xhtml example. nxml-mode will get you from
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head></head>
<body>
<table>
<tr>
<td>blah</td></tr></table>
</body>
to
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head></head>
<body>
<table>
<tr>
<td>blah</td></tr></table>
</body>
</html>
but I don't see a more general facility to do line breaks on certain xml tags as you want. Note that C-j will insert a new line with proper indentation, so you may be able to do a quick macro or hack up a defun that will do your tables.
http://www.delorie.com/gnu/docs/emacs/emacs_277.html
After selecting the region you want to fix. (To select the whole buffer use C-x h)
C-M-q
Reindent all the lines within one parenthetical grouping(indent-sexp).
C-M-\
Reindent all lines in the region (indent-region).
i wrote a function myself to do this for xml, which works well in nxml-mode. should work pretty well for html as well:
(defun jta-reformat-xml ()
"Reformats xml to make it readable (respects current selection)."
(interactive)
(save-excursion
(let ((beg (point-min))
(end (point-max)))
(if (and mark-active transient-mark-mode)
(progn
(setq beg (min (point) (mark)))
(setq end (max (point) (mark))))
(widen))
(setq end (copy-marker end t))
(goto-char beg)
(while (re-search-forward ">\\s-*<" end t)
(replace-match ">\n<" t t))
(goto-char beg)
(indent-region beg end nil))))
In emacs 25, which I'm currently building from source, assuming you are in HTML mode, use
Ctrl-x
h
to select all, and then press Tab.
You can do a replace regexp
M-x replace-regexp
\(</[^>]+>\)
\1C-q-j
Indent the whole buffer
C-x h
M-x indent-region
This question is quite old, but I wasn't really happy with the various answers. A simple way to re-indent an HTML file, given that you are running a relatively newer version of emacs (I am running 24.4.1) is to:
open the file in emacs
mark the entire file with C-x h (note: if you would like to see what is being marked, add (setq transient-mark-mode t) to your .emacs file)
execute M-x indent-region
What's nice about this method is that it does not require any plugins (Conway's suggestion), it does not require a replace regexp (nevcx's suggestion), nor does it require switching modes (jfm3's suggestion). Jay's suggestion is in the right direction — in general, executing C-M-q will indent according to a mode's rules — for example, C-M-q works, in my experience, in js-mode and in several other modes. But neither html-mode nor nxml-mode do not seem to implement C-M-q.
Tidy can do what you want, but only for whole buffer it seems (and the result is XHTML)
M-x tidy-buffer
You can pipe a region to xmllint (if you have it) using:
M-|
Shell command on region: xmllint --format -
The result will end up in a new buffer.
I do this with XML, and it works, though I believe xmllint needs certain other options to work with HTML or other not-perfect XML. nxml-mode will tell you if you have a well-formed document.
The easiest way to do it is via command line.
Make sure you have tidy installed
type tidy -i -m <<file_name>>
Note that -m option replaces the newly tidied file with the old one. If you don't want that, you can type tidy -i -o <<tidied_file_name>> <<untidied_file_name>>
The -i is for indentation. Alternatively, you can create a .tidyrc file that has settings such as
indent: auto
indent-spaces: 2
wrap: 72
markup: yes
output-xml: no
input-xml: no
show-warnings: yes
numeric-entities: yes
quote-marks: yes
quote-nbsp: yes
quote-ampersand: no
break-before-br: no
uppercase-tags: no
uppercase-attributes: no
This way all you have to do is type tidy -o <<tidied_file_name>> <<untidied_file_name>>.
For more just type man tidy on the command line.