How to stop at the next specific character in regex - html

I have many links in a large variable, and am using regex to extract links. The most ideal link would look like
View Stock
And my regex works perfectly looking for two matches: The full Link and the vendornum.
/<a href="\/search\/\product/(.*?)\/.*?>(.*?)<\/a>/igm
But occasionally, the link will include other info such as a class, which has it's own quotes
<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>
And the extra "s throw me off. I cannot figure out the first match, which would be the first two "s
<a href="([^"]+)".*[^>].*?>View Stock</a>
I know regex can be very challenging, and I am using RegEx101.com, a real life saver.
But I just can't seem to figure out how to match the first pattern, the full href link, but excluding any other classes with their own before I reach the closing >
Any experts in regex the can guide me?

There is generally no reason to build an HTML parser by hand, from scratch, while there's usually trouble awaiting down the road; regex are picky, sensitive to details, and brittle to even tiny input changes, while requirements tend to evolve. Why not use one of a few great HTML libraries?
An example with HTML::TreeBuilder (also extracting links, need stated in a comment)
use warnings;
use strict;
use feature 'say';
use HTML::TreeBuilder;
my $links_string =
q(<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>
<a href="/search/title/?vendornum=StaplesA17" >View More Stock</a> );
my $dom = HTML::TreeBuilder->new_from_content($links_string);
my #links_html;
foreach my $tag ( $dom->look_down(_tag => "a") ) {
push #links_html, $tag->as_HTML; # the whole link, as is
my $href = $tag->attr("href");
my ($name, $value) = $href =~ /\?([^=]+)=([^&]+)/; #/
say "$name = $value";
say $tag->as_trimmed_text; # or: ->as_text, keep some spaces
# Or:
# say for $tag->content_list; # all children, and/or text
};
#say for #links_html;
I use a string with a newline between links for your "many links in a large variable", perhaps with some spaces around as well. This doesn't affect parsing done by the library.
A few commments
The workhorse here is HTML::Element class, with its powerful and flexible look_down method. If the string indeed has just links then you can probably use that class directly, but when done as above a full HTML document would parse just as well
Once I get the URL I use a very simple-minded regex to pull out a single name-value pair. Adjust if there can be more pairs, or let me know. Above all, use URI if there's more to it
The as_trimmed_text returns text parts of element's children, which in this case is presumably just the text of the link. The content_list returns all child nodes (same here)
Use URI::Escape if there are percent-encoded characters to convert, per RFC 3986
This prints
vendornum = StaplesA03
View Stock
vendornum = StaplesA17
View More Stock
Another option is Mojo::DOM, which is a part of a whole ecosystem
use warnings;
use strict;
use feature 'say';
use Mojo::DOM;
my $links_string = q( ... ); # as above
my $dom = Mojo::DOM->new($links_string);
my #links_html;
foreach my $node ( $dom->find('a')->each ) {
push #links_html, $node->to_string; # or $node, gets stringified to HTML
my $href = $node->attr('href');
my ($name, $value) = $href =~ /\?([^=]+)=([^&]+)/; #/
say "$name = $value";
say $node->text;
}
#say for #links_html;
I use the same approach as above, and this prints the same. But note that Mojolicious provides for yet other, convenient ways. Often, calls are chained using a range of its useful methods, and very fine navigation through HTML is easily done using CSS selectors.
While it is probably useful here to loop as above, as an example we can also do
my $v = $dom -> find('a')
-> map(
sub {
my ($name, $value) = $_->attr('href') =~ /\?(.+?)=([^&]+)/;
say "$name = $value";
say $_->text;
}
);
what prints the same as above. See Mojo::Collection to better play with this.
The parameters in the URL can be parsed using Mojo::URL if you really know the name
my $value = Mojo::URL->new($href)
-> query
-> param('vendornum');
If these aren't fixed then Mojo::Parameters is useful
my $param_names = Mojo::Parameters
-> new( Mojo::URL->new($href)->query )
-> names
where $param_names is an arrayref with names of all parameters in the query, or use
my $pairs = Mojo::Parameters->new( Mojo::URL->new($href)->query ) -> pairs;
# Or
# my %pairs = #{ Mojo::Parameters->new(Mojo::URL->new($href)->query) -> pairs };
which returns an arrayref with all name,value pairs listed in succession (what can be directly assigned to a hash, for instance).
An HTML document can be nicely parsed using XML::LibXML as well.

If I read correctly, you'd like to extract the vendornum value from the URL, and the link text. Best to use an html parser.
If you want to live dangerously with code that can break you can use a regex to parse html:
my $html = '<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>';
if($html =~ /<a href="[^\?]*\?vendornum=([^"]*)[^>]*>([^<]*).*$/) {
print "vendornum: $1, link text: $2\n";
} else {
print "no match";
}
Output:
vendornum: StaplesA03, link text: View Stock
Explanation:
vendornum=([^"]*) - scan for vendornum=, and capture everything after that until just before "
[^>]*> - scan over remaining attributes, such as class="", up to closing angle bracket
([^<]*) - capture link text
.*$ - scan up to end of text

First of all you should consider using HTML::TreeBuilder for things like this. Once you get the hang of it it can be easier than coming up with regexes. However for quick and dirty tasks, a regex is fine.
$text =
'<a href="/search/title/?vendornum=StaplesA03" class="product-lister" >View Stock</a>
<a x=y href="/search/product/?Vendornum=651687" foo=bar>View Stockings</A>';
$regex =
qr{<a\s[^>]*?href="(?<link>[^"]*?\?vendornum=(?<vendornum>\w+)[^"]*)"[^>]*?>(?<desc>(?:(?!</a>).)*)</a>}i;
while($text =~ m/$regex/g){ Data:Dump::pp1 %+; }
Returns
{
# tied Tie::Hash::NamedCapture
desc => "View Stock",
link => "/search/title/?vendornum=StaplesA03",
vendornum => "StaplesA03",
}
{
# tied Tie::Hash::NamedCapture
desc => "View Stockings",
link => "/search/product/?Vendornum=651687",
vendornum => 651687,
}
HTH

Related

How to Enable HTML::TableExtract to Recognize Special Characters

I was trying to parse a page that contain scientific notation (Greek, etc).
This is the page. Note that there are other pages with more notations to be parsed.
For example it contain the following HTML
<td> human Interleukin 1β </td>
where &beta encode the Greek alphabet.
However after parsing with HTML::TableExtract it became:
human Interleukin 1\x{3b2}
Is there a way to make the code below capture the original HTML as it is,
i.e. maintaning 1&beta.
use HTML::TableExtract;
use Data::Dumper;
# Local file for http://www.violinet.org/vaxjo/vaxjo_detail.php?c_vaxjo_id=55
my $file = "vaxjo_detail.php\?c_vaxjo_id\=50.html";
my $te = HTML::TableExtract->new();
$te->parse_file($file);
my ($table) = $te->tables;
print Dumper $table ;
It did not return
human Interleukin 1\x{3b2}
It returned
human Interleukin 1β
Dumper simply prints that out as Perl string literal
"human Interleukin 1\x{3b2}"
Anyway, if you want the raw HTML instead of the text it represents, I believe passing keep_html => 1 to the constructor will do the trick.

I can't save the contents of a table in html into a text file. Using CGI and PERL

Hello my friends
I am fairly new in javascript and html as well so this project is taking much longer than I probably need.
Here is the problem, I want to save the contents of a table into a text file.
Here is the code in the form
print start_form(-action=>"",-onSubmit=>"return false;");
print table({-border=>1,-cellpadding=>3,-name=>"tabla",-id=>"tab"},
th(["Matricula","Nombre","Apellido Paterno","Apellido Materno","Sexo","Edad","Carrera","Email",
"Materias","Promedio","Borrar"]));
print submit(-label=>'Guardar',-onClick=>'salvar();',-onSubmit=>"salvar(); return false;");
print hidden(-name=>'escondido',-id=>'hid',
-default=>['0']);
print hidden(-name=>'escondido2',-id=>'hid2',
-default=>['nadiemeve2']);
print end_form;
Rows are added to the table as it reads a file using javascript, so when I want to save the contents of the displayed table into a text file I do this:
if (param) {
open FT, "+>>".param('escondido2') or die "No se puede abrir el archivo";
print FT param('celda60');
close FT;
}
Here, 'celda60' is the name of a single textfield inside a field, the name is given as the cell is created.
As you might notice, I am just trying to save a single cell as a test, the thing is, that it doesn't work. It doesnt save a thing. but if I do:
print FT param('celda60')."TEST";
the only thing it saves in my file is TESTESTESTEST, so the problem gotta be param('celda60')
In case you wonder 'salvar()' looks like this
function salvar(){
var table2 = document.getElementById('tab');
var rowCount2 = table2.rows.length - 1;
document.getElementById('hid').value=rowCount2; }
the only thing it does is to get the amount of rows so when I get the saving thing going I can save the content inside each textfield from each cell by doing a few cycles.
Any idea of what I am doing wrong here? I mean, besides lots of rudimentary stuff I guess.
Client side modifications of an HTML document are not automatically reflected via CGI into the state of the server-side program. You need an additional mechanism, e.g. AJAX.
You need to read a well written Perl and CGI tutorial for sometime after 1998.
I recommend Ovid's CGI Course. It's a venerable classic, but it is very well written and does not encourage bad practices.
You've got at least one place where I can run arbitrary commands using shell escapes in my CGI arguments. That is why practices like tainting user data and 3 argument open commands are standard practices and have been for many years.
General advice:
Make sure you are using strict and warnings.
Use 3 argument open with lexical handles, or just use IO::File. open my $fh, '>>', $pathtofile or die "Ouch $!";
You don't have to use the CGI module with Perl to do CGI. It is one convenient way to handle parameter parsing and HTML building. There are MANY options.
print can handle a list of arguments. There is no need to have 50 print statements in your code.
Add whitespace to your code. Spacing things out makes it more readable.
On the last couple points here's an example:
sub print_form {
print
start_form(-action=>"", -onSubmit=>"return false;"),
table({-border=>1,-cellpadding=>3,-name=>"tabla",-id=>"tab"},
th([ "Matricula", "Nombre",
"Apellido Paterno", "Apellido Materno",
"Sexo", "Edad",
"Carrera", "Email",
"Materias", "Promedio",
"Borrar",
])
),
submit( -label => 'Guardar',
-onClick => 'salvar();',
-onSubmit => 'salvar(); return false;',
),
hidden( -name=>'escondido', -id=>'hid', -default=>['0'] ),
hidden( -name=>'escondido2', -id=>'hid2', -default=>['nadiemeve2'] ),
end_form();
}
Although, in my code I'd make the html and pass it back. I like to keep side effects, like printing stuff out, closely grouped.
Maybe you are overcomplicating your question, something like this will allow you to have a user enter a list of values to a form and then save those values to a data file on the server.
use strict;
use warnings;
use CGI;
use IO::File;
my #FIELDS = qw(
matricula nombre
apellido_paterno apellido_materno
sexo edad
carrera email
materias promedio
borrar
);
i f( param('gotstuff') ) {
my $fh = IO::File->open( 'datafile.txt', '>>' );
my #data = param( #FIELDS );
$fh->print join ',', #data;
# Print a thank you for your data page here.
}
else {
# Do your other junk here, start some html, etc
# print the form and so on.
print_form();
# Print the rest of your HTML here.
}
sub print_form {
my #fields = qw(
matricula nombre
apellido_paterno apellido_materno
sexo edad
carrera email
materias promedio
borrar
);
my #headings = map {my $w = $_; $w=~/_/ /g; uc_first $w } #fields;
print
start_form(-action => script_name(), -method => 'POST' ),
table({-border=>1,-cellpadding=>3,-name=>"tabla",-id=>"tab"},
Tr([
th(\#headings),
td([ map textfield($_), #fields ]),
]),
),
submit( -label => 'Guardar',
-onClick => 'salvar();',
-onSubmit => 'salvar(); return false;',
),
hidden( 'gotstuff', 'gotstuff' ),
end_form();
}

How can I modify HTML files in Perl?

I have a bunch of HTML files, and what I want to do is to look in each HTML file for the keyword 'From Argumbay' and change this with some href that I have.
I thought its very simple at first, so what I did is I opended each HTML file and loaded its content into an array (list), then I looked for each keyword and replaced it with s///, and dumped the contents to the file, what the problem? sometimes the keyword can also appear in a href, which in this case I dont want it to be replaced, or it can appear inside some tags and such.
An EXAMPLE: http://www.astrosociety.org/education/surf.html
I would like my script to replace each occurance of the word 'here' with some href that I have in $href, but as you can see, there is another 'here' which is already href'ed, I dont want it to href this one again.
In this case there arent additional 'here's there except from the href, but lets assume that there are.
I want to replace the keyword only if its just text, any idea?
BOUUNTY EDIT: Hi, I believe its a simple thing, But seems like it erases all the comments found in the HTML, SHTML file(the main issue is that it erases SSI's in SHTMLs), i tried using: store_comments(1) method on the $html before calling the recursive function, but to no avail. any idea what am I missing here?
To do this with HTML::TreeBuilder, you would read the file, modify the tree, and write it out (to the same file, or a different file). This is fairly complex, because you're trying to convert part of a text node into a tag, and because you have comments that can't move.
A common idiom with HTML-Tree is to use a recursive function that modifies the tree:
use strict;
use warnings;
use 5.008;
use File::Slurp 'read_file';
use HTML::TreeBuilder;
sub replace_keyword
{
my $elt = shift;
return if $elt->is_empty;
$elt->normalize_content; # Make sure text is contiguous
my $content = $elt->content_array_ref;
for (my $i = 0; $i < #$content; ++$i) {
if (ref $content->[$i]) {
# It's a child element, process it recursively:
replace_keyword($content->[$i])
unless $content->[$i]->tag eq 'a'; # Don't descend into <a>
} else {
# It's text:
if ($content->[$i] =~ /here/) { # your keyword or regexp here
$elt->splice_content(
$i, 1, # Replace this text element with...
substr($content->[$i], 0, $-[0]), # the pre-match text
# A hyperlink with the keyword itself:
[ a => { href => 'http://example.com' },
substr($content->[$i], $-[0], $+[0] - $-[0]) ],
substr($content->[$i], $+[0]) # the post-match text
);
} # end if text contains keyword
} # end else text
} # end for $i in content index
} # end replace_keyword
my $content = read_file('foo.shtml');
# Wrap the SHTML fragment so the comments don't move:
my $html = HTML::TreeBuilder->new;
$html->store_comments(1);
$html->parse("<html><body>$content</body></html>");
my $body = $html->look_down(qw(_tag body));
replace_keyword($body);
# Now strip the wrapper to get the SHTML fragment back:
$content = $body->as_HTML;
$content =~ s!^<body>\n?!!;
$content =~ s!</body>\s*\z!!;
print STDOUT $content; # Replace STDOUT with a suitable filehandle
The output from as_HTML will be syntactically correct HTML, but not necessarily nicely-formatted HTML for people to view the source of. You can use HTML::PrettyPrinter to write out the file if you want that.
If tags matter in your search and replace, you'll need to use HTML::Parser.
This tutorial looks a bit easier to understand than the documentation with the module.
If you wanted to go a regular-expression-only type method and you're prepared to accept the following provisos:
this will not work correctly within HTML comments
this will not work where the < or > character is used within a tag
this will not work where the < or > character is used and not part of a tag
this will not work where a tag spans multiple lines (if you're processing one line at a time)
If any of the above conditions do exist then you will have to use one of the HTML/XML parsing strategies outlined by other answers.
Otherwise:
my $searchfor = "From Argumbay";
my $replacewith = "<a href='http://google.com/?s=Argumbay'>From_Argumbay</a>";
1 while $html =~ s/
\A # beginning of string
( # group all non-searchfor text
( # sub group non-tag followed by tag
[^<]*? # non-tags (non-greedy)
<[^>]*> # whole tags
)*? # zero or more (non-greedy)
)
\Q$searchfor\E # search text
/$1$replacewith/sx;
Note that this will NOT work if $searchfor matches $replacetext (so don't put "From Argumbay" back into the replacement text).

How can I remove external links from HTML using Perl?

I am trying to remove external links from an HTML document but keep the anchors but I'm not having much luck. The following regex
$html =~ s/<a href=".+?\.htm">(.+?)<\/a>/$1/sig;
will match the beginning of an anchor tag and the end of an external link tag e.g.
1
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
so I end up with nothing instead of
1
some other html
It just so happens that all anchors have their href attribute in uppercase, so I know I can do a case sensitive match, but I don't want to rely on it always being the case in the future.
Is the something I can change so it only matches the one a tag?
Echoing Chris Lutz' comment, I hope the following shows that it is really straightforward to use a parser (especially if you want to be able to deal with input you have not yet seen such as <a class="external" href="...">) rather than putting together fragile solutions using s///.
If you are going to take the s/// route, at least be honest, do depend on href attributes being all upper case instead of putting up an illusion of flexibility.
Edit: By popular demand ;-), here is the version using HTML::TokeParser::Simple. See the edit history for the version using just HTML::TokeParser.
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(\*DATA);
while ( my $token = $parser->get_token ) {
if ($token->is_start_tag('a')) {
my $href = $token->get_attr('href');
if (defined $href and $href !~ /^#/) {
print $parser->get_trimmed_text('/a');
$parser->get_token; # discard </a>
next;
}
}
print $token->as_is;
}
__DATA__
1
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
<a class="external" href="http://example.com">An example you
might not have considered</a>
<p>Maybe you did not consider click here >>>
either</p>
Output:
C:\Temp> hjk
1
some other html
No. 155 <!-- end tag not necessarily on the same line -->
An example you might not have considered
<p>Maybe you did not consider click here >>>
either</p>
NB: The regex based solution you checked as ''correct'' breaks if the files that are linked to have the .html extension rather than .htm. Given that, I find your concern with not relying on the upper case HREF attributes unwarranted. If you really want quick and dirty, you should not bother with anything else and you should rely on the all caps HREF and be done with it. If, however, you want to ensure that your code works with a much larger variety of documents and for much longer, you should use a proper parser.
A bit more like a SAX type parser is HTML::Parser:
use strict;
use warnings;
use English qw<$OS_ERROR>;
use HTML::Parser;
use List::Util qw<first>;
my $omitted;
sub tag_handler {
my ( $self, $tag_name, $text, $attr_hashref ) = #_;
if ( $tag_name eq 'a' ) {
my $href = first {; defined } #$attr_hashref{ qw<href HREF> };
$omitted = substr( $href, 0, 7 ) eq 'http://';
return if $omitted;
}
print $text;
}
sub end_handler {
my $tag_name = shift;
if ( $tag_name eq 'a' && $omitted ) {
$omitted = false;
return;
}
print shift;
}
my $parser
= HTML::Parser->new( api_version => 3
, default_h => [ sub { print shift; }, 'text' ]
, start_h => [ \&tag_handler, 'self,tagname,text,attr' ]
, end_h => [ \&end_handler, 'tagname,text' ]
);
$parser->parse_file( $path_to_file ) or die $OS_ERROR;
Yet another solution. I love HTML::TreeBuilder and family.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;
my $root = HTML::TreeBuilder->new_from_file(\*DATA);
foreach my $a ($root->find_by_tag_name('a')) {
if ($a->attr('href') !~ /^#/) {
$a->replace_with_content($a->as_text);
}
}
print $root->as_HTML(undef, "\t");
__DATA__
1
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
<a class="external" href="http://example.com">An example you
might not have considered</a>
<p>Maybe you did not consider click here >>>
either</p>
Why not just only remove links for which the href attribute doesn't begin with a pound sign? Something like this:
html =~ s/<a href="[^#][^"]*?">(.+?)<\/a>/$1/sig;
Even more simple, if you don't care about tag attributes:
$html =~ s/<a[^>]+>(.+?)<\/a>/$1/sig;

How can I remove unused, nested HTML span tags with a Perl regex?

I'm trying to remove unused spans (i.e. those with no attribute) from HTML files, having already cleaned up all the attributes I didn't want with other regular expressions.
I'm having a problem with my regex not picking the correct pair of start and end tags to remove.
my $a = 'a <span>b <span style="color:red;">c</span> d</span>e';
$a =~ s/<span\s*>(.*?)<\/span>/$1/g;
print "$a\
returns
a b <span style="color:red;">c d</span>e
but I want it to return
a b <span style="color:red;">c</span> de
Help appreciated.
Try HTML::Parser:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
my #print_span;
my $p = HTML::Parser->new(
start_h => [ sub {
my ($text, $name, $attr) = #_;
if ( $name eq 'span' ) {
my $print_tag = %$attr;
push #print_span, $print_tag;
return if !$print_tag;
}
print $text;
}, 'text,tagname,attr'],
end_h => [ sub {
my ($text, $name) = #_;
if ( $name eq 'span' ) {
return if !pop #print_span;
}
print $text;
}, 'text,tagname'],
default_h => [ sub { print shift }, 'text'],
);
$p->parse_file(\*DATA) or die "Err: $!";
$p->eof;
__END__
<html>
<head>
<title>This is a title</title>
</head>
<body>
<h1>This is a header</h1>
a <span>b <span style="color:red;">c</span> d</span>e
</body>
</html>
Regex is insufficiently powerful to parse HTML (or XML). Any regex you can come up with will fail to match various formulations of even valid HTML (let alone real-world tag soup).
This is a nesting problem. Regex can't normally handle nesting at all, but Perl has a non-standard extension to support regex recursion: (?n), where n is the group number to recurse into. So something like this would match both spans in your example:
(<span[^>]*>.*+(?1)?.*+<\/span>)
See perlfaq 6.11.
Unfortunately this still isn't enough, because it needs to be able to count both attributed and unattributed <span> start-tags, allowing the </span> end-tag to close either one. I can't think of a way this can be done without also matching the attributes span start-tags.
You need an HTML parser for this, and you should be using one anyway because regex for HTML/XML is decidedly the Wrong Thing.
Don't use regexps for processing (HTML ==) XML. You never know what input you'll get. Consider this, valid HTML:
a <span>b <span style="color:red;" title="being closed with </span>">c</span> de
Would you have thought of that?
Use an XML processor instead.
Also see the Related Questions (to the right) for your question.
With all your help I've published a script that does everything I need.
http://github.com/timabell/decrufter/