Trouble Getting Regular Expression To Work - html

I'm trying to use regular expressions to remove certain blocks of coding from a text file. So far, most of my regular expression lines have worked to remove the codes. However, I have two questions:
1) Whenever I remove a chunk of text, where the text should have been is substituted with blank space, rather than simply being removed.
An example of my regex code is:
$file =~ s/<ul(.*)>//gi;
Which removes all lines with the basic format <ul...>, which is what I want it to do. However, as mentioned prior, it replaces the tag and all contained data with blank spaces, and I was wondering how to stop this particular substitution.
2) Certain regular expression codes that should work, don't seem to. For instance, I want to remove
<script type="text/javascript">
function getCookies() { return ""; }
</script>
I have tried using various regex codes, but nothing seems to remove these lines. For instance:
$file =~ s/<script type(.*)<\/script>//gi;
Which removes the <script type...> and </script> tags respectively, but leaves the
function getCookies() { return ""; }
...intact. I'm unsure as to why this happens, and I would very much like to correct this. How would this be possible? Any help on either of these two questions would be immensely helpful!
Edit: Sorry all, I'm using Perl!
Also: I just tried using
$file =~ /<script type(.*)<\/script>/sgi
...as well as /msgi, but neither worked unfortunately. Both the <script type> and </script> tags were removed, but for some reason the
function getCookies() { return ""; }
...section stayed. Here is my entire code, including all regex:
use strict;
use warnings;
my $firstarg;
if ($ARGV[0]){
$firstarg = $ARGV[0];
}
open (DATA, $ARGV[1]);
my $file = do {local $/; <DATA>};
$file =~ s/<\!DOCTYPE(.*)>//gi;
$file =~ s/<html>//gi;
$file =~ s/<\/html>//gi;
$file =~ s/<title>//gi;
$file =~ s/<\/title>//gi;
$file =~ s/<head>//gi;
$file =~ s/<\/head>//gi;
$file =~ s/<link(.*)>//gi;
$file =~ s/<\link>//gi;
$file =~ s/CDM(.*)\;//gi;
$file =~ s/<\!(.*)->//gi;
$file =~ s/<body(.*)>//gi;
$file =~ s/<\/body>//gi;
$file =~ s/<div(.*)>//gi;
$file =~ s/<\/div>//gi;
$file =~ s/function(.*)>//gi;
$file =~ s/<noscript>//gi;
$file =~ s/<\/noscript>//gi;
$file =~ s/<a(.*)>//gi;
$file =~ s/<\/a>//gi;
$file =~ s/<ul(.*)>//gi;
$file =~ s/<\/ul>//gi;
$file =~ s/<li(.*)>//gi;
$file =~ s/<\/li>//gi;
$file =~ s/<form(.*)>//gi;
$file =~ s/<\/form>//gi;
$file =~ s/<iframe(.*)>//gi;
$file =~ s/<\/iframe>//gi;
$file =~ s/<select(.*)>//gi;
$file =~ s/<\/select>//gi;
$file =~ s/<textarea(.*)>//gi;
$file =~ s/<\/textarea>//gi;
$file =~ s/<b>//gi;
$file =~ s/<\/b>//gi;
$file =~ s/<H1>//gi;
$file =~ s/<H2>//gi;
$file =~ s/<H3>//gi;
$file =~ s/<H4>//gi;
$file =~ s/<H5>//gi;
$file =~ s/<H6>//gi;
$file =~ s/<\/H1>//gi;
$file =~ s/<\/H2>//gi;
$file =~ s/<\/H3>//gi;
$file =~ s/<\/H4>//gi;
$file =~ s/<\/H5>//gi;
$file =~ s/<\/H6>//gi;
$file =~ s/<option(.*)>//gi;
$file =~ s/<\/option>//gi;
$file =~ s/<p>//gi;
$file =~ s/<\/p>//gi;
$file =~ s/<span(.*)>//gi;
$file =~ s/<\/span>//gi;
$file =~ s/<!doctype(.*)>//gi;
$file =~ s/<base(.*)>//gi;
$file =~ s/<br>//gi;
$file =~ s/<hr>//gi;
$file =~ s/<img(.*)>//gi;
$file =~ s/<input(.*)>//gi;
$file =~ s/<link(.*)>//gi;
$file =~ s/<meta(.*)>//gi;
$file =~ s/<script type(.*)<\/script>//gi;
print $file;
Ok, now that I deleted the <script> regex that was causing one problem, another has been created - using:
$file =~ s/<script type(.*)<\/script>//gi;
removes everything in between the first instance of <script ...>, but not the tag itself, not the repetitions of the tag throughout. Using:
$file =~ s/<script type(.*)<\/script>//mgi;
results in the exact same thing. Using:
$file =~ s/<script type(.*)<\/script>//sgi;
results in the printing of several new line characters, but no other text, same for /msgi.
Urgh, the problems never end... :(
NEW EDIT: I would like to apologize for posting a question about parsing HTML using regex. I realize that there is a rather large backlash within the programming community regarding this practice (or attempt at practice, since this seems to fail more often than not). However, I am unfortunately forced to use regex to parse selected HTML, ones that it will be possible to remove the majority, if not all, of the HTML tags. I am not allowed to use a module, despite this being the most obvious and simplest of answers.

If you are not allowed to use anything but Perl regular expressions then you could adapt the code to strip HTML tags from a text:
#!/usr/bin/perl -w
use strict;
use warnings;
$_ = do { local $/; <DATA> };
# see http://www.perlmonks.org/?node_id=161281
# ALGORITHM:
# find < ,
# comment <!-- ... -->,
# or comment <? ... ?> ,
# or one of the start tags which require correspond
# end tag plus all to end tag
# or if \s or ="
# then skip to next "
# else [^>]
# >
s{
< # open tag
(?: # open group (A)
(!--) | # comment (1) or
(\?) | # another comment (2) or
(?i: # open group (B) for /i
( # one of start tags
SCRIPT | # for which
APPLET | # must be skipped
OBJECT | # all content
STYLE # to correspond
) # end tag (3)
) | # close group (B), or
([!/A-Za-z]) # one of these chars, remember in (4)
) # close group (A)
(?(4) # if previous case is (4)
(?: # open group (C)
(?! # and next is not : (D)
[\s=] # \s or "="
["`'] # with open quotes
) # close (D)
[^>] | # and not close tag or
[\s=] # \s or "=" with
`[^`]*` | # something in quotes ` or
[\s=] # \s or "=" with
'[^']*' | # something in quotes ' or
[\s=] # \s or "=" with
"[^"]*" # something in quotes "
)* # repeat (C) 0 or more times
| # else (if previous case is not (4))
.*? # minimum of any chars
) # end if previous char is (4)
(?(1) # if comment (1)
(?<=--) # wait for "--"
) # end if comment (1)
(?(2) # if another comment (2)
(?<=\?) # wait for "?"
) # end if another comment (2)
(?(3) # if one of tags-containers (3)
</ # wait for end
(?i:\3) # of this tag
(?:\s[^>]*)? # skip junk to ">"
) # end if (3)
> # tag closed
}{}gsx; # STRIP THIS TAG
print;
__END__
<html><title>remove script, ul</title>
<script type="text/javascript">
function getCookies() { return ""; }
</script>
<body>
<ul><li>1
<li>2
<p>paragraph
Output
remove script, ul
1
2
paragraph
NOTE: This regex doesn't work for nested tag-containers e.g.:
<!DOCTYPE html>
<meta charset="UTF-8">
<title>Nested <object> example</title>
<body>
<object data="uri:here">fallback content for uri:here
<object data="uri:another">uri:another fallback
</object>!!!this text should be striped too!!!
</object>
Output
Nested <object> example
!!!this text should be striped too!!!
Don't parse html with regexs. Use a html parser or a tool built on top of it e.g., HTML::Parser:
#!/usr/bin/perl -w
use strict;
use warnings;
use HTML::Parser ();
HTML::Parser->new(
ignore_elements => ["script"],
ignore_tags => ["ul"],
default_h => [ sub { print shift }, 'text'],
)->parse_file(\*DATA) or die "error: $!\n";
__END__
<html><title>remove script, ul</title>
<script type="text/javascript">
function getCookies() { return ""; }
</script>
<body>
<ul><li>1
<li>2
<p>paragraph
Output
<html><title>remove script, ul</title>
<body>
<li>1
<li>2
<p>paragraph

To reply your last comment:
perl -e'$file="<script etc>\nfoo\n</script>bar"; $file =~ s/<script.*script>//gis; print $file'
this does seem to do what you want, as suggested by others. I don't see how that is different from what you're trying, though.
....
Can you add this:
use Data::Dumper;
$Data::Dumper::Useqq=1;
print Dumper($file);
before the regexp and give us the result?
.....
Bingo:
line 5 and 6 of your $file =~ list already filter them out:
$file =~ s/<\!DOCTYPE(.*)>//gi;
$file =~ s/<html>//gi;
$file =~ s/<\/html>//gi;
$file =~ s/<title>//gi;
$file =~ s/<\/title>//gi;
## Here they come:
$file =~ s/<script(.*)>//gi;
$file =~ s/<\/script>//gi;
$file =~ s/<head>//gi;

I'm not sure what programming language you're using, but assuming that you're in perl, try putting the s modifier at the end of the regex:
$file =~ /<script type(.*)<\/script>/sgi
The /s modifier makes the . match any character, including newlines (normally it doesn't include newlines)
Edit: I apologize, I'm not good at Perl, but I did some looking around and I finally realized that the s/ in front is for substitutions. In this case, your regex should be:
$file =~ s/<script type(.*)<\/script>/sgi
to remove everything, including the script tags. However, if you just want the content between the tags it is:
$file =~ s/(<script type="[^"]*"\s*>).*(<\/script>)/$1$2/sgi;
Notice the $1$2 between the slashes. This text is the replacment text. In this case we are using the text from capturing groups in place of the original. In your question you were using two slashes in a row (s/<ul(.*)>//gi) which means you're substituting the whole match for an empty string. It seems to me that you're actually looking to replace everything with a blank space (ASCII 20) like s/<ul(.*)>/ /gi.
Since your last edit - You'll want to use one regex for the scripts since you don't want the contents:
$file =~ s/(<script type="[^"]*"\s*>).*(<\/script>)/ /sgi;
and another generic regex for all the other tags:
$file =~ s/<\/?\s*[^>]+>//sgi
I'm assuming here that you don't want to limit to just the tags you displayed above, you just want to kill all HTML. There is a *nix utility called html2text that does this. You might want to look into using that.

You’re going to have to be a lot more careful than that. See both approaches in this answer.

This:
$file =~ s/<div(.*)>//gi;
won't do what you expect. The '*' operator is greedy. If you have a line like:
hello<div id="foo"><b>bar!</b>baz
it'll substitute as much as it can, leaving only:
hellobaz
You want:
$file =~ s/<div[^>]*>//gi;
or
$file =~ s/<div.*?>//gi;

Related

Encoding Special Characters in Perl

I have this string for example:
This is an example text ã&"><£
When I run this Perl code on the string:
my($string)= #_;
$string =~ s/>//g;
$string =~ s/<//g;
$string =~ s/&/and/g;
$string =~ s/\"//g;
$string =~ s/-//;
$string =~ s/ó//;
$string =~ s/;//g;
$string =~ s/&/&/g;
$string = encode_entities($string, '<>&"');
$string = encode_utf8($string);
return $string;
I receive this result:
This is an example text ã£ã£
Instead of the expected one:
This is an example text ã&"><£
How can I solve it?
Can you try the following script:
use feature qw(say);
use strict;
use warnings;
use utf8;
use open qw(:std :encoding(utf-8));
use HTML::Entities;
my $string = 'This is an example text ã&"><£';
$string = encode_entities($string, '<>&"');
say $string;
Output:
This is an example text ã&"><£

how to pass a variable from an HTML form to a perl cgi script?

I would like to use an HTML form to pass a variable to Perl CGI script so that I can process that variable, and then print it out on another HTML page.
Here is my HTML code: http://jsfiddle.net/wTVQ5/.
Here is my Perl CGI script to links the HTML. Here is the way I would like to do it (since it uses less lines and probably more efficient).
#!/usr/bin/perl
use warnings; use strict;
use CGI qw( :standard);
my $query = CGI->new;
# Process an HTTP request
my $user = $query->param('first_name');
# process $user... for example:
my $foo = "Foo";
my $str = $user . $foo;
print "Content-type:text/html\r\n\r\n";
print "<html>";
print "<head>";
print "<title>Hello - Second CGI Program</title>";
print "</head>";
print "<body>";
print "<h2>Hello $str - Second CGI Program</h2>";
print "</body>";
print "</html>";
1;
Here's a way I read from a tutorial and makes more sense to me:
#!/usr/bin/perl
use warnings; use strict;
my ($buffer, #pairs, $pair, $name, $value, %FORM);
# Read in text
$ENV{'REQUEST_METHOD'} =~ tr/a-z/A-Z/;
if ($ENV{'REQUEST_METHOD'} eq "POST") {
read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
}else {
$buffer = $ENV{'QUERY_STRING'};
}
# Split information into name/value pairs
#pairs = split(/&/, $buffer);
foreach $pair (#pairs) {
($name, $value) = split(/=/, $pair);
$value =~ tr/+/ /;
$value =~ s/%(..)/pack("C", hex($1))/eg;
$FORM{$name} = $value;
}
my $user = $FORM{first_name};
# process $user... for example:
my $foo = "Foo";
my $str = $user . $foo;
print "Content-type:text/html\r\n\r\n";
print "<html>";
print "<head>";
print "<title>Hello - Second CGI Program</title>";
print "</head>";
print "<body>";
print "<h2>Hello $str - Second CGI Program</h2>";
print "</body>";
print "</html>";
1;
Both of these don't work properly BTW. When I click on the submit button on the HTML page, it just links me to the script instead of passing the variable, processing it, and printing out the HTML page.
this line:
print "Content-type:text/html\r\n\r\n";
should be:
print "Content-type:text/html\n\n";
or better:
print $query->header;
Also, ensure your web server was well configurated for CGI. And, if you have enough time, use a modern web application approach, there are many frameworks that may be better than CGI (Dancer, Mojolicious, OX, ...)
I see your using CGI 'standard', no need really to initiate a CGI->new unless you just wanted to, also you said less lines, you could just do something like this.
use strict;
use warnings;
use CGI qw( :standard );
my $user = param('first_name') || q/foo/;
print header,
start_html(-title => 'Hello'), h1('Hello ' . $user), end_html;
You need to edit your httpd.conf with something like this.
AddHandler cgi-script cgi pl
<Directory /path/to/cgi/files>
Options +ExecCGI
</Directory>
If you are running this locally you could create a folder named public_html in your home directory structure and set this to run your scripts, you would just have to configure that also mapping it to that location.

How to use perl to replace text between two specific lines

Hi I have a html called 1.html file like this
<div class="t_l"></div>
<some>
lines
of
codes
</some>
<div class="t_r"></div>
I want to replace the content of that div to another one, which stored in the file called "banner".
banner file is
<other>
lines
of some
codes
</other>
So what I want to get is:
<div class="t_l"></div>
<other>
lines
of some
codes
</other>
<div class="t_r"></div>
What I come up with using perl is something like this:
# Slurp file 1.html into a single string
open(FILE,"1.html") or die "Can't open file: $!";
undef $/;
my $file = <FILE>;
open(BANNER,"banner") or die "Can't open file: $!";
undef $/;
my $banner = <BANNER>;
close BANNER;
# Set strings to find and insert
my $first_line = '<div class="t_l"></div>';
my $second_line = '<div class="t_r"></div>';
$file =~ s/$first_line\n.+?\n$second_line#s/$first_line\n$banner\n$second_line/;
close FILE;
# Write output to output.txt
open(OUTPUT,">1new.html") or die "Can't open file: $!";
print OUTPUT $file;
close OUTPUT;
The above code cannot work. Any suggestions?
You're almost there.
The normal regex behavior of . is to match any character except a newline. .+? in your regex doesn't work for you because there are more newline characters between $first_line and $second_line.
Using the /s modifier tells Perl to let . match newline characters, too.
(You also have an extraneous "#s" in your expression)
So a working substitution is:
$file =~ s/$first_line\n.+?\n$second_line/$first_line\n$banner\n$second_line/s;
Go with
$file =~ s/($first_line\n)[^\n]+(\n$second_line)/$1$banner$2/;
or
$file =~ s/(?<=$first_line\n)[^\n]+(?=\n$second_line)/$banner/;

Perl - How to decode or replace ' from database to single quote for browser display

I have found lots of questions close to this but, nothing that helped me solve it. Prob due to my lack of expertise.
PHP has html_entity_decode which could have helped but, Perl does not, I believe.
In my MySQL database I have ' " following lengths in a description like: 12' 6"
I would like it to display as 12' 6".
I have tried:
$string =~ s/:[']:/'/g;
$string =~ s/:["]:/"/g;
$string =~ s/'/'/g;
$string =~ s/"/"/g;
$string =~ s/\'/'/g;
$string =~ s/\"/"/g;
$string =~ s/\'/\'/g;
$string =~ s/\"/\"/g;
perl -pi -e 's:':':g' $_; #crashes.
perl -pi -e 's:":":g' $string #also crashes.
system -pi -e 's:':':g' $_; #crashes.
system -pi -e 's:":":g' $string #also crashes.
I am at a loss. Can someone help?
Have a look at HTML::Entities:
use warnings;
use strict;
use HTML::Entities;
my $str = '12' 6"';
print decode_entities($str);

Convert CSS Style Attributes to HTML Attributes using Perl

Real quick background : We have a PDFMaker (HTMLDoc) that converts html into a pdf. HTMLDoc doesn't consistently pick up the styles that we need from the html that is provided to us by the client. Thus I'm trying to convert things such as style="width:80px;height:90px;" to height=80 width=90.
My attempt so far has revealed my limited understanding of back references and how to utilize them properly during Perl Regex. I can take an input file and convert it to an output file, but it only catches one "style" per line, and only replaces one name/value pair from that css.
I'm probably approaching this the wrong way but I can't figure out a faster or smarter way to do this in Perl. Any help would be greatly appreciated!
NOTE: The only attributes I'm trying to change for this particular script are "height", "width" and "border," because our client utilizes a tool that automatically applies styles to elements that they drag around with a WYSIWYG-style editor. Obviously, using a regex to strip these out of a lot of places works fairly well, as you just let the table cells be sized by their content, which looks okay, but I figured a quicker way to deal with the issue would just be to replace those three attributes with "width" "height" and "border" attributes, which behave mostly the same as their css counterparts (excepting that CSS allows you to actually customize the width, color, and style of the border, but all they ever use is solid 1px, so I can add a condition to replace "solid 1px" with "border=1". I realize these are not fully equivalent, but for this application it would be a step).
Here's what I've got so far:
#!/usr/bin/perl
if (!#ARGV[0] || !#ARGV[1])
{
print "Usage: converter.pl [input file] [output file] \n";
exit;
}
open FILE, "<", #ARGV[0] or die $!;
open OUTFILE, ">", #ARGV[1] or die $!;
my $line;
my $guts;
while ( <FILE> ) {
$line = $_ ;
$line =~ /style=\"(.+)\"/;
$guts = $1;
$guts =~ /([a-zA-Z]+)\:([a-zA-Z0-9]+)\;/;
$name = $1;
$value = $2;
$guts = $name."=".$value;
$line =~ s/style=\"(.+)\"/$guts/g;
print OUTFILE $line ;
}
exit;
Note: This is NOT homework, and no I'm not asking you to do my job for me, this would end up being an internal tool that just sped up the process of formatting our incoming html to work properly in the pdf converter we have.
UPDATE
For those interested, I got an initial working version. This one only replaces width and height, the border attribute we're scrapping for now. But if anyone wanted to see how we did it, take a look...
#!/usr/bin/perl
## NOTES ##
# This script was made to simply replace style attributes with their name/value pair equivalents as attributes.
# It was designed to replace width and height attributes on a metric buttload of table elements from client data we got.
# As such, it's not really designed to handle more than that, and only strips the unit "PX" from the values.
# All of these can be modified in the second foreach loop, which checks for height and width.
if (!#ARGV[0] || !#ARGV[1])
{
print "Usage: quickvert.pl [input file] [output file] \n";
exit;
}
open FILE, "<", #ARGV[0] or die $!;
open OUTFILE, ">", #ARGV[1] or die $!;
my $line;
my $guts;
my $count = 1;
while ( <FILE> ) {
$line = $_ ;
my (#match) = $line =~ /style=\"(.+?)\"/g;
my $guts;
my $newguts;
foreach (#match) {
#print $_ ."\n";
$guts = $_;
$guts =~ /([a-zA-Z]+)\:([a-zA-Z0-9]+)\;/;
$newguts = "";
foreach my $style (split(/;/,$guts)) {
my ($name, $value) = split(/:/,$style);
$value =~ s/px//g;
if ( $name =~ m/height/g || $name =~ m/width/g ) {
$newguts .= "$name='$value' ";
} else {
$newguts .= "";
}
}
#print "replacing $guts with $newguts on line $count \n";
$line =~ s/style=\"$guts\"/$newguts/i;
}
#print $newguts;
print OUTFILE $line ;
$count++;
}
exit;
You will have a very difficult time with this, for a few reasons:
Most things that can be accomplished with CSS can't be done with HTML attributes. To deal with this you'd either have to ignore or attempt to compensate for things like margins and padding, etc...
Many things that correspond between HTML attributes and CSS actually behave slightly differently, and you will need to account for this. To deal with this you would have to write specific code for each difference...
Because of the way CSS rules are applied, you basically need to use a complete CSS engine to parse and apply all of the rules before you will know what needs to be done at the element/attribute level. To deal with this you could just ignore anything except inline styles, but...
This work is almost as complicated as writing a rendering engine for a browser. You might be able to deal with a few specific cases, but even there your success rate would be haphazard at best.
EDIT: Given your very specific feature set, I can give you a little advice on your implementation:
You want to be case-insensitive and use a non-greedy match when looking for the value of the style attribute, i.e.:
$line =~ /style=\"(.+?)\"/i;
So that you only find stuff up to the very next double-quote, not the entire content of the line up to the last double quote. Also, you probably want to skip the line if the match isn't found, so:
next unless ($line =~ /style=\"(.+?)\"/i);
For parsing the guts, I'd use split instead of regex:
my $newguts;
foreach my $style (split(/;/,$guts)) {
my ($name, $value) = split(/:/,$style);
$newguts .= "$name='$value' ";
}
$line =~ s/style=\"$guts\"/$newguts/i;
Of course, this being Perl there are standard mantras such as always use strict and warnings, try to use named matches rather than $1, $2, etc., but I'm trying to restrict my advice to stuff that will move your solution forward right away.
Have a look on CPAN for HTML parsing modules like HTML::TreeBuilder, HTML::DOM or even XML modules like XML::LibXML.
Below is quick example using HTML::TreeBuilder which adds border="1" attribute to any tag that has style attribute with border content:
use strict;
use warnings;
use HTML::TreeBuilder;
my $data =q{
<html>
<head>
</head>
<body>
<h1>blah</h1>
<p style="color: red;">Red</p>
<span style="width:80px;height:90px;border: 1px solid #000000">Some text</span>
</body>
</html>
};
my $tree = HTML::TreeBuilder->new;
$tree->parse_content( $data );
for my $style ( $tree->look_down( sub { $_[0]->attr('style') } ) ) {
my $prop = $style->attr( 'style' );
$style->attr( 'border', 1 ) if $prop =~ m/border/;
}
say $tree->as_HTML;
Which will reproduce the HTML but with border="1" added just to the span tag.
In unison to these modules you can also have a look at CSS and CSS::DOM to help parse the CSS bit.
I don't know your stance on proprietary software, but PrinceXML is the best HTML to PDF converter available.