Editing multiple HTML files using SED (or something similar) - html

I have about 1000 HTML files to edit which represent footnotes in a large technical document. I have been asked to go through the HTML files one by one and manually edit the HTML, to get it all on the straight and narrow.
I know that this could probably be done in a matter of seconds with SED as the changes to each file are similar. The body text in each file can be different but I want to change the tags to match the following:
<body>
<p class="Notes">See <i>R v Swain</i> (1992) 8 CRNZ 657 (HC).</p>
</body>
The text may change, for example, it could say 'See R v Pinky and the Brain (1992) or something like that but basically the body text should be that.
Currently, however, the body text may be:
<body>
<p class="Notes"><span class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB"><span><span
class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB" style="font-size: 10.0pt;">See <i>R v Pinky and the Brain</i> (1992) </span></span></span></span></span></p>
</body>
or even:
<body>
<p class="FootnoteText"><span class="FootnoteReference"><span lang="EN-US"
xml:lang="EN-US" style="font-size: 10.0pt;"><span><![endif]></span></span></span>See <i>R v Pinky and the Brain</i> (1992)</p>
</body>
Can anybody suggest a SED expression or something similar that would solve this?

Like this?:
perl -pe 's/Swain/Pinky and the Brain/g;' -i lots.html of.html files.html
The breakdown:
-e = "Use code on the command line"
-p = "Execute the code on every line of every file, and print out the line, including what changed"
-i = "Actually replace the files with the new content"
If you swap out -i with -i.old then lots.html.old and of.html.old (etc) will contain the files before the changes, in case you need to go back.
This will replace just Swain with Pinky and the Brain in all the files. Further changes would require more runs of the command. Or:
s/Swain/Pinky/g; s/Twain/Brain/g;
To swap Swain with Pinky and Twain with Brain everywhere.
Update:
If you can be sure about the incoming formatting of the data, then something like this may suffice:
# cat ff.html
<body>
<p class="Notes"><span class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB"><span><span
class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB" style="font-size: 10.0pt;">See <i>R v Twain</i> (1992) </span></span></span></span></span></p>
<p class="Notes"><span class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB"><span><span
class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB" style="font-size: 10.0pt;">See <i>R v Swain</i> (1992) </span></span></span></span></span></p>
</body>
# perl -pe 'BEGIN{undef $/;} s/<[pP][ >].*?See <i>(.*?)<\/i>(.*?)<.*?\/[pP]>/<p class="Notes">See <i>$1<\/i>$2<\/p>/gsm;' ff.html
<body>
<p class="Notes">See <i>R v Twain</i> (1992) </p>
<p class="Notes">See <i>R v Swain</i> (1992) </p>
</body>
Explanations:
BEGIN{undef $/;} = treat the whole document as one string, or else html that has newlines in it won't get handled properly
<[pP[ >] = the beginning of a p-tag (case insensitive)
.*? = lots of stuff, non-greedy-matched i.e. http://en.wikipedia.org/wiki/Regular_expression#Lazy_quantification
See <i> = literally look for that string - very important, since that seems to be the only common denominator
(.*?) = put more stuff into a parentheses group (to be used later)
<\/i> = the end i-tag
(.*?) = put more stuff into a parentheses group (to be used later)
<.*?\/[pP] = the end p-tag and other possible tags mashed up before it (like all your spans)
and replace it with the string you want, where $1 and $2 are what got snagged in the parentheses before, i.e. the two (.*?) 's
g = global search - so possibly more than one per line
s = treat everything like one line (which it is now due to the BEGIN at the top)

First convert your HTML files to proper XHTML using http://tidy.sourceforge.net and then use xmlstarlet to do the necessary XHTML processing.
Note: Get the current version of xmlstarlet for in-place XML file editing.
Here's a simple, yet complete mini-example:
curl -s http://checkip.dyndns.org > dyndns.html
tidy -wrap 0 -numeric -asxml -utf8 2>/dev/null < dyndns.html > dyndns.xml
# test: print body text to stdout (dyndns.xml)
xml sel -T \
-N XMLNS="http://www.w3.org/1999/xhtml" \
-t -m "//XMLNS:body" -v '.' -n \
dyndns.xml
# edit body text in-place (dyndns.xml)
xml ed -L \
-N XMLNS="http://www.w3.org/1999/xhtml" \
-u "//XMLNS:body" -v '<p NEW BODY TEXT </p>' \
dyndns.xml
# create new HTML file (by overwriting the original one!)
xml unesc < dyndns.xml > dyndns.html

To consolidate the span tags you may use tidy (version released on 25 March 2009) as well!
# get current tidy version: http://tidy.cvs.sourceforge.net/viewvc/tidy/tidy/
# see also: http://tidy.sourceforge.net/docs/quickref.html#merge-spans
tidy -q -c --merge-spans yes file.html

You will have to check your input files to verify some assumptions can be made. Based on your two examples, I have made the following assumptions. You will need to check them and take some sample input files to verify you have found all assumptions.
The file consists of a single footnote contained in a single <body></body> pair. The body tags are always present and well formed.
The footnote is buried somewhere inside a <p></p> pair and one or many <span></span> tags. <!...> tags can be discarded.
The following Perl script works for both examples you have supplied (on Linux with Perl 5.10.0). Before using it, make sure you have a backup of your original html files. By default, it will only print the result on stdout without changing any file.
#!/usr/bin/perl
$overwrite = 0;
# get rid of default line separator to facilitate slurping in a $scalar var
$/ = '';
foreach $filename (#ARGV)
{
# slurp entire file in $text variable
open FH, "<$filename";
$full_text = <FH>;
close FH;
if ($overwrite)
{
! -f "$filename.bak" && rename $filename, "$filename.bak";
}
# match everything that is found before the body tag, everything
# between and including the body tags, and what follows
# s modifier causes full_text to be considered a single long string
# instead of individual lines
($before_body, $body, $after_body) = ($full_text =~ m!(.*)<body>(.*)</body>(.*)!s);
#print $before_body, $body, $after_body;
# Discard unwanted tags from the body
$body =~ s%<span.*?>%%sg;
$body =~ s%</span.*?>%%sg;
$body =~ s%<p.*?>%%sg;
$body =~ s%</p.*?>%%sg;
$body =~ s%<!.*?>%%sg;
# Remaining leading and trailing whitespace likely to be newlines: remove
$body =~ s%^\s*%%sg;
$body =~ s%\s*$%%sg;
if ($overwrite)
{
open FH, ">$filename";
print FH $before_body, "<body>\n<p class=\"Notes\">$body</p>\n</body>", $after_body;
close FH;
}
else
{
print $before_body, "<body>\n<p class=\"Notes\">$body</p>\n</body>", $after_body;
}
}
To use it:
./script.pl file1.html
./script.pl file1.html file2.html
./script.pl *.html
Tweak it and when you're happy set $overwrite=1. The script creates a .bak only if it does not already exist.

If you have 1 entry per file, no rigid structure in these files, and possibly multiple lines, I would go for a php or perl script to process them file by file, while emitting suitable warnings when the patterns don't match.
use
php -f thescript.php
to execute thescript.php, which contains
<?php
$path = "datapath/";
$dir = opendir($path);
while ( ( $fn = readdir($dir) ) !== false )
{
if ( preg_match("/html$/",$fn) ) process($path.$fn);
}
function process($file)
{
$in = file_get_contents($file);
$in2 = str_replace("\n"," ",strip_tags($in,"<i>"));
if ( preg_match("#^(.*)<i>(.*)</i>(.*)$#i",$in2,$match) )
{
list($dummy,$p0,$p1,$p2) = $match;
$out = "<body>$p0<i>$p1</i>$p2</body>";
file_put_contents($file.".out",$out);
} else {
print "Problem with $file? (stripped down to: $in2)\n";
file_put_contents($file.".problematic",$in);
}
}
?>
you could tweak this to your needs until the number of misses is low enough to do the last few by hand. You probably need to add some $p0 = trim($p0); etc to sanitize everything.

Related

Use regular expression to extract img tag from HTML in Perl

I need to extract captcha from url and recognised it with Tesseract.
My code is:
#!/usr/bin/perl -X
###
$user = 'user'; #Enter your username here
$pass = 'pass'; #Enter your password here
###
#Server settings
$home = "http://perltest.adavice.com";
$url = "$home/c/test.cgi?u=$user&p=$pass";
###Add code here!
#Grab img from HTML code
#if ($html =~ /<img. *?src. *?>/)
#{
# $img1 = $1;
#}
#else
#{
# $img1 = "";
#}
$img2 = grep(/<img. *src=.*>/,$html);
if ($html =~ /\img[^>]* src=\"([^\"]*)\"[^>]*/)
{
my $takeImg = $1;
my #dirs = split('/', $takeImg);
my $img = $dirs[2];
}
else
{
print "Image not found\n";
}
###
die "<img> not found\n" if (!$img);
#Download image to server (save as: ocr_me.img)
print "GET '$img' > ocr_me.img\n";
system "GET '$img' > ocr_me.img";
###Add code here!
#Run OCR (using shell command tesseract) on img and save text as ocr_result.txt
system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";
###
die "ocr_result.txt not found\n" if (!-e "ocr_result.txt");
# check OCR results:
$txt = 'cat ocr_result.txt';
$txt =~ s/[^A-Za-z0-9\-_\.]+//sg;
$img =~ s/^.*\///;
print `echo -n "file=$img&text=$txt" | POST "$url"`;
As you see I`m trying extract img src tag. This solution did not work for me ($img1) use shell command tesseract in perl script to print a text output. Also I used adopted version of that solution($img2) How can I extract URL and link text from HTML in Perl?.
If you need HTMLcode from that page, here is:
<html>
<head>
<title>Perl test</title>
</head>
<body style="font: 18px Arial;">
<nobr>somenumbersimg src="/JJ822RCXHFC23OXONNHR.png"
somenumbers<img src="/captcha/1533030599.png"/>
somenumbersimg src="/JJ822RCXHFC23OXONNHR.png" </nobr><br/><br/><form method="post" action="?u=user&p=pass">User: <input name="u"/><br/>PW: <input name="p"/><br/><input type="hidden" name="file" value="1533030599.png"/>Text: <input name="text"></br><input type="submit"></form><br/>
</body>
</html>
I got error that image not found. My problem is wrong regular expression, as I think.I can not install any modules such as HTTP::Parser or similar
Aside from the fact that using regular expressions on HTML isn't very reliable, your regular expression in the following code isn't going to work because it's missing a capture group, so $1 won't be assigned a value.
if ($html =~ /<img. *?src. *?>/)
{
$img = $1;
}
If you want to extract parts of text using a regular expression you need to put that part inside brackets. Like for example:
$example = "hello world";
$example =~ /(hello) world/;
this will set $1 to "hello".
The regular expression itself doesn't make that much sense - where you have ". *?", that'll match any character followed by 0 or more spaces. Is that a typo for ".*?" which would match any number of characters but isn't greedy like ".*", so will stop when it finds a match for the next part of the regex.
This regular expression is possibly closer to what you're looking for. It'll match the first img tag that has a src attribute that starts with "/captcha/" and store the image URL in $1
$html =~ m%<img[^>]*src="(/captcha/[^"]*)"%s;
To break it down how it works. The "m%....%" is just a different way of saying "/.../" that allows you to put slashes in the regex without needing to escape them. "[^>]*" will match zero or more of any character except ">" - so it won't match the end of the tag. And "(/captcha/[^"]*)" is using a capture group to grab anything inside the double quotes that will be the URL. It's also using the "/s" modifier on the end which will treat $html as if it is just one long line of text and ignoring any \n in it which probably isn't needed, but on the off chance the img tag is split over multiple lines it'll still work.

Perl regex working in online PCRE tester but not in perl command

I've written the following PCRE regex to strip scripts from HTML pages: <script.*?>[\s\S]*?< *?\/ *?script *?>
It works on many online PCRE regex testers:
https://regex101.com/r/lsxyI6/1
https://www.regextester.com/?fam=102647
It does NOT work when I run the following perl substitution command in a bash terminal: cat tmp.html | perl -pe 's/<script.*?>[\s\S]*?< *?\/ *?script *?>//g'
I am using the following test data:
<script>
$(document).ready(function() {
var url = window.location.href;
var element = $('ul.nav a').filter(function() {
if (url.charAt(url.length - 1) == '/') {
url = url.substring(0, url.length - 1);
}
return this.href == url;
}).parent();
if (element.is('li')) {
element.addClass('active');
}
});
</script>
P.S. I am using regex to parse HTML because the HTML parser I am forced to use (xmlpath) breaks when there are complex scripts on the page. I am using this regex to remove scripts from the page before passing it to the parser.
You need to tell perl not to break up each line of the file into its own separate record with -0.
perl -0 -pe 's/<script.*?>[\s\S]*?< *?\/ *?script *?>//g' tmp.html
This actually tells perl to break up records on '\0'. perl -0777 will very explicitly slurp the whole file.
By the way, because I find slurping whole files distasteful, and because I don't care what html has to say about line breaks...a quicker, cleaner, more correct way to do this IF you can guarantee there is no important content on <script> tag lines is:
perl -ne 'print if !(/<script>/../<\/script>/)' tmp.html
(modifying the two regexes to your fancy, of course)
.. is a stateful operator that is flipped on by the expression before it being true and off by the one after being true.
~/test£ cat example.html
<important1/>
<edgecase1/><script></script><edgecase2/>
<important2/>
<script></script>
<important3/>
<script>
<notimportant/>
</script>
~/test£ perl -ne 'print if !(/<script>/../<\/script>/)' example.html
<important1/>
<important2/>
<important3/>
And to (mostly) address content on script tag lines but outside tags:
~/test£ perl -ne 'print if !(/<script>/../<\/script>/);print "$1\n" if /(.+)<script>/;print "$1\n" if /<\/script>(.+)/;' example.html
<important1/>
<edgecase1/>
<edgecase2/>
<important2/>
<important3/>

Remove HTML comments from Markdown file

This would come in handy when converting from Markdown to HTML, for example, if one needs to prevent comments from appearing in the final HTML source.
Example input my.md:
# Contract Cancellation
Dear Contractor X, due to delays in our imports, we would like to ...
<!--
... due to a general shortage in the Y market
TODO make sure to verify this before we include it here
-->
best,
me <!-- ... or should i be more formal here? -->
Example output my-filtered.md:
# Contract Cancellation
Dear Contractor X, due to delays in our imports, we would like to ...
best,
me
On Linux, I would do something like this:
cat my.md | remove_html_comments > my-filtered.md
I am also able to write an AWK script that handles some common cases,
but as I understood, neither AWK nor any other of the common tools for simple text manipulation (like sed) are really up to this job. One would need to use an HTML parser.
How to write a proper remove_html_comments script, and with what tools?
I see from your comment that you mostly use Pandoc.
Pandoc version 2.0, released October 29, 2017, adds a new option --strip-comments. The related issue provides some context to this change.
Upgrading to the latest version and adding --strip-comments to your command should remove HTML comments as part of the conversion process.
It might be a bit counter-intuitive, bud i would use a HTML parser.
Example with Python and BeautifulSoup:
import sys
from bs4 import BeautifulSoup, Comment
md_input = sys.stdin.read()
soup = BeautifulSoup(md_input, "html5lib")
for element in soup(text=lambda text: isinstance(text, Comment)):
element.extract()
# bs4 wraps the text in <html><head></head><body>…</body></html>,
# so we need to extract it:
output = "".join(map(str, soup.find("body").contents))
print(output)
Output:
$ cat my.md | python md.py
# Contract Cancellation
Dear Contractor X, due to delays in our imports, we would like to ...
best,
me
It shouldn't break any other HTML you might have in your .md files (it might change the code formatting a bit, but not it's meaning):
Of course test it thouroughly if you decide to use it.
Edit – Try it out online here: https://repl.it/NQgG (input is read from input.md, not stdin)
This awk should work
$ awk -v FS="" '{ for(i=1; i<=NF; i++){if($i$(i+1)$(i+2)$(i+3)=="<!--"){i+=4; p=1} else if(!p && $i!="-->"){printf $i} else if($i$(i+1)$(i+2)=="-->") {i+=3; p=0;} } printf RS}' file
Dear Contractor X, due to delays in our imports, we would like to ...
best,
me
For better readability and explanation :
awk -v FS="" # Set null as field separator so that each character is treated as a field and it will prevent the formatting as well
'{
for(i=1; i<=NF; i++) # Iterate through each character
{
if($i$(i+1)$(i+2)$(i+3)=="<!--") # If combination of 4 chars makes a comment start tag
{ # then raise flag p and increment i by 4
i+=4; p=1
}
else if(!p && $i!="-->") # if p==0 then print the character
printf $i
else if($i$(i+1)$(i+2)=="-->") # if combination of 3 fields forms comment close tag
{ # then reset flag and increment i by 3
i+=3; p=0;
}
}
printf RS
}' file
If you open it with vim you could do:
:%s/<!--\_.\{-}-->//g
With _. you allow the regular expression to match all characters even the new line character, the {-} is for making it lazy, otherwise you will lose all content from the first to the last comment.
I have tried to use the same expression on sed but it wont work.
my AWK solution, probably more easily to understand then the one of #batMan, at least for high-level devs. the functionality should be about the same.
file remove_html_comments:
#!/usr/bin/awk -f
# Removes common, simple cases of HTML comments.
#
# Example:
# > cat my.html | remove_html_comments > my-filtered.html
#
# This is usefull for example to pre-parse Markdown before generating
# an HTML or PDF document, to make sure the commented out content
# does not end up in the final document, # not even as a comment
# in source code.
#
# Example:
# > cat my.markdown | remove_html_comments | pandoc -o my-filtered.html
#
# Source: hoijui
# License: CC0 1.0 - https://creativecommons.org/publicdomain/zero/1.0/
BEGIN {
com_lvl = 0;
}
/<!--/ {
if (com_lvl == 0) {
line = $0
sub(/<!--.*/, "", line)
printf line
}
com_lvl = com_lvl + 1
}
com_lvl == 0
/-->/ {
if (com_lvl == 1) {
line = $0
sub(/.*-->/, "", line)
print line
}
com_lvl = com_lvl - 1;
}

Multiple Condition in grep regex

I want to use grep to search a file for all lines containing <div, except that I do not want lines between <!-- and -->.
I have this regex to find the lines containing <div: ^.*\<div.*$
I have this regex to exclude the lines between <!-- and -->: ^((?!\<!--.*?--\>).)*$ — but it doesn't work. It only matches when the comment is in a single line. How can I fix that?
How can I combine these in one grep line? Or do I have to type two greps?
grep does not support multiline searches like your search for <!-- ... -->. This can be worked around by using various helper commands, but in your case it's not worth it. It's better to just use a more powerful language, such as sed or AWK or Perl:
perl -ne '$on = 1 if m/<!--/; $on = "" if m/-->/; print if !$on and m/<div/' FILE
Edited to add: If you also want to discount instances of <!-- ... <div ... --> on a single line, you can write:
perl -ne ' my $line = $_;
if ($in_comment && s/.*?-->//) {
$in_comment = "";
}
while (!$in_comment && s/<!--.*?(-->)?/) {
$in_comment = 1 if $1;
}
print $line if !$in_comment && m/<div/
' FILE

Convert CSS Style Attributes to HTML Attributes using Perl

Real quick background : We have a PDFMaker (HTMLDoc) that converts html into a pdf. HTMLDoc doesn't consistently pick up the styles that we need from the html that is provided to us by the client. Thus I'm trying to convert things such as style="width:80px;height:90px;" to height=80 width=90.
My attempt so far has revealed my limited understanding of back references and how to utilize them properly during Perl Regex. I can take an input file and convert it to an output file, but it only catches one "style" per line, and only replaces one name/value pair from that css.
I'm probably approaching this the wrong way but I can't figure out a faster or smarter way to do this in Perl. Any help would be greatly appreciated!
NOTE: The only attributes I'm trying to change for this particular script are "height", "width" and "border," because our client utilizes a tool that automatically applies styles to elements that they drag around with a WYSIWYG-style editor. Obviously, using a regex to strip these out of a lot of places works fairly well, as you just let the table cells be sized by their content, which looks okay, but I figured a quicker way to deal with the issue would just be to replace those three attributes with "width" "height" and "border" attributes, which behave mostly the same as their css counterparts (excepting that CSS allows you to actually customize the width, color, and style of the border, but all they ever use is solid 1px, so I can add a condition to replace "solid 1px" with "border=1". I realize these are not fully equivalent, but for this application it would be a step).
Here's what I've got so far:
#!/usr/bin/perl
if (!#ARGV[0] || !#ARGV[1])
{
print "Usage: converter.pl [input file] [output file] \n";
exit;
}
open FILE, "<", #ARGV[0] or die $!;
open OUTFILE, ">", #ARGV[1] or die $!;
my $line;
my $guts;
while ( <FILE> ) {
$line = $_ ;
$line =~ /style=\"(.+)\"/;
$guts = $1;
$guts =~ /([a-zA-Z]+)\:([a-zA-Z0-9]+)\;/;
$name = $1;
$value = $2;
$guts = $name."=".$value;
$line =~ s/style=\"(.+)\"/$guts/g;
print OUTFILE $line ;
}
exit;
Note: This is NOT homework, and no I'm not asking you to do my job for me, this would end up being an internal tool that just sped up the process of formatting our incoming html to work properly in the pdf converter we have.
UPDATE
For those interested, I got an initial working version. This one only replaces width and height, the border attribute we're scrapping for now. But if anyone wanted to see how we did it, take a look...
#!/usr/bin/perl
## NOTES ##
# This script was made to simply replace style attributes with their name/value pair equivalents as attributes.
# It was designed to replace width and height attributes on a metric buttload of table elements from client data we got.
# As such, it's not really designed to handle more than that, and only strips the unit "PX" from the values.
# All of these can be modified in the second foreach loop, which checks for height and width.
if (!#ARGV[0] || !#ARGV[1])
{
print "Usage: quickvert.pl [input file] [output file] \n";
exit;
}
open FILE, "<", #ARGV[0] or die $!;
open OUTFILE, ">", #ARGV[1] or die $!;
my $line;
my $guts;
my $count = 1;
while ( <FILE> ) {
$line = $_ ;
my (#match) = $line =~ /style=\"(.+?)\"/g;
my $guts;
my $newguts;
foreach (#match) {
#print $_ ."\n";
$guts = $_;
$guts =~ /([a-zA-Z]+)\:([a-zA-Z0-9]+)\;/;
$newguts = "";
foreach my $style (split(/;/,$guts)) {
my ($name, $value) = split(/:/,$style);
$value =~ s/px//g;
if ( $name =~ m/height/g || $name =~ m/width/g ) {
$newguts .= "$name='$value' ";
} else {
$newguts .= "";
}
}
#print "replacing $guts with $newguts on line $count \n";
$line =~ s/style=\"$guts\"/$newguts/i;
}
#print $newguts;
print OUTFILE $line ;
$count++;
}
exit;
You will have a very difficult time with this, for a few reasons:
Most things that can be accomplished with CSS can't be done with HTML attributes. To deal with this you'd either have to ignore or attempt to compensate for things like margins and padding, etc...
Many things that correspond between HTML attributes and CSS actually behave slightly differently, and you will need to account for this. To deal with this you would have to write specific code for each difference...
Because of the way CSS rules are applied, you basically need to use a complete CSS engine to parse and apply all of the rules before you will know what needs to be done at the element/attribute level. To deal with this you could just ignore anything except inline styles, but...
This work is almost as complicated as writing a rendering engine for a browser. You might be able to deal with a few specific cases, but even there your success rate would be haphazard at best.
EDIT: Given your very specific feature set, I can give you a little advice on your implementation:
You want to be case-insensitive and use a non-greedy match when looking for the value of the style attribute, i.e.:
$line =~ /style=\"(.+?)\"/i;
So that you only find stuff up to the very next double-quote, not the entire content of the line up to the last double quote. Also, you probably want to skip the line if the match isn't found, so:
next unless ($line =~ /style=\"(.+?)\"/i);
For parsing the guts, I'd use split instead of regex:
my $newguts;
foreach my $style (split(/;/,$guts)) {
my ($name, $value) = split(/:/,$style);
$newguts .= "$name='$value' ";
}
$line =~ s/style=\"$guts\"/$newguts/i;
Of course, this being Perl there are standard mantras such as always use strict and warnings, try to use named matches rather than $1, $2, etc., but I'm trying to restrict my advice to stuff that will move your solution forward right away.
Have a look on CPAN for HTML parsing modules like HTML::TreeBuilder, HTML::DOM or even XML modules like XML::LibXML.
Below is quick example using HTML::TreeBuilder which adds border="1" attribute to any tag that has style attribute with border content:
use strict;
use warnings;
use HTML::TreeBuilder;
my $data =q{
<html>
<head>
</head>
<body>
<h1>blah</h1>
<p style="color: red;">Red</p>
<span style="width:80px;height:90px;border: 1px solid #000000">Some text</span>
</body>
</html>
};
my $tree = HTML::TreeBuilder->new;
$tree->parse_content( $data );
for my $style ( $tree->look_down( sub { $_[0]->attr('style') } ) ) {
my $prop = $style->attr( 'style' );
$style->attr( 'border', 1 ) if $prop =~ m/border/;
}
say $tree->as_HTML;
Which will reproduce the HTML but with border="1" added just to the span tag.
In unison to these modules you can also have a look at CSS and CSS::DOM to help parse the CSS bit.
I don't know your stance on proprietary software, but PrinceXML is the best HTML to PDF converter available.