The below code able to read the content of file and print the content of body with the file's content.
use strict;
my $filename = '.../text.txt';
open (my $ifh, '<', $filename)
or die "Could not open file '$filename' $!";
local $/ = undef;
my #row = (<$ifh>)[0..9];
close ($ifh);
print "#row\n";
my ($body) = #_;
my ($html_body)= #_;
.
.
.
print(MAIL "Subject: Important Announcement \n");
.
.
.
push(#$html_body, "<h1><b><font color= red ><u>ATTENTION!</u></b></h1></font><br>");
push(#$html_body, "#row");
.
.
.
print(MAIL "$body", "#$html_body");
close(MAIL);
But unfortunately, i am having problem to produce the email body with same format of the text.txt file. The output email produced only having single line instead of paragraphs of 3.
The problem you're facing is that plain text contains no formatting information when placed inside a HTML document. End of line characters are ignored and treated just like ordinary white space. You need to add HTML tags to the text to convey the formatting you want or you could wrap it up in a pre tag as that will display it "as is".
As mentioned by others in the comments above, your use of #_ doesn't make sense. And it doesn't really make sense for $html_body to be treated like an array either when all you're doing is appending HTML to it. So I've rewritten that chunk of code to use it as a scalar and append the HTML to it instead. And also fixed some mistakes in the HTML as you need to close tags in the same order as you open them.
print MAIL "Subject: Important Announcement \n";
print MAIL "\n"; # Need a blank line after the header to show it's finished
my $html_body = "<html><body>";
$html_body .= "<h1><b><font color="red"><u>ATTENTION!</u></font></b></h1>";
$html_body .= "<pre>";
$html_body .= join("", #row);
$html_body .= "</pre>";
$html_body .= "</body></html>";
print MAIL $html_body;
close(MAIL);
First of all #_ is an arrayof arguments passed to subroutines, and it looks like you're not in one. So, doing:
my ($body) = #_;
my ($html_body) = #_;
is setting $body & $html_body to $_[0], which is undef.
How to fix?
There are two ways if you wrap it in a subroutine:
Use shift -> Which will make the above code look like:
my ($body) = shift;
my ($html_body)= shift;
Or,
my ($body, $html_body) = #_;
I would recommend the last one because it is less code and is more readable than the first one.
Related
I have a Perl Script which makes a DB connection and displays the output in HTML format. The data which it's trying to display has tags embedded (<>) in it, hence the HTML does not get displayed. If I open the actual HTML file which the script generates using Notepad, I see the data. However I am unable to display it due to the tags. Any idea how this can be fixed?
#!/usr/bin/perl
use DBI;
use HTML::Escape 'escape_html';
unlink("D:\\Perl32\\scripts\\UndeliveredRAW.html");
my $host = '${Node.Caption}';
my $user = '${USER}';
my $pwd = '${PASSWORD}';
my $driver = "SQL Server";
$dbhslam = DBI->connect("dbi:ODBC:Driver=$driver;Server=$host;UID=$user;PWD=$pwd") || die "connect failed:";
$sthslam = $dbhslam->prepare("SELECT
DBA_Reports.dbo.undelivered_raw_host_msgs.ID
DBA_Reports.dbo.undelivered_raw_host_msgs.MESSAGE
FROM
DBA_Reports.dbo.undelivered_raw_host_msgs");
$sthslam->execute;
$msg = "Up";
$Count = 0;
$Output = "";
$Temp = "";
$tbl = "<TABLE border=1 bordercolor=orange cellspacing=0 cellpadding=1>";
$tblhd = "<TR><TH>ID</TH><TH>MESSAGE</TH></TR>";
while (my $ref = $sthslam->fetchrow_hashref()) {
$Count++;
$Output .= '<TR><TD align=center rowspan=1 valign=top width=1000 height=1000>'
. $ref->{'ID'}.'</TD>'
. '<TD align=center rowspan=1 valign=top width=1000 height=1000>'
. escape_html($ref->{'MESSAGE'}).'</TD></TR>';
}
$dbhslam->disconnect;
$Output = "$tbl$tblhd$Output</TABLE>";
my $filename1 = 'D:\\Perl32\\Scripts\\UndeliveredRAW.html';
open(my $fh1, '>', $filename1) or die "Could not open file '$filename1' $!";
print $fh1 "$Output";
close $fh1;
if ($Count > 0) {
$msg = $Output;
}
print "\nMessage: $msg";
print "\nStatistic: $Count";
Desired Output
Contents of HTML generated
Following code snippet demonstrates slightly modified version of posted code.
Please see the loop section for escape_html(...) usage on database obtained data.
#!/usr/bin/env perl
#
# vim: ai ts=4 sw=4
use strict;
use warnings;
use feature 'say';
use DBI;
use HTML::Escape qw/escape_html/;
my $filename = 'D:\Perl32\scripts\UndeliveredRAW.html';
unlink($filename) if -e $filename;
my $host = ${Node.Caption};
my $user = ${USER};
my $pwd = ${PASSWORD};
my $driver = 'SQL Server';
my $dbh = DBI->connect("dbi:ODBC:Driver=$driver;Server=$host;UID=$user;PWD=$pwd")
or die 'DB connect failed:';
my $query = '
SELECT
DBA_Reports.dbo.undelivered_raw_host_msgs.MESSAGE
FROM
DBA_Reports.dbo.undelivered_raw_host_msgs
';
my $rv = $dbh->do($query) or die $dbh->errstr;
my $msg = 'Up';
my $Count = 0;
my $tbl = '
<TABLE border=1 bordercolor=orange cellspacing=0 cellpadding=1>
<TR><TH>MESSAGE</TH></TR>
';
while (my $ref = $sth->fetchrow_hashref()) {
$Count++;
$tbl .= "\n\t<TR><TD align=center rowspan=1 valign=top width=5000 height=5000>"
. escape_html($ref->{'MESSAGE'})
. '</TD></TR>';
}
$tbl .= '
</TABLE>
';
my $html =
'<!DOCTYPE html>
<html>
<head>
<title>Undelivered RAW</title>
</head>
<body>
<h1>DB table data</h1>
' . $tbl . '
</body>
</html>
';
open my $fh, '>', $filename
or die "Could not open file '$filename1' $!";
print $fh $html;
close $fh;
if ($Count > 0) {
say 'Message: ' . $msg;
say 'Statistic: ' . $Count";
}
Note: to avoid polluting code with HTML style attributes find some time to learn CSS, your generated HTML does not include required sections DOCTYPE, html, head, title, body
Reference:
DBI
HTML::Escape
DBI/DBD::ODBC Tutorial
CSS
HTML
HTML has a well-understood mechanism to include characters that would normally be interpreted as special characters. For example, if you want to include a < in your HTML, that would normally be seen as starting a new HTML element in your document.
The solution is to replace those problematic characters with HTML entities that represent those characters. For example, < should be replaced with <. Note that this means the ampersand (&) needs to be added to the set of characters that should be replaced (in this case by &) if you want to include it in your HTML.
Perl has a long history of being used on the web, so it's no surprise that there are many tools available to carry out this replacement. HTML::Escape is probably the best known. It supplies a single function (escape_html()) which takes a text string and returns that same string with all of the problematic characters replaced by the appropriate entities.
use HTML::Escape 'escape_html';
my $html = '<some text> & <some other text>'
my $escaped_html = escape_html($html);
After running this code, $escaped_html now contains "$lt;some text$gt; $amp; $lt;some other text$gt;". And if you send that text to a browser, you will get the correct output displayed.
So the easiest solution is to load HTML::Escape at the top of your program and then call escape_html() whenever you're adding potentially problematic strings to your output. That means your while loop would look something like this:
while (my $ref = $sthslam->fetchrow_hashref()) {
$Count++;
$Output .= '<TR><TD align=center rowspan=1 valign=top width=5000 height=5000>'
. escape_html($ref->{'MESSAGE'})
. '</TD></TR>';
}
Note that I've removed the $Temp variable (which didn't seem to be doing anything useful) and switched to using .= to build up your output string. =. is the "assignment concatenation" operator - it adds the new string on its right to the end of whatever currently exists in the variable on its left.
You seem to be learning Perl on the job (which is great) but it's a real shame that you're learning it in an environment that seems to use techniques that have been outdated for about twenty years. Your question is a good example of why trying to build up raw HTML strings inside your Perl code is a bad idea. It's a far better idea to use a templating engine of some kind (the defacto standard in the Perl world seems to be the Template Toolkit).
I also recommend looking at Cascading Style Sheets as a more modern approach to styling your HTML output.
I know that HTML:Parser is a thing and from reading around, I've realized that trying to parse html with regex is usually a suboptimal way of doing things, but for a Perl class I'm currently trying to use regular expressions (hopefully just a single match) to identify and store the sentences from a saved html doc. Eventually I want to be able to calculate the number of sentences, words/sentence and hopefully average length of words on the page.
For now, I've just tried to isolate things which follow ">" and precede a ". " just to see what if anything it isolates, but I can't get the code to run, even when manipulating the regular expression. So I'm not sure if the issue is in the regex, somewhere else or both. Any help would be appreciated!
#!/usr/bin/perl
#new
use CGI qw(:standard);
print header;
open FILE, "< sample.html ";
$html = join('', <FILE>);
close FILE;
print "<pre>";
###Main Program###
&sentences;
###sentence identifier sub###
sub sentences {
#sentences;
while ($html =~ />[^<]\. /gis) {
push #sentences, $1;
}
#for debugging, comment out when running
print join("\n",#sentences);
}
print "</pre>";
Your regex should be />[^<]*?./gis
The *? means match zero or more non greedy. As it stood your regex would match only a single non < character followed by a period and a space. This way it will match all non < until the first period.
There may be other problems.
Now read this
A first improvement would be to write $html =~ />([^<.]+)\. /gs, you need to capture the match with the parents, and to allow more than 1 letter per sentence ;--)
This does not get all the sentences though, just the first one in each element.
A better way would be to capture all the text, then extract sentences from each fragment
while( $html=~ m{>([^<]*<}g) { push #text_content, $1};
foreach (#text_content) { while( m{([^.]*)\.}gs) { push #sentences, $1; } }
(untested because it's early in the morning and coffee is calling)
All the usual caveats about parsing HTML with regexps apply, most notably the presence of '>' in the text.
I think this does more or less what you need. Keep in mind that this script only looks at text inside p tags. The file name is passed in as a command line argument (shift).
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Grabber;
my $file_location = shift;
print "\n\nfile: $file_location";
my $totalWordCount = 0;
my $sentenceCount = 0;
my $wordsInSentenceCount = 0;
my $averageWordsPerSentence = 0;
my $char_count = 0;
my $contents;
my $rounded;
my $rounded2;
open ( my $file, '<', $file_location ) or die "cannot open < file: $!";
while( my $line = <$file>){
$contents .= $line;
}
close( $file );
my $dom = HTML::Grabber->new( html => $contents );
$dom->find('p')->each( sub{
my $p_tag = $_->text;
++$totalWordCount while $p_tag =~ /\S+/g;
while ($p_tag =~ /[.!?]+/g){
$p_tag =~ s/\s//g;
$char_count += (length($p_tag));
$sentenceCount++;
}
});
print "\n Total Words: $totalWordCount\n";
print " Total Sentences: $sentenceCount\n";
$rounded = $totalWordCount / $sentenceCount;
print " Average words per sentence: $rounded.\n\n";
print " Total Characters: $char_count.\n";
my $averageCharsPerWord = $char_count / $totalWordCount ;
$rounded2 = sprintf("%.2f", $averageCharsPerWord );
print " Average words per sentence: $rounded2.\n\n";
what i am trying to do is get the contents of a file from another server. Since im not in tune with perl, nor know its mods and functions iv'e gone about it this way:
my $fileContents;
if( $md5Con =~ m/\.php$/g ) {
my $ftp = Net::FTP->new($DB_ftpserver, Debug => 0) or die "Cannot connect to some.host.name: $#";
$ftp->login($DB_ftpuser, $DB_ftppass) or die "Cannot login ", $ftp->message;
$ftp->get("/" . $root . $webpage, "c:/perlscripts/" . md5_hex($md5Con) . "-code.php") or die $ftp->message;
open FILE, ">>c:/perlscripts/" . md5_hex($md5Con) . "-code.php" or die $!;
$fileContents = <FILE>;
close(FILE);
unlink("c:/perlscripts/" . md5_hex($md5Con) . "-code.php");
$ftp->quit;
}
What i thought id do is get the file from the server, put on my local machine, edit the content, upload to where ever an then delete the temp file.
But I cannot seem to figure out how to get the contents of the file;
open FILE, ">>c:/perlscripts/" . md5_hex($md5Con) . "-code.php" or die $!;
$fileContents = <FILE>;
close(FILE);
keep getting error;
Use of uninitialized value $fileContents
Which im guessing means it isn't returning a value.
Any help much appreciated.
>>>>>>>>>> EDIT <<<<<<<<<<
my $fileContents;
if( $md5Con =~ m/\.php$/g ) {
my $ftp = Net::FTP->new($DB_ftpserver, Debug => 0) or die "Cannot connect to some.host.name: $#";
$ftp->login($DB_ftpuser, $DB_ftppass) or die "Cannot login ", $ftp->message;
$ftp->get("/" . $root . $webpage, "c:/perlscripts/" . md5_hex($md5Con) . "-code.php") or die $ftp->message;
my $file = "c:/perlscripts/" . md5_hex($md5Con) . "-code.php";
{
local( $/ ); # undefine the record seperator
open FILE, "<", $file or die "Cannot open:$!\n";
my $fileContents = <FILE>;
#print $fileContents;
my $bodyContents;
my $headContents;
if( $fileContents =~ m/<\s*body[^>]*>.*$/gi ) {
print $0 . $1 . "\n";
$bodyContents = $dbh->quote($1);
}
if( $fileContents =~ m/^.*<\/head>/gi ) {
print $0 . $1 . "\n";
$headContents = $dbh->quote($1);
}
$bodyTable = $dbh->quote($bodyTable);
$headerTable = $dbh->quote($headerTable);
$dbh->do($createBodyTable) or die " error: Couldn't create body table: " . DBI->errstr;
$dbh->do($createHeadTable) or die " error: Couldn't create header table: " . DBI->errstr;
$dbh->do("INSERT INTO $headerTable ( headData, headDataOutput ) VALUES ( $headContents, $headContents )") or die " error: Couldn't connect to database: " . DBI->errstr;
$dbh->do("INSERT INTO $bodyTable ( bodyData, bodyDataOutput ) VALUES ( $bodyContents, $bodyContents )") or die " error: Couldn't connect to database: " . DBI->errstr;
$dbh->do("INSERT INTO page_names (linkFromRoot, linkTrue, page_name, table_name, navigation, location) VALUES ( $linkFromRoot, $linkTrue, $page_name, $table_name, $navigation, $location )") or die " error: Couldn't connect to database: " . DBI->errstr;
unlink("c:/perlscripts/" . md5_hex($md5Con) . "-code.php");
}
$ftp->quit;
}
the above using print WILL print the whole file. BUT, for some reason the two regular expresions are returning false. Any idea why?
if( $fileContents =~ m/<\s*body[^>]*>.*$/gi ) {
print $0 . $1 . "\n";
$bodyContents = $dbh->quote($1);
}
if( $fileContents =~ m/^.*<\/head>/gi ) {
print $0 . $1 . "\n";
$headContents = $dbh->quote($1);
}
This is covered in section 5 of the Perl FAQ included with the standard distribution.
How can I read in an entire file all at once?
You can use the Path::Class::File::slurp module to do it in one step.
use Path::Class;
$all_of_it = file($filename)->slurp; # entire file in scalar
#all_lines = file($filename)->slurp; # one line per element
The customary Perl approach for processing all the lines in a file is to do so one line at a time:
open (INPUT, $file) || die "can't open $file: $!";
while (<INPUT>) {
chomp;
# do something with $_
}
close(INPUT) || die "can't close $file: $!";
This is tremendously more efficient than reading the entire file into memory as an array of lines and then processing it one element at a time, which is often—if not almost always—the wrong approach. Whenever you see someone do this:
#lines = <INPUT>;
you should think long and hard about why you need everything loaded at once. It's just not a scalable solution. You might also find it more fun to use the standard Tie::File module, or the DB_File module's $DB_RECNO bindings, which allow you to tie an array to a file so that accessing an element the array actually accesses the corresponding line in the file.
You can read the entire filehandle contents into a scalar.
{
local(*INPUT, $/);
open (INPUT, $file) || die "can't open $file: $!";
$var = <INPUT>;
}
That temporarily undefs your record separator, and will automatically close the file at block exit. If the file is already open, just use this:
$var = do { local $/; <INPUT> };
For ordinary files you can also use the read function.
read( INPUT, $var, -s INPUT );
The third argument tests the byte size of the data on the INPUT filehandle and reads that many bytes into the buffer $var.
Use Path::Class::File::slurp if you want to read all file contents in one go.
However, more importantly, use an HTML parser to parse HTML.
open FILE, "c:/perlscripts" . md5_hex($md5Con) . "-code.php" or die $!;
while (<FILE>) {
# each line is in $_
}
close(FILE);
will open the file and allow you to process it line-by-line (if that's what you want - otherwise investigate binmode). I think the problem is in your prepending the filename to open with >>. See this tutorial for more info.
I note you're also using regular expressions to parse HTML. Generally I would recommend using a parser to do this (e.g. see HTML::Parser). Regular expressions aren't suited to HTML due to HTML's lack of regularity, and won't work reliably in general cases.
Also, if you are in need of editing the contents of the files take a look at the CPAN module
Tie::File
This module relieves you from the need to creation of a temp file for editing the content
and writing it back to the same file.
EDIT:
What you are looking at is a way to slurp the file. May be you have to undefine
the record separator variable $/
The below code works fine for me:
use strict;
my $file = "test.txt";
{
local( $/ ); # undefine the record seperator
open FILE, "<", $file or die "Cannot open:$!\n";
my $lines =<FILE>;
print $lines;
}
Also see the section "Traditional Slurping" in this article.
BUT, for some reason the two regular expresions are returning false. Any idea why?
. in a regular expression by default matches any character except newline. Presumably you have newlines before the </head> tag and after the <body> tag. To make . match any character including newlines, use the //s flag.
I'm not sure what your print $0 . $1 ... code is about; you aren't capturing anything in your matches to be stored in $1, and $0 isn't a variable used for regular expression captures, it's something very different.
if you want to get the content of the file,
#lines = <FILE>;
Use File::Slurp::Tiny. As convenient as File::Slurp, but without the bugs.
Is there a way to extract HTML page title using Perl? I know it can be passed as a hidden variable during form submit and then retrieved in Perl that way but I was wondering if there is a way to do this without the submit?
Like, lets say i have an HTML page like this:
<html><head><title>TEST</title></head></html>
and then in Perl I want to do :
$q -> h1('something');
How can I replace 'something' dynamically with what is contained in <title> tags?
I would use pQuery. It works just like jQuery.
You can say:
use pQuery;
my $page = pQuery("http://google.com/");
my $title = $page->find('title');
say "The title is: ", $title->html;
Replacing stuff is similar:
$title->html('New Title');
say "The entirety of google.com with my new title is: ", $page->html;
You can pass an HTML string to the pQuery constructor, which it sounds like you want to do.
Finally, if you want to use arbitrary HTML as a "template", and then "refine" that with Perl commands, you want to use Template::Refine.
HTML::HeadParser does this for you.
It's not clear to me what you are asking. You seem to be talking about something that could run in the user's browser, or at least something that already has an html page loaded.
If that's not the case, the answer is URI::Title.
use strict;
use LWP::Simple;
my $url = 'http://www.google.com'|| die "Specify URL on the cmd line";
my $html = get ($url);
$html =~ m{<TITLE>(.*?)</TITLE>}gism;
print "$1\n";
The previous answer is wrong, if the HTML title tag is used more often then this can easily be overcome by checking to make sure that the title tag is valid (no tags in between).
my ($title) = $test_content =~ m/<title>([a-zA-Z\/][^>]+)<\/title>/si;
get the title name form the file.
my $spool = 0;
open my $fh, "<", $absPath or die $!;
#open ($fh, "<$tempfile" );
# wrtie the opening brace
print WFL "[";
while (<$fh>) {
# removes the new line from the line read
chomp;
# removes the leading and trailing spaces.
$_=~ s/^\s+|\s+$//g;
# case where the <title> and </title> occures in one line
# we print and exit in one instant
if (($_=~/$startstring/i)&&($_=~/$endstring/i)) {
print WFL "'";
my ($title) = $_=~ m/$startstring(.+)$endstring/si;
print WFL "$title";
print WFL "',";
last;
}
# case when the <title> is in one line and </title> is in other line
#starting <title> string is found in the line
elsif ($_=~/$startstring/i) {
print WFL "'";
# extract everything after <title> but nothing before <title>
my ($title) = $_=~ m/$startstring(.+)/si;
print WFL "$title";
$spool = 1;
}
# ending string </title> is found
elsif ($_=~/$endstring/i) {
# read everything before </title> and nothing above that
my ($title) = $_=~ m/(.+)$endstring/si;
print WFL " ";
print WFL "$title";
print WFL "',";
$spool = 0;
last;
}
# this will useful in reading all line between <title> and </title>
elsif ($spool == 1) {
print WFL " ";
print WFL "$_";
}
}
close $fh;
# end of getting the title name
If you just want to extract the page title you can use a regular expression. I believe that would be something like:
my ($title) = $html =~ m/<title>(.+)<\/title>/si;
where your HTML page is stored in the string $html. In si, the s stands for for single line mode (i.e., the dot also matches a newline) and i for ignore case.
Real quick background : We have a PDFMaker (HTMLDoc) that converts html into a pdf. HTMLDoc doesn't consistently pick up the styles that we need from the html that is provided to us by the client. Thus I'm trying to convert things such as style="width:80px;height:90px;" to height=80 width=90.
My attempt so far has revealed my limited understanding of back references and how to utilize them properly during Perl Regex. I can take an input file and convert it to an output file, but it only catches one "style" per line, and only replaces one name/value pair from that css.
I'm probably approaching this the wrong way but I can't figure out a faster or smarter way to do this in Perl. Any help would be greatly appreciated!
NOTE: The only attributes I'm trying to change for this particular script are "height", "width" and "border," because our client utilizes a tool that automatically applies styles to elements that they drag around with a WYSIWYG-style editor. Obviously, using a regex to strip these out of a lot of places works fairly well, as you just let the table cells be sized by their content, which looks okay, but I figured a quicker way to deal with the issue would just be to replace those three attributes with "width" "height" and "border" attributes, which behave mostly the same as their css counterparts (excepting that CSS allows you to actually customize the width, color, and style of the border, but all they ever use is solid 1px, so I can add a condition to replace "solid 1px" with "border=1". I realize these are not fully equivalent, but for this application it would be a step).
Here's what I've got so far:
#!/usr/bin/perl
if (!#ARGV[0] || !#ARGV[1])
{
print "Usage: converter.pl [input file] [output file] \n";
exit;
}
open FILE, "<", #ARGV[0] or die $!;
open OUTFILE, ">", #ARGV[1] or die $!;
my $line;
my $guts;
while ( <FILE> ) {
$line = $_ ;
$line =~ /style=\"(.+)\"/;
$guts = $1;
$guts =~ /([a-zA-Z]+)\:([a-zA-Z0-9]+)\;/;
$name = $1;
$value = $2;
$guts = $name."=".$value;
$line =~ s/style=\"(.+)\"/$guts/g;
print OUTFILE $line ;
}
exit;
Note: This is NOT homework, and no I'm not asking you to do my job for me, this would end up being an internal tool that just sped up the process of formatting our incoming html to work properly in the pdf converter we have.
UPDATE
For those interested, I got an initial working version. This one only replaces width and height, the border attribute we're scrapping for now. But if anyone wanted to see how we did it, take a look...
#!/usr/bin/perl
## NOTES ##
# This script was made to simply replace style attributes with their name/value pair equivalents as attributes.
# It was designed to replace width and height attributes on a metric buttload of table elements from client data we got.
# As such, it's not really designed to handle more than that, and only strips the unit "PX" from the values.
# All of these can be modified in the second foreach loop, which checks for height and width.
if (!#ARGV[0] || !#ARGV[1])
{
print "Usage: quickvert.pl [input file] [output file] \n";
exit;
}
open FILE, "<", #ARGV[0] or die $!;
open OUTFILE, ">", #ARGV[1] or die $!;
my $line;
my $guts;
my $count = 1;
while ( <FILE> ) {
$line = $_ ;
my (#match) = $line =~ /style=\"(.+?)\"/g;
my $guts;
my $newguts;
foreach (#match) {
#print $_ ."\n";
$guts = $_;
$guts =~ /([a-zA-Z]+)\:([a-zA-Z0-9]+)\;/;
$newguts = "";
foreach my $style (split(/;/,$guts)) {
my ($name, $value) = split(/:/,$style);
$value =~ s/px//g;
if ( $name =~ m/height/g || $name =~ m/width/g ) {
$newguts .= "$name='$value' ";
} else {
$newguts .= "";
}
}
#print "replacing $guts with $newguts on line $count \n";
$line =~ s/style=\"$guts\"/$newguts/i;
}
#print $newguts;
print OUTFILE $line ;
$count++;
}
exit;
You will have a very difficult time with this, for a few reasons:
Most things that can be accomplished with CSS can't be done with HTML attributes. To deal with this you'd either have to ignore or attempt to compensate for things like margins and padding, etc...
Many things that correspond between HTML attributes and CSS actually behave slightly differently, and you will need to account for this. To deal with this you would have to write specific code for each difference...
Because of the way CSS rules are applied, you basically need to use a complete CSS engine to parse and apply all of the rules before you will know what needs to be done at the element/attribute level. To deal with this you could just ignore anything except inline styles, but...
This work is almost as complicated as writing a rendering engine for a browser. You might be able to deal with a few specific cases, but even there your success rate would be haphazard at best.
EDIT: Given your very specific feature set, I can give you a little advice on your implementation:
You want to be case-insensitive and use a non-greedy match when looking for the value of the style attribute, i.e.:
$line =~ /style=\"(.+?)\"/i;
So that you only find stuff up to the very next double-quote, not the entire content of the line up to the last double quote. Also, you probably want to skip the line if the match isn't found, so:
next unless ($line =~ /style=\"(.+?)\"/i);
For parsing the guts, I'd use split instead of regex:
my $newguts;
foreach my $style (split(/;/,$guts)) {
my ($name, $value) = split(/:/,$style);
$newguts .= "$name='$value' ";
}
$line =~ s/style=\"$guts\"/$newguts/i;
Of course, this being Perl there are standard mantras such as always use strict and warnings, try to use named matches rather than $1, $2, etc., but I'm trying to restrict my advice to stuff that will move your solution forward right away.
Have a look on CPAN for HTML parsing modules like HTML::TreeBuilder, HTML::DOM or even XML modules like XML::LibXML.
Below is quick example using HTML::TreeBuilder which adds border="1" attribute to any tag that has style attribute with border content:
use strict;
use warnings;
use HTML::TreeBuilder;
my $data =q{
<html>
<head>
</head>
<body>
<h1>blah</h1>
<p style="color: red;">Red</p>
<span style="width:80px;height:90px;border: 1px solid #000000">Some text</span>
</body>
</html>
};
my $tree = HTML::TreeBuilder->new;
$tree->parse_content( $data );
for my $style ( $tree->look_down( sub { $_[0]->attr('style') } ) ) {
my $prop = $style->attr( 'style' );
$style->attr( 'border', 1 ) if $prop =~ m/border/;
}
say $tree->as_HTML;
Which will reproduce the HTML but with border="1" added just to the span tag.
In unison to these modules you can also have a look at CSS and CSS::DOM to help parse the CSS bit.
I don't know your stance on proprietary software, but PrinceXML is the best HTML to PDF converter available.