What is the issue with this CGI/HTML/Perl program? [closed] - html

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I am using XAMMP Apache Server. There is a text file named 'animal.txt' containing the name of an animal on each line, with a basic description.
I am trying to write a Perl program, using HTML and CGI. The aim of the program is to have the user search an animal name, using a HTML form. This will then link to a Perl program which loops through the animal.txt file and reads the file line by line. The program will take any line that matches the users search from the original HTML form, and print all animals containing the same name.
So far this is where I am :
(Clientside form)
#!\xampp\perl\bin\perl.exe
use CGI qw/:standard/; # load standard CGI routines
use CGI::Carp('fatalsToBrowser');
print header(); # create the HTTP header
print <<HTML
<head>
<title>Shop Here</title>
</head>
<body>
<h1>Animal list search</h1>
A basic form <br />
<form action="dumpsVar2.pl">
Search: <input type="text", name="Search" size=5><br><br>
Submit: <input type="submit" name="select" size="7"><br>
</form>
</body>
</html>
HTML
And then the perl program: (Server-Side)
#!\xampp\perl\bin\perl.exe
use CGI qw(:standard);
use CGI::Carp('fatalsToBrowser');
$query = new CGI;
#parameters = $query -> param;
print header, start_html("Parameters");
print "$0 was passed these parameters:<br> <br> ";
foreach $name (#parameters) {
$value = $query -> param($name);
print p("$name = $value");
}
$inFile = "animal.txt";
open (IN, $inFile) or die "Can't find file: $inFile";
#animals = (<IN>);
$item = param;
foreach $line (<IN>) {
if ($line =~ /$item/) {
print "$item";
}
}
print end_html;

Several issues here, but the crux of the matter is this code.
#animals = (<IN>);
$item = param;
foreach $line (<IN>) {
if ($line =~ /$item/) {
print "$item";
}
}
Let's look at each line in turn:
#animals = (<IN>);
This reads all of the data from IN into the array #animals. It also leaves IN's file pointer at the end of the file. Any further attempts to read data from IN will fail.
$item = param;
If you call param with no arguments, you get a list of the parameter names that were found in the CGI request. As you're assigning this list to a scalar value, this behaviour changes and you'll get the number of parameters. In your system this will always be 1. So $item contains the value 1.
foreach $line (<IN>) {
Remember how you read all of the data from IN a couple of lines back? Well you're trying to read more data from it here. And that's not going to work. I think you probably wanted #animals here, not <IN>. Currently your foreach is never executed as on the first iteration the call to <IN> returns undef - which is false.
if ($line =~ /$item/) {
Let's assume that you've replaced <IN> with #animals in your foreach loop - so that the loop body is actually executed. This still isn't doing what you wanted. Remember that $item contains 1 rather than the name of an animal to search for. And I doubt that you have an animal called "1".
What you probably want is something more like this:
my $animal = param('Search');
while (<IN>) {
print if /$animal/;
}
I'd also point out that learning CGI in 2014 is pretty ridiculous. You would be far better off looking at a simple Perl web framework like Web::Simple or Dancer.

I think your problem is
$item = param;
Which is putting the number of parameters in your form, in this case 2, into $item. I doubt you have an animal named 2
Change it to
$item = param('Search');

Related

Checking for alphanumeric characters and getting input from HTML form

I'm fairly new to programming in Perl and I have a couple of compilation issues I can't seem to resolve. My program gets input from this HTML form.
Question: Should my form use the post or get method?
<FORM action="./cgi-bin/Perl.pl" method="GET">
<br>
Full name: <br><input type="text" name="full_name" maxlength="20"><br>
Username: <br><input type="text" name="user_name" maxlength="8"><br>
Password: <br><input type="password" name="password" maxlength="15"><br>
Confirm password: <br><input type="password" name="new_password" maxlength="15"><br>
I open a CSV file, write the value of user_name into an array and do a number of checks on the user's input.
Problem #1: I need to check that full_name, user_name, password, and new_password are all alphanumeric or a space but I keep getting multiple errors that look like:
Use of uninitialized value $full_name in string eq at Perl.pl line 33
I don't think I've used CGI correctly to get these values from the form. I also believe I'm not correctly checking for alphanumeric characters. How can I resolve this?
Problem #2: I need to redirect the user to a specific webpage if their passwords don't match and if the username is already taken. I used a meta redirect but it's not doing it successfully. How can I display a proper error page?
This is my code:
#!/usr/bin/perl
use CGI qw(:standard);
use strict;
use warnings;
print "Content-type: text/html\n\n";
#opening Members.csv for reading
my $file = '/home/2014/amosqu/public_html/cgi-bin/Members.csv';
open(my $csv, '<', $file) || die "Could not open your file";
#getting these from HTML form
my $full_name = param('full_name');
my $user_name= param('user_name');
my $password = param('password');
my $new_password = param('new_password');
my #users = ();
#splitting each line of csv file
foreach (<$csv>) {
chomp;
my #fields = split (/\,/);
push #users, $fields[1]; #put all usernames inside of array
}
close $csv;
#opening Members.csv for appending
open(my $fh, '>>', $file) || die "Could not open your file";
#SOURCE OF PROBLEM 1
#checking that all values are alphanumeric
if(($full_name && $user_name && $password && $new_password) eq /\A[[:alnum:]]+\z/) {
#if passwords don't match, redirect to error page
if($password ne $new_password){
print qq(<html>\n);
print qq(<head>\n);
print qq(<title> Passwords don't match. </title> \n);
print qq{<meta http-equiv="refresh"content="5;URL="http://www.cs.mcgill.ca/~amosqu/registration.html">\n};
print qq(</head>\n);
print qq(<body>\n);
print qq(<b><center> Passwords don't match </b></center>\n\n);
print qq(</body>\n);
print qq(</html>\n);
}
#if they do match, check that user name isn't in Members.csv
else {
if(grep (/$user_name/, #users)) {
print qq(<html>\n);
print qq(<head>\n);
print qq(<title> Sorry username already taken. </title>\n);
print qq{<meta http-equiv="refresh"content="5;URL="http://www.cs.mcgill.ca/~amosqu/registration.html">\n};
print qq(</head>\n);
print qq(<body>\n);
print qq(<b><center> Username already taken. </b></center>\n\n);
print qq(</body>\n);
print qq(</html>\n);
}
#if it isn't already in Members.csv append values to the file
else {
print $fh "$full_name, $user_name, $password \n";
}
}
}
close $fh;
This should get you going. There is a number of issues with your code that don't stop it from working, but current wisdom is not to use CGI at all so I will roll with you.
Use GET unless you have a good reason to use POST
The problem is here
if(($full_name && $user_name && $password && $new_password) eq /\A[[:alnum:]]+\z/) {
You are using a Boolean && operation that combines the truth of the three variables, and checking whether that, as a string, is equal to the result of matching the contents of $_ against that regular expression.
You must check each of the variables individually, and use the binding operator =~ to test them against a regex. It is also bad form to use the POSIX character classes. I suggest you use grep, like this
my $mismatch = grep /[^A-Z0-9]/i, $full_name, $user_name, $password, $new_password;
Now, $mismatch is true if any of the variables contain a non-alphanumeric character. (Strictly, it is set to the number of variables that have a a non-alphanumeric character, which is zero (false) if none of them do.)
Then you can say
if (not $mismatch) { ... }
It looks like you just need an else that builds a separate page.

Processing form data using CGI

I'm programming in Perl and need to get data from the following HTML form:
<FORM action="./cgi-bin/Perl.pl" method="GET">
<br>
Full name: <br><input type="text" name="full_name" maxlength="20"><br>
Username: <br><input type="text" name="user_name" maxlength="8"><br>
Password: <br><input type="password" name="password" maxlength="15"><br>
Confirm password: <br><input type="password" name="new_password" maxlength="15"><br>
<input type="submit" value ="Submit"><br><br>
</FORM>
EDIT: If i cannot use CGI.pm, will the following work?
local ($buffer, #pairs, $pair, $name, $value, %FORM);
# Read in text
$ENV{'REQUEST_METHOD'} =~ tr/a-z/A-Z/;
if ($ENV{'REQUEST_METHOD'} eq "GET") {
read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
}
else {
$buffer = $ENV{'QUERY_STRING'};
}
# Split information into name/value pairs
#pairs = split(/&/, $buffer);
foreach $pair (#pairs)
{
($name, $value) = split(/=/, $pair);
$value =~ tr/+/ /;
$value =~ s/%(..)/pack("C", hex($1))/eg;
$FORM{$name} = $value;
}
but every time I attempt to use these values I get the error:
Use of unitilialized value
How can I properly use CGI to handle my form data?
EDIT: It's possible that my error lies elsewhere. This is my code. Could it be the way in which I'm using grep? Should I not be using the GET method?
#!/usr/bin/perl
use CGI qw(:standard);
use strict;
use warnings;
print "Content-type: text/html\n\n";
#getting these from HTML form
my $full_name = param('full_name');
my $user_name= param('user_name');
my $password = param('password');
my $new_password = param('new_password');
#checking that inputs are alphanumeric or an underscore
my $mismatch = grep /^[a-zA-Z0-9_]*$/i, $full_name, $user_name, $password, $new_password;
if($mismatch) {
#error message if invalid input
print qq(<html>\n);
print qq(<head>\n);
print qq(<title> Error: alphanumeric inputs only. </title>\n);
print qq{<meta http-equiv="refresh" content="5;URL="http://www.cs.mcgill.ca/~amosqu/registration.html">\n};
print qq(</head>\n);
print qq(<body>\n);
print qq(<b><center> Inputs with alphanumeric characters only please. </b></center>\n\n);
print qq(</body>\n);
print qq(</html>\n);
}
You have altered the regex that I suggested in my answer to your previous question, which was
grep /[^A-Z0-9]/i, $full_name, $user_name, $password, $new_password
You have changed it so that $mismatch is now set to the number of parameters that are valid, and the condition for an invalid set of arguments is now the awkward $mismatch < 4.
If your requirement has altered from alphanumeric to alphanumeric plus underscore, then you can restore the sense of the grep by writing
my $mismatch = grep /\W/, $full_name, $user_name, $password, $new_password
which will set $mismatch to a positive true value if any of the values contains a "non-word" character, which is alphanumeric plus underscore as you wanted.
However, the problem you are seeing
Use of uninitialized value $_ in pattern match (m//)
is because at least one of the parameters $full_name, $user_name, $password, or $new_password is undefined. You need to find out which one and why it is happening. Are you sure that all four query parameters full_name, user_name, password, and new_password are present in the query string you're getting back? Take a look at what the query_string method returns to see.
Well, "use of initialized value" isn't an error, it's just a warning. More recent versions of Perl will tell you which variable is causing the problem.
Are you sure that it's the grep line that is generating the errors? Are you sure that you are filling in all of the form inputs when you're testing this?
The following suggestions, don't address your warning. But they are problems with your code.
The regex you are using for your grep seems broken. Your code says "set $missing to the number of variables that include nothing but alphanumeric characters". That will be set to four if you get four alphanumeric-only inputs. You then trigger the error page - which seems to be the inverse of what you want. Your regex also checks for zero or more alphanumeric characters. So it accepts the empty string. Is that what you want?
Also, your regex is too complicated. There's no need to include both A-Z and a-z if you're using /i to make the match case-insensitive. In fact your regex can be collapsed to /^\w*$/i (as \w means "alphanumeric characters plus an underscore").
Checking that inputs only contain alphanumeric characters is probably a bad idea as well. Most people's full names will include at least one space. And limiting passwords to just containing alphanumeric characters is a terrible idea.
When people point out that CGI is no longer recommended, that doesn't mean that you should go back to using Matt Wright's broken CGI parameter parser from twenty years ago. That code is just as broken as it always was. No-one should be using it. You should be looking at one of the modern Perl web development frameworks that are based on PSGI - something like Web::Simple, Dancer or Mojolicious. See CGI::Alternatives for details.

Perl param() receiving from its own print HTML

I have a Perl script that reads in data from a database and prints out the result in HTML forms/tables. The form of each book also contains a submit button.
I want Perl to create a text file (or read into one already created) and print the title of the book that was inside the form submitted. But I can't seem to get param() to catch the submit action!
#!/usr/bin/perl -w
use warnings; # Allow for warnings to be sent if error's occur
use CGI; # Include CGI.pm module
use DBI;
use DBD::mysql; # Database data will come from mysql
my $dbh = DBI->connect('DBI:mysql:name?book_store', 'name', 'password')
or die("Could not make connection to database: $DBI::errstr"); # connect to the database with address and pass or return error
my $q = new CGI; # CGI object for basic stuff
my $ip = $q->remote_host(); # Get the user's ip
my $term = $q->param('searchterm'); # Set the search char to $term
$term =~ tr/A-Z/a-z/; # set all characters to lowercase for convenience of search
my $sql = '
SELECT *
FROM Books
WHERE Title LIKE ?
OR Description LIKE ?
OR Author LIKE ?
'; # Set the query string to search the database
my $sth = $dbh->prepare($sql); # Prepare to connect to the database
$sth->execute("%$term%", "%$term%", "%$term%")
or die "SQL Error: $DBI::errstr\n"; # Connect to the database or return an error
print $q->header;
print "<html>";
print "<body>";
print " <form name='book' action='bookcart.php' method=post> "; # Open a form for submitting the result of book selection
print "<table width=\"100%\" border=\"0\"> ";
my $title = $data[0];
my $desc = $data[1];
my $author = $data[2];
my $pub = $data[3];
my $isbn = $data[4];
my $photo = $data[5];
print "<tr> <td width=50%>Title: $title</td> <td width=50% rowspan=5><img src=$photo height=300px></td></tr><tr><td>Discreption Tags: $desc</td></tr><tr><td>Publication Date: $pub</td></tr><tr><td>Author: $author</td></tr><tr><td>ISBN: $isbn</td> </tr></table> <br>";
print "Add this to shopping cart:<input type='submit' name='submit' value='Add'>";
if ($q->param('submit')) {
open(FILE, ">>'$ip'.txt");
print FILE "$title\n";
close(FILE);
}
print "</form>"; # Close the form for submitting to shopping cart
You haven't used use strict, to force you to declare all your variables. This is a bad idea
You have used remote_host, which is the name of the client host system. Your server may not be able to resolve this value, in which case it will remain unset. If you want the IP address, use remote_addr
You have prepared and executed your SQL statement but have fetched no data from the query. You appear to expect the results to be in the array #data, but you haven't declared this array. You would have been told about this had you had use strict in effect
You have used the string '$ip'.txt for your file names so, if you were correctly using the IP address in stead of the host name, your files would look like '92.17.182.165'.txt. Do you really want the single quotes in there?
You don't check the status of your open call, so you have no idea whether the open succeeded, or the reason why it may have failed
I doubt if you have really spent the last 48 hours coding this. I think it is much more likely that you are throwing something together in a rush at the last minute, and using Stack Overflow to help you out of the hole you have dug for yourself.
Before asking for the aid of others you should at least use minimal good-practice coding methods such as applying use strict. You should also try your best to debug your code: it would have taken very little to find that $ip has the wrong value and #data is empty.
Use strict and warnings. You want to use strict for many reasons. A decent article on this is over at perlmonks, you can begin with this. Using strict and warnings
You don't necessarily need the following line, you are using DBI and can access mysql strictly with DBI.
use DBD::mysql;
Many of options are available with CGI, I would recommend reading the perldoc on this also based on user preferences and desired wants and needs.
I would not use the following:
my $q = new CGI;
# I would use as so..
my $q = CGI->new;
Use remote_addr instead of remote_host to retrieve your ip address.
The following line you are converting all uppercase to lowercase, unless it's a need to specifically read from your database with all lowercase, I find this useless.
$term =~ tr/A-Z/a-z/;
Next your $sql line, again user preference, but I would look into sprintf or using it directly inside your calls. Also you are trying to read an array of data that does not exist, where is the call to get back your data? I recommend reading the documentation for DBI also, many methods of returning your data. So you want your data back using an array for example...
Here is an untested example and hint to help get you started.
use strict;
use warnings;
use CGI qw( :standard );
use CGI::Carp qw( fatalsToBrowser ); # Track your syntax errors
use DBI;
# Get IP Address
my $ip = $ENV{'REMOTE_ADDR'};
# Get your query from param,
# I would also parse your data here
my $term = param('searchterm') || undef;
my $dbh = DBI->connect('DBI:mysql:db:host', 'user', 'pass',
{RaiseError => 1}) or die $DBI::errstr;
my $sql = sprintf ('SELECT * FROM Books WHERE Title LIKE %s
OR Description LIKE %s', $term, $term);
my $sth = $dbh->selectall_arrayref( $sql );
# Retrieve your result data from array ref and turn into
# a hash that has title for the key and a array ref to the data.
my %rows = ();
for my $i ( 0..$#{$sth} ) {
my ($title, $desc, $author, $pub, $isbn, $pic) = #{$sth->[$i]};
$rows{$title} = [ $desc, $author, $pub, $isbn, $pic ];
}
# Storing your table/column names
# in an array for mapping later.
my #cols;
$cols[0] = Tr(th('Title'), th('Desc'), th('Author'),
th('Published'), th('ISBN'), th('Photo'));
foreach (keys %rows) {
push #cols, Tr( td($_),
td($rows{$_}->[0]),
td($rows{$_}->[1]),
td($rows{$_}->[2]),
td($rows{$_}->[3]),
td(img({-src => $rows{$_}->[4]}));
}
print header,
start_html(-title => 'Example'),
start_form(-method => 'POST', -action => 'bookcart.php'), "\n",
table( {-border => undef, -width => '100%'}, #cols ),
submit(-name => 'Submit', -value => 'Add Entry'),
end_form,
end_html;
# Do something with if submit is clicked..
if ( param('Submit') ) {
......
}
This assumes that you're using the OO approach to CGI.pm, and that $q is the relevant object. This should work, assuming that you have $q = new CGI somewhere in your script.
Can you post the rest of the script?
I've created a mockup to test this, and it works as expected:
#!/usr/bin/perl
use CGI;
my $q = new CGI;
print $q->header;
print "<form><input type=submit name=submit value='add'></form>\n";
if ($q->param('submit')) {
print "submit is \"" . $q->param('submit') . "\"\n";
}
After the submit button is clicked, the page displays that submit is "add" which means the evaluation is going as planned.
I guess what you need to do is make sure that $q is your CGI object, and move forward from there.

HTML parser using perl

I'm trying to parse the html file using perl script. I'm trying to grep all the text with html tag p. If I view the source code the data is written in this format.
<p> Metrics are all virtualization specific and are prioritized and grouped as follows: </p>
Here is the following code.
use HTML::TagParser();
use URI::Fetch;
//my #list = $html->getElementsByTagName( "p" );
foreach my $elem ( #list ) {
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
push (#array,"$text");
foreach $_ (#array) {
# print "$_\n";
print $html_fh "$_\n";
chomp ($_);
push (#array1, "$_");
}
}
}
$end = $#array1+1;
print "Elements in the array: $end\n";
close $html_fh;
The problem that I'm facing is that the output which is generated is 4.60 Mb and lot of the array elements are just repetition sentences. How can I avoid such repetition? Is there any other efficient way to grep the lines which I'm interested. Can anybody help me out with this issue?
The reason you are seeing duplicated lines is that you are printing your entire array once for every element in it.
foreach my $elem ( #list ) {
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
push (#array,"$text"); # this array is printed below
foreach $_ (#array) { # This is inside the other loop
# print "$_\n";
print $html_fh "$_\n"; # here comes the print
chomp ($_);
push (#array1, "$_");
}
}
So for example, if you have an array "foo", "bar", "baz", it would print:
foo # first iteration
foo # second
bar
foo # third
bar
baz
So, to fix your duplication errors, move the second loop outside the first one.
Some other notes:
You should always use these two pragmas:
use strict;
use warnings;
They will provide more help than any other single thing that you can do. The short learning curve associated with fixing the errors that appear more than make up for the massively reduced time spent debugging.
//my #list = $html->getElementsByTagName( "p" );
Comments in perl start with #. Not sure if this is a typo, because you use this array below.
foreach my $elem ( #list ) {
You don't need to actually store the tags into an array unless you need an array. This is an intermediate variable only in this case. You can simply do the following (note that for and foreach are exactly the same):
for my $elem ($html->getElementsByTagName("p")) {
These variables are also intermediate, and two of them unused.
my $tagname = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
push (#array,"$text");
Also note that you never have to quote a variable this way. You can simply do this:
push #array, $elem->innerText;
foreach $_ (#array) {
The $_ variable is used by default, no need to specify it explicitly.
print $html_fh "$_\n";
chomp ($_);
push (#array1, "$_");
I'm not sure why you are chomping the variable after you print it, but before you store it in this other array, but it doesn't seem to make sense to me. Also, this other array will contain the exact same elements as the other array, only duplicated.
$end = $#array1+1;
This is another intermediate variable, and also it can be simplified. The $# sigil will give you the index of the last element, but the array itself in scalar context will give you the size of it:
$end = #array1; # size = last index + 1
But you can do this in one go:
print "Elements in the array: " . #array1 . "\n";
Note that using the concatenation operator . here enforces scalar context on the array. If you had used the comma operator , it would have list context, and the array would have been expanded into a list of its elements. This is a typical way to manipulate by context.
close $html_fh;
Explicitly closing a file handle is not required as it will automatically closed when the script ends.
If you use Web::Scraper instead, your code gets even simpler and clearer (as long as you are able to construct CSS selectors or XPath queries):
#!/usr/bin/env perl
use strict;
use warnings qw(all);
use URI;
use Web::Scraper;
my $result = scraper {
process 'p',
'paragraph[]' => 'text';
}->scrape(URI->new('http://www.perl.org/'));
for my $test (#{$result->{paragraph}}) {
print "$test\n";
}
print "Elements in the array: " . (scalar #{$result->{paragraph}});
Here is another way to get all the content from between <p> tags, this time using Mojo::DOM part of the Mojolicious project.
#!/usr/bin/env perl
use strict;
use warnings;
use v5.10; # say
use Mojo::DOM;
my $html = <<'END';
<p>Paragraph 1</p>
<p>Paragraph 2</p>
<div>Should not find this</div>
<p>Paragraph 3</p>
END
my $dom = Mojo::DOM->new($html);
my #paragraphs = $dom->find('p')->pluck('text')->each;
say for #paragraphs;

Convert CSS Style Attributes to HTML Attributes using Perl

Real quick background : We have a PDFMaker (HTMLDoc) that converts html into a pdf. HTMLDoc doesn't consistently pick up the styles that we need from the html that is provided to us by the client. Thus I'm trying to convert things such as style="width:80px;height:90px;" to height=80 width=90.
My attempt so far has revealed my limited understanding of back references and how to utilize them properly during Perl Regex. I can take an input file and convert it to an output file, but it only catches one "style" per line, and only replaces one name/value pair from that css.
I'm probably approaching this the wrong way but I can't figure out a faster or smarter way to do this in Perl. Any help would be greatly appreciated!
NOTE: The only attributes I'm trying to change for this particular script are "height", "width" and "border," because our client utilizes a tool that automatically applies styles to elements that they drag around with a WYSIWYG-style editor. Obviously, using a regex to strip these out of a lot of places works fairly well, as you just let the table cells be sized by their content, which looks okay, but I figured a quicker way to deal with the issue would just be to replace those three attributes with "width" "height" and "border" attributes, which behave mostly the same as their css counterparts (excepting that CSS allows you to actually customize the width, color, and style of the border, but all they ever use is solid 1px, so I can add a condition to replace "solid 1px" with "border=1". I realize these are not fully equivalent, but for this application it would be a step).
Here's what I've got so far:
#!/usr/bin/perl
if (!#ARGV[0] || !#ARGV[1])
{
print "Usage: converter.pl [input file] [output file] \n";
exit;
}
open FILE, "<", #ARGV[0] or die $!;
open OUTFILE, ">", #ARGV[1] or die $!;
my $line;
my $guts;
while ( <FILE> ) {
$line = $_ ;
$line =~ /style=\"(.+)\"/;
$guts = $1;
$guts =~ /([a-zA-Z]+)\:([a-zA-Z0-9]+)\;/;
$name = $1;
$value = $2;
$guts = $name."=".$value;
$line =~ s/style=\"(.+)\"/$guts/g;
print OUTFILE $line ;
}
exit;
Note: This is NOT homework, and no I'm not asking you to do my job for me, this would end up being an internal tool that just sped up the process of formatting our incoming html to work properly in the pdf converter we have.
UPDATE
For those interested, I got an initial working version. This one only replaces width and height, the border attribute we're scrapping for now. But if anyone wanted to see how we did it, take a look...
#!/usr/bin/perl
## NOTES ##
# This script was made to simply replace style attributes with their name/value pair equivalents as attributes.
# It was designed to replace width and height attributes on a metric buttload of table elements from client data we got.
# As such, it's not really designed to handle more than that, and only strips the unit "PX" from the values.
# All of these can be modified in the second foreach loop, which checks for height and width.
if (!#ARGV[0] || !#ARGV[1])
{
print "Usage: quickvert.pl [input file] [output file] \n";
exit;
}
open FILE, "<", #ARGV[0] or die $!;
open OUTFILE, ">", #ARGV[1] or die $!;
my $line;
my $guts;
my $count = 1;
while ( <FILE> ) {
$line = $_ ;
my (#match) = $line =~ /style=\"(.+?)\"/g;
my $guts;
my $newguts;
foreach (#match) {
#print $_ ."\n";
$guts = $_;
$guts =~ /([a-zA-Z]+)\:([a-zA-Z0-9]+)\;/;
$newguts = "";
foreach my $style (split(/;/,$guts)) {
my ($name, $value) = split(/:/,$style);
$value =~ s/px//g;
if ( $name =~ m/height/g || $name =~ m/width/g ) {
$newguts .= "$name='$value' ";
} else {
$newguts .= "";
}
}
#print "replacing $guts with $newguts on line $count \n";
$line =~ s/style=\"$guts\"/$newguts/i;
}
#print $newguts;
print OUTFILE $line ;
$count++;
}
exit;
You will have a very difficult time with this, for a few reasons:
Most things that can be accomplished with CSS can't be done with HTML attributes. To deal with this you'd either have to ignore or attempt to compensate for things like margins and padding, etc...
Many things that correspond between HTML attributes and CSS actually behave slightly differently, and you will need to account for this. To deal with this you would have to write specific code for each difference...
Because of the way CSS rules are applied, you basically need to use a complete CSS engine to parse and apply all of the rules before you will know what needs to be done at the element/attribute level. To deal with this you could just ignore anything except inline styles, but...
This work is almost as complicated as writing a rendering engine for a browser. You might be able to deal with a few specific cases, but even there your success rate would be haphazard at best.
EDIT: Given your very specific feature set, I can give you a little advice on your implementation:
You want to be case-insensitive and use a non-greedy match when looking for the value of the style attribute, i.e.:
$line =~ /style=\"(.+?)\"/i;
So that you only find stuff up to the very next double-quote, not the entire content of the line up to the last double quote. Also, you probably want to skip the line if the match isn't found, so:
next unless ($line =~ /style=\"(.+?)\"/i);
For parsing the guts, I'd use split instead of regex:
my $newguts;
foreach my $style (split(/;/,$guts)) {
my ($name, $value) = split(/:/,$style);
$newguts .= "$name='$value' ";
}
$line =~ s/style=\"$guts\"/$newguts/i;
Of course, this being Perl there are standard mantras such as always use strict and warnings, try to use named matches rather than $1, $2, etc., but I'm trying to restrict my advice to stuff that will move your solution forward right away.
Have a look on CPAN for HTML parsing modules like HTML::TreeBuilder, HTML::DOM or even XML modules like XML::LibXML.
Below is quick example using HTML::TreeBuilder which adds border="1" attribute to any tag that has style attribute with border content:
use strict;
use warnings;
use HTML::TreeBuilder;
my $data =q{
<html>
<head>
</head>
<body>
<h1>blah</h1>
<p style="color: red;">Red</p>
<span style="width:80px;height:90px;border: 1px solid #000000">Some text</span>
</body>
</html>
};
my $tree = HTML::TreeBuilder->new;
$tree->parse_content( $data );
for my $style ( $tree->look_down( sub { $_[0]->attr('style') } ) ) {
my $prop = $style->attr( 'style' );
$style->attr( 'border', 1 ) if $prop =~ m/border/;
}
say $tree->as_HTML;
Which will reproduce the HTML but with border="1" added just to the span tag.
In unison to these modules you can also have a look at CSS and CSS::DOM to help parse the CSS bit.
I don't know your stance on proprietary software, but PrinceXML is the best HTML to PDF converter available.