Strip HTML from files in a directory with Perl

Strip HTML from files in a directory with Perl - html

I asked a question a couple of days ago about stripping HTML from files with PERL. I am a n00b and I've searched the site for answers to my question...but unfortunately I couldn't find anything...this is probably because I'm a n00b and I didn't see the answer when I was looking at it.
So, here is the situation. I have a directory with around 20 gb of text files. I want to strip the HTML from each file and output each file to a unique text file. I've written the program below, which seems to do the trick for the first 12 text files in the directory (there are about 12,000 text files in total)...however...I run into a couple of snags. The first snag is that after the 12th text file has been parsed, then I start getting warnings about deep recursion...and then shortly after this the program quits because I've run out of memory. I imagine that my programming is extremely inefficient. So, I'm wondering if any of you see any obvious errors with my code below that would case me to run out of memory. ...once I figure things out then hopefully I'll be able to contribute.
#!/usr/bin/perl -w
#use strict;
use Benchmark;
#get the HTML-Format package from the package manager.
use HTML::Formatter;
#get the HTML-TREE from the package manager
use HTML::TreeBuilder;
use HTML::FormatText;
$startTime = new Benchmark;
my $direct="C:\\Directory";
my $slash='\\';
opendir(DIR1,"$direct")||die "Can't open directory";
my #New1=readdir(DIR1);
foreach $file(#New1)
{
if ($file=~/^\./){next;}
#Initialize the variable names.
my $HTML=0;
my $tree="Empty";
my $data="";
#Open the file and put the file in variable called $data
{
local $/;
open (SLURP, "$direct$slash"."$file") or die "can't open $file: $!";
#read the contents into data
$data = <SLURP>;
#close the filehandle called SLURP
close SLURP or die "cannot close $file: $!";
if($data=~m/<HTML>/i){$HTML=1;}
if($HTML==1)
{
#the following steps strip out any HTML tags, etc.
$tree=HTML::TreeBuilder->new->parse($data);
$formatter=HTML::FormatText->new(leftmargin=> 0, rightmargin=>60);
$Alldata=$formatter->format($tree);
}
}
#print
my $outfile = "out_".$file;
open (FOUT, "> $direct\\$outfile");
print FOUT "file: $file\nHTML: $HTML\n$Alldata\n","*" x 40, "\n" ;
close(FOUT);
}
$endTime = new Benchmark;
$runTime = timediff($endTime, $startTime);
print ("Processing files took ", timestr($runTime));

You are using up a lot of space with the list of files in #New1.
In addition, if you are using an older version of HTML::TreeBuilder then your objects of this class may need explcitly deleting, as they used to be immune to automatic Perl garbage collection.
Here is a program that avoids both of these problems, by reading the directory incrementallly, and by using HTML::FormatText->format_string to format the text, which implicitly deletes any HTML::TreeBuilder objects that it creates.
In addition, File::Spec makes a tidier job of building absolute file paths, and it is a core module so will not need installing on your system
use strict;
use warnings;
use File::Spec;
use HTML::FormatText;
my $direct = 'C:\Directory';
opendir my $dh, $direct or die "Can't open directory";
while ( readdir $dh ) {
next if /^\./;
my $file = File::Spec->catfile($direct, $_);
my $outfile = File::Spec->catfile($direct, "out_$_");
next unless -f $file;
my $html = do {
open my $fh, '<', $file or die qq(Unable to open "$file" for reading: $!);
local $/;
<$fh>;
};
next unless $html =~ /<html/i;
my $formatted = HTML::FormatText->format_string(
$html, leftmargin => 0, rightmargin => 60);
open my $fh, '>', $outfile or die qq(Unable to open "$outfile" for writing: $!);
print $fh "File: $file\n\n";
print $fh "$formatted\n";
print $fh "*" x 40, "\n" ;
close $fh or die qq(Unable to close "$outfile" after writing: $!);
}

What was wrong with the answer to your previous question?
Your opening files for writing without checking the return code. Are you sure the succeed? And in which directory do you thing the files are created?
A better approach would be to:
read files 1 by 1
strip the HTML
write out the new file in the correct directory and checking the return code
something like:
while ( my $file = readdir DIR ) {
....process file
open my $newfile, '>', "$direct/out_$outfile" or die "cannot open $outfile: $!\n";
... etc
}

How to reduce the memory footprint of this application:
Does the problem persist when you add $tree = $tree->delete to the end of your loop?
The perl garbage collector cannot resolve circular references; so you have to destroy the tree manually so you don't run out of memory.
(See the first example in the module documentation at http://metacpan.org/pod/HTML::TreeBuilder)
You should put the readdir inside the loop. The way you coded it, you first read in this gigantic list of files. When you say
my $file;
while (defined($file = readdir DIR1)) {..}
only one entry is actually read at a time. Should save some extra memory.
A few other comments on style:
default values
You give $tree the default value of "Empty". That is completely unnecessary. If you want to show how undefined a variable is, set it to undef, which it is by default. Perl guarantees this initialization.
backslashes
You use backslashes as a directory separator? Stop worrying and just use normal slashes. Unless you are on DOS you can use normal slashes as well, Windows isn't that dumb.
statement modifiers
This line
if ($file=~/^\./){next;}
can be written far more readable as
next if $file =~ /^\./;
consequent use of parens
Your use of parens for function argument lists is inconsequent. You can omit the parens for all built-in functions unless there is ambiguity. I prefer avoiding them, others may find them easier to read. But please stick to a style!
better regex
You test for the existence of /<HTML>/i. What if I told you the html tag can have attributes? You should rather consider testing for /<html/i.
simplification (removes another bug)
Your test
if($data=~m/<HTML>/i){$HTML=1;}
if($HTML==1) {...}
can be written as
$HTML = $data =~ /<html/i;
if ($HTML == 1) {...}
can be written as
$HTML = $data =~ /<html/i
if ($HTML) {...}
can be folded into
if ($data =~ /<html/i) {...}
The way you implemented it, the $HTML variable was never reset to a false value. So once a file contained html, all subsequent files would have been treated as html as well. You can counteract such problems by defining your vars in the innermost sensible scope.
use HTML::FormatText, tribute to #pavel
Use the modules you use to the fullest. Look what I found in the example for HTML::FormatText:
my $string = HTML::FormatText->format_file(
'test.html',
leftmargin => 0, rightmargin => 50
);
You can easily adapt that to circumvent building the tree manually. Why hadn't you tried this approach, as #pavel told you to in your other post? Would have saved you the memory problem...
use strict
Why did you comment out use strict? Getting as much fatal warnings as possible is important when learning a language. Or when writing solid code. That would force you to declare all your variables like $file sensibly. And rather use warnings than the -w switch, which is a bit outdated.
well done
But a very big "well done" on checking the return value of close ;-) That is very un-n00bish!

Related

How do I use embedded HTML form data when method is POST?

I have an assignment for school. We need to use the POST method to pass data from an embedded HTML form to a Perl script. The Perl will then loop through the data and display it back to the user.
I was using a separate file and the GET method but my instructor said I had to embed the HTML and use POST. But the information is no longer stored in the environment variable.
Is there a built-in variable I can use to access this information?
#!/usr/bin/perl -w
print "Content-Type:text/html\n\n";
$qstring = $ENV{'QUERY_STRING'};
$rmethod = $ENV{'REQUEST_METHOD'};
if ( $rmethod eq "GET" ) {
&displayform();
}
elsif ( $rmethod eq "POST" ) {
#pairs = split(/&/, $qstring);
foreach (#pairs) {
($key, $value) = split(/=/);
$value =~ tr/+/ /;
$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
$form{$key} = $value;
}
print "<html><head><title>Student Survey</title></head>\n";
print "<body>\n";
&displayInfo();
print "</body></html>\n";
}
print "</body></html>\n";
}
sub displayInfo {
print "Full Name:", $form{"person"}, "<br>";
print "Favourite:", $form{"sport"}, "<br>";
print "Favourite Course:", $form{"course"}, "<br>";
print "GPA:", $form{"gpa"}, "<br>";
}
sub displayform {
print qq~ (FORM GOES HERE)

Someone is teaching you very bad practices. I don't know whether your sample code is following examples supplied by your school or whether you have cobbled it together from bad examples on the internet - but either way, this code uses techniques that have been out of date for twenty years.
There's a whole debate to be had about the wisdom of teaching CGI programming in 2017 (see CGI::Alternatives for a brief discussion of some better approaches) but let's ignore that and assume that CGI is a good idea here.
If you're writing a CGI program, then you should use the CGI.pm library which has been part of the standard Perl distribution for over twenty years (it was removed recently, but the chances of your school using a version this up to date is tiny).
A standard CGI program, using CGI.pm looks like this:
#!/user/bin/env perl
use strict;
use warnings;
use CGI qw[header param]; # Load the two functions we're going to use
# Display the content-type header
print header;
# See if we have been passed parameters.
if (param) {
# Display the parameters
# Note: I've ignored HTML here. You shouldn't
print 'Name: ', param('person');
print 'Sport: ', param('sport');
# etc...
} else {
# Display the form
...
}
I ignored HTML in my example because embedding HTML in your Perl code is a terrible idea. It's a much better idea to use a templating system (I recommend the Template Toolkit).
A few other points:
Always use strict and use warnings.
-w on the shebang was obsoleted by use warnings in 2000.
Using & on subroutine calls has been unnecessary since Perl 5 was released in 1994.
I know that you don't know any better and that you're just following what your teacher is telling you to do. But it's really depressing to see such outdated practices been taught in school.
Update: And just to add the answer to your original question. You're right that in a POST request, the parameter data is no longer available in the QUERY_STRING environment variable - you need to read it from STDIN instead. That's one of the many advantages of CGI.pm - you use the same method (the param() subroutine) to access both GET and POST parameters.

parse domains from html page using perl

i have an html page that contain urls like :
<h3><a href="http://site.com/path/index.php" h="blablabla">
<h3><a href="https://www.site.org/index.php?option=com_content" h="vlavlavla">
i want to extract :
site.com/path
www.site.org
between <h3><a href=" & /index.php .
i've tried this code :
#!/usr/local/bin/perl
use strict;
use warnings;
open (MYFILE, 'MyFileName.txt');
while (<MYFILE>)
{
my $values1 = split('http://', $_); #VALUE WILL BE: www.site.org/path/index2.php
my #values2 = split('index.php', $values1); #VALUE WILL BE: www.site.org/path/ ?option=com_content
print $values2[0]; # here it must print www.site.org/path/ but it don't
print "\n";
}
close (MYFILE);
but this give an output :
2
1
2
2
1
1
and it don't parse https websites.
hope you've understand , regards.

The main thing wrong with your code is that when you call split in scalar context as in your line:
my $values1 = split('http://', $_);
It returns the size of the list created by the split. See split.
But I don't think split is appropriate for this task anyway. If you know that the value you are looking for will always lie between 'http[s]://' and '/index.php' you just need a regex substitution in your loop (you should also be more careful opening your file...):
open(my $myfile_fh, '<', 'MyFileName.txt') or die "Couldn't open $!";
while(<$myfile_fh>) {
s{.*http[s]?://(.*)/index\.php.*}{$1} && print;
}
close($myfile_fh);
It's likely you will need a more general regex than that, but I think this would work based on your description of the problem.

This feels to me like a job for modules
HTML::LinkExtor
URI
Generally using regexps to parse HTML is risky.

dms explained in his answer why using split isn't the best solution here:
It returns the number of items in scalar context
A normal regex is better suited for this task.
However, I do not think that line-based processing of the input is valid for HTML, or that using a substitution makes sense (it does not, especially when the pattern looks like .*Pattern.*).
Given an URL, we can extract the required information like
if ($url =~ m{^https?://(.+?)/index\.php}s) { # domain+path now in $1
say $1;
}
But how do we extract the URLs? I'd recommend the wonderful Mojolicious suite.
use strict; use warnings;
use feature 'say';
use File::Slurp 'slurp'; # makes it easy to read files.
use Mojo;
my $html_file = shift #ARGV; # take file name from command line
my $dom = Mojo::DOM->new(scalar slurp $html_file);
for my $link ($dom->find('a[href]')->each) {
say $1 if $link->attr('href') =~ m{^https?://(.+?)/index\.php}s;
}
The find method can take CSS selectors (here: all a elements that have an href attribute). The each flattens the result set into a list which we can loop over.
As I print to STDOUT, we can use shell redirection to put the output into a wanted file, e.g.
$ perl the-script.pl html-with-links.html >only-links.txt
The whole script as a one-liner:
$ perl -Mojo -E'$_->attr("href") =~ m{^https?://(.+?)/index\.php}s and say $1 for x(b("test.html")->slurp)->find("a[href]")->each'

How do I convert various user-inputted line break characters to <br> using Perl?

I have a <textarea> for user input, and, as they are invited to do, users liberally add line breaks in the browser and I save this data directly to the database.
Upon displaying this data back on a webpage, I need to convert the line breaks to <br> tags in a reliable way that takes into consideration to \n's the \r\n's and any other common line break sequences employed by client systems.
What is the best way to do this in Perl without doing regex substitutions every time? I am hoping, naturally, for yet another awesome CPAN module recommendation... :)

There's nothing wrong with using regexes here:
s/\r?\n/<br>/g;

Actually, if you're having to deal with Mac users, or if there still happens to be some weird computer that uses form-feeds, you would probably have to use something like this:
$input =~ s/(\r\n|\n|\r|\f)/<br>/g;

#!/usr/bin/perl
use strict; use warnings;
use Socket qw( :crlf );
my $text = "a${CR}b${CRLF}c${LF}";
$text =~ s/$LF|$CR$LF?/<br>/g;
print $text;
Following up on #daxim's comment, here is the modified version:
#!/usr/bin/perl
use strict; use warnings;
use charnames ':full';
my $text = "a\N{CR}b\N{CR}\N{LF}c\N{LF}";
$text =~ s/\N{LF}|\N{CR}\N{LF}?/<br>/g;
print $text;
Following up on #Marcus's comment here is a contrived example:
#!/usr/bin/perl
use strict; use warnings;
use charnames ':full';
my $t = (my $s = "a\012\015\012b\012\012\015\015c");
$s =~ s/\r?\n/<br>/g;
$t =~ s/\N{LF}|\N{CR}\N{LF}?/<br>/g;
print "This is \$s: $s\nThis is \$t:$t\n";
This is a mismash of carriage returns and line feeds (which, at some point in the past, I did encounter).
Here is the output of the script on Windows using ActiveState Perl:
C:\Temp> t | xxd
0000000: 5468 6973 2069 7320 2473 3a20 613c 6272 This is $s: a<br
0000010: 3e3c 6272 3e62 3c62 723e 3c62 723e 0d0d ><br>b<br><br>..
0000020: 630d 0a54 6869 7320 6973 2024 743a 613c c..This is $t:a<
0000030: 6272 3e3c 6272 3e62 3c62 723e 3c62 723e br><br>b<br><br>
0000040: 3c62 723e 3c62 723e 630d 0a <br><br>c..
or, as text:
chis is $s: a<br><br>b<br><br>
This is $t:a<br><br>b<br><br><br><br>c
Admittedly, you are not likely to end up with this input. However, if you want to cater for any unexpected oddities that might indicate a line ending, you might want to use
$s =~ s/\N{LF}|\N{CR}\N{LF}?/<br>/g;
Also, for reference, CGI.pm canonicalizes line-endings this way:
# Define the CRLF sequence. I can't use a simple "\r\n" because the meaning
# of "\n" is different on different OS's (sometimes it generates CRLF, sometimes LF
# and sometimes CR). The most popular VMS web server
# doesn't accept CRLF -- instead it wants a LR. EBCDIC machines don't
# use ASCII, so \015\012 means something different. I find this all
# really annoying.
$EBCDIC = "\t" ne "\011";
if ($OS eq 'VMS') {
$CRLF = "\n";
} elsif ($EBCDIC) {
$CRLF= "\r\n";
} else {
$CRLF = "\015\012";
}

As a matter of general principle, storing the data as entered by the user and doing the EOL-to-<br> conversion each time it's displayed is the better (even Right™) way to do it, both for the sake of having access to the original version of the data and because you may decide at some point that you want to change your filtering algorithm.
But, no, I personally would not use a regex in this case. I would use Parse::BBCode, which provides a whole lot of additional functionality (i.e., full BBCode support, or at least as much as you choose not to disable) in addition to providing line breaks without requiring users to explicitly enter markup for them.

Perl Script to search and replace in .SQL query file with user inputs

I have a .SQL file containing a large number of queries. They are being run against a database containing data for multiple states over multiple years. The machine I am running this on can only handle running the queries for one state, in one year, at a time.
I am trying to create a Perl script that takes user input for the state abbreviation, the state id number, and the year. It then creates a directory for that state and year. Then it opens the "base" .SQL file and searches and replaces the base state id and year with the user input, and saves this new .SQL file to the created directory.
The current script I have (below) stops at
open(IN,'<$infile')
with
"Can't open [filename]"
It seems that it is having difficulty finding or opening the .SQL file. I have quadruple-checked to make sure the paths are correct, and I have even tried replacing the
$path
with an absolute path for the base file. If it was having trouble with creating the new file I'd have more direction, but since it can't find/open the base file I do not know how to proceed.
#!/usr/local/bin/perl
use Cwd;
$path = getcwd();
#Cleans up the path
$path =~ s/\\/\//sg;
#User inputs
print "What is the 2 letter state abbreviation for the state? Ex. 'GA'\n";
$stlet = <>;
print "What is the 2 digit state abbreviation for the state? Ex. '13'\n";
$stdig = <>;
print "What four-digit year are you doing the calculations for? Ex. '2008'\n";
$year = <>;
chomp $stlet;
chomp $stdig;
chomp $year;
#Creates the directory
mkdir($stlet);
$new = $path."\/".$stlet;
mkdir("$new/$year");
$infile = '$path/Base/TABLE_1-26.sql';
$outfile = '$path/$stlet/$year/TABLE_1-26.sql';
open(IN,'<$infile') or die "Can't open $infile: $!\n";
open(OUT,">$infile2") or die "Can't open $outfile: $!\n";
print "Working...";
while (my $search = <IN>) {
chomp $search;
$search =~ s/WHERE pop.grp = 132008/WHERE pop.grp = $stdig$year/g;
print OUT "$search\n";
}
close(IN);
close(OUT);
I know I also probably need to tweak the regular expression some, but I'm trying to take things one at a time. This is my first Perl script, and I haven't really been able to find anything that handles .SQL files like this that I can understand.
Thank you!

$infile = '$path/Base/TABLE_1-26.sql';
The string in that line is single quoted, so $path won't interpolate, so your program is looking for a file literally named $path/Base/TABLE_1-26.sql.
You want
$infile = "$path/Base/TABLE_1-26.sql";
Or, better,
use File::Spec;
....
$infile = File::Spec->catfile($path, 'Base', 'TABLE_1-26.sql');
and similarly for $outfile -- or was that $infile2 ? :) I'd strongly recommend putting a use strict; and use warnings; at the top of this and your future scripts.

How can I remove every third HTML tag in Perl?

This is a quickly cooked up script, but I am having some difficulty due to unfamiliarity with regexes and Perl.
The script is supposed to read in an HTML file. There is a place in the file (by itself) where I have a bunch of <div>s. I want to remove every third of them -- they are grouped in fours.
My script below won't compile, let alone run.
#!/usr/bin/perl
use warnings;
use strict;
&remove();
sub remove {
my $input = $ARGV[0];
my $output = $ARGV[1];
open INPUT, $input or die "couldn't open file $input: $!\n";
open OUTPUT, ">$output" or die "couldn't open file $output: $!\n";
my #file = <INPUT>;
foreach (#file) {
my $int = 0;
if ($_ =~ '<div class="cell">') {
$int++;
{ // this brace was the wrong way
if ($int % 4 == 3) {
$_ =~ '/s\<div class="cell">\+.*<\/div>/;/g';
}
}
print OUTPUT #file;
}
Thanks for all your help. I know it is wrong to parse with a regex, but I just want this one to work.
Postmortem: The problem is almost solved. And I shame those who told me that a regex is not good -- I knew that to begin with. But then again, I wanted something fast and had programmed the XSLT that produced it. In this case I didn't have the source to run it again, otherwise I would program it into the XSLT.

I agree that HTML can't really be parsed by regexes, but for quick little hacks on HTML that you know the format of, regexes work great. The trick to doing repetition replacements with a regex is to put the repetition into the regex. If you don't do that you'll run into trouble syncing the position of the regex matcher with the input you're reading.
Here's the quick-and-dirty way I'd write the Perl. It removes the third div element even when it is nested within the first two divs. The whole file is read and then I use the "g" global replace modifier to make the regex do the counting. If you haven't seen the "x" modifier before, all it does is let you add spaces for formatting—the spaces are ignored in the regex.
remove(#ARGV);
sub remove {
my ($input, $output) = #_;
open(INPUT, "<", $input) or die "couldn't open file $input: $!\n";
open(OUTPUT, ">", $output) or die "couldn't open file $output: $!\n";
my $content = join("", <INPUT>);
close(INPUT);
$content =~ s|(.*? <div \s+ class="cell"> .*? <div \s+ class="cell"> .*?)
<div \s+ class="cell"> .*? </div>
(.*? <div \s+ class="cell">)|$1$2|sxg;
print OUTPUT $content;
close OUTPUT;
}

When your code doesn't compile, read the error and warning messages you get.
If they don't make sense, consult perldoc perldiag (or
put "use diagnostics;" in your code to automatically do this for you).

Well, you're right that you shouldn't be parsing HTML with regular expressions. And since that is the case, it probably won't "just work."
Ideally, you need to be using an HTML parsing and manipulation library. Don't think of HTML as a big string for you to manipulate with text functions: it's a serialized, formatted data structure. You should monkey with it only using a library for that purpose. The various libraries have already fixed the hundreds of bugs that you are likely to face, making it a zillion times more likely that a simple HTML manipulation routine written against them will "just work." The master-level Perl programmers would generally not parse HTML this way, and it's not because they're obsessive and irrational about code quality and purity -- it's because they know that reinventing the wheel themselves is unlikely to yield something that rolls as smooth as the existing machinery.
I recommend HTML::Tree because it functions the way I think of HTML (and XML). I think there are a couple of other libraries that may be more popular.
The real truth is, if you can't even get your program to compile, you need to invest a little more time (a half day or so) figuring out the basics before you come looking for help. You have an error in your syntax for using the s///g regular expression substitution operator, and you need to find out how that is supposed to work before you go any further. It's not hard, and you can find out what you need from the Camel book, or the perlretut manpage, or several other sources. If you don't learn how to debug your program now, then likely any help you receive here is just going to take you to the next syntax error which you won't be able to get past.

Once you get the squiggly brackets matching each other, and start using the substitution regex properly, you also need to move the
my $int = 0;
out of the for loop - it is currently being reset on every line that is read, so it will only ever have the value of 0 or 1.

The subroutine has lost its way. Start by taking a look at the structure of that:
sub remove { # First opening bracket
my $input = $ARGV[0];
my $output = $ARGV[1];
open INPUT, $input or die "couldn't open file $input: $!\n";
open OUTPUT, ">$output" or die "couldn't open file $output: $!\n";
my #file = <INPUT>;
foreach (#file) { # Second opening bracket
my $int = 0;
if ($_ =~ '<div class="cell">') { # Third opening bracket
$int++;
{ # Fourth opening bracket
if ($int % 4 == 3) { # Fifth opening bracket
$_ =~ '/s\<div class="cell">\+.*<\/div>/;/g';
} # First closing bracket
} # Second closing bracket
print OUTPUT #file;
} # Third closing bracket
# No fourth closing bracket?
# No fifth closing bracket?
I think you wanted this:
sub remove {
my $input = $ARGV[0];
my $output = $ARGV[1];
open INPUT, $input or die "couldn't open file $input: $!\n";
open OUTPUT, ">$output" or die "couldn't open file $output: $!\n";
my #file = <INPUT>;
foreach (#file) {
my $int = 0;
if ($_ =~ '<div class="cell">') {
$int++;
}
if ($int % 4 == 3) {
$_ =~ '/s\<div class="cell">\+.*<\/div>/;/g';
}
}
print OUTPUT #file;
}
That will compile, and takes us to the next issue: Why are you single-quoting the regex? (Also see Cebjyre's point about the placement of my $int = 0.)
(To pick up on Ysth's point, you can also always run a script with perl -Mdiagnostics script-name to get the longer diagnostic messages.)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008