Infinite loop using a pair of Perl regex matches - html

I wrote a small Perl script with regular expressions to get HTML components of a website.
I know its not a good way of doing this kind of job, but I was trying to test out my regex skills.
When run with either one of the two regex patterns in the while loop it runs perfectly and displays the correct output. But when I try to check both patterns in the while loop the second pattern matches every time and the loop runs infinitely.
My script:
#!/usr/bin/perl -w
use strict;
while (<STDIN>) {
while ( (m/<span class=\"itempp\">([^<]+)+?<\/span>/g) ||
(m/<font size=\"-1\">([^<]+)+?<\/font>/g) ) {
print "$1\n";
}
}
I am testing the above script with a sample input:
Link title
<span class="itempp">$150</span>
<font size="-1"> (Location)</font>
Desired output:
$150
(Location)
Thank you! Any help would be highly appreciated!

Whenever a global regex fails to match it resets the position where the next global regex will start searching. So when the first of your two patterns fails it forces the second to look from the beginning of the string again.
This behaviour can be disabled by adding the /c modifier, which leaves the position unchanged if a regex fails to match.
In addition, you can improve your patterns by removing the escape characters (" doesn't need escaping and / needn't be escaped if you choose a different delimiter) and the superfluous +? after the captures.
Also use warnings is much better than -w on the command line.
Here is a working version of your code.
use strict;
use warnings;
while (<STDIN>) {
while( m|<span class="itempp">([^<]+)</span>|gc
or m|<font size="-1">([^<]+)</font>|gc ) {
print "$1\n";
}
}

while (<DATA>) {
if (m{<(?:span class="itempp"|font size="-1")>\s*([^<]+)}i) {
print "$1\n";
}
}
__DATA__
Link title
<span class="itempp">$150</span>
<font size="-1"> (Location)</font>

You did not change $_ after or during matching, so it will always match and run into an infinite loop.
to fix it , you can add $_=$'; after print, to run match again in the rest of string.

Related

How do I remove spaces around each field in a CSV in Perl?

My CSV files with whitespace:
Id ; FirstName ; LastName ; email
123; Marc ; TOTO ; marc#toto.com
I would like delete whitespace in my csv by line like this :
Id;FirstName;LastName;email
123;Marc;TOTO;marc#toto.com
I would use a regex in Perl.
It is always a good idea to use a library with file formats like CSV. Even as this case seems trivial and safe to parse with regex surprises can and do sneak up. Also, requirements tend to change and projects and data only get more complex. Once there is sensible code using a good library a project evolution is generally far more easily absorbed.
A library like the excellent Text::CSV can use ; as a separator and can remove that extra whitespace while parsing the file, with a suitable option.
To keep it short and in a one-liner the functional interface is helpful
perl -MText::CSV=csv -we'
csv (in => *ARGV, sep => ";", allow_whitespace => 1)' name.csv > new.csv
Prints as desired with the supplied example file.
Not a perl solution but the awk solution is so simple it might be acceptable:
awk '{OFS="";$1=$1;print $0}' file.csv;
This process uses OFS to override the default output field separator from the usual white space. $1=$1 forces awk to reset the whole line $0 value to remove the field separators before printing it.
Although your title says remove spaces at the end of each line, you may
want to remove whitespaces around the field values. Then would you please try:
perl -pe "s/\s*;\s*/;/g" input_file.csv
Output:
Id;FirstName;LastName;email
123;Marc;TOTO;marc#toto.com
Please note the code breaks in case the field contains ; itself such as abc;"foo ; bar";def or other complicated cases.

Compress heredoc declaration to one line in bash?

I have this which works to declare a JSON string in a bash script:
local my_var="foobar"
local json=`cat <<EOF
{"quicklock":"${my_var}"}
EOF`
The above heredoc works, but I can't seem to format it any other way, it literally has to look exactly like that lol.
Is there any way to get the command to be on one line, something like this:
local json=`cat <<EOF{"quicklock":"${my_var}"}EOF`
that would be so much nicer, but doesn't seem to take, obviously simply because that's not how EOF works I guess lol.
I am looking for a shorthand way to declare JSON in a file that:
Does not require a ton of escape chars.
That allows for dynamic interpolation of variables.
Note: The actual JSON I want to use has multiple dynamic variables with many key/value pairs. Please extrapolate.
I'm not a JSON guy, don't really understand the "well-formed" arguments in the discussion above, but, you can use a 'here-string' rather than a 'here-document', like this:
my_var="foobar"
json=`cat <<<{\"quicklock\":\"${my_var}\"}`
why not use jq? It's pretty good at managing string interpolation and it lints your structure.
$ echo '{}' >> foo.json
$ declare myvar="assigned-var"
$ jq --arg ql "$myvar" '.quicklock=$ql' foo.json
the text that comes out on the other end of that call to jq can then be cat into a file or whatever you wanna do. text would look something like this:
{"quicklock": "assigned-var"}
You can do this with printf:
local json="$(printf '{"quicklock":"%s"}' "$my_var")"
(Never mind that SO's syntax highlighting looks odd here. Posix shell command substitution allows nesting one level of quotes.)
A note (thanks to Charles Duffy's comment on the question): I'm assuming $my_var is not controlled by user input. If it is, you'll need to be careful to ensure it is legal for a JSON string. I highly recommend barring non-ASCII characters, double quotes, and backslashes. If you have jq available, you can use it as Charles noted in the comments to ensure you have well-formed output.
You can define your own helper function to address the situation with missing bash syntax:
function begin() { eval echo $(sed "${BASH_LINENO[0]}"'!d;s/.*begin \(.*\) end.*/\1/;s/"/\\\"/g' "${BASH_SOURCE[0]}"); }
Then you can use it as follows.
my_var="foobar"
json=$(begin { "quicklock" : "${my_var}" } end)
echo "$json"
This fragment displays the desired output:
{ "quicklock" : "foobar" }
This is just a proof of concept. You can define your syntax in any way you want (such as end of the input by the custom EOF string, correctly escape invalid characters). For example, since Bash allows function identifiers using characters other than alphanumeric characters, it is possible to define such a syntax:
json=$(/ { "quicklock" : "${my_var}" } /)
Moreover, if you relax the first criterion (escape characters), ordinary assignment will nicely solve this problem:
json="{ \"quicklock\" : \"${my_var}\" }"
How about just using the shell's natural concatenation of strings? If you concatenate ${mybar} rather than interpolate it, you can avoid escapes and get everything on one line:
my_var1="foobar"
my_var2="quux"
json='{"quicklock":"'${my_var1}'","slowlock":"'$my_var2'"}'
That said, this is a pretty crude scheme, and as others have pointed out you'll have problems if the variables, say, contain quote characters.
Since no escape chars is strong requirement here is a here-doc based solution:
#!/bin/bash
my_var='foobar'
read -r -d '' json << EOF
{
"quicklock": "$my_var"
}
EOF
echo "$json"
It will give you the same output as the first solution I mentioned.
Just be careful, if you would put first EOF inside double quotes:
read -r -d '' json << "EOF"
$my_var would not be considered as a variable but as a plain text, so you would get this output:
{
"quicklock": "$my_var"
}

CGI table with perl

I am trying to build a login form with CGI, using perl.
sub show_login_form{
return div ({-id =>'loginFormDiv'}),
start_form, "\n",
CGI->start_table, "\n",
CGI->end_table, "\n",
end_form, "\n",
div, "\n";
}
I was wondering why I don't need to add CGI-> before start_form but if I don't include it before start_table and end_table, "start_table" and "end_table" are printed as strings?
Thank you for your help.
Why can I use you some subroutines?
Because you are likely importing them using the following use statement:
use CGI qw(:standard);
As documented in CGI - Using the function oriented interface, this will import "standard" features, 'html2', 'html3', 'html4', 'ssl', 'form' and 'cgi'.
But that does not include the table methods.
To get them too, you can modify your use statement to the following:
use CGI qw(:standard *table);
Why does removing CGI-> print start_table as a string?
Because you unwisely do not have use strict turned on.
If you had, you would've gotten the following error:
Bareword "start_table" not allowed while "strict subs"

parse domains from html page using perl

i have an html page that contain urls like :
<h3><a href="http://site.com/path/index.php" h="blablabla">
<h3><a href="https://www.site.org/index.php?option=com_content" h="vlavlavla">
i want to extract :
site.com/path
www.site.org
between <h3><a href=" & /index.php .
i've tried this code :
#!/usr/local/bin/perl
use strict;
use warnings;
open (MYFILE, 'MyFileName.txt');
while (<MYFILE>)
{
my $values1 = split('http://', $_); #VALUE WILL BE: www.site.org/path/index2.php
my #values2 = split('index.php', $values1); #VALUE WILL BE: www.site.org/path/ ?option=com_content
print $values2[0]; # here it must print www.site.org/path/ but it don't
print "\n";
}
close (MYFILE);
but this give an output :
2
1
2
2
1
1
and it don't parse https websites.
hope you've understand , regards.
The main thing wrong with your code is that when you call split in scalar context as in your line:
my $values1 = split('http://', $_);
It returns the size of the list created by the split. See split.
But I don't think split is appropriate for this task anyway. If you know that the value you are looking for will always lie between 'http[s]://' and '/index.php' you just need a regex substitution in your loop (you should also be more careful opening your file...):
open(my $myfile_fh, '<', 'MyFileName.txt') or die "Couldn't open $!";
while(<$myfile_fh>) {
s{.*http[s]?://(.*)/index\.php.*}{$1} && print;
}
close($myfile_fh);
It's likely you will need a more general regex than that, but I think this would work based on your description of the problem.
This feels to me like a job for modules
HTML::LinkExtor
URI
Generally using regexps to parse HTML is risky.
dms explained in his answer why using split isn't the best solution here:
It returns the number of items in scalar context
A normal regex is better suited for this task.
However, I do not think that line-based processing of the input is valid for HTML, or that using a substitution makes sense (it does not, especially when the pattern looks like .*Pattern.*).
Given an URL, we can extract the required information like
if ($url =~ m{^https?://(.+?)/index\.php}s) { # domain+path now in $1
say $1;
}
But how do we extract the URLs? I'd recommend the wonderful Mojolicious suite.
use strict; use warnings;
use feature 'say';
use File::Slurp 'slurp'; # makes it easy to read files.
use Mojo;
my $html_file = shift #ARGV; # take file name from command line
my $dom = Mojo::DOM->new(scalar slurp $html_file);
for my $link ($dom->find('a[href]')->each) {
say $1 if $link->attr('href') =~ m{^https?://(.+?)/index\.php}s;
}
The find method can take CSS selectors (here: all a elements that have an href attribute). The each flattens the result set into a list which we can loop over.
As I print to STDOUT, we can use shell redirection to put the output into a wanted file, e.g.
$ perl the-script.pl html-with-links.html >only-links.txt
The whole script as a one-liner:
$ perl -Mojo -E'$_->attr("href") =~ m{^https?://(.+?)/index\.php}s and say $1 for x(b("test.html")->slurp)->find("a[href]")->each'

How do I convert various user-inputted line break characters to <br> using Perl?

I have a <textarea> for user input, and, as they are invited to do, users liberally add line breaks in the browser and I save this data directly to the database.
Upon displaying this data back on a webpage, I need to convert the line breaks to <br> tags in a reliable way that takes into consideration to \n's the \r\n's and any other common line break sequences employed by client systems.
What is the best way to do this in Perl without doing regex substitutions every time? I am hoping, naturally, for yet another awesome CPAN module recommendation... :)
There's nothing wrong with using regexes here:
s/\r?\n/<br>/g;
Actually, if you're having to deal with Mac users, or if there still happens to be some weird computer that uses form-feeds, you would probably have to use something like this:
$input =~ s/(\r\n|\n|\r|\f)/<br>/g;
#!/usr/bin/perl
use strict; use warnings;
use Socket qw( :crlf );
my $text = "a${CR}b${CRLF}c${LF}";
$text =~ s/$LF|$CR$LF?/<br>/g;
print $text;
Following up on #daxim's comment, here is the modified version:
#!/usr/bin/perl
use strict; use warnings;
use charnames ':full';
my $text = "a\N{CR}b\N{CR}\N{LF}c\N{LF}";
$text =~ s/\N{LF}|\N{CR}\N{LF}?/<br>/g;
print $text;
Following up on #Marcus's comment here is a contrived example:
#!/usr/bin/perl
use strict; use warnings;
use charnames ':full';
my $t = (my $s = "a\012\015\012b\012\012\015\015c");
$s =~ s/\r?\n/<br>/g;
$t =~ s/\N{LF}|\N{CR}\N{LF}?/<br>/g;
print "This is \$s: $s\nThis is \$t:$t\n";
This is a mismash of carriage returns and line feeds (which, at some point in the past, I did encounter).
Here is the output of the script on Windows using ActiveState Perl:
C:\Temp> t | xxd
0000000: 5468 6973 2069 7320 2473 3a20 613c 6272 This is $s: a<br
0000010: 3e3c 6272 3e62 3c62 723e 3c62 723e 0d0d ><br>b<br><br>..
0000020: 630d 0a54 6869 7320 6973 2024 743a 613c c..This is $t:a<
0000030: 6272 3e3c 6272 3e62 3c62 723e 3c62 723e br><br>b<br><br>
0000040: 3c62 723e 3c62 723e 630d 0a <br><br>c..
or, as text:
chis is $s: a<br><br>b<br><br>
This is $t:a<br><br>b<br><br><br><br>c
Admittedly, you are not likely to end up with this input. However, if you want to cater for any unexpected oddities that might indicate a line ending, you might want to use
$s =~ s/\N{LF}|\N{CR}\N{LF}?/<br>/g;
Also, for reference, CGI.pm canonicalizes line-endings this way:
# Define the CRLF sequence. I can't use a simple "\r\n" because the meaning
# of "\n" is different on different OS's (sometimes it generates CRLF, sometimes LF
# and sometimes CR). The most popular VMS web server
# doesn't accept CRLF -- instead it wants a LR. EBCDIC machines don't
# use ASCII, so \015\012 means something different. I find this all
# really annoying.
$EBCDIC = "\t" ne "\011";
if ($OS eq 'VMS') {
$CRLF = "\n";
} elsif ($EBCDIC) {
$CRLF= "\r\n";
} else {
$CRLF = "\015\012";
}
As a matter of general principle, storing the data as entered by the user and doing the EOL-to-<br> conversion each time it's displayed is the better (even Rightâ„¢) way to do it, both for the sake of having access to the original version of the data and because you may decide at some point that you want to change your filtering algorithm.
But, no, I personally would not use a regex in this case. I would use Parse::BBCode, which provides a whole lot of additional functionality (i.e., full BBCode support, or at least as much as you choose not to disable) in addition to providing line breaks without requiring users to explicitly enter markup for them.