Perl regular expression for html - html

I need to extract the IMDB id(example:for the movie 300 it is tt0416449) for a movie specified by the variable URL. I have looked at the page source for this page and come up with the following regex
use LWP::Simple;
$url = "http://www.imdb.com/search/title?title=$FORM{'title'}";
if (is_success( $content = LWP::Simple::get($url) ) ) {
print "$url is alive!\n";
} else {
print "No movies found";
}
$code = "";
if ($content=~/<td class="number">1\.</td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s) {
$code = $1;
}
I am getting an internal server error at this line
$content=~/<td class="number">1\.</td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s
I am very new to perl, and would be grateful if anyone could point out my mistake(s).

Use an HTML parser. Regular expressions cannot parse HTML.
Anyway, the reason for the error is probably that you forgot to escape a forward slash in your regex. It should look like this:
/<td class="number">1\.<\/td><td class="image"><a href="\/title\/tt[\d]{1,7}"/s

A very nice interface for this type of work is provided by some tools of the Mojolicious distribution.
Long version
The combination of its UserAgent, DOM and URL classes can work in a very robust way:
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
use Mojo::UserAgent;
use Mojo::URL;
# preparations
my $ua = Mojo::UserAgent->new;
my $url = "http://www.imdb.com/search/title?title=Casino%20Royale";
# try to load the page
my $tx = $ua->get($url);
# error handling
die join ', ' => $tx->error unless $tx->success;
# extract the url
my $movie_link = $tx->res->dom('a[href^=/title]')->first;
my $movie_url = Mojo::URL->new($movie_link->attrs('href'));
say $movie_url->path->parts->[-1];
Output:
tt0381061
Short version
The funny one liner helper module ojo helps to build a very short version:
$ perl -Mojo -E 'say g("imdb.com/search/title?title=Casino%20Royale")->dom("a[href^=/title]")->first->attrs("href") =~ m|([^/]+)/?$|'
Output:
tt0381061

I agree XML is anti-line-editing thus anti-unix but, there is AWK.
If awk can do, perl can surely do. I can produce a list:
curl -s 'http://www.imdb.com/find?q=300&s=all' | awk -vRS='<a|</a>' -vFS='>|"' -vID=$1 '
$NF ~ ID && /title/ { printf "%s\t", $NF; match($2, "/tt[0-9]+/"); print substr($2, RSTART+1, RLENGTH-2)}
' | uniq
Pass search string to "ID".
Basically it's all about how you choose your tokenizer in awk, I use the <a> tag. Should be easier in perl.

Related

How to replace HTML td tags

I am using Perl to achieve this
while(<INFILE>){
chomp;
if(/\<td/){
system("perl -i -e 's/<td/<td bgcolor="blue"/g' $_");
}
}
When I run the command I get
./HtmlTest.pl file.html
Bareword found where operator expected at ./HtmlTest.pl line 13, near ""perl -i -e 's/<td/<td bgcolor="grey"
(Missing operator before grey?)
String found where operator expected at ./HtmlTest.pl line 13, near "grey"/g' $_""
syntax error at ./HtmlTest.pl line 13, near ""perl -i -e 's/<td/<td bgcolor="grey"
Execution of ./HtmlTest.pl aborted due to compilation errors.
I am not able to figure out why
Even if i run as
perl HtmlTest.pl file.html
I get the same errors.
Sample html table
<td>ABC</td>
<td>DEF</td>
<td>20:00:00</td>
Any advice appreciated
Regexes may become inefficient when it comes to parsing complex HTML files, a better apporach is then to use a dedicated HTML parser. Here is an example using XML::LibXML provided you have a valid HTML file:
use strict;
use warnings;
use XML::LibXML;
my $filename = 'file.html';
my $html = XML::LibXML->load_html( location => $filename );
for my $node ($html->findnodes('//td')) {
$node->setAttribute(bgcolor => "blue");
}
print $html->toStringHTML;
I think you need escape the " in the string since it complains about
"near "grey"/g'
(assumed you tried with grey in your code)
Since the whole string is: "perl -i -e '<string_no_quotes>' $_" if string_no_quotes has " it will give this error, so it needs to be escpaed.
Update:
Should something like this work you write it stdout and pipe it to the file instead?:
foreach my $i ('<td>ABC</td>', '<td>DEF</td>', '<td>20:00:00</td>', '<h1>test</h1>') {
chomp;
$_ = $i;
if (/\<td/) {
print 's/<td/<td bgcolor="blue"/g';
} else {
print $_;
}
}
I replaced the while loop with for loop so I could test it in an online parser. The one I used was this: https://www.tutorialspoint.com/execute_perl_online.php
In OPs code we have following line, which should be corrected to next form
system("perl -i -e 's/<td/<td bgcolor=\"blue\"/g' $_");
It is wrong, $_ will hold current line read from <INFILE> but perl will expect input file instead.
Following code demonstrates alternative solution, which does not utilize any modules. This solution also is not best.
use strict;
use warnings;
while( <DATA> ) {
s/<td>/<td bgcolor="blue">/;
print;
}
__DATA__
<block>
Some text goes in this place
</block>
<td>ABC</td>
<td>DEF</td>
<td>20:00:00</td>
<p>
New paragraph describing something
</p>
Instead of utilizing bgcolor="blue" more correct approach is external CSS style style='some_style'.
This approach would allow make changes in style file for desired tags without touching html file.
You edit CSS style file with desired style and magically you web page will be shown with new colors/text style/types of list/ etc.
HTML Styles CSS

Escape special characters in JSON string

I have Perl script which contains variable $env->{'arguments'}, this variable should contain a JSON object and I want to pass that JSON object as argument to my other external script and run it using backticks.
Value of $env->{'arguments'} before escaping:
$VAR1 = '{"text":"This is from module and backslash \\ should work too"}';
Value of $env->{'arguments'} after escaping:
$VAR1 = '"{\\"text\\":\\"This is from module and backslash \\ should work too\\"}"';
Code:
print Dumper($env->{'arguments'});
escapeCharacters(\$env->{'arguments'});
print Dumper($env->{'arguments'});
my $command = './script.pl '.$env->{'arguments'}.'';
my $output = `$command`;
Escape characters function:
sub escapeCharacters
{
#$env->{'arguments'} =~ s/\\/\\\\"/g;
$env->{'arguments'} =~ s/"/\\"/g;
$env->{'arguments'} = '"'.$env->{'arguments'}.'"';
}
I would like to ask you what is correct way and how to parse that JSON string into valid JSON string which I can use as argument for my script.
You're reinventing a wheel.
use String::ShellQuote qw( shell_quote );
my $cmd = shell_quote('./script.pl', $env->{arguments});
my $output = `$cmd`;
Alternatively, there's a number of IPC:: modules you could use instead of qx. For example,
use IPC::System::Simple qw( capturex );
my $output = capturex('./script.pl', $env->{arguments});
Because you have at least one argument, you could also use the following:
my $output = '';
open(my $pipe, '-|', './script.pl', $env->{arguments});
while (<$pipe>) {
$output .= $_;
}
close($pipe);
Note that current directory isn't necessarily the directory that contains the script that executing. If you want to executing script.pl that's in the same directory as the currently executing script, you want the following changes:
Add
use FindBin qw( $RealBin );
and replace
'./script.pl'
with
"$RealBin/script.pl"
Piping it to your second program rather than passing it as an argument seems like it would make more sense (and be a lot safer).
test1.pl
#!/usr/bin/perl
use strict;
use JSON;
use Data::Dumper;
undef $/;
my $data = decode_json(<>);
print Dumper($data);
test2.pl
#!/usr/bin/perl
use strict;
use IPC::Open2;
use JSON;
my %data = ('text' => "this has a \\backslash", 'nums' => [0,1,2]);
my $json = JSON->new->encode(\%data);
my ($chld_out, $chld_in);
print("Executing script\n");
my $pid = open2($chld_out, $chld_in, "./test1.pl");
print $chld_in "$json\n";
close($chld_in);
my $out = do {local $/; <$chld_out>};
waitpid $pid, 0;
print(qq~test1.pl output =($out)~);

How can I delete parts of a JSON web response?

I have a simple Perl script and I want to remove everything up to the word "city". Or remove everything up to the nth occurrence (the 2nd in my particular case) of the comma's " , ". Here's what is looks like below.
#!/usr/bin/perl
use warnings;
use strict;
my $CMD = `curl http://ip-api.com/json/8.8.8.8`;
chomp($CMD);
my $find = "^[^city]*city";
$CMD =~ s/$find//;
print $CMD;
The output is this:
{"as":"AS15169 Google Inc.","city":"Mountain View","country":"United States","countryCode":"US","isp":"Google","lat" :37.386,"lon":-122.0838,"org":"Google","query":"8.8.8.8","region":"CA","regionName":"California","status":"success","timezone":"America/Los_Angeles","zip":"94035"}
So i want do drop
" {"as":"AS15169 Google Inc.","
or drop up to
{"as":"AS15169 Google Inc.","city":"Mountain View",
EDIT:
I see I was doing far too much when matching the string. I simplified the fix for my problem with removing all before "city". My $find has been changed to
my $find = ".*city";
While I also changed the replace function like so,
$CMD =~ s/$find/city/;
Still haven't figured out how to remove all before the nth occurrence of a comma or any character / string for that matter.
The content you get back is JSON, so you can easily turn it into a Perl data structure, play with it, and even turn it back into JSON if you like. That's the point! And, it's so easy:
use Mojo::UserAgent;
use Mojo::JSON qw(decode_json encode_json);
my $ua = Mojo::UserAgent->new;
my $tx = $ua->get( 'http://ip-api.com/json/8.8.8.8' );
my $json = $tx->res->body;
my $perl = decode_json( $json );
delete $perl->{'as'};
my $new_json = encode_json( $perl );
print $new_json;
Mojolicious is wonderful for this. It's my preferred way for dealing with JSON even without the user-agent stuff. If you play with the JSON string directly, you're likely to have problems when the order of elements change or it contains wide characters.
You don't have to manually decode_json() with Mojolicious. Simply do this:
my $tx = $ua->get('http://ip-api.com/json/8.8.8.8');
my $json = $tx->res->json;
my $as = $json->{as}
You can even go fancy with JSON pointers:
my $as = $tx->res->json("/as");
Something like
#!/usr/bin/perl -w
my $results = `curl http://ip-api.com/json/8.8.8.8`;
chomp $results;
$results =~ s/^.*city":"\w+\s?\w+",//g;
print $results . "\n";
should do the trick.. unless there's a misunderstanding of what you want to keep v.s. remove.
FYI, http://regexr.com/ is totally my go to for regex happiness.

Parsing JSON Data::Dumper output array in Perl

I'm trying to edit an old perl script and I'm a complete beginner. The request from the server returns as:
$VAR1 = [
{
'keywords' => [
'bare knuckle boxing',
'support group',
'dual identity',
'nihilism',
'support',
'rage and hate',
'insomnia',
'boxing',
'underground fighting'
],
}
];
How can I parse this JSON string to grab:
$keywords = "bare knuckle boxing,support group,dual identity,nihilism,support,rage and hate,insomnia,boxing,underground fighting"
Full perl code
#!/usr/bin/perl
use LWP::Simple; # From CPAN
use JSON qw( decode_json ); # From CPAN
use Data::Dumper; # Perl core module
use strict; # Good practice
use warnings; # Good practice
use WWW::TheMovieDB::Search;
use utf8::all;
use Encode;
use JSON::Parse 'json_to_perl';
use JSON::Any;
use JSON;
my $api = new WWW::TheMovieDB::Search('APIKEY');
my $img = $api->type('json');
$img = $api->Movie_imdbLookup('tt0137523');
my $decoded_json = decode_json( encode("utf8", $img) );
print Dumper $decoded_json;
Thanks.
Based on comments and on your recent edit, I would say that what you are asking is how to navigate a perl data structure, contained in the variable $decoded_json.
my $keywords = join ",", #{ $decoded_json->[0]{'keywords'} };
say qq{ #{ $arrayref->[0]->{'keywords'} } };
As TLP pointed out, all you've shown is a combination of perl arrays/hashes. But you should look at the JSON.pm documentation, if you have a JSON string.
The result you present is similar to json, but the Perl-variant of it. (ie => instead of : etc). I don't think you need to look into the json part of it, As you already got the data. You just need to use Perl to join the data into a text string.
Just to eleborate on the solution to vol7ron :
#get a reference to the list of keywords
my $keywords_list = $decoded_json->[0]{'keywords'};
#merge this list with commas
my $keywords = join(',', #$keywords_list );
print $keywords;

Extract text from HTML - Perl using HTML::TreeBuilder

I'm trying to access the .html files and extract the text in <p> tags. Logically, my code below should work. By using the HTML::TreeBuilder. I parse the html then extract text in <p> using find_by_attribute("p"). But my script came out with empty directories. Did i leave out anything?
#!/usr/bin/perl
use strict;
use HTML::TreeBuilder 3;
use FileHandle;
my #task = ('ar','cn','en','id','vn');
foreach my $lang (#task) {
mkdir "./extract_$lang", 0777 unless -d "./extract_$lang";
opendir (my $dir, "./$lang/") or die "$!";
my #files = grep (/\.html/,readdir ($dir));
closedir ($dir);
foreach my $file (#files) {
open (my $fh, '<', "./$lang/$file") or die "$!";
my $root = HTML::TreeBuilder->new;
$root->parse_file("./$lang/$file");
my #all_p = $root->find_by_attribute("p");
foreach my $p (#all_p) {
my $ptag = HTML::TreeBuilder->new_from_content ($p->as_HTML);
my $filewrite = substr($file, 0, -5);
open (my $outwrite, '>>', "extract_$lang/$filewrite.txt") or die $!;
print $outwrite $ptag->as_text . "\n";
my $pcontents = $ptag->as_text;
print $pcontents . "\n";
close (outwrite);
}
close (FH);
}
}
My .html files are the plain text htmls from .asp websites e.g. http://www.singaporemedicine.com/vn/hcp/med_evac_mtas.asp
My .html files are saved in:
./ar/*
./cn/*
./en/*
./id/*
./vn/*
You are confusing element with attribute. The program can be written much more concisely:
#!/usr/bin/env perl
use strictures;
use File::Glob qw(bsd_glob);
use Path::Class qw(file);
use URI::file qw();
use Web::Query qw(wq);
use autodie qw(:all);
foreach my $lang (qw(ar cn en id vn)) {
mkdir "./extract_$lang", 0777 unless -d "./extract_$lang";
foreach my $file (bsd_glob "./$lang/*.html") {
my $basename = file($file)->basename;
$basename =~ s/[.]html$/.txt/;
open my $out, '>>:encoding(UTF-8)', "./extract_$lang/$basename";
$out->say($_) for wq(URI::file->new_abs($file))->find('p')->text;
close $out;
}
}
Use find_by_tag_name to search for tag names, not find_by_attribute.
You want find_by_tag_name, not find_by_attribute:
my #all_p = $root->find_by_tag_name("p");
From the docs:
$h->find_by_tag_name('tag', ...)
In list context, returns a list of elements at or under $h that have
any of the specified tag names. In scalar context, returns the first
(in pre-order traversal of the tree) such element found, or undef if
none.
You might want to take a look at Mojo::DOM which lets you use CSS selectors.