In Perl, using module WWW::Mechanize (required, not other module), is it possible to "parse" document from string variable, instead of url?
I mean instead of
$mech->get($url);
to do something like
$html = '<html...';
$mech->???($html);
Possible?
You could write the data to disk and then get() it in the usual manner. Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use File::Temp;
use URI::File;
use WWW::Mechanize;
my $data = '<html><body>foo</body></html>';
# write the data to disk
my $fh = File::Temp->new;
print $fh $data;
$fh->close;
my $mech = WWW::Mechanize->new;
$mech->get( URI::file->new( $fh->filename ) );
print $mech->content;
prints: <html><body>foo</body></html>
Got it:
$mech->get(0);
$mech->update_html('<html>...</html>');
It works!
Not really. You could try getting the HTTP::Response object using $mech->response and then using that object's content method to replace the content with your own string. But you would have to adjust all the message headers as well and it would get quite messy.
What is it that you want to do? The methods like forms and images that WWW::Mechanize provides are based on other modules and are fairly simple to code.
Related
I am reading in a large string in Perl from a webpage using WWW::Mechanzie. Am not writing it into a file, just going through it. However apostrophes are coming out as . Is there a way to automatically convert the entire string so that I get ' instead of its character code?
To decode strings with HTML entities you can use the decode() method in HTML::Entities. For example:
use feature qw(say);
use strict;
use warnings;
use HTML::Entities;
my $str = "An 'example'";
say decode_entities($str);
Output:
An 'example'
Below listed program fails with the following error:
JSON text must be an object or array (but found number, string, true, false or null, use allow_nonref to allow this) at json_test.pl line 10.
Works fine when I comment out thread startup/join, or when JSON is parsed before thread is run.
Message seems to be coming from JSON library, so I suppose something is wrong with it.
Any ideas what's going on and how to fix it?
# json_test.pl
use strict;
use warnings;
use threads;
use JSON;
use Data::Dumper;
my $t = threads->new(\&DoSomething);
my $str = '{"category":"dummy"}';
my $json = JSON->new();
my $data = $json->decode($str);
print Dumper($data);
$t->join();
sub DoSomething
{
sleep 10;
return 1;
}
JSON uses JSON::XS if installed which is not compatible with Perl threads (please don't take the author's words at face value - threads are discouraged and difficult to use effectively, but not deprecated and there are no plans to remove them). The community-preferred fork Cpanel::JSON::XS is thread safe and will be used by JSON::MaybeXS by default, which is a mostly drop-in replacement for JSON.
i have an html page that contain urls like :
<h3><a href="http://site.com/path/index.php" h="blablabla">
<h3><a href="https://www.site.org/index.php?option=com_content" h="vlavlavla">
i want to extract :
site.com/path
www.site.org
between <h3><a href=" & /index.php .
i've tried this code :
#!/usr/local/bin/perl
use strict;
use warnings;
open (MYFILE, 'MyFileName.txt');
while (<MYFILE>)
{
my $values1 = split('http://', $_); #VALUE WILL BE: www.site.org/path/index2.php
my #values2 = split('index.php', $values1); #VALUE WILL BE: www.site.org/path/ ?option=com_content
print $values2[0]; # here it must print www.site.org/path/ but it don't
print "\n";
}
close (MYFILE);
but this give an output :
2
1
2
2
1
1
and it don't parse https websites.
hope you've understand , regards.
The main thing wrong with your code is that when you call split in scalar context as in your line:
my $values1 = split('http://', $_);
It returns the size of the list created by the split. See split.
But I don't think split is appropriate for this task anyway. If you know that the value you are looking for will always lie between 'http[s]://' and '/index.php' you just need a regex substitution in your loop (you should also be more careful opening your file...):
open(my $myfile_fh, '<', 'MyFileName.txt') or die "Couldn't open $!";
while(<$myfile_fh>) {
s{.*http[s]?://(.*)/index\.php.*}{$1} && print;
}
close($myfile_fh);
It's likely you will need a more general regex than that, but I think this would work based on your description of the problem.
This feels to me like a job for modules
HTML::LinkExtor
URI
Generally using regexps to parse HTML is risky.
dms explained in his answer why using split isn't the best solution here:
It returns the number of items in scalar context
A normal regex is better suited for this task.
However, I do not think that line-based processing of the input is valid for HTML, or that using a substitution makes sense (it does not, especially when the pattern looks like .*Pattern.*).
Given an URL, we can extract the required information like
if ($url =~ m{^https?://(.+?)/index\.php}s) { # domain+path now in $1
say $1;
}
But how do we extract the URLs? I'd recommend the wonderful Mojolicious suite.
use strict; use warnings;
use feature 'say';
use File::Slurp 'slurp'; # makes it easy to read files.
use Mojo;
my $html_file = shift #ARGV; # take file name from command line
my $dom = Mojo::DOM->new(scalar slurp $html_file);
for my $link ($dom->find('a[href]')->each) {
say $1 if $link->attr('href') =~ m{^https?://(.+?)/index\.php}s;
}
The find method can take CSS selectors (here: all a elements that have an href attribute). The each flattens the result set into a list which we can loop over.
As I print to STDOUT, we can use shell redirection to put the output into a wanted file, e.g.
$ perl the-script.pl html-with-links.html >only-links.txt
The whole script as a one-liner:
$ perl -Mojo -E'$_->attr("href") =~ m{^https?://(.+?)/index\.php}s and say $1 for x(b("test.html")->slurp)->find("a[href]")->each'
I have two arrays that have related data. I need to insert them into a html table. I am accessing these arrays from a different program by using modules which I found out by searching the forum.
package My::Module;
use strict;
use warnings;
use File::Slurp;
use Data::Dumper;
use Exporter;
our #ISA = 'Exporter';
our #EXPORT = qw(\#owners \#values);
our(#owners, #values);
$Data::Dumper::Indent = 1;
my #fileDatas = read_file("/x/home/venganesan/output.txt");
This is under a folder My and is named Module.pm. parts of the other file which will have the table are
use strict;
use warnings;
use CGI;
use My::Module;
my $q = new CGI;
print $q->header;
print $q->start_html(-title=>"Table testing", -style =>{'src'=> '/x/home/venganesan/style.css'});
print $q->h1("Modified WOWO diff");
print $q->table( {-border=>1, cellpadding=>3},
$q->Tr($q->th(['WOWODiff', 'Owner', 'Signoff'])),
foreach $own(#owners){
$q->Tr(
$q->td([$own,'Two', 'Three'])},
$q->td(['four', 'Five', 'Six']),
),
I am just trying to print one array to see how it works and then include the other. The output I am getting is both the arrays on command line without the html when I use Module.pm. If i remove it, I get html code. I am learning perl and new modules on the fly. I am open to criticism and better ways to implement the code.
It's 2013. No-one should be generating HTML using CGI.pm these days. By all means, use CGI.pm for generating headers and parsing CGI requests, but please consider using something like the Template Toolkit for your HTML.
I'm not clear what your question is. Are you saying that you get errors if you use My::Module (that's a terrible name for it, by the way)? In that case you should see what gets written to the web server's error log and address the problems given there.
Currently I have a link like follows:
<a href=/myweb/cgi-bin/my.cgi?name=B. anthracis>B. anthracis</a>
But instead of taking B. anthracis as input parameter, it takes B. instead.
How can I modify the above HTML or CGI script to allow that?
And currently my CGI script looks like this:
use CGI;
my $cgi = CGI->new();
my $param = $cgi->param('name');
print "$param\n";
You should URL encode the query string:
B. anthracis
And including the quotes on your attributes is strongly recommended.
You can use encodeURIComponent in JavaScript or uri_escape in Perl to encode each parameter name and value before building the query string.