Download files with Perl - html

I have updated my code to look like this. When I run it though it says it cannot find the specified link. Also what is a good way to test that it is indeed connecting to the page?
#!/usr/bin/perl -w
use strict;
use LWP;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
my $browser = LWP::UserAgent->new;
$browser->credentials(
'Apache/2.2.3 (CentOS):80',
'datawww2.wxc.com',
'************' => '*************'
);
my $response = $browser->get(
'http://datawww2.wxc.com/kml/echo/MESH_Max_180min/'
);
$mech->follow_link( n => 8);
(Original Post)
What is the best way to download small files with Perl?
I looked on CPAN and found lwp-download, but it seems to only download from the link. I have a page with links that change every thirty minutes with the date and time in the name so they are never the same. Is there a built-in function I can use? Everyone on Google keeps saying to use Wget, but I was kind of wanting to stick with Perl if possible just to help me learn it better while I program with it.
Also there is a user name and password to log into the site. I know how to access the site using Perl still, but I thought that might change what I can use to download with.

As stated in a comment in your other question: here
You can use the same method to retrieve .csv files as .html, or any other text-based file for the matter.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
my $csv = get("http://www.spc.noaa.gov/climo/reports/last3hours_hail.csv")
or die "Could not fetch NWS CSV page.";
To login, you may need to use WWW::Mechanize to fill out the webform (look at $mech->get(), $mech->submit_form(), and $mech->follow_link())

Basically, you need to fetch the page, parse it to get the URL, and then download the file.
Personally, I'd use HTML::TreeBuilder::XPath, write a quick XPath expression to go straight to the correct href attribute node, and then plug that into LWP.
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse({put page content here});
foreach($tree->findnodes({put xpath expression here}){
{download the file}
}

Related

JSON::XS under mod_perl fails with POST requests

I am using the default install of Apache and mod_perl on Ubuntu 16.04.1 LTS, I also have reproduced this with the default JSON::XS and I updated to the latest from CPAN JSON-XS-3.02.
The code below works in all cases if I am not using mod_perl.
The script and html below work when using perl via mod_cgi with both POST and GET requests.
If however I am using mod_perl and I use a POST (as in the html provided) it fails, "Hello" does not print, and I get the following error in my apache log file.
Usage: JSON::XS::new(klass).
If I pass the same parameter(s) via a GET method, the script works fine.
test2.pl
#!/usr/bin/perl
use strict;
use warnings;
use CGI;
use JSON::XS;
my $q = new CGI();
print $q->header(-type => 'text/plain');
my $action = $q->param('a');
my $json_str = '{"foo":"bar"}';
my $pscalar = JSON::XS->new->utf8->decode($json_str);
print "Hello";
exit 1;
HTML to call the above (named test2.pl on the server)
<html>
<body>
<form action="test2.pl" method="POST">
<input type="text" name="a"/>
<button type="submit">
</form>
</body>
</html>
OK So this was a rather wild goose chase, analyzing apache core dumps and stack traces, fixing bugs that weren't really there... Long story short.
I was trying to add an include directory to my perl by using
PerlSwitches -I/usr/local/lib/site_perl/my_new_directory
As part of that I added
PerlOptions +Parent so that I would get a new interpreter for each virtual host so my -I was effective for only one virtual host at a time.
I had added those flags before I enabled mod_perl, so when I enabled mod_perl, it just never worked.
By removing the PerlOptions +Parent things started working as expected.
As a side note, it appears +Parent makes things wonky in genral.

PHP echo content of HTML page not working correctly

I am trying to use the following php code to display another html page. Sadly nothing is printed on the screen, and yes I have checked and confirmed that the link works. Any thoughts on why this could be happening would be helpful thank you.
$site = readfile("http://k9minecraft.tk/thanks.html");
echo $site;
First, make sure php is configured so that allow_url_fopen is on.
If you want to save the string to a variable, try using file_get_contents instead since it adds the file to memory. Refer file_get_contents for more detailed information on official documentation.
$site = file_get_contents("http://k9minecraft.tk/thanks.html");
echo $site;
The readfile function reads the file directly to the output buffer, so it doesn't require an echo. Refer readfile for more detailed information on official documentation.
readfile("http://k9minecraft.tk/thanks.html");
readfile is more efficient in terms of memory usage, whereas file_get_contents more useful in many situations.
<?php
//other php codes here
?>
Link Name
<?php
//continue other php codes.
?>
how about that ? Without using readfile.
readfile() actually returns just the number of characters read from the file. The content of the file is stored in buffer.
Turn output buffering OFF.
Use something like ob_end_flush
Check this too...It may help you

Perl table Extract or other method for multi page table

I'm trying to extract elements from a table, I have successfully used get and HTML:TableExtract to get elements of the table. The problem is the table is multi page and navigated with an arrow button to show additional pages. How would i extract these other pages as they are not new links but I think generated with JS or such?
Specifically I am trying to extract the table under Data for this Data Range at:
http://ycharts.com/companies/GOOG/pe_ratio#series=type:company,id:GOOG,calc:pe_ratio,,id:AAPL,type:company,calc:pe_ratio,,id:AMZN,type:company,calc:pe_ratio&zoom=3&startDate=&endDate=&format=real&recessions=false
See how there is the Viewing x of 45 and the First, Previous, Next, Last button.
The rest of the table elements can be viewed with next, how would i extract these in perl?
Update::
Hi Simbabque, Thanks for the response.
So I see if you click on next it calls:
ng-click="getHistoricalData(historicalData.currentPage+1)"
Is there a way I can call this method? I tried to use click,but it is not bound a name. (JS?)
I was trying to use Mechanize::Firefox now but I feel like their must be an easy way to use regular Mech and call the function and re-read the page?
The website builds up the tables using AJAX requests. Those are a little harder to parse. You can use WWW::Mechanize to fetch the initial page and then hit the AJAX calls for the table. It helps you keep track of cookies and stuff automatically.
use strict; use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
$mech->get('http://ycharts.com/companies/GOOG/pe_ratio#series=type:company,id:GOOG,calc:pe_ratio,,id:AAPL,type:company,calc:pe_ratio,,id:AMZN,type:company,calc:pe_ratio&zoom=3&startDate=&endDate=&format=real&recessions=false');
my $response = $mech->post(
'http://ycharts.com/companies/GOOG/pe_ratio/data_ajax',
{
startDate => '1/1/1962',
endDate => '12/3/2013',
pageNum => 4,
}
);
if ( $response->is_success ) {
print $response->decoded_content; # or whatever
} else {
die $response->status_line;
}
This is just a basic example and will not work. It gives a 403 Forbidden. Probably there is more data required. Use Firebug or a similar tool to inspect what is happening. For example, there's another call to http://ping.chartbeat.net/ping?h=ycharts.com&p=%2Fcompanies%2FGOOG%2Fpe_ratio&u=o3m6snxteynby1b8&d=ycharts.com&g=20054&n=1&f=00001&c=10.81&x=200&y=1812&o=1663&w=658&j=30&R=0&W=1&I=0&E=109&e=6&b=1903&t=usmc0fjfd1j0h87g&V=16&_ happening automatically every now and again, with varying parameters. That is most likely required to keep the session going.
This page is pretty sophisticated. This might not be the best approach.
You could also try to use WWW::Mechanize::Firefox or even Selenium to remote-operate a browser. That will be better suited as it takes care of all the AJAX stuff that is happening.
Or you could look for a public API that just hands over that data voluntarily. I bet there is one around... or just pay for a ycharts pro account and hit the download button. ;-)

Retrieving HTTP URLs using Perl scripting

I'm trying to save the whole web page on my system as a .html file and then parse that file, to find some tags and use them.
I'm able to save/parse http://<url>, but not able to save/parse https://<url>. I'm using Perl.
I'm using the following code to save HTTP and it works fine but doesn't work for HTTPS:
use strict;
use warnings;
use LWP::Simple qw($ua get);
use LWP::UserAgent;
use LWP::Protocol::https;
use HTTP::Cookies;
sub main
{
my $ua = LWP::UserAgent->new();
my $cookies = HTTP::Cookies->new(
file => "cookies.txt",
autosave => 1,
);
$ua->cookie_jar($cookies);
$ua->agent("Google Chrome/30");
#$ua->ssl_opts( SSL_ca_file => 'cert.pfx' );
$ua->proxy('http','http://proxy.com');
my $response = $ua->get('http://google.com');
#$ua->credentials($response, "", "usrname", "password");
unless($response->is_success) {
print "Error: " . $response->status_line;
}
# Let's save the output.
my $save = "save.html";
unless(open SAVE, '>' . $save) {
die "nCannot create save file '$save'n";
}
# Without this line, we may get a
# 'wide characters in print' warning.
binmode(SAVE, ":utf8");
print SAVE $response->decoded_content;
close SAVE;
print "Saved ",
length($response->decoded_content),
" bytes of data to '$save'.";
}
main();
Is it possible to parse an HTTPS page?
Always worth checking the documentation for the modules that you're using...
You're using modules from libwww-perl. That includes a cookbook. And in that cookbook, there is a section about HTTPS, which says:
URLs with https scheme are accessed in exactly the same way as with
http scheme, provided that an SSL interface module for LWP has been
properly installed (see the README.SSL file found in the libwww-perl
distribution for more details). If no SSL interface is installed for
LWP to use, then you will get "501 Protocol scheme 'https' is not
supported" errors when accessing such URLs.
The README.SSL file says this:
As of libwww-perl v6.02 you need to install the LWP::Protocol::https
module from its own separate distribution to enable support for
https://... URLs for LWP::UserAgent.
So you just need to install LWP::Protocol::https.
You need to have https://metacpan.org/module/Crypt::SSLeay for https links
It provides SSL support for LWP.
Bit me in the ass with a project of my own.

How to get phpinfo() variables from php programmatically?

I am attempting to get a list of dependable(consistent across requests) list of "hidden" constants in PHP(as in, the client-side won't know about it in most cases without hacking).
Some of the things I am interested in is the following:
./configure options.
I would also like the very first System value in phpinfo.
The loaded PHP modules(as shown in the Apache section)
The build date of PHP.
Registered PHP streams
Registered stream socket transports
Registered stream filters
How can I get either just a portion of the phpinfo or get these values as a regular string? Note that it doesn't matter if there if markup included, but I don't want to parse the phpinfo as that just seems really slow and surely there is a better way..
Here you go:
ini_get_all() or get_loaded_extensions() were the closest I could find
php_uname()
apache_get_modules()
phpversion() was the closest I could find
stream_get_wrappers()
stream_get_transports()
stream_get_filters()
See also get_defined_constants() and some more.
As Chacha102 mentioned you can also use output control functions and parse the phpinfo():
ob_start();
phpinfo();
$variable = ob_get_contents();
ob_get_clean();
Due to the use of ob_get_clean() it won't mess up other output buffering levels you may be using.
Most of the stuff available from phpinfo() can be found in constants. Try looking through:
print_r(get_defined_constants());
Or the functions on this page: http://us.php.net/manual/en/ref.info.php. There are tons of functions to get information about specific extensions.
The following functions might be worth looking at:
ini_get() http://us.php.net/manual/en/function.ini-get.php
getenv() http://us.php.net/manual/en/function.getenv.php
get_cfg_var() http://us.php.net/manual/en/function.get-cfg-var.php
Maybe I am late a bit, but basically if you call a shell script problematically to the php.exe
php -i
then you can parse all the information required