Retrieving HTTP URLs using Perl scripting - html

I'm trying to save the whole web page on my system as a .html file and then parse that file, to find some tags and use them.
I'm able to save/parse http://<url>, but not able to save/parse https://<url>. I'm using Perl.
I'm using the following code to save HTTP and it works fine but doesn't work for HTTPS:
use strict;
use warnings;
use LWP::Simple qw($ua get);
use LWP::UserAgent;
use LWP::Protocol::https;
use HTTP::Cookies;
sub main
{
my $ua = LWP::UserAgent->new();
my $cookies = HTTP::Cookies->new(
file => "cookies.txt",
autosave => 1,
);
$ua->cookie_jar($cookies);
$ua->agent("Google Chrome/30");
#$ua->ssl_opts( SSL_ca_file => 'cert.pfx' );
$ua->proxy('http','http://proxy.com');
my $response = $ua->get('http://google.com');
#$ua->credentials($response, "", "usrname", "password");
unless($response->is_success) {
print "Error: " . $response->status_line;
}
# Let's save the output.
my $save = "save.html";
unless(open SAVE, '>' . $save) {
die "nCannot create save file '$save'n";
}
# Without this line, we may get a
# 'wide characters in print' warning.
binmode(SAVE, ":utf8");
print SAVE $response->decoded_content;
close SAVE;
print "Saved ",
length($response->decoded_content),
" bytes of data to '$save'.";
}
main();
Is it possible to parse an HTTPS page?

Always worth checking the documentation for the modules that you're using...
You're using modules from libwww-perl. That includes a cookbook. And in that cookbook, there is a section about HTTPS, which says:
URLs with https scheme are accessed in exactly the same way as with
http scheme, provided that an SSL interface module for LWP has been
properly installed (see the README.SSL file found in the libwww-perl
distribution for more details). If no SSL interface is installed for
LWP to use, then you will get "501 Protocol scheme 'https' is not
supported" errors when accessing such URLs.
The README.SSL file says this:
As of libwww-perl v6.02 you need to install the LWP::Protocol::https
module from its own separate distribution to enable support for
https://... URLs for LWP::UserAgent.
So you just need to install LWP::Protocol::https.

You need to have https://metacpan.org/module/Crypt::SSLeay for https links
It provides SSL support for LWP.
Bit me in the ass with a project of my own.

Related

Fatfree routing with PHP built-in web server

I'm learning fatfree's route and found it behaves unexpected.
Here is my code in index.php:
$f3 = require_once(dirname(dirname(__FILE__)). '/lib/base.php');
$f3 = \Base::instance();
echo 'received uri: '.$_SERVER['REQUEST_URI'].'<br>';
$f3->route('GET /brew/#count',
function($f3,$params) {
echo $params['count'].' bottles of beer on the wall.';
}
);
$f3->run();
and here is the URL which I access: http://xx.xx.xx.xx:8090/brew/12
I get a 404 error:
received uri: /brew/12
Not Found
HTTP 404 (GET /12)
the strange thing is that the URI in F3 is now "/12" instead of "/brew/12" and I guess this is the issue.
When I check the base.php (3.6.5), $this->hive['BASE'] = "/brew" and $this->hive['PATH'] = "/12".
But if F3 only uses $this->hive['PATH'] to match the predefined route, it won't be able to match them.
If I change the route to:
$f3->route('GET /brew',
and use the URL: http://xx.xx.xx.xx:8090/brew, then the route matches without issue.
In this case, $this->hive['BASE'] = "" and $this->hive['PATH'] = "/brew". If F3 compares the $this->hive['PATH'] with predefined route, they match each other.
BTW, I'm using PHP's built-in web server and since $_SERVER['REQUEST_URI'] (which is used by base.php) returns the correct URI, I don't think there is anything wrong with the URL rewrite in my .htrouter.php.
Any idea? What did I miss here?
add the content of .htrouter.php here
<?php
#get the relative URL
$uri = urldecode(parse_url($_SERVER['REQUEST_URI'], PHP_URL_PATH));
#if request to a real file (such as a html, image, js, css) then leave it as it is
if ($uri !== '/' && file_exists(__DIR__ . $uri)) {
return false;
}
#if request virtual URL then pass it to the bootstrap file - index.php
$_GET['_url'] = $_SERVER['REQUEST_URI'];
require_once __DIR__ . './public/index.php';
Your issue is directly related to the way you're using the PHP built-in web server.
As stated in the PHP docs, here's how the server handles requests:
URI requests are served from the current working directory where PHP was started, unless the -t option is used to specify an explicit document root. If a URI request does not specify a file, then either index.php or index.html in the given directory are returned. If neither file exists, the lookup for index.php and index.html will be continued in the parent directory and so on until one is found or the document root has been reached. If an index.php or index.html is found, it is returned and $_SERVER['PATH_INFO'] is set to the trailing part of the URI. Otherwise a 404 response code is returned.
If a PHP file is given on the command line when the web server is started it is treated as a "router" script. The script is run at the start of each HTTP request. If this script returns FALSE, then the requested resource is returned as-is. Otherwise the script's output is returned to the browser.
That means that, by default (without a router script), the web server is doing a pretty good job for routing unexisting URIs to your document root index.php file.
In other words, provided your file structure is like:
lib/
base.php
template.php
etc.
public/
index.php
The following command is enough to start your server and dispatch the requests properly to the framework:
php -S 0.0.0.0:8090 -t public/
Or if you're running the command directly from the public/ folder:
cd public
php -S 0.0.0.0:8090
Beware that the working directory of your application depends on the folder from which you call the command. In order to leverage this value, I strongly advise you to add chdir(__DIR__); at the top of your public/index.php file. This way, all subsequent require calls will be relative to your public/ folder. For ex: $f3 = require('../lib/base.php');
Routing file-style URIs
The built-in server, by default, won't pass unexisting file URIs to your index.php, as stated in:
If a URI request does not specify a file, then either index.php or index.html in the given directory are returned
So if you plan to define some routes with dots, such as:
$f3->route('GET /brew.json','Brew->json');
$f3->route('GET /brew.html','Brew->html');
Then it won't work because PHP won't pass the request to index.php.
In that case, you need to call a custom router, such as the .htrouter.php you were trying to use. The only thing is that your .htrouter.php has obviously been designed for a different framework (F3 doesn't care about $_GET['url'] but cares about $_SERVER['SCRIPT_NAME'].
Here's an exemple of .htrouter.php that should work with F3:
// public directory definition
$public_dir=__DIR__.'/public';
// serve existing files as-is
if (file_exists($public_dir.$_SERVER['REQUEST_URI']))
return FALSE;
// patch SCRIPT_NAME and pass the request to index.php
$_SERVER['SCRIPT_NAME']='index.php';
require($public_dir.'/index.php');
NB: the $public_dir variable should be set accordingly to the location of the .htrouter.php file.
For example if you call:
php -S 0.0.0.0:8090 -t public/ .htrouter.php
it should be $public_dir=__DIR__.'/public'.
But if you call:
cd public
php -S 0.0.0.0:8090 .htrouter.php
it should be $public_dir=__DIR__.
OK, I checked the base.php and found out when f3 calculates the base URI, it uses $_SERVER['SCRIPT_NAME'].
$base='';
if (!$cli)
$base=rtrim($this->fixslashes(
dirname($_SERVER['SCRIPT_NAME'])),'/');
if we have web server directly forward all requests to index.php, then
_SERVER['SCRIPT_NAME'] = /index.php, and in this this case, base is ''.
if we use URL rewriting via .htrouter.php to index.php, then
_SERVER['SCRIPT_NAME'] = /brew/12, and in this this case, base is '/brew' which causes the issue.
Since I'm going to use the URL rewrite, I have to comment out the if statement and make sure base =''.
Thanks xfra35 for providing the clue.
Apache like php router here:
It can url rewrite.
https://github.com/kyesil/QPHP/blob/master/router.php
Usage:
php -S localhost:8081 router.php

JSON::XS under mod_perl fails with POST requests

I am using the default install of Apache and mod_perl on Ubuntu 16.04.1 LTS, I also have reproduced this with the default JSON::XS and I updated to the latest from CPAN JSON-XS-3.02.
The code below works in all cases if I am not using mod_perl.
The script and html below work when using perl via mod_cgi with both POST and GET requests.
If however I am using mod_perl and I use a POST (as in the html provided) it fails, "Hello" does not print, and I get the following error in my apache log file.
Usage: JSON::XS::new(klass).
If I pass the same parameter(s) via a GET method, the script works fine.
test2.pl
#!/usr/bin/perl
use strict;
use warnings;
use CGI;
use JSON::XS;
my $q = new CGI();
print $q->header(-type => 'text/plain');
my $action = $q->param('a');
my $json_str = '{"foo":"bar"}';
my $pscalar = JSON::XS->new->utf8->decode($json_str);
print "Hello";
exit 1;
HTML to call the above (named test2.pl on the server)
<html>
<body>
<form action="test2.pl" method="POST">
<input type="text" name="a"/>
<button type="submit">
</form>
</body>
</html>
OK So this was a rather wild goose chase, analyzing apache core dumps and stack traces, fixing bugs that weren't really there... Long story short.
I was trying to add an include directory to my perl by using
PerlSwitches -I/usr/local/lib/site_perl/my_new_directory
As part of that I added
PerlOptions +Parent so that I would get a new interpreter for each virtual host so my -I was effective for only one virtual host at a time.
I had added those flags before I enabled mod_perl, so when I enabled mod_perl, it just never worked.
By removing the PerlOptions +Parent things started working as expected.
As a side note, it appears +Parent makes things wonky in genral.

Understanding JSON-RPC in Perl

I am trying to understand the concept of JSON RPC and it's Perl implementation. Though I can fin d a lot of examples for Python/Java, I find surprisingly little or no examples for it in Perl.
I am following this example but am not sure it is complete. The example I had in mind was to add 2 integers. Now I have a very basic HTML page set up, like so:
<html>
<body>
<input type="text" name="num1"><br>
<input type="text" name="num2"><br>
<button>Add</button>
</body>
</html>
Next, based on the example above, I have 3 files:
test1.pl
# Daemon version
use JSON::RPC::Server::Daemon;
# see documentation at:
# https://metacpan.org/pod/distribution/JSON-RPC/lib/JSON/RPC/Legacy.pm
my $server = JSON::RPC::Server::Daemon->new(LocalPort => 8080);
$server -> dispatch({'/test' => 'myApp'});
$server -> handle();
test2.pl
#!/usr/bin/perl
use JSON::RPC::Client;
my $client = new JSON::RPC::Client;
my $uri = 'http://localhost:8080/test';
my $obj = {
method => 'sum', # or 'MyApp.sum'
params => [10, 20],
};
my $res = $client->call( $uri, $obj );
if($res){
if ($res->is_error) {
print "Error : ", $res->error_message;
} else {
print $res->result;
}
} else {
print $client->status_line;
}
myApp.pl
package myApp;
#optionally, you can also
use base qw(JSON::RPC::Procedure); # for :Public and :Private attributes
sub sum : Public(a:num, b:num) {
my ($s, $obj) = #_;
return $obj->{a} + $obj->{b};
}
1;
While I understand what these files individually do, I am at a complete loss when it comes to combining them and making them work together.
My questions are as follows:
Does the button in the HTML page come inside a tag (like we would normally do in a CGI-based program)? If yes, what file does that call? If no, then how do I pass the values to be added?
What is the order of execution of the 3 Perl files? Which one calls which one? How is the flow of execution?
When I tried to run the perl files from the CLI, i.e using $./test2.pl, I got the following error: Error 301 Moved Permanently. What moved permanently? which file was it trying to access? I tried running the files from withing /var/www/html and /var/www/html/test.
Some help in understanding the nuances of this would really be appreciated. Thanks in advance
Does the button in the HTML page come inside a tag (like we would
*normally do in a CGI-based program)? If yes, what file does that call?*
If no, then how do I pass the values to be added?
HTML has nothing at all to do with JSON-RPC. While the RPC call is done via an HTTP POST request, if you're doing that from the browser, you'll need to use XMLHttpRequest (i.e: AJAX). Unlink an HTML form post the Content-encoding: header will need to be something specific to JSON-RPC (e.g: application/json or similar), and you'll need to encode your form data via JSON.stringify and correctly construct the JSON-RPC "envelope", including the id, jsonrpc, method and params properties.
Rather than doing this by hand you might use a purpose-build JSON-RPC JavaScript client like the jQuery-JSONRP plugin (there are many others) -- although the protocol is so simple that implementations usually are less than 20 lines of code.
From the jQuery-RPC documentation, you'd set up the connection like this:
$.jsonRPC.setup({
endPoint: '/ENDPOINT-ROUTE-GOES-HERE'
});
and you'd call the server-side method like this:
$.jsonRPC.request('sum', {
params: [YOURNUMBERINPUTELEMENT1.value, YOURNUMBERINPUT2.value],
success: function(result) {
/* Do something with the result here */
},
error: function(result) {
/* Result is an RPC 2.0 compatible response object */
}
});
What is the order of execution of the 3 Perl files? Which one calls
*which one? How is the flow of execution?*
You'll likely only need test2.pl for testing. It's an example implementation of a JSON-RPC client. You likely want your client to run in your web-browser (as described above). The client JavaScript will make an HTTP POST request to wherever test1.pl is serving content. (e.g: http://localhost:8080).
Or, if you want to keep your code as HTML<-->CGI, then you'll need to make JSON-RPC client calls from within your Perl CGI server-side code (which seems silly if it's on the same machine).
When test1.pl calls dispatch, the MyApp module will be loaded.
Then, when test1.pl calls handle, the sum function in the MyApp package will be called.
The JSON::RPC::Server module takes care of marshalling from JSON-RPC to perl datastructures and back again around the call to handle. die()ing in sum should result in a JSON-RPC exception being transmitted to the calling client, rather than death of the test1.pl script.
When I tried to run the perl files from the CLI, i.e using
*$./test2.pl, I got the following error: Error 301 Moved Permanently.*
What moved permanently? which file was it trying to access? I tried
*running the files from withing /var/www/html and /var/www/html/test.*
This largely depends the configuration of your machine. There's nothing obvious (in your code) to suggest that a 301 Moved Permanently would be issued in response to a valid JSON-RPC request.

Importing local json file using d3.json does not work

I try to import a local .json-file using d3.json().
The file filename.json is stored in the same folder as my html file.
Yet the (json)-parameter is null.
d3.json("filename.json", function(json) {
root = json;
root.x0 = h / 2;
root.y0 = 0;});
. . .
}
My code is basically the same as in this d3.js example
If you're running in a browser, you cannot load local files.
But it's fairly easy to run a dev server, on the commandline, simply cd into the directory with your files, then:
python -m SimpleHTTPServer
(or python -m http.server using python 3)
Now in your browser, go to localhost:3000 (or :8000 or whatever is shown on the commandline).
The following used to work in older versions of d3:
var json = {"my": "json"};
d3.json(json, function(json) {
root = json;
root.x0 = h / 2;
root.y0 = 0;
});
In version d3.v5, you should do it as
d3.json("file.json").then(function(data){ console.log(data)});
Similarly, with csv and other file formats.
You can find more details at https://github.com/d3/d3/blob/master/CHANGES.md
Adding to the previous answers it's simpler to use an HTTP server provided by most Linux/ Mac machines (just by having python installed).
Run the following command in the root of your project
python -m SimpleHTTPServer
Then instead of accessing file://.....index.html open your browser on http://localhost:8080 or the port provided by running the server. This way will make the browser fetch all the files in your project without being blocked.
http://bl.ocks.org/eyaler/10586116
Refer to this code, this is reading from a file and creating a graph.
I also had the same problem, but later I figured out that the problem was in the json file I was using(an extra comma). If you are getting null here try printing the error you are getting, like this may be.
d3.json("filename.json", function(error, graph) {
alert(error)
})
This is working in firefox, in chrome somehow its not printing the error.
Loading a local csv or json file with (d3)js is not safe to do. They prevent you from doing it. There are some solutions to get it working though. The following line basically does not work (csv or json) because it is a local import:
d3.csv("path_to_your_csv", function(data) {console.log(data) });
Solution 1:
Disable the security in your browser
Different browsers have different security setting that you can disable. This solution can work and you can load your files. Disabling is however not advisable. It will make you vulnerable for all kind of threads. On the other hand, who is going to use your software if you tell them to manually disable the security?
Disable the security in Chrome:
--disable-web-security
--allow-file-access-from-files
Solution 2:
Load your csv/json file from a website.
This may seem like a weird solution but it will work. It is an easy fix but can be unpractical though. See here for an example. Check out the page-source. This is the idea:
d3.csv("https://path_to_your_csv", function(data) {console.log(data) });
Solution 3:
Start you own browser, with e.g. Python.
Such a browser does not include all kind of security checks. This may be a solution when you experiment with your code on your own machine. In many cases, this may not be the solution when you have users. This example will serve HTTP on port 8888 unless it is already taken:
python -m http.server 8888
python -m SimpleHTTPServer 8888 &
Open the (Chrome) browser address bar and type the underneath. This will open the index.html. In case you have a different name, type the path to that local HTML page.
localhost:8888
Solution 4:
Use local-host and CORS
You may can use local-host and CORS but the approach is not user-friendly coz setting up this, may not be so straightforward.
Solution 5:
Embed your data in the HTML file
I like this solution the most. Instead of loading your csv, you can write a script that embeds your data directly in the html. This will allow users use their favorite browser, and there are no security issues. This solution may not be so elegant because your html file can grow very hard depending on your data but it will work though. See here for an example. Check out the page-source.
Remove this line:
d3.csv("path_to_your_csv", function(data) { })
Replace with this:
var data =
[
$DATA_COMES_HERE$
]
You can't readily read local files, at least not in Chrome, and possibly not in other browsers either.
The simplest workaround is to simply include your JSON data in your script file and then simply get rid of your d3.json call and keep the code in the callback you pass to it.
Your code would then look like this:
json = { ... };
root = json;
root.x0 = h / 2;
root.y0 = 0;
...
I have used this
d3.json("graph.json", function(error, xyz) {
if (error) throw error;
// the rest of my d3 graph code here
}
so you can refer to your json file by using the variable xyz and graph is the name of my local json file
Use resource as local variable
var filename = {x0:0,y0:0};
//you can change different name for the function than json
d3.json = (x,cb)=>cb.call(null,x);
d3.json(filename, function(json) {
root = json;
root.x0 = h / 2;
root.y0 = 0;});
//...
}

Download files with Perl

I have updated my code to look like this. When I run it though it says it cannot find the specified link. Also what is a good way to test that it is indeed connecting to the page?
#!/usr/bin/perl -w
use strict;
use LWP;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
my $browser = LWP::UserAgent->new;
$browser->credentials(
'Apache/2.2.3 (CentOS):80',
'datawww2.wxc.com',
'************' => '*************'
);
my $response = $browser->get(
'http://datawww2.wxc.com/kml/echo/MESH_Max_180min/'
);
$mech->follow_link( n => 8);
(Original Post)
What is the best way to download small files with Perl?
I looked on CPAN and found lwp-download, but it seems to only download from the link. I have a page with links that change every thirty minutes with the date and time in the name so they are never the same. Is there a built-in function I can use? Everyone on Google keeps saying to use Wget, but I was kind of wanting to stick with Perl if possible just to help me learn it better while I program with it.
Also there is a user name and password to log into the site. I know how to access the site using Perl still, but I thought that might change what I can use to download with.
As stated in a comment in your other question: here
You can use the same method to retrieve .csv files as .html, or any other text-based file for the matter.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
my $csv = get("http://www.spc.noaa.gov/climo/reports/last3hours_hail.csv")
or die "Could not fetch NWS CSV page.";
To login, you may need to use WWW::Mechanize to fill out the webform (look at $mech->get(), $mech->submit_form(), and $mech->follow_link())
Basically, you need to fetch the page, parse it to get the URL, and then download the file.
Personally, I'd use HTML::TreeBuilder::XPath, write a quick XPath expression to go straight to the correct href attribute node, and then plug that into LWP.
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse({put page content here});
foreach($tree->findnodes({put xpath expression here}){
{download the file}
}