Scrape multi-frame website - html

I'm auditing our existing web application, which makes heavy use of HTML frames. I would like to download all of the HTML in each frame, is there a method of doing this with wget or a little bit of scripting?

as an addition to Steve's answer:
Span to any host—‘-H’
The ‘-H’ option turns on host spanning, thus allowing Wget's recursive run to visit any host referenced by a link. Unless sufficient recursion-limiting criteria are applied depth, these foreign hosts will typically link to yet more hosts, and so on until Wget ends up sucking up much more data than you have intended.
Limit spanning to certain domains—‘-D’
The ‘-D’ option allows you to specify the domains that will be followed, thus limiting the recursion only to the hosts that belong to these domains. Obviously, this makes sense only in conjunction with ‘-H’.
A typical example would be downloading the contents of ‘www.server.com’, but allowing downloads from ‘images.server.com’, etc.:
wget -rH -Dserver.com http://www.server.com/
You can specify more than one address by separating them with a comma,
e.g. ‘-Ddomain1.com,domain2.com’.
taken from: wget manual

wget --recursive --domains=www.mysite.com http://www.mysite.com
Which indicates a recursive crawl should also traverse into frames and iframes. Be careful to limit the scope of recursion only to your web site since you probably don't want to crawl the whole web.

wget has a -r option to make it recursive, try wget -r -l1 (in case the font makes it hard to read: that last part is a lower case L followed by a number one)
The -l1 part tells it to recurse to a maximum depth of 1. Try playing with this number to scrape more.

Related

Alternatives to CGI.pm for header() and param()?

I've been an avid user of CGI.pm since the previous millennium so I was a bit surprised when it disappeared from my old Ubuntu server when I upgraded it recently. My short-term fix was sudo cpan install CGI, but a quick web search to find out why it was missing in the first place revealed CGI::Alternatives which explains why it has gone and offers some suggestions for alternatives. For my purposes, HTML::Tiny looks best for replacing my programmatic HTML generation, but Alternatives is strangely silent on the subject of HTTP headers and CGI parameters.
I broadened by search and found lighter alternatives to CGI.pm on perlmonks where one response suggests CGI::Simple, but the recommendation is less than whole-hearted - "its not quite as up to date as CGI.pm".
So is CGI::Simple the way to go, or is there a better alternative?
Please don't spend time suggesting "rewrite everything using framework XXX". I really don't have the time or energy for that. I'm happy to replace all my HTML generation with HTML::Tiny, so I'm looking for something with a similar (or lower!) amount of rework to replace header() and param().
You're missing the point if you're looking for an alternative that provides header and param.
The argument for the removal of CGI.pm from core (but not from CPAN) is that you shouldn't have to deal with CGI yourself; you should be using a framework that handles this for you.
If you don't agree with this — if you're looking for an equivalent to header and param — go ahead and keep using CGI.pm.
If you do agree, CGI::Simple is no better than CGI.pm.
As others have said, there's no reason not to use CGI together with HTML::Tiny. So that's the answer to your question. For the last five years that I was using CGI, my programs all started something like:
use CGI qw[param header];
which is the approach you're talking about here.
If you wait a year or two, the plan is for the HTML generation functions to be removed from the main module, so your problems will all go away at that point.
But that's not what I'd do in your situation. I'd switch to using PSGI and Plack. You said that you don't want anyone to suggest a new framework, so I'm not going to do that. Plack isn't a framework, it's a toolbox for writing PSGI applications. Certainly, I'd use a framework like Dancer, but you don't have to. You can happily use Plack without any of the frameworks built on top of it.
You'll still get most of the advantages of PSGI. You'll be able to deploy your applications in any way you like. You'll have access to all the awesome Plack middleware. Testing your program will be far easier.
When you're using "raw" Plack, the equivalent of CGI::param is Plack::Request::parameters and the equivalent of CGI::header is Plack::Response::headers.
So there are three answers to your question.
Carry on using CGI.pm. Just stop using the HTML generation functions and replace them with HTML::Tiny
Use raw PSGI/Plack and bring your web development into the 21st century
Use one of Perl's many great web frameworks.
Unfortunately, you don't seem to like any of those answers.
The issue with CGI.pm is not that it's going away, merely that it will no longer be distributed as part of the core Perl distribution. However that doesn't mean you have to install from CPAN. On your Ubuntu system you can just do:
sudo apt-get install libcgi-pm-perl
and you'll be off and running with the same old CGI you know and love :-)
The correct answer to my question is that use CGI::Simple is better than use CGI qw(header param) because it loads faster.
Answers along the lines of "Use Plack, it's the future of Perl for websites" weren't helpful to me because I didn't have time to learn a new programming paradigm or to discover how to reconfigure my web server to make it work, no matter how insistent the Plack Evangelists were that I was wrong in what I was trying to do.
I've now had a bit of time to wade through the links to documentation and presentation slides I was offered and I can see what they were getting at, but one failing in what I've read so far is the lack of a concise end-to-end working example to help get my head around things ... so here's what I knocked together to get me started (and, no, I haven't finished yet!). I hope that others beginning the journey from CGI to PSGI will find this useful to help get them underway...
First you need to install Plack. I'm running an Ubuntu 14.04 installation so it was simply a matter of running sudo apt-get install libplack-perl. The generic way is to install Task::Plack from CPAN.
Next you need to know where your cgi-bin directory is located. You ought to know already if you're a CGI die-hard! Since I'm running Apache mine is defined in /etc/apache2/conf-available/serve-cgi-bin.conf by ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/.
Now for the magic. We're going to create a CGI script that runs a PSGI app, handing it data from the CGI environment. This is good for experimentation and testing but NOT for deployment, as you don't get any of the speed benefit that PSGI can give you (for that you need something like Plack::Handler::Apache2, Plack::Handler::FCGI or mod_psgi in Apache, or a dedicated PSGI server such as Starman or Starlet, or one of the other handlers mentioned on PlackPerl.org). Create /usr/lib/cgi-bin/psgi-cgi.pl with the following contents and make it executable:-
#!/usr/bin/perl
use Plack::Util;
use Plack::Handler::CGI;
my $app = Plack::Util::load_psgi($ENV{PATH_TRANSLATED});
Plack::Handler::CGI->new->run($app);
Next we need to tell Apache to pass PSGI app files to this handler. I did this by creating /etc/apache2/conf-available/psgi-cgi.conf containing:-
Action psgi-cgi /cgi-bin/psgi-cgi.pl
AddHandler psgi-cgi .psgi
then loaded it into my Apache server by running sudo a2enconf psgi-cgi and sudo service apache2 reload. Basically you need to get these lines into your httpd.conf file and restart the server.
Finally, my first PSGI script, which I created in my server's DocumentRoot as /var/www/html/hello.psgi:-
use Plack::Request;
my $app = sub {
my $env = shift;
my $req = Plack::Request->new($env);
my $par = $req->parameters;
return [
200,
[ 'Content-Type', 'text/plain' ],
[ "Hello world!\n",
map("$_ = ".join(", ", $par->get_all($_))."\n", sort keys %$par),
]
];
};
The application is a coderef which returns a 3-element arrayref: the first is the HTTP status code, the second is the name,value pairs for the HTTP header, the third is the body of the response (which could be generated using HTML::Tiny for a web page). The first 2 elements answer the question of what you need instead of the CGI::header function - nothing! (though for more complex handling you'll need Plack::Response::headers). The example also shows how to replace CGI::param - use Plack::Request::parameters, which returns a Hash::MultiValue object containing the values of URL (GET) and BODY (POST) parameters, including the ones with multiple values.
Finally, a test:-
$ wget -q -O- 'http://localhost/hello.psgi?a=1&a=2&a=3&b=1&b=4'
Hello world!
a = 1, 2, 3
b = 1, 4
I hope this is useful to other CGI die-hards in taking their first steps towards PSGI proficiency, and I hope the Plack Evangelists will acknowledge that it takes a lot of reading and comprehension to get even this far.
CGI::Minimal would a good option, it is much lighter than CGI & CGI::Simple, but it lacks advanced methods like CGI & CGI::Simple

In what scenario is the --srcdir config option valid?

I am working on a configuration program - it isn't autoconf, but I'm trying (as much as possible) to get it so that ./configure files that use it can be interfaced in a similar manner to those that are made with autoconf -- and that means (as much as possible) supporting the same variable options.
Only there's one option that makes no sense to me. I mean, yes, I am fully clear on what the option means, I just can't conceive of a single scenario in which someone would be well-advised to use that option - except for one scenario in which I'm equally curious why ./configure scripts can't auto-detect the information it would provide.
The option I am referring to is the "--srcdir" option. The reason it so befuddles me is that the only scenario I can imagine in which the source-code files won't be in your present-working-directory (or relative to your present-working-directory as the configure script is programmed to expect) is if the "configure" script itself isn't in your present-working-directory ---- and in that one scenario, I really am unable to imagine why the "configure" script can't extrapolate the source-directory from the name it is invoked by - and instead has to have that --srcdir option to give it that information.
For example, let's say your program's source-code is located in the "awesome/software" directory. That means that, from where you are, the "configure" script would be "awesome/software/configure". Why can't the "configure" script deduce that the source-directory is "awesome/software" just from the fact that it is invoked by the name "awesome/software/configure", and instead require me to add a separate command-line option of: --srcdir=awesome/software
And if this is not the kind of scenario where one would need to specify the --srcdir option (or if it is not the only kind of such scenario) can someone describe to me any other kind of scenario where the person installing a program would be well-advised to alter the "srcdir" variable from it's default?
The option I am referring to is the "--srcdir" option. ...
the only scenario I can imagine in which the source-code files won't be in your present-working-directory (or relative to your present-working-directory as the configure script is programmed to expect) is if the "configure" script itself isn't in your present-working-directory
Right.
and in that one scenario, I really am unable to imagine why the "configure" script can't extrapolate the source-directory from the name it is invoked by - and instead has to have that --srcdir option to give it that information.
I'm not sure it's required. The configure script will attempt to guess the location of srcdir:
# Find the source files, if location was not specified.
if test -z "$srcdir"; then
ac_srcdir_defaulted=yes
# Try the directory containing this script, then the parent directory.
...
So if it's in neither of those places, this will fail, hence the need for --srcdir. Maybe this is (was?) needed where there's some kind of performance differential where the sources are stored on a "slow" drive and the build happens on a "fast" drive, and configure seems to run faster on the "fast" drive so it needs to be there as well...
At any rate, --srcdir is just a variable assignment, so it's not hard to do.
Why can't the "configure" script deduce that the source-directory is "awesome/software" just from the fact that it is invoked by the name "awesome/software/configure"
The configure source seems to do that without specifying --srcdir, but I have not tried it.

Mercurial command-line "API" reference?

I'm working on a Mercurial GUI client that interacts with hg.exe through the command line (the preferred high-level API, as I understand it).
However, I am having trouble determining the possible outputs of each command. I can see several outputs by simulating situations, but I was wondering if there is a complete reference of the possible outputs for each command.
For instance, for the command hg fetch, some possible outputs are:
pulling from https://User#server.com/Repo
searching for changes
no changes found
if there are no changes, or:
abort: outstanding uncommitted changes
or one of several other messages, depending on the situation.
I would like to structure my program to handle as many of these cases as possible, but it's hard for me to know in advance what they all are.
Is there a documented reference for the command-line? I have not been able to find one with The Google.
Look through the translation strings file. Then you'll know you have every message handled and be able to see what parts of it vary.
Also, fetch is just a convenience wrapper around pull/update/merge. If you're invoking mercurial programmatically you probably want to keep those three very different concepts separate in your running it so you know which part failed. In your example above it's the 'update' failing, so the 'pull' would have succeeded and the 'update's failing would allow you to provide the user with a better message.
(fetch is an abomination, which is part of why it's disabled by default)
Is this what you were looking for: https://www.mercurial-scm.org/wiki/MercurialBook ?
Mercurial 1.9 brings a command server, a stable (in a sense that API doesn't change that much) and low overhead (there is no need to run hg process for every command). The communication is done via a pipe.

how to configure Apache + SVN webDAV directory listing

I have an subversion server running with Apache mod_dav_svn and it works nicely but the browsing ability via HTML is a bit spartan. Is there a way to customize it at all?
There's two things I'd like to do to make a huge difference:
separate the directories from the files so all the directories are at the top. Right now everything is in alphabetical order. (the picture above happens to have all the directories preceding files in alphabetical order, but trust me, that's not the normal case)
List the basic file statistics (file size, mod time, last updated version, etc)
Is it posssible to do either of these with mod_dav_svn?
In a vanilla Subversion install, the web interface is very spartan by design. (Remember the HTTP interface is designed for SVN clients, not human beings.)
You can customize the display somewhat via the SVNIndexXSLT directive. (Here is a good place to start).
If you want something richer (with logs and diff features), you will need to install a special front end. WebSVN and ViewVC are very popular. There is also Trac, but this is a higher-level tool.
A list of other repo browsing tools.
Just FYI, we use WebSVN for our repo instance. It took some effort to get it up and running, but once it is setup you can pretty much leave it alone.
WebSvn looks like it might help you. I tried trac and it is very slick but I found it to be complicated and seems overkill for what you're looking for, imo.
Not out of the box - that is, without modifying the source code. You might be interested in tools like ViewSVN or the more sophisticated trac or redmine.

Using diff to find the portions of many files that are the same? (bizzaro-diff, or inverse-diff)

Bizzaro-Diff!!!
Is there a away to do a bizzaro/inverse-diff that only displays the portions of a group of files that are the same? (I.E. way more than three files)
Odd question, I know...but I'm converting someone's ancient static pages to something a little more manageable.
You want a clone detector. It detects similar code chunks across
large source systems.
See our ClonedR tool: http://www.semdesigns.com/Products/Clone/index.html
You could try the comm command (for common). It'll only compare 2 files at a time, but you should be able to do 3+ with some clever scripting.
You could try sim. Been a few years since I've used it, but I recall it being very useful when looking for similarities within a file or in many different files.
This is a classic problem.
If I had to quick-and-dirty it, I'd probably do something like a diff -U 1000000 (assuming a version of diff that supports it), piped through sed to just get the lines in common (and strip the leading spaces). You'd have to loop through all the files, though.
Edit: I forgot there is also Tcl implementation that would be slightly more versatile, but would require more coding. You may be able to find an implementation for your language of choice.