Archived web content without going to the website - open-source

I want to fetch web data without going to the actual website.
http://archive.org/web/web.php is an example which keeps the snapshots of websites. Problem with this is that data is quite old (5-6 months).
Do we have any other archive storage where recent html content can be found?
Thanks

Do you want to curl the website?
you can use php to cURL some web page:
http://php.net/manual/en/book.curl.php
or you can use command, wget or curl, in Unix:
http://linux.about.com/od/commands/l/blcmdl1_curl.htm

Related

How do I get a largely HTML site generated from a rails app?

I would like to create a Rails 4 app, where some data is entered into the db via a form and when it is published, any changes on the site are compiled and the entire consumer facing site is just a bunch of flat HTML files.
That way, on each request there isn't a db request done and just a simple HTML file is sent.
This is similar to the way Octopress operates, where you write a blog post locally and when you do a deploy it basically compiles the entire site into a large set of connected HTML files that are then pushed to your host(gh-pages for instance).
Is there a way to use extensive caching or something similar to get the same effect in Rails 4 or should I go about it another way in Rails or should I just try to customize Octopress for my needs?
Have a look at page caching, it has been moved from Rails to a separate gem
https://github.com/rails/actionpack-page_caching
It saves the generated HTML files to a specified directory which you should be able to deploy separately from the rest of the application.

Website scanner for specific HTML tags?

Basically I am part of a webteam with a website built in the CMS, EasySite (has over 3000 pages). I was wondering if there was a tool or any other way of scanning the HTML in each page for a specific tag (i.e. style="font-size:10px").
Alot of people are copy and pasting content from MS Word which obviously copies the formatting too. Although it doesn't show on the desktop site, it shows up on mobile/tablet devices. So need this sorting on all current pages.
I would do the following:
Mirror your site on your local filesystem (for example with wget --mirror http://example.com, see manpage of wget)
Work on the downloaded files to perform your search with the tools you like most (Python or grep – for example grep -rn 'style="font-size:10px"' mirror_directory)

Automatically copy text from a web page

There is a vpn that keeps changing their password. I have an autologin, but obviously the vpn connection drops every time that they change the password, and I have to manually copy and paste the new password into the credentials file.
http://www.vpnbook.com/freevpn
This is annoying. I realise that the vpn probably wants people not to be able to do this, but it's not against the ToS and not illegal, so work with me here!
I need a way to automatically generate a file which has nothing in it except
username
password
on separate lines, just like the one above. Downloading the entire page as a text file automatically (I can do that) will therefore not work. OpenVPN will not understand the credentials file unless it is purely and simply
username
password
and nothing more.
So, any ideas?
This kind of thing is done ideally via an API that vpnbook provides. Then a script can much more easily access the information and store it in a text file.
Barring that, and looks like vpnbook doesn't have an API, you'll have to use a technique called Web Scraping.
To automate this via "Web Scraping", you'll need to write a script that does the following:
First, login to vpnbook.com with your credentials
Then navigate to the page that has the credentials
Then traverse the structure of the page (called the DOM) to find the info you want
Finally, save out this info to a text file.
I typically do web scraping with Ruby and the mechanize library. The first example in the Mechanize examples page shows how to visit the google homepage, perform a search for "Hello World", and then grab the links in the results one at time printing it out. This is similar to what you are trying to do except instead of printing it out you would want to write it to a text file. (Google for writing a text file with Ruby)":
require 'rubygems'
require 'mechanize'
a = Mechanize.new { |agent|
agent.user_agent_alias = 'Mac Safari'
}
a.get('http://google.com/') do |page|
search_result = page.form_with(:id => 'gbqf') do |search|
search.q = 'Hello world'
end.submit
search_result.links.each do |link|
puts link.text
end
end
To run this on your computer you would need to:
a. Install ruby
b. Save this in a file called scrape.rb
c. call it by using the command line "ruby scrape.rb"
OSX comes with an older ruby that would work for this. Check out the ruby site for instructions on how to install it or get it working for your OS.
Before using a gem like mechanize you need to install it:
gem install mechanize
(this depends on Rubygems being installed, which I think typically comes with Ruby).
If you're new to programming this might sound like a big project, but you'll have an amazing tool in your toolbox for the future, where you'll feel like you can pretty much "do anything" you need to, and not rely on other developers to have happened to have built the software you need.
Note: for sites that rely on javascript, mechanize wont work - you can use Capybara+PhantomJS to run an actual browser that can run javascript from Ruby.
Note 2: Its possible that you don't actually have to go through the motions of (a) going to the login page (2) "filling in your info", (3) clicking on "Login", etc. Depending how their authentication works, you may be able to go directly to the page that displays info you need and just provide your credentials directly to that page using either basic auth or other means. You'll have to look at how their auth system works and do some trial and error for this. The most straightforward, most likely to work approach is to just to what a real user would do...login through the login page.
Update
After writing all this, I came across the vpnbook-utils library (during a search for "vpnbook api") which I think does what you need:
...With this little tool you can generate OpenVPN config files for the free VPN provider vpnbook.com...
...it also extracts the ever changing credentials from the vpnbook.com website...
looks like with one command line:
vpnbook config
you can automatically grab the credentials and write them into a config file.
Good luck! I still recommend you learn ruby :)
You don't even need to parse the content. Just string search for the second occurrence of Username:, cut everything before that, use sed to find the content between the next two occurrences of <strong> and </strong>. You can use curl or wget -qO- to get the website's content.

Printable Large PDF on the Web

The Problem
I have a 35mb PDF file with 130 pages that I need to put online so that people can print off different sections from it each week.
I host the PDF file on Amazon S3 now and have been told that the users don't like to have to wait on the whole file to download before they choose which pages they want to print.
I assume I am going to have to get creative and output the whole magazine to JPGs and get a neat viewer or find another service like ISSUU that doesn't suck.
The Requirements and Situation
I am given 130 single page PDF Files each week (All together this makes up The Magazine).
Users can browse the Magazine
Users can print a few pages.
Can Pay
Automated Process
Things I've tried
Google Docs Viewer - Get an Error, Sorry, we are unable to retrieve the document for viewing or you don't have permission to view the document.
ISSUU.com - They make my users log in to print. No way to automate the upload/conversion.
FlexPaper - Uses SWFTools (see next)
SWFTools - File is too complex error.
Hosting PDF File with an Image Preview of Cover - Users say having to download the whole file before viewing it is too slow. (I can't get new users. =()
Anyone have a solution to this? Or a fix for something I have tried already?
PDF documents can be optimized for downloading through the web, this process is known as PDF Linearization. If you have control over the PDF files you are going to use, you could try to optimize them as linearized PDF files. There are many tools that can help you on this task, just to name a few:
Ghostscript (GPL)
Amyuni PDF Converter (Commercial, Windows only, usual disclaimer applies)
Another option could be to split your file in sections and only deliver each section to its "owner". For the rest of the information, you can put bookmarks linking to the other sections, so that they can be retrieved also if needed. For example:
If the linearization was not enough and you do not have a way to know how to split the file, you could try to split it by page numbers and create bookmarks like these:
-Pages 1-100
-Pages 101-200
-Pages 201-300
...
-Pages 901-1000
-All pages*
The last bookmark is for the ambitious guy that wants to have the whole thing by all means.
And of course you can combine the two approaches and deliver each section as a linearized PDF.
Blankasaurus,
Based on what you've tried, it looks like you are willing to prep the document(s) or I wouldn't suggest this. See if it'll meet your needs... Download ColdFusion and install locally on your PC/VM. You can use CF's cfpdf function to automatically create "thumbnails" (you can set the size) of each of the pages without so much work. Then load it into your favorite gallery script with links to the individual PDFs. Convaluted, I know, but it shouldn't take more than 10 mins once you get the gallery script working.
I would recommend splitting the pdf into pages and then using a web based viewer to publish them online. FlexPaper has many open source tools such as pdf2json, pdftoimage to help out with the publishing. Have a look at our examples here:
http://flexpaper.devaldi.com/demo/

How do I show an external file creation date on website?

I am using this site template to create a mobile/iPhone friendly site. I want to have it link to files, and below the link I want it to show the creation date. Currently everything is working fine but everytime I upload the file I also have to go into the index.html and change the modification date. Is there any type of script to do this for me that will work on my site? I have very basic HTML understanding, hence why I am using a template.
Thanks!
you're going to need some server-side scripting like PHP or ASP.NET. Using that, there are built-in File IO libraries where you can get the creation/modified date.
There are 2 ways to show file's last modification date (and only one of them works for creation).
You can have a file list generated by the file-listing capability of your web server. Basically, any URL mapping to a directory that is permitted to show its contents will result in a web page listing the directory contents, like this:
Index of /images/appimages/MastheadButtons
Name Last modified Size Description
-------------------------------------------------------------------
Parent Directory 22-Jun-2010 09:35 -
GP.JPG [link] 22-Jun-2010 09:41 1k
web.jpg [link] 29-Jan-2003 15:28 17k
You can have a back-end (CGI) script which produces the HTML page print any info you wish.
If you only know HTML, the second approach would not be practical for you. If you know some programming language in which to write web apps (PHP, Perl, anything), you can ask a more targeted qyestion of how to achieve what you want in that language.
However, HTML by itself is running on your browser. It doesn't execute any code on the web server where the file lives and thus doesn't know anything about the files.
Found the solution: I changed the file extension to .php from the iPhone Website template and then inserted this code where I want the modification date to be:
<?= date("m/d/Y H:i:s",filemtime("filename.extension")) ?>