Automatically copy text from a web page

Automatically copy text from a web page - html

There is a vpn that keeps changing their password. I have an autologin, but obviously the vpn connection drops every time that they change the password, and I have to manually copy and paste the new password into the credentials file.
http://www.vpnbook.com/freevpn
This is annoying. I realise that the vpn probably wants people not to be able to do this, but it's not against the ToS and not illegal, so work with me here!
I need a way to automatically generate a file which has nothing in it except
username
password
on separate lines, just like the one above. Downloading the entire page as a text file automatically (I can do that) will therefore not work. OpenVPN will not understand the credentials file unless it is purely and simply
username
password
and nothing more.
So, any ideas?

This kind of thing is done ideally via an API that vpnbook provides. Then a script can much more easily access the information and store it in a text file.
Barring that, and looks like vpnbook doesn't have an API, you'll have to use a technique called Web Scraping.
To automate this via "Web Scraping", you'll need to write a script that does the following:
First, login to vpnbook.com with your credentials
Then navigate to the page that has the credentials
Then traverse the structure of the page (called the DOM) to find the info you want
Finally, save out this info to a text file.
I typically do web scraping with Ruby and the mechanize library. The first example in the Mechanize examples page shows how to visit the google homepage, perform a search for "Hello World", and then grab the links in the results one at time printing it out. This is similar to what you are trying to do except instead of printing it out you would want to write it to a text file. (Google for writing a text file with Ruby)":
require 'rubygems'
require 'mechanize'
a = Mechanize.new { |agent|
agent.user_agent_alias = 'Mac Safari'
}
a.get('http://google.com/') do |page|
search_result = page.form_with(:id => 'gbqf') do |search|
search.q = 'Hello world'
end.submit
search_result.links.each do |link|
puts link.text
end
end
To run this on your computer you would need to:
a. Install ruby
b. Save this in a file called scrape.rb
c. call it by using the command line "ruby scrape.rb"
OSX comes with an older ruby that would work for this. Check out the ruby site for instructions on how to install it or get it working for your OS.
Before using a gem like mechanize you need to install it:
gem install mechanize
(this depends on Rubygems being installed, which I think typically comes with Ruby).
If you're new to programming this might sound like a big project, but you'll have an amazing tool in your toolbox for the future, where you'll feel like you can pretty much "do anything" you need to, and not rely on other developers to have happened to have built the software you need.
Note: for sites that rely on javascript, mechanize wont work - you can use Capybara+PhantomJS to run an actual browser that can run javascript from Ruby.
Note 2: Its possible that you don't actually have to go through the motions of (a) going to the login page (2) "filling in your info", (3) clicking on "Login", etc. Depending how their authentication works, you may be able to go directly to the page that displays info you need and just provide your credentials directly to that page using either basic auth or other means. You'll have to look at how their auth system works and do some trial and error for this. The most straightforward, most likely to work approach is to just to what a real user would do...login through the login page.
Update
After writing all this, I came across the vpnbook-utils library (during a search for "vpnbook api") which I think does what you need:
...With this little tool you can generate OpenVPN config files for the free VPN provider vpnbook.com...
...it also extracts the ever changing credentials from the vpnbook.com website...
looks like with one command line:
vpnbook config
you can automatically grab the credentials and write them into a config file.
Good luck! I still recommend you learn ruby :)

You don't even need to parse the content. Just string search for the second occurrence of Username:, cut everything before that, use sed to find the content between the next two occurrences of <strong> and </strong>. You can use curl or wget -qO- to get the website's content.

Related

Extracting values from a login accessible web page post-javascript using Ruby

I have a stock trading website that is only accessible after logging into the site. After logging in, there is a stock value that I am trying to extract. That number is not readily available and takes a while to load as it is being updated from the company's database.
I am trying to write a script in Ruby that will allow me to extract the number and then use it in my program.
In firebug, the tag looks like this but only after the number has loaded:
<span id="ContentPlaceHolderTodaysStock">10,747</span>
I have explored libraries such as hpricot and nokogiri and have tried code similar to the following:
require "nokogiri"
require "open-uri"
doc = Nokogiri::HTML(open("website.com/stocks"))
puts doc.xpath("//span/text()")
The problems I run into are
1)it only reads the html from the login page "website.com" instead of "website.com/stocks"
2)once I do get past the login, how do I use the html code after the javascript has loaded?
I have also tried Watir so that can get me past problem #1 but then doing something like the following doesn't help with problem#2 because it provides the original html source...
require 'net/http'
source = Net::HTTP.get("website.com/stocks", '/')
Any help in solving this problem would be greatly appreciated. Thank you!

Since you are able to login using Watir, you may as well use it to get the text off of the page. Watir has built-in methods for waiting for asynchronous components to load - see http://watirwebdriver.com/waiting/.
To get the text, you will want something like:
puts browser.span(:id => 'element_id').when_present.text

If it's being loaded after-the-fact, it can't be seen by Nokogiri. You'll need to use something like Watir.
once I do get past the login, how do I use the html code after the javascript has loaded?
You can't get there with Nokogiri. The added HTML doesn't exist in Nokogiri's world, since it's given the base HTML via OpenURI. Nokogiri doesn't execute JavaScript.
Watir, on the other hand, can do all that, so it's your only choice. You'll have to figure out how to navigate through the login-page, request the stock page, then loop, waiting until the text appears, then grab it and do whatever you want with it.

How might I read (parse) html directly from a file using Watir?

I can already do this with Nokogiri of course
doc = Nokogiri::HTML(src)
where src is a text column in my database.
But I really like Watir's search interface for developers over Nokogiri.
There's not much evidence on how to do this so far in my searches on the internet, viz. for unhosted html.

You can access local html files by adding a "file://" to the start of the path to the file (see my blog post on the topic).
For example, lets say you have an html file on your computer at "C:\users\testuser\desktop\test_file.html".
If you want to open this file and interact with it using Watir, you can do:
browser = Watir::Browser.new
browser.goto('file://C:\users\testuser\desktop\test_file.html')
Then you can interact with the browser/page/html as you normally would with Watir.
Note: If you get a NoMethodError: unknown property or method: 'document' exception when trying to interact with the browser, make sure that your browser is being opened by a user with administrative privileges.

If the above does not work for you, you can try navigating with the driver directly like so:
browser = Watir::Browser.new
browser.driver.navigate.to('file://Users/path/to/file.html')
PS I am on a mac, but this should work irrespective of your OS

How do I make a file download require a username and password

In a basic HTML web page, how do you make the user have to enter a username and password before they are allowed to download a file?
What is the best way of achieving this on a website, preferably in plain HTML?

This can't be achieved in HTML.
With client side technologies, the best you are likely to be able to achieve is a JavaScript prompt that you use the data from to direct people to a secret URI.
This is something that really should be handled by the web server.

You won't be able to do this with plain HTML. Easiest way is probably to place the protected file in a directory protected by an .htaccess and .htpasswd file.

I agree with David...this can't really be done. Putting up a JavaScript prompt only protects you so far as well...that isn't really secure.
If you are on an Apache server, you could setup a .htaccess file and setup some user authorization options that point at a page which has the link to your download file. A simple implementation would give you 1 user/password combo that you could distribute to your users. The Apache documentation for this may be found here. Unfortunately I'm not really familiar with how IIS handles this sort of thing.
If you don't want to distribute a generic username/password combo to your users, you're pretty much going to be stuck creating a (or making use of an exsitng) user-management system. There are quite a few modules strewn throughout the web, and a simple Google search should bring you to quite a number of tutorials or existing implementations, depending on what you require.

first download phpmyadmin.open the file.click on create login.name it as login.php.then create a new want again name file.php.after you click yes it popout a logon information.enter the usrname you want and password you want
hope this help!

How to configure Netbeans code entry point when you use mod-rewriting

I am developing a website in PHP and I am using mod-rewrite rules. I want to use the Netbeans Run Configuration (under project properties) to set code entry points that looks like http://project/news or http://project/user/12
It seems Netbeans have a problem with this and needs an entry point to a physical file like http://project/user.php?id=12
Has anyone found a good way to work around this?

I see your question is a bit old, but since it has no answer, I will give you one.
What I did to solve the problem, was to give netbeans what it wants in terms of a valid physical file, but provide my controller (index.php in this case) with the 'data' to act correctly. I pass this data using a query parameter. Using your example of project being the web site domain and user/12 as the URL, use the following in the NetBeans Run Configuration and arguments boxes. netbeans does not need the ? as it inserts that automatically, see the complete url below the input boxes
Project URL: http://project
Index File: index.php *(put your controller name here)*
Arguments: url=user/12
http://project/index.php?url=user/12
Then in your controller (index.php in this example), test for the url query param and if it exists parse it instead of the actual Server Request, as you would do normally.
I also do not want the above URL to be publically accessible. So, by using an IS_DEVELOPER define, which is true only for configured developer IP addresses, I can control who has access that special url.
If you are trying to debug specific pages, alternatively, you can set the NetBeans run configuration to:
http://project/
and debug your project, but you must run through your home page once and since the debugger is now active, just navigate to http://project/user/12 in your browser and NetBeans will debug at that entry point. I found passing through my home page every time a pain, so I use the technique above.
Hopefully that provides enough insight to work with your project. It has worked good for me and if you need more detail, just ask.
EDIT: Also, one can make the Run Configuration Project URL the complete url http://project/user/12 and leave the Index File and Arguments blank and that works too without any special code in controller. (tested in NetBeans 7.1). I think I will start using this method.

Make html validation part of build cycle

Currently when I build my site I have to manually open validate it at the 3wbc site (means when opera pops up, press ctr+alt+shft+u) for every page.
Is it possible to automatically validate every page whenever I build my pages?
P.s.: This page doesn't validate ;)

You can download and install your own copy of the validator - http://validator.w3.org/source/ - and invoke it locally instead of trekking out to w3.org for each page. Still, this requires piggybacking over a web server through plain HTTP or the API. For a simpler solution you may prefer to download the SP library - http://www.jclark.com/sp/index.htm or http://openjade.sourceforge.net/ - on which the W3 validator is based, then invoke the command 'nsgmls' from the command line.
There are of course also many desktop HTML validators that can process a batch of HTML pages at once, which may not be automated, but would certainly be much easier than manually checking each page. For example http://arealvalidator.com/ (Windows), http://www.webthing.com/software/validator-lite/install.html (Unix).

Might not be best choice for you but there's an Ant task for this: XmlValidate.

If you've got the HTML files in source control like SVN or Git, you can use a pre-commit hook script to run client-side validators on them. Or if you're feeling adventurous, you could use that method to ping another script on the server that validates the live pages...

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008