pulling webpages from an adult site -- how to get past the site agreement? - html

I'm trying to parse a bunch of webpages from an adult website using Ruby:
require 'hpricot'
require 'open-uri'
doc = Hpricot(open('random page on an adult website'))
However, what I end up getting instead is that initial 'Site Agreement' page making sure that you're 18+, etc.
How do I get past the Site Agreement and pull the webpages I want? (If there's a way to do it, any language is fine.)

You're going to have to figure out how the site detects that a visitor has accepted the agreement.
The most obvious choice would be cookies. Likely when a visitor accepts the agreement, a cookie is sent to their browser, which is then passed back to the site on every subsequent request.
You'll have to get your script to act like a visitor by accepting the cookie, and sending it with every subsequent request. This will require programming on your part to request the "accept agreement" page first, find the cookie, and store it for use. It's likely that they don't use a specific cookie for the agreement, but rather store it in a session, in which case you just need to find the session cookie.

The 'Site Agreement' page probably has a link you have to click or form you have to submit to send back to the server to proceed. Read the source of that page to be sure. You could send that response back from your application. I don't know how to do that in Ruby, but I've seen similar tasks done using cURL and libcurl, which can probably be used from Ruby.

Install LiveHTTPHeaders plugin for Firefox and visit this site. Watch the headers and see what happens when you accept the agreement. You'll probably see that the browser sends some request (possibly a Post) and accepts some cookies. Then you'll have to repeat whatever browser does in your Ruby script.

Related

Capture response from .jsp

I am a naive user.
There is this website which is a really important source of information for my business.
To monitor the websites, I convert them to RSS feeds using page2rss service and then monitor feeds in IFTTT.
However, this particular site does not use static web pages and generates data response to API Calls:
Here is a sample API Call:
https://www.mpeproc.gov.in/ROOTAPP/GetTenderFreeView.jsp?Department=Urban%20Administration%20and%20Development%20Department&company=MPSEDC
Is there a way by which I could record the response from this call to an html page on my server? or is there any other way to monitor such dynamic pages.
There are solutions but not simple ones. The first page uses JavaScript to create a FORM which it then submits. You can simulate this with the command line tool curl; see https://superuser.com/questions/149329/what-is-the-curl-command-line-syntax-to-do-a-post-request
But take note that many sites don't like scraping; if they notice what you're doing, you may end up on a blacklist. So it's better to ask the site's owner for permission before you aim automated tools at their precious data.

Finding out what http requests were made by the user

I am working on a project involving finding out what http requests were made by the user.
I have all the http request and response headers (but not the data), and I need to find out what content was requested by the user and what content was automatically sent (e.g. ads pages, streaming on the background, and all sorts of unrelevant content).
When recording the net traffic (even for a short period) alot of content gets generated, and most of it is not relevant.
since im no expert in http, i'd like some help with directions as of which headers I can safely use (assuming most web pages send them), and which headers might be omitted and so it will not be safe to rely on.
my current idea involves:
find all the html files, and check what the main html files were (no referrer or search engine referrer), and then recursively mark all the files called by these html files onward as relevant, and discard the rest.
the problem with this is that I've been told that I can't trust the referrer header, and I have no idea as of how to identify what html files were clicked by the user.
Every kind of help will be appreciated, sorry if the post is not formatted well, this is my first question here.
EDIT:
I've been told the question is'nt clear enough, so all I'm asking is for some way to determine which requests were triggered by the user and whic requests were automatically made
To determine which request was send by the user itself you should look at the first request send through the connection and look at it's response body.
All external files referenced in this first body which then consecutively get send to the user are most likely to be send automatically without the users interaction.
Time passing between requests could also be an factor worth looking at.
Another thing you already mentioned yourself would be looking at the Referer header. As far as the RFC 2616 14.36 goes it can be trusted, as the Referer header must not be sent if the Request URI comes from user input. Although there could be automatically send content which does not have the Referer header set, as it's optional.

Cookie: basic info and queries

I read this in Wiki:
A cookie, also known as an HTTP cookie, web cookie, or browser cookie, is usually a small piece of data sent from a website and stored in a user's web browser while a user is browsing a website. When the user browses the same website in the future, the data stored in the cookie can be retrieved by the website to notify the website of the user's previous activity.[1] Cookies were designed to be a reliable mechanism for websites to remember the state of the website or activity the user had taken in the past. This can include clicking particular buttons, logging in, or a record of which pages were visited by the user even months or years ago.
Now I want to know who creates cookies. Is it the browser or can every site create a cookie on its own? Who controls what information has to be saved in cookie and how can all the form field data be saved in cookie?
I think "Setting a cookie" section will help you a lot.
http://en.wikipedia.org/wiki/HTTP_cookie
The website creates the cookie, whether front end (Javascript cookie) or back end (PHP cookie)
The website developer controls what is stored in the cookie.
The website developer gets the information from a form, processes it, then stores it in the cookie.
COOKIES are created by site owner. cookies are actually client side sessions.
Now I want to know who creates cookies. Is it the browser or can every site create a cookie on its own? Who controls what information has to be saved in cookie and how can all the form field data be saved in cookie?
Cookies are created on the client machine by the web server. cookies are initiated using php sessions the browser on the client side stores this cookie as phpsession id which identify s the user the php on the server can then recognize the user by the cookie which is sent from the client to the server. (via the browser).
The creator of the website will control what data is contained in the cookie, for example
`<? php
session_start();
if($_SESSION['logged_in'] == "")
{
header("Location: login.php");
}
?>`
for example the above code would check if the user had the value 'logged_in' if they had not logged in they were redirected to the login page. else they could continue to view the page.
" THanks you , could please let me know can one site access cookies of other site and read information from it and make sense out of it – Vinayjava 1 hour ago"
Yes one website is able to grab information from another website this is known as Cross site request forgeryand is most often performed via XSS injection etc, it can be used to steal user cookies..
any other questions about cookies message me i should be able to help

Web crawler dealing with "Sign up or log in to read full content"

Given a page like this, I am trying to extract all the answer text with a ruby web crawler.
I am using nokogiri and search('div[#class="answer_content"]').inner_text to access the answers, but I can't seem to access all the text, even when in fact I am logged in. About 200 words down, I'll get the message "sign up or log in to read full content."
Also, is this div class the correct one to use?
It seems to me that you need to authenticate yourself from the crawler. I've done it a few weeks ago. I used a firefox extension called Tamper Data which allowed me to see the requests made between the browser and the server. In my case, the authentication was handled by a session id; I just had to get it back and pass it to each request I made to the server.
But in your case, the authentication might be done by a different way, you'll have to see for yourself. Anyway, I can detail if it's not clear enough.

Is there any way of sending on POST data with a Redirect in an MVC3 controller?

I have a form which is posted to an MVC3 controller that then has to be POSTED to an external URL. The browser need to go to the URL permanently so I thought a permanent redirect would be perfect.
However, how do I send the form POST data with the redirect?
I don't really want to send another page down to the browser to do it.
Thanks
A redirect will always to be a GET, not a POST.
If the 2nd POST doesn't need to come from the client, you can make the POST using HttpWebRequest from the server. Beware the secondary POST may hold up the return of the client request if the external server is down or running slowly.
A permanent redirect is wholly inappropriate here. First, it will not cause form values to be resubmitted. Second, the semantics are all wrong - you would be telling the browser "do not request this url again. instead, go here". However, you do want future submissions to go to your same url.
Gaz's idea could work. It involves your server, only.
Alternatively, send a form with the same submitted values and the external URL, and use client-side code to automatically submit it.