When a user request for a page (page request), browser request for several other components within that page (inline requests), i.e. images, CSSs, JSs, ...
By just sniffing the traffic between client and server, is there a way to differentiate page requests and inline requests? To find the time interval between user's page requests (page viewing time)?
The web is inherently stateless, so it's tricky to figure out what requests mean what. If you know, for example that the application is Java EE based, you can look for JSESSION elements in the request header to figure out if this is the same user.
If you further know the pattern of page requests, you can then start to figure out how long they're spending on a page.
If you have access to the traffic on the wire, you could sneakily inject some Javascript into the outgoing page and get monitor events back.
Related
I am running a website with affiliate links .
When the visitors of mydomain.com/page.php click on such an affiliate link,
they are being sent to a link on a domain owned by the affilate network (network.com/link), and then redirected through the affiliate network, to the relevant page in the store (store.com/page.asp).
Over the last two months, the reports of the affiliate network indicate that about 13,000 clicks that I sent to such links, carried mydomain.com/page.php as the referring URL, as I would expect.
However, about 20 other clicks carried abnormal referring URLs, such as:
http://app.mam.vaccint.com/getapp/CT3297962/mam.html
http://www.store.com/page.asp
http://www.network.com/link
http://apnwidgets.ask.com/widget/everest/radio/4/radio-button.html
http://search.yahoo.com/search
http://www.google.com/webhp
http://www.bing.com/
http://192.168.1.1/spyware/blockpage
Unfortunately, This has led the compliance team of my affiliate network to believe that I have a hidden traffic source apart from my website, they claim that it appears to be as if I am using some kind of a third party software to send traffic to store.com, which is not true of course.
They are holding me accountable for this situation and I am required to provide explanations to this situation.
What could have caused my website visitors to arrive at network.com / store.com while carrying the above referring URLs?
Not sure though, but looking at the referring URL's its quite certain that these pages had your content listed on their webpages. Like:
e.g. google.com/webhp - listing the result content / cache / image result of your webpage
Bing.com - another result related webpage (generally web cache)
192.168.1.1/spyware/blockpage - looks like someone accessed your portal but ended up reaching this firewall custom page. But somehow the affiliate widget got loaded as it would have been permitted by the firewall.
Store.com/page.asp & network.com/link - looks like some internal redirected urls which sent traffic to the relevant page (store.com/page.asp)
(rest other) - all other links also can have a similar story which ended up sending traffic to your affiliate network, but had another URL.
I'm sure if you replicate this case in front of them via Google cache / Bing cache, they would get a better understanding of the issue.
Else, try to identify the source referrer of page: network.com/link, which probably is under their control and they would have access to the logs.
I am working on a project involving finding out what http requests were made by the user.
I have all the http request and response headers (but not the data), and I need to find out what content was requested by the user and what content was automatically sent (e.g. ads pages, streaming on the background, and all sorts of unrelevant content).
When recording the net traffic (even for a short period) alot of content gets generated, and most of it is not relevant.
since im no expert in http, i'd like some help with directions as of which headers I can safely use (assuming most web pages send them), and which headers might be omitted and so it will not be safe to rely on.
my current idea involves:
find all the html files, and check what the main html files were (no referrer or search engine referrer), and then recursively mark all the files called by these html files onward as relevant, and discard the rest.
the problem with this is that I've been told that I can't trust the referrer header, and I have no idea as of how to identify what html files were clicked by the user.
Every kind of help will be appreciated, sorry if the post is not formatted well, this is my first question here.
EDIT:
I've been told the question is'nt clear enough, so all I'm asking is for some way to determine which requests were triggered by the user and whic requests were automatically made
To determine which request was send by the user itself you should look at the first request send through the connection and look at it's response body.
All external files referenced in this first body which then consecutively get send to the user are most likely to be send automatically without the users interaction.
Time passing between requests could also be an factor worth looking at.
Another thing you already mentioned yourself would be looking at the Referer header. As far as the RFC 2616 14.36 goes it can be trusted, as the Referer header must not be sent if the Request URI comes from user input. Although there could be automatically send content which does not have the Referer header set, as it's optional.
I'm playing with the idea of having a completely decoupled HTML5 frontend, but still user authentication for a web app. Is this possible or will I run into some heavy browser security issues?
The idea is to have all static content delivered through a CDN on like example.com, and having it fetch dynamic data (and user authentication) through a separate subdomain, like api.example.com.
This would speed up the loading time of the site, and I could keep the frontend stuff in a completely separate repo so that the developers don't have to worry about setting up the backend to develop and test new features.
Is this already possible in some JS framework perhaps, backbone.js, angular.js, ember.js, knockout.js ?
It definitely is, but I think it is more about approach rather than technology. I have implemented what you describe for a project (it's online but don't want to do a shameless plug here, if interested to check it out I can post the link). My stack is java in the backend exposing a REST api for both autentication and business logic. The client is a backbone.js application. I explicitely decided NOT to use sessions at all. It is completely stateless. This of course means that the user must be re-authenticated at every request.
When the user logs in through a slightly modified OAuth endpoint, it gets a token that must be passed at every request. Cookie works in this case as they are handled automatically by the browser. If not passed as cookie, the backend expect it as a parameter. The frontend communicates using the REST endpoints. It's a single-page application, full client side, this means that the backend serves a page that is basically empty, that include few JS files that are the application itself. No other pageload occurs. Logout is done by simply deleting the cookie or not sending the authToken, the server cannot and doesn't have to "forget" about the user. Token are nice as they can be invalidated, both expilcitely or by changing the password. I've chosen this approach as it made it easy to develop desktop app and browser plugin for my webapp without touching a single line of backend code.
I don't understand: how are webserver and trackers like Google Analytics able to track referrals?
Is it part of HTTP?
Is it some (un)specified behavior of the browsers?
Apparently every time you click on a link on a web page, the original web page is passed along the request.
What is the exact mechanism behind that? Is it specified by some spec?
I've read a few docs and I've played with my own Tomcat server and my own Google Analytics account, but I don't understand how the "magic" happens.
Bonus (totally related) question: if, on my own website (served by Tomcat), I put a link to another site, does the other site see my website as the "referrer" without me doing anything special in Tomcat?
Referer (misspelled in the spec) is an HTTP header. It's a standard header that all major HTTP clients support (though some proxy servers and firewalls can be configured to strip it or mangle it). When you click on a link, your browser sends an HTTP request that contains the page being requested and the page on which the link was found, among other things.
Since this is a client/request header, the server is irrelevant, and yes, clicking a link on a page hosted on your own server would result in that page's URL being sent to the other site's server, though your server may not necessarily be accessible from that other site, depending on your network configuration.
One detail to add to what's already been said about how browsers send it: HTTPS changes the behavior a bit. I am not aware if it's in any spec, but if you jump from HTTPS to HTTP, and if you stay on the same domain or go to different domains, then sometimes the referrer is not sent. I don't know the exact rules, but I've observed this in the wild. If there's some spec or description about this, it would be great.
EDIT: ok, the RFC says plainly:
Clients SHOULD NOT include a Referer header field in a (non-secure) HTTP request if the referring page was transferred with a secure protocol.
So, if you go from HTTPS page to a HTTP link, referrer info is not sent.
From: http://en.wikipedia.org/wiki/HTTP_referrer
The referrer field is an optional part
of the HTTP request sent by the
browser program to the web server.
From RFC 2616:
The Referer[sic] request-header field
allows the client to specify, for
the server's benefit, the address
(URI) of the resource from which
the Request-URI was obtained (the
"referrer", although the header
field is misspelled.)
If you request a web page using a browser, your browser will sent the HTTP Referer header along with the request.
Your browser passes referrer with each page request.
It seems unusual that JavaScript has access to this as well, but it does.
Yes, the browser sends the previous page in the HTTP headers. This is defined in the HTTP/1.1 spec:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.36
The answer to your question is yes, as the browser sends the referer.
"The referrer field is an optional part of the HTTP request sent by the browser program to the web server."
http://en.wikipedia.org/wiki/HTTP_referrer
When you click on a link the browser adds a Referer header to the request. It is part of HTTP. You can read more about it here.
I'm trying to parse a bunch of webpages from an adult website using Ruby:
require 'hpricot'
require 'open-uri'
doc = Hpricot(open('random page on an adult website'))
However, what I end up getting instead is that initial 'Site Agreement' page making sure that you're 18+, etc.
How do I get past the Site Agreement and pull the webpages I want? (If there's a way to do it, any language is fine.)
You're going to have to figure out how the site detects that a visitor has accepted the agreement.
The most obvious choice would be cookies. Likely when a visitor accepts the agreement, a cookie is sent to their browser, which is then passed back to the site on every subsequent request.
You'll have to get your script to act like a visitor by accepting the cookie, and sending it with every subsequent request. This will require programming on your part to request the "accept agreement" page first, find the cookie, and store it for use. It's likely that they don't use a specific cookie for the agreement, but rather store it in a session, in which case you just need to find the session cookie.
The 'Site Agreement' page probably has a link you have to click or form you have to submit to send back to the server to proceed. Read the source of that page to be sure. You could send that response back from your application. I don't know how to do that in Ruby, but I've seen similar tasks done using cURL and libcurl, which can probably be used from Ruby.
Install LiveHTTPHeaders plugin for Firefox and visit this site. Watch the headers and see what happens when you accept the agreement. You'll probably see that the browser sends some request (possibly a Post) and accepts some cookies. Then you'll have to repeat whatever browser does in your Ruby script.