Detecting a "unique" anonymous user - unique

It is impossible to identify a user or request as unique since duping is trivial.
However, there are a handful of methods that, combined, can hamper cheating attempts and give a user quasi-unique status.
I know of the following:
IP Address - store the IP address of each visitor in a database of some sort
Can be faked
Multiple computers/users can have the same address
Users with dynamic IP addresses (some ISP issue them)
Cookie tracking - store a cookie per visitor. Visitors that don't have it are considered "unique"
Can be faked
Cookies can be blocked or cleared via browser
Are there more ways to track non-authorized (non-login, non-authentication) website visitors?

There are actually many ways you can detect a "unique" user. Many of these methods are used by our marketing friends. It get's even easier when you have plugins enabled such as Java, Flash etc.
Currently my favorite presentation of cookie based tracking is evercookie (http://samy.pl/evercookie/). It creates a "permanent" cookie via multiple storage mechanisms, the average user is not able to flush, specifically it uses:
Standard HTTP Cookies
Local Shared Objects (Flash Cookies)
Silverlight Isolated Storage
Storing cookies in RGB values of
auto-generated, force-cached PNGs
using HTML5 Canvas tag to read pixels
(cookies) back out
Storing cookies in Web History
Storing cookies in HTTP ETags
Storing cookies in Web cache
window.name caching
Internet Explorer userData storage
HTML5 Session Storage
HTML5 Local Storage
HTML5 Global Storage
HTML5 Database Storage via SQLite
I can't remember the URL, but there is also a site which tells you how "anonymous" you are based on everything it can gather from your web browser: What plugins you have loaded, what version, what language, screensize, ... Then you can leverage the plugins I was talking about earlier (Flash, Java, ...) to find out even more about the user. I'll edit this post when I find the page whcih showed you "how unique you are" or maybe somebody knows »» actually it looks as if every user is in a way unique!
--EDIT--
Found the page I was talking about: Panopticlick - "How Unique and trackable is your browser".
It collects stuff like User Agent, HTTP_ACCEPT headers, Browser Plugins, Time Zone, Screen Size and Depth, System Fonts (via Java?), Cookies...
My result: Your browser fingerprint appears to be unique among the 1,221,154 tested so far.

Panopticlick has a quite refined method for checking for unique users using fingerprinting. Apart from IP-adress and user-agent it used things like timezone, screen resolution, fonts installed on the system and plugins installed in the browser etc, so it comes up with a very distinct ID for each and every user without storing anything in their computers. False negatives (finding two different users with the exact same fingerprint) are very rare.
A problem with that approach is that it can yield some false positive, i.e. it considers the same user to be a new one if they've installed a new font for example. If this is ok or not depends on your application I suppose.

Yes, it's impossible to tell anonymous visitors apart with 100% certainty. The best that you can do is to gather the information that you have, and try to tell as many visitors apart as you can.
There is one more piece of infomration that you can use:
Browser string
It's not unique, but in combination with the other information it increases the resolution.
If you need to tell the visitors apart with 100% certainty, then you need to make them log in.

There is no sure-fire way to achieve this, in my view. Of your options, cookies are the most likely to yield a reasonably realistic number. NATing and proxy servers can mask the IP addresses of a large number of users, and dynamic IP address allocation will confuse the results for a lot of others
Have you considered using e.g Google Analytics or similar? They do unique visitor tracking as part of their service, and they probably have a lot more money to throw at finding heuristic solutions to this problem than you or I. Just a thought!

Related

How to handle new third party cookie rules starting in Chrome 80?

Beginning with Chrome 80, third party cookies will be blocked unless they have SameSite=None and Secure as attributes. With None as a new value for that attribute. https://blog.chromium.org/2019/10/developers-get-ready-for-new.html
The above blog post states that Firefox and Edge plan on implementing these changes at an undetermined date. And there are is a list of incompatible browsers listed here https://www.chromium.org/updates/same-site/incompatible-clients.
What would be the best practice for handling this situation for cross-browser compatibility?
An initial thought is to use local storage instead of a cookie but there is a concern that a similar change could happen with local storage in the future.
You hit on the good point that as browsers move towards stronger methods of preserving user privacy this changes how sites need to consider handling data. There's definitely a tension between the composable / embeddable nature of the web and the privacy / security concerns of that mixed content. I think this is currently coming to the foreground in the conflict between locking down fingerprinting vectors to prevent user tracking that are often the same signals used by sites to detect fraud. The age old problem that if you have perfect privacy for "good" reasons, that means all the people doing "bad" things (like cycling through a batch of stolen credit cards) also have perfect privacy.
Anyway, outside the ethical dilemmas of all of this I would suggest finding ways to encourage users to have an intentional, first-party relationship with your site / service when you need to track some kind of state related to them. It feels like a generally safe assumption to code as if all storage will be partitioned in the long run and that any form of tracking should be via informed consent. If that's not the direction things go, then I still think you will have created a better experience.
In the short term, there are some options at https://web.dev/samesite-cookie-recipes:
Use two sets of cookies, one with current format headers and one without to catch all browsers.
Sniff the useragent to return appropriate headers to the browser.
You can also maintain a first-party cookie, e.g. SameSite=Lax or SameSite=Strict that you use to refresh the cross-site cookies when the user visits your site in a top-level context. For example, if you provide an embeddable widget that gives personalised content, in the event there are no cookies you can display a message in the widget that links the user to the original site to sign in. That way you're explicitly communicating the value to your user of allowing them to be identified across this site boundary.
For a longer-term view, you can look at proposals like HTTP State Tokens which outlines a single, client-controlled token with an explicit cross-site opt-in. There's also the isLoggedIn proposal which is concerned with providing a way of indicating to the browser that a specific token is used to track the user's session.

Do any common email clients pre-fetch links rather than images?

Although I know a lot of email clients will pre-fetch or otherwise cache images. I am unaware of any that pre-fetch regular links like some link
Is this a practice done by some emails? If it is, is there a sort of no-follow type of rel attribute that can be added to the link to help prevent this?
As of Feb 2017 Outlook (https://outlook.live.com/) scans emails arriving in your inbox and it sends all found URLs to Bing, to be indexed by Bing crawler.
This effectively makes all one-time use links like login/pass-reset/etc useless.
(Users of my service were complaining that one-time login links don't work for some of them and it appeared that BingPreview/1.0b is hitting the URL before the user even opens the inbox)
Drupal seems to be experiencing the same problem: https://www.drupal.org/node/2828034
Although I know a lot of email clients will pre-fetch or otherwise cache images.
That is not even a given already.
Many email clients – be they web-based, or standalone applications – have privacy controls that prevent images from being automatically loaded, to prevent tracking of who read a (specific) email.
On the other hand, there’s clients like f.e. gmail’s web interface, that tries to establish the standard of downloading all referenced external images, presumably to mitigate/invalidate such attempts at user tracking – if a large majority of gmail users have those images downloaded automatically, whether they actually opened the email or not, the data that can be gained for analytical purposes becomes watered down.
I am unaware of any that pre-fetch regular links like some link
Let’s stay on gmail for example purposes, but others will behave similarly: Since Google is always interested in “what’s out there on the web”, it is highly likely that their crawlers will follow that link to see what it contains/leads to – for their own indexing purposes.
If it is, is there a sort of no-follow type of rel attribute that can be added to the link to help prevent this?
rel=no-follow concerns ranking rather than crawling, and a no-index (either in robots.txt or via meta element/rel attribute) also won’t keep nosy bots from at least requesting the URL.
Plus, other clients involved – such as a firewall/anti-virus/anti-madware – might also request it for analytical purposes without any user actively triggering it.
If you want to be (relatively) sure that any action is triggered only by a (specific) human user, then use URLs in emails or other kind of messages over the internet only to lead them to a website where they confirm an action to be taken via a form, using method=POST; whether some kind of authentication or CSRF protection might also be needed, might go a little beyond the context of this question.
All Common email clients do not have crawlers to search or pre-build <a> tag related documents if that is what you're asking, as trying to pre-build and cache a web location could be an immense task if the page is dynamic or of large enough size.
Images are stored locally to reduce load time of the email which is a convenience factor and network load reduction, but when you open an email hyperlink it will load it in your web browser rather than email client.
I just ran a test using analytics to report any server traffic, and an email containing just
linktomysite
did not throw any resulting crawls to the site from outlook07, outlook10, thunderbird, or apple mail(yosemite). You could try using a wireshark scan to check for network traffic from the client to specific outgoing IP's if you're really interested
You won't find any native email clients that do that, but you could come across some "web accelerators" that, when using a web-based email, could try to pre-fetch links. I've never seen anything to prevent it.
Links (GETs) aren't supposed to "do" anything, only a POST is. For example, your "unsubscribe me" link in your email should not directly unsubscribe th subscriber. It should "GET" a page the subscriber can then post from.
W3 does a good job of how you should expect a GET to work (caching, etc.)
http://www.w3schools.com/tags/ref_httpmethods.asp

a/b testing a major html/css redesign

At my company we are redesiging our e-commerce website. HTML and CSS is re-written from the ground up to make the website responsive / mobile friendly.
Since it concerns one of our biggest websites which is responsible for generating of over 80% of our revenue it is very important that nothing goes "wrong".
Our application is running on a LAMP stack.
What are the best practices for testing a major redesign?
Some issues i am thinking of:
When a/b testing a whole design (if possible) i guess you definitaly
dont want Google to come by and index youre new design (since its
still in test phase). How to handle this?
Should you redirect a percentage of the users to a new url (or
perhaps subdomain)? Or is it better to serve the new content from the
existing indexed urls based on session?
How to compare statistics from a Google Analytics point of view?
How to hint Google about a new design? Should i e.g.
create a new UA code?
Solution might be to set a cookie only for customers who enter the website via the homepage. Doing so, you're excluding adwords traffic and returning visitors, who might be expecting an other webdesign, serve them the original website and leave their experience untouched.
Start the test with home traffic only, set cookie and redirect a percentage to a subdomain. Measure conversion rate by a dimension in Google analytics, within same analytics account. Set a 'disallow subdomain' in your robots.txt to exclude the subdomain from crawling by SE's.
Marc, You’re mixing a few different concerns here:
Instrumentation. If you changes can be expressed via HTML/CSS/JavaScript only, i.e. optimizational in nature, you may be able to instrument using tols like VWO or Optimizely. If there are server side changes too, then a tool like Sitespect (any server stack) or Variant (Java only) might be in order. The advantange of using a commecial product is that they provide a number of important features out of the box, e.g. collecting experiment data, experience stability (returning user sees the same experience), etc. You may be able to instrument on your own, but unless you’re looking at a handful of pages, that typically is hard, particularly if you want to do it outside of the app, via the DevOps mechanisms.
SEO. If you get your instrumentation right, this shouldn’t be an issue. Public URIs should not differ for the control and variant of the same resource.
Traffic routing. Another reason to consider a commercial tool. They factor that out of your app and let you set percentages. Some tools, like Variant, will allow you to write custom targeters, e.g. “value” users always see control.

How to implement a single sign-on authentication server?

I want to implement a discrete remote authentication server that handles login for many sites. Somewhat similar to OpenID.
Basically, I have site-1 and site-2 and they're both reliant on the same user database, which is on a separate auth-site. So, auth-site handles user authentication for them, and during this process, makes information on the authenticating user available to the requesting system.
Each site can be on a completely separate domain name, on completely separate machines.
This is all via HTTP(S), there can be no direct database access.
There's one last quirk: once an user has logged in to site-1, when accessing any other site reliant on auth-site, the site must treat the user as already authenticated.
This whole business must be entirely fuss-free to the end-user. It should work like a simple everyday login form.
As a concrete example, say we're talking about stackoverflow.com and serverfault.com, and they both authenticate via authentic-overflow-server-stack.com. Again, once logged in to either site, I can go to the other and do my business without logging in again.
What I'd like to know are the general interaction mechanism between the sites behind this scenario.
In my particular setup, I'm using Rails, but I'm not looking for code[1], just general best practice and guidance, so feel free to answer in pseudo-code or any generally readable language. OTOH, bear in mind that I'll have decent MVC, REST, and meta-programming in my toolkit.
[1]: unless you happen to know an existing tiny neat free MIT/BSD-licensed app/plugin/generator that handles this.
It sounds like (especially with the emphasis on fuss-free), you want something like what the Wikimedia Foundation is doing. Basically, you log on to en.wikipedia.org, then that server communicates with other servers (e.g. en.wikinews.org) and gets authentication tokens. Finally, those tokens are embedded into images, e.g. http://en.wikinews.org/wiki/Special:AutoLogin?token=xxxxxxxxxxxxxxx , and when your browser visits that url (img src) it gets a authentication cookie for Wikinews. Of course, the source code is available for your reivew at http://www.mediawiki.org/wiki/Extension:CentralAuth .
OpenID is also a good choice, but it does require that the user "consciously" visit two domains. An example of one entity with two domains doing this is Canonical. E.g., if you go to https://help.ubuntu.com/community/UserPreferences they will redirect you to Launchpad (https://login.launchpad.net/+openid) for authentication.
Note that Wikipedia is doing this over http, but you can do it all https to ensure the img src tokens aren't intercepted.
Looks like CAS is good enough for me, and has ruby implementations, along with dozens of other lesser languages, e.g. one that rhymes with femoral bone rage.
http://code.google.com/p/rubycas-server/
http://code.google.com/p/rubycas-client/
It sounds like you want to actually use the OpenID protocol itself. There's no reason you can't restrict the authentication provider to only your own server, and do some shortcuts that make the authentication process transparent. Also, the OpenID protocol supports what you describe about logging into one implies logging in to all services.

What are the approaches to restrict the access to a group of machines in a web system?

My bank website has a security feature that let me register the machines that are allowed to make banking transactions. If someone steals my password, he won't be able to transfer my money from his computer. Only my personal computers are allowed to make transcations from my account. So...
What are the approaches to restrict the access to a group of machines in a web system?
In other words, how to identify the computer who made the http request in the web server?
Why not using a clients certificate inside the certificate store of an authorized host or inside a cryptographic token such as smartcard that can be plugged into any desired computer?
Update: You should take into account that uniquely identifying a computer means obtaining something that is at a relative low level, unaccessable to code embeded in an html page (Javascript, not signed applet or activeX), unless you install something in the desired computer (or executing something signed such as an applet or activeX).
One thing that is unique per computer is the MAC address of the Ethernet card, that is almost ubiquitous on every rather modern (and not so modern) computer. However that couldn't be secure enough since many cards allow changing its MAC address.
Pentium III used to have an unique serial number inside CPU, that could fit perfect for your use. The downside is that no newer CPUs come with such a thing due to privacy concerns from most users.
You could also combine many elements of the computer such as CPU id (model, speed, etc.), motherboard model, hard disk space, memory installed and so on. I think Windows XP used to gather such type of information to feed a hash to uniquely identify a computer for activation purposes.
Update 2: Hard disks also come with serial numbers that can be retrieved by software. Here is an example of how to get it for activation purposes (your case). However it will work if sb takes the HD to another computer. Nonetheless you can still combine it with more unique data from computer (such as MAC address as I said before). I would also add a unique key generated for a user and kept in a database of your own would (that could be retrieved online from a server) along with the rest to feed a hash function that identifies the system.
Did you actually install something?
Over and above what Mark Brittingham mentions about IP addresses, I suppose some kind of hash code that is known only to your bank's computer and your computer(s) would work, provided you installed something. However, if you don't have a very strong password to begin with, what would stop someone from "registering" their computer to steal money from you?
I would guess your bank was doing it by using a trusted applet - my bank used to have a similar approach (honestly I thought it was a bit of a hassle - now they're using a calculator-like code generator instead). The trusted applet has access to your file system, so it can write some sort of identifier to a file on your system and retrieve this later.
A tutorial on using trusted applets.
I'm thinking about using Gears to store locally a hash-something to flag that the computer is registered.
If you are looking for the IP address of the computer that makes an account-creation request, you can easily pull that from the Request. In ASP.NET, you'd use:
string IPAddress = Request.UserHostAddress;
You could then store that with the account record and only accept logins for that account from that IP address. The problem, of course, is that this will not work for a public site at all. Most people come through an ISP that assigns IP addresses dynamically. Even with an always-on internet connection, the ISP will occasionally drop and re-open the connection, resulting in a change of IP address.
Anyway, is this what you are looking for?
Update: if you are looking to register a specific computer, have you considered using cookies? The drawback, of course, is that someone may clear their cookies and thus "unregister" their computer. The problem is, the web only has so much access to your computer (not much) so there is no fool-proof way to "register" a computer. Even if you install an ActiveX control, they could uninstall or delete it (although this is more persistent than a cookie). In the end, you'll always have to provide the end-user with some method for re-registering. And, if you do that, then you might as well have then log in anyway.