Google Chrome usage of bloom filter

Google Chrome usage of bloom filter - google-chrome

I was reading an wikipedia article on the usage of Bloom Filters . It was mentioned in the article that Bloom filters are used by Google Chrome to detect whether a URL entered is malicious. Because of the presence of false positive
Google Chrome web browser uses a Bloom filter to identify malicious URLs. Any URL is first checked against a local Bloom filter and only upon a hit ,a full check of the URL is performed
I am guessing full check means that Google stores a harsh table of the list of malicious URL and the URL is hashed to checked if it is present in the table. If this is the case , isent it better to just have the hash table instead of hash table + bloom filter??
Please enlightened me on this , is my version of full check correct ???

A Bloom filter is a Probabilistic data structure tells us that the element either definitely is not in the set or may be in the set. Bloom filter takes less space (depends on hash functions configured and error rate) compared to the Hashmap. Hashmap can determine whether an element exists or not, whereas bloomfilter can only deterministically check for non existence of an element.
Let us look at the Google Chrome use case. When a user enters a URL it should validate whether the URL is a safe or not. To validate the URL, chrome can make a call to the google serve (internally google can maintain any data structure to find this out). However, the challenge with this approach is multiple fold.For every URL request served on chrome the URL validation happens through Google Server which adds additional dependency on google servers, network roundtrip time, and requirement to maintain high availability to validate the URLs for all the URLs fired from chrome browsers across the world.
Since this data does not change very often (may be updated in an hour or so), chrome might have considered bundling all the malicious sites data as a bloom filter and google might sync this data periodically with the clients(malicious sites are less compared to the full blown websites. ). When user opens a url in chrome it checks the bloom filter if the URL does not exist it is safe. If it exists then bloom filter is not sure about it so the traffic goes to the google server for validating (this traffic will be way less compared to routing all the traffic).

A bloom filter for all malicous URL's is small enough to be kept on your computer and even in memory. Because almost all sites you enter are not malicous it would be better if you wouldn't do an extra request for them, that's where the bloom filter comes in.
You might not feel it but for slow internet connections it's very useful.

Not only is the Bloom filter much smaller and faster than a web query, it also protects Google's malicious URL API from what would otherwise be a tremendous workload.

From my understanding, bloom filters can store data efficiently in a limited amount of space. The contract of bloom filter is that it does not return false negative, however based on the vector size of your bloom filter it might return you some false positives.
Just to make sure of the false positives, Google is either using Hashing or sending that urls to their servers to recheck the url there by eliminating the load of sending all the urls to their servers.

Google stores a filter of listed malicous URLs in a Bloom-Filter.
See here:
Chromium Code Reviews Issue 10896048
Alex Yakunin's blog: Nice Bloom filter application

Related

Are google's search results influenced by our data?

I have always wondered that.
For example, If I search for the term "composer" or "what is composer", it shows the php package manager. Why does it show programmer-related results? Obviously, it makes sense that it does that, since the results I get are much more relevant to me.
What if an aspiring composer googles that? What results will they get?
Another example is, if I enter the word "spring" to the search engine, it shows the spring framework, instead of, let's say, the season.
So, my question(s):
Does google actually use the data it collects to show relevant search results? (I am not talking about ads, but search results)
If yes, why doesn't incognito mode work?
How can I avoid google using other parameters, besides the very term I typed in, to affect the search results?

Yes. This is the very core of Google's business model. The same data that influences search results is also applied to ad placement (see their real-time bidding system); when you do searches, it's likely you will see ads about the same subjects fairly soon afterwards.
Incognito mode is a very limited form of anonymisation; it's really not very anonymous at all. If you visit a page in a browser that has some google-controlled element (e.g. Google Analytics, a CDN JS library, or a font), then shortly afterwards perform a google search, there will be very many points in common that allow google to match you as very likely the same person (e.g. your IP, time of day, recent similar requests, user agent string, window size, fonts available) even if it blocks cookies that would identify you explicitly. This form of fingerprinting is quite hard to avoid, though Safari is a lot better at it than Chrome. Tor provides much more robust anonymisation by normalising many fingerprintable elements, as well as hiding your IP.
That's difficult because making use of all this information will indeed lead to generally more relevant search results, so it's in Google's interests to use whatever it can (within technical and mostly legal limits). Tor will disconnect the search results from you, but it may instead provide you with results linked to whoever else might have been using the same Tor exit node as you recently, which might not be pleasant! The same would apply to using VPN services.

How long may I store latitude and longitude retrieved from the Google Maps Geocoding API?

I have read Google's Terms and Conditions here: https://developers.google.com/maps/documentation/geocoding/support#comunity-support, but I am still a little unclear on how long we can store latitude and longitude in our own database.
I thought I found the answer here: Terms and Conditions Google Maps: Can I store lat/lng and address components?, but reading some of the recent comments raised doubts once again.
Specifically, if the sole intent is to use the latitude and longitude retrieved from the API with a Google map, can I store those attributes in my own database indefinitely or only for 30 days?
How do I make contact with someone at Google directly so I have a definitive answer to this question and don't need to go contact a lawyer to interpret the terms and conditions.
Thank you,
Terry

The Terms of Service, section 10.5, clause d, states this:
No caching or storage. You will not pre-fetch, cache, index, or store
any Content to be used outside the Service, except that you may store
limited amounts of Content solely for the purpose of improving the
performance of your Maps API Implementation due to network latency
(and not for the purpose of preventing Google from accurately tracking
usage), and only if such storage:
is temporary (and in no event more
than 30 calendar days);
is secure;
does not manipulate or aggregate any part of the Content or Service;
and does not modify attribution in any way.
This appears to me to specify that the caching must be temporary--you can't actively decide that you're going to cache the data for a max of 30 days. By your own words you want to cache it to prevent API hits, but that is explicitly prohibited by this clause.
If you were caching for a short duration for a specific purpose, such as knowing that a given user will be using the data again in a relatively short period of time, caching would be allowed. Caching just for the sake of caching is not allowed.
You are allowed to cache indefinitely if it's related to a user preference. For example, storing lat/long information is okay if you're saving a user's home coordinates, but only the actual preference data and not any results generated by the API that are related to the personal data.
I am not a lawyer, but this section appears rather clear to me.

REST API - file (ie images) processing - best practices

We are developing server with REST API, which accepts and responses with JSON. The problem is, if you need to upload images from client to server.
Note: and also I am talking about a use-case where the entity (user) can have multiple files (carPhoto, licensePhoto) and also have other properties (name, email...), but when you create new user, you don't send these images, they are added after the registration process.
The solutions I am aware of, but each of them have some flaws
1. Use multipart/form-data instead of JSON
good : POST and PUT requests are as RESTful as possible, they can contain text inputs together with file.
cons : It is not JSON anymore, which is much easier to test, debug etc. compare to multipart/form-data
2. Allow to update separate files
POST request for creating new user does not allow to add images (which is ok in our use-case how I said at beginning), uploading pictures is done by PUT request as multipart/form-data to for example /users/4/carPhoto
good : Everything (except the file uploading itself) remains in JSON, it is easy to test and debug (you can log complete JSON requests without being afraid of their length)
cons : It is not intuitive, you cant POST or PUT all variables of entity at once and also this address /users/4/carPhoto can be considered more as a collection (standard use-case for REST API looks like this /users/4/shipments). Usually you cant (and dont want to) GET/PUT each variable of entity, for example users/4/name . You can get name with GET and change it with PUT at users/4. If there is something after the id, it is usually another collection, like users/4/reviews
3. Use Base64
Send it as JSON but encode files with Base64.
good : Same as first solution, it is as RESTful service as possible.
cons : Once again, testing and debugging is a lot worse (the body can have megabytes of data), there is increase in size and also in processing time in both - client and server
I would really like to use solution no. 2, but it has its cons... Anyone can give me a better insight of "what is best" solution?
My goal is to have RESTful services with as much standards included as possible, while I want to keep it as simple as possible.

OP here (I am answering this question after two years, the post made by Daniel Cerecedo was not bad at a time, but the web services are developing very fast)
After three years of full-time software development (with focus also on software architecture, project management and microservice architecture) I definitely choose the second way (but with one general endpoint) as the best one.
If you have a special endpoint for images, it gives you much more power over handling those images.
We have the same REST API (Node.js) for both - mobile apps (iOS/android) and frontend (using React). This is 2017, therefore you don't want to store images locally, you want to upload them to some cloud storage (Google cloud, s3, cloudinary, ...), therefore you want some general handling over them.
Our typical flow is, that as soon as you select an image, it starts uploading on background (usually POST on /images endpoint), returning you the ID after uploading. This is really user-friendly, because user choose an image and then typically proceed with some other fields (i.e. address, name, ...), therefore when he hits "send" button, the image is usually already uploaded. He does not wait and watching the screen saying "uploading...".
The same goes for getting images. Especially thanks to mobile phones and limited mobile data, you don't want to send original images, you want to send resized images, so they do not take that much bandwidth (and to make your mobile apps faster, you often don't want to resize it at all, you want the image that fits perfectly into your view). For this reason, good apps are using something like cloudinary (or we do have our own image server for resizing).
Also, if the data are not private, then you send back to app/frontend just URL and it downloads it from cloud storage directly, which is huge saving of bandwidth and processing time for your server. In our bigger apps there are a lot of terabytes downloaded every month, you don't want to handle that directly on each of your REST API server, which is focused on CRUD operation. You want to handle that at one place (our Imageserver, which have caching etc.) or let cloud services handle all of it.
small 2023 update: If possible, but CDN in front of the pictures, it usually will save you a lot of money and make the pictures even more available (i.e. no issues when peaks happen).
Cons : The only "cons" which you should think of is "not assigned images". User select images and continue with filling other fields, but then he says "nah" and turn off the app or tab, but meanwhile you successfully uploaded the image. This means you have uploaded an image which is not assigned anywhere.
There are several ways of handling this. The most easiest one is "I don't care", which is a relevant one, if this is not happening very often or you even have desire to store every image user send you (for any reason) and you don't want any deletion.
Another one is easy too - you have CRON and i.e. every week and you delete all unassigned images older than one week.

There are several decisions to make:
The first about resource path:
Model the image as a resource on its own:
Nested in user (/user/:id/image): the relationship between the user and the image is made implicitly
In the root path (/image):
The client is held responsible for establishing the relationship between the image and the user, or;
If a security context is being provided with the POST request used to create an image, the server can implicitly establish a relationship between the authenticated user and the image.
Embed the image as part of the user
The second decision is about how to represent the image resource:
As Base 64 encoded JSON payload
As a multipart payload
This would be my decision track:
I usually favor design over performance unless there is a strong case for it. It makes the system more maintainable and can be more easily understood by integrators.
So my first thought is to go for a Base64 representation of the image resource because it lets you keep everything JSON. If you chose this option you can model the resource path as you like.
If the relationship between user and image is 1 to 1 I'd favor to model the image as an attribute specially if both data sets are updated at the same time. In any other case you can freely choose to model the image either as an attribute, updating the it via PUT or PATCH, or as a separate resource.
If you choose multipart payload I'd feel compelled to model the image as a resource on is own, so that other resources, in our case, the user resource, is not impacted by the decision of using a binary representation for the image.
Then comes the question: Is there any performance impact about choosing base64 vs multipart?. We could think that exchanging data in multipart format should be more efficient. But this article shows how little do both representations differ in terms of size.
My choice Base64:
Consistent design decision
Negligible performance impact
As browsers understand data URIs (base64 encoded images), there is no need to transform these if the client is a browser
I won't cast a vote on whether to have it as an attribute or standalone resource, it depends on your problem domain (which I don't know) and your personal preference.

Your second solution is probably the most correct. You should use the HTTP spec and mimetypes the way they were intended and upload the file via multipart/form-data. As far as handling the relationships, I'd use this process (keeping in mind I know zero about your assumptions or system design):
POST to /users to create the user entity.
POST the image to /images, making sure to return a Location header to where the image can be retrieved per the HTTP spec.
PATCH to /users/carPhoto and assign it the ID of the photo given in the Location header of step 2.

There's no easy solution. Each way has their pros and cons . But the canonical way is using the first option: multipart/form-data. As W3 recommendation guide says
The content type "multipart/form-data" should be used for submitting forms that contain files, non-ASCII data, and binary data.
We aren't sending forms,really, but the implicit principle still applies. Using base64 as a binary representation, is incorrect because you're using the incorrect tool for accomplish your goal, in other hand, the second option forces your API clients to do more job in order to consume your API service. You should do the hard work in the server side in order to supply an easy-to-consume API. The first option is not easy to debug, but when you do it, it probably never changes.
Using multipart/form-data you're sticked with the REST/http philosophy. You can view an answer to similar question here.
Another option if mixing the alternatives, you can use multipart/form-data but instead of send every value separate, you can send a value named payload with the json payload inside it. (I tried this approach using ASP.NET WebAPI 2 and works fine).

How to get the id of the logined google user from chrome extension?

Without using oauth2.
Because I don't want to get any user's data or do an authentication, only get the id.
And I want to monitor login/logout (chrome.identity.onSignInChanged does not work).
ps I need ID for storing data on my server (chrome.storage.sync is too small).

You say you dont want to use oauth because you dont need any user data. However the id or email IS user data and there's an oauth scope just for that. Use it, else the other alternatives might break in the future, or wait until chrome identity is out for all.
Another way if you really dont want oauth is to store a random number in chrome sync and use that as your id. If the random is large enough you will avoid collitions in practice. Prepend the random with the current millseconds since 1970 and I bet there will be no collitions.

chrome.identity.onSignInChanged is not working (as you mentioned), because it is currently available on the dev channel only (according to this and other sources online).
So, with a little bit of luck, it will be available on the stable channel soon...
A little hacky, but (as suggested in this answer) you could make an AJAX requests to https://www.google.com/settings/account and parse the content to extract the info about user being logged in and user e-mail.
(That's not very robust, of course, since the e-mail might change, but maybe good enough for a temporary work-around.)

Are cryptographic hashes injective under certain conditions?

sorry for the lengthy post, I have a question about common crpytographic hashing algorithms, such as the SHA family, MD5, etc.
In general, such a hash algorithm cannot be injective, since the actual digest produced has usually a fixed length (e.g., 160 bits under SHA-1, etc.) whereas the space of possible messages to be digested is virtually infinite.
However, if we generate a digest of a message which is at most as long as the digest generated, what are the properties of the commonly used hashing algorithms? Are they likely to be injective on this limited message space? Are there algorithms, which are known to produce collisions even on messages whose bit lengths is shorter than the bit length of the digest produced?
I am actually looking for an algorithm, which has this property, i.e., which, at least in principle, may generate colliding hashes even for short input messages.
Background: We have a browser plug-in, which for every web-site visited, makes a server request asking, whether the web-site belongs to one of our known partners. But of course, we don't want to spy on our users. So, in order to make it hard to generate some kind of surf history, we do not actually send the URL visited, but a hash digest (currently, SHA-1) of some cleaned up version. On the server side, we have a table of hashes of well-known URIs, which is matched against the received hash. We can live with a certain amount of uncertainty here, since we consider not being able to track our users a feature, not a bug.
For obvious reasons, this scheme is pretty fuzzy and admits false positives as well as URIs not matched which should have.
So right now, we are considering changing the fingerprint generation to something, which has more structure, for example, instead of hashing the full (cleaned up) URI, we might instead:
split the host name into components at "." and hash those individually
check the path into components at "." and hash those individually
Join the resulting hashes into a fingerprint value. Example: hashing "www.apple.com/de/shop" using this scheme (and using Adler 32 as hash) might yield "46989670.104268307.41353536/19857610/73204162".
However, as such a fingerprint has a lot of structure (in particular, when compared to a plain SHA-1 digest), we might accidentally make it pretty easy again to compute actual URI visited by a user (for example, by using a pre-computed table of hash values for "common" compont values, such as "www").
So right now, I am looking for a hash/digest algorithm, which has a high rate of collisions (Adler 32 is seriously considered) even on short messages, so that the probability of a given component hash being unique is low. We hope, that the additional structure we impose provides us with enough additional information to as to improve the matching behaviour (i.e., lower the rate of false positives/false negatives).

I do not believe hashes are guaranteed to be injective for messages the same size the digest. If they were, they would be bijective, which would be missing the point of a hash. This suggests that they are not injective for messages smaller than the digest either.
If you want to encourage collisions, i suggest you use any hash function you like, then throw away bits until it collides enough.
For example, throwing away 159 bits of a SHA-1 hash will give you a pretty high collision rate. You might not want to throw that many away.
However, what you are trying to achieve seems inherently dubious. You want to be able to tell that the URL is one of your ones, but not which one it is. That means you want your URLs to collide with each other, but not with URLs that are not yours. A hash function won't do that for you. Rather, because collisions will be random, since there are many more URLs which are not yours than which are (i assume!), any given level of collision will lead to dramatically more confusion over whether a URL is one of yours or not than over which of yours it is.
Instead, how about sending the list of URLs to the plugin at startup, and then having it just send back a single bit indicating if it's visiting a URL in the list? If you don't want to send the URLs explicitly, send hashes (without trying to maximise collisions). If you want to save space, send a Bloom filter.

Since you're willing to accept a rate of false positives (that is, random sites identified as whitelisted when in fact they are not), a Bloom filter might be just the thing.
Each client downloads a Bloom filter containing the whole whitelist. Then the client has no need to otherwise communicate with the server, and there is no risk of spying.
At 2 bytes per URL, the false positive rate would be below 0.1%, and at 4 bytes per URL below 1 in 4 million.
Downloading the whole filter (and perhaps regular updates to it) is a large investment of bandwidth up front. But supposing that it has a million URLs on it (which seems quite a lot to me, given that you can probably apply some rules to canonicalize URLs before lookup), it's a 4MB download. Compare this with a list of a million 32 bit hashes: same size, but the false positive rate would be somewhere around 1 in 4 thousand, so the Bloom filter wins for compactness.
I don't know how the plugin contacts the server, but I doubt that you can get an HTTP transaction done in much under 1kB -- perhaps less with keep-alive connections. If filter updates are less frequent than one per 4k URL visits by a given user (or a smaller number if there are less than a million URLs, or greater than 1 in 4 million false positive probability), this has a chance of using less bandwidth than the current scheme, and of course leaks much less information about the user.
It doesn't work quite as well if you require new URLs to be whitelisted immediately, although I suppose the client could still hit the server at every page request, to check whether the filter has changed and if so download an update patch.
Even if the Bloom filter is too big to download in full (perhaps for cases where the client has no persistent storage and RAM is limited), then I think you could still introduce some information-hiding by having the client compute which bits of the Bloom filter it needs to see, and asking for those from the server. With a combination of caching in the client (the higher the proportion of the filter you have cached, the fewer bits you need to ask for and hence the less you tell the server), asking for a window around the actual bit you care about (so you don't tell the server exactly which bit you need), and the client asking for spurious bits it doesn't really need (hide the information in noise), I expect you could obfuscate what URLs you're visiting. It would take some analysis to prove how much that actually works, though: a spy would be aiming to find a pattern in your requests that's correlated with browsing a particular site.

I'm under the impression that you actually want public key cryptography, where you provide the visitor with a public key used to encode the URL, and you decrypt the URL using the secret key.
There are JavaScript implementations a bit everywhere.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008