Terms of Service and caching

Terms of Service and caching - google-maps

I'm working on an app that will be used in public transit vehicles as an advertising system. This is part of a university research project that brings high speed wifi to vehicles. I have a Google Map displayed on the app, but I know there will be times when connections are not working for a few hours at a time. Therefore I want to cache map tiles in a good portion of the city for when the connection goes down. I know Google's TOS says that cachng is illegal except under the following circumstances: "limited amounts of Content for the purpose of improving the performance of your Maps API Implementation if you do so temporarily, securely, and in a manner that does not permit use of the Content outside of the Service;" I feel as if my app falls under the clause of improving performance of my Maps implementation because I will be downloading the maps for a majority of the time and going to the cache when I absolutely need to. And I would be refreshing the cache quite often too. Do others agree with me that I would be allowed to do this? I haven't actually done anything yet, so I figured I would get the opinions of others first. Also, does anyone know what "limited amounts of Content" would amount to?

Are you sure that that's what the TOS say? It looks to me as if there are two possibly-relevant sets of terms.
http://maps.google.com/help/terms_maps.html (for ordinary Google Maps, I think) doesn't say anything about cacheing explicitly. If those are the terms relevant to your usage, you might be OK. (The thing I'd worry about is 2a -- no copying of "the Content or any part thereof". What counts as copying for this purpose?) But I'm not a lawyer, neither are you, and if this is important then you should probably actually ask Google.
https://www.google.com/enterprise/earthmaps/legal/us/maps_purchase_agreement.html (for "Google Maps for Business", I think) goes further than what you say: "Customer may store limited amounts of Content solely to improve the performance of the Customer Implementation due to network latency" (emphasis mine) -- and cacheing things to make the system work when the network isn't there at all seems to go rather beyond that. But I'm not a lawyer, neither are you, and if this is important then you should probably actually ask Google.
If your cacheing strategy results in making more requests than you otherwise might, you should be aware of the limits Google impose on that, too.
In any case, I'm not a lawyer, neither are you, and if this is important then you should probably actually ask Google. ("You" might actually mean "the legal people at your university" or something.)
You might also want to consider OpenStreetMap.

Related

Are google's search results influenced by our data?

I have always wondered that.
For example, If I search for the term "composer" or "what is composer", it shows the php package manager. Why does it show programmer-related results? Obviously, it makes sense that it does that, since the results I get are much more relevant to me.
What if an aspiring composer googles that? What results will they get?
Another example is, if I enter the word "spring" to the search engine, it shows the spring framework, instead of, let's say, the season.
So, my question(s):
Does google actually use the data it collects to show relevant search results? (I am not talking about ads, but search results)
If yes, why doesn't incognito mode work?
How can I avoid google using other parameters, besides the very term I typed in, to affect the search results?

Yes. This is the very core of Google's business model. The same data that influences search results is also applied to ad placement (see their real-time bidding system); when you do searches, it's likely you will see ads about the same subjects fairly soon afterwards.
Incognito mode is a very limited form of anonymisation; it's really not very anonymous at all. If you visit a page in a browser that has some google-controlled element (e.g. Google Analytics, a CDN JS library, or a font), then shortly afterwards perform a google search, there will be very many points in common that allow google to match you as very likely the same person (e.g. your IP, time of day, recent similar requests, user agent string, window size, fonts available) even if it blocks cookies that would identify you explicitly. This form of fingerprinting is quite hard to avoid, though Safari is a lot better at it than Chrome. Tor provides much more robust anonymisation by normalising many fingerprintable elements, as well as hiding your IP.
That's difficult because making use of all this information will indeed lead to generally more relevant search results, so it's in Google's interests to use whatever it can (within technical and mostly legal limits). Tor will disconnect the search results from you, but it may instead provide you with results linked to whoever else might have been using the same Tor exit node as you recently, which might not be pleasant! The same would apply to using VPN services.

How do deal with bots using the in-site search and overflowing the SQL with too many requests?

What is the best practise to not annoy users with flood limits, but yet block off bots doing automated searches?
What is going on:
I am been more aware of odd search behaviour and I finally had the time, to catch who it is. It is 157.55.39.* also known as Bing. Which is odd, because when _GET['q'] is detected, noindex is added.
Problem however is, that they are slowing down the SQL server, as there is just too many instances of requests coming in.
What I have done so far:
I have implemented searching flood limit. But since I did it with a session-cookie, checking and calculating from the last search timestamp -- bing obviously ignores cookies and continues on.
Worst case scenario is to add reCAPTHA, but I don't want the "Are you human?" tickbox everytime you search. It should appear only, when flood is detected. So basically, the real question is, how to detect too many requests from client to trigger some sort of recaptcha to stop requests..
EDIT #1:
I handled the situation currently, with:
<?
# Get end IP
define('CLIENT_IP', (filter_var(#$_SERVER['HTTP_X_FORWARDED_IP'], FILTER_VALIDATE_IP) ? #$_SERVER['HTTP_X_FORWARDED_IP'] : (filter_var(#$_SERVER['HTTP_X_FORWARDED_FOR'], FILTER_VALIDATE_IP) ? #$_SERVER['HTTP_X_FORWARDED_FOR'] : $_SERVER['REMOTE_ADDR'])));
# Detect BING:
if (substr(CLIENT_IP, 0, strrpos(CLIENT_IP, '.')) == '157.55.39') {
# Tell them not right now:
Header('HTTP/1.1 503 Service Temporarily Unavailable');
# ..and block the request
die();
}
It works. But it seems like another temp solution to a more systematic problem.
I would like to mention, that I still would like search engines, including Bing to index /search.html, just not to actually search there. There is no "latest searches" or anything like that, so its a mystery where they are getting the queries from.
EDIT #2 -- How I solved it
If someone else in the future has these problems, I hope this helps.
First of all, it turns out that Bing has the same URL parameter feature, that Google has. So I was able to tell Bing to ignore URL parameter "q".
Based on the correct answer, I added disallow rows for parameter q to robots.txt:
Disallow: /*?q=*
Disallow: /*?*q=*
I also told inside the bing webmaster console, to not bother us on peak traffic.
Overall, this right away showed positive feedback from server resource usage. I will however, implement overall flood limit for identical queries, specifically where _GET is involved. So in case Bing should ever decide to visit an AJAX call (example: ?action=upvote&postid=1).

Spam is a problem that all website owners struggle to deal with.
And there are a lot of ways to build good protection, starting from very easy ways and finishing with very hard and strong protection mechanisms.
But for you right now I see one simple solution.
Use robots.txt and disallow Bing spider to crawl your search page.
You can do this very easy.
Your robots.txt file would look like:
User-agent: bingbot
Disallow: /search.html?q=
But this will totally block search engine spider from crawling your search results.
If you want just to limit such requests, but not totally block them, try this:
User-agent: bingbot
crawl-delay: 10
This will force Bing to crawl your website pages only every 10 seconds.
But with such delay, it will crawl only 8,640 pages a day (which is very small amount of requests per/day).
If you good with this, then you ok.
But, what if you want manually control this behavior by the server itself, protecting search form not only from web crawlers, but also from hackers?
They could send to your server over 50,000 requests per/hour with the ease.
In this case, I would recommend you 2 solutions.
Firstly, connect CloudFlare to your website, and don't forget to check if your server real IP is still available via services like ViewDNS IP History, cuz many websites with CF protection lack on this (even popular once).
If your active server IP is visible in the history, then you may consider changing it (highly recommended).
Secondly, you could use MemCached to store flood data and detect if a certain IP is querying too much (i.e. 30 q/min).
And if they do, block their opportunity to use perform (via MemCached) for some time.
Of course, this is not the best solution you could use, but it will work and will cost not much for your server.

alternative to Google maps

My client wants some of the functionality of Google maps namely:
- geocoding
- generating maps with points based on postal code or long.lat
- optimal trip mapping
Their issues with Google maps
- cannot control outages
- postal codes are sometimes inaccurate or not updated frequently for Canada/UK
- they have no way to correct inaccurate information
They would prefer to host the mapping application themselves, but will require postal code updates.
Can anyone suggest such a product?
thanks

"cannot control outages - postal codes are sometimes inaccurate or not updated frequently for Canada/UK - they have no way to correct inaccurate information"
Outages
hosting your own mapping is the only way to control this, but you would be very very hard pushed to beat Google Maps / Bing Maps uptime over the last 5 years. Take a look at the following:
OpenStreetMap for the road imagery data, this is open source data very good in the UK (Im not sure about canada) and you can make your own changes and submit them (or just change the data you have downloaded)
Geoserver, Mapnik or MapServer will read openstreetmapdata and create the image tiles needed to create your own maps in whatever style you wish. Depending on if you dont want all countries and all zoom levels these products can create all the tiles you will need in advance, but usually they have to be created in real time and cached. You need a BIG fast server to manage tile crunching
Openlayers or Leaflet are open source javascript mapping platforms that will display your tiles for you
Obviously this is just for road maps, aerial imagery would cost you an absolute fortune.
Post Code Data
Many people do not realize that UK postcode data for latitude and longitude is now completely free and available to download every quarter from the official source (ordinance survey) http://www.ordnancesurvey.co.uk/oswebsite/products/code-point-open/index.html.
This is the same data source Google will use and there is none better but it will always contain inaccuracies and always be a few months out of date.
Finally
Hopefully that answer the question you asked and gives you information to inform your client. Now for the question you didn't ask "Is this approach good value to my client?".
I won't presume to know your business or client, however what I described above is possible but with one to many months of work involved to get it all working together and even then it wont have any where near the performance or uptime of something like google /bing maps and only offers a small subset of their features.

I think you're looking for something like Caliper-It's a very custom, and I would expect expensive, solution. Not suggested.
http://www.caliper.com/GISMappingSoftwareDevelopment.htm
One solution could be to use two different mapping services and compare their results, this way there's a much better chance the data is accurate. You can also fix inaccurate data by creating a system which acts as a barrier between the API and your user, where data you know is inaccurate is corrected before it's displayed. Not sure exactly what you're doing though, so this might not work for you.
Is trip mapping/routing the basic functionality you want to do?

Before rushing into rolling your own, I'd suggest a good think about the consequences of doing so. The first that springs to mind is whilst the pros are that you can now control your data, the cons are that you now control your data.
So you are going to have to consider where and when you get updates and the processes you are going to have to employ to keep your maps in sync with the rest of the world. There are a lot of headaches involved in these things which is why so many people use externally hosted solutions such as Googles.

How Transport for London Website Works

Here let me clarify , I have no intentions to peep in to or any evil intention towards tfls database and other related information.
But , ofcourse Millions of users are greatly beniftted the way it serves the information.
http://journeyplanner.tfl.gov.uk/
So , If we want to create some site like tfl, journeyplanner , what are the basic things we need to keep in mind.
Which Architecture We should use?
Can We create this website using ASP.NET(Should be able to)?
Is TFL integrating it's website with google maps or any other GPS
Edit:
While you enter the Zip/Pin code or Station name , it creates a map automatically from source to destination and calcculates the distance also.
My Question here is , How do they calculate the distance , do they keep help of Maps or GPS or they created there own webservic?

To answer the points, in order:
Which Architecture We should use?
One you know and understand there is more than one approach that would be possible to do a similar thing with.
Can We create this website using ASP.NET(Should be able to)?
You could. Similarly, you could do it as a Java Servlet or PHP application. If you were feeling particularly warped, you could probably make something work in pure Javascript (but your clients might hate you)
Is TFL integrating it's website with google maps or any other GPS
They're more likely using Ordnance Survey data, that they've rendered their own maps from (certainly if you pan right out, coverage runs out quite quickly).
From a routing perspective, they're probably using something like Dijkstra's Algorithm, although it's probably very optimised to cope with timetabling.
There are numerous algorithms for routing, which boil down to "relative cost" (where that cost may be distance, time, financial, or a combination). Not taking into consideration timetables, you can precalculate the costs between connected nodes (e.g. Liverpool St -> Bank via Central Line is ~5 mins), this would give a baseline for something like Dijkstra although you'd still need to factor in the cost of interchaging between modes of transport and waiting for connections to arrive, etc.).
You might want to look into routing algorithms in general (there's even info over on OpenStreetMap's wiki) before looking into the complexities introduced with timetabled services.

Should you worry about fake accounts/logins on a website?

I'm specifically thinking about the BugMeNot service, which provides user name and password combos to a good number of sites. Now, I realize that pay-for-content sites might be worried about this (and I would suspect that most watch for shared accounts), but how about other sites? Should administrators be on the lookout for these accounts? Should web developers do anything differently to take them into account (and perhaps prevent their use)?

I think it depends on the aim of your site. If usage analytics are all-important, then this is something you'd have to watch out for. If advertising is your only revenue stream, then does it really matter which username someone uses?
Probably the best way to discourage use of bugmenot accounts is to make it worthwhile to have an actual account. E.g.: No one would use that here, since we all want rep and a profile, or if you're sending out useful emails, people want to receive them.

Ask yourself the question "Why do we require users to register to access my site?" Once you have business reason for this requirement, then you can try to work out what the effect of having some part of that bypassed by suspect account information.
Work on the basis that at least 10 to 15 percent of account information will be rubbish - and if people using the site can't see any benefit to them personally for registering, and if the registration process is even remotely tedious or an imposition, then accept that you will be either driving more potential visitors away, or increasing your "crap to useful information" ratio.

Not make registration mandatory to read something? i.e. Ask people to register when you are providing some functionality for them that 'saves' some settings, data, etc. I would imagine site like stackoverflow gets less fake registrations (reading questions doesn't require an account) than say New York Times, where you need to have an account to read articles.
If that is not upto your control, you may consider removing dormant accounts. i.e. Removing accounts after a certain amount of inactivity.

That entirely depends.
Most sites that find themselves listed in bugmenot.com tend to be the ones that require registration for in order to access otherwise-free content.
If registration is required in order to interact with the site (ie, add comments/posts/etc), then chances are most people would rather create their own account than use one that has been made public.
So before considering whether to do things like automatically check bugmenot - think about whether their are problems with your business model.
There are a few situations where pay-to-access content sites (I'm thinking things like, ahem, 'adult' sites) end up with a few user accounts being published publically (usually because someone has brute-forced some account details), and in that case there may be a argument for putting significant effort into it.

From an administrator viewpoint absolutely. That registration is required for a reason, even if it's something just as simple as user tracking/profile maintaining. Several thousand people using that login entirely defeats the purpose. IP tracking could help mitigate this problem, but it would definitely be hard to eliminate entirely.

No need to worry about BugMeNot: http://www.bugmenot.com/report.php

With bugmenot, keep in mind that this service is not actually there to harm the sites, but rather to make using them easier. You can request to block your site if it is pay-per-view, community-based (i.e. a forum or Wiki) or the account includes sensible information (like banking). This means in virtually all situations where you would think that bugmenot is a bad thing, bugmenot does not want to be used. So maybe things are not as bad as you might think.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008