I have always wondered that.
For example, If I search for the term "composer" or "what is composer", it shows the php package manager. Why does it show programmer-related results? Obviously, it makes sense that it does that, since the results I get are much more relevant to me.
What if an aspiring composer googles that? What results will they get?
Another example is, if I enter the word "spring" to the search engine, it shows the spring framework, instead of, let's say, the season.
So, my question(s):
Does google actually use the data it collects to show relevant search results? (I am not talking about ads, but search results)
If yes, why doesn't incognito mode work?
How can I avoid google using other parameters, besides the very term I typed in, to affect the search results?
Yes. This is the very core of Google's business model. The same data that influences search results is also applied to ad placement (see their real-time bidding system); when you do searches, it's likely you will see ads about the same subjects fairly soon afterwards.
Incognito mode is a very limited form of anonymisation; it's really not very anonymous at all. If you visit a page in a browser that has some google-controlled element (e.g. Google Analytics, a CDN JS library, or a font), then shortly afterwards perform a google search, there will be very many points in common that allow google to match you as very likely the same person (e.g. your IP, time of day, recent similar requests, user agent string, window size, fonts available) even if it blocks cookies that would identify you explicitly. This form of fingerprinting is quite hard to avoid, though Safari is a lot better at it than Chrome. Tor provides much more robust anonymisation by normalising many fingerprintable elements, as well as hiding your IP.
That's difficult because making use of all this information will indeed lead to generally more relevant search results, so it's in Google's interests to use whatever it can (within technical and mostly legal limits). Tor will disconnect the search results from you, but it may instead provide you with results linked to whoever else might have been using the same Tor exit node as you recently, which might not be pleasant! The same would apply to using VPN services.
Related
What is the best practise to not annoy users with flood limits, but yet block off bots doing automated searches?
What is going on:
I am been more aware of odd search behaviour and I finally had the time, to catch who it is. It is 157.55.39.* also known as Bing. Which is odd, because when _GET['q'] is detected, noindex is added.
Problem however is, that they are slowing down the SQL server, as there is just too many instances of requests coming in.
What I have done so far:
I have implemented searching flood limit. But since I did it with a session-cookie, checking and calculating from the last search timestamp -- bing obviously ignores cookies and continues on.
Worst case scenario is to add reCAPTHA, but I don't want the "Are you human?" tickbox everytime you search. It should appear only, when flood is detected. So basically, the real question is, how to detect too many requests from client to trigger some sort of recaptcha to stop requests..
EDIT #1:
I handled the situation currently, with:
<?
# Get end IP
define('CLIENT_IP', (filter_var(#$_SERVER['HTTP_X_FORWARDED_IP'], FILTER_VALIDATE_IP) ? #$_SERVER['HTTP_X_FORWARDED_IP'] : (filter_var(#$_SERVER['HTTP_X_FORWARDED_FOR'], FILTER_VALIDATE_IP) ? #$_SERVER['HTTP_X_FORWARDED_FOR'] : $_SERVER['REMOTE_ADDR'])));
# Detect BING:
if (substr(CLIENT_IP, 0, strrpos(CLIENT_IP, '.')) == '157.55.39') {
# Tell them not right now:
Header('HTTP/1.1 503 Service Temporarily Unavailable');
# ..and block the request
die();
}
It works. But it seems like another temp solution to a more systematic problem.
I would like to mention, that I still would like search engines, including Bing to index /search.html, just not to actually search there. There is no "latest searches" or anything like that, so its a mystery where they are getting the queries from.
EDIT #2 -- How I solved it
If someone else in the future has these problems, I hope this helps.
First of all, it turns out that Bing has the same URL parameter feature, that Google has. So I was able to tell Bing to ignore URL parameter "q".
Based on the correct answer, I added disallow rows for parameter q to robots.txt:
Disallow: /*?q=*
Disallow: /*?*q=*
I also told inside the bing webmaster console, to not bother us on peak traffic.
Overall, this right away showed positive feedback from server resource usage. I will however, implement overall flood limit for identical queries, specifically where _GET is involved. So in case Bing should ever decide to visit an AJAX call (example: ?action=upvote&postid=1).
Spam is a problem that all website owners struggle to deal with.
And there are a lot of ways to build good protection, starting from very easy ways and finishing with very hard and strong protection mechanisms.
But for you right now I see one simple solution.
Use robots.txt and disallow Bing spider to crawl your search page.
You can do this very easy.
Your robots.txt file would look like:
User-agent: bingbot
Disallow: /search.html?q=
But this will totally block search engine spider from crawling your search results.
If you want just to limit such requests, but not totally block them, try this:
User-agent: bingbot
crawl-delay: 10
This will force Bing to crawl your website pages only every 10 seconds.
But with such delay, it will crawl only 8,640 pages a day (which is very small amount of requests per/day).
If you good with this, then you ok.
But, what if you want manually control this behavior by the server itself, protecting search form not only from web crawlers, but also from hackers?
They could send to your server over 50,000 requests per/hour with the ease.
In this case, I would recommend you 2 solutions.
Firstly, connect CloudFlare to your website, and don't forget to check if your server real IP is still available via services like ViewDNS IP History, cuz many websites with CF protection lack on this (even popular once).
If your active server IP is visible in the history, then you may consider changing it (highly recommended).
Secondly, you could use MemCached to store flood data and detect if a certain IP is querying too much (i.e. 30 q/min).
And if they do, block their opportunity to use perform (via MemCached) for some time.
Of course, this is not the best solution you could use, but it will work and will cost not much for your server.
When I use the search functionality on the scribd docs API to search for a function, like
http://api.scribd.com/api?method=docs.search&api_key=API_KEY&query=hello+world
It returns irrelevant results, and ones different to the search functionality of the site. This request, for example, returns results about Guitar Hero, World of Warcraft and Virtual Worlds etc. Whereas the site search on https://www.scribd.com/search-documents?query=hello+world gives documents titled "Hello World" as you would expect. Is there a parameter that I can add to the api call that will make it return relevant results?
You may try playing with the simple parameter to see if it makes any difference to your queries. According to the API reference (half of it is inaccessible at the moment) it makes the results the same as for the website:
(optional)This option specifies whether or not to allow advanced search queries (more information). When set to false, the API search behaves the same as the search on Scribd.com. When set to true, the API search allows advanced queries that contain filters such as title:"A Tale of Two Cities". Set to "true" by default.
I tried your query myself, but it still doesn't give adequate results, even though it changes things a bit. But it is still not good enough regardless of the simple option being set to false. Even if you try to run their sample queries 1:1 they are still giving 90% irrelevant results.
Then I found a similar issue being discussed in the following google group thread back in 2011. At the end Jared Friedman (the CTO of Scribd) himself admits that API search and website Search work differently and it is not in their priorities to fix this. In 2014 another developer complained. Seems to me that four years later this is still the case.
I'd suggest contacting Scribd support directly and asking them what is the current status of the docs.search API and if there is some preliminary approval process in place (for example, they may do a background check on accounts and only then provide relevant results, otherwise they return just test results for any query) although I doubt it.
I'm wondering if there's a way to make google searches where you can set filters you want to be in effect permanently - like a filter profile. So, for instance, every time you would do a search, you could get results that didn't include say, Yahoo Answers, without having to type in -yahoo -answers.
A feature like this would be invaluable because it's very common to perform a search and want to filter out a lot of popular sites that would normally top the rankings. For example, suppose you're searching for a news topic and don't want to read mainstream media articles. You could add the words reuters, cnn, huffington post, daily mail, and so on to your filter profile and never see those sites turn up in any of your searches ever again.
I'm asking because I'm interesting in making an extension that would do precisely this, but there's no point if such a feature already exists.
You can create a Custom Search in minutes. It's called Google CSE (Custom Search Engine)
This is a sample public link that I've created based on your example above: https://www.google.com/cse/publicurl?cx=006201654654568968489:1kv4asuwfvs
In the settings:
I can choose to exclude by url, url pattern, or even urls within my search results
If you need more ways, here's a good and relevant link.
Search filters can be specified as part of the URL (e.g. append site:example.com/section1 to a Google query to only yield results whose locations start with that prefix). So you can make a search plugin that substitutes your query into such a template and install it into your browser.
Search plugins are generally XML files with a standardized schema. OpenSearch is one such standard supported by Chrome.
There are sites that host collections of user-submitted plugins as well as tools to generate your own. An example that I use is the Mycroft project (originally created for Apple Sherlock software that pioneered the concept and later accepted into the Mozilla project when Firefox took on the feature).
I am trying to develop a chrome extension that part of it will need to have the global find keyword functionality, just like the built in "Find" (Ctrl+F) that comes with the browser. (EDIT: It needs to invoke "Find" multiple times and concurrently on the same tab)
My first thought is to find an API that can provide the "find" functionality from Chrome. However, after going through the list, I don't see what I am looking for. Also, the keywords for my question ("Chrome extension", "Chrome API","find","search") are too generic, I can not find similar examples or information for such an API even after extensive googling.
In order to provide consistent user experience I would love to provide similar, if not exact, "Find" tool in my extension. In order to avoid reinventing the wheel, it would be best if I can somehow invoke the built-in function. Existing extensions are mostly an own implementation in Javascript with limitations (cannot search inside iframe, do not have global highlight, etc.) This will be my last resort.
Does anyone know of such an API(that will invoke built-in "find" tool from the browser) or a similar example to my question? If not, please let me know what's the best way to implement it in javascript, as I am new to lexical analysis or parsing.
Many Thanks!!
-Gavin
P.S: This is my first post here, if I haven't given enough information on my question (or you don't think this is a question at all), feel free to let me know!
EDIT2: I am trying to build an improvement extension based on "Find" that can solve this scenario:
In a text-intense page, if I want to locate a region where it mentions keywordA and keywordB but these two keyword are not immediately adjacent to each other and both of them appear many times in the document. In this case I can neither search "keywordA keywordB" (because they are not next to each other) nor individual keywords (too many occurrence).
For example, in an html-based math textbook, you want to locate a chapter that mentions "linear algebra" and "matrix" together the most times.
The built in search does not support multiple concurrent invocations on the same tab. So even if it becomes accessible via API some day, it's unlikely to support concurrency, because concurrent searches are not natural for general use case with a single interactive user, and involves UI. One thing that I can imagine as improvement for existing built in search is support of a query language, which will allow searching for alternatives (i.e. car | auto for OR-ing), and this would somehow address "multiple" part of your requirement.
Your option is to use a content script for searching text, for example (with jQuery):
var search_i = $('*:contains("text to find")');
This way you can perform and combine as much searches as you want, but you'll need to design a proper (understandable) UI, presenting results for every search without interference with other searches.
I'm working on an app that will be used in public transit vehicles as an advertising system. This is part of a university research project that brings high speed wifi to vehicles. I have a Google Map displayed on the app, but I know there will be times when connections are not working for a few hours at a time. Therefore I want to cache map tiles in a good portion of the city for when the connection goes down. I know Google's TOS says that cachng is illegal except under the following circumstances: "limited amounts of Content for the purpose of improving the performance of your Maps API Implementation if you do so temporarily, securely, and in a manner that does not permit use of the Content outside of the Service;" I feel as if my app falls under the clause of improving performance of my Maps implementation because I will be downloading the maps for a majority of the time and going to the cache when I absolutely need to. And I would be refreshing the cache quite often too. Do others agree with me that I would be allowed to do this? I haven't actually done anything yet, so I figured I would get the opinions of others first. Also, does anyone know what "limited amounts of Content" would amount to?
Are you sure that that's what the TOS say? It looks to me as if there are two possibly-relevant sets of terms.
http://maps.google.com/help/terms_maps.html (for ordinary Google Maps, I think) doesn't say anything about cacheing explicitly. If those are the terms relevant to your usage, you might be OK. (The thing I'd worry about is 2a -- no copying of "the Content or any part thereof". What counts as copying for this purpose?) But I'm not a lawyer, neither are you, and if this is important then you should probably actually ask Google.
https://www.google.com/enterprise/earthmaps/legal/us/maps_purchase_agreement.html (for "Google Maps for Business", I think) goes further than what you say: "Customer may store limited amounts of Content solely to improve the performance of the Customer Implementation due to network latency" (emphasis mine) -- and cacheing things to make the system work when the network isn't there at all seems to go rather beyond that. But I'm not a lawyer, neither are you, and if this is important then you should probably actually ask Google.
If your cacheing strategy results in making more requests than you otherwise might, you should be aware of the limits Google impose on that, too.
In any case, I'm not a lawyer, neither are you, and if this is important then you should probably actually ask Google. ("You" might actually mean "the legal people at your university" or something.)
You might also want to consider OpenStreetMap.