Google Keyword Tool Results - json

I want to understand what this data is after I make a search query for "potato"
in google's keyword tool.
Its interesting because at the bottom it contains cost per click, names, suggestions, but the actual values for things like global searches are not definite.
Since its a bit long I paste it in here:
http://pastebin.com/UCTEhdB1
Any ideas are welcome.
Do not recommend any APIs

I am assuming that the values from your pastebin are coming from this tool.
I understand that this is not what you requested, but, unfortunately, there does not appear to be any documentation on the data returned from the Keyword tool, which should be expected considering that Google offers the Google Adwords API (unfortunately, a paid service).

Related

Are google's search results influenced by our data?

I have always wondered that.
For example, If I search for the term "composer" or "what is composer", it shows the php package manager. Why does it show programmer-related results? Obviously, it makes sense that it does that, since the results I get are much more relevant to me.
What if an aspiring composer googles that? What results will they get?
Another example is, if I enter the word "spring" to the search engine, it shows the spring framework, instead of, let's say, the season.
So, my question(s):
Does google actually use the data it collects to show relevant search results? (I am not talking about ads, but search results)
If yes, why doesn't incognito mode work?
How can I avoid google using other parameters, besides the very term I typed in, to affect the search results?
Yes. This is the very core of Google's business model. The same data that influences search results is also applied to ad placement (see their real-time bidding system); when you do searches, it's likely you will see ads about the same subjects fairly soon afterwards.
Incognito mode is a very limited form of anonymisation; it's really not very anonymous at all. If you visit a page in a browser that has some google-controlled element (e.g. Google Analytics, a CDN JS library, or a font), then shortly afterwards perform a google search, there will be very many points in common that allow google to match you as very likely the same person (e.g. your IP, time of day, recent similar requests, user agent string, window size, fonts available) even if it blocks cookies that would identify you explicitly. This form of fingerprinting is quite hard to avoid, though Safari is a lot better at it than Chrome. Tor provides much more robust anonymisation by normalising many fingerprintable elements, as well as hiding your IP.
That's difficult because making use of all this information will indeed lead to generally more relevant search results, so it's in Google's interests to use whatever it can (within technical and mostly legal limits). Tor will disconnect the search results from you, but it may instead provide you with results linked to whoever else might have been using the same Tor exit node as you recently, which might not be pleasant! The same would apply to using VPN services.

Why does this model fail?

Here is the data set
https://gist.github.com/kirkstrobeck/d8b768867890807f9dc9
When using Google Prediction API it will go from RUNNING for about 30 minutes, then ERROR: INTERNAL ERROR.
Why does it fail? It seems to be a standard consumable regression model data set.
When attempting to answer this question, I looked at the API you speak of as well as its requirements. These requirements lie in the file format and how the text in said file is formatted. The first thing I will point out is that the Google Prediction API that "is uploaded to Google Cloud Storage as a CSV (comma-separated value) file." Your file is a TXT(at least on GitHub), but appears to have the correct structure of a CSV. However, when you take a look at the standards for this filetype, almost everyone has a different way they want it done. In the case of Google, they have very strict requirements on the file format(they also have some good examples here: cloud.google.com/prediction/docs/developer-guide#examples). Long story short, you shouldn't have spaces between your columns, it might cause an error in the processing seeing how it doesn't match the Wikipedia standards or Google's requirements.
EDIT: Sorry about the weird link stuff, I don't have enough rep to do more than two yet.

Scribd API search showing irrelevant answers

When I use the search functionality on the scribd docs API to search for a function, like
http://api.scribd.com/api?method=docs.search&api_key=API_KEY&query=hello+world
It returns irrelevant results, and ones different to the search functionality of the site. This request, for example, returns results about Guitar Hero, World of Warcraft and Virtual Worlds etc. Whereas the site search on https://www.scribd.com/search-documents?query=hello+world gives documents titled "Hello World" as you would expect. Is there a parameter that I can add to the api call that will make it return relevant results?
You may try playing with the simple parameter to see if it makes any difference to your queries. According to the API reference (half of it is inaccessible at the moment) it makes the results the same as for the website:
(optional)This option specifies whether or not to allow advanced search queries (more information). When set to false, the API search behaves the same as the search on Scribd.com. When set to true, the API search allows advanced queries that contain filters such as title:"A Tale of Two Cities". Set to "true" by default.
I tried your query myself, but it still doesn't give adequate results, even though it changes things a bit. But it is still not good enough regardless of the simple option being set to false. Even if you try to run their sample queries 1:1 they are still giving 90% irrelevant results.
Then I found a similar issue being discussed in the following google group thread back in 2011. At the end Jared Friedman (the CTO of Scribd) himself admits that API search and website Search work differently and it is not in their priorities to fix this. In 2014 another developer complained. Seems to me that four years later this is still the case.
I'd suggest contacting Scribd support directly and asking them what is the current status of the docs.search API and if there is some preliminary approval process in place (for example, they may do a background check on accounts and only then provide relevant results, otherwise they return just test results for any query) although I doubt it.

Google searches with permanent filters

I'm wondering if there's a way to make google searches where you can set filters you want to be in effect permanently - like a filter profile. So, for instance, every time you would do a search, you could get results that didn't include say, Yahoo Answers, without having to type in -yahoo -answers.
A feature like this would be invaluable because it's very common to perform a search and want to filter out a lot of popular sites that would normally top the rankings. For example, suppose you're searching for a news topic and don't want to read mainstream media articles. You could add the words reuters, cnn, huffington post, daily mail, and so on to your filter profile and never see those sites turn up in any of your searches ever again.
I'm asking because I'm interesting in making an extension that would do precisely this, but there's no point if such a feature already exists.
You can create a Custom Search in minutes. It's called Google CSE (Custom Search Engine)
This is a sample public link that I've created based on your example above: https://www.google.com/cse/publicurl?cx=006201654654568968489:1kv4asuwfvs
In the settings:
I can choose to exclude by url, url pattern, or even urls within my search results
If you need more ways, here's a good and relevant link.
Search filters can be specified as part of the URL (e.g. append site:example.com/section1 to a Google query to only yield results whose locations start with that prefix). So you can make a search plugin that substitutes your query into such a template and install it into your browser.
Search plugins are generally XML files with a standardized schema. OpenSearch is one such standard supported by Chrome.
There are sites that host collections of user-submitted plugins as well as tools to generate your own. An example that I use is the Mycroft project (originally created for Apple Sherlock software that pioneered the concept and later accepted into the Mozilla project when Firefox took on the feature).

How to integrate chrome browser built in tools (specifically "find") when trying to develop an extension?

I am trying to develop a chrome extension that part of it will need to have the global find keyword functionality, just like the built in "Find" (Ctrl+F) that comes with the browser. (EDIT: It needs to invoke "Find" multiple times and concurrently on the same tab)
My first thought is to find an API that can provide the "find" functionality from Chrome. However, after going through the list, I don't see what I am looking for. Also, the keywords for my question ("Chrome extension", "Chrome API","find","search") are too generic, I can not find similar examples or information for such an API even after extensive googling.
In order to provide consistent user experience I would love to provide similar, if not exact, "Find" tool in my extension. In order to avoid reinventing the wheel, it would be best if I can somehow invoke the built-in function. Existing extensions are mostly an own implementation in Javascript with limitations (cannot search inside iframe, do not have global highlight, etc.) This will be my last resort.
Does anyone know of such an API(that will invoke built-in "find" tool from the browser) or a similar example to my question? If not, please let me know what's the best way to implement it in javascript, as I am new to lexical analysis or parsing.
Many Thanks!!
-Gavin
P.S: This is my first post here, if I haven't given enough information on my question (or you don't think this is a question at all), feel free to let me know!
EDIT2: I am trying to build an improvement extension based on "Find" that can solve this scenario:
In a text-intense page, if I want to locate a region where it mentions keywordA and keywordB but these two keyword are not immediately adjacent to each other and both of them appear many times in the document. In this case I can neither search "keywordA keywordB" (because they are not next to each other) nor individual keywords (too many occurrence).
For example, in an html-based math textbook, you want to locate a chapter that mentions "linear algebra" and "matrix" together the most times.
The built in search does not support multiple concurrent invocations on the same tab. So even if it becomes accessible via API some day, it's unlikely to support concurrency, because concurrent searches are not natural for general use case with a single interactive user, and involves UI. One thing that I can imagine as improvement for existing built in search is support of a query language, which will allow searching for alternatives (i.e. car | auto for OR-ing), and this would somehow address "multiple" part of your requirement.
Your option is to use a content script for searching text, for example (with jQuery):
var search_i = $('*:contains("text to find")');
This way you can perform and combine as much searches as you want, but you'll need to design a proper (understandable) UI, presenting results for every search without interference with other searches.