Scribd API search showing irrelevant answers - scribd

When I use the search functionality on the scribd docs API to search for a function, like
http://api.scribd.com/api?method=docs.search&api_key=API_KEY&query=hello+world
It returns irrelevant results, and ones different to the search functionality of the site. This request, for example, returns results about Guitar Hero, World of Warcraft and Virtual Worlds etc. Whereas the site search on https://www.scribd.com/search-documents?query=hello+world gives documents titled "Hello World" as you would expect. Is there a parameter that I can add to the api call that will make it return relevant results?

You may try playing with the simple parameter to see if it makes any difference to your queries. According to the API reference (half of it is inaccessible at the moment) it makes the results the same as for the website:
(optional)This option specifies whether or not to allow advanced search queries (more information). When set to false, the API search behaves the same as the search on Scribd.com. When set to true, the API search allows advanced queries that contain filters such as title:"A Tale of Two Cities". Set to "true" by default.
I tried your query myself, but it still doesn't give adequate results, even though it changes things a bit. But it is still not good enough regardless of the simple option being set to false. Even if you try to run their sample queries 1:1 they are still giving 90% irrelevant results.
Then I found a similar issue being discussed in the following google group thread back in 2011. At the end Jared Friedman (the CTO of Scribd) himself admits that API search and website Search work differently and it is not in their priorities to fix this. In 2014 another developer complained. Seems to me that four years later this is still the case.
I'd suggest contacting Scribd support directly and asking them what is the current status of the docs.search API and if there is some preliminary approval process in place (for example, they may do a background check on accounts and only then provide relevant results, otherwise they return just test results for any query) although I doubt it.

Related

Are google's search results influenced by our data?

I have always wondered that.
For example, If I search for the term "composer" or "what is composer", it shows the php package manager. Why does it show programmer-related results? Obviously, it makes sense that it does that, since the results I get are much more relevant to me.
What if an aspiring composer googles that? What results will they get?
Another example is, if I enter the word "spring" to the search engine, it shows the spring framework, instead of, let's say, the season.
So, my question(s):
Does google actually use the data it collects to show relevant search results? (I am not talking about ads, but search results)
If yes, why doesn't incognito mode work?
How can I avoid google using other parameters, besides the very term I typed in, to affect the search results?
Yes. This is the very core of Google's business model. The same data that influences search results is also applied to ad placement (see their real-time bidding system); when you do searches, it's likely you will see ads about the same subjects fairly soon afterwards.
Incognito mode is a very limited form of anonymisation; it's really not very anonymous at all. If you visit a page in a browser that has some google-controlled element (e.g. Google Analytics, a CDN JS library, or a font), then shortly afterwards perform a google search, there will be very many points in common that allow google to match you as very likely the same person (e.g. your IP, time of day, recent similar requests, user agent string, window size, fonts available) even if it blocks cookies that would identify you explicitly. This form of fingerprinting is quite hard to avoid, though Safari is a lot better at it than Chrome. Tor provides much more robust anonymisation by normalising many fingerprintable elements, as well as hiding your IP.
That's difficult because making use of all this information will indeed lead to generally more relevant search results, so it's in Google's interests to use whatever it can (within technical and mostly legal limits). Tor will disconnect the search results from you, but it may instead provide you with results linked to whoever else might have been using the same Tor exit node as you recently, which might not be pleasant! The same would apply to using VPN services.

GoogleBetterAds - violatingSites.list - google-apis-explorer

I can get a list of summaries of violating sites, using the following link:
https://developers.google.com/ad-experience-report/[...]/violatingSites/list
My questions:
Is this list exhaustive?
If not, is it possible to get an exhaustive list (or not) and how?
Is it possible to know how these websites are pulled (the share of websites analysed, etc)?
- Is this list exhaustive?
What's size of your actual API return?
If you have an API return statement increasingly longer and longer with new data at each new request, you can think have the exhaustive list (with a possible update
latency).
If the API return statement have always same size with different data, in example old data will not appears and it replaced by new data, it's not exhaustive.
- If not, is it possible to get an exhaustive list (or not) and how?
I have no idea at the moment, the total number of websites can be in billion ...
- Is it possible to know how these websites are pulled (the share of websites analysed, etc)?
I have no idea for the moment too, I think it is either a confidential process or that it is described in the general conditions and subtily in the documentation...

Trying to achieve predictable search results from Google Drive API

Short version:
What is the proper way to list/query files by suffix, "fullText contains 'ext', "fileExtension = 'ext'" or "title contains 'ext'"? These do not always return the same results; only one of them is documented (the first), and it's not consistent.
Long versions:
I've been developing Google Drive apps for years. Every now and then I have to change to my list queries to get the correct results. My application needs to find files with a certain suffix. Official documentation indicates that I need to use the "fullText contains 'ext'" syntax, but sometimes this fails to find some files. At one time I switched to the undocumented "fileExtension = 'ext'" syntax, but again after some time I found files that wouldn't show up and went back to fullText searches. However, again I've seen files not showing up with that search and tried using "title contains 'ext'" (or v3 "name contains 'ext'") which seems to work, but for how long? I don't like using undocumented queries which might just suddenly stop working.
I feel like I'm going in circles since I don't know why fullText fails (and only for some users, and when it does work I've seen the parents field come up empty sometimes...which doesn't happen with other queries) or why the title search works (not documented to search suffixes...and I'm pretty sure it didn't used to work). I might just perform all three searches, but this affects performance, and the "or" keyword with some combinations of those three searches returns no results at all.
My application has thousands of files, each with multiple revisions, in hundreds of folders and each folder is shared with dozens of users and those permissions are changing on a regular basis as people are added and removed from projects. There are hundreds of different owners of the individual files. I suspect this complexity and the time it takes to propagate permissions and file changes affects my queries, but doesn't explain why one search would work and another wouldn't or why the information returned on a file in one query would be different from the other. That is, even after several days the problem doesn't correct itself and often a file must be remove and re-uploaded for everyone to see it. I have experienced the slow updates to meta data for shared files resulting in mismatches between meta data, files, and search results, but I take all of that into account and still have queries which simply won't work properly.
Maybe I'm expecting too much from a free API? Overall I'm very happy with what i can do, but it can be very frustrating when it's not working and you know you're doing it right! :)
You can search or filter files with the 'files.list' or 'children.list' methods of the Drive API. These methods accept the 'q' parameter which is search query.
For more information, see: https://developers.google.com/drive/v3/web/search-parameters

Google searches with permanent filters

I'm wondering if there's a way to make google searches where you can set filters you want to be in effect permanently - like a filter profile. So, for instance, every time you would do a search, you could get results that didn't include say, Yahoo Answers, without having to type in -yahoo -answers.
A feature like this would be invaluable because it's very common to perform a search and want to filter out a lot of popular sites that would normally top the rankings. For example, suppose you're searching for a news topic and don't want to read mainstream media articles. You could add the words reuters, cnn, huffington post, daily mail, and so on to your filter profile and never see those sites turn up in any of your searches ever again.
I'm asking because I'm interesting in making an extension that would do precisely this, but there's no point if such a feature already exists.
You can create a Custom Search in minutes. It's called Google CSE (Custom Search Engine)
This is a sample public link that I've created based on your example above: https://www.google.com/cse/publicurl?cx=006201654654568968489:1kv4asuwfvs
In the settings:
I can choose to exclude by url, url pattern, or even urls within my search results
If you need more ways, here's a good and relevant link.
Search filters can be specified as part of the URL (e.g. append site:example.com/section1 to a Google query to only yield results whose locations start with that prefix). So you can make a search plugin that substitutes your query into such a template and install it into your browser.
Search plugins are generally XML files with a standardized schema. OpenSearch is one such standard supported by Chrome.
There are sites that host collections of user-submitted plugins as well as tools to generate your own. An example that I use is the Mycroft project (originally created for Apple Sherlock software that pioneered the concept and later accepted into the Mozilla project when Firefox took on the feature).

How to integrate chrome browser built in tools (specifically "find") when trying to develop an extension?

I am trying to develop a chrome extension that part of it will need to have the global find keyword functionality, just like the built in "Find" (Ctrl+F) that comes with the browser. (EDIT: It needs to invoke "Find" multiple times and concurrently on the same tab)
My first thought is to find an API that can provide the "find" functionality from Chrome. However, after going through the list, I don't see what I am looking for. Also, the keywords for my question ("Chrome extension", "Chrome API","find","search") are too generic, I can not find similar examples or information for such an API even after extensive googling.
In order to provide consistent user experience I would love to provide similar, if not exact, "Find" tool in my extension. In order to avoid reinventing the wheel, it would be best if I can somehow invoke the built-in function. Existing extensions are mostly an own implementation in Javascript with limitations (cannot search inside iframe, do not have global highlight, etc.) This will be my last resort.
Does anyone know of such an API(that will invoke built-in "find" tool from the browser) or a similar example to my question? If not, please let me know what's the best way to implement it in javascript, as I am new to lexical analysis or parsing.
Many Thanks!!
-Gavin
P.S: This is my first post here, if I haven't given enough information on my question (or you don't think this is a question at all), feel free to let me know!
EDIT2: I am trying to build an improvement extension based on "Find" that can solve this scenario:
In a text-intense page, if I want to locate a region where it mentions keywordA and keywordB but these two keyword are not immediately adjacent to each other and both of them appear many times in the document. In this case I can neither search "keywordA keywordB" (because they are not next to each other) nor individual keywords (too many occurrence).
For example, in an html-based math textbook, you want to locate a chapter that mentions "linear algebra" and "matrix" together the most times.
The built in search does not support multiple concurrent invocations on the same tab. So even if it becomes accessible via API some day, it's unlikely to support concurrency, because concurrent searches are not natural for general use case with a single interactive user, and involves UI. One thing that I can imagine as improvement for existing built in search is support of a query language, which will allow searching for alternatives (i.e. car | auto for OR-ing), and this would somehow address "multiple" part of your requirement.
Your option is to use a content script for searching text, for example (with jQuery):
var search_i = $('*:contains("text to find")');
This way you can perform and combine as much searches as you want, but you'll need to design a proper (understandable) UI, presenting results for every search without interference with other searches.