Trying to search WikiNews - json

I'm trying to search WikiNews, both for specific news stories and for the latest headlines. I've been reading about the MediaWiki API (https://www.mediawiki.org/wiki/API:Main_page), but it doesn't seem to map to what I'm trying to do.
Taking two examples, I need to be able to get the latest headlines (ideally for a specific region (United States, France, Great Britain etc) and for a specific topic (Finance, Sport, Media etc), but right now I'd settle for just getting the latest stories regardless. I've tried a couple of things:
https://en.wikinews.org/w/api.php?action=query&prop=categories&clprop=timestamp&format=json
just returns batchcomplete
http://en.wikinews.org/w/api.php?action=query&list=recentchanges&rnnamespace=0 looks like it might be more promising, but only if I could filter only to show news stories - and show a good deal more than it currently does. Clearly it would also be desirable to add parameters for location / story type in the query rather than filtering them after the list is received.
With regard to searching, I've had even less luck. I've tried searching on a topic that I know is ~~causing trouble~~ making the news:
https://en.wikinews.org/w/api.php?action=query&titles=Donald_Trump&prop=revisions&rvprop=content&format=json&redirects&continue the return is not a list of stories!
Has anyone searched WikiNews? Does anyone have any suggestions to achieve what I need to do?

action=query&format=json&list=search&redirects=1&srsearch="Donald Trump" -incategory:disputed incategory:"August_25, _2016|August_24,_2016|August_23,_2016|August_22,_2016|August_21,_2016|August_20,_2016|August_19,_2016|August_18,_2016"&srnamespace=0&srenablerewrites=1 will search for articles from the last few days mentioning Trump. (See full docs on what keywords can be used in a search query.)
Most of your examples do not actually generate a list (you can see from the lack of a list parameter); they expect one or more article name via the titles parameter and return information about those. Your second example is valid (except the correct parameter name is rcnamespace) but that returns recently edited articles, which is a very random way of trying to find news on a topic.
In general it seems like you are trying to randomly guess what the API modules do. Did you miss the docs and sandbox?

Related

using mediawiki , how can I get random wikipedia page of a given quality?

I'm using Wikimedia random API to get random article from Wikipedia, however, using this API I get completely random articles, the only parameter that I control here is rnnamespace which allow me to filter talk pages, user pages and so on.
I know that some wikipedia pages are assessed for their quality, and I'd like to get a random article, for example, present only in the set of featured article. Is there a way I could use the API to do that ?
I was wondering if my only option was to make sql queries, even though ideally I could rely only on the API.
Sadly, there is no proper API (the task for it is T63840). Use Special:RandomInCategory with the Featured articles category. Or https://randomincategory.toolforge.org/ for a slower but more mathematically correct alternative.
So I found a partly satisfying solution. I can use the API categorymembers, which return pages from a given category.
There's a parameter timestamps which allow to list all article from a specific date, so my idea is to choose randomly a date, then get a list of article from this date, and choose again randomly among those articles.
Of course, it does not guarantee an uniform distribution between the random choice of articles but it should work pretty good anyway.
I'll include my code later on to complete the answer.

GoogleBetterAds - violatingSites.list - google-apis-explorer

I can get a list of summaries of violating sites, using the following link:
https://developers.google.com/ad-experience-report/[...]/violatingSites/list
My questions:
Is this list exhaustive?
If not, is it possible to get an exhaustive list (or not) and how?
Is it possible to know how these websites are pulled (the share of websites analysed, etc)?
- Is this list exhaustive?
What's size of your actual API return?
If you have an API return statement increasingly longer and longer with new data at each new request, you can think have the exhaustive list (with a possible update
latency).
If the API return statement have always same size with different data, in example old data will not appears and it replaced by new data, it's not exhaustive.
- If not, is it possible to get an exhaustive list (or not) and how?
I have no idea at the moment, the total number of websites can be in billion ...
- Is it possible to know how these websites are pulled (the share of websites analysed, etc)?
I have no idea for the moment too, I think it is either a confidential process or that it is described in the general conditions and subtily in the documentation...

How to use Wikipedia API to get page statistics for all pages in a Category?

I am looking to identify the most popular pages in a Wikipedia Category (for example, which graph algorithms had the highest page views in the last year?). However, there seems to be little up-to-date information of Wikipedia APIs, especially for obtaining statistics.
For example, the StackOverflow post on How to use Wikipedia API to get the page view statistics of a particular page in Wikipedia? contains answers that no longer seem to work.
I have dug around a bit, but I am unable to find any usable APIs, other than a really nice website, where I could potentially do this manually, by typing page titles one by one (max. up to ten pages only): https://tools.wmflabs.org/pageviews/. Would appreciate any help. Thanks!
You can use a MediaWiki API call like this to get the titles in the category: https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Physics
Then you can use this to get page view statistics for each page: https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_per_article_project_access_agent_article_granularity_start_end
(careful of the rate limit)
E.g. for the last year, article "Physics" (part of the Physics category): https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/Physics/daily/20151104/20161104
If you're dealing with large categories, it may be best to start downloading statistics from https://dumps.wikimedia.org/other/pageviews/2016/2016-11/ to avoid making so many REST API calls.
TreeViews is a tool designed to do exactly this. Getting good data is going to be hard if your category contains thousands of pages, in which case you'd better do the calculations yourself as Krenair suggests.

Scribd API search showing irrelevant answers

When I use the search functionality on the scribd docs API to search for a function, like
http://api.scribd.com/api?method=docs.search&api_key=API_KEY&query=hello+world
It returns irrelevant results, and ones different to the search functionality of the site. This request, for example, returns results about Guitar Hero, World of Warcraft and Virtual Worlds etc. Whereas the site search on https://www.scribd.com/search-documents?query=hello+world gives documents titled "Hello World" as you would expect. Is there a parameter that I can add to the api call that will make it return relevant results?
You may try playing with the simple parameter to see if it makes any difference to your queries. According to the API reference (half of it is inaccessible at the moment) it makes the results the same as for the website:
(optional)This option specifies whether or not to allow advanced search queries (more information). When set to false, the API search behaves the same as the search on Scribd.com. When set to true, the API search allows advanced queries that contain filters such as title:"A Tale of Two Cities". Set to "true" by default.
I tried your query myself, but it still doesn't give adequate results, even though it changes things a bit. But it is still not good enough regardless of the simple option being set to false. Even if you try to run their sample queries 1:1 they are still giving 90% irrelevant results.
Then I found a similar issue being discussed in the following google group thread back in 2011. At the end Jared Friedman (the CTO of Scribd) himself admits that API search and website Search work differently and it is not in their priorities to fix this. In 2014 another developer complained. Seems to me that four years later this is still the case.
I'd suggest contacting Scribd support directly and asking them what is the current status of the docs.search API and if there is some preliminary approval process in place (for example, they may do a background check on accounts and only then provide relevant results, otherwise they return just test results for any query) although I doubt it.

How does google determine the date a thread was posted?

When you search in google, when searching for a term, you can click "Discussion" on the left hand side of the page. This will lead you to forum based discussions which you can select. I was in the process of designing a discussion board for a usergroup and I would like for google to index my data with post time.
You can filter the results by "Any Time" - "Past Hour" - "Past 24 Hours" - "Past Week" - etc.
What is the best way to ensure that the post date is communicated to google? RSS feed for thread? Special HTML label tag with particular id? Or some other method?
Google continually improves their heuristics and as such, I don't think there are any (publicly known) rules for what you describe. In fact, I just did a discussion search myself and found the resulting pages to have wildly differing layouts, and not all of them have RSS feeds or use standard forum software. I would just guess that Google looks for common indicators such as Post #, Author, Date.
Time-based filtering is mostly based on how frequently Google indexes your page and identifies new content (although discussion pages could also be filtered based on individual post dates, which is once again totally up to Google). Just guessing, but it might also help to add Last-Modified headers to your pages.
I believe Google will simply look at when the content appeared. No need for parsing there, and no special treatment required on your end.
i once read a paper from a googler (a paper i sadly can't find anymore, if somebody finds it, please give me a note) where it was outlines. a lot of formulas and so on, but the bottom line was: google has analyzed the structure of the top forum systems on the web. it does not use a page metaphor to analyse it, but breaks the forum down into topics, threads and posts.
so basically, if you use a standard, popular forum system, google knows that it is a forum and puts you into the discussion segment. if you build your own forum software it is probably best to use existing, established forum conventions (topics, threads, posts, authors....).