How does google determine the date a thread was posted? - html

When you search in google, when searching for a term, you can click "Discussion" on the left hand side of the page. This will lead you to forum based discussions which you can select. I was in the process of designing a discussion board for a usergroup and I would like for google to index my data with post time.
You can filter the results by "Any Time" - "Past Hour" - "Past 24 Hours" - "Past Week" - etc.
What is the best way to ensure that the post date is communicated to google? RSS feed for thread? Special HTML label tag with particular id? Or some other method?

Google continually improves their heuristics and as such, I don't think there are any (publicly known) rules for what you describe. In fact, I just did a discussion search myself and found the resulting pages to have wildly differing layouts, and not all of them have RSS feeds or use standard forum software. I would just guess that Google looks for common indicators such as Post #, Author, Date.
Time-based filtering is mostly based on how frequently Google indexes your page and identifies new content (although discussion pages could also be filtered based on individual post dates, which is once again totally up to Google). Just guessing, but it might also help to add Last-Modified headers to your pages.

I believe Google will simply look at when the content appeared. No need for parsing there, and no special treatment required on your end.

i once read a paper from a googler (a paper i sadly can't find anymore, if somebody finds it, please give me a note) where it was outlines. a lot of formulas and so on, but the bottom line was: google has analyzed the structure of the top forum systems on the web. it does not use a page metaphor to analyse it, but breaks the forum down into topics, threads and posts.
so basically, if you use a standard, popular forum system, google knows that it is a forum and puts you into the discussion segment. if you build your own forum software it is probably best to use existing, established forum conventions (topics, threads, posts, authors....).

Related

using mediawiki , how can I get random wikipedia page of a given quality?

I'm using Wikimedia random API to get random article from Wikipedia, however, using this API I get completely random articles, the only parameter that I control here is rnnamespace which allow me to filter talk pages, user pages and so on.
I know that some wikipedia pages are assessed for their quality, and I'd like to get a random article, for example, present only in the set of featured article. Is there a way I could use the API to do that ?
I was wondering if my only option was to make sql queries, even though ideally I could rely only on the API.
Sadly, there is no proper API (the task for it is T63840). Use Special:RandomInCategory with the Featured articles category. Or https://randomincategory.toolforge.org/ for a slower but more mathematically correct alternative.
So I found a partly satisfying solution. I can use the API categorymembers, which return pages from a given category.
There's a parameter timestamps which allow to list all article from a specific date, so my idea is to choose randomly a date, then get a list of article from this date, and choose again randomly among those articles.
Of course, it does not guarantee an uniform distribution between the random choice of articles but it should work pretty good anyway.
I'll include my code later on to complete the answer.

How to use Wikipedia API to get page statistics for all pages in a Category?

I am looking to identify the most popular pages in a Wikipedia Category (for example, which graph algorithms had the highest page views in the last year?). However, there seems to be little up-to-date information of Wikipedia APIs, especially for obtaining statistics.
For example, the StackOverflow post on How to use Wikipedia API to get the page view statistics of a particular page in Wikipedia? contains answers that no longer seem to work.
I have dug around a bit, but I am unable to find any usable APIs, other than a really nice website, where I could potentially do this manually, by typing page titles one by one (max. up to ten pages only): https://tools.wmflabs.org/pageviews/. Would appreciate any help. Thanks!
You can use a MediaWiki API call like this to get the titles in the category: https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Physics
Then you can use this to get page view statistics for each page: https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_per_article_project_access_agent_article_granularity_start_end
(careful of the rate limit)
E.g. for the last year, article "Physics" (part of the Physics category): https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/Physics/daily/20151104/20161104
If you're dealing with large categories, it may be best to start downloading statistics from https://dumps.wikimedia.org/other/pageviews/2016/2016-11/ to avoid making so many REST API calls.
TreeViews is a tool designed to do exactly this. Getting good data is going to be hard if your category contains thousands of pages, in which case you'd better do the calculations yourself as Krenair suggests.

Anyone ever tried to use Twitter to replace comments sections on web apps?

Here's the scenario I'm imagining.
Simple blog, users typically post comments in a comments form at the bottom of each blog article. Instead of that, using the Twitter API, pull tweets based on a hashtag. Base the hashtag on the article id (i.e. #site10201) where site is a prefix and the number is the article id.
Then provide a link to post a tweet using the hashtag., which would then get picked up in your twitter api pull.
I'm imagining horrible spam issues, but other than that, bad idea?
Has some drawbacks to more run-of-the-mill database systems:
Additional network overhead. Most self-hosted blogs would typically rely on database and blog being on the same server (physical or virtual) so db-lookup is fast (and reliable) compared to twitter API requests.
Caching issues. One host is only allowed X requests of twitter at a time (the next request is going to end up a 404), and how are you going to manage that from your website for a scenario which becomes steadily more complex as multiple articles are added? Presumably you need to authenticate so the easy-way out is a security liability. (The easy way out being to use JavaScript on the at the browser to perform the actual request, neatly circumvents the problem in 20/80 fashion.) Granted most blogs don't get that kind of traffic. ;)
You tie your precious or not so precious comments to the mercy of the fail whale. Which is kind of odd considering a self-hosted blog basically means you want to have such control in the first place by not using a service like blogger.
Is it possible to ensure unicity of hash tags --in the general case? What are you going to do if someone had the same bright idea, only took the name of the tag 5ms before you did? Would you end up pulling the drivel of someone else's blog comments rather than the brilliance you have come to expect from yours? ;)
Lesser point: you rely on others to have a twitter account. Anonymous replies are off the table.
TOS and other considerations that may be imposed on you by twitter, either now or in future. (2) is actually a major item of Twitter's TOS.

SEO Problem for new dictionary site, google hasn't indexed content

I loaded about 15,000 pages, letters A & B of a dictionary and submitted to google a text site map. I'm using google's search with advertisement as the planned mechanism to go through my site. Google's webmaster accepted the site mapps as good but then did not index. My index page has been indexed by google and at this point have not linked to any pages.
So to get google's search to work I need to get all my content indexed. It appears google will not just index from the site map and so I was thinking of adding pages that spider in links from the main index page. But I don't want to create a bunch of pages that programicly link all of the pages without knowing if this has a chance to work. Eventually I plan on having about 150,000 pages each page being a word or phrase being defined. I wrote a program that is pulling this from a dictionary database. I would like to prove the content that I have to anyone interested to show the value of the dictionary in releation to the dictionary software that I'm completing. Suggestions for getting the entire site indexed by google so I can appear in the search results?
Thanks
First off, you have to have legitimate, interesting content and not be some sort of attempt to scam or fool Google (see below).
Assuming that's true, Google has some guidelines for getting your site indexed. I highly recommend reading this document.
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=35769
The first recommendation is:
Make a site with a clear hierarchy and text links. Every page should be reachable from at least one static text link.
This means that, yes, you have to have links to each page.
My experience is with a bookseller with about 90,000 titles, some of which are rare and hard to obtain elsewhere. Humans will access the books pages through the search interface, mostly. Each book page has a fair amount of information: title, publisher, date published, etc - about 20 fields in all.
The site and its ever-changing catalog has been live for more than a decade, but the titles were not accessible to search engines since they were only accessible through the search interface. So google only indexed the CMS pages and not the book catalog.
We started by having a site map. This didn't get google to index the book catalog. Then after reading the guidelines we added static text links to all the titles. Then we added accurate meta tags. (The urls remained normal IIS PHP cruft i.e. book.php?id=123546.) There is no duplicate content.
At one point we got google to index 17,000 pages, but since then the number of pages in google's index has dropped to about 900 - all the non book pages and a few books that are featured on the site.
Unfortunately the client didn't want to pay for us to continue to try to get google to index the titles, so that's how it is today.
Finally, it's not really clear from your post what you are trying to accomplish. If you are trying to put an existing, boring old dictionary up on the web and then get ad revenue, Google is going to ignore you.
Google crawling takes time, specially for new sites/domains. The sitemaps you submit are just "informational" for Google and don't necessarily mean that everything there will be indexed.
Some notes:
Make sure you display unique content or Google will penalize you for duplicating.
Avoid pages with few text, Google likes pages with a reasonable amount of content.
Keep adding content progressively, a couple times every day if possible.
You should start to see your pages crawled hourly/daily (depending on the update frequency) and appearing in Search Engines in about 2 weeks-2 months (mines usually take 1 month for me to be considered "crawled"). Something you can do meanwhile is to get backlinks from other websites and keep checking Webmaster Tools to see Google's crawling rate.
Good luck! :)

Building an index of URLs , what features to include?

I am working towards building an index of URLs. The objective is to build and store a data structure which will have key as a domain URL (eg. www.nytimes.com) and the value will be a set of features associated with that URL. I am looking for your suggestions for this set of features. For example I would like to store www.nytimes.com as following:
[www.nytimes.com: [lang:en, alexa_rank:96, content_type:news, spam_probability: 0.0001, etc..]
Why I am building this? Well the ultimate goal is to do some interesting things with this index, for example I may do clustering on this index and find interesting groups etc. I have with me a whole lot of text which was generated by whole lot URLs over a period of whole lot of time :) So data is not a problem.
Any kind of suggestions are very welcome.
Make it work first with what you've already suggested. Then start adding features suggested by everybody else.
ideas are worth nothing unless
executed.
-- http://www.codinghorror.com/blog/2010/01/cultivate-teams-not-ideas.html
I would maybe start here:
Google white papers on IR
Then also search for white papers on IR on Google maybe?
Also a few things to add to your index:
Subdomains associated with the domain
IP addresses associated with the domain
Average page speed
Links pointing at the domain in Yahoo -
e.g link:nytimes.com or search on yahoo
Number of pages on the domain - site:nytimes.com on Google
traffic nos on compete.com or google trends
whois info e.g. age of domain, length of time registered for etc.
Some other places to research - http://www.majesticseo.com/, http://www.opensearch.org/Home and http://www.seomoz.org they all have their own indexes
I'm sure theres plenty more but hopefully the IR stuff will get the cogs whirring :)