I loaded about 15,000 pages, letters A & B of a dictionary and submitted to google a text site map. I'm using google's search with advertisement as the planned mechanism to go through my site. Google's webmaster accepted the site mapps as good but then did not index. My index page has been indexed by google and at this point have not linked to any pages.
So to get google's search to work I need to get all my content indexed. It appears google will not just index from the site map and so I was thinking of adding pages that spider in links from the main index page. But I don't want to create a bunch of pages that programicly link all of the pages without knowing if this has a chance to work. Eventually I plan on having about 150,000 pages each page being a word or phrase being defined. I wrote a program that is pulling this from a dictionary database. I would like to prove the content that I have to anyone interested to show the value of the dictionary in releation to the dictionary software that I'm completing. Suggestions for getting the entire site indexed by google so I can appear in the search results?
Thanks
First off, you have to have legitimate, interesting content and not be some sort of attempt to scam or fool Google (see below).
Assuming that's true, Google has some guidelines for getting your site indexed. I highly recommend reading this document.
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=35769
The first recommendation is:
Make a site with a clear hierarchy and text links. Every page should be reachable from at least one static text link.
This means that, yes, you have to have links to each page.
My experience is with a bookseller with about 90,000 titles, some of which are rare and hard to obtain elsewhere. Humans will access the books pages through the search interface, mostly. Each book page has a fair amount of information: title, publisher, date published, etc - about 20 fields in all.
The site and its ever-changing catalog has been live for more than a decade, but the titles were not accessible to search engines since they were only accessible through the search interface. So google only indexed the CMS pages and not the book catalog.
We started by having a site map. This didn't get google to index the book catalog. Then after reading the guidelines we added static text links to all the titles. Then we added accurate meta tags. (The urls remained normal IIS PHP cruft i.e. book.php?id=123546.) There is no duplicate content.
At one point we got google to index 17,000 pages, but since then the number of pages in google's index has dropped to about 900 - all the non book pages and a few books that are featured on the site.
Unfortunately the client didn't want to pay for us to continue to try to get google to index the titles, so that's how it is today.
Finally, it's not really clear from your post what you are trying to accomplish. If you are trying to put an existing, boring old dictionary up on the web and then get ad revenue, Google is going to ignore you.
Google crawling takes time, specially for new sites/domains. The sitemaps you submit are just "informational" for Google and don't necessarily mean that everything there will be indexed.
Some notes:
Make sure you display unique content or Google will penalize you for duplicating.
Avoid pages with few text, Google likes pages with a reasonable amount of content.
Keep adding content progressively, a couple times every day if possible.
You should start to see your pages crawled hourly/daily (depending on the update frequency) and appearing in Search Engines in about 2 weeks-2 months (mines usually take 1 month for me to be considered "crawled"). Something you can do meanwhile is to get backlinks from other websites and keep checking Webmaster Tools to see Google's crawling rate.
Good luck! :)
Related
I am looking to identify the most popular pages in a Wikipedia Category (for example, which graph algorithms had the highest page views in the last year?). However, there seems to be little up-to-date information of Wikipedia APIs, especially for obtaining statistics.
For example, the StackOverflow post on How to use Wikipedia API to get the page view statistics of a particular page in Wikipedia? contains answers that no longer seem to work.
I have dug around a bit, but I am unable to find any usable APIs, other than a really nice website, where I could potentially do this manually, by typing page titles one by one (max. up to ten pages only): https://tools.wmflabs.org/pageviews/. Would appreciate any help. Thanks!
You can use a MediaWiki API call like this to get the titles in the category: https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Physics
Then you can use this to get page view statistics for each page: https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_per_article_project_access_agent_article_granularity_start_end
(careful of the rate limit)
E.g. for the last year, article "Physics" (part of the Physics category): https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/Physics/daily/20151104/20161104
If you're dealing with large categories, it may be best to start downloading statistics from https://dumps.wikimedia.org/other/pageviews/2016/2016-11/ to avoid making so many REST API calls.
TreeViews is a tool designed to do exactly this. Getting good data is going to be hard if your category contains thousands of pages, in which case you'd better do the calculations yourself as Krenair suggests.
I'm wondering if there's a way to make google searches where you can set filters you want to be in effect permanently - like a filter profile. So, for instance, every time you would do a search, you could get results that didn't include say, Yahoo Answers, without having to type in -yahoo -answers.
A feature like this would be invaluable because it's very common to perform a search and want to filter out a lot of popular sites that would normally top the rankings. For example, suppose you're searching for a news topic and don't want to read mainstream media articles. You could add the words reuters, cnn, huffington post, daily mail, and so on to your filter profile and never see those sites turn up in any of your searches ever again.
I'm asking because I'm interesting in making an extension that would do precisely this, but there's no point if such a feature already exists.
You can create a Custom Search in minutes. It's called Google CSE (Custom Search Engine)
This is a sample public link that I've created based on your example above: https://www.google.com/cse/publicurl?cx=006201654654568968489:1kv4asuwfvs
In the settings:
I can choose to exclude by url, url pattern, or even urls within my search results
If you need more ways, here's a good and relevant link.
Search filters can be specified as part of the URL (e.g. append site:example.com/section1 to a Google query to only yield results whose locations start with that prefix). So you can make a search plugin that substitutes your query into such a template and install it into your browser.
Search plugins are generally XML files with a standardized schema. OpenSearch is one such standard supported by Chrome.
There are sites that host collections of user-submitted plugins as well as tools to generate your own. An example that I use is the Mycroft project (originally created for Apple Sherlock software that pioneered the concept and later accepted into the Mozilla project when Firefox took on the feature).
When you search in google, when searching for a term, you can click "Discussion" on the left hand side of the page. This will lead you to forum based discussions which you can select. I was in the process of designing a discussion board for a usergroup and I would like for google to index my data with post time.
You can filter the results by "Any Time" - "Past Hour" - "Past 24 Hours" - "Past Week" - etc.
What is the best way to ensure that the post date is communicated to google? RSS feed for thread? Special HTML label tag with particular id? Or some other method?
Google continually improves their heuristics and as such, I don't think there are any (publicly known) rules for what you describe. In fact, I just did a discussion search myself and found the resulting pages to have wildly differing layouts, and not all of them have RSS feeds or use standard forum software. I would just guess that Google looks for common indicators such as Post #, Author, Date.
Time-based filtering is mostly based on how frequently Google indexes your page and identifies new content (although discussion pages could also be filtered based on individual post dates, which is once again totally up to Google). Just guessing, but it might also help to add Last-Modified headers to your pages.
I believe Google will simply look at when the content appeared. No need for parsing there, and no special treatment required on your end.
i once read a paper from a googler (a paper i sadly can't find anymore, if somebody finds it, please give me a note) where it was outlines. a lot of formulas and so on, but the bottom line was: google has analyzed the structure of the top forum systems on the web. it does not use a page metaphor to analyse it, but breaks the forum down into topics, threads and posts.
so basically, if you use a standard, popular forum system, google knows that it is a forum and puts you into the discussion segment. if you build your own forum software it is probably best to use existing, established forum conventions (topics, threads, posts, authors....).
I want to change the title and summary of my website on google search results.
Is enough to change the meta-data and title on my html files, and wait few weeks to let google update the results ? (how long does it take on average ?)
thanks
Is enough to change the meta-data and title on my html files, and wait few weeks to let google update the results ? (how long does it take on average ?)
Basically yes. The time you'll have to wait varies very much from site to site, depending on Page Rank, visitor frequency and other factors. Stack Overflow results are indexed within minutes; smaller sites within days or weeks.
It may also be worth signing up with the Google Webmaster Tools to find out how "visible" your site is to the Google bot and when it last dropped by.
Yes, that's sufficient, but if you want to try to kick-start it a bit you can resubmit your top-level page here. You might also consider trying out some of the Google Webmaster Tools, which in addition to actual tools also have help entries like these:
Requesting reconsideration of your site
Changing your site's title and description in search results
yes it is enough, make sure if you have any sitemap.xml file then it has to be updated and resubmitted to google. Now a days google takes 3-5 days to crawl your files to new indexing.
I am working towards building an index of URLs. The objective is to build and store a data structure which will have key as a domain URL (eg. www.nytimes.com) and the value will be a set of features associated with that URL. I am looking for your suggestions for this set of features. For example I would like to store www.nytimes.com as following:
[www.nytimes.com: [lang:en, alexa_rank:96, content_type:news, spam_probability: 0.0001, etc..]
Why I am building this? Well the ultimate goal is to do some interesting things with this index, for example I may do clustering on this index and find interesting groups etc. I have with me a whole lot of text which was generated by whole lot URLs over a period of whole lot of time :) So data is not a problem.
Any kind of suggestions are very welcome.
Make it work first with what you've already suggested. Then start adding features suggested by everybody else.
ideas are worth nothing unless
executed.
-- http://www.codinghorror.com/blog/2010/01/cultivate-teams-not-ideas.html
I would maybe start here:
Google white papers on IR
Then also search for white papers on IR on Google maybe?
Also a few things to add to your index:
Subdomains associated with the domain
IP addresses associated with the domain
Average page speed
Links pointing at the domain in Yahoo -
e.g link:nytimes.com or search on yahoo
Number of pages on the domain - site:nytimes.com on Google
traffic nos on compete.com or google trends
whois info e.g. age of domain, length of time registered for etc.
Some other places to research - http://www.majesticseo.com/, http://www.opensearch.org/Home and http://www.seomoz.org they all have their own indexes
I'm sure theres plenty more but hopefully the IR stuff will get the cogs whirring :)