Site crawling/indexing issues caused by link structure? - html

I'm doing SEO-type work for a client with many diverse site properties-- none of which were built by myself. One of them in particular, to which I'm linking here, appears to be having issues being indexed by search engines. Interestingly, I've tried multiple sitemap generator tools and they too seem to have problems indexing the site; although the site is made up of only a few pages and external links, the sitemap tools-- and I suspect search engines-- are only seeing the homepage itself and nothing else.
In Google webmaster tools, I'm seeing a couple crawl errors (404) relating to home/index.html but nothing else. Also, in Google Analytics, over 80% of the traffic is direct-- i.e. not search traffic-- which seems alarming. The site has been live for about a month, and is being promoted by various sources. Even searching Google using the domain name itself doesn't bring the homepage up in results (!), let alone any related keywords.
My ultimate question is whether or not there appears to be any glaring issues with the code that might prevent proper indexing. I'm noticing that the developer chose to structure the navigation by naming directories, i.e. linking to "home/index.html," "team/index.html," "about/index.html" etc. when it seems optimal to be naming the HTML file itself, i.e. "team.html" and "about.html." Could this be part of the problem?
Thanks for any insight here.

You have two major issues here.
First issue is the root http://www.raisetheriver.org/ has a meta refresh that redirects the page to http://www.raisetheriver.org/home/index.html
Google recommends against using meta refresh, 301 redirects should be used if you want to redirect pages. However I recommend against redirecting the root home page to another page, as a websites home page is expected to be the root.
The second issue is that all the pages on the site are blocked from being indexed in Google as they have the following code in the source code: <meta name="robots" content="noindex"> which instructs search engines not to index the page.
Correct these issue and the site will be able to get indexed in Google and sitemap generators will be able to crawl the site.

Having sub directories for the pages won't be an issue to web crawler because even large site like Amazon, Ebay and many other have sub directory align pages.
This error occurred that due to your sitemap.xml or sitemap.html might contain invalid or broken links and have been indexed to Google. And you can generate sitemap using this site http://www.xml-sitemaps.com/ even I am using this and works perfectly.
And please check manually all directories and pages in your cpanel is working or not. If find any any invalid link you may be able to fix it.

Related

Html and css website not showing up on search engine

I have a website that I made in html and css. I registered the domain and I am hosting it using googledrive if you go into a browser and type in (websitename).com in the url it works but if you type it into google or bing search engine it dosnt come up. There were about 20 things that came up and none of them were my site. I am using godaddy.com for my domain name. Do I have to enable something? What am I doing wrong?
A few things you should know:
You need to go into the Google Search Console and Bing Webmaster Tools (https://www.google.com/webmasters/tools/home) and add your website and submit a sitemap
Websites aren't crawled and indexed immediately, it takes time. Also, your website may never rank, it depends on how relevant the search engines determine your content is (See this article on SEO http://searchengineland.com/guide/what-is-seo)
Also, you should post this in the Webmasters community, not here.

Users linking to old web pages

What would be the best way to handle this situation?
Company has two types of products, therefore two seperate webpages to serve up each:
products-professional.html
products-consumer.html
Company changes structure and now does not want to list products as seperate, new page is:
products.html
According to Google Webmaster Tools, some sites have links to our old pages. I've added a redirect on them to point them towards the new page, but the errors still show in Google Webmaster Tools. I don't want errors.
You should:
monitor GWT errors and add missing redirects
try to contact as many as possible users linking to the old url's and ask them to fix it
Since 2'nd point is hard to achieve in 100%, your redirects has to be bulletproof, and even then google can find some weird urls from a year ago and report errors.

Google results only showing links of my main page

I have a website futbolpasionatlanta.com that has several pages that are internally linked. For some reason my google results are all showing my linked websites but my main page the index.php or just www.futbolpasionatlanta.com is showing up in the results.
Any ideas what I can do to correct this?
Is is something I would change on in my head tag?
Thanks,
If you want to encourage Google to crawl and index deeper into your site you should try to get incoming links directly to those inner pages. The higher the quality of those links, the better.
It might take some time before Google index your whole site. It will eventually be indexed.
For your next question:
You can use robots.txt and submit it in webmaster tools if you haven't done that yet. This can be used to block some pages:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449

Finding number of pages of a website

I want to find the number of pages of a website. Usually what I look for is a sitemap but I just encountered a site which does not have a sitemap so I am out of ideas of how to find its total pages. I tried to Google the URL but that did not helped much. Is there any other way we can find out the pages of a website?
Thanks in advance.
Ask Google "site:yourdomain.com"
This gives you all indexed pages.
Or use the free tool "Xenu". It crawls the whole site. But it won't find sites which have no internal links pointing to them. You can also export a sitemap with it.
I was about to suggest the same thing :) If this is a website you own, you can also add it to the Google Webmaster tools. It will show you lots of things about your site including number of links, pages, search terms, etc Its very useful and is free of charge.
I have found a better solution myself. You can go to Google Advanced Search and restrict the search results to your domain name. Leave everything else empty. It would give you the list of all pages cached by Google.
You could also try A1 Website Analyzer. But for all link checker software, you will have to make sure you configure them correctly to obey/not-obey (whatever your needs are) e.g robots.txt, noindex and nofollow instructions. (Common source for confusion in my experience.)

Crawling data or using API

How these sites gather all the data - questionhub, bigresource, thedevsea, developerbay?
Is this legal to show data in frame as bigresource do?
#amazed
EDITED : fixed some spelling issues 20110310
How these sites gather all data- questionhub, bigresource ...
Here's a very general sketch of what is probably happening in the background at website like questionhub.com
Spider program (google "spider program" to learn more)
a. configured to start reading web pages at stackoverflow.com (for example)
b. run program so it goes to home page of stackoverflow.com and starts visiting all links that it finds on those pages.
c. Returns HTML data from all of those pages
Search Index Program
Reads HTML data returned by spider and creates search index
Storing the words that it found AND what URL those words where found at
User Interface web-page
Provides feature rich user-interface so you can search the sites that have been spidered.
Is this legal to show data in frame as bigresource do?
To be technical, "it all depends" ;-)
Normally, websites want to be visible in google, so why not other search engines too.
Just as google displays part of the text that was found when a site was spidered,
questionhub.com (or others) has chosen to show more of the text found on the original page,
possibly keeping the formatting that was in the orginal HTML OR changing the formatting to
fit their standard visual styling.
A remote site can 'request' that spyders do NOT go thru some/all of their web pages
by adding a rule in a well-known file called robots.txt. Spiders do not
have to honor the robots.txt, but a vigilant website will track the IP addresses
of spyders that do not honor their robots.txt file and then block that IP address
from looking at anything on their website. You can find plenty of information about robots.txt here on stackoverflow OR by running a query on google.
There is a several industries (besides google) built about what you are asking. There are tags in stack-overflow for search-engine, search; read some of those question/answers. Lucene/Solr are open source search engine components. There is a companion open-source spider, but the name eludes me right now. Good luck.
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, or give it a + (or -) as a useful answer. This goes for your other posts here too ;-)