Users linking to old web pages - html

What would be the best way to handle this situation?
Company has two types of products, therefore two seperate webpages to serve up each:
products-professional.html
products-consumer.html
Company changes structure and now does not want to list products as seperate, new page is:
products.html
According to Google Webmaster Tools, some sites have links to our old pages. I've added a redirect on them to point them towards the new page, but the errors still show in Google Webmaster Tools. I don't want errors.

You should:
monitor GWT errors and add missing redirects
try to contact as many as possible users linking to the old url's and ask them to fix it
Since 2'nd point is hard to achieve in 100%, your redirects has to be bulletproof, and even then google can find some weird urls from a year ago and report errors.

Related

What are dpuf (extension) files?

I have seen this extension in some urls and I would like to know what they are used for.
It seems odd, but I couldn't find any information about them. I think they are specific for some plug-in.
It seems to be connected to 'Share This'-buttons on the websites.
I found this page which gives a quite comprehensive explanation:
This tag is mainly developed for tracking the URL sharing on various Social Networks, so every time anyone copies your blog content there he gets the URL ending with #sthash and extension with .dpuf or .dpbs

Site crawling/indexing issues caused by link structure?

I'm doing SEO-type work for a client with many diverse site properties-- none of which were built by myself. One of them in particular, to which I'm linking here, appears to be having issues being indexed by search engines. Interestingly, I've tried multiple sitemap generator tools and they too seem to have problems indexing the site; although the site is made up of only a few pages and external links, the sitemap tools-- and I suspect search engines-- are only seeing the homepage itself and nothing else.
In Google webmaster tools, I'm seeing a couple crawl errors (404) relating to home/index.html but nothing else. Also, in Google Analytics, over 80% of the traffic is direct-- i.e. not search traffic-- which seems alarming. The site has been live for about a month, and is being promoted by various sources. Even searching Google using the domain name itself doesn't bring the homepage up in results (!), let alone any related keywords.
My ultimate question is whether or not there appears to be any glaring issues with the code that might prevent proper indexing. I'm noticing that the developer chose to structure the navigation by naming directories, i.e. linking to "home/index.html," "team/index.html," "about/index.html" etc. when it seems optimal to be naming the HTML file itself, i.e. "team.html" and "about.html." Could this be part of the problem?
Thanks for any insight here.
You have two major issues here.
First issue is the root http://www.raisetheriver.org/ has a meta refresh that redirects the page to http://www.raisetheriver.org/home/index.html
Google recommends against using meta refresh, 301 redirects should be used if you want to redirect pages. However I recommend against redirecting the root home page to another page, as a websites home page is expected to be the root.
The second issue is that all the pages on the site are blocked from being indexed in Google as they have the following code in the source code: <meta name="robots" content="noindex"> which instructs search engines not to index the page.
Correct these issue and the site will be able to get indexed in Google and sitemap generators will be able to crawl the site.
Having sub directories for the pages won't be an issue to web crawler because even large site like Amazon, Ebay and many other have sub directory align pages.
This error occurred that due to your sitemap.xml or sitemap.html might contain invalid or broken links and have been indexed to Google. And you can generate sitemap using this site http://www.xml-sitemaps.com/ even I am using this and works perfectly.
And please check manually all directories and pages in your cpanel is working or not. If find any any invalid link you may be able to fix it.

Keep pageranks switching to Wordpress

We are doing real good in Google search results and have a high pagerank with our HTML webpage (several pages like 30).
Now we are switching to a Wordpress website on the same domain, and are keeping most of the HTML-pages. But we also are building another Wordpress page on a NEW Domain, here we will showcase the hardware products (we are now showing at our existing domain with HTML).
How could we safely switch one half of HTML-page to Wordpress )on same domain) and keep pagerank, and move the other HTML-page to a Wordpress page on a NEW Domain and keep the pagerank?
Thanks in advance!
Try this tutorial. It's not quite the same, but it's going to talk you through the important parts of a transfer to minimize loss of SEO.
Basically make sure you keep all the current links to your pages working after the transfer.
Import all posts, comments & pages.
Maintaining permalinks for posts & pages (1-on-1 mapping between Blogger.com and WordPress pages).
Redirecting permalinks for labels & search archives.
Retaining all feed subscribers.

Finding number of pages of a website

I want to find the number of pages of a website. Usually what I look for is a sitemap but I just encountered a site which does not have a sitemap so I am out of ideas of how to find its total pages. I tried to Google the URL but that did not helped much. Is there any other way we can find out the pages of a website?
Thanks in advance.
Ask Google "site:yourdomain.com"
This gives you all indexed pages.
Or use the free tool "Xenu". It crawls the whole site. But it won't find sites which have no internal links pointing to them. You can also export a sitemap with it.
I was about to suggest the same thing :) If this is a website you own, you can also add it to the Google Webmaster tools. It will show you lots of things about your site including number of links, pages, search terms, etc Its very useful and is free of charge.
I have found a better solution myself. You can go to Google Advanced Search and restrict the search results to your domain name. Leave everything else empty. It would give you the list of all pages cached by Google.
You could also try A1 Website Analyzer. But for all link checker software, you will have to make sure you configure them correctly to obey/not-obey (whatever your needs are) e.g robots.txt, noindex and nofollow instructions. (Common source for confusion in my experience.)

Crawling data or using API

How these sites gather all the data - questionhub, bigresource, thedevsea, developerbay?
Is this legal to show data in frame as bigresource do?
#amazed
EDITED : fixed some spelling issues 20110310
How these sites gather all data- questionhub, bigresource ...
Here's a very general sketch of what is probably happening in the background at website like questionhub.com
Spider program (google "spider program" to learn more)
a. configured to start reading web pages at stackoverflow.com (for example)
b. run program so it goes to home page of stackoverflow.com and starts visiting all links that it finds on those pages.
c. Returns HTML data from all of those pages
Search Index Program
Reads HTML data returned by spider and creates search index
Storing the words that it found AND what URL those words where found at
User Interface web-page
Provides feature rich user-interface so you can search the sites that have been spidered.
Is this legal to show data in frame as bigresource do?
To be technical, "it all depends" ;-)
Normally, websites want to be visible in google, so why not other search engines too.
Just as google displays part of the text that was found when a site was spidered,
questionhub.com (or others) has chosen to show more of the text found on the original page,
possibly keeping the formatting that was in the orginal HTML OR changing the formatting to
fit their standard visual styling.
A remote site can 'request' that spyders do NOT go thru some/all of their web pages
by adding a rule in a well-known file called robots.txt. Spiders do not
have to honor the robots.txt, but a vigilant website will track the IP addresses
of spyders that do not honor their robots.txt file and then block that IP address
from looking at anything on their website. You can find plenty of information about robots.txt here on stackoverflow OR by running a query on google.
There is a several industries (besides google) built about what you are asking. There are tags in stack-overflow for search-engine, search; read some of those question/answers. Lucene/Solr are open source search engine components. There is a companion open-source spider, but the name eludes me right now. Good luck.
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, or give it a + (or -) as a useful answer. This goes for your other posts here too ;-)