What is sitemap.xml - html

Hello fellow programmers,
I am building a website and i read about sitemap.xml, but there is no place where i can find a definition or what it contains.
Can someone help me, what does it do? what is it for? what is in it?

http://www.sitemaps.org/ is the official resource.
The protocol page is probably the most important part of the entire site. It describes how to properly format your sitemap.xml file so that search engines can properly crawl your website.
from sitemaps.org
Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.

It provides information for search engines about the structure of your site. Wikipedia article.

It's an XML file that contains all of the URLs in your application, along with some other information that is used to make your site easier to crawl for search engines.

Related

How to display Wikipedia pages in my website?

I am building my own version of thewikigame.com. It is based on the concept of Wikiracing. On the website, the developer embeds the Wikipedia pages into his own website, and tracks how many clicks the user makes to reach the target page. I am not sure how he did this. What would be the best way possible to get the HTML + styling to display these pages on my own website?
As I understand from your question, you need first to download the whole content of enwiki to your website. And to do this you have to download the dump XML for English Wikipedia from this mirror, then import that XML to your wiki using one of these methods.

How to make my user's blogs visible in search engines?

In my web site my users can create their own blog.
When user create the blog, all the blog content saved in database and it load the content from db when some one request for it.
My question is that these blogs are searchable in search engines like google?
If not how i make it searchable or what are the ways which i can optimize the discoverableness in search engines?
If your pages are rendered server side, your articles will be crawled by the bots and indexed in search engines. It's about time.
However, you can increase your chances to be indexed faster and better with these simple techniques:
Add enough correct meta tags in your html head Meta tags
Add a robots.txt file in the root of your site Robots.txt
Add a sitemap file in the root of your site Sitemap
Add json-ld description of your blog and each article in the head of your pages Json-ld
Be sure to use semantic html for your content Semantic HTML
Provide social links and social pages that point links to your site
Those are basic, yet effective ways to assure your site to be properly indexed in search engines.
You can also test your SEO eanking with online tools like this one rankgen.com

Whats the best way to handle subdomains in a sitemap?

If you have a website www.yourdomain.com and a subdomain blog.yourdomain.com
(both sites containing simular information) what is the best sitemap setup?
Is it best to have one site map for both sites?
(and if so what would this look like?)
Or two separate sitemaps?
Which would be most effective in regards to search traffic optimisation?
If the content is identical, use a rel tag to tell Google (and the other search engines) which URL should be used to accrue page rank.
If you don't Google will split your page rank 'juice' over both pages. Ideally, you want to concentrate your juice on one URL as it will get a better page rank.
Choose the main site or the subdomain to produce your site map for. IT doesn't really hurt anything if you do both.
The rel="canonical" tags go in your html pages.
You can simply create single sitemap as:
http://yourdomain.com/sitemap.xml and define sitemap for all your blog posts urls under this sitemap.
This sitemap will help you to index a large archive of content pages of your blog.yourdomain.com. In this way, Google crawlers can index from a single place within less time.

Finding number of pages of a website

I want to find the number of pages of a website. Usually what I look for is a sitemap but I just encountered a site which does not have a sitemap so I am out of ideas of how to find its total pages. I tried to Google the URL but that did not helped much. Is there any other way we can find out the pages of a website?
Thanks in advance.
Ask Google "site:yourdomain.com"
This gives you all indexed pages.
Or use the free tool "Xenu". It crawls the whole site. But it won't find sites which have no internal links pointing to them. You can also export a sitemap with it.
I was about to suggest the same thing :) If this is a website you own, you can also add it to the Google Webmaster tools. It will show you lots of things about your site including number of links, pages, search terms, etc Its very useful and is free of charge.
I have found a better solution myself. You can go to Google Advanced Search and restrict the search results to your domain name. Leave everything else empty. It would give you the list of all pages cached by Google.
You could also try A1 Website Analyzer. But for all link checker software, you will have to make sure you configure them correctly to obey/not-obey (whatever your needs are) e.g robots.txt, noindex and nofollow instructions. (Common source for confusion in my experience.)

Crawling data or using API

How these sites gather all the data - questionhub, bigresource, thedevsea, developerbay?
Is this legal to show data in frame as bigresource do?
#amazed
EDITED : fixed some spelling issues 20110310
How these sites gather all data- questionhub, bigresource ...
Here's a very general sketch of what is probably happening in the background at website like questionhub.com
Spider program (google "spider program" to learn more)
a. configured to start reading web pages at stackoverflow.com (for example)
b. run program so it goes to home page of stackoverflow.com and starts visiting all links that it finds on those pages.
c. Returns HTML data from all of those pages
Search Index Program
Reads HTML data returned by spider and creates search index
Storing the words that it found AND what URL those words where found at
User Interface web-page
Provides feature rich user-interface so you can search the sites that have been spidered.
Is this legal to show data in frame as bigresource do?
To be technical, "it all depends" ;-)
Normally, websites want to be visible in google, so why not other search engines too.
Just as google displays part of the text that was found when a site was spidered,
questionhub.com (or others) has chosen to show more of the text found on the original page,
possibly keeping the formatting that was in the orginal HTML OR changing the formatting to
fit their standard visual styling.
A remote site can 'request' that spyders do NOT go thru some/all of their web pages
by adding a rule in a well-known file called robots.txt. Spiders do not
have to honor the robots.txt, but a vigilant website will track the IP addresses
of spyders that do not honor their robots.txt file and then block that IP address
from looking at anything on their website. You can find plenty of information about robots.txt here on stackoverflow OR by running a query on google.
There is a several industries (besides google) built about what you are asking. There are tags in stack-overflow for search-engine, search; read some of those question/answers. Lucene/Solr are open source search engine components. There is a companion open-source spider, but the name eludes me right now. Good luck.
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, or give it a + (or -) as a useful answer. This goes for your other posts here too ;-)