Katana VentraIP

Sitemaps

Sitemaps is a protocol in XML format meant for a webmaster to inform search engines about URLs on a website that are available for web crawling. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt, a URL exclusion protocol.

For the graphical representation of the architecture of a web site, see site map.

History[edit]

Google first introduced Sitemaps 0.84 in June 2005 so web developers could publish lists of links from across their sites.[1] Google, Yahoo! and Microsoft announced joint support for the Sitemaps protocol in November 2006.[2] The schema version was changed to "Sitemap 0.90", but no other changes were made.


In April 2007, Ask.com and IBM announced support for Sitemaps.[3] Also, Google, Yahoo, MSN announced auto-discovery for sitemaps through robots.txt. In May 2007, the state governments of Arizona, California, Utah and Virginia announced they would use Sitemaps on their web sites.[4]


The Sitemaps protocol is based on ideas[5] from "Crawler-friendly Web Servers,"[6] with improvements including auto-discovery through robots.txt and the ability to specify the priority and change frequency of pages.

Some areas of the website are not available through the browsable interface

[7]

Webmasters use rich , Silverlight, or Flash content that is not normally processed by search engines.

Ajax

The site is very large and there is a chance for the web crawlers to overlook some of the new or recently updated content

[7]

When websites have a huge number of pages that are isolated or not well linked together, or

[7]

When a website has few external links

[7]

Sitemaps are particularly beneficial on websites where:

Other formats[edit]

Text file[edit]

The Sitemaps protocol allows the Sitemap to be a simple list of URLs in a text file. The file specifications of XML Sitemaps apply to text Sitemaps as well; the file must be UTF-8 encoded, and cannot be more than 50MiB (uncompressed) or contain more than 50,000 URLs. Sitemaps that exceed these limits should be broken up into multiple sitemaps with a sitemap index file (a file that points to multiple sitemaps).[9]

Syndication feed[edit]

A syndication feed is a permitted method of submitting URLs to crawlers; this is advised mainly for sites that already have syndication feeds. One stated drawback is this method might only provide crawlers with more recently created URLs, but other URLs can still be discovered during normal crawling.[8]


It can be beneficial to have a syndication feed as a delta update (containing only the newest content) to supplement a complete sitemap.

Google - Webmaster Support on Sitemaps: "Using a sitemap doesn't guarantee that all the items in your sitemap will be crawled and indexed, as Google processes rely on complex algorithms to schedule crawling. However, in most cases, your site will benefit from having a sitemap, and you'll never be penalized for having one."

[10]

Bing - Bing uses the standard sitemaps.org protocol and is very similar to the one mentioned below.

Yahoo - After the search deal commenced between Yahoo! Inc. and Microsoft, Yahoo! Site Explorer has merged with

Bing Webmaster Tools

Sitemap limits[edit]

Sitemap files have a limit of 50,000 URLs and 50MiB (52,428,800 bytes) per sitemap. Sitemaps can be compressed using gzip, reducing bandwidth consumption. Multiple sitemap files are supported, with a Sitemap index file serving as an entry point. Sitemap index files may not list more than 50,000 Sitemaps and must be no larger than 50MiB and can be compressed. You can have more than one Sitemap index file.[8]


As with all XML files, any data values (including URLs) must use entity escape codes for the characters ampersand (&), single quote ('), double quote ("), less than (<), and greater than (>).


Best practice for optimising a sitemap index for search engine crawlability is to ensure the index refers only to sitemaps as opposed to other sitemap indexes. Nesting a sitemap index within a sitemap index is invalid according to Google.[11]

Biositemap

Metadata

Resources of a Resource

Yahoo! Site Explorer

Google Webmaster Tools

Official website

Sitemaps