HTML and XML Sitemaps Improve SEO

Improving Website Performance

Fact:  HTML sitemaps improve your customer’s on-site browsing experience. XML sitemaps improves the crawler’s experience, and helps Google finds more of your content, and get it into the search index faster.

HTML Sitemaps

HTML / XML sitemaps improve SEO by Mark SpragueSitemaps are useful at a number of levels – they provide a useful directory to quickly determine content scope, where content is located, and how the content is organized at a web site. These HTML sitemaps also provide search engine crawlers with a secondary method to discover content if they have problems parsing menu links.

  1. While HTML Sitemaps give users a useful top-down view of what is available at the site – keep in mind that web sites tend to grow organically over time – adding pages here, deleting sections there. A site map should be reviewed on a quarterly basis so that you are not directing users to 404 pages. On the other hand you can use automated tools to generate a new Sitemap on the schedule of your choosing.
  2. An HTML Sitemap on a custom 404 page is great for redirecting a crawler or a customer back into the content if they happen to encounter a broken link.

HTML Sitemaps are generally organized in two ways; either in an outline format or in a directory format. The outline format is probably best because you can use indenting to show the existence of sub-links. There are other formats, such as the cloud map – however, these formats are cumbersome, and not all that useful to a customer.

XML Sitemaps

XML-Sitemaps are not meant for human consumption – they provide crawlers with an explicit roadmap for every page on your website. An XML-Sitemap is especially useful if your site dynamically displays content (database driven) or, uses web 2.0 tools such as AJAX. Though these tools help create visually pleasing interactive sites, they tend to be more problematic for crawlers to index.

  1. XML Sitemaps provide crawlers with a road map unencumbered with JavaScript, Frames, Flash, dynamic IDs and poorly written source code. This increases the probability that the crawler will find and index all your content.

XML-Sitemaps are also useful if your website is brand new – that is if there are very few back-links to your site making it harder for crawlers to randomly find you. This may also be true if you have internal content silos that are not well crossed-linked. In general, the larger and more complex your site is, the more value an XML-Sitemap provides at crawl time.  XML-Sitemaps may not be needed if you have a simple site with few pages. However, Sitemaps are fairly easy to generate, so it never hurts to have one.

XML-Sitemap Standards

Call now for free SEO consultationThe current XML-Sitemap standard (0.9) was defined by http://sitemaps.org/, and was adopted by all the major search engines. Google has taken this standard and created specialized XML-Sitemaps for:

  1. Video content
  2. News content
  3. Mobile content
  4. Geo content

These specialty sitemaps are important because Google has specialized indexes that are mutually exclusive. Your entire web site may be in Google’s massive general index, but that does not mean that your mobile friendly site, your news archive and video content will show up in the News / Mobile / Video search indexes, for example. There is more information about these special Sitemaps later in this document.

Alternate Sitemap Standards

There are other formats supported by the Sitemap Protocol. These include:

Syndication Feeds (for Blogs) such as RSS 2.0 feeds and Atom 1.0 / 0.3 feeds. Sites must already have a syndication feed to make use of this format.

  1. Most useful for submitting new URLs
  2. Supports <Link>, <PubDate> and <Modified>

Text Files (txt format):

  1. List of URLs only, fully specified including the HTTP://
  2. One URL to a line
  3. No meta data is allowed
  4. No header / footer information is allowed
  5. Must be saved as a UTF-8 encoded file

ROR.XML format (Resource of a resource). Though this format is not widely supported, it was around first, and has a rich set of tags that describe content. Some of the categories the ROR.XML format supports:

  1. Product feed
  2. Article feed
  3. Event feed
  4. Review feed
  5. Sitemap feed
  6. Job feed
  7. Classified feed
  8. Property feed
  9. People feed
  10. Song feed

XML-Sitemap Common Sense Rules of Thumb

Sitemaps have very strict requirements for submission to Google. These include:

  • Must adhere to the XML 0.9 standard.
  • A Sitemap file can not contain more than 50K URLs.
  • A Sitemap file must be under 10MB in size.
  • If you have more than one Sitemap file, save them in a Sitemap Index file.
  • The Sitemap needs to be referenced in the Robot.txt file.
  • The Sitemap file must be UTF-8 encoded.
  • A Sitemap should be stored in the Root Directory.
  • All URLs in a Sitemap must be from a single host.
  • A sub-domain must have its own Sitemap.
  • URLs that use (data values) the ampersand, quotes and less/greater than, must use Entity Escape codes. See next section for list of codes.
  • URIs must conform to the RFC-3986 standard.
  • IRIs must conform to the RFC-3987 standard.
  • A single standard canonical URL domain name should be chosen and used.
  • Remove session IDs from the URLs.
  • Use only ASCII characters in the URL.
  • Upper ASCII characters can not be used (i.e. *).
  • Don’t use control codes and special characters.
  • Punycodes for encoding international domain names are not supported.

Entity Escape Codes

Acceptable escape codes for upper selected upper ASCCI characters.

Character Escape Code
Ampersand & &amp;
Single Quote &apos;
Double Quote &quot;
Greater Than > &gt;
Less Than < &lt;

Crawler Access to XML-Sitemaps

The preferred format for a Sitemap should be the XML format, but RSS feeds and Text files are ok.  The Sitemap will need to be registered with Google, Yahoo and MSN using the designated webmaster tools for each search engine. As part of that process you will specify the path to the XML-Sitemap – i.e. Http://MSprague.com/sitemap.xml. The XML-Sitemap is not really submitted to Google; rather it is stored in the root directory at your website. Your XML-Sitemap will also need to be referenced by your Robot.txt file.

Before you deploy your XML-Sitemap you should validate it against the published scheme (.09 standard), and for errors. There are a number of tools that are available that can help validate your XML-Sitemap. Here is a list of available tools: http://www.xml.com/pub/a/2000/12/13/schematools.html

Useful search engine submission information:












Sitemap Generators

There are several categories of Sitemap generators available to web site developers.  They run the gamut from online services to class libraries. The major categories are:

Server-side tool kits available for 32/64 bit OS for Windows and Linux, written in PHP, Perl and Python. Two examples:

  1. Paid: http://www.softswot.com/sitemapinfo.php
  2. Free: http://www.smart-it-consulting.com/article.htm?node=154&amp;page=82

Extensions and plugins for CMS, development platforms and publishing platforms such as .Net, Drupal and WordPress. Two examples:

  1. Paid: http://www.pc4people.com/products.php?cat=57
  2. Free: http://drupal.org/project/xmlsitemap

Applications that can be downloaded for free, or for a fee. Two examples:

  1. Paid: http://www.sitemappro.com/
  2. Free: http://www.vigos.com/products/gsitemap/

On-line services that will ingest your web site and create an XML-Sitemap. Two examples:

  1. Paid: http://www.autositemap.com/
  2. Free: http://www.xml-sitemaps.com/

Class libraries available for Java, Perl, ASP and PHP. Two examples:

  1. Free: http://www.phpclasses.org/browse/package/2612.html
  2. Free: http://www.iteam5.net/francesco/sitemap_gen/

XML-Sitemap Tag Definitions

Attribute Description
<urlset> required Encapsulates the file and references the current protocol standard.
<url> required Parent tag for each URL entry. The remaining tags are children of this tag.
<loc> required URL of the page. This URL must begin with the protocol (such as http) and end with a trailing slash, if your web server requires it. This value must be less than 2,048 characters.
<lastmod> optional The date of last modification of the file. This date should be in W3C Datetime format. This format allows you to omit the time portion, if desired, and use YYYY-MM-DD.….CONTINUED ON NEXT PAGE….Note that this tag is separate from the If-Modified-Since (304) header the server can return, and search engines may use the information from both sources differently.
<changefreq> optional How frequently the page is likely to change. This value provides general information to search engines and may not correlate exactly to how often they crawl the page. Valid values are:

  • always
  • hourly
  • daily
  • weekly
  • monthly
  • yearly
  • never

The value “always” should be used to describe documents that change each time they are accessed. The value “never” should be used to describe archived URLs.

Please note that the value of this tag is considered a hint and not a command. Even though search engine crawlers may consider this information when making decisions, they may crawl pages marked “hourly” less frequently than that, and they may crawl pages marked “yearly” more frequently than that. Crawlers may periodically crawl pages marked “never” so that they can handle unexpected changes to those pages.

<priority> optional The priority of this URL relative to other URLs on your site. Valid values range from 0.0 to 1.0. This value does not affect how your pages are compared to pages on other sites—it only lets the search engines know which pages you deem most important for the crawlers.The default priority of a page is 0.5.Please note that the priority you assign to a page is not likely to influence the position of your URLs in a search engine’s result pages. Search engines may use this information when selecting between URLs on the same site, so you can use this tag to increase the likelihood that your most important pages are present in a search index.Also, please note that assigning a high priority to all of the URLs on your site is not likely to help you. Since the priority is relative, it is only used to select between URLs on your site.

XML-Sitemap Index Tag Definitions

Attribute Description
<sitemapindex> required Encapsulates information about all of the Sitemaps in the file.
<sitemap> required Encapsulates information about an individual Sitemap.
<loc> required Identifies the location of the Sitemap.This location can be a Sitemap, an Atom file, RSS file or a simple text file.
<lastmod> optional Identifies the time that the corresponding Sitemap file was modified. It does not correspond to the time that any of the pages listed in that Sitemap were changed. The value for the lastmod tag should be inW3C Datetime format.By providing the last modification timestamp, you enable search engine crawlers to retrieve only a subset of the Sitemaps in the index i.e. a crawler may only retrieve Sitemaps that were modified since a certain date. This incremental Sitemap fetching mechanism allows for the rapid discovery of new URLs on very large sites.

Example XML-Sitemap Index Code

xml version=”1.0″ encoding=”UTF-8″?> sitemapindex xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″>http://www.MSprague.com/sitemap1.xml.gz 2004-10-01T18:23:17+00:00

XML-Video Sitemap Common Sense Rules of Thumb

  • Only URLs (to include the entire supporting page) that point to video content will be indexed. URL’s without video content will be discarded.
  • Pages with multiple videos require a unique URL for each video.
  • URL’s can point to video players.
  • A Video Sitemap can contain up to 10K videos.
  • A Video Sitemap must be 30MB in size or smaller.
  • Multiple Video Sitemaps must be submitted in a Video Sitemap Index file.
  • Google crawlers support indexing: jpg, .mpeg, .mp4, .mov, .wmv, .asf, .avi, .ra, .ram, .rm, and .flv. files.
  • The Sitemap must be referenced by the Robot.txt file.

XML-Video Sitemap Tags

Attribute Description
<loc> Required The tag specifies the landing page (aka play page, referrer page) for the video. When a user clicks on a video result on a search results page, they will be sent to this landing page.
<video:video> Required
<video:player_loc> Required At least one of<video:player_loc> and<video:content_loc> is required. A URL pointing to a flash player for a specific video. In general, this is the information in the "src" element of an tag. The required attributeallow_embed specifies whether Google can embed the video in search results. Allowed values are “Yes” or “No
<video:content_loc> Required At least one of<video:player_loc> and<video:content_loc> is required. This should be a .mpg, .mpeg, .mp4, .mov, .wmv, .asf, .avi, .ra, .ram, .rm, .flv, or other video file format, and can be omitted if <video:player_loc> is specified.
<video:thumbnail_loc> Required A URL pointing to the URL for the video thumbnail image file. We can accept most image sizes/types but recommend your thumbs are at least 160×120 in .jpg, .png, or. gif formats.
<video:title> Required The title of the video. Limited to 100 characters.
<video:description> Required The description of the video. Descriptions longer than 2048 characters will be truncated.
<video:rating> Optional The rating of the video. The value must be float number in the range 0.0-5.0.
<video:view_count> Optional The number of times the video has been viewed
<video:publication_date> Optional The date the video was first published, inW3C format. Acceptable values are complete date (YYYY-MM-DD) and complete date plus hours, minutes and seconds (YYYY-MM-DDThh:mm:ss). Fraction and time zone suffixes are optional. For example, 2007-07-16T19:20:30+08:00.
<video:tag> Optional A tag associated with the video. Tags are generally very short descriptions of key concepts associated with a video or piece of content. A single video could have several tags, although it might belong to only one category. For example, a video about grilling food may belong in the Grilling category, but could be tagged “steak”, “meat”, “summer”, and “outdoor”. Create a new<video:tag> element for each tag associated with a video. A maximum of 32 tags is permitted.
<video:category> Optional The video’s category. For example,cooking. The value should be a string no longer than 256 characters. In general, categories are broad groupings of content by subject. Usually a video will belong to a single category. For example, a site about cooking could have categories for Broiling, Baking, and Grilling
<video:family_friendly> Optional “No” if the video should be available only to users with SafeSearch turned off.
<video:duration> Optional The duration of the video in seconds. Value must be between 0 and 28800 (8 hours). Non-digit characters are disallowed.
<video:expiration_date> Optional The date after which the video will no longer be available, in W3C format. Acceptable values are complete date (YYYY-MM-DD) and complete date plus hours, minutes and seconds (YYYY-MM-DDThh:mm:ss). Fraction and time zone suffixes are optional. For example, 2007-07-16T19:20:30+08:00.

XML-Mobile Sitemaps

The Google Mobile Sitemap uses the 9.0 Sitemap protocol, with a new tag and namespace. Here is an example of a Mobile XML-Sitemap containing the listing of one content entry:

<?xml version=”1.0″ encoding=”UTF-8″ ?>

Currently, the mobile protocol supports the following markup languages:

  1. XHTML (WAP 2.0 – mobile profile)
  2. cHTML (iMode)
  3. WML (WAP 1.2)

A couple of things to keep in mind – not all site map generators can generate mobile sitemaps. If you do not include the tag, your content will not be considered mobile and the content will not be crawled for the mobile index.

XML-News Sitemaps

XML-News Sitemaps are for breaking news only, and have a life of 72 hours. The Sitemap requires the “Sitemap-News” namespace, a publication date and optionally, keywords – which are Google news categories.

Example of a News sitemap containing a single article:

<?xml version=”1.0″ encoding=”UTF-8″?>
<news:keywords>Business, Mergers, Acquisitions</news:keywords>

Rules of Thumb for News Sitemaps:

  • An XML-News Sitemap can contain no more than 1K URLs.
  • The Publication_Date tag should also contain a time stamp.
  • The Google crawler will not index news articles older than 72 hours.
  • News older than 72 hours is still accessible from the Google News Archive.
  • If a Google News Category does not exist for your article, you have the option to create a new category.

XML-Geo Sitemaps

Geo Sitemaps use the KML, KMZ and the GEORSS markup languages to provide location specific information for the Google geo-search index using the tag.


Geo-specific tag definition

Attribute Description
<geo:format> Required Case-insensitive. Specifies the format of the geo content. Examples include “kml” and “georss”.


The following table lists the formats supported by Geo Sitemaps. You should use the Short Name to create your Geo Sitemap.

Short Name Full Name Description
kml KML KML file
kmz KMZ KMZ archive
georss GeoRSS RSS with geoRSS extensions


The KML (Keyhole Markup Language) is an XML-based schema that is used to support 2-D and 3-D browsing primarily using longitude and latitude coordinates. Other tags include: images, polygons, 3D models and textual descriptions. KML was originally developed for use in Google Earth.

An example of KML code:

<kml xmlns=”http://www.opengis.net/kml/2.2″> Placemark> Lexington ebusiness Consulting
<description>Social Media</description>

KMZ files are zipped KML files, and are used for distribution purposes.

GeoRSS comes in two flavors: GeoRSS-GML and GeoRSS-Simple. It’s primarily used to encode Geo-related information (i.e. bridges and roads)  in articles and blogs, using syndicated feeds for distribution.

Find Out More

Get in touch if you would like to find out more about improving website performance. Call me now at 781-862-3126 or send me an email at: Mark@Msprague.com

Lexington eBusiness Consulting

About Lexington eBusiness Consulting

Providing comprehensive SEO services to the Boston community…

Mark Sprague’s 25 years of product development experience, which includes expertise in Search Engines, Information Products, SEO platforms and Social Networking applications provide in-depth expertise to help you refine products and services, and improve your search engine marketing and websites performance by:

  • Developing a superior data-driven SEO strategy for your website.
  • Understanding your customers’ search behavior and normalizing it to your content strategy.
  • Understanding how search engine technology practically impacts SEO and content strategies.
  • Understanding how search technology impacts content in a social networking environment.
  • Developing a superior user experience based on sound information architecture, usability and coding standards.

Lexington eBusiness Consulting
LinkedIn Company Profile
LinkedIn Personal Profile

Lexington eBusiness Consulting

Lexington eBusiness Consulting
Mark Sprague,  CEO
580 Lowell Street
Lexington, MA 02420

List of Lexington eBusiness Clients

Lexington eBusiness Clients

1 Comment (+add yours?)

  1. Peter
    Feb 28, 2013 @ 10:51:22

    how do you use geosite map and kml file on Multi site domain?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: