HTML and XML Sitemaps Improve SEO
Improving Website Performance
Fact: HTML sitemaps improve your customer’s on-site browsing experience. XML sitemaps improves the crawler’s experience, and helps Google finds more of your content, and get it into the search index faster.
Sitemaps are useful at a number of levels – they provide a useful directory to quickly determine content scope, where content is located, and how the content is organized at a web site. These HTML sitemaps also provide search engine crawlers with a secondary method to discover content if they have problems parsing menu links.
- While HTML Sitemaps give users a useful top-down view of what is available at the site – keep in mind that web sites tend to grow organically over time – adding pages here, deleting sections there. A site map should be reviewed on a quarterly basis so that you are not directing users to 404 pages. On the other hand you can use automated tools to generate a new Sitemap on the schedule of your choosing.
- An HTML Sitemap on a custom 404 page is great for redirecting a crawler or a customer back into the content if they happen to encounter a broken link.
HTML Sitemaps are generally organized in two ways; either in an outline format or in a directory format. The outline format is probably best because you can use indenting to show the existence of sub-links. There are other formats, such as the cloud map – however, these formats are cumbersome, and not all that useful to a customer.
XML-Sitemaps are not meant for human consumption – they provide crawlers with an explicit roadmap for every page on your website. An XML-Sitemap is especially useful if your site dynamically displays content (database driven) or, uses web 2.0 tools such as AJAX. Though these tools help create visually pleasing interactive sites, they tend to be more problematic for crawlers to index.
XML-Sitemaps are also useful if your website is brand new – that is if there are very few back-links to your site making it harder for crawlers to randomly find you. This may also be true if you have internal content silos that are not well crossed-linked. In general, the larger and more complex your site is, the more value an XML-Sitemap provides at crawl time. XML-Sitemaps may not be needed if you have a simple site with few pages. However, Sitemaps are fairly easy to generate, so it never hurts to have one.
The current XML-Sitemap standard (0.9) was defined by http://sitemaps.org/, and was adopted by all the major search engines. Google has taken this standard and created specialized XML-Sitemaps for:
- Video content
- News content
- Mobile content
- Geo content
These specialty sitemaps are important because Google has specialized indexes that are mutually exclusive. Your entire web site may be in Google’s massive general index, but that does not mean that your mobile friendly site, your news archive and video content will show up in the News / Mobile / Video search indexes, for example. There is more information about these special Sitemaps later in this document.
Alternate Sitemap Standards
There are other formats supported by the Sitemap Protocol. These include:
Syndication Feeds (for Blogs) such as RSS 2.0 feeds and Atom 1.0 / 0.3 feeds. Sites must already have a syndication feed to make use of this format.
- Most useful for submitting new URLs
- Supports <Link>, <PubDate> and <Modified>
Text Files (txt format):
- List of URLs only, fully specified including the HTTP://
- One URL to a line
- No meta data is allowed
- No header / footer information is allowed
- Must be saved as a UTF-8 encoded file
ROR.XML format (Resource of a resource). Though this format is not widely supported, it was around first, and has a rich set of tags that describe content. Some of the categories the ROR.XML format supports:
- Product feed
- Article feed
- Event feed
- Review feed
- Sitemap feed
- Job feed
- Classified feed
- Property feed
- People feed
- Song feed
XML-Sitemap Common Sense Rules of Thumb
Sitemaps have very strict requirements for submission to Google. These include:
- Must adhere to the XML 0.9 standard.
- A Sitemap file can not contain more than 50K URLs.
- A Sitemap file must be under 10MB in size.
- If you have more than one Sitemap file, save them in a Sitemap Index file.
- The Sitemap needs to be referenced in the Robot.txt file.
- The Sitemap file must be UTF-8 encoded.
- A Sitemap should be stored in the Root Directory.
- All URLs in a Sitemap must be from a single host.
- A sub-domain must have its own Sitemap.
- URLs that use (data values) the ampersand, quotes and less/greater than, must use Entity Escape codes. See next section for list of codes.
- URIs must conform to the RFC-3986 standard.
- IRIs must conform to the RFC-3987 standard.
- A single standard canonical URL domain name should be chosen and used.
- Remove session IDs from the URLs.
- Use only ASCII characters in the URL.
- Upper ASCII characters can not be used (i.e. *).
- Don’t use control codes and special characters.
- Punycodes for encoding international domain names are not supported.
Entity Escape Codes
Acceptable escape codes for upper selected upper ASCCI characters.
Crawler Access to XML-Sitemaps
The preferred format for a Sitemap should be the XML format, but RSS feeds and Text files are ok. The Sitemap will need to be registered with Google, Yahoo and MSN using the designated webmaster tools for each search engine. As part of that process you will specify the path to the XML-Sitemap – i.e. Http://MSprague.com/sitemap.xml. The XML-Sitemap is not really submitted to Google; rather it is stored in the root directory at your website. Your XML-Sitemap will also need to be referenced by your Robot.txt file.
Before you deploy your XML-Sitemap you should validate it against the published scheme (.09 standard), and for errors. There are a number of tools that are available that can help validate your XML-Sitemap. Here is a list of available tools: http://www.xml.com/pub/a/2000/12/13/schematools.html
Useful search engine submission information:
There are several categories of Sitemap generators available to web site developers. They run the gamut from online services to class libraries. The major categories are:
Server-side tool kits available for 32/64 bit OS for Windows and Linux, written in PHP, Perl and Python. Two examples:
- Paid: http://www.softswot.com/sitemapinfo.php
- Free: http://www.smart-it-consulting.com/article.htm?node=154&page=82
Extensions and plugins for CMS, development platforms and publishing platforms such as .Net, Drupal and WordPress. Two examples:
Applications that can be downloaded for free, or for a fee. Two examples:
On-line services that will ingest your web site and create an XML-Sitemap. Two examples:
Class libraries available for Java, Perl, ASP and PHP. Two examples:
- Free: http://www.phpclasses.org/browse/package/2612.html
- Free: http://www.iteam5.net/francesco/sitemap_gen/
XML-Sitemap Tag Definitions
||required||Encapsulates the file and references the current protocol standard.|
||required||Parent tag for each URL entry. The remaining tags are children of this tag.|
||required||URL of the page. This URL must begin with the protocol (such as http) and end with a trailing slash, if your web server requires it. This value must be less than 2,048 characters.|
||optional||The date of last modification of the file. This date should be in W3C Datetime format. This format allows you to omit the time portion, if desired, and use YYYY-MM-DD.….CONTINUED ON NEXT PAGE….Note that this tag is separate from the If-Modified-Since (304) header the server can return, and search engines may use the information from both sources differently.|
||optional||How frequently the page is likely to change. This value provides general information to search engines and may not correlate exactly to how often they crawl the page. Valid values are:
The value “always” should be used to describe documents that change each time they are accessed. The value “never” should be used to describe archived URLs.
Please note that the value of this tag is considered a hint and not a command. Even though search engine crawlers may consider this information when making decisions, they may crawl pages marked “hourly” less frequently than that, and they may crawl pages marked “yearly” more frequently than that. Crawlers may periodically crawl pages marked “never” so that they can handle unexpected changes to those pages.
||optional||The priority of this URL relative to other URLs on your site. Valid values range from 0.0 to 1.0. This value does not affect how your pages are compared to pages on other sites—it only lets the search engines know which pages you deem most important for the crawlers.The default priority of a page is 0.5.Please note that the priority you assign to a page is not likely to influence the position of your URLs in a search engine’s result pages. Search engines may use this information when selecting between URLs on the same site, so you can use this tag to increase the likelihood that your most important pages are present in a search index.Also, please note that assigning a high priority to all of the URLs on your site is not likely to help you. Since the priority is relative, it is only used to select between URLs on your site.|
XML-Sitemap Index Tag Definitions
||required||Encapsulates information about all of the Sitemaps in the file.|
||required||Encapsulates information about an individual Sitemap.|
||required||Identifies the location of the Sitemap.This location can be a Sitemap, an Atom file, RSS file or a simple text file.|
||optional||Identifies the time that the corresponding Sitemap file was modified. It does not correspond to the time that any of the pages listed in that Sitemap were changed. The value for the lastmod tag should be inW3C Datetime format.By providing the last modification timestamp, you enable search engine crawlers to retrieve only a subset of the Sitemaps in the index i.e. a crawler may only retrieve Sitemaps that were modified since a certain date. This incremental Sitemap fetching mechanism allows for the rapid discovery of new URLs on very large sites.|
Example XML-Sitemap Index Code
xml version=”1.0″ encoding=”UTF-8″?> sitemapindex xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″>http://www.MSprague.com/sitemap1.xml.gz 2004-10-01T18:23:17+00:00
XML-Video Sitemap Common Sense Rules of Thumb
- Only URLs (to include the entire supporting page) that point to video content will be indexed. URL’s without video content will be discarded.
- Pages with multiple videos require a unique URL for each video.
- URL’s can point to video players.
- A Video Sitemap can contain up to 10K videos.
- A Video Sitemap must be 30MB in size or smaller.
- Multiple Video Sitemaps must be submitted in a Video Sitemap Index file.
- Google crawlers support indexing: jpg, .mpeg, .mp4, .mov, .wmv, .asf, .avi, .ra, .ram, .rm, and .flv. files.
- The Sitemap must be referenced by the Robot.txt file.
XML-Video Sitemap Tags
|<loc>||Required||The tag specifies the landing page (aka play page, referrer page) for the video. When a user clicks on a video result on a search results page, they will be sent to this landing page.|
|<video:player_loc>||Required||At least one of
|<video:content_loc>||Required||At least one of
|<video:thumbnail_loc>||Required||A URL pointing to the URL for the video thumbnail image file. We can accept most image sizes/types but recommend your thumbs are at least 160×120 in .jpg, .png, or. gif formats.|
|<video:title>||Required||The title of the video. Limited to 100 characters.|
|<video:description>||Required||The description of the video. Descriptions longer than 2048 characters will be truncated.|
|<video:rating>||Optional||The rating of the video. The value must be float number in the range 0.0-5.0.|
|<video:view_count>||Optional||The number of times the video has been viewed|
|<video:publication_date>||Optional||The date the video was first published, inW3C format. Acceptable values are complete date (YYYY-MM-DD) and complete date plus hours, minutes and seconds (YYYY-MM-DDThh:mm:ss). Fraction and time zone suffixes are optional. For example,
|<video:tag>||Optional||A tag associated with the video. Tags are generally very short descriptions of key concepts associated with a video or piece of content. A single video could have several tags, although it might belong to only one category. For example, a video about grilling food may belong in the Grilling category, but could be tagged “steak”, “meat”, “summer”, and “outdoor”. Create a new
|<video:category>||Optional||The video’s category. For example,
|<video:family_friendly>||Optional||“No” if the video should be available only to users with SafeSearch turned off.|
|<video:duration>||Optional||The duration of the video in seconds. Value must be between 0 and 28800 (8 hours). Non-digit characters are disallowed.|
|<video:expiration_date>||Optional||The date after which the video will no longer be available, in W3C format. Acceptable values are complete date (YYYY-MM-DD) and complete date plus hours, minutes and seconds (YYYY-MM-DDThh:mm:ss). Fraction and time zone suffixes are optional. For example, 2007-07-16T19:20:30+08:00.|
The Google Mobile Sitemap uses the 9.0 Sitemap protocol, with a new tag and namespace. Here is an example of a Mobile XML-Sitemap containing the listing of one content entry:
<?xml version=”1.0″ encoding=”UTF-8″ ?>
Currently, the mobile protocol supports the following markup languages:
- XHTML (WAP 2.0 – mobile profile)
- cHTML (iMode)
- WML (WAP 1.2)
A couple of things to keep in mind – not all site map generators can generate mobile sitemaps. If you do not include the tag, your content will not be considered mobile and the content will not be crawled for the mobile index.
XML-News Sitemaps are for breaking news only, and have a life of 72 hours. The Sitemap requires the “Sitemap-News” namespace, a publication date and optionally, keywords – which are Google news categories.
Example of a News sitemap containing a single article:
<?xml version=”1.0″ encoding=”UTF-8″?>
<news:keywords>Business, Mergers, Acquisitions</news:keywords>
Rules of Thumb for News Sitemaps:
- An XML-News Sitemap can contain no more than 1K URLs.
- The Publication_Date tag should also contain a time stamp.
- The Google crawler will not index news articles older than 72 hours.
- News older than 72 hours is still accessible from the Google News Archive.
- If a Google News Category does not exist for your article, you have the option to create a new category.
Geo Sitemaps use the KML, KMZ and the GEORSS markup languages to provide location specific information for the Google geo-search index using the tag.
Geo-specific tag definition
|<geo:format>||Required||Case-insensitive. Specifies the format of the geo content. Examples include “kml” and “georss”.|
The following table lists the formats supported by Geo Sitemaps. You should use the Short Name to create your Geo Sitemap.
|Short Name||Full Name||Description|
|georss||GeoRSS||RSS with geoRSS extensions|
The KML (Keyhole Markup Language) is an XML-based schema that is used to support 2-D and 3-D browsing primarily using longitude and latitude coordinates. Other tags include: images, polygons, 3D models and textual descriptions. KML was originally developed for use in Google Earth.
An example of KML code:
<kml xmlns=”http://www.opengis.net/kml/2.2″> Placemark> Lexington ebusiness Consulting
KMZ files are zipped KML files, and are used for distribution purposes.
GeoRSS comes in two flavors: GeoRSS-GML and GeoRSS-Simple. It’s primarily used to encode Geo-related information (i.e. bridges and roads) in articles and blogs, using syndicated feeds for distribution.
Find Out More
Get in touch if you would like to find out more about improving website performance. Call me now at 781-862-3126 or send me an email at: Mark@Msprague.com
About Lexington eBusiness Consulting
Providing comprehensive SEO services to the Boston community…
Mark Sprague’s 25 years of product development experience, which includes expertise in Search Engines, Information Products, SEO platforms and Social Networking applications provide in-depth expertise to help you refine products and services, and improve your search engine marketing and websites performance by:
- Developing a superior data-driven SEO strategy for your website.
- Understanding your customers’ search behavior and normalizing it to your content strategy.
- Understanding how search engine technology practically impacts SEO and content strategies.
- Understanding how search technology impacts content in a social networking environment.
- Developing a superior user experience based on sound information architecture, usability and coding standards.
Lexington eBusiness Consulting
Mark Sprague, CEO
580 Lowell Street
Lexington, MA 02420