XML sitemaps serve a very niche purpose in search engine optimization: facilitating indexation. Posting an XML sitemap is kind of like rolling out the red carpet for search engines and giving them a roadmap of the preferred routes through the site. It’s the site owner’s chance to tell crawlers, “I’d really appreciate it if you’d focus on these URLs in particular.” Whether the engines accept those recommendations of which URLs to crawl depends on the signals the site is sending.
What Are XML Sitemaps?
Simply put, an XML sitemap is a bit of Extensible Markup Language (XML), a standard machine-readable format consumable by search engines and other data-munching programs like feed readers. XML sitemaps convey information about one thing: the URLs that make up a site. Each XML sitemap file follows the same basic form. A one-page site located at www.example.com would have the following XML sitemap:
The XML version and urlset are the same for every XML sitemap file. For each URL listed, a
<loc> tag are required, with optional
<priority> tags. The URL information, outlined in red above, indicates the information that changes for each URL. The
<loc> tag simply contains the absolute URL or locator for a page.
<Lastmod> specifies the file’s last modification date.
<Changefreq> indicates the frequency with which a file is changed.
<Priority> indicates the file’s importance within the site. Avoid the temptation to set every URL to daily frequency and maximum priority. No multi-page site is structured and maintained this way, so search engines will be more inclined to ignore the whole XML sitemap if the frequency and priority tags do not reflect reality.
The URLs in an XML sitemap can be on the same domain or different subdomains and domains. However, each XML file can only contain 50,000 URLs per file and is limited to 10MB in size. To conserve bandwidth and limit file size, XML sitemaps can be compressed using gzip. When a site contains more than 50,000 URLs or reaches 10MB, multiple XML sitemaps need to be generated and called together from an XML sitemap index file. In the same way an XML sitemap lists URLs in a site, the XML sitemap index lists XML sitemaps for a site. The areas to modify for each XML sitemap listed are outlined below:
For more examples of XML sitemaps, peruse any site and enter sitemap.xml after the domain. For example, http://www.practicalecommerce.com/sitemap.xml is the XML sitemap index for this site. If adding sitemap.xml doesn’t work, the XML sitemap may be named differently. Try checking the robots.txt file to see if the XML sitemap address is there. For example, check out http://www.dell.com/robots.txt for a huge list of XML sitemaps.
What to Exclude
Because XML sitemaps serve as a set of recommended links to crawl, any noncanonical URLs should be excluded from the XML sitemap. Any URLs that have been disallowed in the robots.txt file — such as secure ecommerce pages, duplicate content, and print and email versions of pages — should also not be included in the XML sitemap. Likewise, any files that are excluded from the crawl by robots noindex meta tags and canonical tags should not be included in the XML sitemap. If the crawlers find URLs in the XML sitemap that have been purposely excluded from the crawl by one of these means, it sends a mixed signal. “Don’t crawl this URL. But do consider it more important than the other URLs on my site.” The crawlers will obey the crawl exclusion commands issued by robots.txt disallows and meta robots noindex. But if enough of these mixed signals are present, the XML sitemap may be discredited and lose its recommending ability.