Last month, we featured a basic introduction of the Extensible Markup Language (XML) and how to begin writing your own XML documents. While we were able to create an XML document to describe data, it may not be clear why this is useful. In an effort to utilize what we learned in the last tutorial, as well as to provide a real-world example of how XML is used, we will be creating a Google Sitemap file in this tutorial.
Google Sitemap is a beta feature of the Google search engine that allows webmasters to submit an XML file describing the structure of a website. While this will not help the page ranking of the site in the Google index, it does provide a roadmap to help the Google search engine crawl a website more effectively. Google has defined a series of XML tags and attributes that can be used in a sitemap file, known as an XML schema. This pre-defined schema allows us to include information about what pages are available to be crawled by the search engine, the importance of the information in our opinion, the last time the content of each page was modified, and the expected frequency that the content on each page will be updated.
Armed with our knowledge of XML from last month’s tutorial, we are going to be building our own Google sitemap for our fictitious website that sells widgets. Our first step will be to analyze our website and gain the information that we will need in order to create our Google Sitemap. As mentioned before, we will need to compile all the pages that we want the search engine to crawl, rank the importance of each page, find out the last time each page was modified, and how often we expect each page to be updated. Using this information we can build our XML file based on the Sitemap protocol defined by Google. Since XML provides a method for marking up information using tags and attributes, we need to make sure that our XML tags are consistent with what Google is expecting from us. This is where the power and flexibility of XML begins to become apparent. By defining what tags, attributes, and values we can use in an XML file (the “schema”), Google is able to create a protocol which allows us to efficiently transfer a massive amount of information about our website in a structured manner that their system can automatically utilize.
Once we have created our XML file, we will need to post our Google Sitemap on our website, and submit its location to the search engine. While this does not guarantee inclusion into the Google index, the search engine will now be able to access our sitemap file when it indexes our website and use the information contained in it to more effectively crawl and index the website content. Combined with other search engine optimization techniques, creating a Google Sitemap can help you to improve your search engine placement and get an edge on your competition.
In this tutorial we will be creating a Google Sitemap for a small website that sells widgets. Since we will be keeping it simple, out website will consist of a home page, a contact page, and a products page. Luckily we don’t need to actually invent content for these pages, since the Google Sitemap file that we are going to create is meant to guide the search engine to the content of the site, not describe it. In order to create our XML file we will first need to gather some information about our website pages.
The Google Sitemap protocol defines XML tags that will be used to structure the information about our website. While we are required to include information about the location of each page on our website (or the URL of each page), other information is optional. As we mentioned before, we also have the option of including information about the importance of each page, the last time each page was modified, and how often we expect each page to be modified. Since we want to include as much information as we can, let’s take a look at our website and see what we will need.
Our homepage is important. We have made sure that it is saturated with relevant keywords, and we have optimized the code to make sure that search engine like our home page. Since this page describes our company and gives an overview of our products, we should feel as though this page is pretty important. All of our web pages were modified right before we created this sitemap file, so we know that value will be the same for all pages. We also update our home page weekly to provide new content and special offers. This is all the information that we need to begin.
While we are building our file we will repeat this process for each page that we include in the sitemap. However, let’s first use our XML knowledge from the last tutorial and begin to build our file. The Google Sitemap protocol requires that our XML file be encoding in UTF-8, or Unicode in order to accommodate special characters. This important because we need to declare this in our XML declaration tag. Our file then starts with:
<code><?xml version="1.0" encoding="UTF-8"?> </code>
If you remember from the last tutorial, this opening tag simply declares that this is an XML file and describes the character encoding of the file. This will always be the same for a Google Sitemap, so don’t worry about understanding this one. Just make sure that it is there. Our next piece of code will also remain the same in any Google Sitemap, and that is a tag that encapsulates all the data in the file. It also has an attribute that identifies this file as a Google Sitemap:
<code><?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.google.com/schemas/sitemap/0.84"> </urlset> </code>
Notice that we have again created an opening tag, which contains an attribute, and a closing tag. We have left a space between since all of the data in our file will be contained inside this <urlset> tag. Next we create a tag that will contain the information about the home page of our site. Information about each page of the site will be contained within another tag called the
<url> tag. As you can see, the information about each page will be contained inside the
<code><?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.google.com/schemas/sitemap/0.84"> <url> </url> </urlset> </code>
Again we have left a space to insert the information about the home page. First we want to describe the location of the home page on our website, which is done with the
<loc> tag. This is a required tag since it provides the URL for the search engine to find the page on the website, and the URL of the web page must include the
http://. In our case, our home page
<loc> tag would be:
<code><?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.google.com/schemas/sitemap/0.84"> <url> <loc>http://www.widgetsexample.com</loc> </url> </urlset> </code>
As you can see, since our home page is the index page to our website, we just need to put our website URL in the location tag. Next we want to tell the search engine the last time that we modified the content on this page, which we had determined to be the same for each page. In our case we will say that the pages were last modified on October 31. Notice the format used for the date, as it is important that the year be displayed first, then the month and day of the month:
<code><?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.google.com/schemas/sitemap/0.84"> <url> <loc>http://www.widgetsexample.com</loc> <lastmod>2005-10-31</lastmod> </url> </urlset> </code>
We know that the content on our home page is updated at least weekly to promote product offers, company announcements, and industry news so we want to include this information in out sitemap. This is accomplished by using the next available tag, which is the
<changefreq> tag, which tells the search engine how often to crawl the site. Below is a list of the available values for this tag:
It is important to note here that you should put the value that best describes your website. While many might feel the temptation to tell the search engine that their content is always updated, that could hurt your placement in the long run. It won’t take too long for the search engine to figure out that the content is not changing as frequently as expected, and you may be punished for trying to fool the system. More importantly, you want the search engine to crawl your site and discover fresh content, so the more accurate you can be here the better. Remember, this value does not limit the search engine to crawling these pages at the defined interval, it simply advises it as to how often it should expect new content. Take a look at our file now:
<code><?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.google.com/schemas/sitemap/0.84"> <url> <loc>http://www.widgetsexample.com</loc> <lastmod>2005-10-31</lastmod> <changefreq>weekly</changefreq> </url> </urlset> </code>
Finally, we want to inform the Google search engine that we think that this page is important. We do this by including the
<priority> tag outlined in the Google Sitemap protocol. We can insert a value inside this tag between 0.0 and 1.0 to rank the importance, with a 0.0 being the least important and a 1.0 being the most important. Since we have deemed our home page to be pretty important, but not as important as our products page we will use a value of 0.8:
<code> <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.google.com/schemas/sitemap/0.84"> <url> <loc>http://www.widgetsexample.com</loc> <lastmod>2005-10-31</lastmod> <changefreq>weekly</changefreq> <priority>0.8</priority> </url> </urlset> </code>
As you can see, we have now described all of the pertinent information that we can about our home page in our Google Sitemap document. We now need to repeat the process for our other two pages. Remember that information about each page is contained within the
<url> tags, so we will simply add more of these tags after the first. Take a look at our finished XML file:
<code><?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.google.com/schemas/sitemap/0.84"> <url> <loc>http://www.widgetsexample.com</loc> <lastmod>2005-10-31</lastmod> <changefreq>weekly</changefreq> <priority>0.8</priority> </url> <url> <loc>http://www.widgetsexample.com/contact.html</loc> <lastmod>2005-10-15</lastmod> <priority>0.5</priority> </url> <url> <loc>http://www.widgetsexample.com/products.html</loc> <lastmod>2005-10-31</lastmod> <changefreq>daily</changefreq> <priority>1.0</priority> </url> </urlset> </code>
Notice that we have set the importance value for our products page to 1.0, indicating that we feel it is the most important page on our website. Also notice that since we don’t expect the contact page to change, we have not included the
<changefreq> tag. We have also placed the least amount of importance on this page, although we have still indicated that we feel it is important. This completes our XML Google Sitemap file, and all that waits is to submit it to the Google Sitemap service.
The first step to submitting our sitemap file is to upload it to our website. In this example, we will save our sitemap file in the root directory of our website, called
sitemap.xml. Once we have done this, we can provide the URL of our sitemap to Google so that the search engine may begin using it. You will need to log in with your Google account information, or create an account if you don’t already have one. Once we have logged in, we will enter the complete URL of our sitemap as follows:
Click “Submit URL” and we are finished. If you decide to change you sitemap in the future simply re-submit it again once it has changed. We have now created a Google Sitemap file that will assist the search engine in crawling our website more effectively, and added another tool to our search engine optimization arsenal.