Search engines tend to have problems fully indexing dynamic websites (in other words, sites that are hooked up to a database of content).
The kinds of sites that search engines have the biggest trouble with are ones that have overly complex URL structures, including numerous variables in the URL (marked by numerous occurrences of ampersands and equals signs as well as session IDs, user IDs and referral tracking codes). Matt Cutts, Senior Engineer for Google, said at Search Engine Strategies in San Jose this past August that you are safe if the number of variables in your URL is one or two, unless one of those two variables is named id (or something else resembling a session ID), in which case all bets are off.
Not only are overly complex URLs unfriendly to users who might copy the URL and paste it in an email to a friend, or add a link on their own website to that particular page deep within your site, they are also unfriendly to the search engine spiders because they are a tip-off that the page is dynamically generated and could lead to what is called a spider trap.
A spider trap exists when a search engine spider keeps following links to URLs that appear to be different from URLs that have already been explored, even though it is the same content.
Imagine for example a search engine spider coming to the site, getting assigned a session ID, which is then embedded in the URL of all the pages on the page. The next time a spider comes to this page, it gets a brand new session ID because your web server can’t detect it is the same spider that came a few minutes ago. This result is numerous copies of the same exact page getting indexed, which is obviously a bad result for the search engine and a bad result for the search engine’s users because of all this duplication of content.
The worst kind of spider traps result in the spider getting an infinite variety of URLs although the same limited set of pages. Each search engine has its own tolerance levels as to how many variables in the URL are acceptable. The idea, however, is to eliminate all signs of the dynamic nature of your pages from the URL, in other words removing all question marks, ampersands, equals signs, cgi-bin, user IDs, and session IDs from the URLs to make the page infinitely more palatable to the spiders.
Not only does a clean, simple URL eliminate the potential problems that you could have with getting that page indexed, but, as a bonus, you’re also more likely to garner more “deep links” from other sites (i.e. links directly into one of your pages deep within your site) because the URL looks user-friendly, stable, and easy to copy-and-paste (into a web browser, email message, or web page editor).
The best approach is to replace all dynamic looking links with search engine friendly ones. Don’t be tempted just to take a short-cut approach and create a site map page with links to all these search engine friendly URLs, leaving all the remaining links intact across your site. This is because the URLs that you haven’t fixed will not enhance the link gain of the pages with the friendly URLs. You want to maximize your link gain by having as few variations in each URL as possible. Variations in the URLs lead to dilution of link gain because not all links are voting for the same page. Some of them are spread out, some of them voting for a version of the page with one URL and others voting with other versions of the same page at varying URLs.
Assuming you have a dynamic site that is not yet search engine friendly as far as the URLs are concerned, but you would like to make it so, you have three options:
One is to rewrite the URLs using a URL rewriting server module, like modrewrite (for Apache) or ISAPIRewrite (for IIS Server). Ask your server administrator for information about these modules.
A second option is to recode your ecommerce platform to not pass information through “query strings” but instead use the “path info” environment variable. In other words, you would recode your scripts to look for variables embedded within the directory names or the file name instead of the “query string,” however this tends to be quite a bit more complicated to implement.
The third option is to use a third party hosted proxy serving solution – in other words, an Application Service Provider.
The first option is usually the preferable one, assuming you have the IT resources to implement it on your server and your server supports the technology required for URL re-writing. The second option is good if you can’t do URL rewriting but have programming resource available along with access to the source code of your ecommerce platform.
But if these two options are not feasible for whatever reason, you could use a third party solution that automatically corrects the URLs for you. This is particularly useful if you are caught in a middle of a code freeze, such as during the holiday season.
You may wonder where the new Google Sitemaps program fits in here. Well, until Google Sitemaps supports the ability for you to convey which variations of URLs point to the same content, it’s an incomplete solution. That’s because it will fail to aggregate the PageRank across all those URL variations. You may have five versions of a product page already indexed in Google, and Sitemaps could just exacerbate the problem by getting a sixth version indexed, rather than a collapsing of the five versions into one version with much higher PageRank.
No matter which approach you take, making your URLs search engine friendly will pay dividends.