PHP Session ID's
I'm sure this one has been beat to death somewhere, but I wanted to touch on it here since it came up with us recently. First of all, a PHP session ID is a unique ID given out by the PHP processor. These can be appended to a URL so that you have:
mywebsite.com/?PHPSESSID=183249712347012374871234872314
This is bad. That unique identifier seems to cause search engines to think that it is a different URL than mywebsite.com, which means that there will be duplicates created, and you will be hurt in the rankings. Not all search engines have this problem. Google appears to be very astute at figuring that out.
So as our new site begins to register with the engines, I noticed that Google was great, but MSN has been draining our site (over 400 MB per visit), and is listing session ID's in the URLs of the pages it has found. That had to be fixed, and anyone facing that situation, here is the solution...
The hosting company helped to configure the PHP settings. If you do not have access to this, you can usually put a file called php.ini inside your root folder, and it should work for you. Essentially you want to tell PHP to use cookies for session ids. This means that if a user has cookies enabled, it will store that session id in a cookie. There are also other settings to look at such as;
sesssion.useonlycookies session.usetransid
So that takes care of that, since most users will have cookies enabled they will no longer see the session ID in the URL. Of course, this is purely cosmetic. That does nothing for the search spiders that don't accept cookies. In this case, PHP decides to put the session ID at the end of the URLs. So we haven't gotten anywhere.
The answer to this is to go into your PHP code, and whereever there is a call to start a session (sessionstart() or sessionregister()), you need to put a check in there that looks at the user-agent of the requesting browser. If it is a robot or spider (which you have to create a list of these), then you should make sure that a session is not started. It's a lot of work, especially if you have a large site or application. However, it is the only way to ensure that spiders won't get sessions IDs in the URLs. While there are tons of spiders, I would recommend the following, as they seems to provide most of the search results on the web:
- Google: Googlebot
- MSN: msnbot
- Inktomi: slurp
- Overture: fast
- Alexa: ia_archiver
- AltaVista: scooter
These are the strings that you should look for in the user-agent to determine if these are the spiders visiting the site.
So in the end we are left with three possible visitor types. The first is a normal user-agent (IE or Firefox) that allows cookies, which will not see session IDs since they will be stored in the cookie. The second is a normal user-agent (IE, Firefox, etc.) that does not allow cookies. These folks will see session id's attached to urls, but that is the sacrifice they make for not allowing cookies. The third option, which is most important from an SEO standpoint, is that the visitor is a spider (user-agent matches one from the list you have defined) in which case a session willl not be started, to ensure consistent URLs and to make sure that no duplicates are indexed.
I hope that this helps.