It sounds counterintuitive, but sometimes the goal of search engine optimization is avoiding indexation.
Consider the following scenarios.
- Faceted navigation spews out thousands of pages of duplicate or low value content.
- Email landing pages contain targeted promotions for select groups of customers.
- Add-to-wishlist links generate a new page with the same title tag as the product page.
- A printable coupon continues to rank in search results long past its expiration date.
Each of these situations, and many others, find you wishing the search engines hadn’t indexed portions of your site. By understanding the tools available, particularly what each can and cannot accomplish, you can choose the best method to prevent indexation.
1. Robots.txt Disallow
The disallow is the easiest to implement and also the most likely to accidentally wreak havoc on your SEO program. A disallow line in the robots.txt file located at the root of a site commands ethical search engines not to crawl specified files or folders. It can even use wildcards to specify patterns of URLs to match, such as all URLs ending in .gif or all URLs containing the phrase “email-landing.”
This method is the best way to prevent content that has never been indexed from getting indexed, but has absolutely no impact on your customers’ experience once they’re on your site. In other words, disallowing a page means search engines can’t send new visitors to the page, but visitors can navigate to it once they’re on your site. A word of warning, however: Thoroughly test any change to your robots.txt file in Google Webmaster Tools before you send it live. I’ve seen too many sites accidentally disallow important content or their entire sites with a single disallow command.
2. Meta Robots noindex
On a page-by-page basis, the meta robots noindex tag commands search engine crawlers not to index that specific page. Unlike the robots.txt disallow, which can block entire folders and match patterns of URLs, each meta robots noindex command blocks only a single page from being indexed. Unless noindex is paired with a nofollow command in the meta robots tag, search engines can still crawl the page and follow any links they find. As a result, noindexing is useful for encouraging deeper crawling of a site while still preventing specific pages from being included in the index for ranking. When used at the template level, it’s easy to noindex every page that uses a specific template – for example, the wishlist scenario mentioned above.
Like the disallow, the meta robots noindex tag has no impact on your visitors’ experience once they’re on your site. In other words, noindexing a page means search engines can’t send new visitors to the page, but visitors can navigate to it once they’re on your site.
3. Server Header Status
If neither humans nor bots should be able to access the content, a 301 redirect is the best option for SEO. If a page has gone live and is indexed already, it has some measure of value in terms of trust and authority. Wasting that trust and authority is like burning money. In addition to redirecting the customer to the correct content, a 301 redirect commands search engines to deindex the URL and pass the link authority collected in that page to a different one.
In other words, placing a 301 redirect on a page means that any request for that page by a customer or bot will get redirected instead to the new page. Neither customers nor bots will be able to access the page’s former contents once it has been redirected. In the example of an expired coupon that still ranks well in search results, that page likely has some powerful authority to continue ranking. Implementing a 301 redirect would pass that link authority and searching consumers to a current promotion where the consumers could convert to sale.
If a 301 redirect is physically impossible – which is rarely the case – deleting the URL and serving a 404 “file not found” error will deindex the URL so that it won’t rank or bloat the search engines’ indices. But a 404 error will also cause any authority that page has built up to shrivel away like a grape on the vine in the hot, dry sun.
However, if customers need to access the content but search engines shouldn’t index it, either robots.txt disallow or a meta robots noindex is your best bet.
4. Password Protection
This is standard for content that requires confidential information, but it can also prevent bots from crawling content. The effectiveness of this option for SEO needs to be balanced with customer experience, naturally, but password protection is a good way to restrict search engine access to content meant only for specific customer segments.
5. Other Technical Barriers
Which Method to Use?
Notice that I didn’t include in this list of ways to block bots using a rel=nofollow directive when you link. Nofollow is a classic misnomer because search engine crawlers absolutely do crawl to the links marked nofollow. They just don’t pass link authority through the link to the destination page. In essence, all a nofollow tells a crawler is that you’re willing to link to a page but you’re not willing to trust it.
Also conspicuously absent from this list is the canonical tag. Usually referenced alongside 301 redirects, canonical tags do not block search engines from indexing content. Whereas the 301 redirect is a command, the canonical tag is a polite request to avoid indexing the page and to pass any link authority to the page specified. When you need to be certain a page is blocked from the index, a command is preferable by far to a polite request.
Of these five methods, the first three are by far the most frequently used. Understanding the strengths and limitations of robots.txt disallows, meta robots noindexing, and 301 redirects will arm you with the tools needed to trim unwanted pages from the search engines’ indices.