Practical Ecommerce

SEO: 5 Ways to Avoid Indexation

It sounds counterintuitive, but sometimes the goal of search engine optimization is avoiding indexation.

Consider the following scenarios.

  • Faceted navigation spews out thousands of pages of duplicate or low value content.
  • Email landing pages contain targeted promotions for select groups of customers.
  • Add-to-wishlist links generate a new page with the same title tag as the product page.
  • A printable coupon continues to rank in search results long past its expiration date.

Each of these situations, and many others, find you wishing the search engines hadn’t indexed portions of your site. By understanding the tools available, particularly what each can and cannot accomplish, you can choose the best method to prevent indexation.

1. Robots.txt Disallow

The disallow is the easiest to implement and also the most likely to accidentally wreak havoc on your SEO program. A disallow line in the robots.txt file located at the root of a site commands ethical search engines not to crawl specified files or folders. It can even use wildcards to specify patterns of URLs to match, such as all URLs ending in .gif or all URLs containing the phrase “email-landing.”

This method is the best way to prevent content that has never been indexed from getting indexed, but has absolutely no impact on your customers’ experience once they’re on your site. In other words, disallowing a page means search engines can’t send new visitors to the page, but visitors can navigate to it once they’re on your site. A word of warning, however: Thoroughly test any change to your robots.txt file in Google Webmaster Tools before you send it live. I’ve seen too many sites accidentally disallow important content or their entire sites with a single disallow command.

2. Meta Robots noindex

On a page-by-page basis, the meta robots noindex tag commands search engine crawlers not to index that specific page. Unlike the robots.txt disallow, which can block entire folders and match patterns of URLs, each meta robots noindex command blocks only a single page from being indexed. Unless noindex is paired with a nofollow command in the meta robots tag, search engines can still crawl the page and follow any links they find. As a result, noindexing is useful for encouraging deeper crawling of a site while still preventing specific pages from being included in the index for ranking. When used at the template level, it’s easy to noindex every page that uses a specific template – for example, the wishlist scenario mentioned above.

Like the disallow, the meta robots noindex tag has no impact on your visitors’ experience once they’re on your site. In other words, noindexing a page means search engines can’t send new visitors to the page, but visitors can navigate to it once they’re on your site.

3. Server Header Status

If neither humans nor bots should be able to access the content, a 301 redirect is the best option for SEO. If a page has gone live and is indexed already, it has some measure of value in terms of trust and authority. Wasting that trust and authority is like burning money. In addition to redirecting the customer to the correct content, a 301 redirect commands search engines to deindex the URL and pass the link authority collected in that page to a different one.

In other words, placing a 301 redirect on a page means that any request for that page by a customer or bot will get redirected instead to the new page. Neither customers nor bots will be able to access the page’s former contents once it has been redirected. In the example of an expired coupon that still ranks well in search results, that page likely has some powerful authority to continue ranking. Implementing a 301 redirect would pass that link authority and searching consumers to a current promotion where the consumers could convert to sale.

If a 301 redirect is physically impossible – which is rarely the case – deleting the URL and serving a 404 “file not found” error will deindex the URL so that it won’t rank or bloat the search engines’ indices. But a 404 error will also cause any authority that page has built up to shrivel away like a grape on the vine in the hot, dry sun.

However, if customers need to access the content but search engines shouldn’t index it, either robots.txt disallow or a meta robots noindex is your best bet.

4. Password Protection

This is standard for content that requires confidential information, but it can also prevent bots from crawling content. The effectiveness of this option for SEO needs to be balanced with customer experience, naturally, but password protection is a good way to restrict search engine access to content meant only for specific customer segments.

5. Other Technical Barriers

If you’re rich in developer skills and can confidently tweak your platform, implementing cookies or complex JavaScript could keep the bots at bay as well. I like to call this method positive invisibility because typically I’d be referring to these technologies as something to avoid for the sake of SEO. But when you want to prevent indexation, placing the content behind a door that can only be accessed by a user agent that accepts cookies or is able to execute complex JavaScript can certainly do the trick. Keep in mind that this method could also keep out customers with cookies and JavaScript disabled as well.

Which Method to Use?

Notice that I didn’t include in this list of ways to block bots using a rel=nofollow directive when you link. Nofollow is a classic misnomer because search engine crawlers absolutely do crawl to the links marked nofollow. They just don’t pass link authority through the link to the destination page. In essence, all a nofollow tells a crawler is that you’re willing to link to a page but you’re not willing to trust it.

Also conspicuously absent from this list is the canonical tag. Usually referenced alongside 301 redirects, canonical tags do not block search engines from indexing content. Whereas the 301 redirect is a command, the canonical tag is a polite request to avoid indexing the page and to pass any link authority to the page specified. When you need to be certain a page is blocked from the index, a command is preferable by far to a polite request.

Of these five methods, the first three are by far the most frequently used. Understanding the strengths and limitations of robots.txt disallows, meta robots noindexing, and 301 redirects will arm you with the tools needed to trim unwanted pages from the search engines’ indices.

Tags:  

Jill Kocher
Jill Kocher
Bio  |  RSS Feed


Get the Practical Ecommerce RSS feed

Comments ( 5 )

  1. eCommerce Designer - iMReGaBRi October 25, 2013 Reply

    Hello Jill! Thanks for your awesome article, i spread it where i can, really useful!

    I have a question. What can you do, if you have an ecommerce site with 30-50.000 generated product list subpages because the layered navigation?

    I know i can use noindex nofollow on layered navigation part of sitebuild, but the problem is actual, because when the ecommerce site started, no one used noindex nofollow, so after one year, there is tons of new html pages (duplications too) because the layered navigation.

    I think i do the noindex nofollow things, but then what comes? Webmaster tools says thousands of missing pages. I need to wait 1-2 months, then the problem solve itself after i do noindex nofollow? (I can’t manually 301 redirect 50.000 pages, and when i redirect theese pages to homepage, that is not a perfect solution i think.)

    Thanks for your help and keep up the good work, i will share all of Your articles!

    Have a nice weekend,
    Gabriel

    http://www.ecommerce-designer.eu

  2. B. Moore October 25, 2013 Reply

    What do you do if your site is leaking pages with SID’s in the url into the google serps???

    any help is greatly appreciated.

  3. James West October 26, 2013 Reply

    There are several ways to prevent Google, Yahoo!, Bing or Ask from indexing a site’s pages. Thanks for looking at the different search engine blocking methods!
    James West – seocavalry.com

  4. Alex October 29, 2013 Reply

    Can Google consider hiding facets navigation (or layered navigation) for googlebot as cloaking?

  5. Bhavik Vyas November 1, 2013 Reply

    Nice to see you out in the daylight sean. Re blocking specific areas as long as your URLs are set up with directories that allow you to isolate specific page types, you can block the bots in the robots.txt from crawling these directories. I also recommend double-bagging them by using the noindex tag on these pages.

    Regards
    Bhavik Vyas

Email Newsletter Signup

Sign up to receive EcommerceNotes,
our acclaimed email newsletter.

And receive a free copy of our ebook
50 Great Ecommerce Ideas