SEO: The Duplicate Content Penalty

The question of Google’s supposed “duplicate content penalty” seems to be on everybody’s minds these days. This issue is particularly relevant for dynamic ecommerce websites, as they often have multiple URLs that lead to the same product content (or nearly the same, with only a variation in the product’s color or size).

Several factors magnify the problem of duplicate content. The fact that manufacturers give all their distributors the same product descriptions means those identical phrases end up on numerous sites including your own. Then there are the “scrapers” that run rampant across the web, lifting your content to use on their spam sites. How to foil the scrapers is a big topic that will need to be addressed in a future column.

There has been a lot of talk — misinformation, really — about the “duplicate content penalty” from Google. It’s a myth; it doesn’t exist. You’re hearing it straight from the horse’s (Google’s) mouth here, here, and here. I have it from reliable sources within the Googleplex that Google very rarely penalizes for duplicate content. Instead, it’s a filter.

Duplicate content is copy substantively similar to copy elsewhere on the web, and which is findable by the search engine spiders. Search engines don’t want to present users with similar-looking listings on the same search results page because that would degrade the search experience. So a query-time filter is put in place by Google to avoid displaying multiple copies of the same or very similar content.

Google engineers don’t want to penalize useful websites for the inadvertent creation of duplicate pages — such as when there isn’t a 301 redirect from domain.com to www.domain.com, when the site responds to multiple domain names, or when there are tracking or otherwise superfluous parameters in the URL like session IDs, tracking tags like source=topnav, flags like photos=on and photos=off, etc.

Indeed, even Google’s own Googlestore.com has long-standing duplicate content issues. A previous incarnation of the site had thousands of copies of category pages — e.g., a search for inurl:accessories.html site:googlestore.com returned more than 7,000 pages — due to session IDs in the URLs. Googlestore.com has since corrected that, but it still has, on average, five copies of every product page indexed.

A situation like this leads to PageRank dilution. One definitive version of the product page will receive more PageRank than five versions will receive. That’s because the votes (links) are split five ways. The end result is that product with duplicate content will never rank as well in Google’s search engine as a unique page.

This is not a penalty, merely the natural consequence of an overlooked problem.

It’s true that even a filter can end up feeling like a penalty if you end up filtered out of the search results and your competitor is left to collect the reward. This is more likely to occur if you use the same manufacturer-supplied product copy as everyone else and fold in little or no unique content of your own. In failing to do so, you will automatically garner less authority/PageRank than your competitors.

“I’ve been hit with a duplicate content penalty,” seems to be the excuse du jour. A year ago it was “I’m being sandboxed.” I’m tired of hearing either one bantered around by site operators who use the excuse as a crutch: The real problem is a lack of understanding of best practices and a flawed SEO implementation.

Best practice dictates that you should eliminate, to the best of your ability, the occurrence of duplicate pages in the search engines. It requires that you make your content as unique as possible to differentiate it from what’s found on other sites. If the snippet and/or “shingles” on your page are strikingly similar to those on someone else’s, be warned that Google tends to favor the page with greater PageRank. Of course this is an overly simplistic explanation. Suffice to say, since duplicate results are going to be filtered out, you will be better off if your site is well-endowed with strong PageRank.

Don’t get overly concerned if spammers scrape your content and post it on their own sites. Google is not going to demote you for that. Similarly, don’t get overly concerned if you have the same product copy as a hundred other retailers who sell the same wares. As I said, you won’t be penalized for that either.

My best advice to you is to augment any non-unique content you have obtained through data feeds. Wrap it around unique relevant content such as customer-contributed (or self-written) product reviews and related product recommendations (up-sells and cross-sells). By all means, tweak the product copy as much as you can: Paraphrase it, incorporate synonyms, revise it to include better keyword choices, and embellish those paragraphs with additional descriptive prose. Don’t stop with descriptions: Make unique product page title tags rather than just using the product name (which is what everybody else does), incorporate additional bits of information into the title tag such as the model number if that is something a lot of people use to search. You can figure this out by using keyword research tools; see my past article on keyword sleuthing.

Many online retailers have implemented URL re-writing to make their URLs more search engine friendly; see my article Avoid Complex URLs. The implementation of URL rewriting may sometimes be a major initiative that must be phased in. Keep in mind that not all the URLs across the site can be replaced with search engine-friendly versions in one go-around. In any event, the final outcome is usually a less duplicate-laden website, because superfluous variables have been removed. The trick is to ensure that, after you eliminate them from the search indices, you keep all the old URLs alive and functioning through the use of a 301 (permanent) redirect.

Duplicate pages are a reality of the web. Articles are syndicated all over the web. Developers stick session IDs and flags in the URLs that don’t substantially modify the content but create duplicate pages for the spiders when they crawl. Thankfully, Google has learned how to live with it.

Still, the easier we can make it for Googlebot, the better. (Tips for making it easier are here and here.)

Most of all, remember this: Duplicate content isn’t something that should keep you up at night. Now go get some sleep!

SEO: The Duplicate Content Penalty

March 20, 2007 • Stephan Spencer

Amazon Ads: Scale Globally, Advertise Locally

Amazon Ads

Essentials of Marketplace Success

Walmart Marketplace