There is a long held belief that search engines – namely Google – penalize websites that duplicate content or produce material that is largely the same as other sites on the Internet. But, I’m here to tell you, The Duplicate Content Penalty is a myth.
Think about it this way. If a page of content has five links into it and that page of content only loads at one URL, then all five of those links will flow their link popularity to a single URL. But imagine that same page of content with five links pointing to it that loads at five different URLs. Each of those duplicate URLs for that same piece of content now get a single link’s worth of passed link popularity. They’re each only one-fifth as strong as the single URL with all five links pointing to it.
The Duplicate Content Penalty myth fosters misunderstanding about the real issue: link popularity. The ideal scenario for SEO is one URL for one page of content with one keyword target. I would advise ecommerce merchants to focus their efforts on optimization rather than penalty avoidance.
Causes of Content Duplication
Many different factors result in duplicate content, but one statement is true for them all: Duplicate content doesn’t exist unless there’s a link to it. If a site has duplicate content it’s because there’s at least one link to the same content at different URLs. Links to duplicate content URLs can crop up in breadcrumbs when tracking parameters are appended, when a site doesn’t link consistently by subdomain, when filtering and sorting options append parameters to the URL, when print versions generate a new URL, and many more ways. Worse, each of these can compound the other sources of duplicate content, spawning hundreds of URL variations for the same single page of content.
Home pages would be one example. In some cases, the domain resolves as the home page but clicking on the navigational links to the home page (the same page of content) results in a different URL. Banana Republic has 18 Google-indexed versions of its home page, and several others that aren’t indexed, including:
- http://www.bananarepublic.com/
- http://bananarepublic.gap.com/
- http://bananarepublic.gap.com/?ssiteID=plft
- http://bananarepublic.gap.com/?kwid=1&redirect=true
- http://bananarepublic.gap.com/browse/home.do?ssiteID=ON
Each of these home page URLs has at least one page linking to it. Think how much stronger this page could be if every one of the links pointed to each duplicate home page URL instead of being linked to http://www.bananarepublic.com/.
Types of Content Duplication
Canonical
Lack of canonicalization is a common source of duplicate content. Canonicalization refers to the removal of duplicate versions, or in SEO, to the consolidation of link popularity to a single version of a URL for a single page of content. Consider the following 10 example URLs for the same fictional page of content:
- Canonical URL: http://www.example.com/directory 4/index.html
- Protocol duplication: https://www.example.com/directory 4/index.html
- IP duplication: http://62.184.141.58/directory 4/index.html
- Subdomain duplication: http://example.com/directory 4/index.html
- File path duplication: http://www.example.com/site/directory 4/index.html
- File duplication: http://www.example.com/directory 4/
- Case duplication: http://www.example.com/Directory 4/Index.html
- Special character duplication: http://www.example.com/directory%204/index.html
- Tracking duplication: http://www.example.com/directory 4/index.html?tracking=true
- Legacy URL duplication: http://www.example.com/site/directory.aspx?directory=4&stuff=more
The URLs may be fictional but I’ve worked with sites that had every one of these sources of duplicate content and more. In the worst cases, link popularity was split between more than 1,000 URLs for a single product page. That page would be much stronger if every link pointed to a single URL.
The most effective way to canonicalize duplicate content, consolidate link popularity and de-index the duplicates is with 301 redirects.
Cannibal
When two or more pages target the same keyword target, that’s cannibalization. Ecommerce sites fall into this trap frequently when usability necessities such as pagination, filtering and sorting, email to a friend and other functions create unique pages with some or all of the same content. Technically these pages are not exact duplicates. They need to exist for usability reasons so they can’t be canonicalized to a single URL with 301 redirects.
Site owners have two options in this case: Either differentiate the content to target different keyword themes, or apply a canonical tag to recommend consolidation of link popularity without redirecting the user.
Resolving Duplicate Content
Remember that 301 redirects are a SEO’s best friend when it comes to canonicalizing and resolving duplicate content. If a redirect is off limits because the URL needs to function for humans, a canonical tag is the next best bet for consolidating link popularity. There are other options for suppressing content — such as meta noindex, robots.txt disallow, and 404 errors — but these will only de-index the duplication without consolidating the link popularity. For more detailed information on resolving duplicate content, view this tutorial or this video from Google Webmaster Tools on duplicate content.
Related Articles
- Five SEO Mantras for Website Redesign
- 8 Tips for Choosing an SEO Professional
- SEO: "Speak" to the Search Engine Spiders
Sponsored links
- clickInclusion – Paid Search Consulting For Mid Size Ecommerce Stores
- Infopia, Inc. – Accelerate Your Online Sales Growth
This article is filed under Search Engine Optimization and has the following keyword tags: search engine optimization, marketing, training.
12 Comments
Steve says:
You are right.
I'm not sure if this is what you meant to say this though: "Duplicate content doesn’t exist unless there’s a link to it."
Shouldn't it be: "Duplicate content does exist unless there’s a link to it." ?
Jill Kocher says:
Hi Steve, thanks for the comment. Nope, I definitely meant: "Duplicate content doesn’t exist unless there’s a link to it." The only way these duplicate URLs get generated and indexed is by the presence of at least one link to them.
For example, if you click the header navigation link for "DVDs and Books" on the Discovery Store, you get http://store.discovery.com/?v=discoverydvds-books&nvbar=DVDs+%26+Books with a nvbar tracking parameter appended. Other links to the exact same "DVDs and Books" content page point to a tracking parameter-less URL: http://store.discovery.com/?v=discoverydvds-books. The v parameter loads the content in both URLs, and the nvbar parameter just tracks when users click to the page from the header navigation bar. In this example, the tracking URL http://store.discovery.com/?v=discovery_dvds-books&nvbar=DVDs+%26+Books could not exist without the link to it in the header navigation.
Does that help?
BetaScott says:
There is certainly a lot of conflicting information about this out there. Many SEO experts claim that filters are run to determine if something is duplicate content and drop it from the database.
So, would syndicated content that is fully republished on another site not have any detrimental affect on the original?
Steve says:
Jill,
I think I understand what you mean. The example you gave me would be considered duplicate content since both URLs have links pointing to them.
If one of those URLs didn't have a link pointing to it, then it wouldn't be duplicate content since one of them would not be indexed.
Are we on the same page?
Greg Percifield says:
I have been using our robots.txt for issues such as /page/ /sort/ /alpha/ and pagination.
I've done this because there are many cases when there are less than 10 products to fill a certain page and any of the links above could produce exact same results.
Thanks to your article, I will be updating these so that we use the canonical tag.
GoogleVictim says:
Jill,
I believe the duplicate content you are referring to is the smaller part of the duplicate content problem that most web masters face. I agree that at a single site level duplicate content dilutes the link popularity but is not penalized.
However the bigger problem has been content shared with other sites. For example: Product descriptions on shopping sites, city or hotel descriptions on travel sites. Shared content on affiliate sites.
I have seen google lash sites just because they share content with other similar sites.
What is your take on that?
George Zlatin says:
Yeah, i agree with Google Victim. You can't mention duplicate content without considering content "borrowed" from other sites...in my opinion there is definitely a Google penalty for this type of duplicate content.
Jill Kocher says:
Hi Steve -- yes, exactly my point. Do you buy it?
Jill Kocher says:
Hi BetaScott -- while it's true that engines tend to move duplicate content out of their primary index by way of filtering, that doesn't in the least diminish the problem it causes for sites. My contention is that it's less about the cluttery crufty blech that duplicate content creates in the index (although that's still a problem) and more about the waste of link popularity. If Google or Yahoo etc. decide to disregard a URL because it's a duplicate, then the links pointing to that duplicate URL are wasted in terms of the link popularity benefit they could provide. If a site resolves the duplicate content issue by forcing links to a single URL instead of 10s or 100s of duplicates, all those links that point to the individual URLs now point to one URL that has a stronger chance to rank by virtue of its stronger link popularity.
Jill Kocher says:
GoogleVictim & George, content syndication is a whole other issue, for sure. There are 2 types of sites here, the content syndicator (the original source) and the content repurposer (receives and reposts syndicated content). Search engines value unique content, that's a fact. So unless a site is strong enough in other ways (external links, other sources of unique content to offset the duplication, etc.) to overcome the "me too" impact of building a site around syndicated or stock content, it's not as likely that the site will rank well.
The holy grail is finding a way to mash up syndicated content with user generated content in a fantastically usable and compelling package that will attract links from other sites based on its sheer awesomeness. Naturally that's difficult to execute. And it's hard to create large amounts of unique content in a scalable and cost effective manner, otherwise sites would just do it. But whether we think it's fair or not, the engines' preference for unique content and link popularity is not likely to change.
Nat says:
Hi Jill,
Thanks for posting this article. The article and the link to the video were very helpful.
flackie says:
This article assumes that different URLs will automatically split page rank. Google is a bit cleverer than this, it will recognize pages that are the same, and pool the page rank for them.
This is the official Google blog on the matter: http://googlewebmastercentral.blogspot.com/2008/09/demystifying-duplicate-content-penalty.html
" 1. When we detect duplicate content, such as through variations caused by URL parameters, we group the duplicate URLs into one cluster. 2. We select what we think is the "best" URL to represent the cluster in search results. 3. We then consolidate properties of the URLs in the cluster, such as link popularity, to the representative URL."
Also, recommending 301 redirects is missing a far easier and better solution - the canonical tag. Again, here is Google's official blog: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html