Since Google’s most recent Panda algorithm updates they have become more adept at penalising sites where the quality and quantity of content is poor. This has been evident more so on minimalist sites that provide more rich images than content, but has also affected sites that have limited content for the user above the fold of the page.
The update is also continually penalising sites where their content is duplicated across their own site and the Web itself. In this blog post I aim to give you some good technical tips that could help you combat any instances of content duplication on your site. I will also explain a few best practices that you should apply to your site regarding content presentation and promotion.
Simply put, the best way to deal with content duplication is to remove it (or not create it in the first place). Sometimes though although some content is duplicated it needs to stay live for the users benefit. If that’s the case and the content has to stay as it is for the user then the tips below can help prevent search engines picking up on the duplication and penalising it.
This should be used with caution as if you disallow content incorrectly it could result in your entire site or entire sections being removed from the index. This step may also be used in conjunction with the two following tips. Basically, all you need to do is identify the page(s) that are duplicates of other pages and disallow them like the example below;
The page www.mysite.com/home/ contains the same content as the copy it is showing on the page www.mysite.com/ . What you can do is add a disallow into your robots.txt file to tell search engines not to index anything including and after ‘/home/’. Be sure that none of your pages that you want indexed are following ‘/home/’ like ‘/home/important-stuff/’ as that content will also be removed from the index.
For more information about robots.txt files there is an excellent guest post on the Koozai site here.
Google Webmaster Tools URL Parameters
Working in much the same way as the robots.txt file but in Google Webmaster Tools the URL Parameters section lets you manually tell Google what to ignore and what to track in terms of Pagination or Translation of pages on your site. It is really important that this be implemented properly and checked thoroughly as in the same way (but potentially worse) as the robots.txt file it can de-index more than you wanted it to. There is a warning on the first screen for a reason and it’s best to learn more before using it.
.xml Sitemap Omission
A relatively straightforward thing to do if you know the duplicated content that you don’t want indexed and you have implemented the other two tips I have mentioned it’s best to remove that page from the sitemap.xml file. Once you have taken the links out of the file you can upload it to your domain and re-submit it to Google Webmaster Tools.
‘nofollow’ Internal Links
An oldschool approach but if you have gone to the trouble to tell the search engines to ignore pages from the index it might be worth adding the ‘nofollow’ attribute to links on pages you want indexed that point to the pages you don’t want indexed. This acts as an additional flag to search engines to tell them that the duplicate pages are there for the user and not for search gains.
This is something that may need to be done for lots of different reasons so I’ll list a couple first to give you an idea of when you may want to canonicalise a page;
In the examples above you would add a link on the original pages that says to search engines that the content on these pages is known as being duplicate and the original source can be found at the resource linked by canonicalisation.
noindex / nofollow in Meta
A relatively simple exercise and usually implemented on pages such as Blog category, tag and archive pages where the content is there to help the user find the page they are looking for but is the same as the content in the blog itself. Adding the nofollow & noindex into the meta as well as removing the pages from the index is a good idea. There are also some other good ‘best practices’ to follow in that respect and I will cover them later in this blog. You can also mix and match the tags (e.g. if you want robots to see the content but not index it) such as below:
www vs. Non-www Redirects
When your site can be reached by https://www.sitename and https://sitename the content could well be seen as duplicated. The best way around this is to redirect one version to the other so that there is no way that one version can be indexed and therefore seen as a duplicate. You can do this with a URL rewrite at server level.
Rel=Prev & Rel=Next
This can be a little tricky to implement and is mainly used on component pages on a site. This works in very much the same way as Canonicalisation and it indicates to search engines a relationship between URLs in a paginated series. Google explain it really well in their Webmaster Central Blog but as I mentioned it’s easier to think of it working in a similar way to Canonicalisation in paginated content.
What To Do With International Duplicates
Sometimes sites have duplicate versions that are set for audiences in different countries that speak the same language. For example you may have www.mysite.co.uk for UK audiences and www.mysite.com.au for an Australian audience. For whatever reason rather than have all audiences reaching the site from the same URL the sites were set up to be reachable in a similar way to the example above.
There are a few ways to stop search engines from thinking the content across the different TLD’s is duplicated. Some of them are really simple and a few may be out of the scope of the site for whatever reason but most can be implemented.
In no particular order;
What To Do With Test Sites
The simplest solution is not to have them live! If possible have them on a test server that is not accessible to the internet however there may be a reason to have them live as they may be on a subdomain perhaps. The best way to ensure they aren’t indexed is to disallow in the robots.txt file and use the parameters in Webmaster Tools. Having done this you also may need to (if the test site allows) add canonical links to the test versions of the site pointing to the live version.
You should also set a password on the test site so a random user doesn’t accidentally get to your test site by mistyping your URL.
You may have duplicate content on your site and not even know it. I’m not talking about entire pages as they should be pretty obvious to spot and resolve. I am talking about dynamic content on pages or category style pages. I will list a few examples of places on a site that could be deemed duplicated with a quick solution that might fix it for you without a penalty.
On your site you may have a section to promote your blog that contains the most recent articles. Often the Blog snippets are dynamically pulled from the blog itself and will therefore be a duplicate of the blog itself.
How to resolve this – You could remove the snippet altogether and add a Blog link in the top navigation bar on the site or (some) blog platforms often allow for the author to write an excerpt for the blog. This then takes the place of the dynamic snippet.
Blog Category and Archive Pages
As I mentioned earlier in the post you can use the tips above to keep them out of the index but if you don’t really need them for the user then just remove them. Some Blogs have plugins to remove these (e.g. many of the great plugins by Yoast) so it can be easy. Failing that redirect the category or archive URL to the main blog page.
In much the same way as blog snippets these are likely to be duplicated. If you have a fair few testimonials you can use a few in the promotion on your home page and have the rest on the main testimonials page. Either that or look at the section as an opportunity to sell the page as an advert and invite the user to view the page without the snippet.
Scrolling Product Banners
If you have a few ‘Top Products’ that you like to promote on a scrolling banner or anywhere else on your site the description is quite likely to be duplicated. The best way to resolve this is to write unique descriptions for the products for the banner or promotion where possible or use a separate frame so it is seen as only one page.
The end / The end
These are just a few examples but there are lots of other ways you can inadvertently duplicate your content and hopefully now you have a good idea of some good ways to resolve it or insure yourself as best possible from penalties.
If you have any other tips or ideas please feel free to share them!
For more information or to read something very similar elsewhere I have posted this article on several different sites….*