Canonicalization can be a very effective tool in order to get rid of
internal or external duplication. It serves the purpose of clarifying
which URL is the preferred version of the content, when you have a lot
of syndication or a GET parameter heavy environment
Rel=canonical is a way to specify the preferred version of resources
with duplicate content. It will not prevent your duplicates from being
crawled and it will not prevent search engines from indexing your
content. Instead it serves as a suggestion to the search engine which
content to show preferably in case there are two or more documents that
are same or very similar
To understand how rel=canonical is used by search engines, you should imagine the following:
The search engine has an unfiltered set of results for a search query.
Now the search engine tries to eliminate the duplicates. At this point
the suggestion made with the rel=canonical will be processed and in most
cases be used as a directive to show your preferred URL for the content
while filtering duplicate content.
There are quite a few cases in which the rel=canonical is commonly
misused. The effect can be ranking loss or wasted ranking potential.
This article will point out common mistakes and show when to use rel=canonical - and when not to use it.
How to use rel=canonical
Canonical in HTML markup vs HTTP header
There are two ways to include a canonical into your site. One is in
the markup in the <head> tag, which is the most well known method.
In fact many CMS do this on their own by now.
<html>
<head>
<link rel="canonical" href="http://example.com/page.html">
</head>
<body>
...
The other way to include a canonical is
sending it with the HTTP header.
HTTP/1.1 200 OK
Content-Type: application/pdf
Link: <http://example.com/page.html>; rel="canonical"
Content-Length: 4223
...
The HTTP header canonical is usually used for documents that are not
HTML. This way you can use the HTTP header to set canonical URLs for
images, PDF or any other document. In practice this is often applied for
print versions of the content, downloadable PDF versions and stuff like
that where you end up with non HTML duplicates.
When to use rel=canonical
The most common and reasonable case of canonical usage is a self
reference in every unique document. The statement is basically: "Hey
there, I am the original document. Get me indexed and list me in search
results if different versions of this content are found fitting for the
search query."
This approach usually prevents problems with identical copies of your
content. Duplicate content can occur internally and externally.
The following scenarios can be prevented with proper canonical usage:
- Problems with GET-Parameters
- Tracking Parameters
- Session Parameters
- Unwanted / Unverified Parameters
- Unsorted Parameters
- Problems with multiple URLs for the same content
- CMS has more than one version for the content (e.g. version with ID, and speaking URL)
- Problems due to accessability on different hosts / protocols / ports
- HTTP / HTTPS
- Port 80 / 8080
- www / without www
- different domains
- Duplicate content from external content syndication
How and when not to use the rel=canonical
Choose full URLs over shortened URLs for rel=canonical
One possible source of problems with the canonical tag is usage of shortened URLs over full URLs as canonical URL.
There is always a good chance your website has content that is
available using different protocols or hosts but with the same directory
and file name.
The markup can contain exactly the same rel=canonical in different versions, but each version points to a different URL.
Consider you have two pages with different protocols setup like this:
http://example.com/page.html
<link rel="canonical" href="page.html">
https://example.com/page.html
<link rel="canonical" href="page.html">
Resolving those canonical links will result in the following different URLs:
- http://example.com/page.html
- https://example.com/page.html
The more complete your canonical URL is, the less error prone it is.
<link rel="canonical" href="page.html">
this can result in problems with
- directories
- hosts
- protocols
<link rel="canonical" href="/page.html">
this can result in problems with
<link rel="canonical" href="//example.com/page.html">
this can result in problems with
<link rel="canonical" href="http://example.com/page.html">
this version is not affected by any of the problems.
While there are legit reasons to use short or relative URLs and it is allowed by
RFC6596, there are possible issues that can be avoided with the use of absolute canonicals over relative URLs. Keep in mind:
- relative URLs are shorter but more error prone and harder to
maintain and to evaluate by third party crawlers if not applied 100%
correctly.
- If your site gets cloned to an external website, you will not
benefit from relative URLs in the canonical tag - in fact you might be
helping the content thief.
We strongly suggest that you always use absolute URLs when using rel=canonical.
Do not use rel=canonical for localization
If your website targets more than one country or more than one
language you should help search engines to properly identify the correct
URLs for the specific languages or target countries.
It is not very common but there are websites with multiple languages
where the rel=canonical element points to a preferred language version.
The misbelief is that you can use the rel=canonical element to specify a preferred localized version.
You should use rel=alternate and hreflang tag to clarify which version
belongs to which target market. It will then be easier for search
engines to index all versions properly and show the right results to the
right target group.
The hreflang tag can be used to make a connection between all distinct
language versions of your content. You can also use x-default to
specify a URL as default for users outside your focus regions or
targeted languages.
More information about
how to set up multilanguage websites can be found in the Google Search Console Help Center.
We strongly suggest not to use rel=canonical for localization.
Do not use rel=canonical for PageRank Sculpting
In some cases we found canonical usage for PageRank Sculpting.
PageRank Sculpting is a method to shape PageRank flow on a website. Some
of a websites pages are not supposed to rank in the search results.
That's why some webmasters try to channel the PageRank past these pages
in order to strengthen the actual landing pages. This usually applies to
functional pages like "Imprint" or "About us" - or even in pagination
and category pages.
In this example the "Imprint" or "About us" page would be set up with a
canonical tag with an important landing page as a target URL.
The attempt is based on the assumption that a canonical will channel
all linkjuice to the canonical URL, regardless of the pages content. Our
tests have shown that this is not the case.
From our experience it isn't helpful and might result in search engine spiders ignoring the websites canonical tags at all.
Don't use canonical tags for PageRank Sculpting!
Canonical usage in pagination
Pagination is a technique for dividing content into discrete pages.
Pagination is used when the content is too large to show on just one
URL. In this case the content can be split into multiple pages.
If your website uses pagination, you might want to prevent search
engines from crawling or indexing the pagination beyond the first page.
This applies especially if
- the pagination pages do not add any indexation value over page one
- the website has a very deep pagination level so the crawler would spend a remarkable amount of your crawl budget to crawl your pagination
- the pagination pages can be considered thin content
In a pagination you are usually not treating duplicate content. The
content is not even considered to be similar. Rel=canonical is for
duplicate or similar content. Defining a non self referencing canonical
URL on pagination pages is in most cases a misuse of the rel=canonical
element.
With a non self referencing canonical URL on pagination pages, you
basically tell search engines to ignore your content on the specific
page. If this is your goal you should use the noindex robots directive
to prevent indexing of those pages or block the crawling with a
robots.txt.
Canonical usage for similar products
A more common mistake - especially for online shops - is the usage of
rel=canonical for very similar products (minor differences, different
colours or product versions).
The positive effect: Your similar products won't get recognized as duplicate content.
The negative effect: Your canonical URL product version gets preferred
in the search results. So if a customer searches for the alternate, the
wrong product version is likely to show up in search.
Example:
You are selling T-shirts and you offer the same shirt with the
different colour versions blue, red and yellow on different URLs. To
prevent duplicate content issues you add your canonical to the best
selling version of the product - let it be blue - to have it preferred
in the search results. The default product gets a self referencing
rel=canonical while the alternates all get a canonical pointing to the
default product URL.
If a search user now searches for "red shirt" Google might remember it
once found a red shirt in your shop but it also remembers you've told
it (by using the rel=canonical) to show the blue shirt page.
This leads to a search result that is less fitting to the user search
query and therefore a lower click through rate from the search results.
What you really want is getting every of your alternates in the Google
index, showing up in the right moment. You can archive this by using
schema.org markup.
There is a neat little thing called
isSimilarTo.
With the help of this structured data property you can tell Google that
your blue, red and yellow shirts are all of the same importance but
they only differ in one little aspect. Having this in action allows to
use self referencing canonical URLs all your product alternates.
And that's exactly what you want to tell the search engines. Usually
Google will get it right and show the right product version for the
right query.
Canonical usage for mobile website versions
Sometimes a website offers a distinct mobile website version that is hosted on a extra host like
m.example.com. Now the question is how to prevent duplicate content issues between
your main website and that extra mobile version of your website.
Rel=alternate tells the search engines there are other versions of
this content available that may be a perfect fit for the search user,
depending on device and user agent.
In some cases we've seen that the mobile version is set up with a
canonical pointing to the main version of the website but the
rel=alternate was missing. The result is that the mobile website version
will likely not be shown in the search results. Mobile users will find
your desktop optimized web site in search results. Usually the desktop
optimized website will not be mobile friendly and if there is no
autodetection and redirect to the mobile website, the user will have a
hard time using your website with his mobile device.
What you really want is to have both versions of the site to show up
in the right situation. That's what you achieve using the combination of
rel=alternate and rel=canonical.
example desktop version http://example.com
<html>
<head>
<link rel="canonical" href="http://example.com/" >
<link rel="alternate" href="http://m.example.com/" media="only screen and (max-width: 640px)">
</head>
<body>
example mobile version http://m.example.com
<html>
<head>
<link rel="canonical" href="http://example.com/">
</head>
<body>
Common unintended mistakes with the rel=canonical
Using multiple canonical tags with different target URLs
If you put a second canonical tag in your site and both carry
interfering canonical URLs, search engines are likely to ignore both
canonicals tags.
Usually this happens unintended as a result of SEO plugin usage, leading to weird search engine behaviour.
Take care: This also applies to HTTP header canonical and HTML
canonical combinations. If you don't look closely it can be tricky to
find.
Using canonical outside of <head> area
Another common mistake in canonical usage is to place it outside the
<head> tag, especially in the body tag. Most search engines will
ignore tags ourside the <head>. Especially Google will not
interpret any canonical tag placed outside the <head> tag.
Canonical pointing to target URL with other status code than 200
If your canonical URL delivers a non-200 status code, it might cause your suggestion to get ignored by search engines.
In case the canonical URL target delivers a 30x redirect, it forces
the search engine spider to crawl one additional URL. The numbers can
add up pretty fast over this and waste your crawl budget.
Worst case are rel=canonical targets that show a 4xx or 5xx status
codes. 4xx or 5xx status codes lead to a failure for the canonical link
and therefore they likely force the search engine to ignore your
canonical at all. As a result the index might get stuffed with
duplicates.
Always make sure you check your canonical URL for proper functionality and server response code 200.
Thanks to
https://audisto.com/