Search engine giants Google, Yahoo, and Microsoft all announced on their respective official search engine blogs that they would begin utilizing a new so-called canonical tag that will help their search engines cut through the clutter of the Web. More specifically, the announcements are a signal to webmasters to start using the specialized tag to help the big search engines do a better job of indexing Web sites that have duplicate URLs (uniform resource locators) that all point to the same content.
Basically, dynamically generated Web sites with content in databases tend to create different URLs that can point to or create the same page of content. When a search engine attempts to crawl these sites, it finds all these different URLs and has to figure out which ones are most important. This is one reason search engine users sometimes end up seeing the same content show up more than once.
To complicate matters, site publishers that run special marketing programs, for example, might create different URLs that all link to a single page so they can track where their leads came from.
“If your site has identical or vastly similar content that’s accessible through multiple URLs, this [new canonical] format provides you with more control over the URL returned in search results. It also helps to make sure that properties such as link popularity are consolidated to your preferred version,” explained Joachim Kupke, Google senior software engineer, and Maile Ohye, developer programs tech lead, on the official Google Webmaster Center Blog.
Overworked and Confused Search Engines
Apparently, all this clutter makes it difficult for search engines to accurately index Web sites, and the new canonical tag is designed to let webmasters specify which URL the Web site (and searchers) should really care about. Google, Yahoo and Microsoft all explained the problem in various ways, but Yahoo’s example is perhaps the easiest to grok:
By using a tag in the section of a page’s content, a site can identify the base foundation URL that the search engines should pay attention to. So this would be the desired canonical URL:
These other links, then — which could all generate the same page of content — would be duplicates that could be eliminated in search results by the search engines:
So instead of attempting to manage hundreds, thousands and millions of essentially duplicate links on various Web sites, search engines can be told which URLs should really matter.
Confusing? Here’s Another Example
Using the example of an e-commerce Web site and a simple, ubiquitous product — a t-shirt — sites can create multiple URLs based on a handful of variables.
“Once you add up every combination of color and size [of a t-shirt], you could end up with a huge number of URLs that point to a page that is exactly the same other than the photo and caption. Once you multiply this by every product on the site, you could end up with a much greater number of URLs than you have content pages,” Vanessa Fox, a search engine and search marketing expert with Ninebyblue.com and Searchengineland.com, told TechNewsWorld.
“This issue is compounded for e-commerce — and similar — sites by sort orders. Different parameters in a URL may indicate that the page is sorted by lowest price or highest rating, for instance,” she added.
Site Owners Have Been Asking for This
The move to using canonical URLs is not just to make it easier for search engines to do their jobs — but it does help.
“If the search engine crawler spends a lot of time crawling variations of a single URL, it won’t have time to crawl as many unique pages, so the search engine index won’t be as comprehensive as it could have been. Pages will be missing. If the search engine returns the ‘wrong’ version in the results — such as an earlier version of a wiki-type page — then the results won’t be as useful as they could be,” Fox explained.
“In addition, over the last few years, search engines have made a substantial effort to facilitate a relationship with site owners and to make things work well for both the search engines and site owners. Site owners have been asking for this for a long time, and the engines have been working on the best way to help them with this issue,” she added.
The end goal, of course, is more relevant content in front of searchers.
“Businesses also want the ‘best’ version of the page to be returned in search results. In the case of the historical revisions of the wiki-like page, having an older version in the results might lead to a poor user experience,” Fox said.
“For branding purposes, businesses want the ‘clean’ URL to appear in the search results. Studies have shown that shorter URLs that are easy to understand are more likely to be clicked on by searchers. So, if a business can get the ‘canonical’ version of a URL to be displayed over a longer one with extraneous parameters, then they may see additional acquisition from search,” she noted.
So That’s a Lot of Coding for Webmasters, Right?
At first glance, it might appear that webmasters would have to do a considerable amount of coding to clean up all their URLs. Of course, now that Google, Microsoft and Yahoo all say they are behind a specific implementation of canonical URLs, companies can work to create applications to automate the process and start generating canonical URLs for all their different methods for delivering the same content.
Joost de Valk, a Dutch developer and self-described “general” online marketeer, has already released three plugins/modules for WordPress, Magento and Drupal that webmasters can install that will start generating canonical tag information on their sites.
We’re going through the painful process of modifying all of our pages to support this new tag. To make it a bit easier, we’ve written a Firefox extension to help. If you think it might be useful to you, it can be downloaded here: