One of the interesting things about sharing an office with Jyri is that our free-association stream-of-consciousness conversations often lead to places worth exploring further.
On Friday Jyri and I started wondering about
the link
rel values documented in
the XFN 1.1 profile, which
include not only the relatively commonplace me and friend values,
but also such unconventional values such as colleague, muse, and
spouse. But how frequently are the lesser known rel values really used? Rather
than speculate blindly, I wrote a simple mapreduce to check the web and find out for sure.
The mapreduce scanned approximately 177 million recently crawled HTML
documents, parsing and counting rel values in link and
anchor tags along the way. In those 177M documents, I found just over
19 billion <a> and <link> tags in total. And of those 19B
tags, 1.8 billion of them contained a non-empty rel
attribute.
Following the HTML5 rules
for space
separated tokens I split each rel value
on [\s\t\n\r\f] and extracted each individual value. In
total, over 1.9B instances of rel values were found, or an average of just over 10 per HTML document (with some tags having more than one
rel value).
I found a staggering 1.8M unique rel value strings in use, with many used only once or twice across all the web. In fact, the top 6 most-frequently-used rel values accounted for 80% of all usage, and the top 11 alone were responsible for 90% of all usage. In fact, less than 1000 of the most frequently unique rel values are sufficient to represent the 99th percentile of all usage. In other words, the tail is long indeed, with the remainder of those 1.8M unique rel values accounting for less than 1% of the total usage.
In passing, I noticed that approximately 3 million rel value strings also
contained a comma character; presumably cases where the author may mistakenly
have thought that the "," character would be used as a
delimiter. However, since these cases account for just 0.18% of all
rel value strings, they have little impact in the overall totals.
Here are the top 25 rel values found in <a> and <link> tags in a moderately sized sample of the web today:
| Rank | Value | Count | Relative Frequency |
| 1 | nofollow | 832980014 | |
| 2 | stylesheet | 338648161 | |
| 3 | tag | 168764800 | |
| 4 | alternate | 109150404 | |
| 5 | icon | 69183607 | |
| 6 | chapter | 56395793 | |
| 7 | forum | 55920646 | |
| 8 | shortcut | 53906964 | |
| 9 | bookmark | 30683701 | |
| 10 | archives | 25381711 | |
| 11 | category | 24361195 | |
| 12 | external | 19181232 | |
| 13 | search | 14227485 | |
| 14 | edituri | 8109835 | |
| 15 | apple-touch-icon | 6753583 | |
| 16 | help | 4842211 | |
| 17 | prev | 4537344 | |
| 18 | next | 4390373 | |
| 19 | pingback | 4302068 | |
| 20 | wlwmanifest | 4125573 | |
| 21 | contents | 3959350 | |
| 22 | contact | 3504587 | |
| 23 | service.post | 2678873 | |
| 24 | top | 2502015 | |
| 25 | me | 2501273 |
The most frequently used values are not surprising at all.
The nofollow value is used as a hint to search engines
that the target of an <a> tag should not be used in ranking
calculations. The stylesheet value is used on
<link> tags to indicate that the target is an external CSS
document. The tag is a microformat used to indicate
a category for the
page, as popularized by sites such
as Technorati
and Delicious.
And alternate is frequently used to facilitate the
autodiscovery of an RSS or Atom feed for a given site.
Further down we learn that as OpenID continues to gain in adoption the openid.server and openid.delegate
rel values come in at #35 and #43 respectively — impressive, since
each are only needed once per-page. And even the newer OpenID2-style tags not far
behind, with openid2.provider
and openid2.local_id reaching #51 and #837
respectively.
Near and dear to my heart, I was pleased to see
the search rel value,
the OpenSearch discovery
mechanism, ranked so high at #13. Again these discovery links are only needed once per page; a sign of strong adoption. Admittedly, not all
rel="search" links are OpenSearch related, but I have
another more comprehensive analysis of OpenSearch documents that shows similarly pervasive adoption rates.
Even the newly
agreed-upon canonical rel value makes a
showing at #271, and will surely rise to the top 25 or so over the next year
or two.
And the XFN rel values? The contact rel value is the
most common at #22, with me and friend
just behind at #25 and #28 respectively. Filling out the list
are acquaintance (#58), met
(#68), colleague (#84), co-worker
(#126), neighbor (#180), muse
(#196), co-resident (#232), parent
(#255), sibling (#414), sweetheart
(#446), spouse (#570), crush
(#794), kin (#834), child (#879),
with date bringing up the rear at #1086.
This survey indicates that rel values are both widely and meaningfully used, with adoption being driven by a wide array of needs, such as semantic markup, search engine hints, client-side rendering, discovery and identity protocols, blogging, and/or content that can be later edited.
But more importantly, we learned that a full 0.0003% of all the links have declared, for all the world to see, that some URI out there is their source of inspiration, their Calliope, their Erato, their muse.

February 16th, 2009 at 8:44 pm
Wow tech has changed! It wasn’t so long ago only an army of devs for an online search company could pull this kind of info! I think I just found myself a true tech blogsubscribing now…
February 16th, 2009 at 8:48 pm
Excellent blog post, man.
I never really thought about the usage of the “rel” attributes to and until now. I never understood the point of XFN, and am pretty sure that I’ve only ever used “rel=’nofollow’” in my history of web design and web app programming.
I’ve known that there were a few people that put more thought of what goes into the “rel” attribute, but your quantification of the usage, and how the attribute is exactly being used, really puts it into perspective.
I’d love to see a followup post that explained more about how you got these results. :)
February 16th, 2009 at 8:50 pm
Oh dear, sorry for the double post, but it appears that my usage of <a> and <link> messed up the comment. I wasn’t expecting the HTML to go unescaped.
For clarification, that sentence should read:
I never really thought about the usage of the “rel” attributes to <a> and <link> and until now. I never understood the point of XFN, and am pretty sure that I’ve only ever used “rel=’nofollow’” in my history of web design and web app programming.
February 16th, 2009 at 11:07 pm
Hey….. Thanks for posting this.
Yeah. I was thinking about rel=”canonical” as well.
I suspect this will shoot right near the top..
We’re going to be posting automated stats on our crawler in our next release…. which should be any day now :)
February 16th, 2009 at 11:28 pm
Kevin — Spinn3r crawls content exclusively from feeds, right, or do you have a web crawler going as well? I feel like I’ve seen hits from your spider in my web logs but I could be mistaken.
Either way, it will be interesting to see what values are popular on feed-oriented sites as opposed to the general web. I suspect that certain values will appear just as frequently, such as ‘nofollow’. But others, like the ‘openid’ values, won’t be used much at all.
February 17th, 2009 at 3:18 am
For comparison, I have some rel/rev data from a year ago based on 130K pages from dmoz.org. I counted number of pages rather than total number of occurrences, because that reduces the noise in the data caused by a few pages with huge numbers of repeated values. E.g. this page has a load of rel=”chapter”, and it’s used 12 times per page on average in my data, which explains part of the difference in ranking compared to your data.
(Also I didn’t split on tokens or convert to lowercase, since I wanted to see the original values.)
My data also shows a lot of similar rel patterns like rel=”track:track_pagetag=…” and rel=”balloon29″ and rel=”lightbox[pkguitars]“, presumably all coming from a handful of pages that are abusing the semantics of rel. It’d be interesting to know how many of your 1.8M unique values are instances of this kind of pattern coming from a small number of pages, and how many are real legitimate-looking values.
February 17th, 2009 at 7:05 am
Philip – that’s a great point about a handful of pages with tons of rel values biasing the results. I definitely saw patterns like “track:track…” that show up further down the list; that’s one of the reasons I only published the top 25 tags and cherry picked from there.
In this case I really was curious about total usage, as I was wondering specifically about values like ‘me’ and ‘muse’, which legitimately appear more than once.
That said, I might re-run the test counting only once per page, just to compare.
Also, I didn’t normalize to lowercase, but in retrospect I should have.
Thanks for the link, Philip.
Cheers,
-DeWitt
February 17th, 2009 at 12:29 pm
well, you could make several different ranked lists
as well as lists of what rel values appear most together (on the same page and/or within individual rel=”"s)
this could also be used to detect common errors (capitalization, spelling, wrong separators) which could be used in the w3c’s html validator or html tidy