Term Frequency, Inverse Document Frequency (TFIDF): Exploring TheRarestWords.com

TheRarestWords.com is a impressive little web hack with a nice global and enthusiast artisan. The application attempts to identify the rarest words on the page.

Why would you care? First, understanding where you veer from the mainstream is quite interesting. It's a great way to find mis-spellings, people's names, and other less pedestrian rareties.

An Aside: Language is thought to be infinitely generative. Perhaps ever human utters a huge number of completely unique statements, ignoring person and location names to give the computation a fighting chance. Hard to say... likely a typical long tail situation, maybe with a little more tail.
Secondly, word usage scarcity is a rich source of information in a language which overpopulates words with different meanings and requires surrounding sequences to constrain meaning. Information retrieval (IR) mechanisms have a hard time really capitalizing on word sequence and co-occurence proximity at full web scale, if anywhere. Statistical word counts tend to win out in evaluation.

The buzzword in IR is TFIDF, or term frequency inverse document frequency. This is a method for giving more importance to the less common words in a document that match the query. Mid-range frequency words get discounted, but they're likely key terms, if the page is truly relevant, and often repeated.

Rarest Words at AlwaysBeTesting



Moving beyond term frequencies gets you to n-grams and the requirement to recognize frequencies of multi-word segments. Here, basic part of speech tagging and related tech can really help reduce the problem set -- or you can go the hard way and simply embody part of speech tagging by crunching huge quantities of a language's text. Google has published a database of n-grams, 1,024,908,267,229 of them in fact spanning 13,588,391 unique words with frequencies over 200. They don't report how big of a web crawl generated this database.

In fact, TFIDF is hard to do at scale, per long time hacker buddy Vi.c. At play dataset size, it's grad school work -- given a really sharp Prof.

Think about "jade apple tree". Jade is going to be truly rare. If you do n-grams, you can detect that "apple tree" is a common two word pattern and give credit for the co-occurrences' infrequency. I'll return to the impact of the degree of common use of a word in search at the end.

The tool exposes your most unique word uses and your uses of very common words that don't quite approach the stop word list level.

TechCrunch brought the rarest words.com to my attention. I ran it on my analytics & e-commerce focused blog, alwaysBetesting.com. It returned an interesting set of related blog suggestions based upon rarest words. The author is quite humble on the feature, but it's an interesting hack!





Some of the other suggestions were a bit more offkilter. Perhaps due to a bit of a word fetish, I turned up a few oddball matches. Rare words on my site include deliberative, sxsw, subjective, onerous, and quantifying. Some of the common words are lack, cool, level, solution, opinion, and quick.

A categorization feature produces somethings that don't quite look like categories, but are sensical none the less: Use Cases,Web Designing, Understanding, Designer, Internet Business, Toolbox, Marketing Strategies, Tasks, Evaluate, Recommend, Tool Box, and Requirements.

Hat's off to the craft Russian coder on a hobby project!

Is TheRarestWords an SEO Tool?

If you're truly advanced in targeting content to user needs and variations in expression in a way to maximize your coverage of the query tail, then this type of analysis is quite productive for SEO. For most folks aiming at SEO, the fact that less frequent words are less frequent means that you don't really care. You'll likely be amazed at the mid-frequency queries you match at just by occupying your niche and doing the basic practices well.

I've even considered a similar app but the ROI for most site owners is in good, accessible markup and solid off-site promotion strategies. As it happens, we at StomperNet just released a SEO evaluation tool along the lines of existing predecessors but free (with email subscription) and including numerous instruction videos on corrective actions. Check out Stomper Site Seer if you're really aiming for traffic.

Comments (Comment Moderation is enabled. Your comment will not appear until approved.)
Built with BlogCFC, version 5.9. Contact Andy Edmonds or read more at Free IQ or SurfMind. © 2007.