Term Frequency, Inverse Document Frequency (TFIDF): Exploring TheRarestWords.com
TheRarestWords.com is a impressive little web hack with a nice global and enthusiast artisan. The application attempts to identify the rarest words on the page.
Why would you care? First, understanding where you veer from the mainstream is quite interesting. It's a great way to find mis-spellings, people's names, and other less pedestrian rareties.
| An Aside: Language is thought to be infinitely generative. Perhaps ever human utters a huge number of completely unique statements, ignoring person and location names to give the computation a fighting chance. Hard to say... likely a typical long tail situation, maybe with a little more tail. |
The buzzword in IR is TFIDF, or term frequency inverse document frequency. This is a method for giving more importance to the less common words in a document that match the query. Mid-range frequency words get discounted, but they're likely key terms, if the page is truly relevant, and often repeated.
Rarest Words at AlwaysBeTesting |
Moving beyond term frequencies gets you to n-grams and the requirement to recognize frequencies of multi-word segments. Here, basic part of speech tagging and related tech can really help reduce the problem set -- or you can go the hard way and simply embody part of speech tagging by crunching huge quantities of a language's text. Google has published a database of n-grams, 1,024,908,267,229 of them in fact spanning 13,588,391 unique words with frequencies over 200. They don't report how big of a web crawl generated this database.
In fact, TFIDF is hard to do at scale, per long time hacker buddy Vi.c. At play dataset size, it's grad school work -- given a really sharp Prof.
Think about "jade apple tree". Jade is going to be truly rare. If you do n-grams, you can detect that "apple tree" is a common two word pattern and give credit for the co-occurrences' infrequency. I'll return to the impact of the degree of common use of a word in search at the end.
The tool exposes your most unique word uses and your uses of very common words that don't quite approach the stop word list level.
TechCrunch brought the rarest words.com to my attention. I ran it on my analytics & e-commerce focused blog, alwaysBetesting.com. It returned an interesting set of related blog suggestions based upon rarest words. The author is quite humble on the feature, but it's an interesting hack!
- LukeW's Functioning Form shares a focus on eye-tracking and user experience in online transaction while ROI Revolution focuses on Google Analytics and transaction modeling.
- Things get interesting with Bill Rempel.com -- numbers junkie in stock trading.
- A social software site and a beyond blogging site all make sense and are perphaps something of a random sample given theRarestWord's partial database of word frequencies.
- Blog conversation is exposed in the link to my fellow StomperNet Faculty Dan Thies' blog.
Some of the other suggestions were a bit more offkilter. Perhaps due to a bit of a word fetish, I turned up a few oddball matches. Rare words on my site include deliberative, sxsw, subjective, onerous, and quantifying. Some of the common words are lack, cool, level, solution, opinion, and quick.
A categorization feature produces somethings that don't quite look like categories, but are sensical none the less: Use Cases,Web Designing, Understanding, Designer, Internet Business, Toolbox, Marketing Strategies, Tasks, Evaluate, Recommend, Tool Box, and Requirements.
Hat's off to the craft Russian coder on a hobby project!
Is TheRarestWords an SEO Tool?
If you're truly advanced in targeting content to user needs and variations in expression in a way to maximize your coverage of the query tail, then this type of analysis is quite productive for SEO. For most folks aiming at SEO, the fact that less frequent words are less frequent means that you don't really care. You'll likely be amazed at the mid-frequency queries you match at just by occupying your niche and doing the basic practices well.I've even considered a similar app but the ROI for most site owners is in good, accessible markup and solid off-site promotion strategies. As it happens, we at StomperNet just released a SEO evaluation tool along the lines of existing predecessors but free (with email subscription) and including numerous instruction videos on corrective actions. Check out Stomper Site Seer if you're really aiming for traffic.










