Term Frequency, Inverse Document Frequency (TFIDF): Exploring TheRarestWords.com

TheRarestWords.com is a impressive little web hack with a nice global and enthusiast artisan. The application attempts to identify the rarest words on the page.

Why would you care? First, understanding where you veer from the mainstream is quite interesting. It's a great way to find mis-spellings, people's names, and other less pedestrian rareties.

An Aside: Language is thought to be infinitely generative. Perhaps ever human utters a huge number of completely unique statements, ignoring person and location names to give the computation a fighting chance. Hard to say... likely a typical long tail situation, maybe with a little more tail.
Secondly, word usage scarcity is a rich source of information in a language which overpopulates words with different meanings and requires surrounding sequences to constrain meaning. Information retrieval (IR) mechanisms have a hard time really capitalizing on word sequence and co-occurence proximity at full web scale, if anywhere. Statistical word counts tend to win out in evaluation.

The buzzword in IR is TFIDF, or term frequency inverse document frequency. This is a method for giving more importance to the less common words in a document that match the query. Mid-range frequency words get discounted, but they're likely key terms, if the page is truly relevant, and often repeated.

Rarest Words at AlwaysBeTesting



Moving beyond term frequencies gets you to n-grams and the requirement to recognize frequencies of multi-word segments. Here, basic part of speech tagging and related tech can really help reduce the problem set -- or you can go the hard way and simply embody part of speech tagging by crunching huge quantities of a language's text. Google has published a database of n-grams, 1,024,908,267,229 of them in fact spanning 13,588,391 unique words with frequencies over 200. They don't report how big of a web crawl generated this database.

In fact, TFIDF is hard to do at scale, per long time hacker buddy Vi.c. At play dataset size, it's grad school work -- given a really sharp Prof.

Think about "jade apple tree". Jade is going to be truly rare. If you do n-grams, you can detect that "apple tree" is a common two word pattern and give credit for the co-occurrences' infrequency. I'll return to the impact of the degree of common use of a word in search at the end.

The tool exposes your most unique word uses and your uses of very common words that don't quite approach the stop word list level.

TechCrunch brought the rarest words.com to my attention. I ran it on my analytics & e-commerce focused blog, alwaysBetesting.com. It returned an interesting set of related blog suggestions based upon rarest words. The author is quite humble on the feature, but it's an interesting hack!





Some of the other suggestions were a bit more offkilter. Perhaps due to a bit of a word fetish, I turned up a few oddball matches. Rare words on my site include deliberative, sxsw, subjective, onerous, and quantifying. Some of the common words are lack, cool, level, solution, opinion, and quick.

A categorization feature produces somethings that don't quite look like categories, but are sensical none the less: Use Cases,Web Designing, Understanding, Designer, Internet Business, Toolbox, Marketing Strategies, Tasks, Evaluate, Recommend, Tool Box, and Requirements.

Hat's off to the craft Russian coder on a hobby project!

Is TheRarestWords an SEO Tool?

If you're truly advanced in targeting content to user needs and variations in expression in a way to maximize your coverage of the query tail, then this type of analysis is quite productive for SEO. For most folks aiming at SEO, the fact that less frequent words are less frequent means that you don't really care. You'll likely be amazed at the mid-frequency queries you match at just by occupying your niche and doing the basic practices well.

I've even considered a similar app but the ROI for most site owners is in good, accessible markup and solid off-site promotion strategies. As it happens, we at StomperNet just released a SEO evaluation tool along the lines of existing predecessors but free (with email subscription) and including numerous instruction videos on corrective actions. Check out Stomper Site Seer if you're really aiming for traffic.

Taking Open Office 3.0 Through the (Web Analyst) Paces

The beta of Open Office 3 is a big step for Mac users, removing the need for X11 or the aging NeoOffice port. I tried out the spreadsheet app for some basic data crunching tasks.

I was pleased with the design of the charting workflow. Click through to Flickr for the 4 image sequence. On casual inspection, it seemed a bit easier to work with than the multi-path 2003 Excel or the new 07 Excel.

There are also new charting features for regression plotting and custom error bars.

Check out a quick Jing screencast of running a "data pilot" (akin to pivot tables) on some recent eyetracking data. Alternatively, check out this howto blog post on the Open Office blog.

While I had to read docs to figure out how to do a pivot table, check out this neat menu explorer help function!

While I opened a simple .xlsx file succesfully, Infoworld tried more complex files and found support in need of more work. I did notice that OO won't save to .xslx, though it opened several simple files fine.

Will OO v3 be a credible alternative to MS Office? Depends on the task -- but it looks to do a decent job an analyst's basic requirements.

StomperNet: Going Natural 3 - Adwords Triangulation Method

We launched the Stomper Scrutinizer within the video series called Going Natural 2 back in December, and now it's time for GN3.

The first video in the series showcases Dan Thies talking about his strategies for AdWords. Most notable is his description of how to do split testing with AdWords for fractions of traffic. In essence, if you want to dedicate 10% of traffic to a new ad, you create 10 variants of which 9 are identical. This also provides an "A-A" test that can help you understand variability in your data. Watch the video.

There's a bonus video, new excerpts from the Going Natural 1 Series, along with a downloadable version of my video on understanding vision and web design called Click Fu.

Tracking UI Level Links: An Open Source Script

One of the challenges with the current complex site designs is that multiple links to the same destination tend to appear on the same page.

This does not allow you to understand how your allocation of real estate is being used without resorting to really fancy analytics packages.

To solve this problem I developed a script that upon every click, walks up the document object model (DOM) looking for a an attribute on an HTML tag of ui.

If it finds one before it hits the BODY tag, then it adds a parameter to the link called ui with the value of that attribute, allowing you to understand which links on a page are being used. Shown below is a report from Google Analytics for this site showing how people arrive at my bio page:


If you visit my bio from the About menu at the top of my blog, it adds ui=nav.

I've licensed the script under the MPL and you're welcome to use it on your site, providing you share any enhancements. Grab it at http://alwaysbetesting.com/abtest/includes/logger.js.

All you have to do is include the script on your page and add ui=element_name attributes to key HTML elements that make up your site structure.

Because this happens with Javascript, and only when the user clicks, there's no danger of creating duplicate content for the search engine spiders. Alas, that means it doesn't fix typical analytics overlay views either for duplicate links, but you can typically see the source element in your path reports.

Google Website Optimizer and the iPhone...

What do they have in common? Even with the recent opening of Google Website Optimizer, much like the iPhone, you still have to hack GWO to get maximum value out of your testing. While I am grateful for improved support for factorial analyses and more help content, it would have been so easy to do better.

Here's the problem: Google Website Optimizer restricts your understanding of the effects of your experiment to a single outcome variable, like a conversion.

While getting consensus on a single overall evaluation criteria (OEC) is critical to successful ongoing testing and iteration in a business organization, you also want to use tests to improve the product teams understanding of the customer, products, the website experience, and their interactions.

Without the ability to go deeper to explain why an experimental condition drove the most conversions, you're simply playing roulette with your pixels, not building a better tuned product team.

So, yes, also like the iPhone, this limitation reduces the need for complex skills like statistical significance testing. (Not exposing a command line in the iPhone eliminates the challenge of unix).

There is a solution for those who aren't afraid of the truth... A set of enterprising analyst / coders have reverse engineered the GWO cookie and demonstrated how to port the values back over to Google Analytics. ROI Revolution shows how extract the GWO condition. While he shows integrating it with synthetic page tracker calls, I'd recommend using the "user defined" segmentation values via utmSetVar (old school) or pageTracker._setVar (new school ga.js).

Find more GWO power user tricks in my delicious feed gwo.

Google Analytic Gems #2: Quantifying Deliberative Conversions

Another little known gem in Google Analytics is the Time to Purchase and Visits to Purchase.

For MyWeddingFavors, where the purchase is an exceptionally meaningful one for our customers, some 40% of our sales happen on subsequent visits:
.

Careful though! Looking into Days to Purchase, we see only 25% of sales happen on a different day than the introduction to the customer.



So many (15%) of our shopping experiences are stretched out over a day, while only 25% happen on a subsequent day visit. Understanding this pattern has some serious implications for design, business strategy, and e-commerce feature set.

Google Analytic Gems #1: Split Test Evaluation with Only New Users

A question on LinkedIn, now closed, asked "what are some hard to find but useful reports in GA?" While clicks to task completion is far from the ultimate metric, Google Analytics does suffer from some slightly onerous depth issues for specific data points.

We do a lot of split testing with Google Analytics by defining custom segments and serving each segment a different UI. One issue with understanding the impact of new features, particularly for sites with lots of repeat visitors (e.g. content / blog sites vs ecommerce), is the novelty effect. New features, or even simple changes in layout, can have a short term halo as users notice and engage with the changed content.

Google Analytics does allow you to look at your user defined segments for new and repeat visitors, but it does require a few clicks. Follow along with the picture:


Starting in the visitors submenu (1), the New & Returning report allows you to drill into New users (2). The segment drop down has lots of useful pivots, including "user defined" (3).

Picture #4 shows the results for new users of a split test that moved an mailing list subscribe box from left to right. The magnitude of the effect diminished over time as we tested this. However, by drilling into only new users, we see the original effect size. Looking at all users, the effect is smaller, and looking at returning users, the effect is smaller still.

Dealing with the halo effect is one of the reservations that was expressed during "AB Testing: Designer Friend or Foe" at SXSW. Splitting users into new and returning is one of the easiest strategies for seeing through this confound.

Getting Serious About Testing: Learn from the Pros

Last week's SXSW panel on AB Testing: Designer Friend or Foe left me wishing for a more robust treatment of the experimental design issues around online testing. It was a great panel, and I appreciated the real world experience of the panelists, but aside from Micah, the approach was very much from a design world. This is fine, but issues came up that stats exist to solve, and the distinction between multivariate and AB testing was glossed over.

In particular, designed well, multivariate testing can be used to test hypotheses about user models, not just a way to play roulette with font colors and sizes.

There is a robust body of knowledge that lives between statistics, traditional experimental psychology, cognitive modeling, and resting on the shoulders of giants in practical business success through experimentation.

The Exp Platform, led by Ronny Kohavi, at MSFT publishes from this position of strength. Their latest, 7 pitfalls to controlled experiments on the web, is a solid read for those aspiring to live in this space.

AB testing might indeed be a foe to the designer when done without appropriate expert support -- at least for more aggressive evaluations.

Here's a recap of the Seven Pitfalls:

  1. Avoiding experiments because computing the success metric is hard.
  2. Attempting to run experiments without the pre-requisites: representative & sufficient traffic, appropriate instrumentation, agreed upon metrics.
  3. Hubris: Over-optimization without collecting data along the way.
  4. Bad math: inappropriately deployed confidence intervals, % change, and interaction effects.
  5. Use of composite metrics when power is insufficent. An example, not in the paper, is the use of checkout completion for a product page change, when add-to-cart % would be more sensitive.
  6. Un-even sampling: bad balance between control and test distributions.
  7. Lack of robot detection.

I've blogged the guide to practical web experiments and it's also highly recommended. It provides an overview of the key issues to deal with in setting things up including sampling, failure versus success evaluation, and common pitfalls like day of the week effects.

More from the historical '05 SXSW Design perspective with How to Inform Design: How to Set Your Pants on Fire March 14th, 2005 presented by Nick Finck, Kit Seeborg, and Jeffrey Veen

Design Metrics Wrapup

What fun! The SXSW conversation format is quite cool, though it really needs a dedicated space as our group size was limited by how far voices carried.

Drop a line in the comments if we promised to follow up on something I haven't posted. This is a work in progress, so check back if there are some empty items when you visit.

References during the chat

Blogs

Resource Lists

Analytics

Usability Training

Books

...

Thanks to everyone who participated, and to Micah for bringing me along.

SXSW: Driving Design From User Data

I wrote about the crucial conversation at SXSW with Micah Alpern a few weeks ago. The time has come!

In talking through this with Micah, we came back to the crucial insight that the availability of artifacts of the usage of internet software creates an opportunity and challenge for designers. What follows is a reference for our conversation, which will include a short intro and mostly conversation. Subject to conversational flow, we'll be asking the participants to share stories:

  1. What's your favorite HIPPO story? For those of you who haven't encountered the hippo meme, it's about decision making based upon something more than the highest paid individual's personal opinion.
  2. What business or user goal would you like to be informed by metrics?

The talk precis:

Design Metrics: Better Than 'Because I Said So':

Too often designers are put in a position of defending design decisions based on personal preference or an unarticulated sense of expertise. We'll discuss how to use metrics to understand user and business goals. Then how these metrics can be used to evaluate design decisions, make tradeoffs, and shape strategies.

Our goal is to better enable productive conversations with key stakeholders, using the tools of metrics to understand and advocate a position.

In the most productive cases, this means designing with measurement toward end goals in mind. In less developed scenarios, there may be some foundations in need of construction.

There are a lot of reasons to test designs with live users. The most pedestrian is business acceptance testing. We'll be more focused on using metrics to resolve internal debate, multivariate testing learn more about the motivations, mental models, and personas of users, as well as value estimation.

We believe the "Role of Designer " is to drive hypotheses about the user and to internalize results and use to inform future design.

Of course, testing is not the only tool in the toolbox. You have to choose the right tool for the job. Key dimensions:

  • Quantitative, Qualitative
  • Small vs. large scale
  • Advanced techniques: Sequence modeling, learnability metrics
  • Repeat vs. non-repeat visitor
That said, creative techniques with Greasemonkey or limited scale prototypes can make testing available in situations you might not think it's possible.

We're scheduled for 11:30 AM in Ballroom E on Monday. Hope to see you there. If you can't make it, stay tuned for a follow-up.

More Entries

Built with BlogCFC, version 5.9. Contact Andy Edmonds or read more at Free IQ or SurfMind. © 2007.