Kohavi's "Guide to Controlled Experiments on the Web"

I was lucky to get to work with Ronny Kohavi at MSFT as he ramped up the "Experimentation Platform" there. He shared with me a paper in progress some months ago that is now generally available.
Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO

In addition to one of the more concise primers on key statistical concepts for testing, the paper offers a series of lists of key considerations across the testing process.

Limitations of Controlled Experiments

  1. Quantitative Metrics, but No Explanations
  2. Primacy and Newness Effects
  3. Features Must be Implemented
  4. Consistency of User Experience
  5. Parallel Experiments
  6. Launch & Media Events

Highlights from this section include the need to join metrics to user comments for diagnosis of success or failure and the hard earned opinion that parallel experiments to discover interactive gains are rarely successful. Of course, MSFT (and Ronny's former haunt Amazon) have no issue with data volume. Multivariate, or even better Taguchi, testing can explore a design space more rapidly with less than at-scale traffic.

In addition to a great discussion of sampling methodology, the paper goes on to describe learnings from leading testing at Amazon and designing the next generation platform at MSFT.

Lessons Learned

Analysis

  1. Mine the Data
  2. Speed Matters
  3. Test One Factor at a Time (or Not)
My favorite here is "mine the data" -- don't just look at the averages. Looking forward to the next section, a recent A/B at Smart Marketing revealed a complete reversal of an effect across day of week -- had I not dug deep into the dataset we would have likely called it too small of an effect to warrant a change. Instead, we're looking at a >.5% increase in conversion.

Trust and Execution

  1. Run continuous A/A tests
  2. Automate Ramp-up and Abort
  3. Determine the Minimum Sample Size
  4. Assign 50% of Users to Treatment
  5. Beware of Day of Week Effects

A/A tests provide great sanity checks on sampling methods and variability. Running 50% of traffic in an experiment will get a result 25x faster than running 1%.

Culture and Business

  1. Agree on the Metrics Upfront
  2. Beware of Launching Features that "Do Not Hurt" Users
  3. Weigh the Feature Maintenance Costs
  4. Change to a Data-Driven Culture

A key challenge in metrics is assessing short term versus long term value -- immediate revenue versus customer retention.



So, go read it -- it's also handy to have around to share for folks wanting a quick primer on the stats involved in split testing.

Related Blog Entries

Comments (Comment Moderation is enabled. Your comment will not appear until approved.)
Built with BlogCFC, version 5.9. Contact Andy Edmonds or read more at Free IQ or SurfMind. © 2007.