Last week's SXSW panel on AB Testing: Designer Friend or Foe left me wishing for a more robust treatment of the experimental design issues around online testing. It was a great panel, and I appreciated the real world experience of the panelists, but aside from Micah, the approach was very much from a design world. This is fine, but issues came up that stats exist to solve, and the distinction between multivariate and AB testing was glossed over.
In particular, designed well, multivariate testing can be used to test hypotheses about user models, not just a way to play roulette with font colors and sizes.
There is a robust body of knowledge that lives between statistics, traditional experimental psychology, cognitive modeling, and resting on the shoulders of giants in practical business success through experimentation.
The Exp Platform, led by Ronny Kohavi, at MSFT publishes from this position of strength. Their latest, 7 pitfalls to controlled experiments on the web, is a solid read for those aspiring to live in this space.
AB testing might indeed be a foe to the designer when done without appropriate expert support -- at least for more aggressive evaluations.
Here's a recap of the Seven Pitfalls:
- Avoiding experiments because computing the success metric is hard.
- Attempting to run experiments without the pre-requisites: representative & sufficient traffic, appropriate instrumentation, agreed upon metrics.
- Hubris: Over-optimization without collecting data along the way.
- Bad math: inappropriately deployed confidence intervals, % change, and interaction effects.
- Use of composite metrics when power is insufficent. An example, not in the paper, is the use of checkout completion for a product page change, when add-to-cart % would be more sensitive.
- Un-even sampling: bad balance between control and test distributions.
- Lack of robot detection.
I've blogged the guide to practical web experiments and it's also highly recommended. It provides an overview of the key issues to deal with in setting things up including sampling, failure versus success evaluation, and common pitfalls like day of the week effects.