Google Analytic Gems #2: Quantifying Deliberative Conversions

Another little known gem in Google Analytics is the Time to Purchase and Visits to Purchase.

For MyWeddingFavors, where the purchase is an exceptionally meaningful one for our customers, some 40% of our sales happen on subsequent visits:
.

Careful though! Looking into Days to Purchase, we see only 25% of sales happen on a different day than the introduction to the customer.



So many (15%) of our shopping experiences are stretched out over a day, while only 25% happen on a subsequent day visit. Understanding this pattern has some serious implications for design, business strategy, and e-commerce feature set.

Getting Serious About Testing: Learn from the Pros

Last week's SXSW panel on AB Testing: Designer Friend or Foe left me wishing for a more robust treatment of the experimental design issues around online testing. It was a great panel, and I appreciated the real world experience of the panelists, but aside from Micah, the approach was very much from a design world. This is fine, but issues came up that stats exist to solve, and the distinction between multivariate and AB testing was glossed over.

In particular, designed well, multivariate testing can be used to test hypotheses about user models, not just a way to play roulette with font colors and sizes.

There is a robust body of knowledge that lives between statistics, traditional experimental psychology, cognitive modeling, and resting on the shoulders of giants in practical business success through experimentation.

The Exp Platform, led by Ronny Kohavi, at MSFT publishes from this position of strength. Their latest, 7 pitfalls to controlled experiments on the web, is a solid read for those aspiring to live in this space.

AB testing might indeed be a foe to the designer when done without appropriate expert support -- at least for more aggressive evaluations.

Here's a recap of the Seven Pitfalls:

  1. Avoiding experiments because computing the success metric is hard.
  2. Attempting to run experiments without the pre-requisites: representative & sufficient traffic, appropriate instrumentation, agreed upon metrics.
  3. Hubris: Over-optimization without collecting data along the way.
  4. Bad math: inappropriately deployed confidence intervals, % change, and interaction effects.
  5. Use of composite metrics when power is insufficent. An example, not in the paper, is the use of checkout completion for a product page change, when add-to-cart % would be more sensitive.
  6. Un-even sampling: bad balance between control and test distributions.
  7. Lack of robot detection.

I've blogged the guide to practical web experiments and it's also highly recommended. It provides an overview of the key issues to deal with in setting things up including sampling, failure versus success evaluation, and common pitfalls like day of the week effects.

More from the historical '05 SXSW Design perspective with How to Inform Design: How to Set Your Pants on Fire March 14th, 2005 presented by Nick Finck, Kit Seeborg, and Jeffrey Veen

Design Metrics Wrapup

What fun! The SXSW conversation format is quite cool, though it really needs a dedicated space as our group size was limited by how far voices carried.

Drop a line in the comments if we promised to follow up on something I haven't posted. This is a work in progress, so check back if there are some empty items when you visit.

References during the chat

Blogs

Resource Lists

Analytics

Usability Training

Books

...

Thanks to everyone who participated, and to Micah for bringing me along.

SXSW: Driving Design From User Data

I wrote about the crucial conversation at SXSW with Micah Alpern a few weeks ago. The time has come!

In talking through this with Micah, we came back to the crucial insight that the availability of artifacts of the usage of internet software creates an opportunity and challenge for designers. What follows is a reference for our conversation, which will include a short intro and mostly conversation. Subject to conversational flow, we'll be asking the participants to share stories:

  1. What's your favorite HIPPO story? For those of you who haven't encountered the hippo meme, it's about decision making based upon something more than the highest paid individual's personal opinion.
  2. What business or user goal would you like to be informed by metrics?

The talk precis:

Design Metrics: Better Than 'Because I Said So':

Too often designers are put in a position of defending design decisions based on personal preference or an unarticulated sense of expertise. We'll discuss how to use metrics to understand user and business goals. Then how these metrics can be used to evaluate design decisions, make tradeoffs, and shape strategies.

Our goal is to better enable productive conversations with key stakeholders, using the tools of metrics to understand and advocate a position.

In the most productive cases, this means designing with measurement toward end goals in mind. In less developed scenarios, there may be some foundations in need of construction.

There are a lot of reasons to test designs with live users. The most pedestrian is business acceptance testing. We'll be more focused on using metrics to resolve internal debate, multivariate testing learn more about the motivations, mental models, and personas of users, as well as value estimation.

We believe the "Role of Designer " is to drive hypotheses about the user and to internalize results and use to inform future design.

Of course, testing is not the only tool in the toolbox. You have to choose the right tool for the job. Key dimensions:

  • Quantitative, Qualitative
  • Small vs. large scale
  • Advanced techniques: Sequence modeling, learnability metrics
  • Repeat vs. non-repeat visitor
That said, creative techniques with Greasemonkey or limited scale prototypes can make testing available in situations you might not think it's possible.

We're scheduled for 11:30 AM in Ballroom E on Monday. Hope to see you there. If you can't make it, stay tuned for a follow-up.

SXSW Coming Up! Design Metrics: Better Than 'Because I Said So'

I'm greatly looking forward to SXSW 08 in a couple of weeks. I'll be doing a "core conversation" with Micah Alpern:

Core Conversation: Design Metrics: Better Than 'Because I Said So': Too often designers are put in a position of defending design decisions based on personal preference or an unarticulated sense of expertise. We'll discuss how to use metrics to understand user and business goals. Then how these metrics can be used to evaluate design decisions, make tradeoffs, and shape strategies.

While design efficacy can be treated as a contributor to overall site success, there are some more subtle metrics which can reveal specific strengths and weaknesses of design. I'll post a recap following the gig.

Online Video Metrics: How to Deal with Scrubbing?

Over at webmetricsguru.com, Marshall quotes the following key video metrics from Dennis @ Visual Revenue:

9 Essential Online Video Metrics

  • Online video started
  • Online video Pre-roll advertisement started*
  • Online video core content started
  • Online video Post-roll advertisement started*

  • Online video positive consumption action
  • Online video negative consumption action

  • Online video ended
  • Online video played, percentage of total
  • Online video played, seconds
As another blogger points out, things get really interesting when you start to consider embedded videos.

There is a challenge that neither of these authors mention -- what about user timeline scrubbing? Video complete doesn't mean the same thing if the user fast forwarded through most of it. Logging total time, % viewed, and complete gives you a bit of insight into this. Consider this range of user behavior:
DescriptionTotal Time Played% viewedComplete
Full view12:00100%Yes
Fast forward to watch a 2 minute segment2:1818%No
Screencast how-to view with pause, play actions while following instructions 16:00110%Yes
Quick Scan, fast foward, watch, etc5:0024%Yes

There are a lot of subtleties here: Do you double credit re-watching to allow > 100% viewed? If so, you confuse the real meaning of %. It's a good justification for logging % in addition to time, as otherwise, you could simply compute % as a normalization of user behavior across different video lengths.

We've created custom logging in the FreeIQ video player, both for the video embedder (who uses Google Analytics) and the management of the FreeIQ site. We simply log complete, but are working on a efficient way to capture some of the sublteties here.

From this logging, we computed an average 25% video completion for our Going Natural 2 series videos -- not bad given that these are greater than 20 minutes in length.

Blogdom Buzz on Net Promoter

The Net Promoter scale is one of the most powerful single questions you can ask to measure customer satisfaction. Here's what it looks like:

Would you recommend this blog to a friend?

  • Absolutely
  • Likely
  • Maybe
  • Not Likley
  • Never

There's a good bit of dialogue around the blogosphere on this, and some recent research.

Metrics should be actionable

A key objection is that the Net Promoter scale is not a actionable. It's certainly not diagnostic, but much of business is about maximizing the intersection of user needs and business objectives. Choose to balance between these objectives poorly, and the NP scale will reveal the affects of this choice. Diagnostic surveys, and correlations between them and NP score, can reveal the variety of ways customer SAT (satisfaction) might be improved.

Triangulation & Large Scale Business Practices

This process of correlation between metrics is called triangulation and is an essential piece of every analysts' toolkit. Triangulation helps you resist validity challenges and provides a useful testing ground for industry wisdom to your specific situation.

I worked with some very strong NP advocates at MSFT and while I never got the religion, I strongly value it's ability to distinguish good and bad profits. For instance, over monetizing search results pages would have increase a pretty core metric for any search engine, revenue. But revenue is closely tied to the size and frequency of the customer base. NP is a way to check gains in revenue to insure that in the longer run, the business strategy is not pyrrhic.

Kohavi's "Guide to Controlled Experiments on the Web"

I was lucky to get to work with Ronny Kohavi at MSFT as he ramped up the "Experimentation Platform" there. He shared with me a paper in progress some months ago that is now generally available.
Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO

In addition to one of the more concise primers on key statistical concepts for testing, the paper offers a series of lists of key considerations across the testing process.

Limitations of Controlled Experiments

  1. Quantitative Metrics, but No Explanations
  2. Primacy and Newness Effects
  3. Features Must be Implemented
  4. Consistency of User Experience
  5. Parallel Experiments
  6. Launch & Media Events

Highlights from this section include the need to join metrics to user comments for diagnosis of success or failure and the hard earned opinion that parallel experiments to discover interactive gains are rarely successful. Of course, MSFT (and Ronny's former haunt Amazon) have no issue with data volume. Multivariate, or even better Taguchi, testing can explore a design space more rapidly with less than at-scale traffic.

In addition to a great discussion of sampling methodology, the paper goes on to describe learnings from leading testing at Amazon and designing the next generation platform at MSFT.

Lessons Learned

Analysis

  1. Mine the Data
  2. Speed Matters
  3. Test One Factor at a Time (or Not)
My favorite here is "mine the data" -- don't just look at the averages. Looking forward to the next section, a recent A/B at Smart Marketing revealed a complete reversal of an effect across day of week -- had I not dug deep into the dataset we would have likely called it too small of an effect to warrant a change. Instead, we're looking at a >.5% increase in conversion.

Trust and Execution

  1. Run continuous A/A tests
  2. Automate Ramp-up and Abort
  3. Determine the Minimum Sample Size
  4. Assign 50% of Users to Treatment
  5. Beware of Day of Week Effects

A/A tests provide great sanity checks on sampling methods and variability. Running 50% of traffic in an experiment will get a result 25x faster than running 1%.

Culture and Business

  1. Agree on the Metrics Upfront
  2. Beware of Launching Features that "Do Not Hurt" Users
  3. Weigh the Feature Maintenance Costs
  4. Change to a Data-Driven Culture

A key challenge in metrics is assessing short term versus long term value -- immediate revenue versus customer retention.



So, go read it -- it's also handy to have around to share for folks wanting a quick primer on the stats involved in split testing.

Term Alert: Pre-Bounce

While bounce rates are not enemy #1 for everyone, the transition in your conversion funnel which loses the highest percentage of users is the biggest issue -- typically abandoned carts for e-commerce shops, I really enjoyed this post coining the term pre-bounce.

The Pre-Bounce is a user rejecting your site based upon visual previews popping up all over the web. Aesthetics matter.

Trying Out Compete.com

With the questionable business behavior around Alexa, and the bubbling buzz on engagement metrics, and the announcement of a new API, this morning seemed the time to give Compete.com a thorough evaluation, checking for confirmation of Alexa's recognition of the meteoric growth of Free IQ over the last month.

Compete captures the rise of Free IQ to a top 10k site:


Compete also does the basic Alexa-style competitive user reach tracking:

Compare this to Alexa's similar view.

Compete's interesting deviation from the standard practice is computing the percent of attention a site receives across all the internet focused attention in their participant pool -- essentially the "temporal share" of a site:

The Compete site isn't very upfront with their usage stats, but I did see a reference to 2M users. The ComScore panel is only 200k users, but carefully balanced across demographics. Compete, and Alexa, which rely on installation of toolbars, likely over represent the geek population. None the less, they provide real insights, and it's generally been true that the geeks lead the way for the masses... though some of the new 2.0 tagging, presence sharing, etc may tap the larger marketplace's technology adoption capacity.

On the measurement side, these new applications are challenging traditional page view metrics. A great post over at WAA paralleling this change to the history of science in physics. Certainly, the attention based focus of Compete offers a compelling alternative to pageviews for business success, but the challenges of assessing clickstreams and usability in web 2.0 apps is a slightly thornier problem. More to come on that front with an upcoming scientific publication...

Built with BlogCFC, version 5.9. Contact Andy Edmonds or read more at Free IQ or SurfMind. © 2007.