Statistics 202: Statistical Aspects of Data Mining

Highly recommended - a Standford class delivered at Google with the lectures available on Google Video / Youtube.

The course uses both the open source stats tool R as well as Excel for the heavy and lightweight ways to get things done.

The slides are at stats202.com.

Navigating the YouTube UI to work through these vids is painful, so I created a playlist with the videos. Click through for the larger format video:

Happy data-mining!

Monitoring Branded Traffic with Google Analytics

Tracking branded traffic is a useful activity for any business. Branded traffic refers to users who seek out your business by name, either through historical familiarity or perhaps the translation of offline marketing into online activity. Thus, it can measure the effectiveness of offline advertising and the growth of a loyal custom user base, with the help of return rate metrics.

In some cases, these visits show up as "direct traffic" in analytics. This means the visit came in without a referring URL in the CGI headers. Clicking a link from a site to visit another site will pass the preceding site's URL in this referral section of the HTTP headers, that part of web "clicks" that's invisible to the average user. When a user types in a URL directly, or clicks a bookmark, this field is empty. Alas, the utility of an empty referrer at indicating branded traffic has degraded over the years as firewall products begin to block the passing of the referrer due to potential privacy concerns.

Another phenomena has worked to decrease the comprehensiveness of direct traffic as a measure of branded search. The pervasiveness and ease of user of internet search engine in modern browsers has led users to simplify the task of reaching a destination down to one tool: the search box. While it's theoretically more efficient to enter the URL in the url bar, many users simply use their default search engine for keywords and URL entry (est. 5% of search engine queries from Teevan, Adar, Jones, & Potts. SIGIR 2007) because the simplicity of one operation overrides the efficiency gain of a dedicate search box.

The large number of domain squatters with mis-spelling and typos of popular destinations producing generally useless experiences make the search engines popularity skew to error correction on URLs an especially nice change to a cybersquatter's shop of trinkets.

So I recommend creating a Google Analytics advanced segment that captures both direct traffic and inbound search referrals with query terms that uniquely identify your business. For some names, this may not be viable, but for most you can capture a large percent of the direct navigation. For instance, myweddingfavors.com gets branded searches for "my wedding favors" and "myweddingfavors.com". The my term is the key signal that the query is branded.

Here's an example of the config:

Be sure to name your segment for the site you set it up on as segments are shared across sites in an account.

And here's a report validating the conclusions:

      85% of iBlipper branded traffic is type-in and other components of direct
      As it should be, no referring sites are included in iblipper branded
      Over a third of search engine traffic is branded.

Turning Your Conversion Funnel into a Floodgate

Last year, I wrote about a split test we did that was featured in a Marketing Sherpa report on product videos and optin status. The test of click to play vs autoplay vs no video on an e-commerce site for it's top 10 best selling products revealed a huge benefit to click to play.

I presented the raw numbers for the first time at StomperLive8 and the first ten minutes are online on youtube. I talk about how split testing negates the variation inherent in time of day, day of week, and more unique temporal influences. Next up is the the video case study illustrating the need to look deeper than a single "overall evaluation criteria" (OEC) to understand the results of testing.

The video leaves off with this graphic, which shows from product page view to add to cart in the back two bars and from add to cart to "cart complete" (aka checkout initiated) in the front bar. Checkout is an actual sale.



Looking at the raw numbers makes the graphic's meaning more clear:

Click to play reduced a baseline of cart abandonment from 37% to 5! While it was a huge win, it also decreased add to cart slightly. The premise is that the shoppers were better educated by the video and that opt-IN or click-to-play for the video made the best use of shoppers time.

You can think of it as a force model.

Product Page TypeCheckout Shop More
No VideoMediumLow
Auto Play VideoLowMedium
Click to Play VideoHighHigh

Helping your customers better understand your products is a good thing, especially if you have great unique products as this business did. In addition, customers will appreciate being able to get the critical details on your products and will be more likely to browse further if they decide the current product is not quite right.

The full set of keynotes from the StomperLive8 show are available for sale including a wide range of topics in e-commerce, marketing, and internet success.

Bing! You've got SERP Position

The SEO Position Plus script has been updated to pull page offset from Bing.com -- thanks to Stomper Jim MacKay. Get it at the normal place and keep an eye on http://gist.github.com/138555.

Don't Play Roulette with your Split Testing Efforts

While split or multivariate testing is an extremely powerful tool, it should not be deployed as the sole method driving improvements to your online business. If testing is applied as the only way to determine the right path forward the outcome is certain to be less successful over the long term.

At the usability professional's association a few weeks ago in Portland, I chatted with Beverly Freeman @ Ebay and she recorded the essence on the whiteboard:

My premise is that testing is much more productive when it is aimed at hypothesis testing because the cumuluative outcomes of multiple tests create a body of knowledge about your business, your users, and their interaction.

The next point in the white board is about triangulation. Analytics can provide strong statistically meaningful results, but making sure that rationalize a "why" is important.

The general pattern here is that you can identify in the usability lab or other high resolution, low volume observation scenario and quantify with testing. Alternatively, you may observe in analytics and engage with customers in lab testing or other inquiry to understand.

Using Open Source Tools to Understand Your Data: R & GGobi

I've been a long time, occasional, user of an open source alternative to high-end statistic packages like SAS & SPSS called "R". I recently spent some time learning an associated data exploration tool called "GGobi" and the integration with R (rggobi).

While R is a worthy tool for data summarization, stats, plotting, and cleaning, GGobi is especially useful for exploration. It features "linked plots" in which mousing over a point in one plot highlights it in all other plot windows. I'm a big fan of the scatterplot matrix (shown to right) for understanding relationships between variables and distribution shapes.

This is probably overkill for basic website stats from Google Analytics, but is very useful for more complex data. I've crafted a screencast that walks through the basics of reading a CSV and launching GGobi through R. It then goes on create a scatterplot matrix and apply a custom color scheme.

The dataset in use here is data volunteered from Firefox users about their bookmarks and history data-stores. Details of the analysis are described here.

Getting a handle on R is pretty much a pre-requisite for getting value from GGobi. There are lots of great resources online including Statmethods.net Quick-R but the essential reference action is to add R-Seek.org to your browser search box.

Google Analytics API Roundup

Man, you gotta love the internet. Within a week of the general availability of the Google Analytics API there's an explosion of interesting new works and open source creations. While some folks may have had access before the announcement, the checkin streams on github show there's no time like the present.

There are two Ruby wrappers: Gattica and Garb as well as one in Python. (Update: See this offc. Google Analytics blog post for more)

I did a bit of hacking over the weekend and adapted one of the code samples to present a word cloud of search referral keywords. Bounce rate is mapped to opacity and frequency to sum of traffic from all phrases including the word.

Mad props to the Juice Analytics folks for their Google API explorer tool for making it easy to get a grasp on the capabilities and restrictions. There are some hard to predict limitations on the factor and metric combinations allowed. Also notable in limitations is no support for eventing data.

There's sure to be lots of interesting things to come. I'm not especially interested in tools that mimic the functionality of the web site in desktop application form, but rather in tools that go beyond the analyses that GA enables. Check out Google's gallery for a mix of the two.

The desktop reporting.com folks have a basic replication widget available with the promise of fancier stuff to come. The Juice Analytics folks have a keyword research tool called Concentrate.

We're pondering keyword applications as well, along with traffic source, automated diagnostics, and more advanced path analyses. What would you like to see available from your Google Analytics data?

Finally, if you'd like to try the keyword cloud view on your site, here's the link for the exceptionally quick keyword cloud hack above. Works best in Safari / Chrome at the moment strangely.

SEO Position Plus: Log Your Exact Google Rank with Google Analytics

We've been working hard on SEO Position over the last few months and the latest version has been in use by Stomper members for a while now.

With the announcement of updates to the Google referral string it's now important to update the script and there's some serious new features.

For those unfamiliar with the SEO Position script, in the picture above/right, the query "analytics motion charts" generated a click from Page 1 of the Google results while "motion charts" generated a click from Page 2.
SEO Position Plus tracks more types of referrals than the original script. All the categories logged to Google Analytics events are prefixed with SEO.

  • (Google | Yahoo | AOL | Live | MSN) Page: tracks page number of referrals from different engines
  • Google Site: Distinguishes subdomains & verticals like news, reader, corp, etc.
  • Google TLD: Find you average ranking by page number by Country
  • Google Images: Pull out keywords, traffic flow, etc. from Google image referrals

Mark at MivaMerchant wrote up a great tutorial (Stomper Members: See my video in the portal).

So that's all cool and useful, but the last item "SEO Google Position" captures the exact rank of the result clicked on Google to generate this visit. The announcement says the referral change is being rolled out and we're currently only seeing it for 1/10 to 1/40 of traffic.

I expect the data volume to increase and for this to provide much better data on ranking, ranking changes, and the ROI on ranking changes.

Here's what you'll get when a referral comes through with the new cd parameter:


The script is available at /abtest/includes/seopositionplus.js and is released as open source under the Mozilla Public License. Use it for free for whatever you like, but if you make it better, you have to share!

Place the script following your Google Analytics code/ Copy the SEOpositionPlus file to your server and add the following line beneath your call to pageTracker:

<script src="seopositionplus.js" type="text/javascript" language="javascript"></script>

Google Gets Faster

Recent talks by Jeff Dean of google at Web Search Data Mining '09 (video) and an earlier talk at Univ of Washington present some interesting history on the evolution of the Google search technology.

The most tantalizing aspect of this is the notion of a "fast index", perhaps in-memory on many servers, dedicated to indexing (and computing authority or PageRank) for rapidly moving content like Digg and YouTube video honors. In general, with twitter bubbling, the notion of real time search is focusing the industry on one of Google's key relevance metrics, freshness.

I presented an overview of Off-Page SEO factors at the Atlanta Web Entrepeneurs SEO group this week and Sam Beckett was kind enough to YouTube it:

As I've been compiling my thoughts on this, I created a nifty Prezi with some observations on Jeff Dean's content.

Some 10 years ago, Google had to flip indexes to accomplish updates. This is described as happening on a per machine basis. We've seen increases in the speed of updates but recently the degree to which Google is paying attention to fast moving social media suggests an revolutionary speedup.

Jeff's talks hint at some of the mechanisms.


"Sub second latencies"... new pages are added to the index very rapidly and our experience, and Jeff's dialogue, hints at PageRank calculations at high speed.


The use of in-memory data structures offers some hints at how super rapid rank updates might happen.

Take Away

I'm still pretty early in assessing the impact of a new understanding of the underlying mechanisms, but I'll offer one hypothesis: Google is now capable of detecting the duration a link lives on a "hot list" like Digg's upcoming page. This means a successful social media promotion can have a much greater effect than simple social media participation.

We see the amount of diggs affect how long it takes for a Digg permalink page fall out of the top rankings. It's likely that the anchor text, or title of the Digg, is added to the index record for the page -- So pick your social media link text very carefully. It's also likely the thing that has the biggest impact on the long term effect of social media promotion.

A word from our Sponsor

Need more search engine success? We give you the basics and the hard-hitting science in the Stomping the Search Engines 2 DVD course. I teach the understanding search engines segment and try and walk a fine line between the basics and deep, long term insights.

Get it for just a $1 when you try the Net Effect magazine from StomperNet. It covers traffic, conversion, social media, business building and operations, and more.

SEO Position & UI Links Google Analytics Scripts Updated

The SEO Position script now supports Yahoo, MSN & Live, in addition to Google. The name of the event has changed from "Google SEO" to Google. Thanks to Jim M. from Bunk Beds Now for the help expanding the script.

Alas, it was not originally apparent that logging a source event will flip the bit on "user bounce" and deflate your bouce rate for a page. Rapid detection of near-page-1 rankings may be worth the trade-off.

I've also updated the UI Region Logger to use eventing instead of synthetic page views. Just flip the ui_useEventing boolean to true in the script source.

Read more in the orginal post, but to recap, this script requires that you tag key areas of your interface with a UI attribute. For example, you'd add ui="sidebar" to the div that contains your sidebar. With every click, the script walks up the DOM and checks if there's a UI label above it. I'm looking forward to providing a tool to do a heatmap style visualization of click regions once I build up a good data set with this one. The data is much easier to isolate than in the prior exit link mode.

The image shows the first data from this site with one click on the tag cloud in the right sidebar and 2 clicks in the tools menu.

More Entries

Built with BlogCFC, version 5.9. Contact Andy Edmonds or read more at Free IQ or SurfMind. © 2007.