Statistics 202: Statistical Aspects of Data Mining

Highly recommended - a Standford class delivered at Google with the lectures available on Google Video / Youtube.

The course uses both the open source stats tool R as well as Excel for the heavy and lightweight ways to get things done.

The slides are at stats202.com.

Navigating the YouTube UI to work through these vids is painful, so I created a playlist with the videos. Click through for the larger format video:

Happy data-mining!

Monitoring Branded Traffic with Google Analytics

Tracking branded traffic is a useful activity for any business. Branded traffic refers to users who seek out your business by name, either through historical familiarity or perhaps the translation of offline marketing into online activity. Thus, it can measure the effectiveness of offline advertising and the growth of a loyal custom user base, with the help of return rate metrics.

In some cases, these visits show up as "direct traffic" in analytics. This means the visit came in without a referring URL in the CGI headers. Clicking a link from a site to visit another site will pass the preceding site's URL in this referral section of the HTTP headers, that part of web "clicks" that's invisible to the average user. When a user types in a URL directly, or clicks a bookmark, this field is empty. Alas, the utility of an empty referrer at indicating branded traffic has degraded over the years as firewall products begin to block the passing of the referrer due to potential privacy concerns.

Another phenomena has worked to decrease the comprehensiveness of direct traffic as a measure of branded search. The pervasiveness and ease of user of internet search engine in modern browsers has led users to simplify the task of reaching a destination down to one tool: the search box. While it's theoretically more efficient to enter the URL in the url bar, many users simply use their default search engine for keywords and URL entry (est. 5% of search engine queries from Teevan, Adar, Jones, & Potts. SIGIR 2007) because the simplicity of one operation overrides the efficiency gain of a dedicate search box.

The large number of domain squatters with mis-spelling and typos of popular destinations producing generally useless experiences make the search engines popularity skew to error correction on URLs an especially nice change to a cybersquatter's shop of trinkets.

So I recommend creating a Google Analytics advanced segment that captures both direct traffic and inbound search referrals with query terms that uniquely identify your business. For some names, this may not be viable, but for most you can capture a large percent of the direct navigation. For instance, myweddingfavors.com gets branded searches for "my wedding favors" and "myweddingfavors.com". The my term is the key signal that the query is branded.

Here's an example of the config:

Be sure to name your segment for the site you set it up on as segments are shared across sites in an account.

And here's a report validating the conclusions:

      85% of iBlipper branded traffic is type-in and other components of direct
      As it should be, no referring sites are included in iblipper branded
      Over a third of search engine traffic is branded.

Bing! You've got SERP Position

The SEO Position Plus script has been updated to pull page offset from Bing.com -- thanks to Stomper Jim MacKay. Get it at the normal place and keep an eye on http://gist.github.com/138555.

Using Open Source Tools to Understand Your Data: R & GGobi

I've been a long time, occasional, user of an open source alternative to high-end statistic packages like SAS & SPSS called "R". I recently spent some time learning an associated data exploration tool called "GGobi" and the integration with R (rggobi).

While R is a worthy tool for data summarization, stats, plotting, and cleaning, GGobi is especially useful for exploration. It features "linked plots" in which mousing over a point in one plot highlights it in all other plot windows. I'm a big fan of the scatterplot matrix (shown to right) for understanding relationships between variables and distribution shapes.

This is probably overkill for basic website stats from Google Analytics, but is very useful for more complex data. I've crafted a screencast that walks through the basics of reading a CSV and launching GGobi through R. It then goes on create a scatterplot matrix and apply a custom color scheme.

The dataset in use here is data volunteered from Firefox users about their bookmarks and history data-stores. Details of the analysis are described here.

Getting a handle on R is pretty much a pre-requisite for getting value from GGobi. There are lots of great resources online including Statmethods.net Quick-R but the essential reference action is to add R-Seek.org to your browser search box.

Google Analytics API Roundup

Man, you gotta love the internet. Within a week of the general availability of the Google Analytics API there's an explosion of interesting new works and open source creations. While some folks may have had access before the announcement, the checkin streams on github show there's no time like the present.

There are two Ruby wrappers: Gattica and Garb as well as one in Python. (Update: See this offc. Google Analytics blog post for more)

I did a bit of hacking over the weekend and adapted one of the code samples to present a word cloud of search referral keywords. Bounce rate is mapped to opacity and frequency to sum of traffic from all phrases including the word.

Mad props to the Juice Analytics folks for their Google API explorer tool for making it easy to get a grasp on the capabilities and restrictions. There are some hard to predict limitations on the factor and metric combinations allowed. Also notable in limitations is no support for eventing data.

There's sure to be lots of interesting things to come. I'm not especially interested in tools that mimic the functionality of the web site in desktop application form, but rather in tools that go beyond the analyses that GA enables. Check out Google's gallery for a mix of the two.

The desktop reporting.com folks have a basic replication widget available with the promise of fancier stuff to come. The Juice Analytics folks have a keyword research tool called Concentrate.

We're pondering keyword applications as well, along with traffic source, automated diagnostics, and more advanced path analyses. What would you like to see available from your Google Analytics data?

Finally, if you'd like to try the keyword cloud view on your site, here's the link for the exceptionally quick keyword cloud hack above. Works best in Safari / Chrome at the moment strangely.

SEO Position Plus: Log Your Exact Google Rank with Google Analytics

We've been working hard on SEO Position over the last few months and the latest version has been in use by Stomper members for a while now.

With the announcement of updates to the Google referral string it's now important to update the script and there's some serious new features.

For those unfamiliar with the SEO Position script, in the picture above/right, the query "analytics motion charts" generated a click from Page 1 of the Google results while "motion charts" generated a click from Page 2.
SEO Position Plus tracks more types of referrals than the original script. All the categories logged to Google Analytics events are prefixed with SEO.

  • (Google | Yahoo | AOL | Live | MSN) Page: tracks page number of referrals from different engines
  • Google Site: Distinguishes subdomains & verticals like news, reader, corp, etc.
  • Google TLD: Find you average ranking by page number by Country
  • Google Images: Pull out keywords, traffic flow, etc. from Google image referrals

Mark at MivaMerchant wrote up a great tutorial (Stomper Members: See my video in the portal).

So that's all cool and useful, but the last item "SEO Google Position" captures the exact rank of the result clicked on Google to generate this visit. The announcement says the referral change is being rolled out and we're currently only seeing it for 1/10 to 1/40 of traffic.

I expect the data volume to increase and for this to provide much better data on ranking, ranking changes, and the ROI on ranking changes.

Here's what you'll get when a referral comes through with the new cd parameter:


The script is available at /abtest/includes/seopositionplus.js and is released as open source under the Mozilla Public License. Use it for free for whatever you like, but if you make it better, you have to share!

Place the script following your Google Analytics code/ Copy the SEOpositionPlus file to your server and add the following line beneath your call to pageTracker:

<script src="seopositionplus.js" type="text/javascript" language="javascript"></script>

SEO Position & UI Links Google Analytics Scripts Updated

The SEO Position script now supports Yahoo, MSN & Live, in addition to Google. The name of the event has changed from "Google SEO" to Google. Thanks to Jim M. from Bunk Beds Now for the help expanding the script.

Alas, it was not originally apparent that logging a source event will flip the bit on "user bounce" and deflate your bouce rate for a page. Rapid detection of near-page-1 rankings may be worth the trade-off.

I've also updated the UI Region Logger to use eventing instead of synthetic page views. Just flip the ui_useEventing boolean to true in the script source.

Read more in the orginal post, but to recap, this script requires that you tag key areas of your interface with a UI attribute. For example, you'd add ui="sidebar" to the div that contains your sidebar. With every click, the script walks up the DOM and checks if there's a UI label above it. I'm looking forward to providing a tool to do a heatmap style visualization of click regions once I build up a good data set with this one. The data is much easier to isolate than in the prior exit link mode.

The image shows the first data from this site with one click on the tag cloud in the right sidebar and 2 clicks in the tools menu.

Using the new Google Eventing for SEO Reporting

I'm a huge fan of metrics about activities within a page like scroll depth or form abandonment analyses. Google Analytic's new eventing facilities (almost out of beta it seems!) enable this kinds of data logging and reporting.

Intro to GA Eventing

You get 4 slots of data to log to:
  • Action
  • Category
  • Label
  • Value

The official eventing docs describe an example for a Video action with categories for play/pause/stop and labels of the video name.

Ranking Insights

I've developed a script that enables ranking information to be logged to GA events. I pull the page offset (e.g. the start parameter) from the Google search referral string.

The reporting options are somewhat limited for events -- you can't pivot any report outside of the Event section by this data as of yet. It took a couple iterations to get the logging design to a useful point.

Rankings by Page

Using the Label report, we can see event data by Page with average rank (labeled "average value" in the data grid). The usage tab on this report will let you assess engagement by bounce rate and page views.

If you're logging multiple categories of events, you'll want to drill down to labels through the Google SEO category.

Rankings by Keyword

Using the drop down pivot to select keyword will get your average ranking per keyword, across all page.


Uh, Google...

Why did I have to write code to get this data? Hard to say, webmaster tools gives you some data on ranking position. This, along with accurate reporting of image search referrals should be built into google analytics.

A Word from Our Sponsor

This script is brought to you, in full open source fashion, by StomperNet LLC. The latest offering from StomperNet is Formula 5 -- a training program designed to help you amplify your business success. The program works with multiplicative effects, beyond those demonstrated in my conversion funnel modeler to encompass your entire business. Check it out now.

Get the Script

The script is a dozen or so lines of designed to be placed after your call to GA's pageTracker function. You do have to be running the new ga.js scripts, not the legacy urchin.js.

Copy the SEOposition file to your server and add the following line beneath your call to pageTracker:
<script src="seoposition.js" type="text/javascript" language="javascript"></script>


The script is seoPosition.js and is released under the Mozilla Public License (MPL). The MPL is a friendly open source license allowing any type of use but requiring that enhancemnts to the existing file be contributed back to open source. This work was inspired by a filter hack from andrescholten.nl -- I didn't want to go through that trouble on every site. Thanks to DaveL @ E-Tail.be for calling my attention to this.



Hands on with Google Analytics Motion Charts

I was very excited earlier this week to discover I had access to the new Google Analytics beta features. Custom reports are certainly a useful tool; they allow you to construct both large scale exploratory views as well as concise views in which the viewer doesn't have to ask "which metrics should I look at?"

The big payoff is in the new Motion Chart visualizations which attempts to capture 5 dimensions through the mapping of attributes to x, y, size, color animated over time.

While you could check out the official videos, here's a look at my recent foray into creating an iPhone (web) application.

Pictured are referrals from the Apple web application directory, where iBlipper landed Sept. 10th. On the x-axis are unique searches, or phrases typed into the iBlipper application. The y-axis is a correlated metric, time on site, and size is mapped to % new users.

We can watch as page 1-5 deliver less and less traffic as the app drops off the category independent list and falls down the entertainment app list in the default recency ordering.

Pay close attention to the axis values -- there are some subtle interpretations available from mixing engagement, volume, and loyalty (% new visits in this case).

For instance, iBlipper briefly landed on the top 10 entertainment apps list, url of /webapps/entertainment/index_top.html. These users seemed to spend more time on the site w/o entering their own search phrases, suggesting a less directed choice in visiting iBlipper and more passive usage of the application. This is shown by the green dot highlighted to the right higher in time on site than average for the unique searches compared to most other referral paths.



I'll leave you with some power user tricks for using motion charts:

  • Filters applied in a report view control the data shown in the visualization. In the video case, I've filtered by referrals including '/webapp', a unique signature for the Apple directory
  • There's a subtle option on the x & y axis to code by lin(ear) or log(arithmic). Adjusting both axes to logarithmic can greatly inform on the underlying mechanisms.

Tracking UI Level Links: An Open Source Script

One of the challenges with the current complex site designs is that multiple links to the same destination tend to appear on the same page.

This does not allow you to understand how your allocation of real estate is being used without resorting to really fancy analytics packages.

To solve this problem I developed a script that upon every click, walks up the document object model (DOM) looking for a an attribute on an HTML tag of ui.

If it finds one before it hits the BODY tag, then it adds a parameter to the link called ui with the value of that attribute, allowing you to understand which links on a page are being used. Shown below is a report from Google Analytics for this site showing how people arrive at my bio page:


If you visit my bio from the About menu at the top of my blog, it adds ui=nav.

I've licensed the script under the MPL and you're welcome to use it on your site, providing you share any enhancements. Grab it at http://alwaysbetesting.com/abtest/includes/logger.js.

All you have to do is include the script on your page and add ui=element_name attributes to key HTML elements that make up your site structure.

Because this happens with Javascript, and only when the user clicks, there's no danger of creating duplicate content for the search engine spiders. Alas, that means it doesn't fix typical analytics overlay views either for duplicate links, but you can typically see the source element in your path reports.

More Entries

Built with BlogCFC, version 5.9. Contact Andy Edmonds or read more at Free IQ or SurfMind. © 2007.