Modelling your Snowplow event data: Part 1

Wednesday 8 August, 2018 | By: Simon Rumble

One of the important features of Snowplow is that you can build your own custom data models to suit your unique analytical requirements in a space-efficient and performant way. It’s important that your data model can evolve and grow in complexity as your business grows and your needs get more advanced.

This is the first in a series of blog posts taking your through the thought process of building a data model. Good background reading for this series is the Snowplow blog post on event modeling.

Business context

Here at Snowflake Analytics we’re putting some effort into using our own Snowplow pipeline inside the business. At the moment we’re kind of like the proverbial plumber’s house with non-functioning pipes

My main focus at the moment is our sales process: tracking prospects as they come in, qualifying leads and then converting them into customers. Content like this blog forms an important part of that process, exposing our message to our target market of businesses that want to do advanced analytics.

Site visitors’ content consumption patterns give us some insight into their interests and expertise. We want to feed these insights into our sales pipeline. Where possible we can identify people or their companies and, based on their content consumption, we can decide if it’s worth trying to track down the individuals who are at the right point in the thought process to be a good candidate for our services.

Why model data?

Snowplow architecture stage 5: Data Modelling

The data that drops into the atomic.events table is event-level data. To make it useful we need to apply some logic to it. You can get a lot done with simple queries to show you basic metrics, for example:

SELECT
  page_urlpath,
  SUM(CASE WHEN event = 'page_view' THEN 1 ELSE 0 END) AS pageviews,
  COUNT(DISTINCT domain_sessionid) AS sessions,
  COUNT(DISTINCT domain_userid) AS unique_browsers
FROM atomic.events
WHERE
  app_id = 'snowflake-analytics.com'
  AND CONVERT_TIMEZONE('Australia/Sydney', derived_tstamp) > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '7 days'
GROUP BY 1
ORDER BY 2 DESC

This will get you some quick metrics replicating what you’d get in a tool like Google Analytics.

There are some limitations baked into the query above too. First up, you’re accepting the default definitions for sessions and unique_browsers. The domain_sessionid field is generated by the client JavaScript which has some limitations and will differ from the numbers Google Analytics generates in some odd ways. domain_userid is useful for a single domain site but if you have multiple domains and want to merge those, or you’ve got identified users and you’d like to deduplicate those across devices, or you use AMP which doesn’t get that cookie set, you need more.

Of course you’re also missing any kind of engagement metrics. One of Snowplow’s best features is its ability to measure accurate time spent. This, or even just replicating the way GA and Adobe do it, will require something more complex and a chunk more load on your database.

Remove extraneous data, save size, speed up queries

If you start seeing lots of data volume, these queries will start being onerous quite quickly. For standard end-user reporting you want to have the fastest performance possible and avoid putting lots of load on your database recalculating this stuff every time. Summary tables are the answer here.

Keeping all your raw data around in a database can get expensive quite quickly, especially given there’s a bunch of replicated data that you probably don’t need often. Columns like useragent are repeated for every single event, page pings happen often to give you good time spent resolution, and there’s the deprecated fields like br_gears which aren’t useful to anyone.

Best to get rid of those and only keep the fields and rows you need, merging the result of all your calculations. Your modelled data might include a unified row for each pageview with some additional columns representing time spent, scroll depth and without any extraneous columns you don’t need.

Centralise business rules upstream of reporting

Working from a uniform set of business rules makes everyone’s life easier. We’ve all experienced the fun of explaining why two different reports created by different groups end up with different results. By limiting access to the raw data and consolidating upstream you can set and maintain these business rules and common calculations in one place. For example, how you define:

  • time spent metric
  • unique browsers metric
  • bounces
  • traffic source classification
  • attribution rules
  • conversions

By moving these into a central data model, everyone works from the same source and you get less confusion. The business model can be updated as things evolve.

Keep complexity hidden

Related to centralisation, you can hide complexity and let your analyst and reporting users get on with their work while you worry about the complex stuff. For example, the Snowplow AMP tracker has some tricky limitations. It produces no page pings, so calculations of time spent will need to be somewhat arbitrary or use the GA/Adobe method. How do you merge these with your more accurate measurements from normal web pages and inside apps? AMP also doesn’t set a first-party cookie to become domain_userid. How will you deal with unique browser metrics for these users?

By building a model with a common definition, your reporting users can see a simple single row per pageview data model, with columns representing the output of complex logic for things like user deduplication, time spent, scroll depth and conversions. Their queries can be simple aggregations and filtering, while your data model holds all the complex conditional logic to deal with edge cases.

An example of the complexity rabbit hole

While working on the data model for this blog post I came across another weird edge case that’s worth sharing. The rows below have the same pageview_id, some have the same event_id and domain_userid and they get 25 pageviews for each combination, despite being in places as far apart as Indonesia and California.

Odd behaviour from Google agent

These are almost certainly bots of some kind, quite a lot have Google’s fingerprints on them. It can be quite difficult to work out if you want to count these, and whether some subset you want to count and another you don’t want to count. These are complexities you want to decide on once, then hide from everyone forever after.

Paper over definitional changes and breakage

It’s quite common to see tracking you’ve built breaks because a web developer changed something unexpectedly. This happens a lot if you use page URLs to build a Goal Funnel in Google Analytics. The URL for the conversion step changes and you end up with a gap in data. By modelling out your data you can seamlessly switch across from the old to the new definition and your analytical users need never know, simply seeing a boolean for whether a conversion occured.

Model scope

An important thing to think about is the scope of each object in your data model and what fits in each. Pageviews summarise things about the pageview itself. Sessions hold things that change once per session such as traffic source information. Users hold everything we know about the user, and potentially other ways we can identify users.

Pageview scope

  • Engagement metrics:
    • Time spent on page
    • Scroll depth
  • Events that happen on page:
    • in-view events
    • social share
    • form submissions
  • Conversions

Session scope

  • Attribution
    • Traffic sources
    • Classification of those sources

User scope

  • Identity from login, email clickthrough, other
  • Data tied to identity: CRM
  • IP-lookup data: Clearbit: may have multiple values

A candidate data model

Pulling all this together, we might come up with a data model that looks like this. To build a complete view, you’d select from the atomic.events table, then join that to com_snowplowanalytics_snowplow_web_page_1 to get the pageview identifier. From there I can join on the pageview, session and user tables to get a complete picture. It’s probably easiest to create a view that extends atomic.events and bolts on pageview-, session- and user-scope details for every row.

A potential final data model

Eventually you would write out new data tables that replace the need to ever hit atomic.events directly, containing the items you want from atomic.events and the identifiers for the applicable pageview, session and user objects. You wouldn’t copy across page pings because all you need is summarised in the pageview table. Again, you might create a view that does the joins to give you one row per pageview with everything an analyst could need.

Next steps

In the next part of this series I’ll show you how to build an initial pageview scope data model to analyse content engagement metrics, specifically accurate time spent and scroll depth based off page pings.

Further editions will go through the session and user scope models. Finally we’ll pull it all together by automating the model and creating a derived modelled data table.

About

We exist to make organisations better understand their businesses by enabling all decision makers in a company to work with the same version of the truth.

Social Links