Modelling your Snowplow event data: Part 5 Automation

Tuesday 18 September, 2018 | By: Simon Rumble

In the first four parts of this series, we modelled out:

  • Pageviews: accurate time spent incorporating page pings and maximum scroll depth
  • Sessions: traffic sources and initial landing details
  • Users: filtering internal traffic and looking up things we know about the users

Now it’s time to move the model into production and start using it. To do this we’ll use Snowplow’s SQL Runner tool at the end of each Snowplow batch load into Redshift.

Schema for models

In my previous posts I’ve created a series of lookup_* tables with my modelled data. To keep this clean you probably want to create a new schema specifically for your modelled data. This simplifies the separation of raw, unopinionated event-level data with modelled, opinionated data that applies your business rules. Once everyone has got used to it, you can remove access to the raw data for most of your users, ensuring everyone is working from the same business rules.

By convention, Snowplow uses scratch and derived schemas for data modelling steps.

  • scratch is for ephemeral data used in intermediate processing steps but not actually part of the output. It should be discarded as a final processing step.
  • derived is for your model output data.

SQL Runner

SQL Runner is a pretty simple tool that takes a series of SQL scripts and executes them on command against a specified database with specified credentials. The important piece of information from the documentation is the difference between :queries: and :steps:.

Queries within a Step will execute in parallel while individual steps are executed in the order in which they appear. If your models build on one another, you’ll need to separate them into Steps to operate in sequence. The model I’ve build out in this series doesn’t have any dependencies between the steps so you can easily have them run in parallel within the same Step.

Configuring your scripts

The syntax for SQL Runner’s playbooks is pretty simple. For our model playbook we’re just executing the following:

:targets:
  - :name: "Snowflake Analytics data model"
    :type: redshift
    :host: redshift-endpoint # The endpoint as shown in the Redshift console
    :database: snowplow # Name of database
    :port: 5439 # Default Redshift port
    :username: datamodeling
    :password: {{secret "redshift-datamodeling-password"}}
    :ssl: true # SSL disabled by default
:steps:
  - :name: "Rebuild pageview, session and user models in parallel"
    :queries:
      - :name: pageview
        :file: model-pageview.sql
      - :name: session
        :file: model-session.sql
      - :name: user
        :file: model-user.sql

Automation

Managed Service customers

If you’re a Snowplow Managed Service customer, SQL Runner is handled for you. The process is quite simple:

  1. Clone your pipeline’s snowplow-proservices pipeline repository
  2. Create an sql-runner directory in jobs/main
  3. Place the playbook YAML file in a directory called playbooks
  4. Place your SQL scripts in a directory called sql
  5. Commit your changes to a named branch on the repository and push back upstream
  6. Raise a support ticket with Snowplow advising of the branch you’ve created and the schedule on which you’d like the new data model to run, for example “daily at UTC midnight” or “after every batch”
  7. Snowplow will implement your new data model scripts

From then on, any changes you make on the master branch for existing playbooks will update within about 45 minutes. You’ll need to follow the process above again if you create a new playbook running on a different schedule.

Open Source users

If you’re using an Open Source Snowplow pipeline, you’ll have to set up and manage scheduling yourself, probably within the same script you’re kicking off to do enrichment and loads into the database from cron now. A good place to start is the Guide for devops.

Optimisations

Scope down to a smaller time range

My scripts execute across the entire events dataset we have. Our site isn’t high traffic so this isn’t a huge imposition, executing in about a minute on our Redshift database. If you have decent traffic volumes, you’ll want to scope down the reprocessing of models somewhat. If your queries involve any kind of lookback, that will take some careful tuning of your query windows. Be particularly careful if you’re calculating sessions in some way different to COUNT(DISTINCT domain_sessionid) as a session can easily cross a batch boundary.

Creating useful SQL views

The joins you need to make for this to be useful are a bit complicated to do routinely, so it makes sense to create a nice simple view for your pageview events that incorporates these elements on every row. These views become something you can expose to a visualisation tool like Tableau to enable easy exploration while incorporating engagement metrics, your business rules for traffic source attribution and information about users.

Below is the view I’ve created for our data set. For convenience I’m also including the timezone conversion we routinely use to convert the query to Sydney time. For performance reasons, you might end up writing out the result of this view into a table itself as an additional step. There would be a storage space cost but it should be more performant too.

CREATE VIEW derived.view_pageviews AS (
  SELECT
    CONVERT_TIMEZONE('Australia/Sydney', derived_tstamp) AS derived_tstamp_sydney,
    ev.*,
    pv.pageview_id,
    pv.time_spent_seconds,
    pv.max_pp_yoffset_max,
    pv.max_doc_height,
    pv.scroll_depth_percent,
    se.first_refr_urlhost,
    se.first_page_referrer,
    se.first_refr_urlscheme,
    se.first_refr_urlport,
    se.first_refr_urlpath,
    se.first_refr_urlfragment,
    se.first_refr_medium,
    se.first_refr_source,
    se.first_refr_term,
    se.first_mkt_campaign,
    se.first_mkt_medium,
    se.first_mkt_source,
    se.first_page_urlpath,
    se.last_page_urlpath,
    se.channel,
    us.first_session,
    us.last_session,
    us.last_domain,
    us.last_name,
    us.last_legal_name,
    us.last_description,
    us.last_type,
    us.last_linkedin_handle,
    us.last_logo,
    us.last_metrics_alexa_global_rank,
    us.last_metrics_annual_revenue,
    us.last_metrics_employees,
    us.last_metrics_market_cap,
    us.last_crunchbase_handle,
    us.user_internal
  FROM
    atomic.events ev,
    atomic.com_snowplowanalytics_snowplow_web_page_1 pg,
    derived.pageview pv,
    derived.session se,
    derived.user us
  WHERE
    ev.app_id = 'snowflake-analytics.com'
    AND ev.event = 'page_view'
    AND ev.event_id = pg.root_id
    AND pg.id = pv.pageview_id
    AND ev.domain_sessionid = se.domain_sessionid
    AND ev.domain_userid = us.domain_userid
) WITH NO SCHEMA BINDING

Querying the data

Now let’s use the models and views we’ve built and start analysing some content.

Engagement metrics

This query gives you some handy engagement metrics including bucketing pageviews into scroll depth groups.

SELECT
  page_urlpath,
  SUM(CASE WHEN scroll_depth_percent = 0 THEN 1 END) AS "0%",
  SUM(CASE WHEN scroll_depth_percent > 0 AND scroll_depth_percent <= 0.25 THEN 1 END) AS "25%",
  SUM(CASE WHEN scroll_depth_percent > 0.25 AND scroll_depth_percent <= 0.5 THEN 1 END) AS "50%",
  SUM(CASE WHEN scroll_depth_percent > 0.5 AND scroll_depth_percent <= 0.75 THEN 1 END) AS "75%",
  SUM(CASE WHEN scroll_depth_percent > 0.75 THEN 1 END) AS "100%",
  ROUND(MEDIAN(scroll_depth_percent)*100) AS "median scroll depth percent",
  AVG(time_spent_seconds) AS "average time spent seconds",
  SUM(time_spent_seconds) AS "total time spent"
FROM derived.view_pageviews
WHERE
  page_urlpath LIKE '/blog/data-modeling%'
  AND user_internal IS NOT TRUE
GROUP BY 1
ORDER BY COUNT(*) DESC

Content engagement metrics

Bucketed scroll depth report

Looking at identified companies

As our site is aimed at B2B marketing, identifying the companies coming to our site is an interesting thing to do. I’m excluding traffic from ourselves and our friends at Snowplow and might increase the exclusions as we keep exploring the data.

SELECT
  COALESCE(last_domain, user_ipaddress) AS last_domain,
  last_crunchbase_handle,
  last_linkedin_handle,
  last_name,
  SUM(time_spent_seconds) AS time_spent_seconds,
  COUNT(*) AS pageviews,
  COUNT(DISTINCT domain_sessionid) AS sessions,
  COUNT(DISTINCT domain_userid) AS unique_browsers,
  ROUND(AVG(scroll_depth_percent) * 100) AS avg_scroll_depth
FROM
  derived.view_pageviews
  WHERE
    user_internal IS NOT TRUE
    AND COALESCE(last_domain, '') NOT IN ('snowplowanalytics.com')
    AND derived_tstamp_sydney > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '90 days'
GROUP BY 1, 2, 3, 4
ORDER BY 8 DESC, 6 DESC, 5 DESC

Next steps?

So that’s the end of our giant odyssey into Snowplow data models. We haven’t covered everything that can go into a data model here. In particular, one of the best uses of data models is to map identities across all instances where you see a particular domain_userid back in history. To do this you’d use the same technique as I’ve used here for companies, using whatever identification point (or points!) you have. We, unfortunately, don’t have an easy way to reliably identify people.

You might now like to have a read of the Snowplow web data model, which can be a bit daunting at first but probably makes a whole lot more sense now that you’ve looked through a simpler model.

About

We exist to make organisations better understand their businesses by enabling all decision makers in a company to work with the same version of the truth.

Social Links