Modelling your Snowplow event data: Part 3 Sessions

Wednesday 22 August, 2018 | By: Simon Rumble

In the first two parts of this series, we looked at modelling out pageview events to include accurate time spent and scroll depths. Now we’ll roll up sessions.

A session or visit to a web site is when a user comes to your site, does some stuff, then left. The key feature of a session is that the user came from somewhere, did some stuff, then left. The “some stuff” is a bunch of pageviews and events. The “somewhere” here is a traffic source, for example if they came from a Google search.

Create a lookup table on domain_sessionid

This model pulls out the various parameters from the first referrer, uses the Campaign attribution enrichment fields and also has a monster CASE statement classifying traffic sources on a number of dimensions but mostly using first_refr_urlhost.

The giant CASE statement defining channel is probably the piece you want to most dig into and play with for your purposes. You’ll notice that I classify a first_refr_urlhost of NULL as Unknown in contrast to Direct or Typed/Bookmarked. This is a very conscious decision as Direct and Typed/Bookmarked give the invalid impression that there is some certainty about the source when in fact we just don’t know where the session came from.

DROP TABLE IF EXISTS lookup_sessions;
CREATE TABLE lookup_sessions
  DISTKEY(1)
  SORTKEY(1)
  AS (
    WITH sessions AS
    (SELECT DISTINCT
      collector_tstamp,
      domain_sessionid,
      refr_urlhost,
      page_referrer,
      mkt_campaign,
      refr_urlscheme,
      refr_urlhost,
      refr_urlport,
      refr_urlpath,
      refr_urlquery,
      refr_urlfragment,
      refr_medium,
      refr_source,
      refr_term,
      mkt_content,
      mkt_medium,
      mkt_source,
      page_urlpath,
      FIRST_VALUE(refr_urlhost)
      OVER (
        PARTITION BY domain_sessionid
        ORDER BY domain_sessionid, collector_tstamp
        rows between unbounded preceding and unbounded following ) AS first_refr_urlhost,
       FIRST_VALUE(page_referrer)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_page_referrer,
       FIRST_VALUE(refr_urlscheme)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_refr_urlscheme,
       FIRST_VALUE(refr_urlport)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_refr_urlport,
       FIRST_VALUE(refr_urlpath)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_refr_urlpath,
       FIRST_VALUE(refr_urlquery)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_refr_urlquery,
       FIRST_VALUE(refr_urlfragment)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_refr_urlfragment,
       FIRST_VALUE(refr_medium)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_refr_medium,
       FIRST_VALUE(refr_source)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_refr_source,
       FIRST_VALUE(refr_term)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_refr_term,
       FIRST_VALUE(mkt_campaign)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_mkt_campaign,
       FIRST_VALUE(mkt_content)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_mkt_content,
       FIRST_VALUE(mkt_medium)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_mkt_medium,
       FIRST_VALUE(mkt_source)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_mkt_source,
       FIRST_VALUE(page_urlpath)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS first_page_urlpath,
       LAST_VALUE(page_urlpath)
       OVER (
         PARTITION BY domain_sessionid
         ORDER BY domain_sessionid, collector_tstamp
         rows between unbounded preceding and unbounded following ) AS last_page_urlpath
      FROM atomic.events
      WHERE app_id IN ('snowflake-analytics.com')
      ORDER BY
        1 DESC)
    SELECT DISTINCT
      domain_sessionid,
      first_refr_urlhost,
      first_page_referrer,
      first_refr_urlscheme,
      first_refr_urlport,
      first_refr_urlpath,
      first_refr_urlquery,
      first_refr_urlfragment,
      first_refr_medium,
      first_refr_source,
      first_refr_term,
      first_mkt_campaign,
      first_mkt_content,
      first_mkt_medium,
      first_mkt_source,
      first_page_urlpath,
      last_page_urlpath,
      CASE
        WHEN first_refr_urlhost LIKE '%snowflake-analytics.com'
          THEN 'internal'
        WHEN first_refr_urlhost = 'localhost'
          THEN 'localhost'
        WHEN first_refr_urlhost = '127.0.0.1'
          THEN 'localhost'

        -- Google
        WHEN first_mkt_source = 'adwords'
          THEN 'Adwords'
        WHEN first_refr_urlhost LIKE 'www.google.%'
          THEN 'Google'
        WHEN first_refr_urlhost = 'com.google.android.googlequicksearchbox'
          THEN 'Google'
        WHEN first_refr_urlhost = 'encrypted.google.com'
          THEN 'Google'
        WHEN first_refr_urlhost LIKE '%doubleclick.net'
          THEN 'Google Doubleclick'
        WHEN first_refr_urlhost = 'plus.google.com'
          THEN 'Google Plus'
        WHEN first_refr_urlhost = 'plus.url.google.com'
          THEN 'Google Plus'
        WHEN first_refr_urlhost = 'www.googleadservices.com'
          THEN 'Google Ad Services'
        WHEN first_refr_urlhost = 'cse.google.com'
          THEN 'Google Custom Search'

        -- Email platforms
        WHEN first_refr_urlhost = 'outlook.live.com'
          THEN 'Email'
        WHEN first_refr_urlhost = 'mailchi.mp'
          THEN 'Email'
        WHEN first_refr_urlhost = 'mail.google.com'
          THEN 'Email'
        WHEN first_refr_urlhost = 'com.google.android.gm'
          THEN 'Email'

        WHEN first_refr_urlhost LIKE '%ecosia.org'
          THEN 'Ecosia Search'
        WHEN first_refr_urlhost LIKE '%search.yahoo.com'
          THEN 'Yahoo Search'
        WHEN first_refr_urlhost = 't.co'
          THEN 'Twitter'
        WHEN first_refr_urlhost LIKE '%bing.com'
          THEN 'Bing'
        WHEN first_refr_urlhost = 'discourse.snowplowanalytics.com'
          THEN 'Discourse'
        WHEN first_refr_urlhost LIKE '%snowplowanalytics.com'
          THEN 'Snowplow'
        WHEN first_refr_urlhost = 'yandex.ru'
          THEN 'Yandex'
        WHEN first_refr_urlhost = 'www.linkedin.com'
          THEN 'LinkedIn'
        WHEN first_refr_urlhost = 'com.linkedin.android'
          THEN 'LinkedIn'
        WHEN first_refr_urlhost = 'lnkd.in'
          THEN 'LinkedIn'
        WHEN first_refr_urlhost = 'chrome.google.com'
          THEN 'Chrome'
        WHEN first_refr_urlhost = 'duckduckgo.com'
          THEN 'DuckDuckGo'
        WHEN first_refr_urlhost LIKE '%measurecamp.org'
          THEN 'MeasureCamp'
        WHEN first_refr_urlhost LIKE '%meetup.com'
          THEN 'Meetup'
        WHEN first_refr_urlhost LIKE '%wawsydney.com'
          THEN 'WAW Sydney'
        WHEN first_page_referrer LIKE '%meetup.com/Web-Analytics-Wednesday-Sydney%'
          THEN 'WAW Sydney'
        WHEN first_refr_urlhost LIKE '%wawmelbourne.com'
          THEN 'WAW Melbourne'
        WHEN first_page_referrer LIKE '%meetup.com/Web-Analytics-Wednesdays-Melbourne%'
          THEN 'WAW Melbourne'
        WHEN first_refr_urlhost LIKE '%meetup.com'
          THEN 'Meetup'
        WHEN first_refr_urlhost = 'com.Slack'
          THEN 'Slack'
        WHEN first_refr_urlhost LIKE '%reddit.com'
          THEN 'Reddit'
        WHEN first_refr_urlhost LIKE '%facebook.com'
          THEN 'Facebook'
        WHEN first_refr_urlhost LIKE '%baidu.com'
          THEN 'Baidu'
        WHEN first_refr_urlhost LIKE '%github.com'
          THEN 'Github'
        WHEN first_refr_urlhost LIKE '%getpocket.com'
          THEN 'Pocket'
        WHEN first_refr_urlhost LIKE '%evernote.com'
          THEN 'Evernote'
        WHEN first_refr_urlhost IS NULL
          THEN 'Unknown'
        ELSE 'other'
        END AS channel
      FROM sessions
      ORDER BY channel
  )

Reporting

To build the report shown above, we’re just looking at the channel categorizations and the associated sessions. Your app_id would be different.

-- Landing pages
SELECT se.channel, COUNT(DISTINCT ev.domain_sessionid) AS sessions
FROM atomic.events ev, lookup_sessions se
WHERE CONVERT_TIMEZONE('Australia/Sydney', ev.derived_tstamp) > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '7 days'
      AND EVENT = 'page_view'
  AND ev.domain_sessionid= se.domain_sessionid
  AND ev.app_id = 'snowflake-analytics.com'
GROUP BY 1
ORDER BY 2 DESC

A more advanced report breaks out traffic sources by landing pages, showing source then the first page in the session. Your app_id would be different.

SELECT se.channel,  se.first_page_urlpath, COUNT(DISTINCT ev.domain_sessionid) AS sessions, COUNT(*) AS pv
FROM atomic.events ev, lookup_sessions se
WHERE CONVERT_TIMEZONE('Australia/Sydney', ev.derived_tstamp) > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '7 days'
      AND EVENT = 'page_view'
  AND ev.domain_sessionid= se.domain_sessionid
  AND ev.app_id = 'snowflake-analytics.com'
GROUP BY 1, 2
ORDER BY 3 DESC

Next steps

In the next installation we’ll look at user data. The most interesting component of that model is the ability to identify users retrospectively and across devices.

Continue to Part 4: Users

About

We exist to make organisations better understand their businesses by enabling all decision makers in a company to work with the same version of the truth.

Social Links