Modelling your Snowplow event data: Part 4 Users

Wednesday 29 August, 2018 | By: Simon Rumble

In the first three parts of this series, we looked at modelling out pageview events to include accurate time spent and scroll depth, then classifying sessions based on what we know about where the user came from. Now we’re going to look at what we know about users.

One of the best things about having raw behavioural event data in an SQL database is that you have the ability to update your knowledge about a users’ behaviour after the behaviour is recorded. The classic case for this is when a user logs in or otherwise identifies themselves: now know the identity behind the previous behaviour. If your application has some way of identifying users–such as forcing login or email click-throughs–you can merge user behaviour across multiple devices, such as a desktop browser and a mobile browser.

In our company’s case we don’t have a good way to identify individual users. At some point we’ll start a newsletter and drip feed nurturing process which will allow us to pass identifiers into the click-through URLs. Until then, most of the information we can get is based on Clearbit data which is created through a lookup of the IP address and takes a stab at guessing company behind the browser.

Bug in sessions post

A quick aside: while writing this component on users I discovered a bug in the sessions piece. The part of my big CASE statement dealing with sessions that don’t have a referrer should be at the bottom of the big statement, otherwise the clause looking for utm querystring components (mtk_* columns) won’t ever execute in cases without a referrer. Using UTM strings is specifically designed to deal with lack of a referrer situations.

I’ve updated the post so if you want to look for it, look at the location of this piece of code:

        WHEN first_refr_urlhost IS NULL
          THEN 'Unknown'

Create a lookup table on domain_userid

For this installment we create a lookup_users table keyed off the first-party cookie which is stored in domain_userid.

Labelling internal traffic

The first component looks for any users who’ve behaved in a manner that might be considered “internal” traffic. For us that means they’ve sent in data with a host of localhost or they’ve appeared from one of our internal IP addresses.

In larger companies you might also put a data collection pixel on your intranet homepage, internal systems and the like. Any browser that’s ever visited those locations is considered “internal”. You want to tackle labelling internal browsers from multiple angles to be sure you capture everything. It’s entirely likely that developers will use a browser that never sees the intranet or internal systems to develop and test, so you also want to look at IP ranges and any other opportunities too. If you use them, think about any automated tests running and how you can label those.

Clearbit company details

The second component looks for the latest Clearbit data available for the user. This isn’t going to be perfect. I looked up my own domain_userid and saw a bunch of different places I’d been including client sites and hotels but it will give you some idea. More complex logic might score users with multiple Clearbit identifiers. Potentially you could just give a count of unique Clearbit companies for the user to drill down when there’s uncertainty.

DROP TABLE IF EXISTS lookup_users;
CREATE TABLE lookup_users
  DISTKEY(1)
  SORTKEY(1)
  AS (

    WITH internal_users AS (
        SELECT DISTINCT
          domain_userid,
          TRUE user_internal
        FROM atomic.events
        WHERE page_urlhost IN ('127.0.0.1', 'localhost')
              OR
              user_ipaddress IN ('52.123.123.123', '150.123.123.123')
    ),
        clearbit_data AS (
          SELECT
            DISTINCT
            domain_userid,
            FIRST_VALUE(domain_sessionid)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) first_session,
            LAST_VALUE(domain_sessionid)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_session,
            LAST_VALUE(domain IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_domain,
            LAST_VALUE(name IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_name,
            LAST_VALUE(legal_name IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_legal_name,
            LAST_VALUE(description IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_description,
            LAST_VALUE(type IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_type,
            LAST_VALUE("linkedin.handle" IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_linkedin_handle,
            LAST_VALUE(logo IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_logo,
            LAST_VALUE("metrics.alexa_global_rank" IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_metrics_alexa_global_rank,
            LAST_VALUE("metrics.annual_revenue" IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_metrics_annual_revenue,
            LAST_VALUE("metrics.employees" IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_metrics_employees,
            LAST_VALUE("metrics.market_cap" IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_metrics_market_cap,
            LAST_VALUE("crunchbase.handle" IGNORE NULLS)
            OVER (
              PARTITION BY domain_userid
              ORDER BY collector_tstamp ASC
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) last_crunchbase_handle

          FROM atomic.events
            LEFT JOIN atomic.com_clearbit_enrichment_company_1
              ON event_id = root_id
                 AND collector_tstamp = root_tstamp
          WHERE app_id = 'snowflake-analytics.com'
          ORDER BY collector_tstamp
      )
    SELECT *
    FROM clearbit_data
      LEFT JOIN internal_users
      USING (domain_userid)
  )

Pulling it all together

Here’s a simple query that shows recent site visitors ranked by uniques and time spent on the site, traffic source categories and any Clearbit identification information we’ve been able to get for the IP address. You can see the output here with the company details blurred out to protect the innocent. This query pulls together data from all the lookup tables we’ve created in this series: pageviews, sessions and users.

Example user activity report

SELECT ev.geo_country, ev.geo_city, NVL(us.last_domain, ev.user_ipaddress) AS last_domain, us.last_name, se.channel, SUM(pv.time_spent_seconds) AS time_spent_seconds, COUNT(*) AS pageviews, COUNT(DISTINCT ev.domain_sessionid) AS sessions, COUNT(DISTINCT ev.domain_userid) AS unique_browsers, ROUND(AVG(pv.scroll_depth_percent) * 100) AS avg_scroll_depth
FROM
  atomic.events ev, atomic.com_snowplowanalytics_snowplow_web_page_1 pg, lookup_pageview pv, lookup_sessions se, lookup_users us
  WHERE
    ev.app_id = 'snowflake-analytics.com'
    AND ev.event = 'page_view'
    AND CONVERT_TIMEZONE('Australia/Sydney', ev.derived_tstamp) > CONVERT_TIMEZONE('Australia/Sydney', SYSDATE) - INTERVAL '7 days'
    AND ev.event_id = pg.root_id
    AND pg.id = pv.pageview_id
    AND ev.domain_sessionid = se.domain_sessionid
    AND ev.domain_userid = us.domain_userid
    AND us.user_internal IS NOT TRUE
GROUP BY 1, 2, 3, 4, 5
ORDER BY 9 DESC, 6 DESC

Next steps

So we’ve now got techniques to build up a custom view of our audience behaviours using the three different scopes of interest. In the next post we’ll look at automating this so that it updates every time a new batch of data is added to Redshift, then some convenience views to make querying simple for everyday use and to expose to your end users.

Continue to Part 5: Automation

About

We exist to make organisations better understand their businesses by enabling all decision makers in a company to work with the same version of the truth.

Social Links