How does Snowplow Analytics compare to other vendors?

Aug 9, 2017. | By: Simon Rumble

Tonight Snowflake Analytics team members Mike and Narbeh are debating the merits of Snowplow Analytics with representatives of Google Analytics and Adobe Analytics at Web Analytics Wednesday. The head-to-head aspect of it is meant to be lighthearted, but it’s forced us to think about some of the ways Snowplow Analytics is a better match for many types of digital analytics problems.

So without further ado, here’s a rundown of features where Snowplow has unique qualities over Google and Adobe Analytics.

Feature Snowplow Insights Google Analytics Adobe Analytics
Real-time Some Some
Unsampled data and performance
Custom structured data models
Personally Identifiable Information
SQL interface
Cross-domain tracking and user identification
Data sovereignty
Change goals with historical data
Accurate time spent metric
Custom enrichments
Incoming Webhooks
High-cardinality (none) (Low-Traffic)
IP addresses and user-agent strings

Real-time

Snowplow Analytics can operate in real-time mode with latency down to about 5 seconds from collection to availability. Google and Adobe Analytics offer this with a subset of dimensions and events, but Snowplow gives you access to the entire enriched event stream, including all the richness of custom structured data models. See also: Custom structured data models.

Unsampled data and performance

Because Snowplow runs in your own cloud infrastructure, you have control over the inevitable price-performance tradeoffs involved in collecting, processing and analysing large event volumes. You can decide to have full, unsampled data available speedily by throwing more infrastructure at the problem, or have it available less quickly to reduce cost. See also: High cardinality.

Custom structured data models

Snowplow’s event model enables customers to use their existing data models and provide it as context to the events without transforming them. Other tools require you to shoehorn them into their flat dimension/metric model.

This is perhaps best illustrated with an example. Let’s say you want to record an action, say a page print, for a recipe on your content site.

You could record this as an event in Google Analytics, but the only context you can provide comes from the URL of the page itself, and potentially some custom dimensions. Custom dimensions can’t hold multiple values, so you can only model something like an ingredients list as a string, which is really ugly once you get to reporting it. And truncates any string over 150 bytes. Nice!

Adobe Analytics has the concept of List Props and List Vars which would enable this, but you have to shoehorn the result into Adobe’s model. That means no nesting of things, like a list containing other items like recommended recipes.

By contrast, here’s how you would attach a self-describing context for recipes onto a Structured Event in a Snowplow implementation. Note how the model is readable and looks just like how you might store it in your CMS. These contexts can be attached to any kind of event you like, including things like pageviews and social shares.

window.snowplow('trackStructEvent', 'article','print','Chicken Chow Mein recipe', null, null, 
    [
      {
          schema: 'iglu:com.snowflake-analytics/recipe/jsonschema/1-0-1',
          data: {
            "title": "Chicken Chow Mein recipe",
            "author": ["John Smith", "Mary Jones"],
            "category": "recipes",
            "cuisine": "Chinese",
            "course": "main",
            "ingredients": [ "chicken", "noodles", "vegetables" ],
            "content_tags": [ "chinese recipes", "chicken", "quick meals"]
        }
      }
    ]
);

Personally Identifiable Information

Google take a particularly hard line on Personally Identifiable Information. You can’t send it, and if they find it they’ll delete all your data. Ouch.

Adobe are a bit more lenient but still cautious.

With Snowplow, the data is collected and stored in your own cloud infrastructure. It’s still best practice to be quite careful with PII, and particularly with sensitive information, but that boundary line is up to the customer. That means if PII ends up inside your Snowplow data, it can be controlled under your own policies, not those of a third-party organisation. See also: Data sovereignty.

SQL interface

Snowplow’s data shows up in a database with an SQL interface, running in your own cloud infrastructure. You can connect it with your own systems, create joins with other tables, point your existing BI teams at it and generally use it like it’s your own data, because it is.

You also get to keep the raw event data, unprocessed and ready to re-process or re-examine with tools like AWS Athena or Presto.

Adobe Analytics will provide the raw data for you to ingest into a database, and these days they even give you the header row for that table, but it’s up to you to make sense of the data, stitch together visits and the like.

Google Analytics data, with the premium product, can be pushed automatically to BigQuery, which is great. BigQuery is a very capable SQL data analysis tool, and support for it is becoming much more common than in the past, so Google probably gets a pass on this, althrough currently the latency for data to appear is 4 hours while Snowplow can deliver hourly or possibly faster. See also: Change goals with historical data.

Cross-domain tracking and user identification

Snowplow gives you access to all the identifiers: first-party domain cookie, third-party domain cookie, all the ingredients for fingerprinting, IDFA, IDFV, AdId and as many versions of login information as you have. You can then make up your own mind about how you stitch together user identities and sessions.

Adobe Analytics does third-party domain cookies with fallback to first-party domain cookies. Google Analytics has a feature with extremely narrow utility (basically, if you have a hosted checkout, it might be useful). See also IP addresses and user-agent strings

Data sovereignty

This is simple: you get to decide where your data lives, and the security and access policies. Adobe will charge you extra to store your data in some specific jurisdictions, but you have to trust them on the security and access policies. Google, well it’s in “the cloud”. See also: Personally Identifiable Information.

Change goals with historical data

You’ve got full control of your data, so you can recrunch the numbers any time. It might be a big, slow query, but you can do it. That means if you decide to include an or statement in your goal and funnel after you’ve created it, it’s totally doable.

The Google Analytics premium offering has a beta dynamic funnels report, but those don’t show up as conversions.

Accurate time spent metric

The way Google Analytics and Adobe Analytics (and Nielsen and lots of others) calculates the time spent metric is badly flawed, and have got even worse with modern traffic patterns. Snowplow is better. I’ve dealt with this in detail here.

Custom enrichments

Snowplow allows you to query external databases to automatically enrich your data as it passes through the pipeline. For example, you could look up the current exchange rate from your actual bank to record alongside a transactions’ local value.

Incoming Webhooks

Real-time integrations with external systems can be done through Webhooks, meaning your Snowplow pipeline can have immediate access to things like email opens, logistics state changes, incoming phone calls and the like.

High-cardinality

No (other) and (Low-Traffic) in Snowplow reports. See also: Unsampled data and performance.

IP addresses and user-agent strings

All the raw materials to be able to track users, do custom fingerprinting and analyse platforms are available. Other tools throw away these raw materials. We have customers looking at fraud patterns based on IP addresses. In the past we’ve looked for dark social traffic by finding the Facebook in-app browsers that aren’t sending through accurate Referrer headers. (Hint: look for “FBAN” in the user-agent). See also: Cross-domain tracking and user identification.

[Read More]

Make big data small again with Redshift ZSTD compression

Jul 12, 2017. | By: Mike Robins

A new compression option in Redshift allows you to make big storage savings, up to two-thirds in our tests, over the standard Snowplow setup. This guide shows how it works and how to get it happening.

In late 2016 Facebook open sourced a compression algorithm known as Zstandard that combines Lempel Ziv and tANS to achieve a compression ratio better than many algorithms with a slight tradeoff in speed.

In January 2017 AWS added ZSTD support to Redshift, so you can now use this powerful compression algorithm to reduce the storage size of your data. One immense perk of ZSTD is that it can be applied across all supported data types. It gives exceptional performance on long varchars: perfectly suited for large JSON strings that regularly appear in shredded tables.

We’ve had great success both experimentally and practically by applying ZSTD compression to multiple Snowplow tables. Experimentally across datasets ranging between 10 million and 5 billion rows we’ve achieved a mean compression ratio of ~3 meaning that the newly compressed table takes up approximately a third of the original table on disk when compared to the compression defaults in atomic.events 0.8.0.

In particular we’ve found:

  • Negligible impact on speed of queries for atomic.events (note this will be highly workload dependent so benchmarks here are less useful)
  • So far we haven’t experimented with the performance/space savings of compressing SORTKEYs. AWS recommend against compressing SORTKEYs (so use ENCODE RAW for collector_tstamp and root_tstamp).
  • ZSTD in almost all instances replaces LZO as the default compression method suggested by ANALYZE COMPRESSION.

Things to take into account

  • Depending on your workload you may want to run ANALYZE COMPRESSION on your table which will provide some recommendations by Redshift as to what the suggested column encodings are. One caveat of this approach is that you are limited to sampling 1 billion rows so if possible choose a sample data set that contains representative variability within columns.
  • If you have the opportunity we also recommend benchmarking common queries/jobs on an identical sample of data for a) the 0.8.0 compression defaults and b) the newly compressed tables.
  • You cannot modify compression on existing columns in a table so consider deep copying the data particularly if a large region of your table is unsorted1 as this will outperform a VACUUM. If performing a deep copy ensure you have sufficient disk space to complete the action: it’s difficult to know how much space is required but we opt for at least 50% of the cluster storage remaining. You may need to consider resizing the cluster and/or temporarily pausing your pipeline to complete this action.
  • If you decide to drop the original atomic.events table ensure that you either DROP CASCADE or individually drop any dependencies that may rely on this table such as views or foreign key constraints from shredded or derived tables.
  • If you’ve dropped any foreign keys ensure that you recreate them and they reference the new atomic.events object. Although Redshift does not enforce these constraints they are used by planner to generate optimised query plans.
  • As this is a new object you may also need to regrant permissions to users/groups as well as ensure the table owner is identical to the original table(s) so that Snowplow can continue to load data.

Getting started

  1. Analyze compression on your existing table by using ANALYZE COMPRESSION.
  2. If Redshift recommends using ZSTD for columns consider benchmarking a sample of data (e.g., 3 months) with the original column compression and ZSTD column compression
  3. If performance is better or equivalent deep copy data from the original atomic.events table to a new atomic.events_new table. You may need to resize Redshift or free up additional disk space before performing this action.
  4. Verify that the two tables contain identical data by comparing a total row count as well as a row count per day.
  5. Drop the original table as well as any references2. ALTER the new table to atomic events3. Ensure that this new table has an identical owner.
  6. Sit back and enjoy that free disk space.

Need help?

There’s a few gotchas and edge cases involved with doing this. If you’d like help or advice on how to do it feel free to post in the comments below or get in touch.

Queries

1

 SELECT
 "database",
 "table",
 diststyle,
 sortkey1,
 sortkey1_enc,
 size,
 pct_used,
 unsorted,
 stats_off,
 tbl_rows
FROM
 svv_table_info
WHERE
 schema = 'atomic'
ORDER BY
 size DESC

2

SELECT DISTINCT c_p.oid AS tbloid
,n_p.nspname AS schemaname
,c_p.relname AS NAME
,n_c.nspname AS refbyschemaname
,c_c.relname AS refbyname
,c_c.oid AS viewoid
FROM pg_class c_p
JOIN pg_depend d_p ON c_p.relfilenode = d_p.refobjid
JOIN pg_depend d_c ON d_p.objid = d_c.objid
JOIN pg_class c_c ON d_c.refobjid = c_c.relfilenode
LEFT JOIN pg_namespace n_p ON c_p.relnamespace = n_p.oid
LEFT JOIN pg_namespace n_c ON c_c.relnamespace = n_c.oid
WHERE
 schemaname = 'atomic'

3

ALTER TABLE atomic.events_new RENAME TO atomic.events;

[Read More]

Snowplow Inspector: a debug tool for Snowplow beacons

Jun 20, 2017. | By: Simon Rumble

Snowplow Insights is an amazingly flexible way to collect data, but with great flexibility comes some complexity. If you work on Snowplow implementations a lot, you’re likely familiar with Base 64 Decode and JSON Formatter when you’re digging into custom contexts and testing your implementation.

Snowplow Inspector

We wanted a tool to make debugging easier so we created Snowplow Inspector for the Chrome browser. It lives in the Developer Tools (☰ > More Tools > Developer Tools or Ctrl-Shift-I) under its own Snowplow tab. Here’s how it looks on the Snowplow site:

Snowplow Inspector screenshot

On the left-hand side of the panel you see the detected Snowplow pixels for each site you’re loading. Clicking on a pixel you’ll see the details of the tracking pixel, including the custom contexts and unstructured events decoded and formatted to be readable.

Install the extension

To install the extension, go to the Chrome store and click Add to Chrome.

Feedback and changes

The extension is released under the Gnu Public License, so you’re free to fork it and make changes. Check out the Github repo. If you find any bugs or have any feature requests, create a Github issues.

[Read More]

Accurate time spent: A killer feature of Snowplow Analytics

Mar 7, 2017. | By: Simon Rumble

Web analytics tools commonly have a Time Spent metric. Understanding how long people have spent reading a page is a really valuable thing for some businesses. For publishers, the quality of engagement with content is vital, given they’re effectively selling the attention of their readers.

What many people don’t realise is that the way this metric is calculated is critically flawed on the most popular web analytics tools, Adobe Analytics and Google Analytics. These tools calculate Time Spent by counting the time between pageview events.

How is Time Spent calculated?

Snowplow Analytics time spent metric

In this example we see a user who:

  • Lands on a page and reads it for 30 seconds
  • Moves to another page and reads it for 45 seconds
  • Moves to another page and reads it for 90 seconds
  • Leaves the site

So the actual Time Spent is 2 minutes 45 seconds. However, Google Analytics and Adobe Analytics will allocate zero seconds to the final pageview, because these tools only work from the difference in timestamps between pageview events and there is no pageview event following the final page. So Google Analytics and Adobe Analytics will record 1 minute 15 seconds for this session.

The pathological example of this problem is for a single page session:

General Analytics time spent metric done poorly

In this example, the user enters the site, views the content for 30 seconds and then leaves. Traditional web analytics tools will record zero seconds against this session as there is only the single pageview event.

Many publishers now receive a huge amount of traffic in the form of single-page visits, primarily coming from aggregators and social networks. This means despite the fact your content may be receiving significant attention, your analytics will be showing very low Time Spent and a high bounce rate.

How Snowplow does this

How Snowplow Analytics tracks this

Snowplow’s JavaScript tracker contains an option called Activity Tracking which enables a Page Ping that will be sent every n seconds. In this diagram I’ve used 5 seconds as the ping interval starting 5 seconds after the page loads:

1
2
snowplow('enableActivityTracking',5,5);
snowplow('trackPageView');

So once the page loads, a ping is sent every five seconds recording that the page is still open. That gives an accuracy of at least five seconds to calculating the actual time a user spent.

Other tools

For tools that use the traditional mechanism of measuring Time Spent, there are some workarounds to get better numbers. Though none are ideal. The biggest problem is that there is no reliable mechanism to ensure a pixel is sent out when a user leaves your site.


CHARTBEAT

Chartbeat uses a similar approach for data collection, although they attempt to measure actual engagement by monitoring user activity in the window as well. It would be interesting to apply this approach to the Snowplow page ping. Shouldn’t be too hard to update the tracker to support this.


GOOGLE ANALYTICS

Riveted plugin, as described here, tracks actual user engagements Simo Ahava, always worth following, has a few different approaches to the problem


ADOBE ANALYTICS

  • Adobe’s calculation method has changed a couple of times, so be sure you understand that.
  • Some have attempted to trap clicks on exit links and the unload event in the browser and then send a custom link to give the tool another event to extend the session for the last pageview, but these methods aren’t reliable across browsers and platforms.
  • I have a conceptual approach of how to do this very accurately in Adobe Analytics without costing enormous amounts. Drop me a note if you’d like to know more.


How to analyse?

Page pings create a large number of event rows inside your Snowplow database, consistent with the way Snowplow does things. That opens up a bunch of different ways for you to calculate the Time Spent metric. In the next blog, Mike will go through different approaches for modelling the page ping data to analyse Time Spent.

Until then you can check out Snowplow’s Measuring content page performance article.

[Read More]

Decoding Snowplow real-time bad rows (Thrift)

Dec 14, 2016. | By: Mike Robins

In this tutorial we’ll look at decoding the bad rows data that comes out of Snowplow real time. In the real time pipeline bad rows that are inserted into Elasticsearch (and S3) are stored as base64’d binary serialized Thrift records. We’ll walk step by step the instructions in Python as to how to first decode, and then deserialize these records.

You can find the code for this tutorial without the accompanying text on Github.

Before we start you’ll need 3 things.

  1. A copy of Python 3
  2. The thriftpy library which can be installed using pip install thriftpy
  3. A local copy of the Snowplow Analytics collector payload thrift file

If you’re running real-time you can either take a sample payload from your Elasticsearch index (the line property) from a record in bad rows - otherwise we’ll use a sample event I’ve generated below.

1
2
3
4
import base64
import thriftpy
from thriftpy.protocol import TCyBinaryProtocolFactory
from thriftpy.utils import deserialize, serialize

First up we’ll import the libraries we need. base64 is part of the standard library and the thriftpy imports are from the thriftpy library which has been installed above. Let’s define our example payload below.

1
sample_payload = "CwFAAAAAAi9pCwBkAAAACTEyNy4wLjAuMQoAyAAAAVjbnjdoC3ppAAAAQWlnbHU6Y29tLnNub3dwbG93YW5hbHl0aWNzLnNub3dwbG93L0NvbGxlY3RvclBheWxvYWQvdGhyaWZ0LzEtMC0wCwFKAAABaHN0bT0xNDgxMTUzMzI5MDAwJmU9cHYmdXJsPWh0dHAlM0ElMkYlMkZzbm93Zmxha2UtYW5hbHl0aWNzLmNvbSZ0dj1qcy0yLjYuMCZ0bmE9anMtMy42LjAmYWlkPXNub3dmbGFrZSZwPXdlYiZ0ej1BdXN0cmFsaWElMkZTeWRuZXkmbGFuZz1lbi1BVSZjcz1VVEYtOCZyZXM9MzYweDY0MCZjZD0zMiZjb29raWU9MSZlaWQ9YzI1OWMyNWUtZjk0Yi00ZDJjLWExMWMtMGQyNzhjMmU2ZDFhJmR0bT0xNDc5OTI3ODU3MjAxJnZwPTB4LTU2JmRzPTIwMHgyNjI5NSZ2aWQ9NCZzaWQ9N2ZiOTdmQzYtNmUwZi00MDIyLWFkYmQtMDE3NDMxNTIwZGRiJmR1aWQ9NGQxMGQzZDAtYzJiNC00NzNlLWE0ODMtODEyNzk5ZTgyNGQxJmZwPTEyOTExMjMzMgsBLAAAAG1Nb3ppbGxhLzUuMCAoV2luZG93cyBOVCAxMC4wOyBXT1c2NCkgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgQ2hyb21lLzU0LjAuMjg0MC43MSBTYWZhcmkvNTM3LjM2CwGQAAAAIWNvbGxlY3Rvci5zbm93Zmxha2UtYW5hbHl0aWNzLmNvbQsA0gAAAAVVVEYtOAsBNgAAAB9odHRwczovL3Nub3dmbGFrZS1hbmFseXRpY3MuY29tCwGaAAAAJDRkMTBkM2QwLWMyYjQtNDczZS1hNDgzLTgxMjc5OWU4MjRkMQsA3AAAABFzc2MtMC43LjAta2luZXNpcw8BXgsAAAALAAAAJ0hvc3Q6IGNvbGxlY3Rvci5zbm93Zmxha2UtYW5hbHl0aWNzLmNvbQAAAB1BY2NlcHQ6IGltYWdlL3dlYnAsICovKjtxPTAuOAAAACRBY2NlcHQtRW5jb2Rpbmc6IGd6aXAsIGRlZmxhdGUsIHNkY2gAAAA3QWNjZXB0LUxhbmd1YWdlOiBlbi1BVSwgZW47cT0wLjgsIGVuLVVTO3E9MC42LCBlbjtxPTAuNAAAABRDb29raWU6IHNwPWFiY2QtMTIzNAAAACdSZWZlcmVyOiBodHRwOi8vc25vd2ZsYWtlLWFuYWx5dGljcy5jb20AAAB6VXNlci1BZ2VudDogIE1vemlsbGEvNS4wIChXaW5kb3dzIE5UIDEwLjA7IFdPVzY0KSBBcHBsZVdlYktpdC81MzcuMzYgKEtIVE1MLCBsaWtlIEdlY2tvKSBDaHJvbWUvNTQuMC4yODQwLjcxIFNhZmFyaS81MzcuMzYAAAAaWC1Gb3J3YXJkZWQtRm9yOiAxMjcuMC4wLjEAAAAVWC1Gb3J3YXJkZWQtUG9ydDogNDQzAAAAGFgtRm9yd2FyZGVkLVByb3RvOiBodHRwcwAAABZDb25uZWN0aW9uOiBrZWVwLWFsaXZlAA=="

Now let’s use the b64decode method to decode the payload.

1
2
3
4
decoded_payload = base64.b64decode(sample_payload)
print(decoded_payload)

b"\x0b\x01@\x00\x00\x00\x02/i\x0b\x00d\x00\x00\x00\t127.0.0.1\n\x00\xc8\x00\x00\x01X\xdb\x9e7h\x0bzi\x00\x00\x00Aiglu:com.snowplowanalytics.snowplow/CollectorPayload/thrift/1-0-0\x0b\x01J\x00\x00\x01hstm=1481153329000&e=pv&url=http%3A%2F%2Fsnowflake-analytics.com&tv=js-2.6.0&tna=js-3.6.0&aid=snowflake&p=web&tz=Australia%2FSydney&lang=en-AU&cs=UTF-8&res=360x640&cd=32&cookie=1&eid=c259c25e-f94b-4d2c-a11c-0d278c2e6d1a&dtm=1479927857201&vp=0x-56&ds=200x26295&vid=4&sid=7fb97fC6-6e0f-4022-adbd-017431520ddb&duid=4d10d3d0-c2b4-473e-a483-812799e824d1&fp=129112332\x0b\x01,\x00\x00\x00mMozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36\x0b\x01\x90\x00\x00\x00!collector.snowflake-analytics.com\x0b\x00\xd2\x00\x00\x00\x05UTF-8\x0b\x016\x00\x00\x00\x1fhttps://snowflake-analytics.com\x0b\x01\x9a\x00\x00\x00$4d10d3d0-c2b4-473e-a483-812799e824d1\x0b\x00\xdc\x00\x00\x00\x11ssc-0.7.0-kinesis\x0f\x01^\x0b\x00\x00\x00\x0b\x00\x00\x00'Host: collector.snowflake-analytics.com\x00\x00\x00\x1dAccept: image/webp, */*;q=0.8\x00\x00\x00$Accept-Encoding: gzip, deflate, sdch\x00\x00\x007Accept-Language: en-AU, en;q=0.8, en-US;q=0.6, en;q=0.4\x00\x00\x00\x14Cookie: sp=abcd-1234\x00\x00\x00'Referer: http://snowflake-analytics.com\x00\x00\x00zUser-Agent:  Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36\x00\x00\x00\x1aX-Forwarded-For: 127.0.0.1\x00\x00\x00\x15X-Forwarded-Port: 443\x00\x00\x00\x18X-Forwarded-Proto: https\x00\x00\x00\x16Connection: keep-alive\x00"

You’ll see that some fields are recognisable here. We can make out a query string and some headers but there’s a lot of data in between that makes this a bit difficult to read and parse. This is our binary serialized Thrift record that we’ll deserialize next.

1
2
3
collector = thriftpy.load("collector-payload.thrift")
collector_payload = collector.CollectorPayload()
raw_payload = deserialize(collector_payload, decoded_payload, TCyBinaryProtocolFactory())

In the cell above we’re doing quite a bit. First we’re loading the Snowplow provided .thrift file which acts as a specification for both serializing and deserializing the data. Next we’re initialising a structure based on the CollectorPayload struct and finally we are deserializing our base64 decoded object providing the class, base64 decoded payload and protocol factory to deserialize the object.

1
2
3
print(raw_payload)

CollectorPayload(querystring='stm=1481153329000&e=pv&url=http%3A%2F%2Fsnowflake-analytics.com&tv=js-2.6.0&tna=js-3.6.0&aid=snowflake&p=web&tz=Australia%2FSydney&lang=en-AU&cs=UTF-8&res=360x640&cd=32&cookie=1&eid=c259c25e-f94b-4d2c-a11c-0d278c2e6d1a&dtm=1479927857201&vp=0x-56&ds=200x26295&vid=4&sid=7fb97fC6-6e0f-4022-adbd-017431520ddb&duid=4d10d3d0-c2b4-473e-a483-812799e824d1&fp=129112332', collector='ssc-0.7.0-kinesis', ipAddress='127.0.0.1', timestamp=1481153329000, path='/i', contentType=None, refererUri='https://snowflake-analytics.com', schema='iglu:com.snowplowanalytics.snowplow/CollectorPayload/thrift/1-0-0', encoding='UTF-8', hostname='collector.snowflake-analytics.com', userAgent='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', body=None, headers=['Host: collector.snowflake-analytics.com', 'Accept: image/webp, */*;q=0.8', 'Accept-Encoding: gzip, deflate, sdch', 'Accept-Language: en-AU, en;q=0.8, en-US;q=0.6, en;q=0.4', 'Cookie: sp=abcd-1234', 'Referer: http://snowflake-analytics.com', 'User-Agent:  Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', 'X-Forwarded-For: 127.0.0.1', 'X-Forwarded-Port: 443', 'X-Forwarded-Proto: https', 'Connection: keep-alive'], networkUserId='4d10d3d0-c2b4-473e-a483-812799e824d1')

Excellent! That’s looking a lot more readable already. If you haven’t already noticed, the properties of our CollectorPayload class match the fields defined in our collector-payload.thrift record. Let’s access some of them directly.

1
2
3
4
5
querystring = raw_payload.querystring
network_userid = raw_payload.networkUserId
print(querystring)

stm=1481153329000&e=pv&url=http%3A%2F%2Fsnowflake-analytics.com&tv=js-2.6.0&tna=js-3.6.0&aid=snowflake&p=web&tz=Australia%2FSydney&lang=en-AU&cs=UTF-8&res=360x640&cd=32&cookie=1&eid=c259c25e-f94b-4d2c-a11c-0d278c2e6d1a&dtm=1479927857201&vp=0x-56&ds=200x26295&vid=4&sid=7fb97fC6-6e0f-4022-adbd-017431520ddb&duid=4d10d3d0-c2b4-473e-a483-812799e824d1&fp=129112332

We can now access these properties directly but at this stage we’ve just got the raw querystring - you’ll recognise this as the network request that is sent to the collector when using the Snowplow Javascript library (tna=js-3.6.0). Let’s split this out so we can look at the individual key value pairs that are in the payload.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from urllib import parse
params = parse.parse_qs(querystring)
print(params)

{'aid': ['snowflake'],
 'cd': ['32'],
 'cookie': ['1'],
 'cs': ['UTF-8'],
 'ds': ['200x26295'],
 'dtm': ['1479927857201'],
 'duid': ['4d10d3d0-c2b4-473e-a483-812799e824d1'],
 'e': ['pv'],
 'eid': ['c259c25e-f94b-4d2c-a11c-0d278c2e6d1a'],
 'fp': ['129112332'],
 'lang': ['en-AU'],
 'p': ['web'],
 'res': ['360x640'],
 'sid': ['7fb97fC6-6e0f-4022-adbd-017431520ddb'],
 'stm': ['1481153329000'],
 'tna': ['js-3.6.0'],
 'tv': ['js-2.6.0'],
 'tz': ['Australia/Sydney'],
 'url': ['http://snowflake-analytics.com'],
 'vid': ['4'],
 'vp': ['0x-56']}

If you’re familiar with Python you’ll note that our parse_qs function returns a dictionary containing values in the form of a list. We can access this reasonably easy by - in this case let’s extract out the app_id for the event.

1
2
3
4
app_id = params.get('aid')[0]
print(app_id)

snowflake

In the above sample we’re using the built in get method (on a dictionary) to access the aid property rather than app_id you can find a list of the mappings from the URL parameter to the friendly name here. The [0] above access the first element in that list for the aid key. In our example each of these lists only contains one value so we don’t need to worry about addressing additional values.

That concludes this tutorial. Please post below if you’ve got queries or parts of the post require additional clarification - this is a quite involved process.

In our next tutorial we’ll cover how we can leverage this simple process of decoding a single bad row - to multiple bad rows from a run and looping this data back into Redshift so we can query it!

[Read More]

Monitoring Snowplow bad rows using Lambda and Cloudwatch

Oct 19, 2016. | By: Mike Robins

In this tutorial we’ll use Amazon Lambda and Amazon Cloudwatch to set up monitoring for the number of bad rows that are inserted into Elasticsearch over a period of time. This allows us to set an alert for the threshold of bad rows, and generates an email or notification when this threshold has been exceeded. Snowplow users on the realtime pipeline will find this most useful, however users running loads in batch can also adapt this monitoring.

We’ll use the following services throughout this tutorial so check first that they are available in your AWS region of choice:

  • Cloudwatch rules
  • Lambda

Below is an overview of what we’ll be setting up.

A Cloudwatch rule will trigger a Lambda function every 5 minutes. This Lambda function will query our Elasticsearch cluster, and return the number of rows in the bad index for the last 5 minutes. Finally, we’ll send this result to our own custom metric to allow for monitoring and alerting using Cloudwatch alarms.

Lambda execution role

First, let’s set up an execution role for our AWS Lambda function. This execution role will allow access to the appropriate resources, namely being able to submit a metric to Cloudwatch.

  1. Navigate to IAM in the AWS console.
  2. Select ‘Policy’ and ‘Create Policy’ in this interface.
  3. Select ‘Create Your Own Policy’
  4. Name the policy document and provide a description. Under the ‘Policy document’ we’ll use the following configuration.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData"
            ],
            "Resource": "*"
        }
    ]
}

Lambda bad rows AWS IAM Policy

This will allow the Lambda function to log to Cloudwatch for debugging as well as pushing data for the bad rows metric we will define shortly.

Next, we’ll create a role for the Lambda function to use, and attach this policy.

  1. In IAM select ‘Role’ and ‘Create New Role’.
  2. Under ‘AWS Service Roles’ select ‘AWS Lambda’
  3. Search and select the policy you created above, review the policy and hit ‘Create Role’.

Creating our Lambda function to retrieve the number of bad rows

  1. In the AWS Lambda dashboard select ‘Create a Lambda function`
  2. Select the blueprint as ‘Blank function’ as we’ll be writing our own code for this function.
  3. For the moment, leave triggers blank and hit ‘Next’.
  4. Name your function, e.g., snowplow_bad_rows_count
  5. In this example we will use the ‘Python 2.7’ runtime.
  6. For the function code copy and paste the function below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import json
import urllib2
import boto3
# import required libraries

def lambda_handler(event, context):
    host = 'your-hostname-here.com' # your elasticsearch host
    port = 9200 # your elasticsearch port
    index_name = 'bad-alias' # your elasticsearch bad index

    url = 'http://{0}:{1}/{2}/_count'.format(host, port, index_name)
    data = '{"query": {"range": {"failure_tstamp": {"gt": "now-5m"}}}}' # query to filter data

    request = urllib2.urlopen(url, data) # connect to Elasticsearch
    result = json.loads(request.read()) # read and decode JSON response
    bad_row_count = result.get('count') # select the count property
    request.close() # close the connection

    client = boto3.client('cloudwatch') # setup a cloudwatch client
    data = [
            {
                'MetricName': 'bad_rows_count',
                'Value': bad_row_count,
                'Unit': 'Count'
            }
            ]
    # prepare a payload to send with the name 'bad_rows_count'
    client.put_metric_data(Namespace='snowplow', MetricData=data) # send the data to Cloudwatch using the 'snowplow' namespace
    return bad_row_count # return the number of bad rows in the last 5 minutes

In the above snippet change the host variable to point to your Elasticsearch cluster. This is likely to be either a load balancer if you’re running Elasticsearch on EC2 instances, or the Elasticsearch endpoint if using AWS Elasticsearch Service.

You should also change port, to the appropriate port which is likely to be either 9200 (default) or 80 is using Elasticsearch Service or port-forwarding on a load balancer.

Finally, enter the name of the index. If you’re unsure of what this is by adding /_cat/indices to the end of your Elasticsearch URL which will list all indices in your Elasticsearch cluster.

  1. To modify the period in which Elasticsearch looks for bad events change the 'now-5m' string to the period required, e.g., for 1 hour use 'now-1h'.
  2. To modify the name of the metric, change the 'MetricName' in the data payload
  3. To modify the event namespace, change the Namespace in the put_metric_data call.
  4. The handler field can be left the same.
  5. Under ‘Existing role’ select the role we created in the previous steps.
  6. The memory and timeout of advanced settings can be left the same.
  7. Select an appropriate VPC. If running an Elasticsearch cluster within a VPC you may need to use this VPC, however if you’re cluster is accessible externally ‘No VPC’ may work.
  8. Review the function and select ‘Create function’.
  9. Let’s quickly test the function to ensure that it is able to connect to Elasticsearch successfully and produces the expected output.
  10. Select ‘Test’ in the interface, and ‘Hello World’ as the sample event template. The data we use as input does not matter as our event only produces output.
  11. Select ‘Save and test’
  12. Your function should successfully execute and return the number of bad rows in the Elasticsearch index selected. You can verify this using the Kibana interface if required.
  13. If the function times out, double check the address and settings for your Elasticsearch cluster, the function may not be able to connect.

Create our Cloudwatch trigger

  1. In the Cloudwatch interface, select ‘Rules’ under ‘Events’ from the left-hand pane.
  2. Select ‘Create Rule’
  3. For the event source select ‘Schedule’, and for the moment we’ll leave it as the default of fixed rate of every 5 minutes. This sets the option of when, and how often our Lambda function will run and our metric will update. More information about the cron syntax AWS uses can be found here.
  4. Under ‘Targets’, select ‘Lambda function’ and the name of the Lambda function you created above.
  5. Select ‘Configure details’
  6. Add a meaningful name and description for this rule and leave the enabled checkbox selected. Select ‘Create Rule’.

Check our metric is working in Cloudwatch

  1. In the Cloudwatch interface scroll down in the left-hand panel to the bottom and under ‘Custom Metrics…’ select ‘snowplow’, your metric name should appear here and by clicking the checkbox you should see a point of data every 5 minutes. If you’ve only just set this up you may only see one data point.

To create an alarm

  1. Select the metric and select the ‘Create Alarm’ button in the right hand panel under ‘Actions’.
  2. Set a meaningful name and description for the alarm, and a threshold to trigger at based upon the number of bad rows you expect to have in a 5 minute period. Anything above this threshold will trigger the alarm.
  3. Under Actions, you can select a notification list (if you have a SNS topic set up) or alternately if you select ‘New list’ you can enter a comma delimited list of email addresses.
  4. This will notify users when the alarm is triggered based on the criteria you have set.

Cost

Running this monitoring will cost approximately $0.69 a month depending on your usage, frequency and availability zone. For many users some of this cost may be included in the AWS free tier.

  • $0.50 per custom metric
  • $0.10 per alarm
  • ~$0.09 for Cloudwatch PutMetricData (9000 requests/month, assuming rule triggers every 5 minutes)

If you have any questions or comments feel free to leave them in the comment section below, or in the Snowplow discourse forum.

[Read More]

About

We exist to make organisations better understand their businesses by enabling all decision makers in a company to work with the same version of the truth.

Social Links

Our Bunker

Level 5, 104 Commonwealth Street
Surry Hills, 2010, Surry Hills,
NSW, Australia
contact@snowflake-analytics.com
AU: 1300 971 915
US: 1 415 963 4782
Privacy