Classifying browsers: Which visitors are real people?

Wednesday 28 November, 2018 | By: Simon Rumble

Key points

  • Browser classifications in atomic.events are deprecated (there’s now a better way)
  • Use the UA Parser enrichment to classify browsers
  • Use the IAB Bots & Spiders enrichment to find bad bots
  • You’ll still need to look at your own data carefully to remove bots that aren’t detected

Finding the robots

The Internet has a lot of robots crawling around. Some are well behaved: they follow the robots.txt rules for whether or not to crawl your site and identify themselves as robots in the useragent header sent with their requests.

Others are not so well behaved.

Most site owners want a reliable way to exclude robot traffic from analytics data.

This never used to be a problem with JavaScript-based tracking: crawlers simply didn’t execute JS so you could safely ignore them. Since at least 2014 crawlers updated to better mimic real user experience, which means the JS gets run and your analytics gets hit. There are still plenty of simple crawlers who’ll never show up in your Snowplow data but there’s also a bunch that will.

Stop using atomic.events browser data

Snowplow’s atomic.events table has a bunch of browser classification fields that are filled from the User Agent Utils enrichment. This enrichment is now deprecated and unmaintained so it’s out-of-date for the latest browsers. You should stop using the fields in atomic.events to classify browsers.

Use ua-parser-enrichment

The replacement for atomic.events, ua-parser-enrichment, uses the open source ua-parser library which is actively maintained. It works a little differently to the old model and requires you to join in an external table to the core events table like any other custom data model.

Using this enrichment, you might want to exclude things based on device_family values of “Spider” and perhaps look deeper and exclude useragent_family values of “HeadlessChrome” and “PhantomJS”.

Use the IAB Bots & Spiders list

Of course not all bots and spiders are so well behaved as to announce their presence. Snowplow has another enrichment for this purpose, the IAB/ABC International Bots & Spiders list client. This enrichment uses the list curated by the IAB and UK ABC organisations and used extensively in the ad tech space. As well as user-agent strings it also keeps a list of known bot IP address ranges.

Subscribing to the list alone can be quite expensive but it is available at a discount to our Snowplow Managed Service customers. Just create a ticket for it and we’ll get it set up for you.

Querying the data

Bringing these two data sources together we can create a query that shows the different classifications of the browser and user. Below is an example query you can use to amend your data model.

SELECT derived_tstamp, page_urlpath, useragent, category, primary_impact, reason, spider_or_robot, useragent_family, device_family
  FROM atomic.events
    LEFT JOIN atomic.com_iab_snowplow_spiders_and_robots_1
      ON atomic.events.event_id = atomic.com_iab_snowplow_spiders_and_robots_1.root_id AND
      atomic.events.collector_tstamp = snowplow.atomic.com_iab_snowplow_spiders_and_robots_1.root_tstamp
    LEFT JOIN atomic.com_snowplowanalytics_snowplow_ua_parser_context_1
      ON atomic.events.event_id = atomic.com_snowplowanalytics_snowplow_ua_parser_context_1.root_id AND
      atomic.events.collector_tstamp = atomic.com_snowplowanalytics_snowplow_ua_parser_context_1.root_tstamp
WHERE
  app_id = 'snowflake-analytics.com'
  AND derived_tstamp > '2018-10-01'
  AND spider_or_robot = true
ORDER BY 1

What’s missing?

You’ll notice there’s no dvce_type field in the UA Parser table. This is because the UA Parser project deems only information inside the User-Agent header to be relevant. To break out into devices like desktop, mobile, tablet, TV, game you’d need to create your own lookup table using something like the device_family column. Unfortunately that column isn’t enormously consistent and you’d need to keep monitoring what values pop up there to be sure you don’t miss anything. Tablet detection might also need some screen size dimension in there too.

Bots that aren’t captured

There are plenty of bots that won’t be identified using these mechanisms. Bad actors, particularly in ad fraud, have strong financial incentives to get around blocking mechanisms and actively work to get around the IAB list. Some perfectly legitimate bots and spiders operate in narrow niches or regions that might not have made it onto these lists, which are built primarily to service online advertising.

You’ll need to dig through your own data and explore to find bot-like patterns of behaviour and make your own classifications to catch these.

Next steps

Device categorisation in Snowplow (and other platforms) isn’t perfect, but it has improved. Snowplow are working on new enrichments all the time and have plans to use some commercially-available classification systems to get even better at classification.

If there’s anything else you’d like us to help with, let us know.

About

We exist to make organisations better understand their businesses by enabling all decision makers in a company to work with the same version of the truth.

Social Links