atomic.eventsare deprecated (there’s now a better way)
The Internet has a lot of robots crawling around. Some are well behaved: they follow the robots.txt rules for whether or not to crawl your site and identify themselves as robots in the useragent header sent with their requests.
Others are not so well behaved.
Most site owners want a reliable way to exclude robot traffic from analytics data.
atomic.events table has a bunch of browser classification fields that are filled from the User Agent Utils enrichment. This enrichment is now deprecated and unmaintained so it’s out-of-date for the latest browsers. You should stop using the fields in atomic.events to classify browsers.
The replacement for
atomic.events, ua-parser-enrichment, uses the open source ua-parser library which is actively maintained. It works a little differently to the old model and requires you to join in an external table to the core events table like any other custom data model.
Using this enrichment, you might want to exclude things based on
device_family values of “Spider” and perhaps look deeper and exclude
useragent_family values of “HeadlessChrome” and “PhantomJS”.
Of course not all bots and spiders are so well behaved as to announce their presence. Snowplow has another enrichment for this purpose, the IAB/ABC International Bots & Spiders list client. This enrichment uses the list curated by the IAB and UK ABC organisations and used extensively in the ad tech space. As well as user-agent strings it also keeps a list of known bot IP address ranges.
Subscribing to the list alone can be quite expensive but it is available at a discount to our Snowplow Managed Service customers. Just create a ticket for it and we’ll get it set up for you.
Bringing these two data sources together we can create a query that shows the different classifications of the browser and user. Below is an example query you can use to amend your data model.
You’ll notice there’s no
dvce_type field in the UA Parser table. This is because the UA Parser project deems only information inside the User-Agent header to be relevant.
To break out into devices like desktop, mobile, tablet, TV, game you’d need to create your own lookup table using something like the
device_family column. Unfortunately that column isn’t enormously consistent and you’d need to keep monitoring what values pop up there to be sure you don’t miss anything. Tablet detection might also need some screen size dimension in there too.
There are plenty of bots that won’t be identified using these mechanisms. Bad actors, particularly in ad fraud, have strong financial incentives to get around blocking mechanisms and actively work to get around the IAB list. Some perfectly legitimate bots and spiders operate in narrow niches or regions that might not have made it onto these lists, which are built primarily to service online advertising.
You’ll need to dig through your own data and explore to find bot-like patterns of behaviour and make your own classifications to catch these.
Device categorisation in Snowplow (and other platforms) isn’t perfect, but it has improved. Snowplow are working on new enrichments all the time and have plans to use some commercially-available classification systems to get even better at classification.
If there’s anything else you’d like us to help with, let us know.
Subscribe to this blog via RSS.