Make big data small again with Redshift ZSTD compression

Jul 12, 2017. | By: Mike Robins

A new compression option in Redshift allows you to make big storage savings, up to two-thirds in our tests, over the standard Snowplow setup. This guide shows how it works and how to get it happening.

In late 2016 Facebook open sourced a compression algorithm known as Zstandard that combines Lempel Ziv and tANS to achieve a compression ratio better than many algorithms with a slight tradeoff in speed.

In January 2017 AWS added ZSTD support to Redshift, so you can now use this powerful compression algorithm to reduce the storage size of your data. One immense perk of ZSTD is that it can be applied across all supported data types. It gives exceptional performance on long varchars: perfectly suited for large JSON strings that regularly appear in shredded tables.

We’ve had great success both experimentally and practically by applying ZSTD compression to multiple Snowplow tables. Experimentally across datasets ranging between 10 million and 5 billion rows we’ve achieved a mean compression ratio of ~3 meaning that the newly compressed table takes up approximately a third of the original table on disk when compared to the compression defaults in atomic.events 0.8.0.

In particular we’ve found:

  • Negligible impact on speed of queries for atomic.events (note this will be highly workload dependent so benchmarks here are less useful)
  • So far we haven’t experimented with the performance/space savings of compressing SORTKEYs. AWS recommend against compressing SORTKEYs (so use ENCODE RAW for collector_tstamp and root_tstamp).
  • ZSTD in almost all instances replaces LZO as the default compression method suggested by ANALYZE COMPRESSION.

Things to take into account

  • Depending on your workload you may want to run ANALYZE COMPRESSION on your table which will provide some recommendations by Redshift as to what the suggested column encodings are. One caveat of this approach is that you are limited to sampling 1 billion rows so if possible choose a sample data set that contains representative variability within columns.
  • If you have the opportunity we also recommend benchmarking common queries/jobs on an identical sample of data for a) the 0.8.0 compression defaults and b) the newly compressed tables.
  • You cannot modify compression on existing columns in a table so consider deep copying the data particularly if a large region of your table is unsorted1 as this will outperform a VACUUM. If performing a deep copy ensure you have sufficient disk space to complete the action: it’s difficult to know how much space is required but we opt for at least 50% of the cluster storage remaining. You may need to consider resizing the cluster and/or temporarily pausing your pipeline to complete this action.
  • If you decide to drop the original atomic.events table ensure that you either DROP CASCADE or individually drop any dependencies that may rely on this table such as views or foreign key constraints from shredded or derived tables.
  • If you’ve dropped any foreign keys ensure that you recreate them and they reference the new atomic.events object. Although Redshift does not enforce these constraints they are used by planner to generate optimised query plans.
  • As this is a new object you may also need to regrant permissions to users/groups as well as ensure the table owner is identical to the original table(s) so that Snowplow can continue to load data.

Getting started

  1. Analyze compression on your existing table by using ANALYZE COMPRESSION.
  2. If Redshift recommends using ZSTD for columns consider benchmarking a sample of data (e.g., 3 months) with the original column compression and ZSTD column compression
  3. If performance is better or equivalent deep copy data from the original atomic.events table to a new atomic.events_new table. You may need to resize Redshift or free up additional disk space before performing this action.
  4. Verify that the two tables contain identical data by comparing a total row count as well as a row count per day.
  5. Drop the original table as well as any references2. ALTER the new table to atomic events3. Ensure that this new table has an identical owner.
  6. Sit back and enjoy that free disk space.

Need help?

There’s a few gotchas and edge cases involved with doing this. If you’d like help or advice on how to do it feel free to post in the comments below or get in touch.

Queries

1

 SELECT
 "database",
 "table",
 diststyle,
 sortkey1,
 sortkey1_enc,
 size,
 pct_used,
 unsorted,
 stats_off,
 tbl_rows
FROM
 svv_table_info
WHERE
 schema = 'atomic'
ORDER BY
 size DESC

2

SELECT DISTINCT c_p.oid AS tbloid
,n_p.nspname AS schemaname
,c_p.relname AS NAME
,n_c.nspname AS refbyschemaname
,c_c.relname AS refbyname
,c_c.oid AS viewoid
FROM pg_class c_p
JOIN pg_depend d_p ON c_p.relfilenode = d_p.refobjid
JOIN pg_depend d_c ON d_p.objid = d_c.objid
JOIN pg_class c_c ON d_c.refobjid = c_c.relfilenode
LEFT JOIN pg_namespace n_p ON c_p.relnamespace = n_p.oid
LEFT JOIN pg_namespace n_c ON c_c.relnamespace = n_c.oid
WHERE
 schemaname = 'atomic'

3

ALTER TABLE atomic.events_new RENAME TO atomic.events;

About

We exist to make organisations better understand their businesses by enabling all decision makers in a company to work with the same version of the truth.

Social Links

Our Bunker

Level 5, 104 Commonwealth Street
Surry Hills, 2010, Surry Hills,
NSW, Australia
contact@snowflake-analytics.com
AU: 1300 971 915
US: 1 415 963 4782
Privacy